219 files changed, 4306 insertions, 984 deletions
diff --git a/README.md b/README.md
index ef5bdc66ef..3cdb6e478d 100644
--- a/README.md
+++ b/README.md
@@ -22,6 +22,10 @@ organization for the purposes of conducting machine learning and deep neural
 networks research.  The system is general enough to be applicable in a wide
 variety of other domains, as well.
 
+Keep up to date with release announcements and security updates by
+subscribing to
+[announce@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/announce).
+
 ## Installation
 *See [Installing TensorFlow](https://www.tensorflow.org/get_started/os_setup.html) for instructions on how to install our release binaries or how to build from source.*
 
diff --git a/SECURITY.md b/SECURITY.md
index fea24b2739..378e776967 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -6,7 +6,7 @@ report vulnerabilities in TensorFlow.
 
 ## TensorFlow models are programs
 
-TensorFlow's runtime system interprets and executes programs. What machine 
+TensorFlow's runtime system interprets and executes programs. What machine
 learning practitioners term
 [**models**](https://developers.google.com/machine-learning/glossary/#model) are
 expressed as programs that TensorFlow executes.  TensorFlow programs are encoded
@@ -28,12 +28,12 @@ data you supply to TensorFlow to train a model, or to use a model to run
 inference on the data.
 
 **TensorFlow models are programs, and need to be treated as such from a security
-perspective.** 
+perspective.**
 
 ## Running untrusted models
 
 As a general rule: **Always** execute untrusted models inside a sandbox (e.g.,
-[nsjail](https://github.com/google/nsjail)). 
+[nsjail](https://github.com/google/nsjail)).
 
 There are several ways in which a model could become untrusted. Obviously, if an
 untrusted party supplies TensorFlow kernels, arbitrary code may be executed.
@@ -109,11 +109,11 @@ graphs known to the `ModelServer`. This means that an attacker may run
 graphs using untrusted inputs as described above, but they would not be able to
 execute arbitrary graphs. It is possible to safely expose a `ModelServer`
 directly to an untrusted network, **but only if the graphs it is configured to
-use have been carefully audited to be safe**. 
+use have been carefully audited to be safe**.
 
 Similar to best practices for other servers, we recommend running any
 `ModelServer` with appropriate privileges (i.e., using a separate user with
-reduced permisisons). In the spirit of defense in depth, we recommend
+reduced permissions). In the spirit of defense in depth, we recommend
 authenticating requests to any TensorFlow server connected to an untrusted
 network, as well as sandboxing the server to minimize the adverse effects of
 any breach.
@@ -129,11 +129,11 @@ with specially crafted inputs.
 ### What is a vulnerability?
 
 Given TensorFlow's flexibility, it is possible to specify computation graphs
-which exhibit unexpected or unwanted behaviors. The fact that TensorFlow models
+which exhibit unexpected or unwanted behavior. The fact that TensorFlow models
 can perform arbitrary computations means that they may read and write files,
 communicate via the network, produce deadlocks and infinite loops, or run out
 of memory. It is only when these behaviors are outside the specifications of the
-operations involved that such behavior is a vulnerability. 
+operations involved that such behavior is a vulnerability.
 
 A `FileWriter` writing a file is not unexpected behavior and therefore is not a
 vulnerability in TensorFlow. A `MatMul` allowing arbitrary binary code execution
@@ -168,7 +168,7 @@ below).
 
 Please use a descriptive subject line for your report email. After the initial
 reply to your report, the security team will endeavor to keep you informed of
-the progress being made towards a fix and announcement. 
+the progress being made towards a fix and announcement.
 
 If you believe that an existing (public) issue is security-related, please send
 an email to `security@tensorflow.org`. The email should include the issue ID and
diff --git a/configure.py b/configure.py
index 97f46757ee..7d61c2e5e3 100644
--- a/configure.py
+++ b/configure.py
@@ -1048,7 +1048,10 @@ def set_tf_tensorrt_install_path(environ_cp):
 
     for lib_file in possible_files:
       if is_compatible(lib_file, cuda_ver, cudnn_ver):
-        ver_str = nvinfer_pattern.search(lib_file).group(1)
+        matches = nvinfer_pattern.search(lib_file)
+        if len(matches.groups()) == 0:
+          continue
+        ver_str = matches.group(1)
         ver = convert_version_to_int(ver_str) if len(ver_str) else 0
         if ver > highest_ver[0]:
           highest_ver = [ver, ver_str, lib_file]
diff --git a/tensorflow/compiler/xla/service/generic_transfer_manager.cc b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
index 78dc0ad4fc..a99e2b7794 100644
--- a/tensorflow/compiler/xla/service/generic_transfer_manager.cc
+++ b/tensorflow/compiler/xla/service/generic_transfer_manager.cc
@@ -38,14 +38,7 @@ namespace xla {
 
 GenericTransferManager::GenericTransferManager(se::Platform::Id platform_id,
                                                size_t pointer_size)
-    : platform_id_(platform_id), pointer_size_(pointer_size) {
-  // We currently only support kHostPlatformId for CPU, kCudaPlatformId for
-  // GPU and kInterpreterPlatformId for Interpreter. Before supporting other
-  // platforms, we need to test this transfer manager on them.
-  CHECK(platform_id_ == se::host::kHostPlatformId ||
-        platform_id_ == se::interpreter::kInterpreterPlatformId ||
-        platform_id_ == se::cuda::kCudaPlatformId);
-}
+    : platform_id_(platform_id), pointer_size_(pointer_size) {}
 
 se::Platform::Id GenericTransferManager::PlatformId() const {
   return platform_id_;
diff --git a/tensorflow/compiler/xla/tests/convolution_test.cc b/tensorflow/compiler/xla/tests/convolution_test.cc
index 99640f5bb5..72715398de 100644
--- a/tensorflow/compiler/xla/tests/convolution_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_test.cc
@@ -723,7 +723,7 @@ INSTANTIATE_TEST_CASE_P(
 );
 #endif
 
-TEST_F(ConvolutionTest, Convolve_bf16_1x1x1x2_1x1x1x2_Valid) {
+XLA_TEST_F(ConvolutionTest, Convolve_bf16_1x1x1x2_1x1x1x2_Valid) {
   ComputationBuilder builder(client_, TestName());
   Shape input_shape = ShapeUtil::MakeShape(BF16, {1, 1, 1, 2});
   Shape filter_shape = ShapeUtil::MakeShape(BF16, {1, 1, 1, 2});
diff --git a/tensorflow/contrib/BUILD b/tensorflow/contrib/BUILD
index bab37e8906..d103da79e3 100644
--- a/tensorflow/contrib/BUILD
+++ b/tensorflow/contrib/BUILD
@@ -121,6 +121,7 @@ cc_library(
         "//tensorflow/contrib/coder:all_kernels",
         "//tensorflow/contrib/cudnn_rnn:cudnn_rnn_kernels",
         "//tensorflow/contrib/data/kernels:dataset_kernels",
+        "//tensorflow/contrib/kafka:dataset_kernels",
         "//tensorflow/contrib/factorization/kernels:all_kernels",
         "//tensorflow/contrib/input_pipeline:input_pipeline_ops_kernels",
         "//tensorflow/contrib/layers:sparse_feature_cross_op_kernel",
@@ -147,7 +148,7 @@ cc_library(
         "//tensorflow/contrib/factorization:all_ops",
         "//tensorflow/contrib/framework:all_ops",
         "//tensorflow/contrib/input_pipeline:input_pipeline_ops_op_lib",
-        "//tensorflow/contrib/kafka:kafka_ops_op_lib",
+        "//tensorflow/contrib/kafka:dataset_ops_op_lib",
         "//tensorflow/contrib/layers:sparse_feature_cross_op_op_lib",
         "//tensorflow/contrib/nccl:nccl_ops_op_lib",
         "//tensorflow/contrib/nearest_neighbor:nearest_neighbor_ops_op_lib",
diff --git a/tensorflow/contrib/cmake/README.md b/tensorflow/contrib/cmake/README.md
index 8f85a75ee4..fe83bb3204 100644
--- a/tensorflow/contrib/cmake/README.md
+++ b/tensorflow/contrib/cmake/README.md
@@ -26,7 +26,7 @@ The CMake files in this directory can build the core TensorFlow runtime, an
 example C++ binary, and a PIP package containing the runtime and Python
 bindings.
 
-### Pre-requisites
+### Prerequisites
 
 * CMake version 3.5 or later.
 
@@ -34,14 +34,16 @@ bindings.
 
 * [SWIG](http://www.swig.org/download.html)
 
-* Additional pre-requisites for Microsoft Windows:
+* Additional prerequisites for Microsoft Windows:
   - Visual Studio 2015
   - Python 3.5
-  - NumPy 1.11.0 or later
 
-* Additional pre-requisites for Linux:
+* Additional prerequisites for Linux:
   - Python 2.7 or later
   - [Docker](https://www.docker.com/) (for automated testing)
+
+* Python dependencies:
+  - wheel
   - NumPy 1.11.0 or later
 
 ### Known-good configurations
@@ -102,7 +104,7 @@ ops or APIs.
 Step-by-step Windows build
 ==========================
 
-1. Install the pre-requisites detailed above, and set up your environment.
+1. Install the prerequisites detailed above, and set up your environment.
 
    * The following commands assume that you are using the Windows Command
      Prompt (`cmd.exe`). You will need to set up your environment to use the
diff --git a/tensorflow/contrib/cmake/external/grpc.cmake b/tensorflow/contrib/cmake/external/grpc.cmake
index 95106dba1f..cc218e8ab8 100644
--- a/tensorflow/contrib/cmake/external/grpc.cmake
+++ b/tensorflow/contrib/cmake/external/grpc.cmake
@@ -35,6 +35,7 @@ else()
   set(grpc_STATIC_LIBRARIES
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc++_unsecure.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgrpc_unsecure.a
+      ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/third_party/cares/cares/lib/libcares.a
       ${CMAKE_CURRENT_BINARY_DIR}/grpc/src/grpc/libgpr.a)
 endif()
 
diff --git a/tensorflow/contrib/cmake/external/protobuf.cmake b/tensorflow/contrib/cmake/external/protobuf.cmake
index aba8a5244e..ab464bc99a 100644
--- a/tensorflow/contrib/cmake/external/protobuf.cmake
+++ b/tensorflow/contrib/cmake/external/protobuf.cmake
@@ -16,7 +16,7 @@ include (ExternalProject)
 
 set(PROTOBUF_INCLUDE_DIRS ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
 set(PROTOBUF_URL https://github.com/google/protobuf.git)
-set(PROTOBUF_TAG 396336eb961b75f03b25824fe86cf6490fb75e3a)
+set(PROTOBUF_TAG b04e5cba356212e4e8c66c61bbe0c3a20537c5b9)
 
 if(WIN32)
   if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index cdf48b3584..237f4fe33a 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -478,6 +478,10 @@ if (tensorflow_BUILD_CC_TESTS)
     "${tensorflow_source_dir}/tensorflow/core/profiler/internal/advisor/*_test.cc"
   )
 
+  list(REMOVE_ITEM tf_test_src_simple
+    ${tf_core_profiler_test_srcs}
+  )
+
   set(tf_test_lib tf_test_lib)
   add_library(${tf_test_lib} STATIC ${tf_src_testlib})
 
diff --git a/tensorflow/contrib/data/__init__.py b/tensorflow/contrib/data/__init__.py
index f09d156832..9212b69700 100644
--- a/tensorflow/contrib/data/__init__.py
+++ b/tensorflow/contrib/data/__init__.py
@@ -40,6 +40,7 @@ See the @{$datasets$Importing Data} Programmer's Guide for an overview.
 @@rejection_resample
 @@scan
 @@shuffle_and_repeat
+@@sliding_window_batch
 @@sloppy_interleave
 @@unbatch
 
@@ -72,6 +73,9 @@ from tensorflow.contrib.data.python.ops.readers import SqlDataset
 from tensorflow.contrib.data.python.ops.resampling import rejection_resample
 from tensorflow.contrib.data.python.ops.scan_ops import scan
 from tensorflow.contrib.data.python.ops.shuffle_ops import shuffle_and_repeat
+from tensorflow.contrib.data.python.ops.sliding import sliding_window_batch
+from tensorflow.python.data.ops.iterator_ops import Iterator
+from tensorflow.python.ops.parsing_ops import parse_single_example_v2 as parse_single_example
 # pylint: enable=unused-import
 
 from tensorflow.python.util.all_util import remove_undocumented
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index 22418b38e3..2c4d4adfda 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -498,6 +498,23 @@ py_test(
     ],
 )
 
+tf_py_test(
+    name = "slide_dataset_op_test",
+    size = "small",
+    srcs = ["slide_dataset_op_test.py"],
+    additional_deps = [
+        "//tensorflow/contrib/data/python/ops:dataset_ops",
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//third_party/py/numpy",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/data/python/kernel_tests/resample_test.py b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
index 3c7b46629e..913ab9b9f8 100644
--- a/tensorflow/contrib/data/python/kernel_tests/resample_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
@@ -45,12 +45,10 @@ class ResampleTest(test.TestCase):
                 target_dist=target_dist,
                 initial_dist=initial_dist,
                 class_func=lambda c, _: c,
-                seed=27)).make_initializable_iterator())
-    init_op = iterator.initializer
+                seed=27)).make_one_shot_iterator())
     get_next = iterator.get_next()
 
     with self.test_session() as sess:
-      sess.run(init_op)
       returned = []
       with self.assertRaises(errors.OutOfRangeError):
         while True:
diff --git a/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
new file mode 100644
index 0000000000..33c48e20be
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/slide_dataset_op_test.py
@@ -0,0 +1,242 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import sliding
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class SlideDatasetTest(test.TestCase):
+
+  def testSlideDataset(self):
+    """Test an dataset that maps a TF function across its input elements."""
+    components = (np.arange(7),
+                  np.array([[1, 2, 3]]) * np.arange(7)[:, np.newaxis],
+                  np.array(37.0) * np.arange(7))
+
+    count = array_ops.placeholder(dtypes.int64, shape=[])
+    window_size = array_ops.placeholder(dtypes.int64, shape=[])
+    stride = array_ops.placeholder(dtypes.int64, shape=[])
+
+    def _map_fn(x, y, z):
+      return math_ops.square(x), math_ops.square(y), math_ops.square(z)
+
+    # The pipeline is TensorSliceDataset -> MapDataset(square_3) ->
+    # RepeatDataset(count) -> _SlideDataset(window_size, stride).
+    iterator = (dataset_ops.Dataset.from_tensor_slices(components)
+                .map(_map_fn)
+                .repeat(count)
+                .apply(sliding.sliding_window_batch(window_size, stride))
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    self.assertEqual([[None] + list(c.shape[1:]) for c in components],
+                     [t.shape.as_list() for t in get_next])
+
+    with self.test_session() as sess:
+      # Slide over a finite input, where the window_size divides the
+      # total number of elements.
+      sess.run(init_op, feed_dict={count: 20, window_size: 14, stride: 7})
+      # Same formula with convolution layer.
+      num_batches = (20 * 7 - 14) // 7 + 1
+      for i in range(num_batches):
+        result = sess.run(get_next)
+        for component, result_component in zip(components, result):
+          for j in range(14):
+            self.assertAllEqual(component[(i*7 + j) % 7]**2,
+                                result_component[j])
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over a finite input, where the window_size does not
+      # divide the total number of elements.
+      sess.run(init_op, feed_dict={count: 20, window_size: 17, stride: 9})
+
+      num_batches = (20 * 7 - 17) // 9 + 1
+      for i in range(num_batches):
+        result = sess.run(get_next)
+        for component, result_component in zip(components, result):
+          for j in range(17):
+            self.assertAllEqual(component[(i*9 + j) % 7]**2,
+                                result_component[j])
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over a finite input, which is less than window_size,
+      # should fail straight away.
+      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 4})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      sess.run(init_op, feed_dict={count: 1, window_size: 10, stride: 8})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Slide over an empty input should fail straight away.
+      sess.run(init_op, feed_dict={count: 0, window_size: 8, stride: 4})
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+      # Empty window_size should be an initialization time error.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 0, stride: 0})
+
+      # Invalid stride should be an initialization time error.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 0})
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 3})
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(init_op, feed_dict={count: 14, window_size: 3, stride: 5})
+
+  def assertSparseValuesEqual(self, a, b):
+    self.assertAllEqual(a.indices, b.indices)
+    self.assertAllEqual(a.values, b.values)
+    self.assertAllEqual(a.dense_shape, b.dense_shape)
+
+  def testSlideSparse(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0]], values=(i * [1]), dense_shape=[1])
+
+    iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
+        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      num_batches = (10 - 5) // 3 + 1
+      for i in range(num_batches):
+        actual = sess.run(get_next)
+        expected = sparse_tensor.SparseTensorValue(
+            indices=[[0, 0], [1, 0], [2, 0], [3, 0], [4, 0]],
+            values=[i * 3, i * 3 + 1, i * 3 + 2, i * 3 + 3, i * 3 + 4],
+            dense_shape=[5, 1])
+        self.assertTrue(sparse_tensor.is_sparse(actual))
+        self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testSlideSparseWithDifferentDenseShapes(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=array_ops.expand_dims(
+              math_ops.range(i, dtype=dtypes.int64), 1),
+          values=array_ops.fill([math_ops.to_int32(i)], i),
+          dense_shape=[i])
+
+    iterator = dataset_ops.Dataset.range(10).map(_sparse).apply(
+        sliding.sliding_window_batch(5, 3)).make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      num_batches = (10 - 5) // 3 + 1
+      for i in range(num_batches):
+        actual = sess.run(get_next)
+        expected_indices = []
+        expected_values = []
+        for j in range(5):
+          for k in range(i * 3 + j):
+            expected_indices.append([j, k])
+            expected_values.append(i * 3 + j)
+        expected = sparse_tensor.SparseTensorValue(
+            indices=expected_indices,
+            values=expected_values,
+            dense_shape=[5, i * 3 + 5 - 1])
+        self.assertTrue(sparse_tensor.is_sparse(actual))
+        self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testNestedSlideSparse(self):
+
+    def _sparse(i):
+      return sparse_tensor.SparseTensorValue(
+          indices=[[0]], values=(i * [1]), dense_shape=[1])
+
+    iterator = (dataset_ops.Dataset.range(10)
+                .map(_sparse)
+                .apply(sliding.sliding_window_batch(4, 2))
+                .apply(sliding.sliding_window_batch(3, 1))
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      # Slide: 1st batch.
+      actual = sess.run(get_next)
+      expected = sparse_tensor.SparseTensorValue(
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
+                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
+                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          values=[0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7],
+          dense_shape=[3, 4, 1])
+      self.assertTrue(sparse_tensor.is_sparse(actual))
+      self.assertSparseValuesEqual(actual, expected)
+      # Slide: 2nd batch.
+      actual = sess.run(get_next)
+      expected = sparse_tensor.SparseTensorValue(
+          indices=[[0, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0],
+                   [1, 0, 0], [1, 1, 0], [1, 2, 0], [1, 3, 0],
+                   [2, 0, 0], [2, 1, 0], [2, 2, 0], [2, 3, 0]],
+          values=[2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9],
+          dense_shape=[3, 4, 1])
+      self.assertTrue(sparse_tensor.is_sparse(actual))
+      self.assertSparseValuesEqual(actual, expected)
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+  def testSlideShapeError(self):
+
+    def generator():
+      yield [1.0, 2.0, 3.0]
+      yield [4.0, 5.0, 6.0]
+      yield [7.0, 8.0, 9.0, 10.0]
+
+    iterator = (dataset_ops.Dataset.from_generator(generator, dtypes.float32,
+                                                   output_shapes=[None])
+                .apply(sliding.sliding_window_batch(3, 1))
+                .make_initializable_iterator())
+    next_element = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer)
+      with self.assertRaisesRegexp(
+          errors.InvalidArgumentError,
+          r"Cannot batch tensors with different shapes in component 0. "
+          r"First element had shape \[3\] and element 2 had shape \[4\]."):
+        sess.run(next_element)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/ops/BUILD b/tensorflow/contrib/data/python/ops/BUILD
index f03430c5c5..c3331e9636 100644
--- a/tensorflow/contrib/data/python/ops/BUILD
+++ b/tensorflow/contrib/data/python/ops/BUILD
@@ -106,6 +106,7 @@ py_library(
         "interleave_ops.py",
         "resampling.py",
         "scan_ops.py",
+        "sliding.py",
         "stats_ops.py",
         "threadpool.py",
         "unique.py",
diff --git a/tensorflow/contrib/data/python/ops/resampling.py b/tensorflow/contrib/data/python/ops/resampling.py
index 56f526a330..f4015f19fb 100644
--- a/tensorflow/contrib/data/python/ops/resampling.py
+++ b/tensorflow/contrib/data/python/ops/resampling.py
@@ -54,7 +54,7 @@ def rejection_resample(class_func, target_dist, initial_dist=None, seed=None):
   def _apply_fn(dataset):
     """Function from `Dataset` to `Dataset` that applies the transformation."""
     dist_estimation_batch_size = 32
-    target_dist_t = ops.convert_to_tensor(target_dist, name="initial_dist")
+    target_dist_t = ops.convert_to_tensor(target_dist, name="target_dist")
     class_values_ds = dataset.map(class_func)
     if initial_dist is not None:
       initial_dist_t = ops.convert_to_tensor(initial_dist, name="initial_dist")
@@ -151,7 +151,7 @@ def _calculate_acceptance_probs(initial_probs, target_probs):
   ```
 
 
-  A solution for a_i in terms of the other variabes is the following:
+  A solution for a_i in terms of the other variables is the following:
     ```a_i = (t_i / p_i) / max_i[t_i / p_i]```
   """
   # Add tiny to initial_probs to avoid divide by zero.
diff --git a/tensorflow/contrib/data/python/ops/sliding.py b/tensorflow/contrib/data/python/ops/sliding.py
new file mode 100644
index 0000000000..19cc3cb89f
--- /dev/null
+++ b/tensorflow/contrib/data/python/ops/sliding.py
@@ -0,0 +1,102 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Sliding dataset transformations."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import gen_dataset_ops
+
+
+class _SlideDataset(dataset_ops.Dataset):
+  """A `Dataset` that passes a sliding window over its input."""
+
+  def __init__(self, input_dataset, window_size, stride=1):
+    """See `sliding_window_batch` for details."""
+    super(_SlideDataset, self).__init__()
+    self._input_dataset = input_dataset
+    self._window_size = ops.convert_to_tensor(
+        window_size, dtype=dtypes.int64, name="window_size")
+    self._stride = ops.convert_to_tensor(
+        stride, dtype=dtypes.int64, name="stride")
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.slide_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        window_size=self._window_size,
+        stride=self._stride,
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)))
+
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+  @property
+  def output_shapes(self):
+    input_shapes = self._input_dataset.output_shapes
+    return nest.pack_sequence_as(input_shapes, [
+        tensor_shape.vector(None).concatenate(s)
+        for s in nest.flatten(self._input_dataset.output_shapes)
+    ])
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
+
+
+def sliding_window_batch(window_size, stride=1):
+  """A sliding window with size of `window_size` and step of `stride`.
+
+  This transformation passes a sliding window over this dataset. The
+  window size is `window_size` and step size is `stride`. If the left
+  elements cannot fill up the sliding window, this transformation will
+  drop the final smaller element. For example:
+
+  ```python
+  # NOTE: The following examples use `{ ... }` to represent the
+  # contents of a dataset.
+  a = { [1], [2], [3], [4], [5], [6] }
+
+  a.apply(tf.contrib.data.sliding_window_batch(window_size=3, stride=2)) ==
+  {
+      [[1], [2], [3]],
+      [[3], [4], [5]],
+  }
+  ```
+
+  Args:
+    window_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
+      elements in the sliding window.
+    stride: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
+      steps moving the sliding window forward for one iteration. The default
+      is `1`. It must be in `[1, window_size)`.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+  """
+  def _apply_fn(dataset):
+    return _SlideDataset(dataset, window_size, stride)
+
+  return _apply_fn
diff --git a/tensorflow/contrib/factorization/BUILD b/tensorflow/contrib/factorization/BUILD
index 90f10f1fa8..ad8568ad44 100644
--- a/tensorflow/contrib/factorization/BUILD
+++ b/tensorflow/contrib/factorization/BUILD
@@ -224,7 +224,10 @@ py_test(
     srcs = ["python/ops/kmeans_test.py"],
     shard_count = 4,
     srcs_version = "PY2AND3",
-    tags = ["notsan"],  # b/67512932
+    tags = [
+        "nomac",  # b/73741358
+        "notsan",  # b/67512932
+    ],
     deps = [
         ":factorization_py",
         ":factorization_py_CYCLIC_DEPENDENCIES_THAT_NEED_TO_GO",
diff --git a/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc b/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
index e61221a6b0..35341406a0 100644
--- a/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
+++ b/tensorflow/contrib/ffmpeg/default/ffmpeg_lib.cc
@@ -256,6 +256,9 @@ Status ReadInfoFile(const string& filename, uint32* width, uint32* height,
         if (p != std::string::npos) {
           string rgb24 = line.substr(p + 9, line.find(" ", p + 9));
           rgb24 = rgb24.substr(0, rgb24.find(","));
+          // Strip anything after " ", in case the format is
+          // `640x360 [SAR 1:1 DAR 16:9]`
+          rgb24 = rgb24.substr(0, rgb24.find(" "));
           string rgb24_width = rgb24.substr(0, rgb24.find("x"));
           string rgb24_height = rgb24.substr(rgb24_width.length() + 1);
           if (strings::safe_strtou32(rgb24_width, &width_value) &&
@@ -270,8 +273,10 @@ Status ReadInfoFile(const string& filename, uint32* width, uint32* height,
       // We only look for the first stream mapping to have the number of the
       // frames.
       // Once processed we will not further process stream mapping section.
-      if (line.find("frame=  ") == 0) {
-        string number = line.substr(8, line.find(" ", 8));
+      if (line.find("frame=") == 0) {
+        // The format might be `frame=  166 ` or `frame=12488 `
+        string number = line.substr(6);
+        number = number.substr(number.find_first_not_of(" "));
         number = number.substr(0, number.find(" "));
         if (strings::safe_strtou32(number, &frames_value)) {
           in_mapping = false;
diff --git a/tensorflow/contrib/framework/python/ops/arg_scope.py b/tensorflow/contrib/framework/python/ops/arg_scope.py
index 409657fe1d..3cad1fee19 100644
--- a/tensorflow/contrib/framework/python/ops/arg_scope.py
+++ b/tensorflow/contrib/framework/python/ops/arg_scope.py
@@ -142,7 +142,7 @@ def arg_scope(list_ops_or_scope, **kwargs):
   else:
     # Assumes that list_ops_or_scope is a list/tuple of ops with kwargs.
     if not isinstance(list_ops_or_scope, (list, tuple)):
-      raise TypeError('list_ops_or_scope must either be a list/tuple or reused'
+      raise TypeError('list_ops_or_scope must either be a list/tuple or reused '
                       'scope (i.e. dict)')
     try:
       current_scope = current_arg_scope().copy()
diff --git a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
index 7e86d10b64..47e51415fd 100644
--- a/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
+++ b/tensorflow/contrib/gan/python/eval/python/classifier_metrics_impl.py
@@ -321,7 +321,7 @@ def classifier_score(images, classifier_fn, num_batches=1):
 
   NOTE: This function consumes images, computes their logits, and then
   computes the classifier score. If you would like to precompute many logits for
-  large batches, use clasifier_score_from_logits(), which this method also
+  large batches, use classifier_score_from_logits(), which this method also
   uses.
 
   Args:
@@ -454,7 +454,7 @@ def frechet_classifier_distance(real_images,
 
   This technique is described in detail in https://arxiv.org/abs/1706.08500.
   Given two Gaussian distribution with means m and m_w and covariance matrices
-  C and C_w, this function calcuates
+  C and C_w, this function calculates
 
               |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
 
@@ -467,7 +467,7 @@ def frechet_classifier_distance(real_images,
   Frechet distance is biased. It is more biased for small sample sizes. (e.g.
   even if the two distributions are the same, for a small sample size, the
   expected Frechet distance is large). It is important to use the same
-  sample size to compute frechet classifier distance when comparing two
+  sample size to compute Frechet classifier distance when comparing two
   generative models.
 
   NOTE: This function consumes images, computes their activations, and then
@@ -659,7 +659,7 @@ def frechet_classifier_distance_from_activations(real_activations,
 
   This technique is described in detail in https://arxiv.org/abs/1706.08500.
   Given two Gaussian distribution with means m and m_w and covariance matrices
-  C and C_w, this function calcuates
+  C and C_w, this function calculates
 
                 |m - m_w|^2 + Tr(C + C_w - 2(C * C_w)^(1/2))
 
diff --git a/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py b/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
index 9bebcacbe4..4b10bc0f8e 100644
--- a/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
+++ b/tensorflow/contrib/gan/python/eval/python/sliced_wasserstein_impl.py
@@ -212,7 +212,7 @@ def sliced_wasserstein_distance(real_images,
   Args:
       real_images: (tensor) Real images (batch, height, width, channels).
       fake_images: (tensor) Fake images (batch, height, width, channels).
-      resolution_min: (int) Minimum resolution for the Laplacion pyramid.
+      resolution_min: (int) Minimum resolution for the Laplacian pyramid.
       patches_per_image: (int) Number of patches to extract per image per
         Laplacian level.
       patch_size: (int) Width of a square patch.
@@ -221,7 +221,7 @@ def sliced_wasserstein_distance(real_images,
       use_svd: experimental method to compute a more accurate distance.
   Returns:
       List of tuples (distance_real, distance_fake) for each level of the
-      Laplacian pyramid from the highest resoluion to the lowest.
+      Laplacian pyramid from the highest resolution to the lowest.
         distance_real is the Wasserstein distance between real images
         distance_fake is the Wasserstein distance between real and fake images.
   Raises:
diff --git a/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py b/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
index cd31c62667..e2594faf85 100644
--- a/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
+++ b/tensorflow/contrib/gan/python/features/python/conditioning_utils_impl.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Miscellanous utilities for TFGAN code and examples.
+"""Miscellaneous utilities for TFGAN code and examples.
 
 Includes:
 1) Conditioning the value of a Tensor, based on techniques from
diff --git a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
index 4cfae0de44..9e4ec59e70 100644
--- a/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
+++ b/tensorflow/contrib/gan/python/features/python/random_tensor_pool_impl.py
@@ -17,7 +17,7 @@
 We use this to keep a history of values created by a generator, such that
 a discriminator can randomly be trained on some older samples, not just the
 current one. This can help to not let the discriminator get too far ahead of the
-generator and also to keep the system from oscilating, if the discriminator
+generator and also to keep the system from oscillating, if the discriminator
 forgets too fast what past samples from the generator looked like.
 
 See the following papers for more details.
@@ -97,7 +97,7 @@ def tensor_pool(input_values,
         dtypes=[v.dtype for v in input_values],
         shapes=None)
 
-    # In pseudeo code this code does the following:
+    # In pseudo code this code does the following:
     # if not pool_full:
     #   enqueue(input_values)
     #   return input_values
diff --git a/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py b/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
index 845f89827b..2fe06a2872 100644
--- a/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
+++ b/tensorflow/contrib/gan/python/features/python/virtual_batchnorm_test.py
@@ -148,7 +148,7 @@ class VirtualBatchnormTest(test.TestCase):
       self.assertAllClose(bn_np[i, ...], vb_np)
 
   def test_minibatch_independent(self):
-    """Test that virtual batch normalized exampels are independent.
+    """Test that virtual batch normalized examples are independent.
 
     Unlike batch normalization, virtual batch normalization has the property
     that the virtual batch normalized value of an example is independent of the
diff --git a/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py b/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
index 252788140f..bcd2a34c4e 100644
--- a/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
+++ b/tensorflow/contrib/grid_rnn/python/ops/grid_rnn_cell.py
@@ -110,7 +110,7 @@ class GridRNNCell(rnn.RNNCell):
       logging.warning('%s: Using a concatenated state is slower and will '
                       'soon be deprecated.  Use state_is_tuple=True.', self)
     if not output_is_tuple:
-      logging.warning('%s: Using a concatenated output is slower and will'
+      logging.warning('%s: Using a concatenated output is slower and will '
                       'soon be deprecated.  Use output_is_tuple=True.', self)
 
     if num_dims < 1:
diff --git a/tensorflow/contrib/image/BUILD b/tensorflow/contrib/image/BUILD
index 2924aef815..79eb3762ed 100755
--- a/tensorflow/contrib/image/BUILD
+++ b/tensorflow/contrib/image/BUILD
@@ -259,6 +259,7 @@ cuda_py_test(
         "//tensorflow/core:protos_all_py",
     ],
     data = [":sparse_image_warp_test_data"],
+    tags = ["no_pip"],
 )
 
 filegroup(
diff --git a/tensorflow/contrib/kafka/BUILD b/tensorflow/contrib/kafka/BUILD
index efb403462a..1c3974871c 100644
--- a/tensorflow/contrib/kafka/BUILD
+++ b/tensorflow/contrib/kafka/BUILD
@@ -1,66 +1,93 @@
-package(
-    default_visibility = ["//visibility:private"],
-)
+package(default_visibility = ["//tensorflow:internal"])
 
 licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
-load("//tensorflow:tensorflow.bzl", "tf_gen_op_libs")
-load("//tensorflow:tensorflow.bzl", "tf_gen_op_wrapper_py")
-load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
-load("//tensorflow:tensorflow.bzl", "tf_py_test")
+load(
+    "//tensorflow:tensorflow.bzl",
+    "tf_gen_op_wrapper_py",
+    "tf_kernel_library",
+    "tf_custom_op_library",
+    "tf_custom_op_py_library",
+    "tf_gen_op_libs",
+    "tf_py_test",
+)
 
-tf_kernel_library(
-    name = "kafka_kernels",
+py_library(
+    name = "kafka",
+    srcs = ["__init__.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":dataset_ops",
+    ],
+)
+
+tf_custom_op_library(
+    name = "_dataset_ops.so",
+    srcs = ["ops/dataset_ops.cc"],
+    deps = [":dataset_kernels"],
+)
+
+tf_gen_op_libs(
+    op_lib_names = ["dataset_ops"],
+)
+
+cc_library(
+    name = "dataset_kernels",
     srcs = ["kernels/kafka_dataset_ops.cc"],
-    visibility = ["//visibility:public"],
     deps = [
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-        "//tensorflow/core/kernels:bounds_check_lib",
-        "//tensorflow/core/kernels:dataset",
+        "//tensorflow/core:framework_headers_lib",
         "//third_party/eigen3",
         "@kafka",
+        "@protobuf_archive//:protobuf_headers",
     ],
+    alwayslink = 1,
 )
 
-tf_gen_op_libs(
-    op_lib_names = ["kafka_ops"],
+py_library(
+    name = "dataset_ops",
+    srcs = [
+        "python/ops/kafka_dataset_ops.py",
+    ],
+    srcs_version = "PY2AND3",
     deps = [
-        "//tensorflow/core:lib",
+        ":kafka_op_loader",
+        "//tensorflow/python:dataset_ops_gen",
+        "//tensorflow/python:util",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/util:nest",
     ],
 )
 
 tf_gen_op_wrapper_py(
-    name = "gen_kafka_ops",
-    out = "python/ops/gen_kafka_ops.py",
-    require_shape_functions = True,
-    deps = [":kafka_ops_op_lib"],
+    name = "gen_dataset_ops",
+    out = "python/ops/gen_dataset_ops.py",
+    deps = ["//tensorflow/contrib/kafka:dataset_ops_op_lib"],
 )
 
-py_library(
-    name = "kafka",
-    srcs = [
-        "__init__.py",
-        "python/ops/kafka_dataset_ops.py",
+tf_kernel_library(
+    name = "dataset_ops_kernels",
+    deps = [
+        ":dataset_kernels",
+        "//tensorflow/core:framework",
+    ],
+    alwayslink = 1,
+)
+
+tf_custom_op_py_library(
+    name = "kafka_op_loader",
+    srcs = ["python/ops/kafka_op_loader.py"],
+    dso = ["//tensorflow/contrib/kafka:_dataset_ops.so"],
+    kernels = [
+        ":dataset_ops_kernels",
+        "//tensorflow/contrib/kafka:dataset_ops_op_lib",
     ],
     srcs_version = "PY2AND3",
-    visibility = ["//visibility:public"],
     deps = [
-        ":gen_kafka_ops",
+        ":gen_dataset_ops",
         "//tensorflow/contrib/util:util_py",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:control_flow_ops",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:platform",
-        "//tensorflow/python:state_ops",
-        "//tensorflow/python:training",
-        "//tensorflow/python/data/ops:dataset_ops",
-        "//tensorflow/python/data/ops:iterator_ops",
-        "//tensorflow/python/data/ops:readers",
     ],
 )
 
@@ -88,6 +115,7 @@ tf_py_test(
     ],
     tags = [
         "manual",
+        "no_windows",
         "notap",
     ],
 )
@@ -95,7 +123,9 @@ tf_py_test(
 filegroup(
     name = "all_files",
     srcs = glob(
-        ["**/*"],
+        include = [
+            "**/*",
+        ],
         exclude = [
             "**/METADATA",
             "**/OWNERS",
diff --git a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
index 88ef5f3571..a4cd4a2cc4 100644
--- a/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
+++ b/tensorflow/contrib/kafka/kernels/kafka_dataset_ops.cc
@@ -13,9 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/kernels/dataset.h"
-
-#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/dataset.h"
 
 #include "src-cpp/rdkafkacpp.h"
 
diff --git a/tensorflow/contrib/kafka/ops/dataset_ops.cc b/tensorflow/contrib/kafka/ops/dataset_ops.cc
new file mode 100644
index 0000000000..8cdf16103b
--- /dev/null
+++ b/tensorflow/contrib/kafka/ops/dataset_ops.cc
@@ -0,0 +1,44 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/common_shape_fns.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+
+namespace tensorflow {
+
+REGISTER_OP("KafkaDataset")
+    .Input("topics: string")
+    .Input("servers: string")
+    .Input("group: string")
+    .Input("eof: bool")
+    .Input("timeout: int64")
+    .Output("handle: variant")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ScalarShape)
+    .Doc(R"doc(
+Creates a dataset that emits the messages of one or more Kafka topics.
+
+topics: A `tf.string` tensor containing one or more subscriptions,
+  in the format of [topic:partition:offset:length],
+  by default length is -1 for unlimited.
+servers: A list of bootstrap servers.
+group: The consumer group id.
+eof: If True, the kafka reader will stop on EOF.
+timeout: The timeout value for the Kafka Consumer to wait
+  (in millisecond).
+)doc");
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py b/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
index 8e51d27a34..a1624614d1 100644
--- a/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
+++ b/tensorflow/contrib/kafka/python/ops/kafka_dataset_ops.py
@@ -17,8 +17,9 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.kafka.python.ops import gen_kafka_ops
-from tensorflow.python.data.ops.readers import Dataset
+from tensorflow.contrib.kafka.python.ops import kafka_op_loader  # pylint: disable=unused-import
+from tensorflow.contrib.kafka.python.ops import gen_dataset_ops
+from tensorflow.python.data.ops.dataset_ops import Dataset
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
@@ -58,8 +59,8 @@ class KafkaDataset(Dataset):
         timeout, dtype=dtypes.int64, name="timeout")
 
   def _as_variant_tensor(self):
-    return gen_kafka_ops.kafka_dataset(self._topics, self._servers, self._group,
-                                       self._eof, self._timeout)
+    return gen_dataset_ops.kafka_dataset(self._topics, self._servers,
+                                         self._group, self._eof, self._timeout)
 
   @property
   def output_classes(self):
diff --git a/tensorflow/contrib/kafka/python/ops/kafka_op_loader.py b/tensorflow/contrib/kafka/python/ops/kafka_op_loader.py
new file mode 100644
index 0000000000..ec2fdea962
--- /dev/null
+++ b/tensorflow/contrib/kafka/python/ops/kafka_op_loader.py
@@ -0,0 +1,24 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Python helper for loading kafka ops and kernels."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.util import loader
+from tensorflow.python.platform import resource_loader
+
+_dataset_ops = loader.load_op_library(
+    resource_loader.get_path_to_datafile("../../_dataset_ops.so"))
diff --git a/tensorflow/contrib/kfac/python/ops/optimizer.py b/tensorflow/contrib/kfac/python/ops/optimizer.py
index dee55cfa39..083da768ec 100644
--- a/tensorflow/contrib/kfac/python/ops/optimizer.py
+++ b/tensorflow/contrib/kfac/python/ops/optimizer.py
@@ -153,7 +153,7 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       raise ValueError("Unsupported momentum type {}. Must be one of {}."
                        .format(momentum_type, legal_momentum_types))
     if momentum_type != "regular" and norm_constraint is not None:
-      raise ValueError("Update clipping is only supported with momentum"
+      raise ValueError("Update clipping is only supported with momentum "
                        "type 'regular'.")
     if momentum_type not in ["regular", "adam"] and momentum != 0:
       raise ValueError("Momentum must be unspecified if using a momentum_type "
diff --git a/tensorflow/contrib/layers/python/layers/embedding_ops.py b/tensorflow/contrib/layers/python/layers/embedding_ops.py
index b62e3050cd..ffa208540d 100644
--- a/tensorflow/contrib/layers/python/layers/embedding_ops.py
+++ b/tensorflow/contrib/layers/python/layers/embedding_ops.py
@@ -470,7 +470,7 @@ def embedding_lookup_unique(params, ids, name=None):
     ids = ops.convert_to_tensor(ids)
     shape = array_ops.shape(ids)
     ids_flat = array_ops.reshape(
-        ids, math_ops.reduce_prod(shape, keep_dims=True))
+        ids, math_ops.reduce_prod(shape, keepdims=True))
     unique_ids, idx = array_ops.unique(ids_flat)
     unique_embeddings = embedding_ops.embedding_lookup(params, unique_ids)
     embeds_flat = array_ops.gather(unique_embeddings, idx)
diff --git a/tensorflow/contrib/layers/python/layers/encoders.py b/tensorflow/contrib/layers/python/layers/encoders.py
index 89c9d37bd0..f42112206d 100644
--- a/tensorflow/contrib/layers/python/layers/encoders.py
+++ b/tensorflow/contrib/layers/python/layers/encoders.py
@@ -125,7 +125,7 @@ def embed_sequence(ids,
       `reuse` is `None` or `False`.
   """
   if not (reuse or (vocab_size and embed_dim)):
-    raise ValueError('Must specify vocab size and embedding dimension when not'
+    raise ValueError('Must specify vocab size and embedding dimension when not '
                      'reusing. Got vocab_size=%s and embed_dim=%s' % (
                          vocab_size, embed_dim))
   with variable_scope.variable_scope(
diff --git a/tensorflow/contrib/learn/BUILD b/tensorflow/contrib/learn/BUILD
index f837ca3265..9c59150580 100644
--- a/tensorflow/contrib/learn/BUILD
+++ b/tensorflow/contrib/learn/BUILD
@@ -5,6 +5,8 @@ licenses(["notice"])  # Apache 2.0
 
 exports_files(["LICENSE"])
 
+load("//tensorflow:tensorflow.bzl", "py_test")
+
 package(default_visibility = [
     "//engedu/ml/tf_from_scratch:__pkg__",
     "//tensorflow:internal",
@@ -426,7 +428,10 @@ py_test(
     size = "medium",
     srcs = ["python/learn/estimators/kmeans_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["noasan"],
+    tags = [
+        "noasan",  # b/73741358
+        "nomac",
+    ],
     deps = [
         ":learn",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator.py b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
index d8ccb1e7dc..7a026a15e4 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
@@ -917,8 +917,8 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
       if feed_fn:
         hooks.append(basic_session_run_hooks.FeedFnHook(feed_fn))
       if steps == 0:
-        logging.warning('evaluation steps are 0. If `input_fn` does not raise'
-                        'OutOfRangeError`, the evaluation will never stop.'
+        logging.warning('evaluation steps are 0. If `input_fn` does not raise '
+                        '`OutOfRangeError`, the evaluation will never stop. '
                         'Use steps=None if intended.')
       if steps:
         hooks.append(
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index 9a7c4cd685..3744abd860 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -358,7 +358,7 @@ class Experiment(object):
         self._start_server()
     elif config.cluster_spec and config.master:
       raise ValueError(
-          "For distributed runtime, Experiment class only works with"
+          "For distributed runtime, Experiment class only works with "
           "tf.contrib.learn.RunConfig for now, but provided {}".format(
               type(config)))
 
diff --git a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
index b3b067b8e1..8f9811cf25 100644
--- a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
@@ -61,7 +61,7 @@ def embedding_lookup(params, ids, name='embedding_lookup'):
     ids = ops.convert_to_tensor(ids)
     shape = array_ops_.shape(ids)
     ids_flat = array_ops_.reshape(
-        ids, math_ops.reduce_prod(shape, keep_dims=True))
+        ids, math_ops.reduce_prod(shape, keepdims=True))
     embeds_flat = nn.embedding_lookup(params, ids_flat, name)
     embed_shape = array_ops_.concat([shape, [-1]], 0)
     embeds = array_ops_.reshape(embeds_flat, embed_shape)
diff --git a/tensorflow/contrib/lite/Makefile b/tensorflow/contrib/lite/Makefile
index 7f31629272..b4504f246a 100644
--- a/tensorflow/contrib/lite/Makefile
+++ b/tensorflow/contrib/lite/Makefile
@@ -27,10 +27,10 @@ LIBDIR := $(MAKEFILE_DIR)/gen/lib/
 GENDIR := $(MAKEFILE_DIR)/gen/obj/
 
 # Settings for the host compiler.
-CXX := $(CC_PREFIX) gcc
+CXX := $(CC_PREFIX)gcc
 CXXFLAGS := --std=c++11 -O3 -DNDEBUG
-CC := $(CC_PREFIX) gcc
-CFLAGS :=
+CC := $(CC_PREFIX)gcc
+CFLAGS := -O3 -DNDEBUG
 LDOPTS :=
 LDOPTS += -L/usr/local/lib
 ARFLAGS := -r
@@ -57,10 +57,11 @@ LIBS := \
 
 # If we're on Linux, also link in the dl library.
 ifeq ($(HOST_OS),LINUX)
-	LIBS += -ldl -lpthread
+	LIBS += -ldl
 endif
 
 include $(MAKEFILE_DIR)/ios_makefile.inc
+include $(MAKEFILE_DIR)/rpi_makefile.inc
 
 # This library is the main target for this makefile. It will contain a minimal
 # runtime that can be linked in to other programs.
diff --git a/tensorflow/contrib/lite/README.md b/tensorflow/contrib/lite/README.md
index df8c1c623c..2680d515eb 100644
--- a/tensorflow/contrib/lite/README.md
+++ b/tensorflow/contrib/lite/README.md
@@ -99,7 +99,7 @@ Similar to the Android demo app, there's an iOS camera app that uses exactly the
 
 This demo app requires a camera so it doesn't work with simulators. It need to be executed on a real iOS device. Follow the instructions to build and run the demo app:
 
-1.   Run `third_party/tensorflow/contrib/lite/examples/ios/download_models.sh` to download the model files used by the demo app.
+1.   Run `tensorflow/contrib/lite/examples/ios/download_models.sh` to download the model files used by the demo app.
 1.   Install [CocoaPods](https://cocoapods.org/) if it wasn't installed yet: `sudo gem install cocoapods`.
 1.   Run `pod install` in `tensorflow/contrib/lite/examples/ios/camera` to generate the workspace file.
 1.   Open the project by running `open tflite_camera_example.xcworkspace`, and build the app in XCode.
@@ -165,7 +165,7 @@ bazel-bin/tensorflow/python/tools/freeze_graph\
     --input_graph=/tmp/mobilenet_v1_224.pb \
     --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \
     --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \
-    --output_node_names=MobileNet/Predictions/Reshape_1
+    --output_node_names=MobilenetV1/Predictions/Reshape_1
 ```
 
 The user has to first build the freeze_graph script using bazel and then run the script.  The input_binary flag has to be enabled to ensure that the protobuf is read and written in binary format.  The user has to input the .pb and the .ckpt files to freeze the graph The output_node_names may not be obvious outside of the code that built the model. The easiest way to find them is to visualize the graph, either with
diff --git a/tensorflow/contrib/lite/arena_planner.h b/tensorflow/contrib/lite/arena_planner.h
index 58bc164619..f84b3dad95 100644
--- a/tensorflow/contrib/lite/arena_planner.h
+++ b/tensorflow/contrib/lite/arena_planner.h
@@ -33,7 +33,7 @@ class AllocationInfo;
 // each tensor needs to be allocated and deallocated, and preallocates all the
 // necessary memory (the PlanAllocations phase). It then assigns portions of
 // this memory buffer to each tensor (the ExecuteAllocations phase). Tensors may
-// share some of the bufer if a tensor B is to be allocated after another tensor
+// share some of the buffer if a tensor B is to be allocated after another tensor
 // A has been deallocated.
 //
 // If dynamic tensors are used the planning steps can be repeated during model
diff --git a/tensorflow/contrib/lite/build_rpi_lib.sh b/tensorflow/contrib/lite/build_rpi_lib.sh
new file mode 100755
index 0000000000..3824b16412
--- /dev/null
+++ b/tensorflow/contrib/lite/build_rpi_lib.sh
@@ -0,0 +1,22 @@
+#!/bin/bash -x
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR/../../.."
+
+CC_PREFIX=arm-linux-gnueabihf- make -j 3 -f tensorflow/contrib/lite/Makefile TARGET=RPI TARGET_ARCH=armv7
diff --git a/tensorflow/contrib/lite/error_reporter.h b/tensorflow/contrib/lite/error_reporter.h
index da193d2586..3c5f805f12 100644
--- a/tensorflow/contrib/lite/error_reporter.h
+++ b/tensorflow/contrib/lite/error_reporter.h
@@ -30,7 +30,7 @@ namespace tflite {
 //  va_list args;
 //  foo.Report("test %d", args); // where args is va_list
 //
-// Sublclass ErrorReporter to provide another reporting destination.
+// Subclass ErrorReporter to provide another reporting destination.
 // For example, if you have a GUI program, you might redirect to a buffer
 // that drives a GUI error log box.
 class ErrorReporter {
diff --git a/tensorflow/contrib/lite/g3doc/ios.md b/tensorflow/contrib/lite/g3doc/ios.md
index a359b8d4b4..e0358a444d 100644
--- a/tensorflow/contrib/lite/g3doc/ios.md
+++ b/tensorflow/contrib/lite/g3doc/ios.md
@@ -22,6 +22,15 @@ Then install
 brew install automake
 brew install libtool
 ```
+If you get an error where either automake or libtool install but do not link correctly, you'll first need to:
+```bash
+sudo chown -R $(whoami) /usr/local/*
+```
+Then follow the instructions to perform the linking:
+```bash
+brew link automake
+brew link libtool
+```
 
 Then you need to run a shell script to download the dependencies you need:
 
diff --git a/tensorflow/contrib/lite/g3doc/rpi.md b/tensorflow/contrib/lite/g3doc/rpi.md
new file mode 100644
index 0000000000..7a3a231626
--- /dev/null
+++ b/tensorflow/contrib/lite/g3doc/rpi.md
@@ -0,0 +1,50 @@
+# TensorFlow Lite for Raspberry Pi
+
+## Cross compiling
+### Installing toolchian
+This has been tested on Ubuntu 16.04.3 64bit and Tensorflow devel docker image [tensorflow/tensorflow:nightly-devel](https://hub.docker.com/r/tensorflow/tensorflow/tags/).
+
+To cross compiling TensorFlow Lite. First you should install the toolchain and libs.
+```bash
+sudo apt-get update
+sudo apt-get install crossbuild-essential-armhf
+```
+> If you are using docker, you may not use `sudo`
+
+### Building
+Clone this Tensorflow repository, Run this script at the root of the repository to download all the dependencies:
+> The Tensorflow repository is in `/tensorflow` if you are using `tensorflow/tensorflow:nightly-devel` docker image, just try it.
+```bash
+./tensorflow/contrib/lite/download_dependencies.sh
+```
+Note than you only need to to this once.
+
+You should then be able to compile:
+```bash
+./tensorflow/contrib/lite/build_rpi_lib.sh
+```
+
+This should compile a static library in:
+`tensorflow/contrib/lite/gen/lib/rpi_armv7/libtensorflow-lite.a`.
+
+## Native compiling
+This has been tested on Raspberry Pi 3b, Raspbian GNU/Linux 9.1 (stretch), gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1).
+
+Log in to you RPI, install the toolchain.
+```bash
+sudo apt-get instal build-essential
+```
+
+First, clone this TensorFlow repository. Run this at the root of the repository:
+```bash
+./tensorflow/contrib/lite/download_dependencies.sh
+```
+Note than you only need to to this once.
+
+You should then be able to compile:
+```bash
+./tensorflow/contrib/lite/build_rpi_lib.sh
+```
+
+This should compile a static library in:
+`tensorflow/contrib/lite/gen/lib/rpi_armv7/libtensorflow-lite.a`.
diff --git a/tensorflow/contrib/lite/interpreter.h b/tensorflow/contrib/lite/interpreter.h
index 788546fd60..77db178783 100644
--- a/tensorflow/contrib/lite/interpreter.h
+++ b/tensorflow/contrib/lite/interpreter.h
@@ -493,7 +493,7 @@ class Interpreter {
   // During Invoke(), Interpreter will allocate input tensors first, which are
   // known to be fixed size. Then it will allocate outputs from nodes as many
   // as possible. When there is a node that produces dynamic sized tensor.
-  // Intepreter will stop allocating tensors, set the value of next allocate
+  // Interpreter will stop allocating tensors, set the value of next allocate
   // node id, and execute the node to generate the output tensor before continue
   // to allocate successors. This process repeats until all nodes are executed.
   // NOTE: this relies on the order of nodes that is in topological order.
diff --git a/tensorflow/contrib/lite/interpreter_test.cc b/tensorflow/contrib/lite/interpreter_test.cc
index efb29d5c9d..131e088079 100644
--- a/tensorflow/contrib/lite/interpreter_test.cc
+++ b/tensorflow/contrib/lite/interpreter_test.cc
@@ -42,7 +42,7 @@ TEST(BasicInterpreter, InvokeInvalidModel) {
   ASSERT_EQ(interpreter.Invoke(), kTfLiteOk);
 }
 
-// Test size accesser functions.
+// Test size accessor functions.
 TEST(BasicInterpreter, TestSizeFunctions) {
   Interpreter interpreter;
   int base_index;
diff --git a/tensorflow/contrib/lite/kernels/conv.cc b/tensorflow/contrib/lite/kernels/conv.cc
index b91ba1a03d..e0cd12f1b4 100644
--- a/tensorflow/contrib/lite/kernels/conv.cc
+++ b/tensorflow/contrib/lite/kernels/conv.cc
@@ -64,7 +64,7 @@ struct OpData {
 
   TfLitePaddingValues padding;
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
diff --git a/tensorflow/contrib/lite/kernels/depthwise_conv.cc b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
index 15dbfe08c8..cad9ce114c 100644
--- a/tensorflow/contrib/lite/kernels/depthwise_conv.cc
+++ b/tensorflow/contrib/lite/kernels/depthwise_conv.cc
@@ -52,7 +52,7 @@ enum KernelType {
 struct OpData {
   TfLitePaddingValues padding;
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
diff --git a/tensorflow/contrib/lite/kernels/fully_connected.cc b/tensorflow/contrib/lite/kernels/fully_connected.cc
index a77fe94e49..888e67966c 100644
--- a/tensorflow/contrib/lite/kernels/fully_connected.cc
+++ b/tensorflow/contrib/lite/kernels/fully_connected.cc
@@ -48,7 +48,7 @@ enum KernelType {
 
 struct OpData {
   // The scaling factor from input to output (aka the 'real multiplier') can
-  // be represented as a fixed point multipler plus a left shift.
+  // be represented as a fixed point multiplier plus a left shift.
   int32_t output_multiplier;
   int output_shift;
   // The range of the fused activation layer. For example for kNone and
diff --git a/tensorflow/contrib/lite/kernels/internal/spectrogram.cc b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
index 0e481a9d40..4eddf7bf0a 100644
--- a/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
+++ b/tensorflow/contrib/lite/kernels/internal/spectrogram.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/lite/kernels/internal/spectrogram.h"
 
+#include <assert.h>
 #include <math.h>
 
 #include "third_party/fft2d/fft.h"
diff --git a/tensorflow/contrib/lite/kernels/kernel_util.h b/tensorflow/contrib/lite/kernels/kernel_util.h
index 28f53b9fbb..21da1daff7 100644
--- a/tensorflow/contrib/lite/kernels/kernel_util.h
+++ b/tensorflow/contrib/lite/kernels/kernel_util.h
@@ -58,7 +58,7 @@ inline bool IsConstantTensor(TfLiteTensor* tensor) {
 }
 
 // Determines whether tensor is dynamic. Note that a tensor can be non-const and
-// not dynamic. This function specificially checks for a dynamic tensor.
+// not dynamic. This function specifically checks for a dynamic tensor.
 inline bool IsDynamicTensor(TfLiteTensor* tensor) {
   return tensor->allocation_type == kTfLiteDynamic;
 }
diff --git a/tensorflow/contrib/lite/kernels/lsh_projection.cc b/tensorflow/contrib/lite/kernels/lsh_projection.cc
index 5f73b56ed9..0ee35775d5 100644
--- a/tensorflow/contrib/lite/kernels/lsh_projection.cc
+++ b/tensorflow/contrib/lite/kernels/lsh_projection.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// LSH Projection projects an input to a bit vector via locality senstive
+// LSH Projection projects an input to a bit vector via locality sensitive
 // hashing.
 //
 // Options:
diff --git a/tensorflow/contrib/lite/kernels/lstm.cc b/tensorflow/contrib/lite/kernels/lstm.cc
index b9255b23a5..8cf1165135 100644
--- a/tensorflow/contrib/lite/kernels/lstm.cc
+++ b/tensorflow/contrib/lite/kernels/lstm.cc
@@ -213,9 +213,9 @@ TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
   // present.
   // 2) If projection weight is present, then projection bias is optional.
   // TODO(ghodrat): make sure this is correct.
-  const bool projecton_tensors_consistent =
+  const bool projection_tensors_consistent =
       ((projection_weights != nullptr) || (projection_bias == nullptr));
-  TF_LITE_ENSURE(context, projecton_tensors_consistent == true);
+  TF_LITE_ENSURE(context, projection_tensors_consistent == true);
 
   return kTfLiteOk;
 }
@@ -357,7 +357,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Since we have already checked that weights are all there or none, we can
-  // check the existense of only one to the get the condition.
+  // check the existence of only one to get the condition.
   const bool use_cifg = (input_to_input_weights == nullptr);
   const bool use_peephole = (cell_to_output_weights != nullptr);
 
diff --git a/tensorflow/contrib/lite/kernels/reshape.cc b/tensorflow/contrib/lite/kernels/reshape.cc
index f3e6ddc9f4..438f70d311 100644
--- a/tensorflow/contrib/lite/kernels/reshape.cc
+++ b/tensorflow/contrib/lite/kernels/reshape.cc
@@ -49,20 +49,20 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 
   TfLiteIntArray* output_size = TfLiteIntArrayCreate(params->num_dimensions);
   int num_output_elements = 1;
-  int strech_dim = -1;
+  int stretch_dim = -1;
   for (int i = 0; i < params->num_dimensions; ++i) {
     int value = params->shape[i];
     if (value == -1) {
-      TF_LITE_ENSURE_EQ(context, strech_dim, -1);
-      strech_dim = i;
+      TF_LITE_ENSURE_EQ(context, stretch_dim, -1);
+      stretch_dim = i;
     } else {
       num_output_elements *= value;
       output_size->data[i] = value;
     }
   }
-  if (strech_dim != -1) {
-    output_size->data[strech_dim] = num_input_elements / num_output_elements;
-    num_output_elements *= output_size->data[strech_dim];
+  if (stretch_dim != -1) {
+    output_size->data[stretch_dim] = num_input_elements / num_output_elements;
+    num_output_elements *= output_size->data[stretch_dim];
   }
 
   TF_LITE_ENSURE_EQ(context, num_input_elements, num_output_elements);
diff --git a/tensorflow/contrib/lite/kernels/reshape_test.cc b/tensorflow/contrib/lite/kernels/reshape_test.cc
index 0fbcf6e6aa..aecbd0399f 100644
--- a/tensorflow/contrib/lite/kernels/reshape_test.cc
+++ b/tensorflow/contrib/lite/kernels/reshape_test.cc
@@ -60,7 +60,7 @@ TEST(ReshapeOpTest, TooManyDimensions) {
 
 TEST(ReshapeOpTest, TooManySpecialDimensions) {
   EXPECT_DEATH(ReshapeOpModel({1, 2, 4, 1}, {-1, -1, 2, 4}),
-               "strech_dim != -1");
+               "stretch_dim != -1");
 }
 
 TEST(ReshapeOpTest, SimpleTest) {
diff --git a/tensorflow/contrib/lite/kernels/test_util.cc b/tensorflow/contrib/lite/kernels/test_util.cc
index 373310bd87..0bb28b50b2 100644
--- a/tensorflow/contrib/lite/kernels/test_util.cc
+++ b/tensorflow/contrib/lite/kernels/test_util.cc
@@ -141,8 +141,8 @@ void SingleOpModel::SetBuiltinOp(BuiltinOperator type,
 
 void SingleOpModel::SetCustomOp(
     const string& name, const std::vector<uint8_t>& custom_option,
-    const std::function<TfLiteRegistration*()>& registeration) {
-  custom_registrations_[name] = registeration;
+    const std::function<TfLiteRegistration*()>& registration) {
+  custom_registrations_[name] = registration;
   opcodes_.push_back(
       CreateOperatorCodeDirect(builder_, BuiltinOperator_CUSTOM, name.data()));
   operators_.push_back(CreateOperator(
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
index 508a570e2e..42941a97db 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
@@ -360,7 +360,7 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
   const int n_output = recurrent_to_output_weights->dims->data[1];
 
   // Since we have already checked that weights are all there or none, we can
-  // check the existense of only one to the get the condition.
+  // check the existence of only one to get the condition.
   const bool use_cifg = (input_to_input_weights == nullptr);
   const bool use_peephole = (cell_to_output_weights != nullptr);
 
diff --git a/tensorflow/contrib/lite/memory_planner.h b/tensorflow/contrib/lite/memory_planner.h
index 5cd6c20850..0294ec815c 100644
--- a/tensorflow/contrib/lite/memory_planner.h
+++ b/tensorflow/contrib/lite/memory_planner.h
@@ -34,8 +34,8 @@ class MemoryPlanner {
   // [first_node, last_node].
   virtual TfLiteStatus ExecuteAllocations(int first_node, int last_node) = 0;
 
-  // Invalidates allocations made earliers. This is called when tensors sizes
-  // have change. All planned allocations remain, but can't be used until
+  // Invalidates allocations made earlier. This is called when tensors sizes
+  // have changed. All planned allocations remain, but can't be used until
   // ExecuteAllocations() is called.
   virtual TfLiteStatus ResetAllocations() = 0;
 };
diff --git a/tensorflow/contrib/lite/model.h b/tensorflow/contrib/lite/model.h
index 0c777760cb..036dc46e03 100644
--- a/tensorflow/contrib/lite/model.h
+++ b/tensorflow/contrib/lite/model.h
@@ -81,7 +81,7 @@ class FlatBufferModel {
       const tflite::Model* model_spec,
       ErrorReporter* error_reporter = DefaultErrorReporter());
 
-  // Releases memory or unmaps mmaped meory.
+  // Releases memory or unmaps mmaped memory.
   ~FlatBufferModel();
 
   // Copying or assignment is disallowed to simplify ownership semantics.
diff --git a/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h b/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
index 76032771af..bd49d327c9 100644
--- a/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
+++ b/tensorflow/contrib/lite/nnapi/NeuralNetworksShim.h
@@ -569,7 +569,7 @@ enum {
   ANEURALNETWORKS_LOGISTIC = 14,
 
   /**
-   * Projects an input to a bit vector via locality senstive hashing.
+   * Projects an input to a bit vector via locality sensitive hashing.
    *
    * Inputs:
    * * 0: Hash functions. Dim.size == 2, DataType: Float.
diff --git a/tensorflow/contrib/lite/rpi_makefile.inc b/tensorflow/contrib/lite/rpi_makefile.inc
new file mode 100644
index 0000000000..832ef5824b
--- /dev/null
+++ b/tensorflow/contrib/lite/rpi_makefile.inc
@@ -0,0 +1,33 @@
+# Settings for Raspberry Pi.
+ifeq ($(TARGET), RPI)
+	ifeq ($(TARGET_ARCH), armv7)
+		CXXFLAGS += \
+			-march=armv7-a \
+			-mfpu=neon-vfpv4 \
+			-funsafe-math-optimizations \
+			-ftree-vectorize
+
+		CCFLAGS += \
+			-march=armv7-a \
+			-mfpu=neon-vfpv4 \
+			-funsafe-math-optimizations \
+			-ftree-vectorize
+
+		LDFLAGS := \
+			-Wl,--no-export-dynamic \
+			-Wl,--exclude-libs,ALL \
+			-Wl,--gc-sections \
+			-Wl,--as-needed
+	endif
+
+	LIBS := \
+	-lstdc++ \
+	-lpthread \
+	-lm \
+	-ldl
+
+	OBJDIR := $(OBJDIR)rpi_$(TARGET_ARCH)/
+	LIBDIR := $(LIBDIR)rpi_$(TARGET_ARCH)/
+	BINDIR := $(BINDIR)rpi_$(TARGET_ARCH)/
+	DEPDIR := $(DEPDIR)rpi_$(TARGET_ARCH)/
+endif
diff --git a/tensorflow/contrib/lite/schema/upgrade_schema.py b/tensorflow/contrib/lite/schema/upgrade_schema.py
index 94f5730be5..e0b36d3d3e 100644
--- a/tensorflow/contrib/lite/schema/upgrade_schema.py
+++ b/tensorflow/contrib/lite/schema/upgrade_schema.py
@@ -39,8 +39,8 @@ import tensorflow as tf
 from tensorflow.python.platform import resource_loader
 
 parser = argparse.ArgumentParser(
-    description="Script to move TFLite models from pre-release schema to"
-    " new schema.")
+    description="Script to move TFLite models from pre-release schema to "
+    "new schema.")
 parser.add_argument(
     "input",
     type=str,
@@ -48,7 +48,7 @@ parser.add_argument(
 parser.add_argument(
     "output",
     type=str,
-    help="Output json or bin TensorFlow lite model compliant with"
+    help="Output json or bin TensorFlow lite model compliant with "
     "the new schema. Extension must be `.json`, `.bin` or `.tflite`.")
 
 
@@ -258,7 +258,7 @@ class Converter(object):
       # Check if builtin_code is the appropriate string type
       # use type("") instead of str or unicode. for py2and3
       if not isinstance(operator_code["builtin_code"], type(u"")):
-        raise ValueError("builtin_code %r is non-string. this usually means"
+        raise ValueError("builtin_code %r is non-string. this usually means "
                          "your model has consistency problems." %
                          (operator_code["builtin_code"]))
       operator_code["builtin_code"] = (RemapOperator(
diff --git a/tensorflow/contrib/lite/simple_memory_arena.cc b/tensorflow/contrib/lite/simple_memory_arena.cc
index 4aab244989..2f2004f56b 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.cc
+++ b/tensorflow/contrib/lite/simple_memory_arena.cc
@@ -113,21 +113,21 @@ TfLiteStatus SimpleMemoryArena::Commit(TfLiteContext* context) {
     underlying_buffer_size_ = required_size;
     underlying_buffer_aligned_ptr_ = new_underlying_buffer_aligned_ptr;
   }
-  commited_ = true;
+  committed_ = true;
   return underlying_buffer_ != nullptr ? kTfLiteOk : kTfLiteError;
 }
 
 TfLiteStatus SimpleMemoryArena::ResolveAlloc(TfLiteContext* context,
                                              const ArenaAlloc& alloc,
                                              char** output_ptr) {
-  TF_LITE_ENSURE(context, commited_);
+  TF_LITE_ENSURE(context, committed_);
   TF_LITE_ENSURE(context, output_ptr != nullptr);
   *output_ptr = underlying_buffer_aligned_ptr_ + alloc.offset;
   return kTfLiteOk;
 }
 
 TfLiteStatus SimpleMemoryArena::Clear() {
-  commited_ = false;
+  committed_ = false;
   high_water_mark_ = 0;
   allocs_.clear();
   return kTfLiteOk;
diff --git a/tensorflow/contrib/lite/simple_memory_arena.h b/tensorflow/contrib/lite/simple_memory_arena.h
index 0535522374..5faf78b59e 100644
--- a/tensorflow/contrib/lite/simple_memory_arena.h
+++ b/tensorflow/contrib/lite/simple_memory_arena.h
@@ -22,7 +22,7 @@ limitations under the License.
 namespace tflite {
 
 // This little structure holds the offset and the size for a dynamic memory
-// allocation in the memory arena. When the arena is commited and the
+// allocation in the memory arena. When the arena is committed and the
 // underlying buffer is set, the alloc can be resolved into an actual memory
 // pointer.
 struct ArenaAlloc {
@@ -43,7 +43,7 @@ struct ArenaAlloc {
 class SimpleMemoryArena {
  public:
   explicit SimpleMemoryArena(size_t arena_alignment)
-      : commited_(false),
+      : committed_(false),
         arena_alignment_(arena_alignment),
         high_water_mark_(0),
         underlying_buffer_size_(0),
@@ -73,7 +73,7 @@ class SimpleMemoryArena {
   }
 
  private:
-  bool commited_;
+  bool committed_;
   size_t arena_alignment_;
   size_t high_water_mark_;
   std::unique_ptr<char[]> underlying_buffer_;
diff --git a/tensorflow/contrib/makefile/README.md b/tensorflow/contrib/makefile/README.md
index 995230dfa8..6c3b02e12b 100644
--- a/tensorflow/contrib/makefile/README.md
+++ b/tensorflow/contrib/makefile/README.md
@@ -194,6 +194,8 @@ with:
 srcs = glob(["libs/arm64-v8a/*.so"]),
 ```
 
+If you are building for Android TV (Shield TV devices), replace "portrait" with "landscape" for android:screenOrientation in all four activities in tensorflow/examples/android/AndroidManifest.xml
+
 Then run:
 ```bash
 # Create dir for native libs
diff --git a/tensorflow/contrib/makefile/build_all_ios.sh b/tensorflow/contrib/makefile/build_all_ios.sh
index 2d99791839..0a458a27b3 100755
--- a/tensorflow/contrib/makefile/build_all_ios.sh
+++ b/tensorflow/contrib/makefile/build_all_ios.sh
@@ -80,10 +80,9 @@ if [[ ! -z "${OPTIMIZE_FOR_GRAPH}" ]]; then
         fi
     else
         echo "${PRNT_SLCTV_BIN} found. Using it"
-        ${PRNT_SLCTV_BIN} --graphs=${OPTIMIZE_FOR_GRAPH} > ${TOP_SRCDIR}/tensorflow/core/framework/ops_to_register.h
-
     fi
 
+    ${PRNT_SLCTV_BIN} --graphs=${OPTIMIZE_FOR_GRAPH} > ${TOP_SRCDIR}/tensorflow/core/framework/ops_to_register.h
 fi
 
 if [[ "${ONLY_MAKE_TENSORFLOW}" != "true" ]]; then
@@ -111,7 +110,7 @@ if [[ -z "${BUILD_ARCH}" ]]; then
     TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios`
 else
     # arch specified so build just that
-    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios -a ${BUILD_ARCH}`
+    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios -a "${BUILD_ARCH}"`
 fi
 export HOST_NSYNC_LIB TARGET_NSYNC_LIB
 
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops.py b/tensorflow/contrib/metrics/python/ops/metric_ops.py
index 0fee584f8e..81f05e7ce5 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops.py
@@ -3647,7 +3647,7 @@ def cohen_kappa(labels,
     RuntimeError: If eager execution is enabled.
   """
   if context.executing_eagerly():
-    raise RuntimeError('tf.contrib.metrics.cohen_kappa is not supported'
+    raise RuntimeError('tf.contrib.metrics.cohen_kappa is not supported '
                        'when eager execution is enabled.')
   if num_classes < 2:
     raise ValueError('`num_classes` must be >= 2.'
diff --git a/tensorflow/contrib/model_pruning/python/layers/layers.py b/tensorflow/contrib/model_pruning/python/layers/layers.py
index 988748ad75..466daf204a 100644
--- a/tensorflow/contrib/model_pruning/python/layers/layers.py
+++ b/tensorflow/contrib/model_pruning/python/layers/layers.py
@@ -214,7 +214,7 @@ def masked_convolution(inputs,
     elif data_format == 'NCHW':
       df = 'channels_first'
     else:
-      raise ValueError('Unsupported data fromat', data_format)
+      raise ValueError('Unsupported data format', data_format)
 
     layer = layer_class(
         filters=num_outputs,
diff --git a/tensorflow/contrib/model_pruning/python/pruning.py b/tensorflow/contrib/model_pruning/python/pruning.py
index 86963be4b8..5146a4a2de 100644
--- a/tensorflow/contrib/model_pruning/python/pruning.py
+++ b/tensorflow/contrib/model_pruning/python/pruning.py
@@ -216,7 +216,7 @@ def _partitioned_variable_assign(partitioned_var, new_value):
   """Assign op for partitioned variables.
 
   Args:
-    partitioned_var: A partitioned tensotflow variable
+    partitioned_var: A partitioned tensorflow variable
     new_value: Value to be assigned to the variable var
 
   Returns:
diff --git a/tensorflow/contrib/mpi/mpi_utils.h b/tensorflow/contrib/mpi/mpi_utils.h
index fa297c28cb..df055ff567 100644
--- a/tensorflow/contrib/mpi/mpi_utils.h
+++ b/tensorflow/contrib/mpi/mpi_utils.h
@@ -24,6 +24,8 @@ limitations under the License.
 
 #include "tensorflow/core/lib/strings/str_util.h"
 
+// Skip MPI C++ bindings support, this matches the usage in other places
+#define OMPI_SKIP_MPICXX
 #include "third_party/mpi/mpi.h"
 #define MPI_CHECK(cmd)                                                \
   do {                                                                \
diff --git a/tensorflow/contrib/predictor/predictor_factories.py b/tensorflow/contrib/predictor/predictor_factories.py
index 04b5d5bdf1..6e77e934fe 100644
--- a/tensorflow/contrib/predictor/predictor_factories.py
+++ b/tensorflow/contrib/predictor/predictor_factories.py
@@ -53,7 +53,7 @@ def from_contrib_estimator(estimator,
       `Estimator`.
   """
   if isinstance(estimator, core_estimator.Estimator):
-    raise TypeError('Espected estimator to be of type '
+    raise TypeError('Expected estimator to be of type '
                     'tf.contrib.learn.Estimator, but got type '
                     'tf.python.estimator.Estimator. You likely want to call '
                     'from_estimator.')
@@ -88,7 +88,7 @@ def from_estimator(estimator,
       `Estimator`.
   """
   if isinstance(estimator, contrib_estimator.Estimator):
-    raise TypeError('Espected estimator to be of type '
+    raise TypeError('Expected estimator to be of type '
                     'tf.python.estimator.Estimator, but got type '
                     'tf.contrib.learn.Estimator. You likely want to call '
                     'from_contrib_estimator.')
diff --git a/tensorflow/contrib/py2tf/converters/single_return.py b/tensorflow/contrib/py2tf/converters/single_return.py
index 90bc22008f..1194b98f5e 100644
--- a/tensorflow/contrib/py2tf/converters/single_return.py
+++ b/tensorflow/contrib/py2tf/converters/single_return.py
@@ -212,7 +212,7 @@ class DetectReturnInUnsupportedControlFlow(gast.NodeVisitor):
 
   def __init__(self):
     self.cant_return = False
-    super(gast.NodeVisitor, self).__init__()
+    super(DetectReturnInUnsupportedControlFlow, self).__init__()
 
   def visit_While(self, node):
     self.cant_return = True
diff --git a/tensorflow/contrib/quantize/python/fold_batch_norms.py b/tensorflow/contrib/quantize/python/fold_batch_norms.py
index e8a0d41425..5750be6f4c 100644
--- a/tensorflow/contrib/quantize/python/fold_batch_norms.py
+++ b/tensorflow/contrib/quantize/python/fold_batch_norms.py
@@ -237,7 +237,7 @@ def _FindFusedBatchNorms(graph):
       # The batch variance used during forward and backward prop is biased,
       # i.e it is calculated as: V=sum(x(k)-mu)^2/N. For the moving average
       # calculation, the variance is corrected by the term N/N-1 (Bessel's
-      # correction). The variance tensor read from FuseBatchNorm has bessel's
+      # correction). The variance tensor read from FuseBatchNorm has Bessel's
       # correction applied, so we undo it here.
       scope, sep, _ = bn_op.name.rpartition('/')
       g = ops.get_default_graph()
@@ -306,7 +306,7 @@ def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
 
   Args:
     context: The scope under which we look for batch norm params
-    match: Object containg required batch norm tensors for correction
+    match: Object containing required batch norm tensors for correction
       computation.
     freeze_batch_norm_delay: Delay in steps at which computation switches
       from regular batch norm to frozen mean and variance.
diff --git a/tensorflow/contrib/quantize/python/quant_ops.py b/tensorflow/contrib/quantize/python/quant_ops.py
index 0a8e35080c..a4f7b1b221 100644
--- a/tensorflow/contrib/quantize/python/quant_ops.py
+++ b/tensorflow/contrib/quantize/python/quant_ops.py
@@ -282,8 +282,8 @@ def _FakeQuantWithMinMaxVars(inputs, min_var, max_var, per_channel, num_bits,
   Args:
     inputs: a tensor containing values to be quantized.
     min_var: a variable containing quantization range lower end(s).
-    max_var: a variable containing quantization range lupper end(s).
-    per_channel: a boolean specifying whether to use per-channel quantizatioh.
+    max_var: a variable containing quantization range upper end(s).
+    per_channel: a boolean specifying whether to use per-channel quantization.
     num_bits: Number of bits to use for quantization, must be between 2 and 8.
     narrow_range: Whether to use the narrow quantization range
       [1; 2^num_bits - 1] or wide range [0; 2^num_bits - 1].
diff --git a/tensorflow/contrib/quantize/python/quantize.py b/tensorflow/contrib/quantize/python/quantize.py
index 2b5b877e8e..33f14e8d0e 100644
--- a/tensorflow/contrib/quantize/python/quantize.py
+++ b/tensorflow/contrib/quantize/python/quantize.py
@@ -341,7 +341,7 @@ def _InsertQuantOp(context,
   """Inserts a quant op between a producer op and (multiple) consumer ops.
 
   Args:
-    context: Context w,here producer and consumer operations are nested.
+    context: Context where producer and consumer operations are nested.
     name: Name for the new quantization op within the context.
     producer: Producer operation of the pairs where quantization will be
       inserted.
diff --git a/tensorflow/contrib/quantize/python/quantize_graph.py b/tensorflow/contrib/quantize/python/quantize_graph.py
index d0fb55da74..0b74b438ac 100644
--- a/tensorflow/contrib/quantize/python/quantize_graph.py
+++ b/tensorflow/contrib/quantize/python/quantize_graph.py
@@ -155,7 +155,7 @@ def experimental_create_training_graph(input_graph=None,
   often fail.
 
   Args:
-    input_graph: The tf.Graph to be transformed,if None then defaults to the
+    input_graph: The tf.Graph to be transformed, if None then defaults to the
       default graph.
     weight_bits: Number of bits to use for quantizing weights.
     activation_bits: Number of bits to use for quantizing activations.
diff --git a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
index 0624cc878b..db745aa562 100644
--- a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
@@ -419,7 +419,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -470,7 +470,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -526,7 +526,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
           normalizer_params=self._BatchNormParams(fused_batch_norm),
           scope=scope)
 
-      # Manually add a bypass (optionaly) and an activation.
+      # Manually add a bypass (optional) and an activation.
       if with_bypass:
         node = math_ops.add(inputs, node, name='test/Add')
 
@@ -565,7 +565,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       stddev: Standard deviation of normal variable.
 
     Returns:
-      An initialized that initialzes with a truncated normal variable.
+      An initialized that initializes with a truncated normal variable.
     """
     return init_ops.truncated_normal_initializer(stddev=stddev)
 
diff --git a/tensorflow/contrib/quantize/python/quantize_test.py b/tensorflow/contrib/quantize/python/quantize_test.py
index 216310abe4..bef58bad8d 100644
--- a/tensorflow/contrib/quantize/python/quantize_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_test.py
@@ -197,7 +197,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       stddev: Standard deviation of normal variable.
 
     Returns:
-      An initialized that initialzes with a truncated normal variable.
+      An initialized that initializes with a truncated normal variable.
     """
     return init_ops.truncated_normal_initializer(stddev=stddev)
 
diff --git a/tensorflow/contrib/rnn/ops/gru_ops.cc b/tensorflow/contrib/rnn/ops/gru_ops.cc
index e91d1e8a80..9c8e40851a 100644
--- a/tensorflow/contrib/rnn/ops/gru_ops.cc
+++ b/tensorflow/contrib/rnn/ops/gru_ops.cc
@@ -69,7 +69,7 @@ Element-wise dot product of a and b is represented by ab
 Element-wise dot product is represented by \circ
 Matrix multiplication is represented by *
 
-Baises are initialized with :
+Biases are initialized with :
 `b_ru` - constant_initializer(1.0)
 `b_c` - constant_initializer(0.0)
 
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py b/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
index 7957edf68c..ffd2421894 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/lstm_ops_test.py
@@ -54,7 +54,7 @@ def blocks_match(sess, use_peephole):
   initializer = init_ops.random_uniform_initializer(-0.01, 0.01, seed=19890212)
 
   with variable_scope.variable_scope("test", initializer=initializer):
-    # magic naming so that the cells pick up these variables and resuse them
+    # magic naming so that the cells pick up these variables and reuse them
     if use_peephole:
       wci = variable_scope.get_variable(
           "rnn/lstm_cell/w_i_diag", shape=[cell_size], dtype=dtypes.float32)
diff --git a/tensorflow/contrib/rnn/python/ops/lstm_ops.py b/tensorflow/contrib/rnn/python/ops/lstm_ops.py
index 4eb4fbcd92..9e61fc54d1 100644
--- a/tensorflow/contrib/rnn/python/ops/lstm_ops.py
+++ b/tensorflow/contrib/rnn/python/ops/lstm_ops.py
@@ -480,8 +480,7 @@ class LSTMBlockWrapper(base_layer.Layer):
     """Run this LSTM on inputs, starting from the given state.
 
     Args:
-      inputs: `3-D` tensor with shape `[time_len, batch_size, input_size]`
-        or a list of `time_len` tensors of shape `[batch_size, input_size]`.
+      inputs: `3-D` tensor with shape `[time_len, batch_size, input_size]`.
       initial_state: a tuple `(initial_cell_state, initial_output)` with tensors
         of shape `[batch_size, self._num_units]`. If this is not provided, the
         cell is expected to create a zero initial state of type `dtype`.
diff --git a/tensorflow/contrib/rnn/python/ops/rnn_cell.py b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
index 73f2607d84..2f6ae9f367 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
@@ -534,7 +534,7 @@ class GridLSTMCell(rnn_cell_impl.RNNCell):
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
       num_unit_shards: (optional) int, default 1, How to split the weight
-        matrix. If > 1,the weight matrix is stored across num_unit_shards.
+        matrix. If > 1, the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
         of the training.
@@ -993,7 +993,7 @@ class BidirectionalGridLSTMCell(GridLSTMCell):
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
       num_unit_shards: (optional) int, default 1, How to split the weight
-        matrix. If > 1,the weight matrix is stored across num_unit_shards.
+        matrix. If > 1, the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
         of the training.
@@ -2133,7 +2133,7 @@ class Conv1DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_1d_lstm_cell", **kwargs):
     """Construct Conv1DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv1DLSTMCell, self).__init__(conv_ndims=1, **kwargs)
+    super(Conv1DLSTMCell, self).__init__(conv_ndims=1, name=name, **kwargs)
 
 
 class Conv2DLSTMCell(ConvLSTMCell):
@@ -2144,7 +2144,7 @@ class Conv2DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_2d_lstm_cell", **kwargs):
     """Construct Conv2DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv2DLSTMCell, self).__init__(conv_ndims=2, **kwargs)
+    super(Conv2DLSTMCell, self).__init__(conv_ndims=2, name=name, **kwargs)
 
 
 class Conv3DLSTMCell(ConvLSTMCell):
@@ -2155,7 +2155,7 @@ class Conv3DLSTMCell(ConvLSTMCell):
 
   def __init__(self, name="conv_3d_lstm_cell", **kwargs):
     """Construct Conv3DLSTM. See `ConvLSTMCell` for more details."""
-    super(Conv3DLSTMCell, self).__init__(conv_ndims=3, **kwargs)
+    super(Conv3DLSTMCell, self).__init__(conv_ndims=3, name=name, **kwargs)
 
 
 def _conv(args, filter_size, num_features, bias, bias_start=0.0):
diff --git a/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py b/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
index b427dff88b..c4139dde49 100644
--- a/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
+++ b/tensorflow/contrib/seq2seq/python/kernel_tests/attention_wrapper_test.py
@@ -222,6 +222,9 @@ class AttentionWrapperTest(test.TestCase):
           self.assertEqual(
               (None, batch_size, None),
               tuple(state_alignment_history.get_shape().as_list()))
+        nest.assert_same_structure(
+            cell.state_size,
+            cell.zero_state(batch_size, dtypes.float32))
         # Remove the history from final_state for purposes of the
         # remainder of the tests.
         final_state = final_state._replace(alignment_history=())  # pylint: disable=protected-access
diff --git a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
index 9265540317..178328619f 100644
--- a/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
+++ b/tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_decoder_test.py
@@ -27,6 +27,7 @@ from tensorflow.contrib.seq2seq.python.ops import beam_search_ops
 from tensorflow.contrib.seq2seq.python.ops import decoder
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.layers import core as layers_core
 from tensorflow.python.ops import array_ops
@@ -70,6 +71,98 @@ class TestGatherTree(test.TestCase):
 
     self.assertAllEqual(expected_result, res_)
 
+  def _test_gather_tree_from_array(self,
+                                   depth_ndims=0,
+                                   merged_batch_beam=False):
+    array = np.array(
+        [[[1, 2, 3], [4, 5, 6], [7, 8, 9], [0, 0, 0]],
+         [[2, 3, 4], [5, 6, 7], [8, 9, 10], [11, 12, 0]]]).transpose([1, 0, 2])
+    parent_ids = np.array(
+        [[[0, 0, 0], [0, 1, 1], [2, 1, 2], [-1, -1, -1]],
+         [[0, 0, 0], [1, 1, 0], [2, 0, 1], [0, 1, 0]]]).transpose([1, 0, 2])
+    expected_array = np.array(
+        [[[2, 2, 2], [6, 5, 6], [7, 8, 9], [0, 0, 0]],
+         [[2, 3, 2], [7, 5, 7], [8, 9, 8], [11, 12, 0]]]).transpose([1, 0, 2])
+    sequence_length = [[3, 3, 3], [4, 4, 3]]
+
+    array = ops.convert_to_tensor(
+        array, dtype=dtypes.float32)
+    parent_ids = ops.convert_to_tensor(
+        parent_ids, dtype=dtypes.int32)
+    expected_array = ops.convert_to_tensor(
+        expected_array, dtype=dtypes.float32)
+
+    max_time = array_ops.shape(array)[0]
+    batch_size = array_ops.shape(array)[1]
+    beam_width = array_ops.shape(array)[2]
+
+    def _tile_in_depth(tensor):
+      # Generate higher rank tensors by concatenating tensor and tensor + 1.
+      for _ in range(depth_ndims):
+        tensor = array_ops.stack([tensor, tensor + 1], -1)
+      return tensor
+
+    if merged_batch_beam:
+      array = array_ops.reshape(
+          array, [max_time, batch_size * beam_width])
+      expected_array = array_ops.reshape(
+          expected_array, [max_time, batch_size * beam_width])
+
+    if depth_ndims > 0:
+      array = _tile_in_depth(array)
+      expected_array = _tile_in_depth(expected_array)
+
+    sorted_array = beam_search_decoder.gather_tree_from_array(
+        array, parent_ids, sequence_length)
+
+    with self.test_session() as sess:
+      sorted_array = sess.run(sorted_array)
+      expected_array = sess.run(expected_array)
+      self.assertAllEqual(expected_array, sorted_array)
+
+  def test_gather_tree_from_array_scalar(self):
+    self._test_gather_tree_from_array()
+
+  def test_gather_tree_from_array_1d(self):
+    self._test_gather_tree_from_array(depth_ndims=1)
+
+  def test_gather_tree_from_array_1d_with_merged_batch_beam(self):
+    self._test_gather_tree_from_array(depth_ndims=1, merged_batch_beam=True)
+
+  def test_gather_tree_from_array_2d(self):
+    self._test_gather_tree_from_array(depth_ndims=2)
+
+
+class TestArrayShapeChecks(test.TestCase):
+
+  def _test_array_shape_dynamic_checks(self, static_shape, dynamic_shape,
+                                       batch_size, beam_width, is_valid=True):
+    t = array_ops.placeholder_with_default(
+        np.random.randn(*static_shape).astype(np.float32),
+        shape=dynamic_shape)
+
+    batch_size = array_ops.constant(batch_size)
+    check_op = beam_search_decoder._check_batch_beam(t, batch_size, beam_width)  # pylint: disable=protected-access
+
+    with self.test_session() as sess:
+      if is_valid:
+        sess.run(check_op)
+      else:
+        with self.assertRaises(errors.InvalidArgumentError):
+          sess.run(check_op)
+
+  def test_array_shape_dynamic_checks(self):
+    self._test_array_shape_dynamic_checks(
+        (8, 4, 5, 10), (None, None, 5, 10), 4, 5, is_valid=True)
+    self._test_array_shape_dynamic_checks(
+        (8, 20, 10), (None, None, 10), 4, 5, is_valid=True)
+    self._test_array_shape_dynamic_checks(
+        (8, 21, 10), (None, None, 10), 4, 5, is_valid=False)
+    self._test_array_shape_dynamic_checks(
+        (8, 4, 6, 10), (None, None, None, 10), 4, 5, is_valid=False)
+    self._test_array_shape_dynamic_checks(
+        (8, 4), (None, None), 4, 5, is_valid=False)
+
 
 class TestEosMasking(test.TestCase):
   """Tests EOS masking used in beam search."""
@@ -319,7 +412,8 @@ class TestLargeBeamStep(test.TestCase):
 
 class BeamSearchDecoderTest(test.TestCase):
 
-  def _testDynamicDecodeRNN(self, time_major, has_attention):
+  def _testDynamicDecodeRNN(self, time_major, has_attention,
+                            with_alignment_history=False):
     encoder_sequence_length = np.array([3, 2, 3, 1, 1])
     decoder_sequence_length = np.array([2, 0, 1, 2, 3])
     batch_size = 5
@@ -359,7 +453,7 @@ class BeamSearchDecoderTest(test.TestCase):
             cell=cell,
             attention_mechanism=attention_mechanism,
             attention_layer_size=attention_depth,
-            alignment_history=False)
+            alignment_history=with_alignment_history)
       cell_state = cell.zero_state(
           dtype=dtypes.float32, batch_size=batch_size_tensor * beam_width)
       if has_attention:
@@ -420,6 +514,12 @@ class BeamSearchDecoderTest(test.TestCase):
   def testDynamicDecodeRNNBatchMajorYesAttention(self):
     self._testDynamicDecodeRNN(time_major=False, has_attention=True)
 
+  def testDynamicDecodeRNNBatchMajorYesAttentionWithAlignmentHistory(self):
+    self._testDynamicDecodeRNN(
+        time_major=False,
+        has_attention=True,
+        with_alignment_history=True)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
index f8da5a3e17..9ff8a343f1 100644
--- a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
+++ b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
@@ -1278,7 +1278,8 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
         attention_state=self._item_or_tuple(
             a.state_size for a in self._attention_mechanisms),
         alignment_history=self._item_or_tuple(
-            () for _ in self._attention_mechanisms))  # sometimes a TensorArray
+            a.alignments_size if self._alignment_history else ()
+            for a in self._attention_mechanisms))  # sometimes a TensorArray
 
   def zero_state(self, batch_size, dtype):
     """Return an initial (zero) state tuple for this `AttentionWrapper`.
@@ -1318,22 +1319,26 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
         cell_state = nest.map_structure(
             lambda s: array_ops.identity(s, name="checked_cell_state"),
             cell_state)
+      initial_alignments = [
+          attention_mechanism.initial_alignments(batch_size, dtype)
+          for attention_mechanism in self._attention_mechanisms]
       return AttentionWrapperState(
           cell_state=cell_state,
           time=array_ops.zeros([], dtype=dtypes.int32),
           attention=_zero_state_tensors(self._attention_layer_size, batch_size,
                                         dtype),
-          alignments=self._item_or_tuple(
-              attention_mechanism.initial_alignments(batch_size, dtype)
-              for attention_mechanism in self._attention_mechanisms),
+          alignments=self._item_or_tuple(initial_alignments),
           attention_state=self._item_or_tuple(
               attention_mechanism.initial_state(batch_size, dtype)
               for attention_mechanism in self._attention_mechanisms),
           alignment_history=self._item_or_tuple(
-              tensor_array_ops.TensorArray(dtype=dtype, size=0,
-                                           dynamic_size=True)
+              tensor_array_ops.TensorArray(
+                  dtype,
+                  size=0,
+                  dynamic_size=True,
+                  element_shape=alignment.shape)
               if self._alignment_history else ()
-              for _ in self._attention_mechanisms))
+              for alignment in initial_alignments))
 
   def call(self, inputs, state):
     """Perform a step of attention-wrapped RNN.
diff --git a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
index 03fe31abf7..a26107b0d7 100644
--- a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
+++ b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
@@ -35,6 +35,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.platform import tf_logging
 from tensorflow.python.util import nest
 
 __all__ = [
@@ -121,14 +122,114 @@ def tile_batch(t, multiplier, name=None):
     return nest.map_structure(lambda t_: _tile_batch(t_, multiplier), t)
 
 
+def gather_tree_from_array(t, parent_ids, sequence_length):
+  """Calculates the full beams for `TensorArray`s.
+
+  Args:
+    t: A stacked `TensorArray` of size `max_time` that contains `Tensor`s of
+      shape `[batch_size, beam_width, s]` or `[batch_size * beam_width, s]`
+      where `s` is the depth shape.
+    parent_ids: The parent ids of shape `[max_time, batch_size, beam_width]`.
+    sequence_length: The sequence length of shape `[batch_size, beam_width]`.
+
+  Returns:
+    A `Tensor` which is a stacked `TensorArray` of the same size and type as
+    `t` and where beams are sorted in each `Tensor` according to `parent_ids`.
+  """
+  max_time = parent_ids.shape[0].value or array_ops.shape(parent_ids)[0]
+  batch_size = parent_ids.shape[1].value or array_ops.shape(parent_ids)[1]
+  beam_width = parent_ids.shape[2].value or array_ops.shape(parent_ids)[2]
+
+  # Generate beam ids that will be reordered by gather_tree.
+  beam_ids = array_ops.expand_dims(
+      array_ops.expand_dims(math_ops.range(beam_width), 0), 0)
+  beam_ids = array_ops.tile(beam_ids, [max_time, batch_size, 1])
+
+  mask = array_ops.sequence_mask(
+      sequence_length, maxlen=max_time, dtype=dtypes.int32)
+  mask = array_ops.transpose(mask, perm=[2, 0, 1])
+
+  # Use beam_width + 1 to mark the end of beam.
+  masked_beam_ids = (beam_ids * mask) + (1 - mask) * (beam_width + 1)
+
+  max_sequence_lengths = math_ops.to_int32(
+      math_ops.reduce_max(sequence_length, axis=1))
+  sorted_beam_ids = beam_search_ops.gather_tree(
+      step_ids=masked_beam_ids,
+      parent_ids=parent_ids,
+      max_sequence_lengths=max_sequence_lengths,
+      end_token=beam_width + 1)
+
+  # For out of range steps, simply copy the same beam.
+  sorted_beam_ids = array_ops.where(
+      math_ops.cast(mask, dtypes.bool), x=sorted_beam_ids, y=beam_ids)
+
+  # Generate indices for gather_nd.
+  time_ind = array_ops.tile(array_ops.reshape(
+      math_ops.range(max_time), [-1, 1, 1]), [1, batch_size, beam_width])
+  batch_ind = array_ops.tile(array_ops.reshape(
+      math_ops.range(batch_size), [-1, 1, 1]), [1, max_time, beam_width])
+  batch_ind = array_ops.transpose(batch_ind, perm=[1, 0, 2])
+  indices = array_ops.stack([time_ind, batch_ind, sorted_beam_ids], -1)
+
+  # Gather from a tensor with collapsed additional dimensions.
+  gather_from = t
+  final_shape = array_ops.shape(gather_from)
+  gather_from = array_ops.reshape(
+      gather_from, [max_time, batch_size, beam_width, -1])
+  ordered = array_ops.gather_nd(gather_from, indices)
+  ordered = array_ops.reshape(ordered, final_shape)
+
+  return ordered
+
+
 def _check_maybe(t):
-  if isinstance(t, tensor_array_ops.TensorArray):
-    raise TypeError(
-        "TensorArray state is not supported by BeamSearchDecoder: %s" % t.name)
   if t.shape.ndims is None:
     raise ValueError(
         "Expected tensor (%s) to have known rank, but ndims == None." % t)
 
+def _check_static_batch_beam_maybe(shape, batch_size, beam_width):
+  """Raises an exception if dimensions are known statically and can not be
+  reshaped to [batch_size, beam_size, -1].
+  """
+  reshaped_shape = tensor_shape.TensorShape([batch_size, beam_width, None])
+  if (batch_size is not None and shape[0].value is not None
+      and (shape[0] != batch_size * beam_width
+           or (shape.ndims >= 2 and shape[1].value is not None
+               and (shape[0] != batch_size or shape[1] != beam_width)))):
+    tf_logging.warn("TensorArray reordering expects elements to be "
+                    "reshapable to %s which is incompatible with the "
+                    "current shape %s. Consider setting "
+                    "reorder_tensor_arrays to False to disable TensorArray "
+                    "reordering during the beam search."
+                    % (reshaped_shape, shape))
+    return False
+  return True
+
+def _check_batch_beam(t, batch_size, beam_width):
+  """Returns an Assert operation checking that the elements of the stacked
+  TensorArray can be reshaped to [batch_size, beam_size, -1]. At this point,
+  the TensorArray elements have a known rank of at least 1.
+  """
+  error_message = ("TensorArray reordering expects elements to be "
+                   "reshapable to [batch_size, beam_size, -1] which is "
+                   "incompatible with the dynamic shape of %s elements. "
+                   "Consider setting reorder_tensor_arrays to False to disable "
+                   "TensorArray reordering during the beam search."
+                   % (t.name))
+  rank = t.shape.ndims
+  shape = array_ops.shape(t)
+  if rank == 2:
+    condition = math_ops.equal(shape[1], batch_size * beam_width)
+  else:
+    condition = math_ops.logical_or(
+        math_ops.equal(shape[1], batch_size * beam_width),
+        math_ops.logical_and(
+            math_ops.equal(shape[1], batch_size),
+            math_ops.equal(shape[2], beam_width)))
+  return control_flow_ops.Assert(condition, [error_message])
+
+
 
 class BeamSearchDecoder(decoder.Decoder):
   """BeamSearch sampling decoder.
@@ -173,7 +274,8 @@ class BeamSearchDecoder(decoder.Decoder):
                initial_state,
                beam_width,
                output_layer=None,
-               length_penalty_weight=0.0):
+               length_penalty_weight=0.0,
+               reorder_tensor_arrays=True):
     """Initialize the BeamSearchDecoder.
 
     Args:
@@ -188,6 +290,12 @@ class BeamSearchDecoder(decoder.Decoder):
         `tf.layers.Dense`.  Optional layer to apply to the RNN output prior
         to storing the result or sampling.
       length_penalty_weight: Float weight to penalize length. Disabled with 0.0.
+      reorder_tensor_arrays: If `True`, `TensorArray`s' elements within the cell
+        state will be reordered according to the beam search path. If the
+        `TensorArray` can be reordered, the stacked form will be returned.
+        Otherwise, the `TensorArray` will be returned as is. Set this flag to
+        `False` if the cell state contains `TensorArray`s that are not amenable
+        to reordering.
 
     Raises:
       TypeError: if `cell` is not an instance of `RNNCell`,
@@ -202,6 +310,7 @@ class BeamSearchDecoder(decoder.Decoder):
           "output_layer must be a Layer, received: %s" % type(output_layer))
     self._cell = cell
     self._output_layer = output_layer
+    self._reorder_tensor_arrays = reorder_tensor_arrays
 
     if callable(embedding):
       self._embedding_fn = embedding
@@ -299,12 +408,13 @@ class BeamSearchDecoder(decoder.Decoder):
     """
     finished, start_inputs = self._finished, self._start_inputs
 
+    dtype = nest.flatten(self._initial_cell_state)[0].dtype
     log_probs = array_ops.one_hot(  # shape(batch_sz, beam_sz)
         array_ops.zeros([self._batch_size], dtype=dtypes.int32),
         depth=self._beam_width,
-        on_value=0.0,
-        off_value=-np.Inf,
-        dtype=nest.flatten(self._initial_cell_state)[0].dtype)
+        on_value=ops.convert_to_tensor(0.0, dtype=dtype),
+        off_value=ops.convert_to_tensor(-np.Inf, dtype=dtype),
+        dtype=dtype)
 
     initial_state = BeamSearchDecoderState(
         cell_state=self._initial_cell_state,
@@ -341,6 +451,11 @@ class BeamSearchDecoder(decoder.Decoder):
         outputs.parent_ids,
         max_sequence_lengths=max_sequence_lengths,
         end_token=self._end_token)
+    if self._reorder_tensor_arrays:
+      final_state = final_state._replace(cell_state=nest.map_structure(
+          lambda t: self._maybe_sort_array_beams(
+              t, outputs.parent_ids, final_state.lengths),
+          final_state.cell_state))
     outputs = FinalBeamSearchDecoderOutput(
         beam_search_decoder_output=outputs, predicted_ids=predicted_ids)
     return outputs, final_state
@@ -431,9 +546,10 @@ class BeamSearchDecoder(decoder.Decoder):
       returned unchanged.
 
     Raises:
-      TypeError: If `t` is an instance of `TensorArray`.
       ValueError: If the rank of `t` is not statically known.
     """
+    if isinstance(t, tensor_array_ops.TensorArray):
+      return t
     _check_maybe(t)
     if t.shape.ndims >= 1:
       return self._split_batch_beams(t, s)
@@ -454,15 +570,55 @@ class BeamSearchDecoder(decoder.Decoder):
       A reshaped version of t with shape `[batch_size, beam_width] + s`.
 
     Raises:
-      TypeError: If `t` is an instance of `TensorArray`.
       ValueError:  If the rank of `t` is not statically known.
     """
+    if isinstance(t, tensor_array_ops.TensorArray):
+      return t
     _check_maybe(t)
     if t.shape.ndims >= 2:
       return self._merge_batch_beams(t, s)
     else:
       return t
 
+  def _maybe_sort_array_beams(self, t, parent_ids, sequence_length):
+    """Maybe sorts beams within a `TensorArray`.
+
+    Args:
+      t: A `TensorArray` of size `max_time` that contains `Tensor`s of shape
+        `[batch_size, beam_width, s]` or `[batch_size * beam_width, s]` where
+        `s` is the depth shape.
+      parent_ids: The parent ids of shape `[max_time, batch_size, beam_width]`.
+      sequence_length: The sequence length of shape `[batch_size, beam_width]`.
+
+    Returns:
+      A `TensorArray` where beams are sorted in each `Tensor` or `t` itself if
+      it is not a `TensorArray` or does not meet shape requirements.
+    """
+    if not isinstance(t, tensor_array_ops.TensorArray):
+      return t
+    # pylint: disable=protected-access
+    if (not t._infer_shape or not t._element_shape
+        or t._element_shape[0].ndims is None
+        or t._element_shape[0].ndims < 1):
+      shape = (
+          t._element_shape[0] if t._infer_shape and t._element_shape
+          else tensor_shape.TensorShape(None))
+      tf_logging.warn("The TensorArray %s in the cell state is not amenable to "
+                      "sorting based on the beam search result. For a "
+                      "TensorArray to be sorted, its elements shape must be "
+                      "defined and have at least a rank of 1, but saw shape: %s"
+                      % (t.handle.name, shape))
+      return t
+    shape = t._element_shape[0]
+    # pylint: enable=protected-access
+    if not _check_static_batch_beam_maybe(
+        shape, tensor_util.constant_value(self._batch_size), self._beam_width):
+      return t
+    t = t.stack()
+    with ops.control_dependencies(
+        [_check_batch_beam(t, self._batch_size, self._beam_width)]):
+      return gather_tree_from_array(t, parent_ids, sequence_length)
+
   def step(self, time, inputs, state, name=None):
     """Perform a decoding step.
 
@@ -757,6 +913,8 @@ def _maybe_tensor_gather_helper(gather_indices, gather_from, batch_size,
     output: Gathered tensor of shape tf.shape(gather_from)[:1+len(gather_shape)]
       or the original tensor if its dimensions are too small.
   """
+  if isinstance(gather_from, tensor_array_ops.TensorArray):
+    return gather_from
   _check_maybe(gather_from)
   if gather_from.shape.ndims >= len(gather_shape):
     return _tensor_gather_helper(
diff --git a/tensorflow/contrib/slim/README.md b/tensorflow/contrib/slim/README.md
index 2d9df8f27e..40f484fd78 100644
--- a/tensorflow/contrib/slim/README.md
+++ b/tensorflow/contrib/slim/README.md
@@ -94,7 +94,7 @@ of thin wrapper functions in
 [variables.py](https://www.tensorflow.org/code/tensorflow/contrib/framework/python/ops/variables.py)
 which allow callers to easily define variables.
 
-For example, to create a `weight` variable, initialize it using a truncated
+For example, to create a `weights` variable, initialize it using a truncated
 normal distribution, regularize it with an `l2_loss` and place it on the `CPU`,
 one need only declare the following:
 
diff --git a/tensorflow/contrib/solvers/python/ops/least_squares.py b/tensorflow/contrib/solvers/python/ops/least_squares.py
index fb7c0eb649..6e164f5342 100644
--- a/tensorflow/contrib/solvers/python/ops/least_squares.py
+++ b/tensorflow/contrib/solvers/python/ops/least_squares.py
@@ -33,7 +33,7 @@ def cgls(operator, rhs, tol=1e-6, max_iter=20, name="cgls"):
   r"""Conjugate gradient least squares solver.
 
   Solves a linear least squares problem \\(||A x - rhs||_2\\) for a single
-  righ-hand side, using an iterative, matrix-free algorithm where the action of
+  right-hand side, using an iterative, matrix-free algorithm where the action of
   the matrix A is represented by `operator`. The CGLS algorithm implicitly
   applies the symmetric conjugate gradient algorithm to the normal equations
   \\(A^* A x = A^* rhs\\). The iteration terminates when either
diff --git a/tensorflow/contrib/solvers/python/ops/linear_equations.py b/tensorflow/contrib/solvers/python/ops/linear_equations.py
index d791d46763..9305c6a11c 100644
--- a/tensorflow/contrib/solvers/python/ops/linear_equations.py
+++ b/tensorflow/contrib/solvers/python/ops/linear_equations.py
@@ -41,7 +41,7 @@ def conjugate_gradient(operator,
   r"""Conjugate gradient solver.
 
   Solves a linear system of equations `A*x = rhs` for selfadjoint, positive
-  definite matrix `A` and righ-hand side vector `rhs`, using an iterative,
+  definite matrix `A` and right-hand side vector `rhs`, using an iterative,
   matrix-free algorithm where the action of the matrix A is represented by
   `operator`. The iteration terminates when either the number of iterations
   exceeds `max_iter` or when the residual norm has been reduced to `tol`
diff --git a/tensorflow/contrib/tensorrt/BUILD b/tensorflow/contrib/tensorrt/BUILD
index c832c6f2e0..906cc3f034 100644
--- a/tensorflow/contrib/tensorrt/BUILD
+++ b/tensorflow/contrib/tensorrt/BUILD
@@ -83,6 +83,7 @@ cc_library(
         "kernels/trt_engine_op.h",
     ],
     copts = tf_copts(),
+    visibility = ["//visibility:public"],
     deps = [
         ":trt_logging",
         ":trt_resources",
@@ -154,6 +155,7 @@ py_library(
     deps = [
         ":trt_convert_py",
         ":trt_ops_py",
+        "//tensorflow/python:errors",
     ],
 )
 
diff --git a/tensorflow/contrib/tensorrt/README.md b/tensorflow/contrib/tensorrt/README.md
index dfcce0fd00..461e627e99 100644
--- a/tensorflow/contrib/tensorrt/README.md
+++ b/tensorflow/contrib/tensorrt/README.md
@@ -2,7 +2,8 @@ Using TensorRT in TensorFlow
 ============================
 
 This module provides necessary bindings and introduces TRT_engine_op
-operator that wraps a subgraph in TensorRT.
+operator that wraps a subgraph in TensorRT. This is still a work in progress
+but should be useable with most common graphs.
 
 Compilation
 -----------
@@ -15,26 +16,10 @@ configure script should find the necessary components from the system
 automatically. If installed from tar packages, user has to set path to
 location where the library is installed during configuration.
 
-
-```
+```shell
 bazel build --config=cuda --config=opt //tensorflow/tools/pip_package:build_pip_package
 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/
 ```
 
 After the installation of tensorflow package, TensorRT transformation
-will be available. An example use is shown below.
-
-```python
-import tensorflow as tf
-import tensorflow.contrib.tensorrt as trt
-#... create and train or load model
-gdef = sess.graph.as_graph_def()
-trt_gdef = trt.create_inference_graph(
-    gdef, #original graph_def
-    ["output"], #name of output node(s)
-    max_batch_size, #maximum batch size to run the inference
-    max_workspace_size_bytes) # max memory for TensorRT to use
-tf.reset_default_graph()
-tf.import_graph_def(graph_def=trt_gdef)
-#...... run inference
-```
+will be available. An example use can be found in test/test_tftrt.py directory
diff --git a/tensorflow/contrib/tensorrt/__init__.py b/tensorflow/contrib/tensorrt/__init__.py
index fd551d70b4..140ad48282 100644
--- a/tensorflow/contrib/tensorrt/__init__.py
+++ b/tensorflow/contrib/tensorrt/__init__.py
@@ -18,6 +18,18 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# pylint: disable=unused-import,wildcard-import
-from tensorflow.contrib.tensorrt.python import *
-# pylint: enable=unused-import,wildcard-import
+from tensorflow.python.framework import errors
+
+# pylint: disable=unused-import,wildcard-import,g-import-not-at-top
+try:
+  from tensorflow.contrib.tensorrt.python import *
+except errors.NotFoundError as e:
+  no_trt_message = (
+      '**** Failed to initialize TensorRT. This is either because the TensorRT'
+      ' installation path is not in LD_LIBRARY_PATH, or because you do not have'
+      ' it installed. If not installed, please go to'
+      ' https://developer.nvidia.com/tensorrt to download and install'
+      ' TensorRT ****')
+  print(no_trt_message)
+  raise e
+# pylint: enable=unused-import,wildcard-import,g-import-not-at-top
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.cc b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
index 970f810473..eea8c8efa2 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/contrib/tensorrt/convert/convert_graph.h"
 
+#include <list>
 #include <map>
 #include <set>
 #include <unordered_map>
@@ -48,13 +49,29 @@ namespace tensorrt {
 namespace convert {
 namespace {
 
-static bool IsTensorRTCandidate(const tensorflow::NodeDef& node_def) {
+bool IsTensorRTCandidate(const tensorflow::NodeDef& node_def) {
   // LINT.IfChange
   // TODO(jie): Segmentation shouldn't associated with op name.
   //            Split it into a registration for each kernel.
   static const std::set<string> candidate_ops = {
-      "Identity", "Const", "Conv2D", "MaxPool", "BiasAdd", "Relu",
-      "Add",      "Mul",   "Sub",    "Rsqrt",   "Pad"  // "Placeholder" ,"Mean"
+      "Identity",
+      "Const",
+      "Conv2D",
+      "MaxPool",
+      "BiasAdd",
+      "Relu",
+      "Add",
+      "Mul",
+      "Sub",
+      "Rsqrt",
+      "Pad",
+      "Mean",
+      "AvgPool",
+      "ConcatV2",
+      "DepthwiseConv2dNative",
+      "FusedBatchNorm",
+      "FusedBatchNormV2",
+      // TODO(ben,jie): ...
   };
   // LINT.ThenChange(//tensorflow/contrib/tensorrt/convert/convert_nodes.h)
   return candidate_ops.count(node_def.op());
@@ -69,6 +86,8 @@ void GetSubGraphIncomingEdges(const tensorflow::Graph& graph,
       if (!subgraph_node_ids.count(edge->src()->id()) &&
           !edge->src()->IsSource()) {
         incoming_edges->insert(edge);
+      } else {
+        VLOG(2) << edge->src()->name() << " N, ";
       }
     }
   }
@@ -82,7 +101,10 @@ void GetSubGraphOutgoingEdges(const tensorflow::Graph& graph,
     for (const tensorflow::Edge* edge : node->out_edges()) {
       if (!subgraph_node_ids.count(edge->dst()->id()) &&
           !edge->dst()->IsSink()) {
+        VLOG(2) << edge->dst()->name() << " Y, ";
         outgoing_edges->insert(edge);
+      } else {
+        VLOG(2) << edge->dst()->name() << " N, ";
       }
     }
   }
@@ -109,74 +131,150 @@ std::unordered_map<string, std::vector<int>> BuildTensorNameMap(
   }
   return result;
 }
-
-tensorflow::Status ConvertSubGraphToTensorRT(
-    const std::vector<string>& output_names,
-    const std::set<int>& subgraph_node_ids,
-    size_t max_batch_size,  // Max batch size that engine will be created for
-    // Max amount of memory that engine will be allowed to consume, in bytes
-    size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_properties,
-    tensorflow::Graph* graph) {
-  tensorflow::EdgeSet subgraph_incoming_edges;
-  GetSubGraphIncomingEdges(*graph, subgraph_node_ids, &subgraph_incoming_edges);
-
+// TODO(sami): convert references to pointers
+struct ConvertGraphParams {
+  ConvertGraphParams(
+      tensorflow::Graph& inp_graph,
+      const std::vector<string>& output_node_names,
+      const std::set<int>& subgraph_node_id_numbers,
+      size_t max_supported_batch_size, size_t max_consumed_workspace_size_bytes,
+      const tensorflow::grappler::GraphProperties& current_graph_properties,
+      std::unordered_map<string, std::pair<int, string>>* output_edges,
+      int engine_precision_mode)
+      : graph(inp_graph),
+        output_names(output_node_names),
+        subgraph_node_ids(subgraph_node_id_numbers),
+        max_batch_size(max_supported_batch_size),
+        max_workspace_size_bytes(max_consumed_workspace_size_bytes),
+        graph_properties(current_graph_properties),
+        output_edge_map(output_edges),
+        precision_mode(engine_precision_mode) {}
+  tensorflow::Graph& graph;
+  const std::vector<string>& output_names;
+  const std::set<int>& subgraph_node_ids;
+  size_t max_batch_size;
+  size_t max_workspace_size_bytes;
+  const tensorflow::grappler::GraphProperties& graph_properties;
+  std::unordered_map<string, std::pair<int, string>>* output_edge_map;
+  int precision_mode;
   std::vector<std::pair<int, int>> subgraph_inputs;
+  std::vector<std::pair<int, int>> subgraph_outputs;
+  tensorflow::EdgeSet subgraph_incoming_edges;
+  tensorflow::EdgeSet subgraph_outgoing_edges;
+};
 
-  // Collect inputs by looking for incoming edges
-  for (const tensorflow::Edge* edge : subgraph_incoming_edges) {
-    subgraph_inputs.push_back({edge->src()->id(), edge->src_output()});
+static tensorflow::Status FillSubGraphEdgeSets(ConvertGraphParams* p) {
+  GetSubGraphIncomingEdges(p->graph, p->subgraph_node_ids,
+                           &p->subgraph_incoming_edges);
+  for (const tensorflow::Edge* edge : p->subgraph_incoming_edges) {
+    p->subgraph_inputs.push_back({edge->src()->id(), edge->src_output()});
   }
+  auto output_name_to_index_map = BuildTensorNameMap(p->output_names);
   std::set<std::pair<int, int>> subgraph_outputs_set;
   // Collect outputs referenced from output_names
-  auto output_name_to_index_map = BuildTensorNameMap(output_names);
-  for (int node_id : subgraph_node_ids) {
-    tensorflow::Node* node = graph->FindNodeId(node_id);
+  for (int node_id : p->subgraph_node_ids) {
+    tensorflow::Node* node = p->graph.FindNodeId(node_id);
     if (output_name_to_index_map.count(node->name())) {
       for (int index : output_name_to_index_map.at(node->name())) {
         subgraph_outputs_set.insert({node_id, index});
       }
     }
   }
-  // Collect outputs referenced from outgoing edges
-  tensorflow::EdgeSet subgraph_outgoing_edges;
-  GetSubGraphOutgoingEdges(*graph, subgraph_node_ids, &subgraph_outgoing_edges);
-  for (const tensorflow::Edge* edge : subgraph_outgoing_edges) {
+  GetSubGraphOutgoingEdges(p->graph, p->subgraph_node_ids,
+                           &p->subgraph_outgoing_edges);
+  for (const tensorflow::Edge* edge : p->subgraph_outgoing_edges) {
     subgraph_outputs_set.insert({edge->src()->id(), edge->src_output()});
   }
-  // Impose an ordering on the outputs
-  std::vector<std::pair<int, int>> subgraph_outputs(
-      subgraph_outputs_set.begin(), subgraph_outputs_set.end());
-  // Build TensorRT node and add it to the graph
+  p->subgraph_outputs.reserve(subgraph_outputs_set.size());
+  p->subgraph_outputs.insert(p->subgraph_outputs.begin(),
+                             subgraph_outputs_set.begin(),
+                             subgraph_outputs_set.end());
+  return tensorflow::Status::OK();
+};
+
+tensorflow::Status GetCalibNode(ConvertGraphParams* params) {
+  TF_RETURN_IF_ERROR(FillSubGraphEdgeSets(params));
   tensorflow::NodeDef trt_node_def;
-  TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRTNodeDef(
-      *graph, subgraph_node_ids, subgraph_inputs, subgraph_outputs,
-      max_batch_size, max_workspace_size_bytes, graph_properties,
-      &trt_node_def));
+  SubGraphParams s(params->graph, params->subgraph_node_ids,
+                   params->subgraph_inputs, params->subgraph_outputs,
+                   params->max_batch_size, params->max_workspace_size_bytes,
+                   params->graph_properties, params->output_edge_map,
+                   &trt_node_def, params->precision_mode);
+  TF_RETURN_IF_ERROR(InjectCalibrationNode(s));
   tensorflow::Status status;
-  tensorflow::Node* trt_node = graph->AddNode(trt_node_def, &status);
+  tensorflow::Node* trt_node = params->graph.AddNode(trt_node_def, &status);
+
+  TF_RETURN_IF_ERROR(status);
+
+  for (auto in_edge :
+       params->subgraph_incoming_edges) {  // loop over incoming edges and
+                                           // attach them to calib node
+    // tensorflow::Node* src_node = in_edge->src();
+    auto src_output = in_edge->src_output();
+    auto dst_node = in_edge->dst();
+    auto dst_input = in_edge->dst_input();
+    VLOG(1) << " update edge " << trt_node->name() << ":" << src_output
+            << " -> " << dst_node->name() << ":" << dst_input;
+    TF_RETURN_IF_ERROR(
+        params->graph.UpdateEdge(trt_node, src_output, dst_node, dst_input));
+  }
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertSubGraphToTensorRT(ConvertGraphParams* params) {
+  TF_RETURN_IF_ERROR(FillSubGraphEdgeSets(params));
+  tensorflow::NodeDef trt_node_def;
+
+  SubGraphParams s(params->graph, params->subgraph_node_ids,
+                   params->subgraph_inputs, params->subgraph_outputs,
+                   params->max_batch_size, params->max_workspace_size_bytes,
+                   params->graph_properties, params->output_edge_map,
+                   &trt_node_def, params->precision_mode);
+  TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRTNodeDef(s));
+  tensorflow::Status status;
+  tensorflow::Node* trt_node = params->graph.AddNode(trt_node_def, &status);
+
+  // AddNode does not wire edges.
+  // Re-map incoming edges to use the new TRT node instead of the orig subgraph
+  std::map<std::pair<int, int>, int> subgraph_edge_to_input_map;
+  for (size_t i = 0; i < params->subgraph_inputs.size(); ++i) {
+    subgraph_edge_to_input_map.insert({params->subgraph_inputs.at(i), i});
+  }
+  for (const tensorflow::Edge* edge : params->subgraph_incoming_edges) {
+    std::pair<int, int> old_src = {edge->src()->id(), edge->src_output()};
+    int new_src_output = subgraph_edge_to_input_map.at(old_src);
+    params->graph.AddEdge(edge->src(), edge->src_output(), trt_node,
+                          new_src_output);
+    params->graph.RemoveEdge(edge);
+  }
+
+  VLOG(2) << "new wiring edges: " << trt_node->in_edges().size();
+  for (const tensorflow::Edge* edge : trt_node->in_edges()) {
+    VLOG(2) << edge->src()->name() << " port: " << edge->src_output();
+  }
+
   TF_RETURN_IF_ERROR(status);
 
   // Re-map outgoing edges to use the new TRT node instead of the orig subgraph
   std::map<std::pair<int, int>, int> subgraph_edge_to_output_map;
-  for (size_t i = 0; i < subgraph_outputs.size(); ++i) {
-    subgraph_edge_to_output_map.insert({subgraph_outputs.at(i), i});
+  for (size_t i = 0; i < params->subgraph_outputs.size(); ++i) {
+    subgraph_edge_to_output_map.insert({params->subgraph_outputs.at(i), i});
   }
   TF_RETURN_IF_ERROR(status);
-  for (const tensorflow::Edge* edge : subgraph_outgoing_edges) {
+  for (const tensorflow::Edge* edge : params->subgraph_outgoing_edges) {
     std::pair<int, int> old_src = {edge->src()->id(), edge->src_output()};
     int new_src_output = subgraph_edge_to_output_map.at(old_src);
-    TF_RETURN_IF_ERROR(graph->UpdateEdge(trt_node, new_src_output, edge->dst(),
-                                         edge->dst_input()));
+    TF_RETURN_IF_ERROR(params->graph.UpdateEdge(
+        trt_node, new_src_output, edge->dst(), edge->dst_input()));
   }
   // Remove the original subgraph
-  for (int node_id : subgraph_node_ids) {
-    tensorflow::Node* node = graph->FindNodeId(node_id);
+  for (int node_id : params->subgraph_node_ids) {
+    tensorflow::Node* node = params->graph.FindNodeId(node_id);
     // Don't remove the input placeholders
     if (node->type_string() == "Placeholder") {
       continue;
     }
-    graph->RemoveNode(node);
+    params->graph.RemoveNode(node);
   }
   return tensorflow::Status::OK();
 }
@@ -194,12 +292,39 @@ tensorflow::Status BuildNodeMap(
 }
 
 }  // namespace
+tensorflow::Status ConvertCalibGraphToInferGraph(
+    const tensorflow::GraphDef& graph_def, tensorflow::GraphDef* infer_graph) {
+  VLOG(0) << "Starting Calib Conversion";
+  tensorflow::Graph graph(tensorflow::OpRegistry::Global());
+  TF_RETURN_IF_ERROR(tensorflow::ConvertGraphDefToGraph(
+      tensorflow::GraphConstructorOptions(), graph_def, &graph));
+  //  get calib nodes
+  std::vector<tensorflow::Node*> calib_nodes;
+  for (auto node : graph.op_nodes()) {
+    if (node->type_string() == "TRTCalibOp") {
+      VLOG(1) << "Found Calib Node";
+      calib_nodes.push_back(node);
+    }
+  }
+  VLOG(0) << "Num Calib nodes in graph= " << calib_nodes.size();
+  if (calib_nodes.size() == 0)
+    return tensorflow::errors::FailedPrecondition(
+        "Graph doesn't contain any calibration nodes!."
+        " Please generate calibration graph and run calibration first");
+  for (auto n : calib_nodes) {
+    TF_RETURN_IF_ERROR(
+        tensorrt::convert::ConvertCalibrationNodeToEngineNode(graph, n));
+  }
+  graph.ToGraphDef(infer_graph);
+  return tensorflow::Status::OK();
+}
 
 tensorflow::Status ConvertGraphDefToTensorRT(
     const tensorflow::GraphDef& graph_def,
     const std::vector<string>& output_names, size_t max_batch_size,
-    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def) {
-  // Optimization pass
+    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def,
+    int precision_mode = FP32MODE, int minimum_segment_size = 3) {
+  // optimization pass
   tensorflow::grappler::GrapplerItem item;
   item.fetch = output_names;
   tensorflow::GraphDef gdef;
@@ -209,16 +334,23 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   tensorflow::grappler::LayoutOptimizer optimizer;
   tensorflow::grappler::Cluster* cluster;
 
-  // Virtual cluster
+  // virtual cluster
   tensorflow::DeviceProperties device_properties;
+
   device_properties.set_type("GPU");
   device_properties.mutable_environment()->insert({"architecture", "6"});
   cluster =
       new tensorflow::grappler::VirtualCluster({{"/GPU:0", device_properties}});
 
+  // single machine
+  int num_cpu_cores = tensorflow::grappler::GetNumAvailableLogicalCPUCores();
+  int num_gpus = tensorflow::grappler::GetNumAvailableGPUs();
+  VLOG(2) << "cpu_cores: " << num_cpu_cores;
+  VLOG(2) << "gpus: " << num_gpus;
+
   TF_RETURN_IF_ERROR(optimizer.Optimize(cluster, item, &gdef));
 
-  // Constant folding
+  // constant folding
   item.graph = gdef;
   tensorflow::grappler::ConstantFolding fold(nullptr);
   TF_RETURN_IF_ERROR(fold.Optimize(nullptr, item, &gdef));
@@ -226,7 +358,6 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   // AJ refactoring shape inference through grappler/GraphProperties.
   tensorflow::grappler::GraphProperties static_graph_properties(item);
   TF_RETURN_IF_ERROR(static_graph_properties.InferStatically(false));
-
   // Build full graph
   tensorflow::FunctionLibraryDefinition flib(tensorflow::OpRegistry::Global(),
                                              gdef.library());
@@ -243,7 +374,7 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   }
 
   // TODO(sami): this should be passed as a knob!!!!
-  segment_options.minimum_segment_size = 2;
+  segment_options.minimum_segment_size = minimum_segment_size;
   tensorflow::tensorrt::segment::SegmentNodesVector segments;
   TF_RETURN_IF_ERROR(tensorrt::segment::SegmentGraph(
       gdef, IsTensorRTCandidate, segment_options, &segments));
@@ -252,14 +383,37 @@ tensorflow::Status ConvertGraphDefToTensorRT(
   }
   std::unordered_map<string, tensorflow::Node*> node_map;
   TF_RETURN_IF_ERROR(BuildNodeMap(graph, &node_map));
+  std::unordered_map<string, std::pair<int, string>> output_edge_map;
+  int count = 0;
+  float total_num_nodes_in_segments = 0.;
+  for (auto s : segments) {
+    total_num_nodes_in_segments += s.size();
+  }
   for (const std::set<string>& subgraph_node_names : segments) {
     std::set<int> subgraph_node_ids;
+    size_t max_mem_per_engine =
+        max_workspace_size_bytes *
+        ((float)subgraph_node_names.size() / total_num_nodes_in_segments);
+    std::stringstream oss;
     for (const string& node_name : subgraph_node_names) {
+      oss << " " << node_name;
       subgraph_node_ids.insert(node_map.at(node_name)->id());
     }
-    TF_RETURN_IF_ERROR(ConvertSubGraphToTensorRT(
-        output_names, subgraph_node_ids, max_batch_size,
-        max_workspace_size_bytes, static_graph_properties, &graph));
+    VLOG(2) << "Subgraph nodes" << oss.str();
+    ConvertGraphParams p(graph, output_names, subgraph_node_ids, max_batch_size,
+                         max_mem_per_engine, static_graph_properties,
+                         &output_edge_map, precision_mode);
+    if (precision_mode == INT8MODE) {
+      TF_RETURN_IF_ERROR(GetCalibNode(&p));
+    } else {
+      tensorflow::Status status = ConvertSubGraphToTensorRT(&p);
+      if (status != tensorflow::Status::OK()) {
+        LOG(WARNING) << "subgraph conversion error for subgraph_index:" << count
+                     << " due to: \n"
+                     << status.ToString() << " SKIPPING......";
+      }
+      count++;
+    }
   }
   graph.ToGraphDef(new_graph_def);
   return tensorflow::Status::OK();
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.h b/tensorflow/contrib/tensorrt/convert/convert_graph.h
index 154ad3f2e8..e01e4a5328 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.h
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.h
@@ -28,14 +28,20 @@ namespace tensorflow {
 namespace tensorrt {
 namespace convert {
 
+// This method converts an already generated calibration graph which was used in
+// calibration runs to an inference graph
+tensorflow::Status ConvertCalibGraphToInferGraph(
+    const tensorflow::GraphDef& graph_def, tensorflow::GraphDef* new_graph_def);
+
 // max_batch_size: maximum batch size which can be used for inference for
 //                 optimization targets inference run with max batch size.
-// max_workspace_size_bytes: The upper bound of memory allowence for
+// max_workspace_size_bytes: The upper bound of memory allowance for
 //                 engine building.
 tensorflow::Status ConvertGraphDefToTensorRT(
     const tensorflow::GraphDef& graph_def,
     const std::vector<string>& output_names, size_t max_batch_size,
-    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def);
+    size_t max_workspace_size_bytes, tensorflow::GraphDef* new_graph_def,
+    int precision_mode, int minimum_segment_size);
 
 }  // namespace convert
 }  // namespace tensorrt
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
index 9ee717dd7f..92a692baa7 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.cc
@@ -24,6 +24,10 @@ limitations under the License.
 #include <utility>
 #include <vector>
 
+#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
+#include "tensorflow/core/framework/node_def.pb.h"  // NOLINT
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"  // NOLINT
 #include "tensorflow/core/framework/types.h"
@@ -32,6 +36,7 @@ limitations under the License.
 #include "tensorflow/core/graph/graph_constructor.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/tensor_coding.h"
@@ -39,7 +44,6 @@ limitations under the License.
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
-#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
 #include "tensorrt/include/NvInfer.h"
 
 //  Check if the types are equal. Cast to int first so that failure log message
@@ -49,6 +53,7 @@ limitations under the License.
 namespace tensorflow {
 namespace tensorrt {
 namespace convert {
+using ::tensorflow::strings::StrCat;
 
 namespace {
 
@@ -65,7 +70,8 @@ inline tensorflow::Status ConvertDType(tensorflow::DataType tf_dtype,
       *trt_dtype = nvinfer1::DataType::kHALF;
       break;
     default:
-      return tensorflow::errors::InvalidArgument("Unsupported data type");
+      return tensorflow::errors::InvalidArgument(
+          "Unsupported data type " + tensorflow::DataTypeString(tf_dtype));
   }
   return tensorflow::Status::OK();
 }
@@ -112,6 +118,18 @@ static std::vector<std::pair<int, int>> CreateSamePadding(
   return padding;
 }
 
+string GetCommonNameScope(const string& op_name_a, const string& op_name_b) {
+  size_t last_scope_separator = 0;
+  for (size_t i = 0; i < std::min(op_name_a.size(), op_name_b.size()); ++i) {
+    if (op_name_a[i] != op_name_b[i]) {
+      break;
+    } else if (op_name_a[i] == '/') {
+      last_scope_separator = i + 1;
+    }
+  }
+  return op_name_a.substr(0, last_scope_separator);
+}
+
 class TRT_ShapedWeights {
  public:
   TRT_ShapedWeights(tensorflow::DataType type, const void* values,
@@ -245,6 +263,11 @@ std::vector<int> TFAttrs::get<std::vector<int>>(string key) const {
 }
 
 template <>
+std::vector<string> TFAttrs::get<std::vector<string>>(string key) const {
+  auto attr = this->at(key)->list().s();
+  return std::vector<string>(attr.begin(), attr.end());
+}
+template <>
 nvinfer1::Dims TFAttrs::get<nvinfer1::Dims>(string key) const {
   auto values = this->get<std::vector<int>>(key);
   nvinfer1::Dims dims;
@@ -266,6 +289,17 @@ tensorflow::DataType TFAttrs::get<tensorflow::DataType>(string key) const {
   return this->at(key)->type();
 }
 
+template <>
+float TFAttrs::get<float>(string key) const {
+  return this->at(key)->f();
+}
+
+template <>
+bool TFAttrs::get<bool>(string key) const {
+  return this->at(key)->b();
+}
+
+// TODO(jie): reorder4 & reorder2 should be merged?
 template <typename T>
 void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
               nvinfer1::DimsNCHW istrides, T* odata,
@@ -283,29 +317,87 @@ void Reorder4(nvinfer1::DimsNCHW shape, const T* idata,
   }
 }
 
+template <typename T>
+void Reorder2(nvinfer1::DimsHW shape, const T* idata, nvinfer1::DimsHW istrides,
+              T* odata, nvinfer1::DimsHW ostrides) {
+  for (int h = 0; h < shape.h(); ++h) {
+    for (int w = 0; w < shape.w(); ++w) {
+      odata[h * ostrides.h() + w * ostrides.w()] =
+          idata[h * ostrides.h() + w * ostrides.w()];
+    }
+  }
+}
+
+// TODO(jie): fallback to tensorflow!!
+void ReorderCKtoKC(const TRT_ShapedWeights& iweights,
+                   TRT_ShapedWeights* oweights) {
+  int c = iweights.shape_.d[0];
+  int k = iweights.shape_.d[1];
+  oweights->shape_.d[0] = k;
+  oweights->shape_.d[1] = c;
+  nvinfer1::DimsHW istrides = {1, k};
+  nvinfer1::DimsHW ostrides = {c, 1};
+  switch (iweights.type_) {
+    case tensorflow::DataType::DT_FLOAT: {
+      Reorder2({k, c}, static_cast<float const*>(iweights.GetValues()),
+               istrides,
+               static_cast<float*>(const_cast<void*>(oweights->GetValues())),
+               ostrides);
+      break;
+    }
+    case tensorflow::DataType::DT_HALF: {
+      Reorder2(
+          {k, c}, static_cast<Eigen::half const*>(iweights.GetValues()),
+          istrides,
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues())),
+          ostrides);
+      break;
+    }
+    default:
+      LOG(FATAL) << "Unsupported type in reorder expected fp32 or fp16 but got "
+                 << DataTypeString(iweights.type_);
+  }
+}
+
 void ReorderRSCKToKCRS(const TRT_ShapedWeights& iweights,
-                       TRT_ShapedWeights* oweights) {
+                       TRT_ShapedWeights* oweights, int num_groups) {
   CHECK_EQ(iweights.type_, oweights->type_);
   CHECK_EQ(iweights.size_bytes(), oweights->size_bytes());
   int r = iweights.shape_.d[0];
   int s = iweights.shape_.d[1];
-  int c = iweights.shape_.d[2];
-  int k = iweights.shape_.d[3];
-  oweights->shape_.d[0] = k;
-  oweights->shape_.d[1] = c;
+  // TRT requires GKcRS, while TF depthwise has RSCK
+  //   where c=1, C=G
+  VLOG(2) << "num_groups: " << num_groups;
+  int c = iweights.shape_.d[2] / num_groups;
+  VLOG(2) << "c" << iweights.shape_.d[2] << " then " << c;
+  int k = iweights.shape_.d[3] * num_groups;
+  VLOG(2) << "k" << iweights.shape_.d[3] << " then " << k;
+  oweights->shape_.d[0] = k / num_groups;
+  oweights->shape_.d[1] = c * num_groups;
   oweights->shape_.d[2] = r;
   oweights->shape_.d[3] = s;
   nvinfer1::DimsNCHW istrides = {1, k, s * k * c, c * k};
   nvinfer1::DimsNCHW ostrides = {c * r * s, r * s, s, 1};
   switch (iweights.type_) {
-    case tensorflow::DataType::DT_FLOAT:
+    case tensorflow::DataType::DT_FLOAT: {
       Reorder4({k, c, r, s}, static_cast<float const*>(iweights.GetValues()),
                istrides,
                static_cast<float*>(const_cast<void*>(oweights->GetValues())),
                ostrides);
       break;
+    }
+    case tensorflow::DataType::DT_HALF: {
+      Reorder4(
+          {k, c, r, s}, static_cast<Eigen::half const*>(iweights.GetValues()),
+          istrides,
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues())),
+          ostrides);
+      break;
+    }
+
     default:
-      LOG(FATAL) << "!!!!!!!!!!!!!!!!!!!!!!!!broke!!!!!!!!!!!!";
+      LOG(FATAL) << "Unsupported type, expected fp32 or fp16 but got "
+                 << DataTypeString(iweights.type_);
   }
 }
 
@@ -323,12 +415,11 @@ inline std::shared_ptr<T> infer_object(T* obj) {
   return std::shared_ptr<T>(obj, InferDeleter());
 }
 
-// Logger for GIE info/warning/errors
 class Converter;
 
 using OpConverter =
     std::function<tensorflow::Status(Converter&, const tensorflow::NodeDef&,
-                                     std::vector<TRT_TensorOrWeights> const&,
+                                     const std::vector<TRT_TensorOrWeights>&,
                                      std::vector<TRT_TensorOrWeights>*)>;
 
 class Converter {
@@ -336,34 +427,57 @@ class Converter {
   std::unordered_map<string, OpConverter> op_registry_;
   nvinfer1::INetworkDefinition* trt_network_;
   std::list<std::vector<uint8_t>> temp_bufs_;
-
+  tensorflow::tensorrt::TRTWeightStore* weight_store_;
+  bool fp16_;
   void register_op_converters();
-
   std::vector<TRT_TensorOrWeights> get_inputs(
       const tensorflow::NodeDef& node_def) {
     std::vector<TRT_TensorOrWeights> inputs;
-    for (const auto& input_name : node_def.input()) {
-      VLOG(2) << "Retrieve input: " << input_name;
-      inputs.push_back(trt_tensors_.at(input_name));
+    for (auto const& input_name : node_def.input()) {
+      /*************************************************************************
+       * TODO(jie) handle case 1) here
+       * Normalizes the inputs and extracts associated metadata:
+       * 1) Inputs can contain a colon followed by a suffix of characters.
+       *    That suffix may be a single number (e.g. inputName:1) or several
+       *    word characters separated from a number by a colon
+       *    (e.g. inputName:foo:1). The
+       *    latter case is used to denote inputs and outputs of functions.
+       * 2) Control dependency inputs contain caret at the beginning and we
+       *    remove this and annotate the edge as a control dependency.
+       ************************************************************************/
+      string name = input_name[0] == '^' ? input_name.substr(1) : input_name;
+      auto first = name.find_first_of(':');
+      if (first != string::npos && first + 2 == name.size() &&
+          name[first + 1] == '0')
+        name.erase(first);
+
+      VLOG(2) << "retrieve input: " << name;
+      if (trt_tensors_.count(name)) {
+        inputs.push_back(trt_tensors_.at(name));
+      } else {
+        LOG(FATAL) << "input: " << name << " not available for node at, "
+                   << node_def.name();
+      }
     }
     return inputs;
   }
 
  public:
-  explicit Converter(nvinfer1::INetworkDefinition* trt_network)
-      : trt_network_(trt_network) {
+  explicit Converter(nvinfer1::INetworkDefinition* trt_network,
+                     tensorflow::tensorrt::TRTWeightStore* ws, bool fp16)
+      : trt_network_(trt_network), weight_store_(ws), fp16_(fp16) {
     this->register_op_converters();
   }
-
+  tensorflow::tensorrt::TRTWeightStore* weight_store() { return weight_store_; }
   TRT_ShapedWeights get_temp_weights(tensorflow::DataType type,
                                      nvinfer1::Dims shape) {
     TRT_ShapedWeights weights(type, nullptr, shape);
     // TODO(jie): check weights size_bytes. 0 means type error
-    temp_bufs_.push_back(std::vector<uint8_t>(weights.size_bytes()));
-    weights.SetValues(temp_bufs_.back().data());
+    weight_store_->store_.push_back(std::vector<uint8_t>(weights.size_bytes()));
+    weights.SetValues(weight_store_->store_.back().data());
     return weights;
   }
-
+  bool isFP16() { return fp16_; };
   TRT_ShapedWeights get_temp_weights_like(const TRT_ShapedWeights& weights) {
     return this->get_temp_weights(weights.type_, weights.shape_);
   }
@@ -382,7 +496,7 @@ class Converter {
       TRT_TensorOrWeights output = outputs.at(i);
       // TODO(jie): tf protobuf seems to be omitting the :0 suffix
       string output_name = node_def.name();
-      if (i != 0) output_name = output_name + ":" + std::to_string(i);
+      if (i != 0) output_name = StrCat(output_name, ":", i);
       if (output.is_tensor()) {
         output.tensor()->setName(output_name.c_str());
       }
@@ -448,7 +562,7 @@ struct LambdaFactory {
     switch (op) {
       case OP_CATEGORY::RSQRT: {
         VLOG(2) << "RSQRT GETS DONE";
-        return [](T t) -> T { return 1.0 / std::sqrt(t); };
+        return [](T t) -> T { return 1.0 / sqrt(t); };
       }
       case OP_CATEGORY::NEG:
         return [](T t) -> T { return -t; };
@@ -534,6 +648,22 @@ struct LambdaFactory {
   }
 };
 
+template <>
+std::function<Eigen::half(Eigen::half)> LambdaFactory::unary<Eigen::half>() {
+  switch (op) {
+    case OP_CATEGORY::RSQRT: {
+      VLOG(2) << "RSQRT GETS DONE";
+      return [](Eigen::half t) -> Eigen::half {
+        return Eigen::half(1.0 / sqrt(float(t)));
+      };
+    }
+    case OP_CATEGORY::NEG:
+      return [](Eigen::half t) -> Eigen::half { return -t; };
+    default:
+      VLOG(2) << "Not supported op for unary: " << static_cast<int>(op);
+      return nullptr;
+  }
+}
 tensorflow::Status UnaryCompute(const TRT_ShapedWeights& iweights,
                                 TRT_ShapedWeights* oweights,
                                 LambdaFactory unary_op) {
@@ -545,6 +675,14 @@ tensorflow::Status UnaryCompute(const TRT_ShapedWeights& iweights,
       std::transform(inp, inp + iweights.count(), oup, unary_op.unary<float>());
       break;
     }
+    case tensorflow::DataType::DT_HALF: {
+      auto inp = static_cast<Eigen::half const*>(iweights.GetValues());
+      auto oup =
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues()));
+      std::transform(inp, inp + iweights.count(), oup,
+                     unary_op.unary<Eigen::half>());
+      break;
+    }
     default:
       return tensorflow::errors::Unimplemented(
           "Data type not supported: " +
@@ -588,6 +726,32 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
       }
       break;
     }
+    case tensorflow::DataType::DT_HALF: {
+      auto inp_l = static_cast<const Eigen::half*>(iweights_l.GetValues());
+      auto inp_r = static_cast<const Eigen::half*>(iweights_r.GetValues());
+      auto oup =
+          static_cast<Eigen::half*>(const_cast<void*>(oweights->GetValues()));
+
+      if (iweights_l.count() != iweights_r.count()) {
+        // We only supports broadcast of RankZero
+        if (iweights_l.count() == 1) {
+          VLOG(2) << "I bet it is not working!" << (*inp_l);
+          std::transform(inp_r, inp_r + iweights_r.count(), oup,
+                         binary_op.broadcast_l<Eigen::half>(*inp_l));
+        } else if (iweights_r.count() == 1) {
+          VLOG(2) << "I bet it is not working!" << (*inp_r);
+          std::transform(inp_l, inp_l + iweights_l.count(), oup,
+                         binary_op.broadcast_r<Eigen::half>(*inp_r));
+        } else {
+          return tensorflow::errors::Unimplemented(
+              "Binary op with non-rankZero broadcast not supported");
+        }
+      } else {
+        std::transform(inp_l, inp_l + iweights_l.count(), inp_r, oup,
+                       binary_op.binary<Eigen::half>());
+      }
+      break;
+    }
     default:
       return tensorflow::errors::Unimplemented(
           "Data type not supported: " +
@@ -599,7 +763,7 @@ tensorflow::Status BinaryCompute(const TRT_ShapedWeights& iweights_l,
 
 tensorflow::Status ConstantFoldUnary(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   TRT_ShapedWeights weights_input = inputs.at(0).weights();
 
@@ -613,13 +777,12 @@ tensorflow::Status ConstantFoldUnary(
   CHECK_EQ(weights_input.type_,
            TFAttrs(node_def).get<tensorflow::DataType>("T"));
 
-  // Maybe I should do a switch
   LambdaFactory unary_op;
   if (node_def.op() == "Rsqrt") {
     // Compute rsqrt
     unary_op.op = LambdaFactory::OP_CATEGORY::RSQRT;
     auto ret = UnaryCompute(weights_input, &weights_output, unary_op);
-    // PAss the output
+    // Pass the output
     if (ret == tensorflow::Status::OK()) {
       outputs->push_back(TRT_TensorOrWeights(weights_output));
     }
@@ -631,11 +794,11 @@ tensorflow::Status ConstantFoldUnary(
 }
 
 // TODO(jie,ben) broadcast is needed yet not implemented
-// Let's get the simple stuff working first. Maybe we should fall bakc to TF
+// Let's get the simple stuff working first. Maybe we should fall back to TF
 //   approach for constant folding
 tensorflow::Status ConstantFoldBinary(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   TRT_ShapedWeights weights_input_l = inputs.at(0).weights();
   TRT_ShapedWeights weights_input_r = inputs.at(1).weights();
@@ -648,12 +811,12 @@ tensorflow::Status ConstantFoldBinary(
         "Binary op implicit broadcast not supported: " + node_def.op());
 
   // TODO(jie): constant fold should really fall back to TF.
-  int nb_dims = weights_input_l.shape_.nbDims;
+  int num_dims = weights_input_l.shape_.nbDims;
   nvinfer1::Dims output_shape;
-  output_shape.nbDims = nb_dims;
-  VLOG(2) << "nb_dims: " << nb_dims
+  output_shape.nbDims = num_dims;
+  VLOG(2) << "nb_dims: " << num_dims
           << ", the other: " << weights_input_r.shape_.nbDims;
-  for (int i = 0; i < nb_dims; i++) {
+  for (int i = 0; i < num_dims; i++) {
     if (weights_input_l.shape_.d[i] == weights_input_r.shape_.d[i]) {
       output_shape.d[i] = weights_input_l.shape_.d[i];
     } else if (weights_input_l.shape_.d[i] == 1 ||
@@ -678,7 +841,6 @@ tensorflow::Status ConstantFoldBinary(
   // Allocate output weights
   TRT_ShapedWeights weights_output = ctx.get_temp_weights(dtype, output_shape);
 
-  // Maybe I should do a switch
   LambdaFactory binary_op;
   if (node_def.op() == "Sub") {
     binary_op.op = LambdaFactory::OP_CATEGORY::SUB;
@@ -712,48 +874,90 @@ tensorflow::Status BinaryTensorOpWeight(
   // Maybe this part has to be moved into the block of rsqrt later
 
   // Check type consistency
-  auto dtype = TFAttrs(node_def).get<nvinfer1::DataType>("T");
-  CHECK_EQ_TYPE(tensor->getType(), dtype);  // Cast to int for error messages
   nvinfer1::DataType ttype;
   TF_CHECK_OK(ConvertDType(weights.type_, &ttype));
-  CHECK_EQ_TYPE(ttype, dtype);  // Cast to int for error message
 
   // Check scale mode
   auto dims_w = weights.shape_;
   auto dims_t = tensor->getDimensions();
 
-  // Default to channel-wise
+  // default to element-wise
   auto scale_mode = nvinfer1::ScaleMode::kELEMENTWISE;
 
+  // TODO(jie): maybe use a permutation instead to support more cases;
+  bool permutation_flag = false;
+
   if (weights.count() == 1) {
     VLOG(2) << "UNIFORM";
     scale_mode = nvinfer1::ScaleMode::kUNIFORM;
   } else {
-    // No broadcasting on Batch dimension;
-    assert(dims_w.d[0] == 1);
-
-    // Broadcasting on Channel dimension only allowed in kUNIFORM
-    assert(dims_w.d[1] == dims_t.d[0]);
-    assert(dims_w.nbDims == dims_t.nbDims);
-
-    // Default is element;
-    for (int i = 2; i < dims_w.nbDims; i++) {
-      if (dims_w.d[i] != dims_t.d[i - 1]) {
-        scale_mode = nvinfer1::ScaleMode::kCHANNEL;
-        break;
+    // no broadcasting on Batch dimension;
+    VLOG(2) << "WEIGHTS DIM: " << dims_w.nbDims
+            << " tensor DIM: " << dims_t.nbDims;
+    if (dims_w.nbDims == dims_t.nbDims + 1) {
+      if (dims_w.d[0] == 1) {
+        for (int i = 1; i < dims_w.nbDims; i++) {
+          dims_w.d[i - 1] = dims_w.d[i];
+        }
+        dims_w.nbDims--;
+      } else {
+        return tensorflow::errors::InvalidArgument(
+            "Binary op cannot operate on batch, " + node_def.name());
       }
     }
-    if (scale_mode == nvinfer1::ScaleMode::kELEMENTWISE) {
+
+    if (dims_w.nbDims == dims_t.nbDims && dims_w.d[0] == dims_t.d[0]) {
       scale_mode = nvinfer1::ScaleMode::kELEMENTWISE;
-      for (int i = 2; i < dims_w.nbDims; i++) {
-        if (dims_w.d[i] != 1)
-          return tensorflow::errors::InvalidArgument(
-              "Weight shape not compatible at, " + node_def.name());
+      // default is element;
+      for (int i = 1; i < dims_w.nbDims; i++) {
+        if (dims_w.d[i] != dims_t.d[i]) {
+          // if dimension does not match, switch back to channel;
+          VLOG(2) << "channel";
+          scale_mode = nvinfer1::ScaleMode::kCHANNEL;
+          break;
+        }
+      }
+      // if channel as candidate, validate it
+      if (scale_mode == nvinfer1::ScaleMode::kCHANNEL) {
+        for (int i = 1; i < dims_w.nbDims; i++) {
+          if (dims_w.d[i] != 1)
+            return tensorflow::errors::InvalidArgument(
+                "Weight shape not compatible at, " + node_def.name());
+        }
+      } else {
+        VLOG(2) << "elementwise";
       }
+    } else if (dims_w.nbDims == 1 &&
+               dims_w.d[0] == dims_t.d[dims_t.nbDims - 1]) {
+      // channel wise and broadcast required;
+      permutation_flag = true;
+      scale_mode = nvinfer1::ScaleMode::kCHANNEL;
+    } else {
+      return tensorflow::errors::InvalidArgument(
+          "Weight shape not compatible at, " + node_def.name());
     }
   }
 
-  // Prepare weights
+  // transpose last dimension
+  std::vector<int> permutation(dims_t.nbDims + 1);
+  if (permutation_flag) {
+    if (scale_mode == nvinfer1::ScaleMode::kCHANNEL && dims_t.nbDims > 1) {
+      // we swap the last dimension into channel for trt.
+      // because of tensorflow default broadcasting rules.
+      for (int i = 0; i < static_cast<int>(permutation.size()); i++) {
+        permutation[i] = i;
+      }
+      permutation[1] = dims_t.nbDims;
+      permutation[dims_t.nbDims] = 1;
+      tensor = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor),
+                                   permutation);
+    } else {
+      return tensorflow::errors::InvalidArgument(
+          "Transpose cannot be applied, " + node_def.name());
+    }
+  }
+
+  // prepare weights
   TRT_ShapedWeights shift_weights(weights.type_);
   TRT_ShapedWeights scale_weights(weights.type_);
   TRT_ShapedWeights power_weights(weights.type_);
@@ -779,88 +983,26 @@ tensorflow::Status BinaryTensorOpWeight(
       scale_weights, power_weights);
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  // transpose back dimension
+  if (permutation_flag) {
+    output_tensor = ctx.TransposeTensor(output_tensor, permutation);
+  }
 
   // Pass the output
   outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
-tensorflow::Status BinaryTensorOpTensor(
-    Converter& ctx, const tensorflow::NodeDef& node_def,
-    const nvinfer1::ITensor* tensor_l, const nvinfer1::ITensor* tensor_r,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  static const std::unordered_map<string, nvinfer1::ElementWiseOperation> ops{
-      {"Add", nvinfer1::ElementWiseOperation::kSUM},
-      {"Mul", nvinfer1::ElementWiseOperation::kPROD},
-      // {"max", nvinfer1::ElementWiseOperation::kMAX},
-      // {"min", nvinfer1::ElementWiseOperation::kMIN},
-      {"Sub", nvinfer1::ElementWiseOperation::kSUB},
-      {"Div", nvinfer1::ElementWiseOperation::kDIV},
-  };
-
-  // FIXME assume type matches input weights
-  // Get trt type & shape
-  TFAttrs attrs(node_def);
-  // Maybe this part has to be moved into the block of rsqrt later
-  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("T");
-
-  // Check type consistency
-  CHECK_EQ_TYPE(tensor_l->getType(), dtype);
-  CHECK_EQ_TYPE(tensor_r->getType(), dtype);
-  auto op_pair = ops.find(node_def.op());
-  if (op_pair == ops.end())
-    return tensorflow::errors::Unimplemented("binary op: " + node_def.op() +
-                                             " not supported at: " +
-                                             node_def.name());
+enum class ConvolutionType { DEFAULT, DEPTHWISE_CONV };
 
-  nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
-      *const_cast<nvinfer1::ITensor*>(tensor_l),
-      *const_cast<nvinfer1::ITensor*>(tensor_r), op_pair->second);
-
-  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
-
-  // Pass the output
-  outputs->push_back(TRT_TensorOrWeights(output_tensor));
-  return tensorflow::Status::OK();
-}
-
-tensorflow::Status ConvertPlaceholder(
+tensorflow::Status ConvertConv2DHelper(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
-    std::vector<TRT_TensorOrWeights>* outputs) {
-  VLOG(2) << "Placeholder should have been replace already";
-  return tensorflow::errors::Unimplemented(", cannot convert Placeholder op");
-  // OK this make sense since we are supposed to replace it with input
-  TFAttrs attrs(node_def);
-  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("dtype");
-  nvinfer1::Dims dims = attrs.get<nvinfer1::Dims>("shape");
-
-  dims.nbDims--;
-  for (int i = 0; i < dims.nbDims; i++) dims.d[i] = dims.d[i + 1];
-
-  nvinfer1::ITensor* output =
-      ctx.network()->addInput(node_def.name().c_str(), dtype, dims);
-  if (!output) {
-    return tensorflow::errors::InvalidArgument("Failed to create Input layer");
-  }
-  outputs->push_back(TRT_TensorOrWeights(output));
-  return tensorflow::Status::OK();
-}
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs,
+    int group  // group ==0 specifies depthwise conv
+) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
-tensorflow::Status ConvertConv2D(Converter& ctx,
-                                 const tensorflow::NodeDef& node_def,
-                                 const std::vector<TRT_TensorOrWeights>& inputs,
-                                 std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
-  // TODO(jie): handle NHWC/NCHW transpose;
-  TRT_ShapedWeights weights_rsck = inputs.at(1).weights();
-  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_rsck);
-  ReorderRSCKToKCRS(weights_rsck, &weights);
-  TRT_ShapedWeights biases(weights.type_);
-  int noutput = weights.shape_.d[0];
-  nvinfer1::DimsHW kernel_size;
-  kernel_size.h() = weights.shape_.d[2];
-  kernel_size.w() = weights.shape_.d[3];
   TFAttrs attrs(node_def);
 
   int h_index = 2;
@@ -874,11 +1016,31 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
     // TODO(jie): transpose it
   }
 
+  // tensor after transpose (NCHW)
+  auto tensor_dim = tensor->getDimensions();
+
+  int num_groups = group;
+  if (num_groups == 0)  // depthwise convolution
+    num_groups = tensor_dim.d[0];
+  VLOG(2) << "groups count: " << num_groups;
+
+  TRT_ShapedWeights weights_rsck = inputs.at(1).weights();
+  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_rsck);
+  ReorderRSCKToKCRS(weights_rsck, &weights, num_groups);
+  TRT_ShapedWeights biases(weights.type_);
+  int noutput = weights.shape_.d[0] * num_groups;
+  nvinfer1::DimsHW kernel_size;
+  kernel_size.h() = weights.shape_.d[2];
+  kernel_size.w() = weights.shape_.d[3];
+  VLOG(2) << "kernel size: " << kernel_size.h() << ", " << kernel_size.w();
+
   // TODO(jie): stride. (NHWC/NCHW)
   auto tf_stride = attrs.get<std::vector<int>>("strides");
+  VLOG(2) << "h_INDEX" << h_index << ", w_index " << w_index;
+  VLOG(2) << "stride!!!: " << tf_stride[0] << tf_stride[1] << tf_stride[2]
+          << tf_stride[3];
   nvinfer1::DimsHW stride(tf_stride[h_index], tf_stride[w_index]);
 
-  auto tensor_dim = tensor->getDimensions();
   std::vector<std::pair<int, int>> padding;
   // TODO(jie): padding.
   if (attrs.get<string>("padding") == "SAME") {
@@ -919,10 +1081,11 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
   layer->setStride(stride);
   layer->setPadding({padding[0].first, padding[1].first});
   layer->setName(node_def.name().c_str());
+  layer->setNbGroups(num_groups);
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
 
   auto dim_after = output_tensor->getDimensions();
-  VLOG(2) << "TENSOR out: " << dim_after.d[0] << ", " << dim_after.d[1]
+  VLOG(2) << "TENSOR out: " << dim_after.d[0] << ", " << dim_after.d[1] << ", "
           << dim_after.d[2] << ", " << dim_after.d[3];
 
   if (data_format == "NHWC") {
@@ -935,11 +1098,101 @@ tensorflow::Status ConvertConv2D(Converter& ctx,
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertConv2DHelper(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs, ConvolutionType type) {
+  switch (type) {
+    case ConvolutionType::DEFAULT:
+      return ConvertConv2DHelper(ctx, node_def, inputs, outputs, 1);
+    case ConvolutionType::DEPTHWISE_CONV:
+      return ConvertConv2DHelper(ctx, node_def, inputs, outputs, 0);
+  }
+  return tensorflow::errors::Unimplemented("unsupported convolution type at, " +
+                                           node_def.name());
+}
+
+tensorflow::Status BinaryTensorOpTensor(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const nvinfer1::ITensor* tensor_l, const nvinfer1::ITensor* tensor_r,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  static const std::unordered_map<string, nvinfer1::ElementWiseOperation> ops{
+      {"Add", nvinfer1::ElementWiseOperation::kSUM},
+      {"Mul", nvinfer1::ElementWiseOperation::kPROD},
+      {"Sub", nvinfer1::ElementWiseOperation::kSUB},
+      {"Div", nvinfer1::ElementWiseOperation::kDIV},
+  };
+
+  // FIXME assume type matches input weights
+  // get trt type & shape
+  TFAttrs attrs(node_def);
+  // maybe this part has to be moved into the block of rsqrt later
+  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("T");
+
+  // check type consistency
+  CHECK_EQ_TYPE(tensor_l->getType(), dtype);
+  CHECK_EQ_TYPE(tensor_r->getType(), dtype);
+  auto op_pair = ops.find(node_def.op());
+  if (op_pair == ops.end())
+    return tensorflow::errors::Unimplemented("binary op: " + node_def.op() +
+                                             " not supported at: " +
+                                             node_def.name());
+
+  nvinfer1::IElementWiseLayer* layer = ctx.network()->addElementWise(
+      *const_cast<nvinfer1::ITensor*>(tensor_l),
+      *const_cast<nvinfer1::ITensor*>(tensor_r), op_pair->second);
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+
+  // pass the output
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertPlaceholder(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  VLOG(2) << "Placeholder should have been replace already";
+  return tensorflow::errors::Unimplemented("cannot convert Placeholder op");
+  // OK this make sense since we are supposed to replace it with input
+  TFAttrs attrs(node_def);
+  nvinfer1::DataType dtype = attrs.get<nvinfer1::DataType>("dtype");
+  nvinfer1::Dims dims = attrs.get<nvinfer1::Dims>("shape");
+
+  dims.nbDims--;
+  for (int i = 0; i < dims.nbDims; i++) dims.d[i] = dims.d[i + 1];
+
+  nvinfer1::ITensor* output =
+      ctx.network()->addInput(node_def.name().c_str(), dtype, dims);
+  if (!output) {
+    return tensorflow::errors::InvalidArgument("Failed to create Input layer");
+  }
+  outputs->push_back(TRT_TensorOrWeights(output));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertConv2D(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  return ConvertConv2DHelper(ctx, node_def, inputs, outputs,
+                             ConvolutionType::DEFAULT);
+}
+
+tensorflow::Status ConvertConv2DDepthwise(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  return ConvertConv2DHelper(ctx, node_def, inputs, outputs,
+                             ConvolutionType::DEPTHWISE_CONV);
+}
+
 tensorflow::Status ConvertPool(Converter& ctx,
                                const tensorflow::NodeDef& node_def,
-                               std::vector<TRT_TensorOrWeights> const& inputs,
+                               const std::vector<TRT_TensorOrWeights>& inputs,
                                std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   TFAttrs attrs(node_def);
 
   int h_index = 2;
@@ -957,6 +1210,8 @@ tensorflow::Status ConvertPool(Converter& ctx,
   // TODO(jie): support other pooling type
   if (node_def.op() == "MaxPool")
     type = nvinfer1::PoolingType::kMAX;
+  else if (node_def.op() == "AvgPool")
+    type = nvinfer1::PoolingType::kAVERAGE;
   else
     return tensorflow::errors::Unimplemented("Only supports Max pool");
 
@@ -1019,9 +1274,9 @@ tensorflow::Status ConvertPool(Converter& ctx,
 
 tensorflow::Status ConvertActivation(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   nvinfer1::IActivationLayer* layer = ctx.network()->addActivation(
       *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ActivationType::kRELU);
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
@@ -1031,14 +1286,14 @@ tensorflow::Status ConvertActivation(
 
 tensorflow::Status ConvertScale(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
     return tensorflow::errors::Unimplemented(
         "Only supports tensor op weight for now, at " + node_def.name());
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
 
   // TODO(jie): handle NHWC/NCHW transpose;
   TRT_ShapedWeights weights = inputs.at(1).weights();
@@ -1055,9 +1310,26 @@ tensorflow::Status ConvertScale(Converter& ctx,
   } else {
     VLOG(2) << "NCHW !!!!";
   }
-  nvinfer1::IScaleLayer* layer = ctx.network()->addScale(
-      *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ScaleMode::kCHANNEL,
-      weights, empty_weights, empty_weights);
+
+  auto dims = tensor->getDimensions();
+  VLOG(2) << "tensor dimensions: " << dims.nbDims;
+  for (int i = 0; i < dims.nbDims; i++) {
+    VLOG(2) << "i: " << dims.d[i];
+  }
+  dims = weights.shape_;
+  VLOG(2) << "tensor dimensions: " << dims.nbDims;
+  for (int i = 0; i < dims.nbDims; i++) {
+    VLOG(2) << "i: " << dims.d[i];
+  }
+
+  nvinfer1::ScaleMode mode = nvinfer1::ScaleMode::kCHANNEL;
+  if (weights.shape_.d[0] == 1) {
+    mode = nvinfer1::ScaleMode::kUNIFORM;
+  }
+
+  nvinfer1::IScaleLayer* layer =
+      ctx.network()->addScale(*const_cast<nvinfer1::ITensor*>(tensor), mode,
+                              weights, empty_weights, empty_weights);
 
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   if (data_format == "NHWC") {
@@ -1072,7 +1344,7 @@ tensorflow::Status ConvertScale(Converter& ctx,
 
 tensorflow::Status ConvertConst(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   const auto& weights_tensor = node_def.attr().at("value").tensor();
 
@@ -1091,20 +1363,144 @@ tensorflow::Status ConvertConst(Converter& ctx,
     VLOG(2) << "SCALAR!!!" << node_def.name();
     nvinfer1::Dims scalar_shape;
     if (tensor.dims() > 0) {
-      VLOG(2) << "Dimensions: " << tensor.dims();
-      weights = TRT_ShapedWeights(dtype, weights_tensor.float_val().data(),
-                                  GetTensorShape(tensor));
+      VLOG(2) << "dimensions: " << tensor.dims();
+      VLOG(2) << "size: " << weights_tensor.float_val_size();
+      scalar_shape = GetTensorShape(tensor);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        VLOG(2) << scalar_shape.d[i];
+      if (GetShapeSize(scalar_shape) != weights_tensor.float_val_size()) {
+        if (weights_tensor.float_val_size() == 1 ||
+            scalar_shape.d[0] == weights_tensor.float_val_size()) {
+          scalar_shape.nbDims = 1;
+          // no dimension provided. flatten it
+          scalar_shape.d[0] = weights_tensor.float_val_size();
+          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+        } else {
+          LOG(FATAL) << "Broadcast on weights only supports kCHANNEL and"
+                     << " kUNIFORM, at: " << node_def.name();
+        }
+      }
     } else {
       VLOG(2) << "Dimensions: " << tensor.dims();
       scalar_shape.nbDims = 1;
-      scalar_shape.d[0] = 1;
+      // no dimension provided. flatten it
+      scalar_shape.d[0] = weights_tensor.float_val_size();
       scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
       for (int i = 1; i < nvinfer1::Dims::MAX_DIMS; i++) {
         scalar_shape.d[i] = 0;
         scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
       }
-      weights = TRT_ShapedWeights(dtype, weights_tensor.float_val().data(),
-                                  scalar_shape);
+    }
+    if (ctx.isFP16()) {
+      auto dtype_new = tensorflow::DataType::DT_HALF;
+      size_t len_data = tensorflow::DataTypeSize(dtype_new);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        len_data *= scalar_shape.d[i];
+      ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+      void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+      tensorflow::Tensor temp_tensor(tensorflow::DT_HALF, tensor.shape());
+      auto half_tensor = temp_tensor.flat<Eigen::half>();
+      Eigen::DefaultDevice defd;
+      half_tensor.device(defd) =
+          tensor.flat<float>().template cast<Eigen::half>();
+      memcpy(dst, half_tensor.data(), len_data);  // store into weight store
+      weights = TRT_ShapedWeights(dtype_new, dst, scalar_shape);
+    } else {
+      size_t len_data = tensorflow::DataTypeSize(dtype);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        len_data *= scalar_shape.d[i];
+      ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+      void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+      std::vector<float> tensor_data(
+          weights_tensor.float_val().begin(),
+          weights_tensor.float_val()
+              .end());  //  make a local copy first to flatten
+      memcpy(dst, tensor_data.data(), len_data);  // store into weight store
+      weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
+    }
+  } else if (!weights_tensor.int_val().empty()) {
+    VLOG(2) << "int!!!" << node_def.name();
+    nvinfer1::Dims scalar_shape;
+    if (tensor.dims() > 0) {
+      VLOG(2) << "dimensions: " << tensor.dims();
+      scalar_shape = GetTensorShape(tensor);
+      if (GetShapeSize(scalar_shape) != weights_tensor.int_val_size()) {
+        if (weights_tensor.int_val_size() == 1 ||
+            scalar_shape.d[0] == weights_tensor.int_val_size()) {
+          scalar_shape.nbDims = 1;
+          // no dimension provided. flatten it
+          scalar_shape.d[0] = weights_tensor.int_val_size();
+          scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+        } else {
+          LOG(FATAL) << "Broadcast on weights only supports kCHANNEL and"
+                     << " kUNIFORM, at: " << node_def.name();
+        }
+      }
+    } else {
+      VLOG(2) << "dimensions: " << tensor.dims();
+      scalar_shape.nbDims = 1;
+      // no dimension provided. flatten it
+      scalar_shape.d[0] = weights_tensor.int_val_size();
+      scalar_shape.type[0] = nvinfer1::DimensionType::kSPATIAL;
+      for (int i = 1; i < nvinfer1::Dims::MAX_DIMS; i++) {
+        scalar_shape.d[i] = 0;
+        scalar_shape.type[i] = nvinfer1::DimensionType::kSPATIAL;
+      }
+    }
+    if (ctx.isFP16()) {
+      auto dtype_new = tensorflow::DataType::DT_HALF;
+      size_t len_data = tensorflow::DataTypeSize(dtype_new);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        len_data *= scalar_shape.d[i];
+      ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+      void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+      tensorflow::Tensor temp_tensor(tensorflow::DT_HALF, tensor.shape());
+      TTypes<Eigen::half>::Flat half_tensor = temp_tensor.flat<Eigen::half>();
+      Eigen::DefaultDevice defd;
+      switch (dtype) {
+        case (tensorflow::DT_INT32): {
+          half_tensor.device(defd) =
+              tensor.flat<int32>().template cast<Eigen::half>();
+          break;
+        }
+        case (tensorflow::DT_INT16): {
+          half_tensor.device(defd) =
+              tensor.flat<int16>().template cast<Eigen::half>();
+          break;
+        }
+        case (tensorflow::DT_INT8): {
+          half_tensor.device(defd) =
+              tensor.flat<int8>().template cast<Eigen::half>();
+          break;
+        }
+        case (tensorflow::DT_UINT8): {
+          half_tensor.device(defd) =
+              tensor.flat<uint8>().template cast<Eigen::half>();
+          break;
+        }
+        default:
+          return tensorflow::errors::InvalidArgument(
+              "Datatype " + tensorflow::DataTypeString(dtype) +
+              " for FP16 conversion");
+          break;
+      };
+      memcpy(dst, half_tensor.data(), len_data);  // store into weight store
+      weights = TRT_ShapedWeights(dtype_new, dst, scalar_shape);
+    } else {
+      size_t len_data = tensorflow::DataTypeSize(dtype);
+      for (int i = 0; i < scalar_shape.nbDims; i++)
+        len_data *= scalar_shape.d[i];
+      size_t len_tensor = weights_tensor.int_val_size() * sizeof(int32);
+      len_data = std::max(len_data, len_tensor);
+      ctx.weight_store()->store_.push_back(std::vector<uint8_t>(len_data));
+      void* dst = static_cast<void*>(&(ctx.weight_store()->store_.back()[0]));
+      std::vector<int32> tensor_data(
+          weights_tensor.int_val().begin(),
+          weights_tensor.int_val()
+              .end());  //  make a local copy first to flatten
+                        //  doesn't have to be contiguous
+      memcpy(dst, tensor_data.data(), len_tensor);  // store into weight store
+      weights = TRT_ShapedWeights(dtype, dst, scalar_shape);
     }
   } else if (!weights_tensor.tensor_content().empty()) {
     VLOG(2) << "TENSOR!!!" << node_def.name();
@@ -1130,7 +1526,7 @@ tensorflow::Status ConvertConst(Converter& ctx,
 
 tensorflow::Status ConvertIdentity(
     Converter& ctx, const tensorflow::NodeDef& node_def,
-    std::vector<TRT_TensorOrWeights> const& inputs,
+    const std::vector<TRT_TensorOrWeights>& inputs,
     std::vector<TRT_TensorOrWeights>* outputs) {
   outputs->push_back(inputs.at(0));
   return tensorflow::Status::OK();
@@ -1138,7 +1534,7 @@ tensorflow::Status ConvertIdentity(
 
 tensorflow::Status ConvertBinary(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
-                                 std::vector<TRT_TensorOrWeights> const& inputs,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                  std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2)
     return tensorflow::errors::FailedPrecondition(
@@ -1165,7 +1561,7 @@ tensorflow::Status ConvertBinary(Converter& ctx,
 
 tensorflow::Status ConvertUnary(Converter& ctx,
                                 const tensorflow::NodeDef& node_def,
-                                std::vector<TRT_TensorOrWeights> const& inputs,
+                                const std::vector<TRT_TensorOrWeights>& inputs,
                                 std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 1)
     return tensorflow::errors::FailedPrecondition(
@@ -1183,7 +1579,7 @@ tensorflow::Status ConvertUnary(Converter& ctx,
 
 tensorflow::Status ConvertReduce(Converter& ctx,
                                  const tensorflow::NodeDef& node_def,
-                                 std::vector<TRT_TensorOrWeights> const& inputs,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
                                  std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
@@ -1191,7 +1587,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
         "Input expects tensor and weights, at" + node_def.name());
 
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
   int nb_dims = dims.nbDims + 1;
@@ -1229,6 +1625,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
       return tensorflow::errors::InvalidArgument("TRT cannot reduce at 0, at" +
                                                  node_def.name());
     if (index_list_data[i] == 1) permuted_index = 1;
+
     idx_set.emplace(index_list_data[i]);
   }
 
@@ -1236,7 +1633,7 @@ tensorflow::Status ConvertReduce(Converter& ctx,
   nvinfer1::DimsHW pool_kernel;
   if (permuted_index == 1) {
     for (int i = 2; i < nb_dims; i++) {
-      if (idx_set.count(i)) {
+      if (idx_set.count(i) == 0) {
         permuted_index = i;
         break;
       }
@@ -1271,12 +1668,13 @@ tensorflow::Status ConvertReduce(Converter& ctx,
     output_tensor = ctx.TransposeTensor(
         const_cast<nvinfer1::ITensor*>(output_tensor), permutation_order);
   }
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
   return tensorflow::Status::OK();
 }
 
 tensorflow::Status ConvertPad(Converter& ctx,
                               const tensorflow::NodeDef& node_def,
-                              std::vector<TRT_TensorOrWeights> const& inputs,
+                              const std::vector<TRT_TensorOrWeights>& inputs,
                               std::vector<TRT_TensorOrWeights>* outputs) {
   if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
       !inputs.at(1).is_weights())
@@ -1284,7 +1682,7 @@ tensorflow::Status ConvertPad(Converter& ctx,
         "Input expects tensor and weights, at" + node_def.name());
 
   // Implement tensor binaryOp weight [channel wise] for now;
-  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
   auto dims = tensor->getDimensions();
   // Restore implicit batch dimension
   int nb_dims = dims.nbDims + 1;
@@ -1371,19 +1769,287 @@ tensorflow::Status ConvertPad(Converter& ctx,
   return tensorflow::Status::OK();
 }
 
+tensorflow::Status ConvertConcat(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  // not including the last input (axis) here
+  int input_size = static_cast<int>(inputs.size()) - 1;
+
+  if (!inputs.at(0).is_tensor())
+    return tensorflow::errors::InvalidArgument(
+        "Concat in TRT support only Tensor input, at " + node_def.name());
+
+  // We are retrieving the axis
+  TRT_ShapedWeights axis = inputs.at(input_size).weights();
+
+  TFAttrs attrs(node_def);
+  // auto attr_size = attrs.at("N")->i();
+  // auto data_type = attrs.get<nvinfer1::DataType>("T");
+  auto index_type = attrs.get<tensorflow::DataType>("Tidx");
+
+  // TODO(jie): handle data type
+  // Only expect to handle INT32 as index attributes for now
+  if (index_type != tensorflow::DataType::DT_INT32)
+    return tensorflow::errors::Unimplemented(
+        "Tidx supports only DT_INT32, at " + node_def.name());
+
+  int index = *(static_cast<int*>(const_cast<void*>(axis.GetValues())));
+
+  // TODO(jie): early termination with no-op (attr_size==1)
+
+  auto dim = inputs.at(0).tensor()->getDimensions();
+  // dimension check
+  if (index > dim.nbDims + 1)
+    return tensorflow::errors::InvalidArgument(
+        "Concatenate on axis out of dimension range, at " + node_def.name());
+
+  if (index == 0)
+    return tensorflow::errors::InvalidArgument(
+        "Concatenate on batch dimension not supported, at " + node_def.name());
+
+  // incase we need permutation;
+  std::vector<int> permutation_order(dim.nbDims + 1);
+
+  for (int i = 0; i < dim.nbDims + 1; i++) permutation_order[i] = i;
+
+  if (index != 1) {
+    permutation_order[1] = index - 1;
+    permutation_order[index - 1] = 1;
+  }
+
+  std::vector<nvinfer1::ITensor const*> inputs_vec;
+  // Shap chack (all input tensor should have same shape)
+  // starting from 0 since we are probably also doing transpose here;
+  for (int i = 0; i < input_size; i++) {
+    auto tensor_i = inputs.at(i).tensor();
+    auto dim_i = tensor_i->getDimensions();
+    if (dim_i.nbDims != dim.nbDims)
+      return tensorflow::errors::InvalidArgument(
+          "Concatenate receives inputs with inconsistent dimensions, at " +
+          node_def.name());
+
+    for (int j = 0; j < dim.nbDims; j++) {
+      // check dimension consistency on non-concatenate axis
+      if (j != index - 1 && dim_i.d[j] != dim.d[j])
+        return tensorflow::errors::InvalidArgument(
+            "Concatenate receives inputs with inconsistent shape, at" +
+            node_def.name());
+    }
+
+    // TRT does concatenation only on channel!
+    if (index != 1)
+      tensor_i = ctx.TransposeTensor(const_cast<nvinfer1::ITensor*>(tensor_i),
+                                     permutation_order);
+
+    inputs_vec.push_back(tensor_i);
+  }
+
+  // nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  nvinfer1::IConcatenationLayer* layer = ctx.network()->addConcatenation(
+      const_cast<nvinfer1::ITensor* const*>(inputs_vec.data()),
+      inputs_vec.size());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+
+  if (index != 1) {
+    output_tensor = ctx.TransposeTensor(output_tensor, permutation_order);
+  }
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertFusedBatchNorm(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  TFAttrs attrs(node_def);
+  float epsilon = attrs.get<float>("epsilon");
+  auto data_format = attrs.get<string>("data_format");
+  if (data_format != "NCHW") {
+    return tensorflow::errors::Unimplemented(
+        "only data_format=NCHW is supported, at " + node_def.name());
+  }
+  bool is_training = attrs.get<bool>("is_training");
+  if (is_training) {
+    return tensorflow::errors::Unimplemented(
+        "only is_training=false is supported, at " + node_def.name());
+  }
+  nvinfer1::ITensor const* tensor = inputs.at(0).tensor();
+  TRT_ShapedWeights scale_weights = inputs.at(1).weights();
+  TRT_ShapedWeights offset_weights = inputs.at(2).weights();
+  TRT_ShapedWeights mean_weights = inputs.at(3).weights();
+  TRT_ShapedWeights variance_weights = inputs.at(4).weights();
+  TRT_ShapedWeights dummy_power_weights(scale_weights.type_);
+  TRT_ShapedWeights combined_scale_weights =
+      ctx.get_temp_weights_like(scale_weights);
+  TRT_ShapedWeights combined_offset_weights =
+      ctx.get_temp_weights_like(offset_weights);
+  size_t nweight = scale_weights.count();
+  if ((scale_weights.type_ == offset_weights.type_) &&
+      (mean_weights.type_ == variance_weights.type_) &&
+      (scale_weights.type_ == variance_weights.type_)) {
+    if ((scale_weights.type_ != tensorflow::DataType::DT_FLOAT) &&
+        (scale_weights.type_ != tensorflow::DataType::DT_HALF)) {
+      return tensorflow::errors::Unimplemented(
+          "only float32 or float16 weight data type is supported, for node " +
+          node_def.name() + " got " +
+          tensorflow::DataTypeString(scale_weights.type_));
+    }
+    if (scale_weights.type_ == tensorflow::DT_FLOAT) {
+      for (size_t i = 0; i < nweight; ++i) {
+        float scale = (static_cast<float const*>(scale_weights.GetValues()))[i];
+        float offset =
+            (static_cast<float const*>(offset_weights.GetValues()))[i];
+        float mean = (static_cast<float const*>(mean_weights.GetValues()))[i];
+        float variance =
+            (static_cast<float const*>(variance_weights.GetValues()))[i];
+        float& combined_scale_ref = const_cast<float*>(
+            static_cast<float const*>(combined_scale_weights.GetValues()))[i];
+        float& combined_offset_ref = const_cast<float*>(
+            static_cast<float const*>(combined_offset_weights.GetValues()))[i];
+        combined_scale_ref = scale / sqrtf(variance + epsilon);
+        combined_offset_ref = offset - mean * combined_scale_ref;
+      }
+    } else {
+      const Eigen::half* scale_vals =
+          (static_cast<Eigen::half const*>(scale_weights.GetValues()));
+      const Eigen::half* off_vals =
+          (static_cast<Eigen::half const*>(offset_weights.GetValues()));
+      const Eigen::half* mean_vals =
+          (static_cast<Eigen::half const*>(mean_weights.GetValues()));
+      const Eigen::half* variance_vals =
+          (static_cast<Eigen::half const*>(variance_weights.GetValues()));
+      Eigen::half* comb_scale_vals = const_cast<Eigen::half*>(
+          static_cast<Eigen::half const*>(combined_scale_weights.GetValues()));
+      Eigen::half* comb_off_vals = const_cast<Eigen::half*>(
+          static_cast<Eigen::half const*>(combined_offset_weights.GetValues()));
+      for (size_t i = 0; i < nweight; ++i) {
+        float scale(scale_vals[i]);
+        float offset(off_vals[i]);
+        float mean(mean_vals[i]);
+        float variance(variance_vals[i]);
+        float combined_scale_ref = scale / sqrtf(variance + epsilon);
+        comb_scale_vals[i] = Eigen::half(combined_scale_ref);
+        float combined_offset_ref = offset - mean * combined_scale_ref;
+        comb_off_vals[i] = Eigen::half(combined_offset_ref);
+      }
+    }
+  }
+  nvinfer1::IScaleLayer* layer = ctx.network()->addScale(
+      *const_cast<nvinfer1::ITensor*>(tensor), nvinfer1::ScaleMode::kCHANNEL,
+      combined_offset_weights.GetWeightsForTRT(),
+      combined_scale_weights.GetWeightsForTRT(),
+      dummy_power_weights.GetWeightsForTRT());
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertMatMul(Converter& ctx,
+                                 const tensorflow::NodeDef& node_def,
+                                 const std::vector<TRT_TensorOrWeights>& inputs,
+                                 std::vector<TRT_TensorOrWeights>* outputs) {
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+
+  // TODO(jie): transpose!
+  TFAttrs attrs(node_def);
+
+  TRT_ShapedWeights weights_ck = inputs.at(1).weights();
+  TRT_ShapedWeights weights = ctx.get_temp_weights_like(weights_ck);
+  ReorderCKtoKC(weights_ck, &weights);
+  TRT_ShapedWeights biases(weights.type_);
+
+  int noutput = weights.shape_.d[0];
+
+  nvinfer1::IFullyConnectedLayer* layer = ctx.network()->addFullyConnected(
+      *const_cast<nvinfer1::ITensor*>(tensor), noutput, weights, biases);
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status ConvertReshape(
+    Converter& ctx, const tensorflow::NodeDef& node_def,
+    const std::vector<TRT_TensorOrWeights>& inputs,
+    std::vector<TRT_TensorOrWeights>* outputs) {
+  if (inputs.size() != 2 || !inputs.at(0).is_tensor() ||
+      !inputs.at(1).is_weights())
+    return tensorflow::errors::InvalidArgument(
+        "Input expects tensor and weights, at" + node_def.name());
+
+  // implement tensor binaryOp weight [channel wise] for now;
+  const nvinfer1::ITensor* tensor = inputs.at(0).tensor();
+  auto dims = tensor->getDimensions();
+  // restore implicit batch dimension
+
+  TRT_ShapedWeights shape = inputs.at(1).weights();
+
+  TFAttrs attrs(node_def);
+
+  auto padding_type = attrs.get<tensorflow::DataType>("Tshape");
+
+  if (shape.shape_.nbDims != 1)
+    return tensorflow::errors::InvalidArgument(
+        "reshape new shape is not 1 dimensional, at " + node_def.name());
+
+  // Only expect to handle INT32 as attributes for now
+  if (padding_type != tensorflow::DataType::DT_INT32)
+    return tensorflow::errors::Unimplemented(
+        "reshape new shape supports only DT_INT32, at " + node_def.name());
+
+  auto shape_data = static_cast<int*>(const_cast<void*>(shape.GetValues()));
+
+  if (shape_data[0] != -1)
+    return tensorflow::errors::InvalidArgument(
+        "reshape new shape first dimension is not -1, at " + node_def.name());
+
+  auto shape_num_dims = shape.shape_.d[0];
+  VLOG(2) << "shape dimensions: " << shape_num_dims;
+  int volume_w = 1;
+  for (int i = 1; i < shape.shape_.d[0]; i++) volume_w *= shape_data[i];
+
+  int volume_t = 1;
+  for (int i = 0; i < dims.nbDims; i++) volume_t *= dims.d[i];
+
+  VLOG(2) << "volume: " << volume_t << " volume weights: " << volume_w;
+  if (volume_w != volume_t)
+    return tensorflow::errors::InvalidArgument(
+        "volume does not agree between tensor and new shape, at " +
+        node_def.name());
+
+  nvinfer1::IShuffleLayer* layer =
+      ctx.network()->addShuffle(*const_cast<nvinfer1::ITensor*>(tensor));
+
+  nvinfer1::Dims reshape_dims;
+  VLOG(2) << "new dimension: " << shape_num_dims - 1;
+  reshape_dims.nbDims = shape_num_dims - 1;
+  for (int32_t i = 0; i < reshape_dims.nbDims; ++i) {
+    reshape_dims.d[i] = shape_data[i + 1];
+  }
+  layer->setReshapeDimensions(reshape_dims);
+  VLOG(2) << "new dimension: " << shape_num_dims - 1;
+
+  nvinfer1::ITensor* output_tensor = layer->getOutput(0);
+  auto dims_output = output_tensor->getDimensions();
+  VLOG(2) << "output tensor dimension:" << dims_output.nbDims;
+  outputs->push_back(TRT_TensorOrWeights(output_tensor));
+  return tensorflow::Status::OK();
+}
+
 void Converter::register_op_converters() {
   // vgg_16 slim implementation
   op_registry_["Placeholder"] = ConvertPlaceholder;
   op_registry_["Conv2D"] = ConvertConv2D;
+  op_registry_["DepthwiseConv2dNative"] = ConvertConv2DDepthwise;
   op_registry_["Relu"] = ConvertActivation;
   op_registry_["MaxPool"] = ConvertPool;
+  op_registry_["AvgPool"] = ConvertPool;
   // This could be really handled as ConvertBinary
   op_registry_["BiasAdd"] = ConvertScale;
   op_registry_["Const"] = ConvertConst;
-  // op_registry_["MatMul"] = ConvertFullyConnected;  // Not used in vgg
   // TODO(ben,jie): this is a temp hack.
   op_registry_["Identity"] = ConvertIdentity;  // Identity should be removed
-  // op_registry_["AvgPool"] = ConvertPool;
 
   // resnet_50_v1 slim implementation
   op_registry_["Add"] = ConvertBinary;
@@ -1393,26 +2059,364 @@ void Converter::register_op_converters() {
   op_registry_["Mean"] = ConvertReduce;
   op_registry_["Pad"] = ConvertPad;
   // TODO(ben,jie): Add more ops
+
+  op_registry_["ConcatV2"] = ConvertConcat;
+  op_registry_["MatMul"] = ConvertMatMul;
+  op_registry_["Reshape"] = ConvertReshape;
+  op_registry_["FusedBatchNorm"] = ConvertFusedBatchNorm;
+  op_registry_["FusedBatchNormV2"] = ConvertFusedBatchNorm;
 }
 
 }  // namespace
+tensorflow::Status GetTensorRTGraph(tensorrt::convert::SubGraphParams& s) {
+  return tensorflow::errors::Unimplemented("Not implemented yet");
+}
+tensorflow::Status ConvertCalibrationNodeToEngineNode(
+    tensorflow::Graph& graph, tensorflow::Node* c_node) {
+  const auto ndef = c_node->def();
+
+  TFAttrs attrs(ndef);
+  std::vector<string> segment_nodes(
+      attrs.get<std::vector<string>>("segment_nodes"));
+  std::vector<string> output_nodes(
+      attrs.get<std::vector<string>>("segment_output_names"));
+  std::vector<string> input_names(
+      attrs.get<std::vector<string>>("input_names"));
+  string res_name = attrs.get<string>("resource_name");
+  VLOG(1) << "Node name " << c_node->name() << " res_name " << res_name;
+  string engine_name = "my_trt_op";
+  {
+    const auto node_id = tensorflow::str_util::Split(res_name, "_");
+    engine_name += node_id.back();
+  }
+  std::map<string, tensorflow::Node*> node_maps;
+
+  for (auto n : graph.op_nodes()) {
+    node_maps.insert({n->name(), n});
+  }
+  VLOG(1) << "Output Nodes:";
+  std::vector<tensorflow::DataType> out_types;
+  std::vector<const tensorflow::Edge*> out_edges;
+  for (auto& i : output_nodes) {
+    auto node_port = tensorflow::str_util::Split(i, ":");
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+    auto out_node_name = node_port.at(0);
+    if (node_port.size() > 1) {
+      VLOG(1) << "Multi port output" << node_port.at(0) << " "
+              << node_port.at(1) << " size=" << node_port.size();
+    }
+    auto node_it = node_maps.find(out_node_name);
+    if (node_it != node_maps.end()) {
+      tensorflow::Node* out_node = node_it->second;
+      int port = 0;
+      if (node_port.size() == 2) {
+        port = std::strtoul(node_port.at(1).c_str(), nullptr, 10);
+        out_types.push_back(out_node->output_type(port));
+      } else {
+        out_types.push_back(out_node->output_type(0));
+      }
+      for (auto out_edge : out_node->out_edges()) {
+        if (out_edge->src_output() == port) {
+          out_edges.push_back(out_edge);
+          break;
+        }
+      }
+    } else {
+      LOG(WARNING) << " couldn't find output node " << out_node_name;
+    }
+  }
+  VLOG(1) << "Input Nodes:";
+  for (auto& i : input_names) {
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+  }
+  auto trt_rm = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto resmgr = trt_rm->getManager("TRTCalibOps");
+  tensorflow::tensorrt::TRTCalibrationResource* calib_res = nullptr;
+  auto status = resmgr->Lookup(res_name, res_name, &calib_res);
+  if (!status.ok() || !calib_res->calibrator_) {
+    return tensorflow::errors::FailedPrecondition(
+        "You must run calibration"
+        " and inference conversion in the same proces");
+  }
+
+  calib_res->calibrator_->setDone();
+  calib_res->thr_->join();
+  delete calib_res->thr_;
+  if (!calib_res->engine_) {
+    LOG(FATAL) << "Calibration failed!, engine is nullptr. Did you run "
+                  "calibration graph?";
+  }
+  auto weight_rmgr = trt_rm->getManager("WeightStore");
+  TF_CHECK_OK(weight_rmgr->Delete<tensorflow::tensorrt::TRTWeightStore>(
+      res_name, res_name));
+  auto engine_plan = calib_res->engine_->serialize();
+  calib_res->engine_->destroy();
+  calib_res->network_->destroy();
+  calib_res->builder_->destroy();
+  calib_res->thr_ = nullptr;
+  calib_res->engine_ = nullptr;
+  calib_res->builder_ = nullptr;
+  tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
+  std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  for (const auto in_edge : c_node->in_edges()) {
+    auto src = in_edge->src();
+    int dest_port = in_edge->dst_input();
+    income_edges.emplace_back(src->name(), in_edge->src_output(),
+                              c_node->input_type(dest_port));
+  }
+  tensorflow::gtl::ArraySlice<tensorflow::NodeDefBuilder::NodeOut> input_list(
+      income_edges);
+  op_builder.Input(input_list);
+  tensorflow::NodeDef engine_node;
+  const char* engine_plan_data = static_cast<const char*>(engine_plan->data());
+  string engine_plan_string(engine_plan_data,
+                            engine_plan_data + engine_plan->size());
+  status = op_builder.Attr("serialized_engine", engine_plan_string)
+               .Attr("input_nodes", input_names)
+               .Attr("output_nodes", output_nodes)
+               .Attr("OutT", out_types)
+               .Finalize(&engine_node);
+  if (!status.ok()) {
+    LOG(ERROR) << "Engine Node creation failed";
+    return status;
+  }
+  auto trt_engine_node = graph.AddNode(engine_node, &status);
+  TF_CHECK_OK(status);
+  for (size_t i = 0; i < out_edges.size(); i++) {
+    VLOG(1) << "Connecting trt_engine_node output " << i << " with "
+            << out_edges.at(i)->dst()->name() << " port "
+            << out_edges.at(i)->dst_input();
+    TF_RETURN_IF_ERROR(graph.UpdateEdge(trt_engine_node, i,
+                                        out_edges.at(i)->dst(),
+                                        out_edges.at(i)->dst_input()));
+  }
+  VLOG(1) << "Segment nodes:";
+  for (auto& i : segment_nodes) {
+    VLOG(1) << " " << i << " in graph " << node_maps.count(i);
+    auto it = node_maps.find(i);
+    if (it != node_maps.end()) {
+      graph.RemoveNode(it->second);
+    }
+  }
+  graph.RemoveNode(c_node);
+  return tensorflow::Status::OK();
+}
+
+tensorflow::Status InjectCalibrationNode(tensorrt::convert::SubGraphParams& s) {
+  // Visit nodes in reverse topological order and construct the TRT network.
+
+  // Toposort
+  std::vector<tensorflow::Node*> order_vec;
+  tensorflow::GetPostOrder(s.graph, &order_vec);
+  // Select just the subgraph
+  std::list<tensorflow::Node*> order;
+  for (tensorflow::Node* node : order_vec) {
+    if (s.subgraph_node_ids.count(node->id())) {
+      order.push_front(node);  // we want topological order to construct the
+      // network layer by layer
+    }
+  }
+  // topological order is needed to build TRT network
+  static int static_id = 0;
+  string subgraph_name_scope;
+  if (!order.empty()) {
+    subgraph_name_scope = order.front()->name();
+  }
+  for (const tensorflow::Node* node : order) {
+    subgraph_name_scope = GetCommonNameScope(subgraph_name_scope, node->name());
+  }
+  // TODO(sami,ben,jie): proper naming!
+  string calib_op_name =
+      StrCat(subgraph_name_scope, "my_trt_calib_op_", static_id);
+  string engine_name = StrCat(subgraph_name_scope, "my_trt_op", static_id);
+  static_id++;
+  auto trt_rmgr = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto op_rmgr = trt_rmgr->getManager("TRTCalibOps");
+  auto op_res = new tensorflow::tensorrt::TRTCalibrationResource();
+  TF_CHECK_OK(op_rmgr->Create(calib_op_name, calib_op_name, op_res));
+  op_res->logger_ = new tensorflow::tensorrt::Logger();
+  op_res->builder_ = nvinfer1::createInferBuilder(*(op_res->logger_));
+
+  if (!op_res->builder_) {
+    return tensorflow::errors::Internal(
+        "failed to create TensorRT builder object");
+  }
+
+  op_res->network_ = op_res->builder_->createNetwork();
+  if (!op_res->network_) {
+    return tensorflow::errors::Internal(
+        "failed to create TensorRT network object");
+  }
+
+  // Build the network
+  auto weight_rmgr = trt_rmgr->getManager("WeightStore");
+  auto ws = new tensorflow::tensorrt::TRTWeightStore();
+  TF_CHECK_OK(weight_rmgr->Create(calib_op_name, calib_op_name, ws));
+  Converter converter(op_res->network_, ws, s.precision_mode == FP16MODE);
+  std::vector<string> input_names;
+  std::vector<tensorflow::DataType> input_dtypes;
+  for (const std::pair<int, int>& input : s.input_inds) {
+    VLOG(2) << "parsing input. Node id= " << input.first;
+    int node_id = input.first;
+    int output_idx = input.second;
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
+    auto node_name = node->name();
+    input_names.push_back(node_name);  // insert original node name without port
+    // TODO(jie): alternative :)
+    if (!s.graph_properties.HasOutputProperties(node_name))
+      return tensorflow::errors::Internal("failed to find input node: " +
+                                          node_name);
+
+    auto op_info_vec = s.graph_properties.GetOutputProperties(node_name);
+    if (static_cast<int>(op_info_vec.size()) < output_idx)
+      return tensorflow::errors::Internal(
+          "accessing output index of: ", output_idx, ", at node: ", node_name,
+          "with output entry from shape_map: ", op_info_vec.size());
+
+    auto op_info = op_info_vec.at(output_idx);
+
+    tensorflow::DataType tf_dtype = op_info.dtype();
+    input_dtypes.push_back(tf_dtype);
+
+    nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
+    TF_CHECK_OK(ConvertDType(tf_dtype, &dtype));
+
+    VLOG(2) << "accessing output index of: " << output_idx
+            << ", at node: " << node_name
+            << "with output entry from shape_map: " << op_info_vec.size();
+
+    // TODO(ben,jie): update TRT input format/dimension
+    nvinfer1::DimsCHW input_dim_psuedo_chw;
+    for (int i = 0; i < 3; i++) input_dim_psuedo_chw.d[i] = 1;
+
+    for (int i = 1; i < op_info.shape().dim_size(); i++) {
+      VLOG(2) << "dimension: " << i
+              << " , size: " << op_info.shape().dim(i).size();
+      input_dim_psuedo_chw.d[i - 1] = op_info.shape().dim(i).size();
+    }
+
+    // TODO(ben,jie): proper way to restore input tensor name?
+    auto input_tensor_name = node_name;
+    if (output_idx != 0) input_tensor_name = StrCat(node_name, ":", output_idx);
+
+    nvinfer1::ITensor* input_tensor = converter.network()->addInput(
+        input_tensor_name.c_str(), dtype, input_dim_psuedo_chw);
+
+    if (!input_tensor)
+      return tensorflow::errors::InvalidArgument(
+          "Failed to create Input layer");
+    VLOG(2) << "input tensor name :" << input_tensor_name;
+
+    if (!converter.insert_input_tensor(input_tensor_name, input_tensor))
+      return tensorflow::errors::AlreadyExists(
+          "output tensor already exists for op: " + input_tensor_name);
+  }
+
+  VLOG(2) << "finished sorting";
+
+  for (const tensorflow::Node* node : order) {
+    const tensorflow::NodeDef& node_def = node->def();
+    VLOG(2) << "converting node: " << node_def.name() << " , " << node_def.op();
+    TF_RETURN_IF_ERROR(converter.convert_node(node_def));
+  }
+
+  VLOG(2) << "finished conversion";
+
+  // Gather output metadata
+  std::vector<string> output_names;
+  std::vector<tensorflow::DataType> output_dtypes;
+  int trt_engine_op_output_idx = 0;
+  for (const std::pair<int, int>& output : s.output_inds) {
+    int node_id = output.first;
+    int output_idx = output.second;
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
+    string op_name = node->name();
+    string tensor_name = op_name;
+
+    s.output_edge_map->insert(
+        {trt_engine_op_output_idx == 0
+             ? engine_name
+             : StrCat(engine_name, ":", trt_engine_op_output_idx),
+         {output_idx, tensor_name}});
+    trt_engine_op_output_idx++;
+    if (output_idx != 0) {
+      tensor_name = StrCat(tensor_name, ":", output_idx);
+    }
+    VLOG(1) << "output tensor name: " << tensor_name;
+    output_names.push_back(tensor_name);
+    auto tensor_or_weights = converter.get_tensor(tensor_name);
+    if (!tensor_or_weights.is_tensor()) {
+      return tensorflow::errors::InvalidArgument(
+          "Output node is weights not tensor");
+    }
+    nvinfer1::ITensor* tensor = tensor_or_weights.tensor();
+    if (!tensor) {
+      return tensorflow::errors::NotFound("Output tensor not found: " +
+                                          tensor_name);
+    }
+    converter.network()->markOutput(*tensor);
+    tensorflow::DataType tf_dtype = node->output_type(output_idx);
+    output_dtypes.push_back(tf_dtype);
+    nvinfer1::DataType trt_dtype = nvinfer1::DataType::kFLOAT;
+    TF_RETURN_IF_ERROR(ConvertDType(tf_dtype, &trt_dtype));
+    tensor->setType(trt_dtype);
+  }
+
+  VLOG(2) << "finished output";
+
+  // Build the engine
+  op_res->builder_->setMaxBatchSize(s.max_batch_size);
+  op_res->builder_->setMaxWorkspaceSize(s.max_workspace_size_bytes);
+
+  // Build the TRT op
+  // TODO(sami,ben,jie): proper naming!
+  tensorflow::NodeDefBuilder op_builder(calib_op_name, "TRTCalibOp");
+  std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  for (size_t i = 0; i < input_names.size(); ++i) {
+    int output_idx = s.input_inds.at(i).second;
+    // we wired up the input here already, it is redundant to do it again in
+    //  ConvertSubGraphToTensorRT(convert_graph.cc)
+    auto incoming_edge = tensorflow::NodeDefBuilder::NodeOut(
+        input_names.at(i), output_idx, input_dtypes.at(i));
+    VLOG(1) << calib_op_name << " input " << i << " = " << input_names.at(i)
+            << ":" << output_idx
+            << " dType= " << tensorflow::DataTypeString(input_dtypes.at(i));
+    income_edges.push_back(incoming_edge);
+  }
+  tensorflow::gtl::ArraySlice<tensorflow::NodeDefBuilder::NodeOut> input_list(
+      income_edges);
+  op_builder.Input(input_list);
+  std::vector<string> segment_names;
+  segment_names.reserve(s.subgraph_node_ids.size());
+  for (int i : s.subgraph_node_ids) {
+    auto node = s.graph.FindNodeId(i);
+    segment_names.push_back(node->name());
+  }
+  LOG(INFO) << "finished op preparation";
+
+  auto status = op_builder.Attr("segment_nodes", segment_names)
+                    .Attr("input_names", input_names)
+                    .Attr("segment_output_names", output_names)
+                    .Attr("resource_name", calib_op_name)
+                    .Finalize(s.trt_node);
+
+  LOG(INFO) << status.ToString();
+  LOG(INFO) << "finished op building";
+
+  return tensorflow::Status::OK();
+}
 
 tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
-    const tensorflow::Graph& graph, const std::set<int>& subgraph_node_ids,
-    const std::vector<std::pair<int, int>>& input_inds,
-    const std::vector<std::pair<int, int>>& output_inds, size_t max_batch_size,
-    size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_properties,
-    tensorflow::NodeDef* trt_node) {
+    tensorrt::convert::SubGraphParams& s) {
   // Visit nodes in reverse topological order and construct the TRT network.
 
   // Toposort
   std::vector<tensorflow::Node*> order_vec;
-  tensorflow::GetPostOrder(graph, &order_vec);
+  tensorflow::GetPostOrder(s.graph, &order_vec);
   // Select just the subgraph
   std::list<tensorflow::Node*> order;
   for (tensorflow::Node* node : order_vec) {
-    if (subgraph_node_ids.count(node->id())) {
+    if (s.subgraph_node_ids.count(node->id())) {
       // We want topological order to contstruct the
       // network layer by layer
       order.push_front(node);
@@ -1434,46 +2438,86 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
         "Failed to create TensorRT network object");
   }
 
+  string subgraph_name_scope;
+  if (!order.empty()) {
+    subgraph_name_scope = order.front()->name();
+  }
+  for (const tensorflow::Node* node : order) {
+    subgraph_name_scope = GetCommonNameScope(subgraph_name_scope, node->name());
+  }
+  static int static_id = 0;
+  // TODO(sami,ben,jie): proper naming!
+  string engine_name = StrCat(subgraph_name_scope, "my_trt_op");
+  engine_name = StrCat(engine_name, static_id++);
+  auto trt_rmgr = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto weight_rmgr = trt_rmgr->getManager("WeightStore");
+  auto ws = new tensorflow::tensorrt::TRTWeightStore();
+  TF_CHECK_OK(weight_rmgr->Create(engine_name, engine_name, ws));
+
   // Build the network
-  Converter converter(trt_network.get());
+  Converter converter(trt_network.get(), ws, s.precision_mode == FP16MODE);
 
   std::vector<string> input_names;
   std::vector<tensorflow::DataType> input_dtypes;
-  for (std::pair<int, int> const& input : input_inds) {
+  for (const std::pair<int, int>& input : s.input_inds) {
+    VLOG(2) << "parsing input!!!!!";
     int node_id = input.first;
     int output_idx = input.second;
-    tensorflow::Node* node = graph.FindNodeId(node_id);
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
     auto node_name = node->name();
-    input_names.push_back(node_name);  // Insert original node name without port
-    // TODO(jie): alternative :)
-    if (!graph_properties.HasOutputProperties(node_name))
-      return tensorflow::errors::Internal("Failed to find input node: " +
-                                          node_name);
+    // input_names should use the node name in the graph
+    // here it should be the input tensor name -> matching the binding
+    // insert original node name without port
+    auto tensor_name = node_name;
+    if (output_idx != 0) {
+      tensor_name = StrCat(tensor_name, ":", output_idx);
+    }
 
-    auto op_info_vec = graph_properties.GetOutputProperties(node_name);
-    if (static_cast<int>(op_info_vec.size()) < output_idx)
-      return tensorflow::errors::Internal(
-          "Accessing output index of: " + std::to_string(output_idx) +
-          ", at node: " + node_name + " with output entry from shape_map: " +
-          std::to_string(op_info_vec.size()));
+    VLOG(2) << "input name: " << node_name << " tensor_name: " << tensor_name
+            << " idx: " << output_idx;
 
-    auto op_info = op_info_vec.at(output_idx);
+    auto shape_inference_node_name = node_name;
+    auto shape_inference_output_idx = output_idx;
+    // rewire the shape inference to original node in the graph
+    if (s.output_edge_map->count(tensor_name)) {
+      shape_inference_node_name = s.output_edge_map->at(tensor_name).second;
+      shape_inference_output_idx = s.output_edge_map->at(tensor_name).first;
+    }
+    if (shape_inference_output_idx < 0) continue;
+    VLOG(2) << "shapeinference name: " << shape_inference_node_name
+            << " idx: " << shape_inference_output_idx;
 
+    if (!s.graph_properties.HasOutputProperties(shape_inference_node_name))
+      return tensorflow::errors::Internal("failed to find input node: " +
+                                          shape_inference_node_name);
+
+    auto op_info_vec =
+        s.graph_properties.GetOutputProperties(shape_inference_node_name);
+    if (static_cast<int>(op_info_vec.size()) <= shape_inference_output_idx)
+      return tensorflow::errors::Internal(
+          "accessing output index of: ", shape_inference_output_idx,
+          ", at node: ", shape_inference_node_name,
+          " with output entry from shape_map: ", op_info_vec.size());
+
+    auto op_info = op_info_vec.at(shape_inference_output_idx);
     tensorflow::DataType tf_dtype = op_info.dtype();
     input_dtypes.push_back(tf_dtype);
 
     nvinfer1::DataType dtype(nvinfer1::DataType::kFLOAT);
     TF_CHECK_OK(ConvertDType(tf_dtype, &dtype));
 
-    VLOG(2) << "Accessing output index of: " << std::to_string(output_idx)
+    VLOG(2) << "Accessing output index of: " << output_idx
             << ", at node: " << node_name
-            << " with output entry from shape_map: "
-            << std::to_string(op_info_vec.size());
-
+            << " with output entry from shape_map: " << op_info_vec.size();
     // TODO(ben,jie): update TRT input format/dimension
     nvinfer1::DimsCHW input_dim_psuedo_chw;
     for (int i = 0; i < 3; i++) input_dim_psuedo_chw.d[i] = 1;
 
+    // TODO(jie): TRT 3.x only support 4 dimensional input tensor.
+    //            update the code once TRT 4.0 comes out.
+    if (op_info.shape().dim_size() != 4)
+      return tensorflow::errors::Unimplemented("require 4 dimensional input");
+
     for (int i = 1; i < op_info.shape().dim_size(); i++) {
       VLOG(2) << "dimension: " << i
               << " , size: " << op_info.shape().dim(i).size();
@@ -1482,9 +2526,11 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
 
     // TODO(ben,jie): proper way to restore input tensor name?
     auto input_tensor_name = node_name;
-    if (output_idx != 0)
-      input_tensor_name = node_name + ":" + std::to_string(output_idx);
+    if (output_idx != 0) {
+      input_tensor_name = StrCat(node_name, ":", output_idx);
+    }
 
+    input_names.push_back(input_tensor_name);
     nvinfer1::ITensor* input_tensor = converter.network()->addInput(
         input_tensor_name.c_str(), dtype, input_dim_psuedo_chw);
 
@@ -1511,14 +2557,22 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
   // Gather output metadata
   std::vector<string> output_names;
   std::vector<tensorflow::DataType> output_dtypes;
-  for (std::pair<int, int> const& output : output_inds) {
+  int trt_engine_op_output_idx = 0;
+  for (const std::pair<int, int>& output : s.output_inds) {
     int node_id = output.first;
     int output_idx = output.second;
-    tensorflow::Node* node = graph.FindNodeId(node_id);
+    tensorflow::Node* node = s.graph.FindNodeId(node_id);
     string op_name = node->name();
     string tensor_name = op_name;
+
+    s.output_edge_map->insert(
+        {trt_engine_op_output_idx == 0
+             ? engine_name
+             : StrCat(engine_name, ":", trt_engine_op_output_idx),
+         {output_idx, tensor_name}});
+    trt_engine_op_output_idx++;
     if (output_idx != 0)
-      tensor_name = tensor_name + ":" + std::to_string(output_idx);
+      tensorflow::strings::StrAppend(&tensor_name, ":", output_idx);
     VLOG(2) << "Output tensor name: " << tensor_name;
     output_names.push_back(tensor_name);
     auto tensor_or_weights = converter.get_tensor(tensor_name);
@@ -1540,19 +2594,25 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
   }
 
   VLOG(2) << "Finished output";
-  // TODO(jie): static_id is not thread safe.
-  static int static_id = 0;
 
   // Build the engine
-  trt_builder->setMaxBatchSize(max_batch_size);
-  trt_builder->setMaxWorkspaceSize(max_workspace_size_bytes);
-  VLOG(0) << "Starting build engine " << static_id;
-  // TODO(ben,jie): half2 and int8 mode support
+  trt_builder->setMaxBatchSize(s.max_batch_size);
+  trt_builder->setMaxWorkspaceSize(s.max_workspace_size_bytes);
+  VLOG(0) << "Max batch size= " << s.max_batch_size
+          << " max workspace size= " << s.max_workspace_size_bytes;
+  if (s.precision_mode == FP16MODE) {
+    trt_builder->setHalf2Mode(true);
+    VLOG(0) << "Using FP16 precision mode";
+  }
+  LOG(INFO) << "starting build engine";
   string engine_plan_string;
   {
     auto trt_engine =
         infer_object(trt_builder->buildCudaEngine(*converter.network()));
     VLOG(0) << "Built network";
+    if (trt_engine.get() == nullptr) {
+      return tensorflow::errors::Internal("Engine building failure");
+    }
     auto engine_plan = infer_object(trt_engine->serialize());
     VLOG(0) << "Serialized engine";
     const char* engine_plan_data =
@@ -1560,18 +2620,19 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
     engine_plan_string =
         string(engine_plan_data, engine_plan_data + engine_plan->size());
   }
-
-  VLOG(0) << "Finished engine";
+  TF_RETURN_IF_ERROR(weight_rmgr->Delete<tensorflow::tensorrt::TRTWeightStore>(
+      engine_name, engine_name));
+  LOG(INFO) << "finished engine " << engine_name;
 
   // Build the TRT op
-  // TODO(sami,ben,jie): proper naming!
-  tensorflow::NodeDefBuilder op_builder(
-      tensorflow::strings::StrCat("my_trt_op", static_id++), "TRTEngineOp");
+  tensorflow::NodeDefBuilder op_builder(engine_name, "TRTEngineOp");
   std::vector<tensorflow::NodeDefBuilder::NodeOut> income_edges;
+  VLOG(2) << "input edge size: " << input_names.size();
   for (size_t i = 0; i < input_names.size(); ++i) {
-    int output_idx = input_inds.at(i).second;
-    // We wired up the input here already, it is redundant to do it again in
-    // ConvertSubGraphToTensorRT(convert_graph.cc)
+    VLOG(2) << "input edges: " << i << " " << input_names.at(i);
+    int output_idx = s.input_inds.at(i).second;
+    // we wired up the input here already, it is redundant to do it again in
+    //  ConvertSubGraphToTensorRT(convert_graph.cc)
     auto incoming_edge = tensorflow::NodeDefBuilder::NodeOut(
         input_names.at(i), output_idx, input_dtypes.at(i));
     income_edges.push_back(incoming_edge);
@@ -1586,7 +2647,7 @@ tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
                     .Attr("input_nodes", input_names)
                     .Attr("output_nodes", output_names)
                     .Attr("OutT", output_dtypes)
-                    .Finalize(trt_node);
+                    .Finalize(s.trt_node);
 
   VLOG(0) << status.ToString() << " finished op building";
 
diff --git a/tensorflow/contrib/tensorrt/convert/convert_nodes.h b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
index 2e7fd19566..954a1e72f8 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_nodes.h
+++ b/tensorflow/contrib/tensorrt/convert/convert_nodes.h
@@ -17,6 +17,8 @@ limitations under the License.
 #define TENSORFLOW_CONTRIB_TENSORRT_CONVERT_CONVERT_NODES_H_
 
 #include <set>
+#include <string>
+#include <unordered_map>
 #include <utility>
 #include <vector>
 
@@ -32,16 +34,49 @@ namespace tensorflow {
 namespace tensorrt {
 namespace convert {
 
-tensorflow::Status ConvertSubGraphToTensorRTNodeDef(
-    const tensorflow::Graph& graph, const std::set<int>& subgraph_node_ids,
-    const std::vector<std::pair<int, int>>&
-        input_inds,  // {node_id, output_idx}
-    const std::vector<std::pair<int, int>>&
-        output_inds,  // {node_id, output_idx}
-    size_t max_batch_size, size_t max_workspace_size_bytes,
-    const tensorflow::grappler::GraphProperties& graph_prop,
-    tensorflow::NodeDef* trt_node);
+const int FP32MODE = 0;
+const int FP16MODE = 1;
+const int INT8MODE = 2;
 
+struct SubGraphParams {
+  SubGraphParams(
+      tensorflow::Graph& inp_graph,
+      const std::set<int>& subgraph_node_id_numbers,
+      const std::vector<std::pair<int, int>>& input_indices,
+      const std::vector<std::pair<int, int>>& output_indices,
+      size_t max_supported_batch_size, size_t max_consumed_workspace_size_bytes,
+      const tensorflow::grappler::GraphProperties& current_graph_properties,
+      std::unordered_map<string, std::pair<int, string>>* output_edges,
+      tensorflow::NodeDef* constructed_trt_node,
+      int engine_precision_mode = FP32MODE)
+      : graph(inp_graph),
+        subgraph_node_ids(subgraph_node_id_numbers),
+        input_inds(input_indices),
+        output_inds(output_indices),
+        max_batch_size(max_supported_batch_size),
+        max_workspace_size_bytes(max_consumed_workspace_size_bytes),
+        graph_properties(current_graph_properties),
+        output_edge_map(output_edges),
+        trt_node(constructed_trt_node),
+        precision_mode(engine_precision_mode) {}
+
+  tensorflow::Graph& graph;
+  const std::set<int>& subgraph_node_ids;
+  const std::vector<std::pair<int, int>>& input_inds;   // {node_id, output_idx}
+  const std::vector<std::pair<int, int>>& output_inds;  // {node_id, output_idx}
+  size_t max_batch_size;
+  size_t max_workspace_size_bytes;
+  const tensorflow::grappler::GraphProperties& graph_properties;
+  std::unordered_map<string, std::pair<int, string>>* output_edge_map;
+  tensorflow::NodeDef* trt_node;
+  const int precision_mode;
+};
+
+// TODO(sami): Replace references with const reference or pointers
+tensorflow::Status ConvertSubGraphToTensorRTNodeDef(SubGraphParams& params);
+tensorflow::Status InjectCalibrationNode(SubGraphParams& params);
+tensorflow::Status ConvertCalibrationNodeToEngineNode(tensorflow::Graph& graph,
+                                                      tensorflow::Node* c_node);
 }  // namespace convert
 }  // namespace tensorrt
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
index 1dcb87e768..aea44fd8a2 100644
--- a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
+++ b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
@@ -21,10 +21,11 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/platform/stream_executor.h"
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
-#include "cuda_runtime_api.h"
+#include "cuda/include/cuda_runtime_api.h"
 #include "tensorrt/include/NvInfer.h"
 
 namespace tensorflow {
@@ -113,7 +114,13 @@ void TRTCalibOp::Compute(tensorflow::OpKernelContext* ctx) {
     ctx->set_output(i, t);
   }
   VLOG(2) << "Filled map for sending";
-  calib_res->calibrator_->setBatch(input_data);
+  // copied from cuda_kernel_helper since it seems only valid in *.cu.cc files
+  const cudaStream_t* stream = CHECK_NOTNULL(
+      reinterpret_cast<const cudaStream_t*>(ctx->op_device_context()
+                                                ->stream()
+                                                ->implementation()
+                                                ->CudaStreamMemberHack()));
+  calib_res->calibrator_->setBatch(input_data, *stream);
   VLOG(2) << "Passed calibration data";
   // TODO(aaroey): make sure we wait for the completion of calibration on the
   // last batch in future PR.
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
index 8efdf63ebe..b32371b642 100644
--- a/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
+++ b/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc
@@ -24,8 +24,12 @@ limitations under the License.
 #include "cuda/include/cuda_runtime_api.h"
 
 namespace tensorflow {
-namespace tensorrt {
 static ::tensorflow::tensorrt::Logger logger;
+namespace gpu = ::perftools::gputools;
+using IRuntime = nvinfer1::IRuntime;
+using Dims = nvinfer1::Dims;
+
+namespace tensorrt {
 
 TRTEngineOp::TRTEngineOp(OpKernelConstruction* context) : OpKernel(context) {
   // read serialized_engine
@@ -40,10 +44,21 @@ TRTEngineOp::TRTEngineOp(OpKernelConstruction* context) : OpKernel(context) {
   // TODO(samikama) runtime should be taken from a resourcemanager as well.
   // Only engine should be in the op and context and runtime should be taken
   // from resourcemanager
-  nvinfer1::IRuntime* infer = nvinfer1::createInferRuntime(logger);
+  // TODO(jie): cudaSetDevice make sure trt engine is allocated on the same
+  // gpu where the input/output is also located.
+  int gpu_id = context->device()->tensorflow_gpu_device_info()->gpu_id;
+  cudaSetDevice(gpu_id);
+  int device;
+  cudaGetDevice(&device);
+  if (gpu_id != device) LOG(FATAL) << "set device failed!";
+
+  // TODO(samikama) runtime should be taken from a resourcemanager as well.
+  // Only engine should be in the op and context and runtime should be taken
+  // from resourcemanager
+
+  IRuntime* infer = nvinfer1::createInferRuntime(logger);
   trt_engine_ptr_.reset(infer->deserializeCudaEngine(
       serialized_engine.c_str(), serialized_engine.size(), nullptr));
-
   trt_execution_context_ptr_.reset(trt_engine_ptr_->createExecutionContext());
   // Runtime is safe to delete after engine creation
   infer->destroy();
@@ -55,7 +70,6 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
 
   size_t binding_index;
   int num_batch = 0;
-  bool valid = true;
   for (int i = 0; i < context->num_inputs(); i++) {
     // Grab the input tensor
     binding_index = trt_engine_ptr_->getBindingIndex(input_nodes_[i].c_str());
@@ -64,8 +78,12 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
     const TensorShape& input_shape = input_tensor.shape();
     if (i == 0) {
       num_batch = input_shape.dim_size(0);
+      if (num_batch > trt_engine_ptr_->getMaxBatchSize()) {
+        LOG(FATAL) << "input tensor batch larger than max_batch_size: "
+                   << trt_engine_ptr_->getMaxBatchSize();
+      }
     } else if (num_batch != input_shape.dim_size(0)) {
-      valid = false;
+      LOG(FATAL) << "input data inconsistent batch size";
       break;
     }
     switch (trt_engine_ptr_->getBindingDataType(binding_index)) {
@@ -81,9 +99,6 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
     }
   }
 
-  // Might want a different way to inform the user of batch size inconsistency
-  if (!valid) LOG(WARNING) << "input data inconsistent batch size";
-
   for (int i = 0; i < static_cast<int>(output_nodes_.size()); i++) {
     // This is bad that we have to reallocate output buffer every run.
     // Create an output tensor
@@ -126,9 +141,11 @@ void TRTEngineOp::Compute(OpKernelContext* context) {
                                                 ->implementation()
                                                 ->CudaStreamMemberHack()));
 
-  // execution handled by TF since we are getting stream from TF.
-  // it is safe for CPU pointer array (buffers) to go out of scope after enqueue
-  trt_execution_context_ptr_->enqueue(num_batch, &buffers[0], *stream, nullptr);
+  // TODO(jie): trt enqueue does not return error
+  auto ret = trt_execution_context_ptr_->enqueue(num_batch, &buffers[0],
+                                                 *stream, nullptr);
+  VLOG(2) << "enqueue returns: " << ret;
+  // sync should be done by TF.
 }
 
 REGISTER_KERNEL_BUILDER(Name("TRTEngineOp").Device(DEVICE_GPU), TRTEngineOp);
diff --git a/tensorflow/contrib/tensorrt/log/trt_logger.cc b/tensorflow/contrib/tensorrt/log/trt_logger.cc
index 7add8cb8b3..dda0dc9e71 100644
--- a/tensorflow/contrib/tensorrt/log/trt_logger.cc
+++ b/tensorflow/contrib/tensorrt/log/trt_logger.cc
@@ -27,19 +27,19 @@ void Logger::log(Severity severity, const char* msg) {
   // Suppress info-level messages
   switch (severity) {
     case Severity::kINFO: {  // Mark TRT info messages as debug!
-      VLOG(2) << msg;
+      VLOG(2) << name_ << " " << msg;
       break;
     }
     case Severity::kWARNING: {
-      LOG(WARNING) << msg;
+      LOG(WARNING) << name_ << " " << msg;
       break;
     }
     case Severity::kERROR: {
-      LOG(ERROR) << msg;
+      LOG(ERROR) << name_ << " " << msg;
       break;
     }
     case Severity::kINTERNAL_ERROR: {
-      LOG(FATAL) << msg;
+      LOG(FATAL) << name_ << " " << msg;
       break;
     }
     // This is useless for now. But would catch it in future if enum changes. It
diff --git a/tensorflow/contrib/tensorrt/log/trt_logger.h b/tensorflow/contrib/tensorrt/log/trt_logger.h
index d71f66b933..7f3544f8cf 100644
--- a/tensorflow/contrib/tensorrt/log/trt_logger.h
+++ b/tensorflow/contrib/tensorrt/log/trt_logger.h
@@ -27,9 +27,11 @@ namespace tensorrt {
 
 // Logger for GIE info/warning/errors
 class Logger : public nvinfer1::ILogger {
- private:
+ public:
+  Logger(string name = "DefaultLogger") : name_(name){};
   void log(nvinfer1::ILogger::Severity severity, const char* msg) override;
 
+ private:
   string name_;
 };
 
diff --git a/tensorflow/contrib/tensorrt/python/__init__.py b/tensorflow/contrib/tensorrt/python/__init__.py
index 7e050a768c..0b2321b5fc 100644
--- a/tensorflow/contrib/tensorrt/python/__init__.py
+++ b/tensorflow/contrib/tensorrt/python/__init__.py
@@ -20,5 +20,6 @@ from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long
 from tensorflow.contrib.tensorrt.python.ops import trt_engine_op
+from tensorflow.contrib.tensorrt.python.trt_convert import calib_graph_to_infer_graph
 from tensorflow.contrib.tensorrt.python.trt_convert import create_inference_graph
 # pylint: enable=unused-import,line-too-long
diff --git a/tensorflow/contrib/tensorrt/python/trt_convert.py b/tensorflow/contrib/tensorrt/python/trt_convert.py
index 9454862f85..338475d90e 100644
--- a/tensorflow/contrib/tensorrt/python/trt_convert.py
+++ b/tensorflow/contrib/tensorrt/python/trt_convert.py
@@ -20,11 +20,17 @@ from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long
 import six as _six
+from tensorflow.contrib.tensorrt.wrap_conversion import calib_convert
 from tensorflow.contrib.tensorrt.wrap_conversion import trt_convert
 from tensorflow.core.framework import graph_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import errors_impl as _impl
+from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
+from tensorflow.python.grappler import tf_optimizer
+from tensorflow.python.util import compat
+# pylint: enable=unused-import,line-too-long
 
 
 # TODO(skama): get outputs from session when implemented as c++
@@ -32,22 +38,33 @@ from tensorflow.python.framework import ops
 def create_inference_graph(input_graph_def,
                            outputs,
                            max_batch_size=1,
-                           max_workspace_size_bytes=2 << 20):
-  """Python wrapper for the TRT transormation.
-
+                           max_workspace_size_bytes=2 << 20,
+                           precision_mode="FP32",
+                           minimum_segment_size=3):
+  """Python wrapper for the TRT transformation.
 
   Args:
     input_graph_def: GraphDef object containing a model to be transformed.
-    outputs: List of tensors or node names for the model outputs.
+    outputs: list of tensors or node names for the model outputs.
     max_batch_size: max size for the input batch
     max_workspace_size_bytes: parameter to control memory allocation (in Bytes)
+    precision_mode: one of 'FP32', 'FP16' and 'INT8'
+    minimum_segment_size: the minimum number of nodes required for a subgraph to
+      be replaced by TRTEngineOp.
 
   Returns:
     New GraphDef with TRTEngineOps placed in graph replacing subgraphs.
 
   Raises:
+    ValueError: if the provided precision mode is invalid.
     RuntimeError: if the returned status message is malformed.
   """
+  supported_precision_modes = {"FP32": 0, "FP16": 1, "INT8": 2}
+  if precision_mode.upper() not in supported_precision_modes:
+    raise ValueError(("precision mode '{}' is not supported."
+                      "It should be one of {}").format(
+                          precision_mode, "{'FP32', 'FP16', 'INT8'}"))
+  mode = supported_precision_modes[precision_mode.upper()]
 
   def py2bytes(inp):
     return inp
@@ -83,7 +100,7 @@ def create_inference_graph(input_graph_def,
   # pair or strings where first one is encoded status and the second
   # one is the transformed graphs protobuf string.
   out = trt_convert(input_graph_def_str, out_names, max_batch_size,
-                    max_workspace_size_bytes)
+                    max_workspace_size_bytes, mode, minimum_segment_size)
   status = to_string(out[0])
   output_graph_def_string = out[1]
   del input_graph_def_str  # Save some memory
@@ -101,3 +118,46 @@ def create_inference_graph(input_graph_def,
   output_graph_def.ParseFromString(output_graph_def_string)
   del output_graph_def_string  # Save some memory
   return output_graph_def
+
+
+def calib_graph_to_infer_graph(calibration_graph_def):
+  """Convert an existing calibration graph to inference graph.
+
+  Args:
+    calibration_graph_def: the calibration GraphDef object with calibration data
+  Returns:
+    New GraphDef with TRTEngineOps placed in graph replacing calibration nodes.
+  Raises:
+    RuntimeError: if the returned status message is malformed.
+  """
+
+  def py2string(inp):
+    return inp
+
+  def py3string(inp):
+    return inp.decode("utf-8")
+
+  if _six.PY2:
+    to_string = py2string
+  else:
+    to_string = py3string
+
+  graph_str = calibration_graph_def.SerializeToString()
+  out = calib_convert(graph_str)
+  status = to_string(out[0])
+  output_graph_def_string = out[1]
+  del graph_str  # Save some memory
+  if len(status) < 2:
+    raise _impl.UnknownError(None, None, status)
+  if status[:2] != "OK":
+    msg = status.split(";")
+    if len(msg) == 1:
+      raise RuntimeError("Status message is malformed {}".format(status))
+    # pylint: disable=protected-access
+    raise _impl._make_specific_exception(None, None, ";".join(msg[1:]),
+                                         int(msg[0]))
+    # pylint: enable=protected-access
+  output_graph_def = graph_pb2.GraphDef()
+  output_graph_def.ParseFromString(output_graph_def_string)
+  del output_graph_def_string  # Save some memory
+  return output_graph_def
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
index 3d5cc76c42..dc7c93f869 100644
--- a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
@@ -23,7 +23,7 @@ limitations under the License.
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
-#include "cuda_runtime_api.h"
+#include "cuda/include/cuda_runtime_api.h"
 
 namespace tensorflow {
 namespace tensorrt {
@@ -38,22 +38,18 @@ TRTInt8Calibrator::TRTInt8Calibrator(
       done_(false),
       dev_buffers_(dev_buffers),
       calib_running_(false),
+      batch_is_set_(false),
       engine_name_(engine_name) {}
 
-bool TRTInt8Calibrator::setBatch(
-    const std::unordered_map<string, void*>& data) {
-  // TODO(aaroey): make sure that in future PR:
-  // 1. the mutex_lock is outside of the loop
-  // 2. wait() is used instead of wait_for()
-  // 3. done_ is to be protected by the mutex
-  // 4. the first batch is not missed
-  if (done_) return false;
-  while (calib_running_.load(
-      std::memory_order_acquire)) {  // wait while calibration is running
-    tensorflow::mutex_lock l(cond_mtx_);
-    cond_.wait_for(l, std::chrono::milliseconds(50));
-    if (done_) return false;
+bool TRTInt8Calibrator::setBatch(const std::unordered_map<string, void*>& data,
+                                 const cudaStream_t stream) {
+  tensorflow::mutex_lock lock(cond_mtx_);
+  while ((calib_running_ || batch_is_set_) &&
+         !done_) {  // wait while calibration is running
+    cond_.wait(lock);
   }
+  if (done_) return false;
+  CHECK(!calib_running_ && !batch_is_set_);
   VLOG(1) << "Set Batch Waiting finished";
   for (const auto it : data) {
     auto devptr = dev_buffers_.find(it.first);
@@ -65,27 +61,32 @@ bool TRTInt8Calibrator::setBatch(
 
     // TODO(aaroey): we should not use sync copy on default stream. Make sure
     // stream->ThenMemcpy() is used in future PRs.
-    auto status =
-        cudaMemcpy(d.first, it.second, d.second, cudaMemcpyDeviceToDevice);
+    // TODO(sami,aaroey): Need to figure out a way to ensure synchronization
+    // between stream, perhaps using a tensor?
+    auto status = cudaMemcpyAsync(d.first, it.second, d.second,
+                                  cudaMemcpyDeviceToDevice, stream);
     if (status != cudaSuccess) {
       LOG(FATAL) << "cudaMemcpy " << engine_name_ << " for '" << it.first
                  << "' failed with " << status;
     }
   }
-  calib_running_.store(true, std::memory_order_release);  // release builder
+
+  // TODO(Sami, aaorey): Find an alternative way!
+  cudaStreamSynchronize(
+      stream);  // we have to wait for the stream before returning!
+  batch_is_set_ = true;
   cond_.notify_all();
   return true;
 }
 
 bool TRTInt8Calibrator::getBatch(void** bindings, const char** names,
                                  int num_bindings) {
-  calib_running_.store(false, std::memory_order_release);  // wait for new batch
+  tensorflow::mutex_lock lock(cond_mtx_);
+  calib_running_ = false;
   cond_.notify_all();
-  while (!calib_running_.load(
-      std::memory_order_acquire)) {  // wait until new batch arrives
-    tensorflow::mutex_lock l(cond_mtx_);
-    cond_.wait_for(l, std::chrono::milliseconds(50));
-    if (done_) return false;
+  while ((!batch_is_set_ && !done_)) {  // wait until new batch arrives
+    cond_.wait(lock);
+
   }
   if (done_) {
     return false;
@@ -100,6 +101,8 @@ bool TRTInt8Calibrator::getBatch(void** bindings, const char** names,
 
     bindings[i] = it->second.first;
   }
+  batch_is_set_ = false;
+  calib_running_ = true;
   return true;
 }
 
@@ -107,6 +110,12 @@ const void* TRTInt8Calibrator::readCalibrationCache(std::size_t& length) {
   return nullptr;
 }
 
+void TRTInt8Calibrator::setDone() {
+  tensorflow::mutex_lock lock(cond_mtx_);
+  done_ = true;
+  cond_.notify_all();
+}
+
 void TRTInt8Calibrator::writeCalibrationCache(const void* ptr,
                                               std::size_t length) {}
 TRTInt8Calibrator::~TRTInt8Calibrator() {
@@ -115,5 +124,6 @@ TRTInt8Calibrator::~TRTInt8Calibrator() {
 
 }  // namespace tensorrt
 }  // namespace tensorflow
+
 #endif
 #endif
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
index 8830f7efe7..d77aa2c5ab 100644
--- a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
@@ -24,7 +24,10 @@ limitations under the License.
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
+
+#include "cuda/include/cuda_runtime_api.h"
 #include "tensorrt/include/NvInfer.h"
+
 namespace tensorflow {
 namespace tensorrt {
 // This class provides a 1 element queue to match TFs push model to
@@ -39,8 +42,9 @@ struct TRTInt8Calibrator : public nvinfer1::IInt8EntropyCalibrator {
   int getBatchSize() const override;
   bool getBatch(void* bindings[], const char* names[],
                 int num_bindings) override;
-  bool setBatch(const std::unordered_map<string, void*>& data);
-  void setDone() { done_ = true; }
+  bool setBatch(const std::unordered_map<string, void*>& data,
+                const cudaStream_t stream);
+  void setDone();
   const void* readCalibrationCache(std::size_t& length) override;
   void writeCalibrationCache(const void* ptr, std::size_t length) override;
   ~TRTInt8Calibrator();
@@ -55,11 +59,14 @@ struct TRTInt8Calibrator : public nvinfer1::IInt8EntropyCalibrator {
   const std::unordered_map<string, std::pair<void*, size_t>>
       dev_buffers_;  // map to keep tensorrt input buffers and sizes keyed with
                      // buffer names
-  std::atomic_bool calib_running_;
+  bool calib_running_;
+  bool batch_is_set_;
   string engine_name_;
 };
+
 }  // namespace tensorrt
 }  // namespace tensorflow
-#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+
 #endif
 #endif
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
diff --git a/tensorflow/contrib/tensorrt/test/test_tftrt.py b/tensorflow/contrib/tensorrt/test/test_tftrt.py
index c78f6f2224..ad01bedd8f 100644
--- a/tensorflow/contrib/tensorrt/test/test_tftrt.py
+++ b/tensorflow/contrib/tensorrt/test/test_tftrt.py
@@ -60,6 +60,7 @@ def get_simple_graph_def():
 
 
 def run_graph(gdef, dumm_inp):
+  """Run given graphdef once."""
   gpu_options = cpb2.GPUOptions(per_process_gpu_memory_fraction=0.50)
   ops.reset_default_graph()
   g = ops.Graph()
@@ -74,15 +75,65 @@ def run_graph(gdef, dumm_inp):
   return val
 
 
+# Use real data that is representative of the inference dataset
+# for calibration. For this test script it is random data.
+def run_calibration(gdef, dumm_inp):
+  """Run given calibration graph multiple times."""
+  gpu_options = cpb2.GPUOptions(per_process_gpu_memory_fraction=0.50)
+  ops.reset_default_graph()
+  g = ops.Graph()
+  with g.as_default():
+    inp, out = importer.import_graph_def(
+        graph_def=gdef, return_elements=["input", "output"])
+    inp = inp.outputs[0]
+    out = out.outputs[0]
+  with csess.Session(
+      config=cpb2.ConfigProto(gpu_options=gpu_options), graph=g) as sess:
+    # run over real calibration data here, we are mimicking a calibration set of
+    # 30 different batches. Use as much calibration data as you want
+    for _ in range(30):
+      val = sess.run(out, {inp: dumm_inp})
+  return val
+
+
 if "__main__" in __name__:
   inp_dims = (100, 24, 24, 2)
   dummy_input = np.random.random_sample(inp_dims)
-  gdef = get_simple_graph_def()
+  orig_graph = get_simple_graph_def()  # use a frozen graph for inference
   # Get optimized graph
-  trt_graph = trt.create_inference_graph(gdef, ["output"], inp_dims[0])
-  o1 = run_graph(gdef, dummy_input)
+  trt_graph = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="FP32",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  o1 = run_graph(orig_graph, dummy_input)
   o2 = run_graph(trt_graph, dummy_input)
   o3 = run_graph(trt_graph, dummy_input)
   assert np.array_equal(o1, o2)
   assert np.array_equal(o3, o2)  # sanity check
+  fp16_graph = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="FP16",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  int8_calib_gdef = trt.create_inference_graph(
+      input_graph_def=orig_graph,
+      outputs=["output"],
+      max_batch_size=inp_dims[0],
+      max_workspace_size_bytes=1 << 25,
+      precision_mode="INT8",  # TRT Engine precision "FP32","FP16" or "INT8"
+      minimum_segment_size=2  # minimum number of nodes in an engine
+  )
+  o4 = run_graph(fp16_graph, dummy_input)
+  _ = run_calibration(int8_calib_gdef, dummy_input)
+  int8_graph = trt.calib_graph_to_infer_graph(int8_calib_gdef)
+  o5 = run_graph(int8_graph, dummy_input)
+  assert np.allclose(o1, o4)
+  assert np.allclose(o1, o5)
   print("Pass")
diff --git a/tensorflow/contrib/tensorrt/trt_conversion.i b/tensorflow/contrib/tensorrt/trt_conversion.i
index d679945d56..46480e99a1 100644
--- a/tensorflow/contrib/tensorrt/trt_conversion.i
+++ b/tensorflow/contrib/tensorrt/trt_conversion.i
@@ -64,13 +64,17 @@ PyObject* pair_helper(std::pair<string, string>* in) {
 %ignoreall
 %unignore tensorflow;
 %unignore trt_convert;
+%unignore calib_convert;
 
 %{
+
 std::pair<string, string> trt_convert(
     string graph_def_string,  // The serialized GraphDef string.
     std::vector<string> output_names,
     size_t max_batch_size,
-    size_t max_workspace_size_bytes
+    size_t max_workspace_size_bytes,
+    int precision_mode,
+    int minimum_segment_size
     // Unfortunately we can't use TF_Status here since it
     // is in c/c_api and brings in a lot of other libraries
     // which in turn declare ops. These ops are included
@@ -90,16 +94,64 @@ std::pair<string, string> trt_convert(
     return std::pair<string, string>{out_status, ""};
   }
 
+  if(precision_mode < 0 || precision_mode > 2){
+    out_status = "InvalidArgument;Invalid precision_mode";
+    return std::pair<string, string>{out_status, ""};
+  }
   if (!output_names.size()) {
     out_status = "InvalidArgument;Size of the output_names vector is 0";
     return std::pair<string, string>{out_status, ""};
-    // return "";
   }
   tensorflow::GraphDef outGraph;
   tensorflow::Status conversion_status =
       tensorflow::tensorrt::convert::ConvertGraphDefToTensorRT(
           graph_def, output_names, max_batch_size, max_workspace_size_bytes,
-          &outGraph);
+          &outGraph, precision_mode, minimum_segment_size);
+  if (!conversion_status.ok()) {
+    auto retCode = (int)conversion_status.code();
+    char buff[2000];
+    snprintf(buff, 2000, "%d;%s", retCode,
+             conversion_status.error_message().c_str());
+    out_status = buff;
+    return std::pair<string, string>{out_status, ""};
+  }
+  string result;
+  if (!outGraph.SerializeToString(&result)) {
+    out_status = "InvalidArgument;Couldn't serialize output as a GraphDef";
+    return std::pair<string, string>{out_status, ""};
+  }
+  out_status = "OK;All good!";
+  return std::pair<string, string>{out_status, result};
+#else
+  // Returns FAILED_PRECONDITION.
+  return std::pair<string, string>{"9;TensorRT is not enabled!", ""};
+#endif  // GOOGLE_CUDA && GOOGLE_TENSORRT
+}
+
+std::pair<string, string> calib_convert(string graph_def_string  //  const tensorflow::GraphDef&
+    // unfortunately we can't use TF_Status here since it
+    // is in c/c_api and brings in a lot of other libraries
+    // which in turn declare ops. These ops are included
+    // statically in our library and cause an abort when
+    // module is loaded due to double registration
+    // until Tensorflow properly exposes these headers
+    // we have to work around this by returning a string
+    // and converting it to exception on python side.
+    //,TF_Status* out_status) {
+) {
+#if GOOGLE_CUDA && GOOGLE_TENSORRT
+  string out_status;
+
+  tensorflow::GraphDef graph_def;
+  if (!graph_def.ParseFromString(graph_def_string)) {
+    out_status = "InvalidArgument;Couldn't interpret input as a GraphDef";
+    return std::pair<string, string>{out_status, ""};
+  }
+
+  tensorflow::GraphDef outGraph;
+  tensorflow::Status conversion_status =
+      tensorflow::tensorrt::convert::ConvertCalibGraphToInferGraph(graph_def,
+                                                                   &outGraph);
   if (!conversion_status.ok()) {
     auto retCode = (int)conversion_status.code();
     char buff[2000];
@@ -122,10 +174,13 @@ std::pair<string, string> trt_convert(
 }
 %}
 
+std::pair<string, string> calib_convert(string graph_def_string);
+
 std::pair<string, string> trt_convert(string graph_def_string,
                                       std::vector<string> output_names,
                                       size_t max_batch_size,
-                                      size_t max_workspace_size_bytes);
+                                      size_t max_workspace_size_bytes,
+                                      int precision_mode, int minimum_segment_size);
 
 
 %unignoreall
diff --git a/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc b/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
index cc32a26528..72d37f774c 100644
--- a/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
+++ b/tensorflow/contrib/tpu/ops/tpu_embedding_ops.cc
@@ -50,7 +50,7 @@ namespace tensorflow {
 // TPU Embeddings use dedicated ops to enforce Host/TPU consistency in the
 // state of embedding table variables. Before beginning training or inference,
 // the model must Load the optimizer parameters into the TPU memories. Before
-// saving a checkpoint, the model must Retreieve the parameters back into the
+// saving a checkpoint, the model must Retrieve the parameters back into the
 // host CPU memory.
 
 REGISTER_OP("TPUEmbeddingLoadGradientDescentParameters")
@@ -263,7 +263,7 @@ REGISTER_OP("TPUEmbeddingReceiveActivations")
     .SetIsStateful()
     .SetShapeFn(tpu_embedding_config_util::ActivationShapes)
     .Doc(R"doc(
-An op that receives embeddng activations on the TPU.
+An op that receives embedding activations on the TPU.
 
 The TPU system performs the embedding lookups and aggregations specified by
 the arguments to TPUEmbeddingEnqueueSparseBatch. The results of these
@@ -293,7 +293,7 @@ REGISTER_OP("TPUEmbeddingActivations")
 An op enabling differentiation of TPU Embeddings.
 
 This op simply returns its first input, which is assumed to have been sliced
-from the Tensors returnd by TPUEmbeddingDequeueActivations. The presence of this
+from the Tensors returned by TPUEmbeddingDequeueActivations. The presence of this
 op, and its first argument being a trainable Variable, enables automatic
 differentiation of graphs containing embeddings via the TPU Embedding Python
 libraries.
diff --git a/tensorflow/contrib/tpu/python/tpu/device_assignment.py b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
index bdd9b88af5..726b2d248e 100644
--- a/tensorflow/contrib/tpu/python/tpu/device_assignment.py
+++ b/tensorflow/contrib/tpu/python/tpu/device_assignment.py
@@ -191,9 +191,9 @@ class DeviceAssignment(object):
       logical_core: A tuple of three integers which represents a logical core.
     Returns:
       A sorted list of the replicas that are attached to that task and
-      loical_core.
+      logical_core.
     Raises:
-      ValueError: If no replica exisis in the task which contains the logical
+      ValueError: If no replica exists in the task which contains the logical
       core.
     """
     try:
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_config.py b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
index 009326e3d0..38b5ea2310 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_config.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
@@ -161,7 +161,7 @@ class RunConfig(run_config_lib.RunConfig):
     self._tpu_config = tpu_config or TPUConfig()
     self._cluster = cluster
 
-    # If user sets master and/or evaluation_master explicilty, including empty
+    # If user sets master and/or evaluation_master explicitly, including empty
     # string '', take it. Otherwise, take the values set by parent class.
     if master is not None:
       if cluster is not None:
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_context.py b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
index c5c46ea741..3bac2db77e 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_context.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_context.py
@@ -39,7 +39,7 @@ class _TPUContext(object):
 
   This immutable object holds TPUEstimator config, train/eval batch size, and
   `TPUEstimator.use_tpu`, which is expected to be passed around. It also
-  provides utility functions, basded on the current state, to determine other
+  provides utility functions, based on the current state, to determine other
   information commonly required by TPU computation, such as TPU device names,
   TPU hosts, shard batch size, etc.
 
@@ -218,7 +218,7 @@ class _TPUContext(object):
         model, when mode == PREDICT. Only with this bool, we could
         tell whether user is calling the Estimator.predict or
         Estimator.export_savedmodel, which are running on TPU and CPU
-        respectively. Parent class Estimator does not distingush these two.
+        respectively. Parent class Estimator does not distinguish these two.
 
     Returns:
       bool, whether current input_fn or model_fn should be running on CPU.
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
index f61f6bb52e..4354735744 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@@ -137,7 +137,7 @@ def _increase_eval_step_op(iterations_per_loop):
   """Returns an op to increase the eval step for TPU evaluation.
 
   Args:
-    iterations_per_loop: Tensor. The number of eval steps runnining in TPU
+    iterations_per_loop: Tensor. The number of eval steps running in TPU
         system before returning to CPU host for each `Session.run`.
 
   Returns:
@@ -609,17 +609,17 @@ class _StoppingPredictHook(session_run_hook.SessionRunHook):
       # batch. And we append one more batch to signal the system it should stop.
       # The data flow might look like
       #
-      #  batch   0: images, labels, stop = 0  (user provideded)
-      #  batch   1: images, labels, stop = 0  (user provideded)
+      #  batch   0: images, labels, stop = 0  (user provided)
+      #  batch   1: images, labels, stop = 0  (user provided)
       #  ...
-      #  batch  99: images, labels, stop = 0  (user provideded)
+      #  batch  99: images, labels, stop = 0  (user provided)
       #  batch 100: images, labels, stop = 1  (TPUEstimator appended)
       #
       # where the final batch (id = 100) is appended by TPUEstimator, so we
       # should drop it before returning the predictions to user.
       # To achieve that, we throw the OutOfRangeError in after_run. Once
       # Monitored Session sees this error in SessionRunHook.after_run, the
-      # "current" prediciton, i.e., batch with id=100, will be discarded
+      # "current" prediction, i.e., batch with id=100, will be discarded
       # immediately
       raise errors.OutOfRangeError(None, None, 'Stopped by stopping signal.')
 
@@ -758,7 +758,7 @@ class _InputPipeline(object):
   2. (features, labels)
 
   Internally, form 1 is reformed to `(features, None)` as features and labels
-  are passed separatedly to underlying methods. For TPU training, TPUEstimator
+  are passed separately to underlying methods. For TPU training, TPUEstimator
   may expect multiple `features` and `labels` tuples one for each core.
 
   TPUEstimator allows various different structures for inputs (namely `features`
diff --git a/tensorflow/contrib/tpu/python/tpu/training_loop.py b/tensorflow/contrib/tpu/python/tpu/training_loop.py
index 3d7896127a..82a75d0255 100644
--- a/tensorflow/contrib/tpu/python/tpu/training_loop.py
+++ b/tensorflow/contrib/tpu/python/tpu/training_loop.py
@@ -170,7 +170,7 @@ def while_loop(condition, body, inputs=None, infeed_queue=None, name=None):
 
 
 def repeat(n, body, inputs=None, infeed_queue=None, name=None):
-  """Builds a training loop that executes a fixed number of interations.
+  """Builds a training loop that executes a fixed number of iterations.
 
   The set of loop-carried tensors correspond to `inputs`.
   `body` must be a function that takes and returns the values of the
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 42d222ff6b..a14eeed1a5 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -3338,6 +3338,10 @@ tf_cc_test(
     size = "small",
     srcs = ["common_runtime/function_test.cc"],
     linkstatic = tf_kernel_tests_linkstatic(),
+    tags = [
+        "manual",
+        "no_oss",
+    ],
     deps = [
         ":core",
         ":core_cpu",
diff --git a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
index 51d63eeb56..7be9a958ab 100644
--- a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEig.pbtxt
@@ -19,6 +19,7 @@ form square matrices, with the same constraints as the single matrix
 SelfAdjointEig.
 
 The result is a [..., M+1, M] matrix with [..., 0,:] containing the
-eigenvalues, and subsequent [...,1:, :] containing the eigenvectors.
+eigenvalues, and subsequent [...,1:, :] containing the eigenvectors. The eigenvalues
+are sorted in non-decreasing order.
 END
 }
diff --git a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
index 4a5e125258..fae9e84fc8 100644
--- a/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
+++ b/tensorflow/core/api_def/base_api/api_def_SelfAdjointEigV2.pbtxt
@@ -31,7 +31,8 @@ END
   summary: "Computes the eigen decomposition of one or more square self-adjoint matrices."
   description: <<END
 Computes the eigenvalues and (optionally) eigenvectors of each inner matrix in
-`input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`.
+`input` such that `input[..., :, :] = v[..., :, :] * diag(e[..., :])`. The eigenvalues
+are sorted in non-decreasing order.
 
 ```python
 # a is a tensor.
diff --git a/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
new file mode 100644
index 0000000000..9fabe7863e
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_SlideDataset.pbtxt
@@ -0,0 +1,18 @@
+op {
+  graph_op_name: "SlideDataset"
+  in_arg {
+    name: "window_size"
+    description: <<END
+A scalar representing the number of elements in the
+sliding window.
+END
+  }
+  in_arg {
+    name: "stride"
+    description: <<END
+A scalar representing the steps moving the sliding window
+forward in one iteration. It must be in `[1, window_size)`.
+END
+  }
+  summary: "Creates a dataset that passes a sliding window over `input_dataset`."
+}
diff --git a/tensorflow/core/distributed_runtime/tensor_coding.cc b/tensorflow/core/distributed_runtime/tensor_coding.cc
index 34a4013547..fe2d1a1293 100644
--- a/tensorflow/core/distributed_runtime/tensor_coding.cc
+++ b/tensorflow/core/distributed_runtime/tensor_coding.cc
@@ -81,7 +81,7 @@ void TensorResponse::InitPartial(const RecvTensorResponse& response) {
 Status TensorResponse::ParseFrom(Source* source) {
   if (!on_host_) {
     protobuf::io::CodedInputStream input(source->contents());
-    input.SetTotalBytesLimit(INT_MAX);  // Unlimited
+    input.SetTotalBytesLimit(INT_MAX, INT_MAX);  // Unlimited
 
     // Pre-parse into local storage, then delegate to device.
     if (!meta_.ParseFromCodedStream(&input) || !input.ConsumedEntireMessage()) {
@@ -217,7 +217,7 @@ bool TensorResponse::ParseTensorSubmessage(
 
 bool TensorResponse::ParseFast(Source* source) {
   protobuf::io::CodedInputStream input(source->contents());
-  input.SetTotalBytesLimit(INT_MAX);  // Unlimited
+  input.SetTotalBytesLimit(INT_MAX, INT_MAX);  // Unlimited
   while (true) {
     auto p = input.ReadTagWithCutoff(127);
     int tag = GetTagFieldNumber(p.first);
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index 02038c5d77..1507b6eae2 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -2492,10 +2492,10 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
                       mkl_op_registry::GetMklOpName(csinfo_.identity),
                       CopyAttrsDataType, AlwaysRewrite});
     rinfo_.push_back({csinfo_.lrn, mkl_op_registry::GetMklOpName(csinfo_.lrn),
-                      CopyAttrsLRN, AlwaysRewrite});
+                      CopyAttrsLRN, LrnRewrite});
     rinfo_.push_back({csinfo_.lrn_grad,
                       mkl_op_registry::GetMklOpName(csinfo_.lrn_grad),
-                      CopyAttrsLRN, AlwaysRewrite});
+                      CopyAttrsLRN, LrnRewrite});
     rinfo_.push_back({csinfo_.max_pool,
                       mkl_op_registry::GetMklOpName(csinfo_.max_pool),
                       CopyAttrsPooling, NonDepthBatchWisePoolRewrite});
@@ -2865,6 +2865,28 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
     return false;
   }
 
+  // If the depth_radius of LRN is not 2, then MKL DNN takes unoptimized 
+  // path. The unoptimized path is slow. Thus we dont rewrite the node 
+  // and use default Eigen. But for depth_radius=2, MKL DNN optimized 
+  // path is taken, i.e., eigen node is rewritten by MKl DNN node.
+  static bool LrnRewrite(const Node* n) {
+    CHECK_NOTNULL(n);
+
+    int depth_radius;
+    CHECK_EQ(GetNodeAttr(n->def(), "depth_radius", &depth_radius).ok(), true);
+
+    // if the depth_radius of LRN is not 2, don't rewrite the node by MKL DNN
+    // and use eigen node instead 
+    if (depth_radius == 2) {
+      return true;
+    }
+    VLOG(1) << "LrnRewrite: The model sets depth_radius as not 2 which"
+            << "case is not optimized by Intel MKL, thus using Eigen op"
+            << "for LRN " ; 
+
+    return false;
+  }
+
   static bool AddNRewrite(const Node* n) {
     CHECK_NOTNULL(n);
 
@@ -3528,11 +3550,13 @@ void MklLayoutRewritePass::CopyAttrsConv2D(const Node* orig_node,
   string data_format;
   string padding;
   std::vector<int32> strides;
+  std::vector<int32> dilations;
   bool use_cudnn_on_gpu;
 
   // Get all attributes from old node.
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "T", &T));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "strides", &strides));
+  TF_CHECK_OK(GetNodeAttr(orig_node->def(), "dilations", &dilations));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "padding", &padding));
   TF_CHECK_OK(GetNodeAttr(orig_node->def(), "data_format", &data_format));
   TF_CHECK_OK(
@@ -3541,6 +3565,7 @@ void MklLayoutRewritePass::CopyAttrsConv2D(const Node* orig_node,
   // Add attributes to new node.
   nb->Attr("T", T);
   nb->Attr("strides", strides);
+  nb->Attr("dilations", dilations);
   nb->Attr("padding", padding);
   nb->Attr("data_format", data_format);
   nb->Attr("use_cudnn_on_gpu", use_cudnn_on_gpu);
@@ -3778,12 +3803,14 @@ Status MklLayoutRewritePass::MergeConv2DWithBiasAdd(std::unique_ptr<Graph>* g,
   DataType T_pred, T_succ;
   string padding;
   std::vector<int32> strides;
+  std::vector<int32> dilations;
   string data_format_pred, data_format_succ;
   bool use_cudnn_on_gnu;
   TF_CHECK_OK(GetNodeAttr(pred->def(), "T", &T_pred));
   TF_CHECK_OK(GetNodeAttr(succ->def(), "T", &T_succ));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "padding", &padding));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "strides", &strides));
+  TF_CHECK_OK(GetNodeAttr(pred->def(), "dilations", &dilations));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "data_format", &data_format_pred));
   TF_CHECK_OK(GetNodeAttr(succ->def(), "data_format", &data_format_succ));
   TF_CHECK_OK(GetNodeAttr(pred->def(), "use_cudnn_on_gpu", &use_cudnn_on_gnu));
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.cc b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
index f78036d78c..bd0d94b83f 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/grappler_item.h"
@@ -413,7 +414,7 @@ Status LoopOptimizer::LoopInvariantNodeMotion() {
       frame_children_[frame_ids[0]].insert(frame_ids[1]);
       frame_parent_[frame_ids.back()] = frame_ids[frame_ids.size() - 2];
     }
-    if (!frame_ids.empty()) {
+    if (frame_ids.size() >= 1) {
       frame_children_.insert(std::make_pair(frame_ids.back(), empty_set_));
       if (node->op() == "LoopCond") {
         if (loop_cond_.count(frame_ids.back())) {
@@ -432,7 +433,7 @@ Status LoopOptimizer::LoopInvariantNodeMotion() {
   }
 
   for (auto it = frame_children_.begin(); it != frame_children_.end(); ++it) {
-    if (it->second.empty()) {
+    if (it->second.size() == 0) {
       worklist.push_back(it->first);
     }
   }
@@ -445,7 +446,7 @@ Status LoopOptimizer::LoopInvariantNodeMotion() {
     if (parent_it != frame_parent_.end()) {
       int parent_id = parent_it->second;
       frame_children_[parent_id].erase(frame_id);
-      if (frame_children_[parent_id].empty()) {
+      if (frame_children_[parent_id].size() == 0) {
         worklist.push_back(parent_id);
       }
     }
@@ -468,6 +469,7 @@ Status LoopOptimizer::LoopInvariantNodeMotion() {
 
 Status LoopOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
                                GraphDef* optimized_graph) {
+
   TF_RETURN_IF_ERROR(RemoveStackOps(item, optimized_graph));
 
   if (opt_level_ == RewriterConfig::AGGRESSIVE) {
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 48d5955ad1..2e39f25fc1 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -5157,7 +5157,6 @@ tf_kernel_library(
     ],
     hdrs = [
         "meta_support.h",
-        "quantization_utils.h",
         "reference_gemm.h",
     ],
     deps = [
@@ -5232,6 +5231,7 @@ tf_cc_test(
     name = "quantization_utils_test",
     srcs = ["quantization_utils_test.cc"],
     deps = [
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
@@ -5294,6 +5294,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5355,6 +5356,7 @@ tf_cc_test(
         ":math",
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5377,6 +5379,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5441,6 +5444,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5461,6 +5465,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5500,6 +5505,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5556,6 +5562,7 @@ tf_cc_test(
         ":math",
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/cc:cc_ops",
         "//tensorflow/cc:client_session",
@@ -5578,6 +5585,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:framework",
@@ -5614,6 +5622,7 @@ tf_cc_test(
     deps = [
         ":ops_testutil",
         ":ops_util",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
@@ -5635,6 +5644,7 @@ tf_cc_test(
     deps = [
         ":batch_norm_op",
         ":ops_testutil",
+        ":quantization_utils",
         ":quantized_ops",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu_internal",
diff --git a/tensorflow/core/kernels/concat_op.cc b/tensorflow/core/kernels/concat_op.cc
index 7011550f7e..f16766315f 100644
--- a/tensorflow/core/kernels/concat_op.cc
+++ b/tensorflow/core/kernels/concat_op.cc
@@ -18,7 +18,6 @@ limitations under the License.
 #include <limits>
 #include <vector>
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -28,6 +27,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/concat_lib.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/types.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
 namespace tensorflow {
 
@@ -53,17 +53,38 @@ class ConcatBaseOp : public OpKernel {
   void Compute(OpKernelContext* c) override {
     const Tensor* concat_dim_tensor;
     const char* axis_attribute_name =
-        AxisArgName == NAME_IS_AXIS
-            ? "axis"
-            : AxisArgName == NAME_IS_CONCAT_DIM ? "concat_dim" : "<invalid>";
+        AxisArgName == NAME_IS_AXIS ? "axis" : AxisArgName == NAME_IS_CONCAT_DIM
+                                                   ? "concat_dim"
+                                                   : "<invalid>";
     OP_REQUIRES_OK(c, c->input(axis_attribute_name, &concat_dim_tensor));
     OP_REQUIRES(c, IsLegacyScalar(concat_dim_tensor->shape()),
                 errors::InvalidArgument(
                     axis_attribute_name,
                     " tensor should be a scalar integer, but got shape ",
                     concat_dim_tensor->shape().DebugString()));
-    const int32 concat_dim =
-        internal::SubtleMustCopy(concat_dim_tensor->scalar<int32>()());
+    int64 concat_dim;
+    // In case of ConcatV2, "axis" could be int32 or int64
+    if (AxisArgName == NAME_IS_AXIS) {
+      OP_REQUIRES(
+          c, (concat_dim_tensor->dtype() == DT_INT32 ||
+              concat_dim_tensor->dtype() == DT_INT64),
+          errors::InvalidArgument(axis_attribute_name,
+                                  " tensor should be int32 or int64, but got ",
+                                  concat_dim_tensor->dtype()));
+    } else {
+      OP_REQUIRES(c, (concat_dim_tensor->dtype() == DT_INT32),
+                  errors::InvalidArgument(axis_attribute_name,
+                                          " tensor should be int32, but got ",
+                                          concat_dim_tensor->dtype()));
+    }
+    if (concat_dim_tensor->dtype() == DT_INT32) {
+      concat_dim =
+          internal::SubtleMustCopy(concat_dim_tensor->scalar<int32>()());
+    } else {
+      concat_dim =
+          internal::SubtleMustCopy(concat_dim_tensor->scalar<int64>()());
+    }
+
     OpInputList values;
     OP_REQUIRES_OK(c, c->input_list("values", &values));
     const int N = values.size();
@@ -154,17 +175,16 @@ using ConcatOp = ConcatBaseOp<Device, T, NAME_IS_CONCAT_DIM>;
 template <typename Device, typename T>
 using ConcatV2Op = ConcatBaseOp<Device, T, NAME_IS_AXIS>;
 
-#define REGISTER_CONCAT(type)                                \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_CPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<CPUDevice, type>)         \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_CPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_CONCAT(type)                            \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_CPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<CPUDevice, type>)     \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_CPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<CPUDevice, type>)
 
 TF_CALL_POD_STRING_TYPES(REGISTER_CONCAT);
@@ -178,17 +198,16 @@ REGISTER_CONCAT(qint32);
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU(type)                                   \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_GPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<GPUDevice, type>)         \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_GPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_GPU(type)                               \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_GPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<GPUDevice, type>)     \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_GPU)        \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<GPUDevice, type>)
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU);
@@ -212,7 +231,6 @@ REGISTER_KERNEL_BUILDER(Name("Concat")
 REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tidx")
                             .HostMemory("values")
                             .HostMemory("axis")
                             .HostMemory("output"),
@@ -221,17 +239,16 @@ REGISTER_KERNEL_BUILDER(Name("ConcatV2")
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL(type)                                  \
-  REGISTER_KERNEL_BUILDER(Name("Concat")                     \
-                              .Device(DEVICE_SYCL)           \
-                              .TypeConstraint<type>("T")     \
-                              .HostMemory("concat_dim"),     \
-                          ConcatOp<SYCLDevice, type>)        \
-  REGISTER_KERNEL_BUILDER(Name("ConcatV2")                   \
-                              .Device(DEVICE_SYCL)           \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("axis"),           \
+#define REGISTER_SYCL(type)                              \
+  REGISTER_KERNEL_BUILDER(Name("Concat")                 \
+                              .Device(DEVICE_SYCL)       \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("concat_dim"), \
+                          ConcatOp<SYCLDevice, type>)    \
+  REGISTER_KERNEL_BUILDER(Name("ConcatV2")               \
+                              .Device(DEVICE_SYCL)       \
+                              .TypeConstraint<type>("T") \
+                              .HostMemory("axis"),       \
                           ConcatV2Op<SYCLDevice, type>)
 
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
@@ -246,7 +263,6 @@ REGISTER_KERNEL_BUILDER(Name("Concat")
 REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tidx")
                             .HostMemory("values")
                             .HostMemory("axis")
                             .HostMemory("output"),
diff --git a/tensorflow/core/kernels/conv_ops_test.cc b/tensorflow/core/kernels/conv_ops_test.cc
index 666bca265c..e2e166c02f 100644
--- a/tensorflow/core/kernels/conv_ops_test.cc
+++ b/tensorflow/core/kernels/conv_ops_test.cc
@@ -401,7 +401,7 @@ class ConvOpTest : public OpsTestBase {
     // (1*0)+(4*5)+(7*6)+(2*0)+(5*9)+(8*10)+(3*0)+(6*0)+(9*0)=187
     // (1*5)+(4*6)+(7*7)+(2*9)+(5*10)+(8*11)+(3*0)+(6*0)+(9*0)=234
     // (1*6)+(4*7)+(7*8)+(2*10)+(5*11)+(8*12)+(3*0)+(6*0)+(9*0)=261
-    // (1*7)+(4*11)+(7*0)+(2*8)+(5*12)+(8*0)+(3*0)+(6*0)+(9*0)=121
+    // (1*7)+(4*8)+(7*0)+(2*11)+(5*12)+(8*0)+(3*0)+(6*0)+(9*0)=121
     // This means we should end up with this matrix:
     // |  105  |  150  |  183  |   95  |
     // |  235  |  312  |  357  |  178  |
diff --git a/tensorflow/core/kernels/data/BUILD b/tensorflow/core/kernels/data/BUILD
index 484d4f88d6..01754ec21a 100644
--- a/tensorflow/core/kernels/data/BUILD
+++ b/tensorflow/core/kernels/data/BUILD
@@ -114,6 +114,19 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
+    name = "slide_dataset_op",
+    srcs = ["slide_dataset_op.cc"],
+    deps = [
+        ":dataset",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//tensorflow/core/kernels:batch_util",
+    ],
+)
+
+tf_kernel_library(
     name = "padded_batch_dataset_op",
     srcs = ["padded_batch_dataset_op.cc"],
     deps = [
@@ -538,6 +551,7 @@ tf_kernel_library(
         ":scan_dataset_op",
         ":shuffle_dataset_op",
         ":skip_dataset_op",
+        ":slide_dataset_op",
         ":sparse_tensor_slice_dataset_op",
         ":sql_dataset_ops",
         ":stats_aggregator_ops",
diff --git a/tensorflow/core/kernels/data/slide_dataset_op.cc b/tensorflow/core/kernels/data/slide_dataset_op.cc
new file mode 100644
index 0000000000..4f3537b691
--- /dev/null
+++ b/tensorflow/core/kernels/data/slide_dataset_op.cc
@@ -0,0 +1,252 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/batch_util.h"
+#include "tensorflow/core/kernels/data/dataset.h"
+
+namespace tensorflow {
+
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+
+class SlideDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit SlideDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {}
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    int64 window_size = 0;
+    int64 stride = 1;
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<int64>(ctx, "window_size", &window_size));
+    OP_REQUIRES_OK(ctx,
+                   ParseScalarArgument<int64>(ctx, "stride", &stride));
+    OP_REQUIRES(
+        ctx, window_size > 0,
+        errors::InvalidArgument("Window size must be greater than zero."));
+    OP_REQUIRES(
+        ctx, stride > 0 && stride < window_size,
+        errors::InvalidArgument("Stride must be in [1, window_size)."));
+
+    *output = new Dataset(ctx, window_size, stride, input);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, int64 window_size, int64 stride, const DatasetBase* input)
+        : GraphDatasetBase(ctx), window_size_(window_size), stride_(stride), input_(input) {
+      input_->Ref();
+
+      const auto& input_shapes = input_->output_shapes();
+      output_shapes_.reserve(input_shapes.size());
+      for (const auto& input_shape : input_shapes) {
+        output_shapes_.emplace_back(
+            PartialTensorShape({-1}).Concatenate(input_shape));
+      }
+    }
+
+    ~Dataset() override { input_->Unref(); }
+
+    std::unique_ptr<IteratorBase> MakeIterator(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(new Iterator(
+          Iterator::Params{this, strings::StrCat(prefix, "::Slide")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return input_->output_dtypes();
+    }
+
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return output_shapes_;
+    }
+
+    string DebugString() override {
+      return strings::StrCat("SlideDatasetOp(", window_size_, ", ", stride_, ")::Dataset");
+    }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      Node* input_graph_node = nullptr;
+      TF_RETURN_IF_ERROR(b->AddParentDataset(ctx, input_, &input_graph_node));
+      Node* window_size = nullptr;
+      Node* stride = nullptr;
+      TF_RETURN_IF_ERROR(b->AddScalar(window_size_, &window_size));
+      TF_RETURN_IF_ERROR(b->AddScalar(stride_, &stride));
+      TF_RETURN_IF_ERROR(
+          b->AddDataset(this, {input_graph_node, window_size, stride}, output));
+      return Status::OK();
+    }
+
+   private:
+
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params),
+            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        const int64 window_size = dataset()->window_size_;
+        const int64 stride = dataset()->stride_;
+        std::vector<std::vector<Tensor>> batch_elements;
+        {
+          mutex_lock l(mu_);
+          if (!input_impl_) {
+            *end_of_sequence = true;
+            return Status::OK();
+          }
+          batch_elements.reserve(window_size);
+          const bool first_call = cache_.empty();
+          if (first_call) {
+            cache_.reserve(window_size);
+          } else {
+            // Reuse cache in the previous iteration.
+            cache_.swap(batch_elements);
+          }
+          // Fill up with new elements.
+          *end_of_sequence = false;
+          for (size_t i = batch_elements.size(); i < window_size && !*end_of_sequence;
+              ++i) {
+            std::vector<Tensor> batch_element_tuple;
+            TF_RETURN_IF_ERROR(input_impl_->GetNext(ctx, &batch_element_tuple,
+                                                    end_of_sequence));
+            if (!*end_of_sequence) {
+              batch_elements.push_back(std::move(batch_element_tuple));
+            } else {
+              input_impl_.reset();
+            }
+          }
+          // Drop the final smaller blocks.
+          if (batch_elements.size() < window_size) {
+            DCHECK(*end_of_sequence);
+            return Status::OK();
+          }
+          // Cache the data used for the next iteration.
+          for (size_t i = stride; i < window_size; ++i) {
+            cache_.emplace_back(batch_elements[i]);
+          }
+        }
+
+        // Construct output tensors.
+        // Those codes below are copied from batch_dataset_op.cc.
+        const size_t num_tuple_components = batch_elements[0].size();
+        const int64 num_batch_elements = batch_elements.size();
+        for (size_t component_index = 0; component_index < num_tuple_components;
+             ++component_index) {
+          const Tensor& first_element = batch_elements[0][component_index];
+          TensorShape batch_component_shape({num_batch_elements});
+          batch_component_shape.AppendShape(first_element.shape());
+          Tensor batch_component(cpu_allocator(), first_element.dtype(),
+                                 batch_component_shape);
+          // Build the output tuple component by copying one slice
+          // from each input element in the batch.
+          for (size_t i = 0; i < num_batch_elements; ++i) {
+            if (batch_elements[i][component_index].shape() !=
+                first_element.shape()) {
+              return errors::InvalidArgument(
+                  "Cannot batch tensors with different shapes in component ",
+                  component_index, ". First element had shape ",
+                  first_element.shape().DebugString(), " and element ", i,
+                  " had shape ",
+                  batch_elements[i][component_index].shape().DebugString(),
+                  ".");
+            }
+            TF_RETURN_IF_ERROR(batch_util::CopyElementToSlice(
+                std::move(batch_elements[i][component_index]), &batch_component,
+                i));
+          }
+          out_tensors->emplace_back(std::move(batch_component));
+        }
+        *end_of_sequence = false;
+        return Status::OK();
+      }
+
+     protected:
+      Status SaveInternal(IteratorStateWriter* writer) override {
+        mutex_lock l(mu_);
+        if (!input_impl_) {
+          TF_RETURN_IF_ERROR(
+              writer->WriteScalar(full_name("input_impl_empty"), ""));
+        } else {
+          TF_RETURN_IF_ERROR(SaveParent(writer, input_impl_));
+        }
+        // Save cache.
+        TF_RETURN_IF_ERROR(
+            writer->WriteScalar(strings::StrCat("cache_size"), cache_.size()));
+        for (int64 i = 0; i < cache_.size(); i++) {
+          TF_RETURN_IF_ERROR(writer->WriteScalar(
+              strings::StrCat("cache[", i, "]_size"), cache_[i].size()));
+          for (int64 j = 0; j < cache_[i].size(); j++) {
+            TF_RETURN_IF_ERROR(writer->WriteTensor(
+                strings::StrCat("cache[", i, "][", j, "]"), cache_[i][j]));
+          }
+        }
+        return Status::OK();
+      }
+
+      Status RestoreInternal(IteratorContext* ctx,
+                             IteratorStateReader* reader) override {
+        mutex_lock l(mu_);
+        if (!reader->Contains(full_name("input_impl_empty"))) {
+          TF_RETURN_IF_ERROR(RestoreParent(ctx, reader, input_impl_));
+        } else {
+          input_impl_.reset();
+        }
+        // Restore cache.
+        int64 cache_size;
+        TF_RETURN_IF_ERROR(
+            reader->ReadScalar(strings::StrCat("cache_size"), &cache_size));
+        cache_.resize(cache_size);
+        for (int64 i = 0; i < cache_size; i++) {
+          int64 vector_size;
+          TF_RETURN_IF_ERROR(reader->ReadScalar(
+              strings::StrCat("cache[", i, "]_size"), &vector_size));
+          cache_[i].resize(vector_size);
+          for (int64 j = 0; j < vector_size; j++) {
+            TF_RETURN_IF_ERROR(reader->ReadTensor(
+                strings::StrCat("cache[", i, "][", j, "]"), &cache_[i][j]));
+          }
+        }
+        return Status::OK();
+      }
+
+     private:
+      mutex mu_;
+      std::vector<std::vector<Tensor>> cache_ GUARDED_BY(mu_);
+      std::unique_ptr<IteratorBase> input_impl_ GUARDED_BY(mu_);
+    };
+
+    const int64 window_size_;
+    const int64 stride_;
+    const DatasetBase* const input_;
+    std::vector<PartialTensorShape> output_shapes_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("SlideDataset").Device(DEVICE_CPU),
+                        SlideDatasetOp);
+
+}  // namespace
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/depthtospace_op.cc b/tensorflow/core/kernels/depthtospace_op.cc
index 39aa3e9eb0..b74a09e2cb 100644
--- a/tensorflow/core/kernels/depthtospace_op.cc
+++ b/tensorflow/core/kernels/depthtospace_op.cc
@@ -188,6 +188,9 @@ REGISTER_KERNEL_BUILDER(
     Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     DepthToSpaceOp<GPUDevice, float>);
 REGISTER_KERNEL_BUILDER(
+    Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
+    DepthToSpaceOp<GPUDevice, Eigen::half>);
+REGISTER_KERNEL_BUILDER(
     Name("DepthToSpace").Device(DEVICE_GPU).TypeConstraint<qint8>("T"),
     DepthToSpaceOp<GPUDevice, qint8>);
 #endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc b/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
index 184c703599..0656081177 100644
--- a/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthtospace_op_gpu.cu.cc
@@ -238,6 +238,12 @@ struct DepthToSpaceOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, float, FORMAT_NCHW>;
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, float, FORMAT_NHWC>;
 
+// Instantiate the GPU implementations for Eigen::half.
+template struct functor::DepthToSpaceOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NCHW>;
+template struct functor::DepthToSpaceOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NHWC>;
+
 // NCHW_VECT_C with 4 x qint8 can be treated as NCHW int32.
 template struct functor::DepthToSpaceOpFunctor<GPUDevice, int32, FORMAT_NCHW>;
 
diff --git a/tensorflow/core/kernels/hexagon/BUILD b/tensorflow/core/kernels/hexagon/BUILD
index 108d59db2c..7688305019 100644
--- a/tensorflow/core/kernels/hexagon/BUILD
+++ b/tensorflow/core/kernels/hexagon/BUILD
@@ -45,6 +45,7 @@ tf_cc_test(
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
         "//tensorflow/core/kernels:cwise_op",
+        "//tensorflow/core/kernels:quantization_utils",
         "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/kernels:reduction_ops",
         "//tensorflow/core/kernels:remote_fused_graph_execute_utils",
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index 1401bc65a4..e0706568b1 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -444,6 +444,7 @@ class MklConv2DCustomBackpropFilterOp
   ~MklConv2DCustomBackpropFilterOp() {}
 
  private:
+  const int kDilationH = 0, kDilationW = 1;
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -492,7 +493,9 @@ class MklConv2DCustomBackpropFilterOp
                        const convolution_forward::primitive_desc& conv_fwd_pd,
                        MklDnnData<T>* input, MklDnnData<T>* filter,
                        MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor, const memory::dims& strides,
+                       Tensor** output_tensor,
+                       const memory::dims& strides,
+                       const memory::dims& dilations,
                        const memory::dims& padding_l,
                        const memory::dims& padding_r, padding_kind padding,
                        const memory::dims& bwd_output_dims,
@@ -518,31 +521,32 @@ class MklConv2DCustomBackpropFilterOp
       bias_grad->SetOpMemDesc(bias_grad_dims, memory::format::x);
     }
 
-    // Create convolution backward weights primitive.
-    auto bwd_desc =
-        (biasEnabled && (bias_grad != nullptr))
-            ? convolution_backward_weights::desc(
-                  convolution_direct, input->GetOpMemDesc(),
-                  output->GetOpMemDesc(), bias_grad->GetOpMemDesc(),
-                  outbackprop->GetOpMemDesc(), strides, padding_l, padding_r,
-                  padding)
-            : convolution_backward_weights::desc(
-                  convolution_direct, input->GetOpMemDesc(),
-                  output->GetOpMemDesc(), outbackprop->GetOpMemDesc(), strides,
-                  padding_l, padding_r, padding);
-
-    auto bwd_pd = convolution_backward_weights::primitive_desc(
-        bwd_desc, cpu_engine, conv_fwd_pd);
-
-    // Allocate output tensor.
-    AllocateOutputTensor(context, bwd_pd, bwd_output_dims, bwd_output_format,
-                         output_tensor);
-
-    CHECK_NOTNULL(*output_tensor);
-    // Set buffer handle using allocated output tensor.
-    output->SetUsrMemDataHandle(*output_tensor);
-
     if (biasEnabled && (bias_grad != nullptr)) {
+      // Create convolution backward weights with bias primitive.
+      // Use dilated convolution in case dilate rates are greater than zero.
+      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  bias_grad->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(), strides,
+                                  dilations, padding_l, padding_r, padding) :
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  bias_grad->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(),
+                                  strides, padding_l, padding_r, padding);
+      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
+                                                            cpu_engine,
+                                                            conv_fwd_pd);
+
+      // Allocate output tensor.
+      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
+                           bwd_output_format, output_tensor);
+
+      CHECK_NOTNULL(*output_tensor);
+      // Set buffer handle using allocated output tensor.
+      output->SetUsrMemDataHandle(*output_tensor);
+
       // Allocate bias_grad tensor
       TensorShape bias_grad_shape({depth});
       Tensor* bias_grad_tensor = nullptr;
@@ -553,11 +557,32 @@ class MklConv2DCustomBackpropFilterOp
           memory::desc({bias_grad_dims}, MklDnnType<T>(), memory::format::x);
       bias_grad->SetUsrMem(bias_grad_md, bias_grad_tensor);
       bias_grad->SetUsrMemDataHandle(bias_grad_tensor);
-    }
 
-    if (biasEnabled && (bias_grad != nullptr)) {
-      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output, bias_grad);
+      PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output,
+                                  bias_grad);
     } else {
+      // Create convolution backward weights primitive.
+      // Use dilated convolution in case dilate rates are greater than zero.
+      auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(), strides,
+                                  dilations, padding_l, padding_r, padding) :
+        convolution_backward_weights::desc(convolution_direct,
+                                  input->GetOpMemDesc(), output->GetOpMemDesc(),
+                                  outbackprop->GetOpMemDesc(),
+                                  strides, padding_l, padding_r, padding);
+      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
+                                                            cpu_engine,
+                                                            conv_fwd_pd);
+
+      // Allocate output tensor.
+      AllocateOutputTensor(context, bwd_pd, bwd_output_dims,
+                           bwd_output_format, output_tensor);
+
+      CHECK_NOTNULL(*output_tensor);
+      // Set buffer handle using allocated output tensor.
+      output->SetUsrMemDataHandle(*output_tensor);
       PrepareAndExecutePrimitive(bwd_pd, input, outbackprop, output);
     }
   }
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index eeed009531..d203c04934 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -369,6 +369,7 @@ class MklConv2DCustomBackpropInputOp
  private:
   const int kInputIndex_Filter = 1, kInputIndex_InputSizes = 0,
             kInputIndex_OutBackProp = 2;
+  const int kDilationH = 0, kDilationW = 1;
   void ValidateMklShapes(const MklDnnShape& input_mkl_shape,
                          const MklDnnShape& filter_mkl_shape,
                          const MklDnnShape& obp_mkl_shape) {
@@ -419,7 +420,9 @@ class MklConv2DCustomBackpropInputOp
                        const convolution_forward::primitive_desc& conv_fwd_pd,
                        MklDnnData<T>* input, MklDnnData<T>* filter,
                        MklDnnData<T>* outbackprop, MklDnnData<T>* output,
-                       Tensor** output_tensor, const memory::dims& strides,
+                       Tensor** output_tensor,
+                       const memory::dims& strides,
+                       const memory::dims& dilations,
                        const memory::dims& padding_l,
                        const memory::dims& padding_r, padding_kind padding,
                        const memory::dims& bwd_output_dims,
@@ -432,9 +435,16 @@ class MklConv2DCustomBackpropInputOp
     CHECK_NOTNULL(output_tensor);
 
     // Create convolution backward data primitive.
-    auto bwd_desc = convolution_backward_data::desc(
-        convolution_direct, output->GetOpMemDesc(), filter->GetOpMemDesc(),
-        outbackprop->GetOpMemDesc(), strides, padding_l, padding_r, padding);
+    // Use dilated convolution in case dilate rates are greater than zero.
+    auto bwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0) ?
+        convolution_backward_data::desc(convolution_direct,
+                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
+                      outbackprop->GetOpMemDesc(), strides,
+                      dilations, padding_l, padding_r, padding):
+        convolution_backward_data::desc(convolution_direct,
+                      output->GetOpMemDesc(), filter->GetOpMemDesc(),
+                      outbackprop->GetOpMemDesc(),
+                      strides, padding_l, padding_r, padding);
 
     auto bwd_pd = convolution_backward_data::primitive_desc(
         bwd_desc, cpu_engine, conv_fwd_pd);
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 1440da8f82..f0818eb96d 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -493,6 +493,7 @@ class MklConv2DOp : public OpKernel {
   ~MklConv2DOp() {}
 
   explicit MklConv2DOp(OpKernelConstruction* context) : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("dilations", &dilations_));
     OP_REQUIRES_OK(context, context->GetAttr("strides", &strides_));
     string data_format;
     OP_REQUIRES_OK(context, context->GetAttr("data_format", &data_format));
@@ -509,6 +510,20 @@ class MklConv2DOp : public OpKernel {
         errors::InvalidArgument("Current implementation does not yet support "
                                 "strides in the batch and depth dimensions."));
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
+    OP_REQUIRES(context, dilations_.size() == 4,
+                errors::InvalidArgument("Sliding window dilations field must "
+                                        "specify 4 dimensions"));
+    const int64 dilation_n = GetTensorDim(dilations_, data_format_, 'N');
+    const int64 dilation_c = GetTensorDim(dilations_, data_format_, 'C');
+    const int64 dilation_h = GetTensorDim(dilations_, data_format_, 'H');
+    const int64 dilation_w = GetTensorDim(dilations_, data_format_, 'W');
+    OP_REQUIRES(context, dilation_n == 1 && dilation_c == 1,
+                errors::InvalidArgument(
+                    "Current implementation does not yet support "
+                    "dilations in the batch and depth dimensions."));
+    OP_REQUIRES(
+        context, dilation_h > 0 && dilation_w > 0,
+        errors::InvalidArgument("Dilated rates should be larger than 0."));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -530,17 +545,19 @@ class MklConv2DOp : public OpKernel {
       MklDnnData<T> filter(&cpu_engine);
       MklDnnData<T> output(&cpu_engine);
 
-      memory::dims src_dims, filter_dims, padding_l, padding_r, strides;
+      memory::dims src_dims, filter_dims, padding_l, padding_r,
+                   dilations, strides;
       memory::dims output_dims_tf_order, output_dims_mkl_order;
 
       // Get shapes of input tensors in MKL-DNN order
-      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
+      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_,
+                             dilations_);
       auto src_tf_shape = GetTfShape(context, kInputIndex_Src);
       auto filter_tf_shape = GetTfShape(context, kInputIndex_Filter);
       conv_utl.GetConvFwdSizesInMklOrder(
           src_tf_shape, filter_tf_shape, &src_dims, &filter_dims, &strides,
-          &output_dims_tf_order, &output_dims_mkl_order, &padding_l,
-          &padding_r);
+          &dilations, &output_dims_tf_order, &output_dims_mkl_order,
+          &padding_l, &padding_r);
       if (!context->status().ok()) return;
 
       // Check for corner case - if there is nothing to compute, return.
@@ -553,6 +570,7 @@ class MklConv2DOp : public OpKernel {
         //               Need semantics for Null MKL tensor
         MklDnnShape output_mkl_shape;
         output_mkl_shape.SetMklTensor(false);
+
         AllocateOutputSetMklShape(context, kOutputIndex_Dst, &output_tensor,
                                   src_tf_shape, output_mkl_shape);
 
@@ -596,55 +614,79 @@ class MklConv2DOp : public OpKernel {
       filter.SetOpMemDesc(filter_dims, memory::format::any);
       output.SetOpMemDesc(output_dims_mkl_order, memory::format::any);
 
-      // If bias is enabled, then do the same steps as above for bias.
+      // MKLDNN dilation starts from 0.
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+
       if (biasEnabled) {
-        MklDnnData<T> bias(&cpu_engine);
-        memory::dims bias_size;
-        conv_utl.GetBiasSizeInMklOrder(kInputIndex_Bias, &bias_size);
-        const Tensor& bias_tensor = MklGetInput(context, kInputIndex_Bias);
-        bias.SetUsrMem(bias_size, memory::format::x, &bias_tensor);
-        bias.SetOpMemDesc(bias_size, memory::format::any);
-
-        // Create convolution primitive with Bias.
-        auto conv_desc = convolution_forward::desc(
-            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
-            filter.GetOpMemDesc(), bias.GetOpMemDesc(), output.GetOpMemDesc(),
-            strides, padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
-
-        auto conv_prim_desc =
-            convolution_forward::primitive_desc(conv_desc, cpu_engine);
-        AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
-                             tf_fmt, &output_tensor);
-        // Set data handle for output.
-        output.SetUsrMemDataHandle(output_tensor);
-
-        Tensor* filter_out_tensor = nullptr;
-        AllocateFilterOutputTensor(context, conv_prim_desc,
-                                   TFShapeToMklDnnDims(filter_tf_shape),
-                                   &filter_out_tensor);
-
-        PrepareAndExecuteNet(conv_prim_desc, &src, &filter, &bias, &output,
-                             filter_out_tensor);
+          // Create convolution primitive with Bias.
+          MklDnnData<T> bias(&cpu_engine);
+          memory::dims bias_size;
+          conv_utl.GetBiasSizeInMklOrder(kInputIndex_Bias, &bias_size);
+          const Tensor& bias_tensor = MklGetInput(context, kInputIndex_Bias);
+          bias.SetUsrMem(bias_size, memory::format::x, &bias_tensor);
+          bias.SetOpMemDesc(bias_size, memory::format::any);
+
+          // Create convolution primitive with Bias.
+          // Use MKLDNN dilated convolution in case of dilated rate (>0).
+          auto conv_desc = (dilations[kDilationH] > 0 ||
+              dilations[kDilationW] > 0) ?
+              convolution_forward::desc(prop_kind::forward,
+                      convolution_direct, src.GetOpMemDesc(),
+                      filter.GetOpMemDesc(), bias.GetOpMemDesc(),
+                      output.GetOpMemDesc(), strides, dilations,
+                      padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_)):
+              convolution_forward::desc(prop_kind::forward,
+                      convolution_direct, src.GetOpMemDesc(),
+                      filter.GetOpMemDesc(), bias.GetOpMemDesc(),
+                      output.GetOpMemDesc(), strides,
+                      padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_));
+
+          auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
+                                                                  cpu_engine);
+          AllocateOutputTensor(context, conv_prim_desc,
+                               output_dims_mkl_order, tf_fmt, &output_tensor);
+          // Set data handle for output.
+          output.SetUsrMemDataHandle(output_tensor);
+
+          Tensor* filter_out_tensor = nullptr;
+          AllocateFilterOutputTensor(context, conv_prim_desc,
+                TFShapeToMklDnnDims(filter_tf_shape),
+                &filter_out_tensor);
+
+          PrepareAndExecuteNet(conv_prim_desc, &src, &filter, &bias, &output,
+                               filter_out_tensor);
       } else {
-        // Create convolution primitive without Bias.
-        auto conv_desc = convolution_forward::desc(
-            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
-            filter.GetOpMemDesc(), output.GetOpMemDesc(), strides, padding_l,
-            padding_r, TFPaddingToMklDnnPadding(padding_));
-
-        auto conv_prim_desc =
-            convolution_forward::primitive_desc(conv_desc, cpu_engine);
-        AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
-                             tf_fmt, &output_tensor);
-        // Set data handle for output.
-        output.SetUsrMemDataHandle(output_tensor);
-
-        Tensor* filter_out_tensor = nullptr;
-        AllocateFilterOutputTensor(context, conv_prim_desc,
-                                   TFShapeToMklDnnDims(filter_tf_shape),
-                                   &filter_out_tensor);
-        PrepareAndExecuteNet(conv_prim_desc, &src, &filter, nullptr, &output,
-                             filter_out_tensor);
+          // Create convolution primitive without Bias.
+          // Use MKLDNN dilated convolution in case of dilated rate (>0).
+          auto conv_desc = (dilations[kDilationH] > 0 ||
+            dilations[kDilationW] > 0) ?
+            convolution_forward::desc(prop_kind::forward,
+              convolution_direct, src.GetOpMemDesc(),
+              filter.GetOpMemDesc(), output.GetOpMemDesc(),
+              strides, dilations, padding_l, padding_r,
+              TFPaddingToMklDnnPadding(padding_)):
+          convolution_forward::desc(prop_kind::forward,
+              convolution_direct, src.GetOpMemDesc(),
+              filter.GetOpMemDesc(), output.GetOpMemDesc(),
+              strides, padding_l, padding_r,
+              TFPaddingToMklDnnPadding(padding_));
+
+          auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
+                                                                  cpu_engine);
+          AllocateOutputTensor(context, conv_prim_desc, output_dims_mkl_order,
+                               tf_fmt, &output_tensor);
+          // Set data handle for output.
+          output.SetUsrMemDataHandle(output_tensor);
+
+          Tensor* filter_out_tensor = nullptr;
+          AllocateFilterOutputTensor(context, conv_prim_desc,
+                TFShapeToMklDnnDims(filter_tf_shape),
+                &filter_out_tensor);
+          PrepareAndExecuteNet(conv_prim_desc, &src, &filter,
+                              nullptr, &output, filter_out_tensor);
       }
     } catch (mkldnn::error& e) {
       string error_msg = "Status: " + std::to_string(e.status) +
@@ -658,10 +700,12 @@ class MklConv2DOp : public OpKernel {
 
  private:
   std::vector<int32> strides_;
+  std::vector<int32> dilations_;
   Padding padding_;
   TensorFormat data_format_;
   const int kInputIndex_Src = 0, kInputIndex_Filter = 1, kInputIndex_Bias = 2;
   const int kOutputIndex_Dst = 0, kOutputIndex_Filter = 1;
+  const int kDilationH = 0, kDilationW = 1;
 
   // Allocate output tensor.
   void AllocateOutputTensor(
diff --git a/tensorflow/core/kernels/mkl_conv_ops.h b/tensorflow/core/kernels/mkl_conv_ops.h
index 9dd88221a8..7ca10db895 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.h
+++ b/tensorflow/core/kernels/mkl_conv_ops.h
@@ -58,13 +58,16 @@ class MklDnnConvUtil {
  protected:
   OpKernelContext* context_;  // We don't own this.
   std::vector<int32> strides_;
+  std::vector<int32> dilations_;
   Padding padding_;
   TensorFormat data_format_;
 
  public:
   MklDnnConvUtil(OpKernelContext* context, const std::vector<int32>& strides,
-                 Padding pad, TensorFormat fm)
-      : context_(context), strides_(strides), padding_(pad), data_format_(fm) {}
+                 Padding pad, TensorFormat fm,
+                 const std::vector<int32>& dilations) :
+    context_(context), strides_(strides), padding_(pad),
+    data_format_(fm), dilations_(dilations) {}
 
   virtual ~MklDnnConvUtil() { context_ = nullptr; }
 
@@ -78,6 +81,16 @@ class MklDnnConvUtil {
     *strides = {stride_rows, stride_cols};
   }
 
+  // Calculate Convolution dilations
+  virtual inline void GetDilationsInMklOrder(memory::dims *dilations) {
+    // For now we take the dilation from the second and third dimensions only
+    // (we do not support dilation on the batch or depth dimension).
+    CHECK_NOTNULL(dilations);
+    int dilations_rows = GetTensorDim(dilations_, data_format_, 'H');
+    int dilations_cols = GetTensorDim(dilations_, data_format_, 'W');
+    *dilations = {dilations_rows, dilations_cols};
+  }
+
   // Calculate Convolution input size in MKL-DNN order. MKL-DNN
   // requires input in NCHW format. Function does not return anything.
   // But errors arising from sanity checks are returned in context's
@@ -213,7 +226,8 @@ class MklDnnConvUtil {
   // TODO(nhasabni): Add similar function for input and filter in MklShape.
   virtual inline void GetOutputAndPadSizeInMklOrder(
       const TensorShape& input_shape, const TensorShape& filter_shape,
-      const memory::dims& strides, memory::dims* output_dims_tf_order,
+      const memory::dims& strides, const memory::dims& dilations,
+      memory::dims* output_dims_tf_order,
       memory::dims* output_dims_mkl_order, memory::dims* pad_l,
       memory::dims* pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
@@ -232,6 +246,8 @@ class MklDnnConvUtil {
     // Stride is vector of 2 elements: {s_r, s_c}
     int stride_rows = strides[0];
     int stride_cols = strides[1];
+    int dilation_rows = dilations[0];
+    int dilation_cols = dilations[1];
 
     // Output batch is same as input batch.
     int out_batch = GetTensorDim(input_shape, data_format_, 'N');
@@ -241,11 +257,13 @@ class MklDnnConvUtil {
     int64 out_rows = 0, out_cols = 0;
     int64 pad_top = 0, pad_bottom = 0, pad_left, pad_right;
 
-    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
-                                 input_rows, filter_rows, stride_rows, padding_,
+    OP_REQUIRES_OK(context_,
+            GetWindowedOutputSizeVerboseV2(input_rows, filter_rows,
+                                 dilation_rows, stride_rows, padding_,
                                  &out_rows, &pad_top, &pad_bottom));
-    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
-                                 input_cols, filter_cols, stride_cols, padding_,
+    OP_REQUIRES_OK(context_,
+            GetWindowedOutputSizeVerboseV2(input_cols, filter_cols,
+                                 dilation_cols, stride_cols, padding_,
                                  &out_cols, &pad_left, &pad_right));
 
     // Tensorflow output is in data_format order. (NHWC or NCHW)
@@ -271,7 +289,8 @@ class MklDnnConvUtil {
   //
   // Function does not return anything, but sets error in context status.
   inline void GetOutputAndPadSizeInMklOrder(
-      size_t src_index, size_t filter_index, const memory::dims& strides,
+      size_t src_index, size_t filter_index,
+      const memory::dims& strides, const memory::dims& dilations,
       memory::dims* output_dims_tf_order, memory::dims* output_dims_mkl_order,
       memory::dims* pad_l, memory::dims* pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
@@ -286,9 +305,9 @@ class MklDnnConvUtil {
                 errors::InvalidArgument("input must be 4-dimensional",
                                         input_tf_shape.DebugString()));
 
-    GetOutputAndPadSizeInMklOrder(input_tf_shape, filter_tf_shape, strides,
-                                  output_dims_tf_order, output_dims_mkl_order,
-                                  pad_l, pad_r);
+    GetOutputAndPadSizeInMklOrder(input_tf_shape, filter_tf_shape,
+                                  strides, dilations, output_dims_tf_order,
+                                  output_dims_mkl_order, pad_l, pad_r);
   }
 
   // Wrapper function to calculate input, filter, and output sizes of
@@ -300,12 +319,14 @@ class MklDnnConvUtil {
   inline void GetConvFwdSizesInMklOrder(
       const TensorShape& input_shape, const TensorShape& filter_shape,
       memory::dims* input_dims, memory::dims* filter_dims,
-      memory::dims* strides, memory::dims* output_dims_tf_order,
+      memory::dims* strides, memory::dims *dilations,
+      memory::dims* output_dims_tf_order,
       memory::dims* output_dims_mkl_order, memory::dims* pad_l,
       memory::dims* pad_r) {
     CHECK_NOTNULL(input_dims);
     CHECK_NOTNULL(filter_dims);
     CHECK_NOTNULL(strides);
+    CHECK_NOTNULL(dilations);
     CHECK_NOTNULL(output_dims_tf_order);
     CHECK_NOTNULL(output_dims_mkl_order);
     CHECK_NOTNULL(pad_l);
@@ -316,7 +337,9 @@ class MklDnnConvUtil {
     GetFilterSizeInMklOrder(input_shape, filter_shape, filter_dims);
     if (!context_->status().ok()) return;
     GetStridesInMklOrder(strides);
-    GetOutputAndPadSizeInMklOrder(input_shape, filter_shape, *strides,
+    GetDilationsInMklOrder(dilations);
+    GetOutputAndPadSizeInMklOrder(input_shape, filter_shape,
+                                  *strides, *dilations,
                                   output_dims_tf_order, output_dims_mkl_order,
                                   pad_l, pad_r);
     if (!context_->status().ok()) return;
@@ -344,7 +367,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
         context, (stride_n == 1 && stride_c == 1),
         errors::InvalidArgument("Current implementation does not yet support "
                                 "strides in the batch and depth dimensions."));
-
+    OP_REQUIRES_OK(context, context->GetAttr("dilations", &dilations_));
+    OP_REQUIRES(context, dilations_.size() == 4,
+                errors::InvalidArgument("Sliding window dilations field must "
+                                        "specify 4 dimensions"));
+    int dilation_n = GetTensorDim(dilations_, data_format_, 'N');
+    int dilation_c = GetTensorDim(dilations_, data_format_, 'C');
+    int dilation_h = GetTensorDim(dilations_, data_format_, 'H');
+    int dilation_w = GetTensorDim(dilations_, data_format_, 'W');
+    OP_REQUIRES(context, (dilation_n == 1 && dilation_c == 1),
+                errors::InvalidArgument(
+                    "Current implementation does not yet support "
+                    "dilations in the batch and depth dimensions."));
+    OP_REQUIRES(
+        context, dilation_h > 0 && dilation_w > 0,
+        errors::InvalidArgument("Dilated rates should be larger than 0."));
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
   }
 
@@ -406,15 +443,16 @@ class MklConv2DBackpropCommonOp : public OpKernel {
       // By default, all dims are in MKL order. Only dims in TF order
       // are those with prefix tf_order.
       memory::dims outbprop_dims, fwd_input_dims, fwd_filter_dims;
-      memory::dims padding_l, padding_r, strides, fwd_output_dims;
+      memory::dims padding_l, padding_r, dilations, strides, fwd_output_dims;
       memory::dims fwd_output_dims_tf_order;
 
       // Get forward convolution parameters.
-      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
+      MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_,
+                             dilations_);
       conv_utl.GetConvFwdSizesInMklOrder(
           input_tf_shape, filter_tf_shape, &fwd_input_dims, &fwd_filter_dims,
-          &strides, &fwd_output_dims_tf_order, &fwd_output_dims, &padding_l,
-          &padding_r);
+          &strides, &dilations, &fwd_output_dims_tf_order, &fwd_output_dims,
+          &padding_l, &padding_r);
       if (!context->status().ok()) return;
 
       // Create Convolution forward descriptor since Convolution backward
@@ -437,10 +475,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
                                               memory::format::hwio);
       // Tensorflow Output of Conv2D is in data_format order.
       auto fwd_out_md = memory::desc(fwd_output_dims, MklDnnType<T>(), tf_fmt);
-      auto fwd_desc = convolution_forward::desc(
-          prop_kind::forward, convolution_direct, fwd_input_md, fwd_filter_md,
-          fwd_out_md, strides, padding_l, padding_r,
-          TFPaddingToMklDnnPadding(padding_));
+
+      const int kDilationH = 0, kDilationW = 1;
+      dilations[kDilationH] -= 1;
+      dilations[kDilationW] -= 1;
+      auto fwd_desc = (dilations[kDilationH] > 0 || dilations[kDilationW] > 0)?
+              convolution_forward::desc(prop_kind::forward,
+                     convolution_direct, fwd_input_md,
+                     fwd_filter_md, fwd_out_md,
+                     strides, dilations, padding_l, padding_r,
+                     TFPaddingToMklDnnPadding(padding_)) :
+              convolution_forward::desc(prop_kind::forward,
+                     convolution_direct, fwd_input_md,
+                     fwd_filter_md, fwd_out_md,
+                     strides, padding_l, padding_r,
+                     TFPaddingToMklDnnPadding(padding_));
       auto fwd_pd = convolution_forward::primitive_desc(fwd_desc, cpu_engine);
 
       // Create memory for user data. Describe how the inputs and outputs of
@@ -485,8 +534,9 @@ class MklConv2DBackpropCommonOp : public OpKernel {
 
       // Operator-specific call to create and execute primitive.
       CreatePrimitive(context, cpu_engine, fwd_pd, &input, &filter,
-                      &outbackprop, &output, &output_tensor, strides, padding_l,
-                      padding_r, TFPaddingToMklDnnPadding(padding_),
+                      &outbackprop, &output, &output_tensor,
+                      strides, dilations, padding_l, padding_r,
+                      TFPaddingToMklDnnPadding(padding_),
                       bwd_output_dims, bwd_output_format);
     } catch (mkldnn::error& e) {
       string error_msg = "Status: " + std::to_string(e.status) +
@@ -535,20 +585,21 @@ class MklConv2DBackpropCommonOp : public OpKernel {
   virtual memory::format GetOutputFormat(const memory::format data_format) = 0;
 
   /// Create and execute the primitive storing output in the output_tensor.
-  virtual void CreatePrimitive(
-      OpKernelContext* context, const engine& cpu_engine,
-      const convolution_forward::primitive_desc& conv_fwd_pd,
-      MklDnnData<T>* input, MklDnnData<T>* filter, MklDnnData<T>* outbackprop,
-      MklDnnData<T>* output, Tensor** output_tensor,
-      const memory::dims& strides, const memory::dims& padding_l,
-      const memory::dims& padding_r, padding_kind padding,
-      const memory::dims& bwd_output_dims,
-      memory::format bwd_output_format) = 0;
+  virtual void CreatePrimitive(OpKernelContext* context,
+    const engine& cpu_engine,
+    const convolution_forward::primitive_desc& conv_fwd_pd,
+    MklDnnData<T>* input, MklDnnData<T>* filter, MklDnnData<T>* outbackprop,
+    MklDnnData<T>* output, Tensor** output_tensor, const memory::dims& strides,
+    const memory::dims& dilations, const memory::dims& padding_l,
+    const memory::dims& padding_r, padding_kind padding,
+    const memory::dims& bwd_output_dims,
+    memory::format bwd_output_format) = 0;
 
   // Get the data_format {NCHW, NHWC}
   TensorFormat GetTFDataFormat() { return data_format_; }
 
  private:
+  std::vector<int32> dilations_;
   std::vector<int32> strides_;
   Padding padding_;
   TensorFormat data_format_;
diff --git a/tensorflow/core/kernels/mkl_input_conversion_op.cc b/tensorflow/core/kernels/mkl_input_conversion_op.cc
index e9a2376b54..d91f7107c5 100644
--- a/tensorflow/core/kernels/mkl_input_conversion_op.cc
+++ b/tensorflow/core/kernels/mkl_input_conversion_op.cc
@@ -442,12 +442,11 @@ class MklInputConversionOp : public OpKernel {
       auto input_tf_md = mkl_output_mkl_shape.GetTfLayout();
       tf_input.SetUsrMem(input_tf_md, tf_tensor);
 
-      // Create reorder between tensorflow layout and Mkl layout.
+      // Create reorder between tensorflow layout and Mkl layout if necessary
       std::vector<primitive> net;
-      CHECK_EQ(tf_input.CheckReorderToOpMem(
+      tf_input.CheckReorderToOpMem(
                    memory::primitive_desc(output_mkl_md, cpu_engine),
-                   tensor_out, &net),
-               true);
+                   tensor_out, &net);
       stream(stream::kind::eager).submit(net).wait();
 
       // -- The tensor in MKL format passes through --
diff --git a/tensorflow/core/kernels/mkl_relu_op.cc b/tensorflow/core/kernels/mkl_relu_op.cc
index 267f4f8d12..0a0f69522f 100644
--- a/tensorflow/core/kernels/mkl_relu_op.cc
+++ b/tensorflow/core/kernels/mkl_relu_op.cc
@@ -392,7 +392,7 @@ class MklReluOpBase : public OpKernel {
 
       Tensor* dst_tensor = nullptr;
       if (src_tensor.dims() == 0) {
-        Compute_Scalar(context);
+        Compute_Scalar(context); // scalar case doesn't use in-place operation
         return;
       }
 
@@ -437,11 +437,15 @@ class MklReluOpBase : public OpKernel {
         dnn_shape_dst.SetMklTensor(false);
         tf_shape_dst = src_tensor.shape();
       }
-      AllocateOutputSetMklShape(context, dst_index, &dst_tensor, tf_shape_dst,
-                                dnn_shape_dst);
+      
+      // Allocate output and MklDnnShape tensors separately for possible
+      // in-place operation
+      OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                      {src_index}, dst_index, tf_shape_dst, &dst_tensor));
+      AllocateOutputSetMklShape(context, dst_index, dnn_shape_dst);
 
       // Destination memory descriptor is same as source memory descriptor.
-      auto dst_md = src_md;
+      auto &dst_md = src_md;
       dst.SetUsrMem(dst_md, dst_tensor);
 
       // execute net
@@ -492,7 +496,7 @@ class MklReluGradOpBase : public OpKernel {
 
       int src_dims_size = src_tensor.dims();
       if (src_dims_size == 0) {
-        Compute_Scalar(context);
+        Compute_Scalar(context); // scalar case doesn't use in-place operation
         return;
       }
 
@@ -603,8 +607,13 @@ class MklReluGradOpBase : public OpKernel {
         // so it is ok to get TensorFlow shape.
         tf_shape_diff_src = src_tensor.shape();
       }
-      AllocateOutputSetMklShape(context, diff_src_index, &diff_src_tensor,
-                                tf_shape_diff_src, dnn_shape_diff_src);
+
+      // Allocate diff_src and MklDnnShape tensors separately for possible
+      // in-place operation
+      OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                      {diff_dst_index}, diff_src_index, tf_shape_diff_src,
+                                      &diff_src_tensor));
+      AllocateOutputSetMklShape(context, diff_src_index, dnn_shape_diff_src);
 
       // diff_src memory descriptor is same as memory descriptor for both
       // inputs.
diff --git a/tensorflow/core/kernels/segment_reduction_ops.h b/tensorflow/core/kernels/segment_reduction_ops.h
index 4abfbfb1a6..d65692a552 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.h
+++ b/tensorflow/core/kernels/segment_reduction_ops.h
@@ -16,6 +16,14 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 #define TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 
+
+// This file requires the following include because it uses CudaAtomicMax:
+// #include "tensorflow/core/util/cuda_kernel_helper.h"
+
+// Unfortunately we can't add the #include, since it breaks compilation for
+// non-GPU targets. This only breaks in clang, because it's more strict for
+// template code and CudaAtomicMax is used in template context.
+
 // This file requires the following include because it uses CudaAtomicMax:
 // #include "tensorflow/core/util/cuda_kernel_helper.h"
 
diff --git a/tensorflow/core/kernels/spacetodepth_op.cc b/tensorflow/core/kernels/spacetodepth_op.cc
index 23df1c35e5..e59adfc6ac 100644
--- a/tensorflow/core/kernels/spacetodepth_op.cc
+++ b/tensorflow/core/kernels/spacetodepth_op.cc
@@ -188,6 +188,9 @@ REGISTER_KERNEL_BUILDER(
     Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     SpaceToDepthOp<GPUDevice, float>);
 REGISTER_KERNEL_BUILDER(
+    Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
+    SpaceToDepthOp<GPUDevice, Eigen::half>);
+REGISTER_KERNEL_BUILDER(
     Name("SpaceToDepth").Device(DEVICE_GPU).TypeConstraint<qint8>("T"),
     SpaceToDepthOp<GPUDevice, qint8>);
 #endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc b/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
index db05ca1ed2..f38459724a 100644
--- a/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc
@@ -234,6 +234,12 @@ struct SpaceToDepthOpFunctor<GPUDevice, T, FORMAT_NCHW> {
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, float, FORMAT_NCHW>;
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, float, FORMAT_NHWC>;
 
+// Instantiate the GPU implementations for Eigen::half.
+template struct functor::SpaceToDepthOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NCHW>;
+template struct functor::SpaceToDepthOpFunctor<GPUDevice, Eigen::half,
+                                               FORMAT_NHWC>;
+
 // NCHW_VECT_C with 4 x qint8 can be treated as NCHW int32.
 template struct functor::SpaceToDepthOpFunctor<GPUDevice, int32, FORMAT_NCHW>;
 
diff --git a/tensorflow/core/lib/io/record_reader.cc b/tensorflow/core/lib/io/record_reader.cc
index 254fdf115d..6de850bb20 100644
--- a/tensorflow/core/lib/io/record_reader.cc
+++ b/tensorflow/core/lib/io/record_reader.cc
@@ -205,7 +205,9 @@ Status RecordReader::SkipNBytes(uint64 offset) {
     if (options_.buffer_size > 0) {
       TF_RETURN_IF_ERROR(input_stream_->SkipNBytes(offset));
     }
+#if !defined(IS_SLIM_BUILD)
   }
+#endif
   return Status::OK();
 }  // namespace io
 
diff --git a/tensorflow/core/lib/io/record_reader.h b/tensorflow/core/lib/io/record_reader.h
index 62dd2efb79..26278e0328 100644
--- a/tensorflow/core/lib/io/record_reader.h
+++ b/tensorflow/core/lib/io/record_reader.h
@@ -16,10 +16,10 @@ limitations under the License.
 #ifndef TENSORFLOW_LIB_IO_RECORD_READER_H_
 #define TENSORFLOW_LIB_IO_RECORD_READER_H_
 
-#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
-#if !defined(IS_SLIM_BUILD)
 #include "tensorflow/core/lib/io/inputstream_interface.h"
+#if !defined(IS_SLIM_BUILD)
 #include "tensorflow/core/lib/io/zlib_compression_options.h"
 #include "tensorflow/core/lib/io/zlib_inputstream.h"
 #endif  // IS_SLIM_BUILD
diff --git a/tensorflow/core/ops/dataset_ops.cc b/tensorflow/core/ops/dataset_ops.cc
index f32baee45e..e2453b9712 100644
--- a/tensorflow/core/ops/dataset_ops.cc
+++ b/tensorflow/core/ops/dataset_ops.cc
@@ -1,4 +1,4 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -266,6 +266,16 @@ REGISTER_OP("BatchDataset")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+// TODO(mrry): move SlideDataset to contrib in the future.
+REGISTER_OP("SlideDataset")
+    .Input("input_dataset: variant")
+    .Input("window_size: int64")
+    .Input("stride: int64")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("PaddedBatchDataset")
     .Input("input_dataset: variant")
     .Input("batch_size: int64")
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index 910fbaca9e..d6a0f38033 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -1498,6 +1498,7 @@ REGISTER_OP("_MklConv2D")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn(shape_inference::Conv2DShape)
     .Doc(R"doc(
 MKL version of Conv2D operator. Uses MKL DNN APIs to perform 2D convolution.
@@ -1516,6 +1517,7 @@ REGISTER_OP("__MklDummyConv2DWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 Dummy node that enables fusing Conv2D and BiasAdd operator for MKL. This node
 does not perform anything. It is just created as an intermediate output of
@@ -1541,6 +1543,7 @@ REGISTER_OP("_MklConv2DWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 MKL version of Conv2D and BiasAdd operator. Uses MKL DNN APIs to perform
 2D convolution and add Bias to the output of convolution.
@@ -1563,6 +1566,7 @@ REGISTER_OP("_MklConv2DBackpropFilter")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle s;
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(1, &s));
@@ -1589,6 +1593,7 @@ REGISTER_OP("__MklDummyConv2DBackpropFilterWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle input_shape;
       // Fetch the data_format attribute, which may not exist.
@@ -1633,6 +1638,7 @@ REGISTER_OP("_MklConv2DBackpropFilterWithBias")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle input_shape;
       // Fetch the data_format attribute, which may not exist.
@@ -1668,6 +1674,7 @@ REGISTER_OP("_MklConv2DWithBiasBackpropBias")
     .Attr("T: {half, float, double}")
     .Attr("strides: list(int)")
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .Doc(R"doc(
 MKL version of Conv2DBackpropBias. Uses MKL DNN APIs to compute the
 gradients of convolution with respect to the bias.
@@ -1690,6 +1697,7 @@ REGISTER_OP("_MklConv2DBackpropInput")
     .Attr("use_cudnn_on_gpu: bool = true")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
+    .Attr("dilations: list(int) = [1, 1, 1, 1]")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle s;
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(0, &s));
diff --git a/tensorflow/core/platform/windows/windows_file_system.cc b/tensorflow/core/platform/windows/windows_file_system.cc
index b6b3722caa..682e46e0fc 100644
--- a/tensorflow/core/platform/windows/windows_file_system.cc
+++ b/tensorflow/core/platform/windows/windows_file_system.cc
@@ -382,7 +382,8 @@ Status WindowsFileSystem::NewReadOnlyMemoryRegionFromFile(
 
 Status WindowsFileSystem::FileExists(const string& fname) {
   constexpr int kOk = 0;
-  if (_access(TranslateName(fname).c_str(), kOk) == 0) {
+  std::wstring ws_translated_fname = Utf8ToWideChar(TranslateName(fname));
+  if (_waccess(ws_translated_fname.c_str(), kOk) == 0) {
     return Status::OK();
   }
   return errors::NotFound(fname, " not found");
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index 7405e01e14..22f2c02b78 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -24,7 +24,7 @@ limitations under the License.
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
-#define TF_VERSION_SUFFIX "-rc1"
+#define TF_VERSION_SUFFIX ""
 
 #define TF_STR_HELPER(x) #x
 #define TF_STR(x) TF_STR_HELPER(x)
diff --git a/tensorflow/core/util/stat_summarizer.h b/tensorflow/core/util/stat_summarizer.h
index f7b63e8686..79fa63723e 100644
--- a/tensorflow/core/util/stat_summarizer.h
+++ b/tensorflow/core/util/stat_summarizer.h
@@ -186,7 +186,7 @@ class StatSummarizer {
   void Reset();
 
   // Returns number of runs.
-  int num_runs() const { return run_total_us_.count(); }
+  int num_runs() const { return static_cast<int>(run_total_us_.count()); }
 
   // Returns stats of total microseconds spent by all nodes in each run.
   const Stat<int64>& run_total_us() const { return run_total_us_; }
diff --git a/tensorflow/docs_src/community/welcome.md b/tensorflow/docs_src/community/welcome.md
index 9f6fe91b14..6d0458e678 100644
--- a/tensorflow/docs_src/community/welcome.md
+++ b/tensorflow/docs_src/community/welcome.md
@@ -51,6 +51,8 @@ Europe:
 TensorFlow provides multiple communication paths.  To pick the right path,
 please read the following list carefully:
 
+  * For new release announcements and security updates, subscribe to
+    [announce@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/announce).
   * To ask or answer technical questions about TensorFlow, use
     [Stack Overflow](https://stackoverflow.com/questions/tagged/tensorflow).
     For example, ask or search Stack Overflow about a particular error message
@@ -65,5 +67,5 @@ please read the following list carefully:
     on GitHub.  For example, use the issue tracker to request a
     new operation in TensorFlow.
   * To report vulnerabilities, please follow our
-    [vulnerability disclosure guidelines](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/SECURITY.md).
+    [vulnerability disclosure guidelines](https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md).
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index 818798555a..0481c97885 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -38,7 +38,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for macOS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.6.0.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 4c6dfa8daf..8f89898c92 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.6.0.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 527884863e..0ee9c849e1 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.6.0</version>
 </dependency>
 ```
 
@@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.6.0-rc1</version>
+                 <version>1.6.0</version>
                </dependency>
              </dependencies>
          </project>
@@ -123,12 +123,12 @@ instead:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.6.0</version>
 </dependency>
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>libtensorflow_jni_gpu</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.6.0</version>
 </dependency>
 ```
 
@@ -147,7 +147,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or macOS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0-rc1.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -166,7 +166,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.6.0-rc1.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.6.0.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -174,10 +174,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0-rc1.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.6.0.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.6.0-rc1.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.6.0.zip).
   3. Extract this .zip file.
 
 
@@ -225,7 +225,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.6.0-rc1.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.6.0.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -239,11 +239,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and macOS X:
 
-<pre><b>java -cp libtensorflow-1.6.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.6.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.6.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>d
+<pre><b>java -cp libtensorflow-1.6.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index 2741b61bb2..8612762271 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -39,16 +39,9 @@ must be installed on your system:
   * [cuDNN SDK v7](http://developer.nvidia.com/cudnn). For details, see
     [NVIDIA's documentation](http://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
     Ensure that you create the `CUDA_HOME` environment variable as
-    described in NVIDIA's documentation.
-  * [CUPTI](http://docs.nvidia.com/cuda/cupti/) ships with the CUDA Toolkit, but
-    you also need to append its path to the `LD_LIBRARY_PATH` environment
-    variable:
-
-    <pre> $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> </pre>
-
-In order to run TensorFlow computations on the GPU, you also need:
-
-  * A GPU card with CUDA Compute Capability 3.0 or higher.  See
+    described in the NVIDIA documentation.
+  * GPU card with CUDA Compute Capability 3.0 or higher for building
+    from source and 3.5 or higher for our binaries. See
     [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for
     a list of supported GPU cards.
   * [GPU drivers](http://nvidia.com/driver) supporting your version of the CUDA
@@ -172,7 +165,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      Virtualenv environment:
 
      <pre>(tensorflow)$ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common_installation_problems).
@@ -277,7 +270,7 @@ take the following steps:
 
      <pre>
      $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp34-cp34m-linux_x86_64.whl</b>
      </pre>
 
      If this step fails, see
@@ -340,24 +333,23 @@ where:
     to 6006.
   * <tt><i>TensorFlowCPUImage</i></tt> is required. It identifies the Docker
     container. Specify one of the following values:
-    * <tt>gcr.io/tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:latest-devel</tt>, which is the latest
+    * <tt>tensorflow/tensorflow</tt>, which is the TensorFlow CPU binary image.
+    * <tt>tensorflow/tensorflow:latest-devel</tt>, which is the latest
       TensorFlow CPU Binary image plus source code.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i></tt>, which is the
+    * <tt>tensorflow/tensorflow:<i>version</i></tt>, which is the
       specified version (for example, 1.1.0rc1) of TensorFlow CPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-devel</tt>, which is
+    * <tt>tensorflow/tensorflow:<i>version</i>-devel</tt>, which is
       the specified version (for example, 1.1.0rc1) of the TensorFlow GPU
       binary image plus source code.
 
-    <tt>gcr.io</tt> is the Google Container Registry. Note that some
-    TensorFlow images are also available at
+    TensorFlow images are available at
     [dockerhub](https://hub.docker.com/r/tensorflow/tensorflow/).
 
 For example, the following command launches the latest TensorFlow CPU binary image
 in a Docker container from which you can run TensorFlow programs in a shell:
 
 <pre>
-$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b>
+$ <b>docker run -it tensorflow/tensorflow bash</b>
 </pre>
 
 The following command also launches the latest TensorFlow CPU binary image in a
@@ -365,7 +357,7 @@ Docker container. However, in this Docker container, you can run TensorFlow
 programs in a Jupyter notebook:
 
 <pre>
-$ <b>docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow</b>
+$ <b>docker run -it -p 8888:8888 tensorflow/tensorflow</b>
 </pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
@@ -389,14 +381,14 @@ where:
     <tt><i>hostPort</i></tt> and <code><em>containerPort</em></code> to `8888`.
   * <i>TensorFlowGPUImage</i> specifies the Docker container. You must
     specify one of the following values:
-    * <tt>gcr.io/tensorflow/tensorflow:latest-gpu</tt>, which is the latest
+    * <tt>tensorflow/tensorflow:latest-gpu</tt>, which is the latest
       TensorFlow GPU binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:latest-devel-gpu</tt>, which is
+    * <tt>tensorflow/tensorflow:latest-devel-gpu</tt>, which is
       the latest TensorFlow GPU Binary image plus source code.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
+    * <tt>tensorflow/tensorflow:<i>version</i>-gpu</tt>, which is the
       specified version (for example, 0.12.1) of the TensorFlow GPU
       binary image.
-    * <tt>gcr.io/tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is
+    * <tt>tensorflow/tensorflow:<i>version</i>-devel-gpu</tt>, which is
       the specified version (for example, 0.12.1) of the TensorFlow GPU
       binary image plus source code.
 
@@ -405,7 +397,7 @@ following command launches the latest TensorFlow GPU binary image in a
 Docker container from which you can run TensorFlow programs in a shell:
 
 <pre>
-$ <b>nvidia-docker run -it gcr.io/tensorflow/tensorflow:latest-gpu bash</b>
+$ <b>nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash</b>
 </pre>
 
 The following command also launches the latest TensorFlow GPU binary image
@@ -413,13 +405,13 @@ in a Docker container. In this Docker container, you can run TensorFlow
 programs in a Jupyter notebook:
 
 <pre>
-$ <b>nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu</b>
+$ <b>nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu</b>
 </pre>
 
 The following command installs an older TensorFlow version (0.12.1):
 
 <pre>
-$ <b>nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:0.12.1-gpu</b>
+$ <b>nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:0.12.1-gpu</b>
 </pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
@@ -464,7 +456,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 <a name="ValidateYourInstallation"></a>
 ## Validate your installation
@@ -489,7 +481,7 @@ If you installed through Docker, start a Docker container
 from which you can run bash. For example:
 
 <pre>
-$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b>
+$ <b>docker run -it tensorflow/tensorflow bash</b>
 </pre>
 
 
@@ -631,14 +623,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -650,14 +642,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -669,14 +661,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -688,14 +680,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0rc1-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.6.0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 99745fcc6d..7207cb4f2b 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -119,7 +119,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      TensorFlow in the active Virtualenv is as follows:
 
      <pre> $ <b>pip install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0-py2-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -242,7 +242,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0-py2-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -292,24 +292,23 @@ where:
     to 6006.
   * <i>TensorFlowImage</i> is required. It identifies the Docker container.
     You must specify one of the following values:
-    * <code>gcr.io/tensorflow/tensorflow</code>: TensorFlow binary image.
-    * <code>gcr.io/tensorflow/tensorflow:latest-devel</code>: TensorFlow
+    * <code>tensorflow/tensorflow</code>: TensorFlow binary image.
+    * <code>tensorflow/tensorflow:latest-devel</code>: TensorFlow
       Binary image plus source code.
 
-<code>gcr.io</code> is the Google Container Registry. Note that some
-TensorFlow images are also available at
+The TensorFlow images are available at
 [dockerhub](https://hub.docker.com/r/tensorflow/tensorflow/).
 
 For example, the following command launches a TensorFlow CPU binary image
 in a Docker container from which you can run TensorFlow programs in a shell:
 
-<pre>$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b></pre>
+<pre>$ <b>docker run -it tensorflow/tensorflow bash</b></pre>
 
 The following command also launches a TensorFlow CPU binary image in a
 Docker container. However, in this Docker container, you can run
 TensorFlow programs in a Jupyter notebook:
 
-<pre>$ <b>docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow</b></pre>
+<pre>$ <b>docker run -it -p 8888:8888 tensorflow/tensorflow</b></pre>
 
 Docker will download the TensorFlow binary image the first time you launch it.
 
@@ -351,7 +350,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (<i>targetDirectory</i>)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -376,7 +375,7 @@ do the following:
 If you installed through Docker, start a Docker container that runs bash.
 For example:
 
-<pre>$ <b>docker run -it gcr.io/tensorflow/tensorflow bash</b></pre>
+<pre>$ <b>docker run -it tensorflow/tensorflow bash</b></pre>
 
 
 
@@ -519,7 +518,7 @@ The value you specify depends on your Python version.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0-py2-none-any.whl
 </pre>
 
 
@@ -527,5 +526,5 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-a
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0-py3-none-any.whl
 </pre>
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index 4e7b07d78b..a7f33819b4 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -350,10 +350,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.6.0rc1 on Linux:
+for TensorFlow 1.6.0 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.6.0rc1-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.6.0-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index f0a30ee394..2413bc9cfb 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -17,7 +17,7 @@ You must choose one of the following types of TensorFlow to install:
     NVIDIA® GPU, you must install this version. Note that this version of
     TensorFlow is typically much easier to install (typically,
     in 5 or 10 minutes), so even if you have an NVIDIA GPU, we recommend
-    installing this version first.
+    installing this version first. Prebuilt binaries will use AVX instructions. 
   * **TensorFlow with GPU support**. TensorFlow programs typically run
     significantly faster on a GPU than on a CPU. Therefore, if your
     system has a NVIDIA® GPU meeting the prerequisites shown below
@@ -41,7 +41,8 @@ installed on your system:
     Note that cuDNN is typically installed in a different location from the
     other CUDA DLLs. Ensure that you add the directory where you installed
     the cuDNN DLL to your `%PATH%` environment variable.
-  * GPU card with CUDA Compute Capability 3.0 or higher.  See
+  * GPU card with CUDA Compute Capability 3.0 or higher for building
+    from source and 3.5 or higher for our binaries. See
     [NVIDIA documentation](https://developer.nvidia.com/cuda-gpus) for a
     list of supported GPU cards.
 
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
index cd47fc2803..580a899ac4 100644
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@@ -78,7 +78,7 @@ training CIFAR-10 illustrates the use of the `tf.data` API along with
 The `tf.data` API utilizes C++ multi-threading and has a much lower overhead
 than the Python-based `queue_runner` that is limited by Python's multi-threading
 performance. A detailed performance guide for the `tf.data` API can be found
-[here](#datasets_performance).
+[here](@{$datasets_performance}).
 
 While feeding data using a `feed_dict` offers a high level of flexibility, in
 general `feed_dict` does not provide a scalable solution. If only a single GPU
diff --git a/tensorflow/docs_src/performance/xla/jit.md b/tensorflow/docs_src/performance/xla/jit.md
index d4dc3e57c8..d9a979ccbd 100644
--- a/tensorflow/docs_src/performance/xla/jit.md
+++ b/tensorflow/docs_src/performance/xla/jit.md
@@ -157,7 +157,7 @@ to fuse Ops is visible by starting at `hlo_graph_0.dot` and viewing each diagram
 in succession.
 
 To Render the .dot file into a png, install
-[GraphViz](http://www.graphviz.org/Download..php) and run:
+[GraphViz](https://www.graphviz.org/download/) and run:
 
 ```shell
 dot -Tpng hlo_graph_80.dot -o hlo_graph_80.png
diff --git a/tensorflow/docs_src/programmers_guide/debugger.md b/tensorflow/docs_src/programmers_guide/debugger.md
index 5fb1c2da88..d1399814ee 100644
--- a/tensorflow/docs_src/programmers_guide/debugger.md
+++ b/tensorflow/docs_src/programmers_guide/debugger.md
@@ -459,7 +459,7 @@ accuracy_score = classifier.evaluate(x=test_set.data,
 
 
 [debug_tflearn_iris.py](https://www.tensorflow.org/code/tensorflow/python/debug/examples/debug_tflearn_iris.py),
-based on {$tflearn$tf-learn's iris tutorial}, contains a full example of how to
+based on [tf-learn's iris tutorial](https://www.tensorflow.org/versions/r1.2/get_started/tflearn), contains a full example of how to
 use the tfdbg with `Estimator`s. To run this example, do:
 
 ```none
@@ -753,6 +753,7 @@ There are three possible workarounds or solutions:
    # For LocalCLIDebugHook
    hooks = [tf_debug.LocalCLIDebugHook(dump_root="/with/lots/of/space")]
    ```
+
    Make sure that the directory pointed to by dump_root is empty or nonexistent.
    tfdbg cleans up the dump directories before exiting.
 *  Reduce the batch size used during the runs.
diff --git a/tensorflow/docs_src/programmers_guide/faq.md b/tensorflow/docs_src/programmers_guide/faq.md
index 1548d43877..392ac6f7f1 100644
--- a/tensorflow/docs_src/programmers_guide/faq.md
+++ b/tensorflow/docs_src/programmers_guide/faq.md
@@ -159,7 +159,7 @@ available. These operations allow you to build sophisticated
 @{$reading_data$input pipelines}, at the cost of making the
 TensorFlow computation somewhat more complicated. See the how-to documentation
 for
-@{$reading_data#creating-threads-to-prefetch-using-queuerunner-objects$using `QueueRunner` objects to drive queues and readers}
+@{$reading_data#creating_threads_to_prefetch_using_queuerunner_objects$using `QueueRunner` objects to drive queues and readers}
 for more information on how to use them.
 
 ## Variables
@@ -272,7 +272,7 @@ Prefer predefined TensorFlow operations such as @{tf.decode_raw},
 
 If your data is not easily parsable with the built-in TensorFlow operations,
 consider converting it, offline, to a format that is easily parsable, such
-as ${tf.python_io.TFRecordWriter$`TFRecord`} format.
+as @{tf.python_io.TFRecordWriter$`TFRecord`} format.
 
 The more efficient method to customize the parsing behavior is to
 @{$adding_an_op$add a new op written in C++} that parses your
diff --git a/tensorflow/docs_src/programmers_guide/saved_model.md b/tensorflow/docs_src/programmers_guide/saved_model.md
index d01d187e86..55ee42dd64 100644
--- a/tensorflow/docs_src/programmers_guide/saved_model.md
+++ b/tensorflow/docs_src/programmers_guide/saved_model.md
@@ -3,7 +3,7 @@
 The @{tf.train.Saver} class provides methods to save and restore models. The
 @{tf.saved_model.simple_save} function is an easy way to build a
 @{tf.saved_model$saved model} suitable for serving.
-[Estimators](/programmers_guide/estimators) automatically save and restore
+[Estimators](@{$programmers_guide/estimators}) automatically save and restore
 variables in the `model_dir`.
 
 ## Save and restore variables
@@ -400,7 +400,7 @@ defined in:
 
 After training an `Estimator` model, you may want to create a service
 from that model that takes requests and returns a result.  You can run such a
-service locally on your machine or deploy it scalably in the cloud.
+service locally on your machine or deploy it in the cloud.
 
 To prepare a trained Estimator for serving, you must export it in the standard
 SavedModel format. This section explains how to:
diff --git a/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md b/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
index 79280d246a..fadfa03e78 100644
--- a/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
+++ b/tensorflow/docs_src/programmers_guide/summaries_and_tensorboard.md
@@ -83,7 +83,7 @@ data than you need, though. Instead, consider running the merged summary op
 every `n` steps.
 
 The code example below is a modification of the
-@{$layers$simple MNIST tutorial},
+[simple MNIST tutorial](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist.py),
 in which we have added some summary ops, and run them every ten steps. If you
 run this and then launch `tensorboard --logdir=/tmp/tensorflow/mnist`, you'll be able
 to visualize statistics, such as how the weights or accuracy varied during
diff --git a/tensorflow/docs_src/programmers_guide/using_tpu.md b/tensorflow/docs_src/programmers_guide/using_tpu.md
index d74d7f3181..a9c2cb3e33 100644
--- a/tensorflow/docs_src/programmers_guide/using_tpu.md
+++ b/tensorflow/docs_src/programmers_guide/using_tpu.md
@@ -129,10 +129,9 @@ my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
 Typically the `FLAGS` would be set by command line arguments. To switch from
 training locally to training on a cloud TPU you would need to:
 
-  1) Set `FLAGS.use_tpu` to `True`
-  1) Set `FLAGS.tpu_name` so the
-     `tf.contrib.cluster_resolver.TPUClusterResolver` can find it
-  1) Set `FLAGS.model_dir` to a Google Cloud Storage bucket url (`gs://`).
+* Set `FLAGS.use_tpu` to `True`
+* Set `FLAGS.tpu_name` so the `tf.contrib.cluster_resolver.TPUClusterResolver` can find it
+* Set `FLAGS.model_dir` to a Google Cloud Storage bucket url (`gs://`).
 
 
 ## Optimizer
diff --git a/tensorflow/docs_src/tutorials/deep_cnn.md b/tensorflow/docs_src/tutorials/deep_cnn.md
index 3692a02f2e..6361079671 100644
--- a/tensorflow/docs_src/tutorials/deep_cnn.md
+++ b/tensorflow/docs_src/tutorials/deep_cnn.md
@@ -268,7 +268,7 @@ in `cifar10_input.py`.
 
 `cifar10_train.py` periodically @{tf.train.Saver$saves}
 all model parameters in
-@{$variables#saving-and-restoring$checkpoint files}
+@{$programmers_guide/saved_model$checkpoint files}
 but it does *not* evaluate the model. The checkpoint file
 will be used by `cifar10_eval.py` to measure the predictive
 performance (see [Evaluating a Model](#evaluating-a-model) below).
diff --git a/tensorflow/docs_src/tutorials/image_retraining.md b/tensorflow/docs_src/tutorials/image_retraining.md
index 246a420400..93d7c86e42 100644
--- a/tensorflow/docs_src/tutorials/image_retraining.md
+++ b/tensorflow/docs_src/tutorials/image_retraining.md
@@ -115,7 +115,7 @@ process is progressing. The training's objective is to make the loss as small as
 possible, so you can tell if the learning is working by keeping an eye on
 whether the loss keeps trending downwards, ignoring the short-term noise.
 
-By default this script will run 4,000 training steps. Each step chooses ten
+By default this script will run 4,000 training steps. Each step chooses 100
 images at random from the training set, finds their bottlenecks from the cache,
 and feeds them into the final layer to get predictions. Those predictions are
 then compared against the actual labels to update the final layer's weights
diff --git a/tensorflow/docs_src/tutorials/kernel_methods.md b/tensorflow/docs_src/tutorials/kernel_methods.md
index 63f408c2ca..b1f06ce0a3 100644
--- a/tensorflow/docs_src/tutorials/kernel_methods.md
+++ b/tensorflow/docs_src/tutorials/kernel_methods.md
@@ -1,9 +1,9 @@
 # Improving Linear Models Using Explicit Kernel Methods
 
-Note: This document uses a deprecated version of ${tf.estimator},
-which has a ${tf.contrib.learn.estimator$different interface}.
+Note: This document uses a deprecated version of @{tf.estimator},
+which has a different interface (see `tf.contrib.learn Estimator`).
 It also uses other `contrib` methods whose
-${$version_compat#not_covered$API may not be stable}.
+@{$version_compat#not_covered$API may not be stable}.
 
 In this tutorial, we demonstrate how combining (explicit) kernel methods with
 linear models can drastically increase the latters' quality of predictions
diff --git a/tensorflow/docs_src/tutorials/layers.md b/tensorflow/docs_src/tutorials/layers.md
index ee03f440c9..9b17d0d4d5 100644
--- a/tensorflow/docs_src/tutorials/layers.md
+++ b/tensorflow/docs_src/tutorials/layers.md
@@ -193,7 +193,7 @@ to calculate loss, configure the training op, and generate predictions. If
 you're already experienced with CNNs and @{$get_started/custom_estimators$TensorFlow `Estimator`s},
 and find the above code intuitive, you may want to skim these sections or just
 skip ahead to ["Training and Evaluating the CNN MNIST
-Classifier"](#training-and-evaluating-the-cnn-mnist-classifier).
+Classifier"](#training_and_evaluating_the_cnn_mnist_classifier).
 
 ### Input Layer
 
@@ -446,7 +446,7 @@ tf.nn.softmax(logits, name="softmax_tensor")
 
 > Note: We use the `name` argument to explicitly name this operation
 > `softmax_tensor`, so we can reference it later. (We'll set up logging for the
-> softmax values in ["Set Up a Logging Hook"](#set-up-a-logging-hook).
+> softmax values in ["Set Up a Logging Hook"](#set-up-a-logging-hook)).
 
 We compile our predictions in a dict, and return an `EstimatorSpec` object:
 
@@ -534,9 +534,8 @@ if mode == tf.estimator.ModeKeys.TRAIN:
 ```
 
 > Note: For a more in-depth look at configuring training ops for Estimator model
-> functions, see @{$get_started/custom_estimators#defining-the-training-op-for-the-model$"Defining
-> the training op for the model"} in the @{$get_started/custom_estimators$"Creating Estimations in
-> tf.estimator"} tutorial.
+> functions, see @{$get_started/custom_estimators#defining_the_training_op_for_the_model$"Defining the training op for the model"} 
+> in the @{$get_started/custom_estimators$"Creating Estimators in tf.estimator."} tutorial.
 
 ### Add evaluation metrics
 
@@ -625,7 +624,8 @@ operation earlier when we generated the probabilities in `cnn_model_fn`.
 > Note: If you don't explicitly assign a name to an operation via the `name`
 > argument, TensorFlow will assign a default name. A couple easy ways to
 > discover the names applied to operations are to visualize your graph on
-> @{$graph_viz$TensorBoard}) or to enable the @{$debugger$TensorFlow Debugger (tfdbg)}.
+> @{$graph_viz$TensorBoard}) or to enable the
+> @{$programmers_guide/debugger$TensorFlow Debugger (tfdbg)}.
 
 Next, we create the `LoggingTensorHook`, passing `tensors_to_log` to the
 `tensors` argument. We set `every_n_iter=50`, which specifies that probabilities
diff --git a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
index e22536adb6..7584a76ba5 100644
--- a/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
+++ b/tensorflow/docs_src/tutorials/recurrent_quickdraw.md
@@ -109,7 +109,8 @@ This download will take a while and download a bit more than 23GB of data.
 
 To convert the `ndjson` files to
 @{$python/python_io#tfrecords_format_details$TFRecord} files containing
-${tf.train.Example} protos run the following command.
+[`tf.train.Example`](https://www.tensorflow.org/code/tensorflow/core/example/example.proto)
+protos run the following command.
 
 ```shell
    python create_dataset.py --ndjson_path rnn_tutorial_data \
diff --git a/tensorflow/docs_src/tutorials/wide.md b/tensorflow/docs_src/tutorials/wide.md
index 005dc020f9..27ce75a30d 100644
--- a/tensorflow/docs_src/tutorials/wide.md
+++ b/tensorflow/docs_src/tutorials/wide.md
@@ -74,8 +74,8 @@ Here's a list of columns available in the Census Income dataset:
 | relationship   | Categorical | Wife, Own-child, Husband,         |
 :                :             : Not-in-family, Other-relative,    :
 :                :             : Unmarried.                        :
-| race           | Categorical | White, Asian-Pac-Islander,        |
-:                :             : Amer-Indian-Eskimo, Other, Black. :
+| race           | Categorical | Amer-Indian-Eskimo, Asian-Pac-    |
+:                :             : Islander, Black, White, Other.    :
 | gender         | Categorical | Female, Male.                     |
 | capital_gain   | Continuous  | Capital gains recorded.           |
 | capital_loss   | Continuous  | Capital Losses recorded.          |
@@ -247,7 +247,7 @@ hours_per_week = tf.feature_column.numeric_column('hours_per_week')
 ### Making Continuous Features Categorical through Bucketization
 
 Sometimes the relationship between a continuous feature and the label is not
-linear. As an hypothetical example, a person's income may grow with age in the
+linear. As a hypothetical example, a person's income may grow with age in the
 early stage of one's career, then the growth may slow at some point, and finally
 the income decreases after retirement. In this scenario, using the raw `age` as
 a real-valued feature column might not be a good choice because the model can
@@ -361,6 +361,16 @@ The first line of the final output should be something like
 `accuracy: 0.83557522`, which means the accuracy is 83.6%. Feel free to try more
 features and transformations and see if you can do even better!
 
+After the model is evaluated, we can use the model to predict whether an individual has an annual income of over
+50,000 dollars given an individual's information input.
+```python
+  pred_iter = model.predict(input_fn=lambda: input_fn(FLAGS.test_data, 1, False, 1))
+  for pred in pred_iter:
+    print(pred['classes'])
+```
+
+The model prediction output would be like `[b'1']` or `[b'0']` which means whether corresponding individual has an annual income of over 50,000 dollars or not.
+
 If you'd like to see a working end-to-end example, you can download our
 [example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/wide_deep.py)
 and set the `model_type` flag to `wide`.
diff --git a/tensorflow/examples/android/AndroidManifest.xml b/tensorflow/examples/android/AndroidManifest.xml
index bb75431a1f..5c47ce6b67 100644
--- a/tensorflow/examples/android/AndroidManifest.xml
+++ b/tensorflow/examples/android/AndroidManifest.xml
@@ -40,6 +40,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -49,6 +50,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -58,6 +60,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
 
@@ -67,6 +70,7 @@
             <intent-filter>
                 <action android:name="android.intent.action.MAIN" />
                 <category android:name="android.intent.category.LAUNCHER" />
+                <category android:name="android.intent.category.LEANBACK_LAUNCHER" />
             </intent-filter>
         </activity>
     </application>
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java b/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
index 8bd4abb154..429138abe5 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/CameraActivity.java
@@ -351,6 +351,10 @@ public abstract class CameraActivity extends Activity
 
   protected void setFragment() {
     String cameraId = chooseCamera();
+    if (cameraId == null) {
+      Toast.makeText(this, "No Camera Detected", Toast.LENGTH_SHORT).show();
+      finish();
+    }
 
     Fragment fragment;
     if (useCamera2API) {
@@ -416,7 +420,8 @@ public abstract class CameraActivity extends Activity
 
   @Override
   public boolean onKeyDown(final int keyCode, final KeyEvent event) {
-    if (keyCode == KeyEvent.KEYCODE_VOLUME_DOWN || keyCode == KeyEvent.KEYCODE_VOLUME_UP) {
+    if (keyCode == KeyEvent.KEYCODE_VOLUME_DOWN || keyCode == KeyEvent.KEYCODE_VOLUME_UP
+            || keyCode == KeyEvent.KEYCODE_BUTTON_L1 || keyCode == KeyEvent.KEYCODE_DPAD_CENTER) {
       debug = !debug;
       requestRender();
       onSetDebug(debug);
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java b/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
index 6a66ec3927..33ec65e9f7 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/StylizeActivity.java
@@ -16,8 +16,10 @@
 
 package org.tensorflow.demo;
 
+import android.app.UiModeManager;
 import android.content.Context;
 import android.content.res.AssetManager;
+import android.content.res.Configuration;
 import android.graphics.Bitmap;
 import android.graphics.Bitmap.Config;
 import android.graphics.BitmapFactory;
@@ -31,9 +33,11 @@ import android.graphics.Typeface;
 import android.media.ImageReader.OnImageAvailableListener;
 import android.os.Bundle;
 import android.os.SystemClock;
+import android.util.DisplayMetrics;
 import android.util.Size;
 import android.util.TypedValue;
 import android.view.Display;
+import android.view.KeyEvent;
 import android.view.MotionEvent;
 import android.view.View;
 import android.view.View.OnClickListener;
@@ -43,6 +47,7 @@ import android.widget.BaseAdapter;
 import android.widget.Button;
 import android.widget.GridView;
 import android.widget.ImageView;
+import android.widget.RelativeLayout;
 import android.widget.Toast;
 import java.io.IOException;
 import java.io.InputStream;
@@ -381,6 +386,27 @@ public class StylizeActivity extends CameraActivity implements OnImageAvailableL
     grid = (GridView) findViewById(R.id.grid_layout);
     grid.setAdapter(adapter);
     grid.setOnTouchListener(gridTouchAdapter);
+
+    // Change UI on Android TV
+    UiModeManager uiModeManager = (UiModeManager) getSystemService(UI_MODE_SERVICE);
+    if (uiModeManager.getCurrentModeType() == Configuration.UI_MODE_TYPE_TELEVISION) {
+      DisplayMetrics displayMetrics = new DisplayMetrics();
+      getWindowManager().getDefaultDisplay().getMetrics(displayMetrics);
+      int styleSelectorHeight = displayMetrics.heightPixels;
+      int styleSelectorWidth = displayMetrics.widthPixels - styleSelectorHeight;
+      RelativeLayout.LayoutParams layoutParams = new RelativeLayout.LayoutParams(styleSelectorWidth, ViewGroup.LayoutParams.MATCH_PARENT);
+
+      // Calculate number of style in a row, so all the style can show up without scrolling
+      int numOfStylePerRow = 3;
+      while (styleSelectorWidth / numOfStylePerRow * Math.ceil((float) (adapter.getCount() - 2) / numOfStylePerRow) > styleSelectorHeight) {
+        numOfStylePerRow++;
+      }
+      grid.setNumColumns(numOfStylePerRow);
+      layoutParams.addRule(RelativeLayout.ALIGN_PARENT_RIGHT);
+      grid.setLayoutParams(layoutParams);
+      adapter.buttons.clear();
+    }
+
     setStyle(adapter.items[0], 1.0f);
   }
 
@@ -602,4 +628,38 @@ public class StylizeActivity extends CameraActivity implements OnImageAvailableL
 
     borderedText.drawLines(canvas, 10, canvas.getHeight() - 10, lines);
   }
+
+  @Override
+  public boolean onKeyDown(int keyCode, KeyEvent event) {
+    int moveOffset = 0;
+    switch (keyCode) {
+      case KeyEvent.KEYCODE_DPAD_LEFT:
+        moveOffset = -1;
+        break;
+      case KeyEvent.KEYCODE_DPAD_RIGHT:
+        moveOffset = 1;
+        break;
+      case KeyEvent.KEYCODE_DPAD_UP:
+        moveOffset = -1 * grid.getNumColumns();
+        break;
+      case KeyEvent.KEYCODE_DPAD_DOWN:
+        moveOffset = grid.getNumColumns();
+        break;
+      default:
+        return super.onKeyDown(keyCode, event);
+    }
+
+    // get the highest selected style
+    int currentSelect = 0;
+    float highestValue = 0;
+    for (int i = 0; i < adapter.getCount(); i++) {
+      if (adapter.items[i].value > highestValue) {
+        currentSelect = i;
+        highestValue = adapter.items[i].value;
+      }
+    }
+    setStyle(adapter.items[(currentSelect + moveOffset + adapter.getCount()) % adapter.getCount()], 1);
+
+    return true;
+  }
 }
diff --git a/tensorflow/examples/ios/README.md b/tensorflow/examples/ios/README.md
index 5bdaeb43ce..5d7bd36837 100644
--- a/tensorflow/examples/ios/README.md
+++ b/tensorflow/examples/ios/README.md
@@ -119,11 +119,13 @@ rundown:
    `tensorflow/contrib/makefile/gen/lib` to the Library Search Paths setting.
 
  - You'll also need to add `libprotobuf.a` and `libprotobuf-lite.a` from
-   `tensorflow/contrib/makefile/gen/protobuf_ios/lib` to your _Build Stages_ and
-   _Library Search Paths_.
+   `tensorflow/contrib/makefile/gen/protobuf_ios/lib`
+   and `nsync.a` from `tensorflow/contrib/makefile/downloads/nsync/builds/lipo.ios.c++11` 
+   to your _Build Stages_ and _Library Search Paths_.
 
  - The _Header Search_ paths needs to contain:
    - the root folder of tensorflow,
+   - `tensorflow/contrib/makefile/downloads/nsync/public`
    - `tensorflow/contrib/makefile/downloads/protobuf/src`
    - `tensorflow/contrib/makefile/downloads`,
    - `tensorflow/contrib/makefile/downloads/eigen`, and
diff --git a/tensorflow/examples/learn/mnist.py b/tensorflow/examples/learn/mnist.py
index 98819b20bf..3ead8614b6 100644
--- a/tensorflow/examples/learn/mnist.py
+++ b/tensorflow/examples/learn/mnist.py
@@ -61,8 +61,10 @@ def conv_model(features, labels, mode):
 
   # Densely connected layer with 1024 neurons.
   h_fc1 = tf.layers.dense(h_pool2_flat, 1024, activation=tf.nn.relu)
-  if mode == tf.estimator.ModeKeys.TRAIN:
-    h_fc1 = tf.layers.dropout(h_fc1, rate=0.5)
+  h_fc1 = tf.layers.dropout(
+      h_fc1, 
+      rate=0.5, 
+      training=(mode == tf.estimator.ModeKeys.TRAIN))
 
   # Compute logits (1 per class) and compute loss.
   logits = tf.layers.dense(h_fc1, N_DIGITS, activation=None)
diff --git a/tensorflow/examples/learn/resnet.py b/tensorflow/examples/learn/resnet.py
index 9542e55250..c00de932a8 100755
--- a/tensorflow/examples/learn/resnet.py
+++ b/tensorflow/examples/learn/resnet.py
@@ -53,6 +53,8 @@ def res_net_model(features, labels, mode):
     ndim = int(sqrt(input_shape[1]))
     x = tf.reshape(x, [-1, ndim, ndim, 1])
 
+  training = (mode == tf.estimator.ModeKeys.TRAIN)
+  
   # First convolution expands to 64 channels
   with tf.variable_scope('conv_layer1'):
     net = tf.layers.conv2d(
@@ -60,7 +62,7 @@ def res_net_model(features, labels, mode):
         filters=64,
         kernel_size=7,
         activation=tf.nn.relu)
-    net = tf.layers.batch_normalization(net)
+    net = tf.layers.batch_normalization(net, training=training)
 
   # Max pool
   net = tf.layers.max_pooling2d(
@@ -88,7 +90,7 @@ def res_net_model(features, labels, mode):
             kernel_size=1,
             padding='valid',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       with tf.variable_scope(name + '/conv_bottleneck'):
         conv = tf.layers.conv2d(
@@ -97,7 +99,7 @@ def res_net_model(features, labels, mode):
             kernel_size=3,
             padding='same',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       # 1x1 convolution responsible for restoring dimension
       with tf.variable_scope(name + '/conv_out'):
@@ -108,7 +110,7 @@ def res_net_model(features, labels, mode):
             kernel_size=1,
             padding='valid',
             activation=tf.nn.relu)
-        conv = tf.layers.batch_normalization(conv)
+        conv = tf.layers.batch_normalization(conv, training=training)
 
       # shortcut connections that turn the network into its counterpart
       # residual function (identity shortcut)
@@ -154,7 +156,7 @@ def res_net_model(features, labels, mode):
   loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
 
   # Create training op.
-  if mode == tf.estimator.ModeKeys.TRAIN:
+  if training:
     optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
     train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
     return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index d11ee6f74c..54e944c264 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -86,7 +86,6 @@ py_library(
         ":ops",
         ":platform",
         ":pywrap_tensorflow",
-        ":saver_test_utils",
         ":script_ops",
         ":session_ops",
         ":sets",
@@ -96,14 +95,15 @@ py_library(
         ":standard_ops",
         ":state_ops",
         ":string_ops",
-        ":subscribe",
         ":summary",
         ":tensor_array_ops",
+        ":training",
+        ":saver_test_utils",
+        ":subscribe",
         ":test_ops",  # TODO: Break testing code out into separate rule.
-        ":tf_cluster",
         ":tf_item",
+        ":tf_cluster",
         ":tf_optimizer",
-        ":training",
         ":util",
         ":weights_broadcast_ops",
         "//third_party/py/numpy",
@@ -3971,7 +3971,11 @@ py_test(
     srcs = ["training/checkpoint_utils_test.py"],
     srcs_version = "PY2AND3",
     tags = [
+        "manual",
+        "no_cuda_on_cpu_tap",
+        "no_oss",
         "no_windows",
+        "notap",
     ],
     deps = [
         ":client",
diff --git a/tensorflow/python/client/timeline_test.py b/tensorflow/python/client/timeline_test.py
index 9641b8b7f2..5e6b5acdb0 100644
--- a/tensorflow/python/client/timeline_test.py
+++ b/tensorflow/python/client/timeline_test.py
@@ -155,9 +155,12 @@ class TimelineTest(test.TestCase):
     ctf = step_analysis.chrome_trace.format_to_string()
     self._validateTrace(ctf)
     maximums = step_analysis.allocator_maximums
-    self.assertTrue('cpu' in maximums)
+    cpuname = 'cpu'
+    if 'mklcpu' in maximums:
+      cpuname = 'mkl' + cpuname
+    self.assertTrue(cpuname in maximums)
     cpu_max = maximums[
-        'cuda_host_bfc'] if 'cuda_host_bfc' in maximums else maximums['cpu']
+        'cuda_host_bfc'] if 'cuda_host_bfc' in maximums else maximums[cpuname]
     # At least num1 + num2, both float32s (4 bytes each)
     self.assertGreater(cpu_max.num_bytes, 8)
     self.assertGreater(cpu_max.timestamp, 0)
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index 9fcbd4ff77..6a4132bca2 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -139,8 +139,8 @@ class Estimator(object):
                  to configure Estimators from hyper parameter tuning.
           * `config`: Optional configuration object. Will receive what is passed
                  to Estimator in `config` parameter, or the default `config`.
-                 Allows updating things in your model_fn based on configuration
-                 such as `num_ps_replicas`, or `model_dir`.
+                 Allows updating things in your `model_fn` based on
+                 configuration such as `num_ps_replicas`, or `model_dir`.
 
         * Returns:
           `EstimatorSpec`
@@ -301,11 +301,11 @@ class Estimator(object):
 
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
             tuple (features, labels) with same constraints as below.
-          * A tuple (features, labels): Where features is a `Tensor` or a
-            dictionary of string feature name to `Tensor` and labels is a
+          * A tuple (features, labels): Where `features` is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and `labels` is a
             `Tensor` or a dictionary of string label name to `Tensor`. Both
-            features and labels are consumed by `model_fn`. They should satisfy
-            the expectation of `model_fn` from inputs.
+            `features` and `labels` are consumed by `model_fn`. They should
+            satisfy the expectation of `model_fn` from inputs.
 
       hooks: List of `SessionRunHook` subclass instances. Used for callbacks
         inside the training loop.
@@ -381,11 +381,11 @@ class Estimator(object):
 
           * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
             tuple (features, labels) with same constraints as below.
-          * A tuple (features, labels): Where features is a `Tensor` or a
-            dictionary of string feature name to `Tensor` and labels is a
+          * A tuple (features, labels): Where `features` is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and `labels` is a
             `Tensor` or a dictionary of string label name to `Tensor`. Both
-            features and labels are consumed by `model_fn`. They should satisfy
-            the expectation of `model_fn` from inputs.
+            `features` and `labels` are consumed by `model_fn`. They should
+            satisfy the expectation of `model_fn` from inputs.
 
       steps: Number of steps for which to evaluate model. If `None`, evaluates
         until `input_fn` raises an end-of-input exception.
@@ -457,17 +457,17 @@ class Estimator(object):
       checkpoint_path: Path of a specific checkpoint to predict. If `None`, the
         latest checkpoint in `model_dir` is used.
       yield_single_examples: If False, yield the whole batch as returned by the
-        model_fn instead of decomposing the batch into individual elements. This
-        is useful if model_fn return some tensor with first dimension not
-        equal to the batch size
+        `model_fn` instead of decomposing the batch into individual elements.
+        This is useful if `model_fn` returns some tensors whose first dimension
+        is not equal to the batch size.
 
     Yields:
       Evaluated values of `predictions` tensors.
 
     Raises:
-      ValueError: Could not find a trained model in model_dir.
-      ValueError: if batch length of predictions are not same and
-        yield_single_examples is True.
+      ValueError: Could not find a trained model in `model_dir`.
+      ValueError: If batch length of predictions is not the same and
+        `yield_single_examples` is True.
       ValueError: If there is a conflict between `predict_keys` and
         `predictions`. For example if `predict_keys` is not `None` but
         `EstimatorSpec.predictions` is not a `dict`.
@@ -849,7 +849,7 @@ class Estimator(object):
                   'loss': estimator_spec.loss,
                   'step': global_step_tensor
               },
-              every_n_iter=100)
+              every_n_iter=self._config.log_step_count_steps)
       ])
       worker_hooks.extend(estimator_spec.training_hooks)
 
diff --git a/tensorflow/python/estimator/run_config.py b/tensorflow/python/estimator/run_config.py
index 62f035bce5..820fda7765 100644
--- a/tensorflow/python/estimator/run_config.py
+++ b/tensorflow/python/estimator/run_config.py
@@ -423,7 +423,7 @@ class RunConfig(object):
         to be saved. The default value of 10,000 hours effectively disables
         the feature.
       log_step_count_steps: The frequency, in number of global steps, that the
-        global step/sec will be logged during training.
+        global step/sec and the loss will be logged during training.
 
 
     Raises:
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index 2cc3331a15..e38b765da5 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -128,9 +128,16 @@ class TrainSpec(
     """Creates a validated `TrainSpec` instance.
 
     Args:
-      input_fn: Training input function returning a tuple of:
-          features - `Tensor` or dictionary of string feature name to `Tensor`.
-          labels - `Tensor` or dictionary of `Tensor` with labels.
+      input_fn: A function that provides input data for training as minibatches.
+        See @{$get_started/premade_estimators#create_input_functions} for more
+        information. The function should construct and return one of
+        the following:
+          * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+            tuple (features, labels) with same constraints as below.
+          * A tuple (features, labels): Where features is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and labels is a
+            `Tensor` or a dictionary of string label name to `Tensor`.
+            
       max_steps: Int. Positive number of total steps for which to train model.
         If `None`, train forever. The training `input_fn` is not expected to
         generate `OutOfRangeError` or `StopIteration` exceptions. See the
@@ -185,9 +192,16 @@ class EvalSpec(
     """Creates a validated `EvalSpec` instance.
 
     Args:
-      input_fn: Evaluation input function returning a tuple of:
-          features - `Tensor` or dictionary of string feature name to `Tensor`.
-          labels - `Tensor` or dictionary of `Tensor` with labels.
+      input_fn: A function that constructs the input data for evaluation.
+        See @{$get_started/premade_estimators#create_input_functions} for more
+        information. The function should construct and return one of
+        the following:
+          * A 'tf.data.Dataset' object: Outputs of `Dataset` object must be a
+            tuple (features, labels) with same constraints as below.
+          * A tuple (features, labels): Where features is a `Tensor` or a
+            dictionary of string feature name to `Tensor` and labels is a
+            `Tensor` or a dictionary of string label name to `Tensor`.
+            
       steps: Int. Positive number of steps for which to evaluate model. If
         `None`, evaluates until `input_fn` raises an end-of-input exception.
         See `Estimator.evaluate` for details.
diff --git a/tensorflow/python/keras/_impl/keras/engine/training.py b/tensorflow/python/keras/_impl/keras/engine/training.py
index 57506f9aff..4acb41553e 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training.py
@@ -266,7 +266,7 @@ class Model(Network):
     # initialization for Eager mode execution
     if context.executing_eagerly():
       if target_tensors is not None:
-        raise ValueError('target_tensors are not currently supported in Eager'
+        raise ValueError('target_tensors are not currently supported in Eager '
                          'mode.')
       self.total_loss = None
       self.metrics_tensors = []
diff --git a/tensorflow/python/keras/_impl/keras/layers/recurrent.py b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
index 2910719807..791f9b3113 100644
--- a/tensorflow/python/keras/_impl/keras/layers/recurrent.py
+++ b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
@@ -546,8 +546,8 @@ class RNN(Layer):
         raise ValueError('The initial state or constants of an RNN'
                          ' layer cannot be specified with a mix of'
                          ' Keras tensors and non-Keras tensors'
-                         '(a "Keras tensor" is a tensor that was'
-                         'returned by a Keras layer, or by `Input`)')
+                         ' (a "Keras tensor" is a tensor that was'
+                         ' returned by a Keras layer, or by `Input`)')
 
     if is_keras_tensor:
       # Compute the full input spec, including state and constants
diff --git a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
index 5196bf1740..3bbe87f92d 100644
--- a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
@@ -490,8 +490,8 @@ def slice_arrays(arrays, start=None, stop=None):
   if arrays is None:
     return [None]
   if isinstance(start, list) and stop is not None:
-    raise ValueError('The stop argument has to be None if the value of start is'
-                     'a list.')
+    raise ValueError('The stop argument has to be None if the value of start '
+                     'is a list.')
   elif isinstance(arrays, list):
     if hasattr(start, '__len__'):
       # hdf5 datasets only support list objects as indices
diff --git a/tensorflow/python/keras/_impl/keras/utils/vis_utils.py b/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
index 45c1b92075..4761cece82 100644
--- a/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/vis_utils.py
@@ -120,7 +120,7 @@ def model_to_dot(model, show_shapes=False, show_layer_names=True, rankdir='TB'):
     layer_id = str(id(layer))
     for i, node in enumerate(layer._inbound_nodes):
       node_key = layer.name + '_ib-' + str(i)
-      if node_key in model._container_nodes:
+      if node_key in model._network_nodes:  # pylint: disable=protected-access
         for inbound_layer in node.inbound_layers:
           inbound_layer_id = str(id(inbound_layer))
           layer_id = str(id(layer))
diff --git a/tensorflow/python/kernel_tests/concat_op_test.py b/tensorflow/python/kernel_tests/concat_op_test.py
index 81c6a4aa6e..c22934ce47 100644
--- a/tensorflow/python/kernel_tests/concat_op_test.py
+++ b/tensorflow/python/kernel_tests/concat_op_test.py
@@ -606,6 +606,17 @@ class ConcatOpTest(test.TestCase):
           inp_tensors_placeholders, -2, output_shape=[2, 3],
           gather_indexes=[2, 0], feed_dict=feed_dict)
 
+  def testConcatAxisType(self):
+    for dtype in [dtypes.int32, dtypes.int64]:
+      with self.test_session(use_gpu=True):
+        t1 = [[1, 2, 3], [4, 5, 6]]
+        t2 = [[7, 8, 9], [10, 11, 12]]
+
+        c = gen_array_ops.concat_v2([t1, t2],
+                                    constant_op.constant(1, dtype=dtype))
+        self.assertEqual([2, 6], c.get_shape().as_list())
+        output = c.eval()
+        self.assertAllEqual([[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]], output)
 
 class ConcatOffsetTest(test.TestCase):
 
diff --git a/tensorflow/python/kernel_tests/conv_ops_test.py b/tensorflow/python/kernel_tests/conv_ops_test.py
index f4fe01f868..25525cc128 100644
--- a/tensorflow/python/kernel_tests/conv_ops_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_test.py
@@ -970,7 +970,7 @@ class Conv2DTest(test.TestCase):
       self.assertArrayNear(value_2.flatten(), value.flatten(), err)
 
   def testConv2D2x2Depth3ValidBackpropFilterStride1x1Dilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 6, 1],
@@ -984,7 +984,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth1ValidBackpropFilterDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 2, 3, 1],
@@ -998,7 +998,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DEmptyBackpropFilterDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 2, 3, 1],
@@ -1012,7 +1012,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropFilterDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 4, 3],
@@ -1026,7 +1026,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DKernelSizeMatchesInputSizeBackpropFilterDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropFilterDilation(
             input_sizes=[1, 3, 3, 1],
@@ -1040,7 +1040,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropInputStride1x1Dilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 3, 6, 1],
@@ -1054,7 +1054,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth1ValidBackpropInputDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 2, 3, 1],
@@ -1068,7 +1068,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2DEmptyBackpropInputDilation1x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[0, 2, 3, 1],
@@ -1082,7 +1082,7 @@ class Conv2DTest(test.TestCase):
             err=1e-5)
 
   def testConv2D2x2Depth3ValidBackpropInputDilation2x1(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         # The GPU version of this test is not very stable. So adjusting the
         # error threshold to 1e-4.
@@ -1098,7 +1098,7 @@ class Conv2DTest(test.TestCase):
             err=1e-4)
 
   def testConv2DKernelSizeMatchesInputSizeBackpropInputDilation2x2(self):
-    if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True) or test_util.IsMklEnabled():
       for (data_format, use_gpu) in GetTestConfigs():
         self._RunAndVerifyBackpropInputDilation(
             input_sizes=[1, 3, 3, 1],
diff --git a/tensorflow/python/kernel_tests/depthtospace_op_test.py b/tensorflow/python/kernel_tests/depthtospace_op_test.py
index 96c9718b83..f0beabb4e2 100644
--- a/tensorflow/python/kernel_tests/depthtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/depthtospace_op_test.py
@@ -35,8 +35,8 @@ from tensorflow.python.platform import tf_logging
 
 class DepthToSpaceTest(test.TestCase):
 
-  def _testOne(self, inputs, block_size, outputs):
-    input_nhwc = math_ops.to_float(inputs)
+  def _testOne(self, inputs, block_size, outputs, dtype=dtypes.float32):
+    input_nhwc = math_ops.cast(inputs, dtype)
     with self.test_session(use_gpu=False):
       # test NHWC (default) on CPU
       x_tf = array_ops.depth_to_space(input_nhwc, block_size)
@@ -59,6 +59,12 @@ class DepthToSpaceTest(test.TestCase):
     x_out = [[[[1], [2]], [[3], [4]]]]
     self._testOne(x_np, block_size, x_out)
 
+  def testBasicFloat16(self):
+    x_np = [[[[1, 2, 3, 4]]]]
+    block_size = 2
+    x_out = [[[[1], [2]], [[3], [4]]]]
+    self._testOne(x_np, block_size, x_out, dtype=dtypes.float16)
+
   # Tests for larger input dimensions. To make sure elements are
   # correctly ordered spatially.
   def testBlockSize2(self):
diff --git a/tensorflow/python/kernel_tests/spacetodepth_op_test.py b/tensorflow/python/kernel_tests/spacetodepth_op_test.py
index b76135764f..cd90d16aac 100644
--- a/tensorflow/python/kernel_tests/spacetodepth_op_test.py
+++ b/tensorflow/python/kernel_tests/spacetodepth_op_test.py
@@ -34,8 +34,8 @@ from tensorflow.python.platform import tf_logging
 
 class SpaceToDepthTest(test.TestCase):
 
-  def _testOne(self, inputs, block_size, outputs):
-    input_nhwc = math_ops.to_float(inputs)
+  def _testOne(self, inputs, block_size, outputs, dtype=dtypes.float32):
+    input_nhwc = math_ops.cast(inputs, dtype)
     with self.test_session(use_gpu=False):
       # test NHWC (default) on CPU
       x_tf = array_ops.space_to_depth(input_nhwc, block_size)
@@ -58,6 +58,12 @@ class SpaceToDepthTest(test.TestCase):
     x_out = [[[[1, 2, 3, 4]]]]
     self._testOne(x_np, block_size, x_out)
 
+  def testBasicFloat16(self):
+    x_np = [[[[1], [2]], [[3], [4]]]]
+    block_size = 2
+    x_out = [[[[1, 2, 3, 4]]]]
+    self._testOne(x_np, block_size, x_out, dtype=dtypes.float16)
+
   # Tests for larger input dimensions. To make sure elements are
   # correctly ordered spatially.
   def testLargerInput2x2(self):
diff --git a/tensorflow/python/layers/base.py b/tensorflow/python/layers/base.py
index e9066d3fda..e4395bea92 100644
--- a/tensorflow/python/layers/base.py
+++ b/tensorflow/python/layers/base.py
@@ -578,7 +578,7 @@ class Layer(checkpointable.CheckpointableBase):
           if isinstance(variable, tf_variables.PartitionedVariable):
             raise RuntimeError(
                 'Partitioned variable regularization is not yet '
-                'supported when executing eagerly. File a feature request'
+                'supported when executing eagerly. File a feature request '
                 'if this is important to you.')
           # Save a zero-argument lambda which runs the regularizer on the
           # variable, to be executed when `Layer.losses` is requested.
diff --git a/tensorflow/python/layers/normalization.py b/tensorflow/python/layers/normalization.py
index 11daf01670..29fb92ccb5 100644
--- a/tensorflow/python/layers/normalization.py
+++ b/tensorflow/python/layers/normalization.py
@@ -664,9 +664,16 @@ def batch_normalization(inputs,
 
   Note: when training, the moving_mean and moving_variance need to be updated.
   By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
-  need to be added as a dependency to the `train_op`. For example:
+  need to be added as a dependency to the `train_op`. Also, be sure to add
+  any batch_normalization ops before getting the update_ops collection.
+  Otherwise, update_ops will be empty, and training/inference will not work
+  properly. For example:
 
   ```python
+    x_norm = tf.layers.batch_normalization(x, training=training)
+
+    # ...
+
     update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
     with tf.control_dependencies(update_ops):
       train_op = optimizer.minimize(loss)
diff --git a/tensorflow/python/lib/io/file_io_test.py b/tensorflow/python/lib/io/file_io_test.py
index a751607aaa..223858edfa 100644
--- a/tensorflow/python/lib/io/file_io_test.py
+++ b/tensorflow/python/lib/io/file_io_test.py
@@ -485,6 +485,11 @@ class FileIoTest(test.TestCase):
     f.flush()
     self.assertEqual(content, f.read(len(content) + 1))
 
+  def testUTF8StringPathExists(self):
+    file_path = os.path.join(self._base_dir, "UTF8测试_file_exist")
+    file_io.write_string_to_file(file_path, "testing")
+    v = file_io.file_exists(file_path)
+    self.assertEqual(v, True)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/lib/io/tf_record.py b/tensorflow/python/lib/io/tf_record.py
index 48ea107a14..6fcf9c91d8 100644
--- a/tensorflow/python/lib/io/tf_record.py
+++ b/tensorflow/python/lib/io/tf_record.py
@@ -75,14 +75,16 @@ def tf_record_iterator(path, options=None):
 
   if reader is None:
     raise IOError("Could not open %s." % path)
-  while True:
-    try:
-      with errors.raise_exception_on_not_ok_status() as status:
-        reader.GetNext(status)
-    except errors.OutOfRangeError:
-      break
-    yield reader.record()
-  reader.Close()
+  try:
+    while True:
+      try:
+        with errors.raise_exception_on_not_ok_status() as status:
+          reader.GetNext(status)
+      except errors.OutOfRangeError:
+        break
+      yield reader.record()
+  finally:
+    reader.Close()
 
 
 @tf_export("python_io.TFRecordWriter")
diff --git a/tensorflow/python/ops/linalg_ops.py b/tensorflow/python/ops/linalg_ops.py
index 37470e00d7..5b4fb4f7c8 100644
--- a/tensorflow/python/ops/linalg_ops.py
+++ b/tensorflow/python/ops/linalg_ops.py
@@ -341,7 +341,7 @@ def self_adjoint_eig(tensor, name=None):
     name: string, optional name of the operation.
 
   Returns:
-    e: Eigenvalues. Shape is `[..., N]`.
+    e: Eigenvalues. Shape is `[..., N]`. Sorted in non-decreasing order.
     v: Eigenvectors. Shape is `[..., N, N]`. The columns of the inner most
       matrices contain eigenvectors of the corresponding matrices in `tensor`
   """
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index fb3fe77b4d..a74de39eab 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -150,14 +150,12 @@ class _NonAtrousConvolution(object):
                                                               conv_dims))
     if conv_dims == 1:
       # conv1d uses the 2-d data format names
-      if data_format is None or data_format == "NWC":
-        data_format_2d = "NHWC"
-      elif data_format == "NCW":
-        data_format_2d = "NCHW"
-      else:
+      if data_format is None:
+        data_format = "NWC"
+      elif data_format not in {"NCW", "NWC", "NCHW", "NHWC"}:
         raise ValueError("data_format must be \"NWC\" or \"NCW\".")
       self.strides = strides[0]
-      self.data_format = data_format_2d
+      self.data_format = data_format
       self.conv_op = self._conv1d
     elif conv_dims == 2:
       if data_format is None or data_format == "NHWC":
@@ -699,7 +697,7 @@ def convolution(
   `padded_input` is obtained by zero padding the input using an effective
   spatial filter shape of `(spatial_filter_shape-1) * dilation_rate + 1` and
   output striding `strides` as described in the
-  @{tf.nn.convolution$comment here}.
+  @{$python/nn#Convolution$comment here}.
 
   In the case that `data_format` does start with `"NC"`, the `input` and output
   (but not the `filter`) are simply transposed as follows:
@@ -1043,9 +1041,7 @@ def pool(
 
 @tf_export("nn.atrous_conv2d")
 def atrous_conv2d(value, filters, rate, padding, name=None):
-  """Atrous convolution (a.k.a.
-
-  convolution with holes or dilated convolution).
+  """Atrous convolution (a.k.a. convolution with holes or dilated convolution).
 
   This function is a simpler wrapper around the more general
   @{tf.nn.convolution}, and exists only for backwards compatibility. You can
diff --git a/tensorflow/python/ops/random_ops.py b/tensorflow/python/ops/random_ops.py
index db8159579a..6a2dd3f1cd 100644
--- a/tensorflow/python/ops/random_ops.py
+++ b/tensorflow/python/ops/random_ops.py
@@ -209,7 +209,7 @@ def random_uniform(shape,
     maxval: A 0-D Tensor or Python value of type `dtype`. The upper bound on
       the range of random values to generate.  Defaults to 1 if `dtype` is
       floating point.
-    dtype: The type of the output: 'float16`, `float32`, `float64`, `int32`,
+    dtype: The type of the output: `float16`, `float32`, `float64`, `int32`,
       or `int64`.
     seed: A Python integer. Used to create a random seed for the distribution.
       See @{tf.set_random_seed}
diff --git a/tensorflow/python/ops/rnn.py b/tensorflow/python/ops/rnn.py
index c59eccc174..42af7f8b27 100644
--- a/tensorflow/python/ops/rnn.py
+++ b/tensorflow/python/ops/rnn.py
@@ -867,7 +867,7 @@ def raw_rnn(cell, loop_fn,
 
   ```python
   time = tf.constant(0, dtype=tf.int32)
-  (finished, next_input, initial_state, _, loop_state) = loop_fn(
+  (finished, next_input, initial_state, emit_structure, loop_state) = loop_fn(
       time=time, cell_output=None, cell_state=None, loop_state=None)
   emit_ta = TensorArray(dynamic_size=True, dtype=initial_state.dtype)
   state = initial_state
@@ -878,7 +878,7 @@ def raw_rnn(cell, loop_fn,
         loop_state=loop_state)
     # Emit zeros and copy forward state for minibatch entries that are finished.
     state = tf.where(finished, state, next_state)
-    emit = tf.where(finished, tf.zeros_like(emit), emit)
+    emit = tf.where(finished, tf.zeros_like(emit_structure), emit)
     emit_ta = emit_ta.write(time, emit)
     # If any new minibatch entries are marked as finished, mark these.
     finished = tf.logical_or(finished, next_finished)
@@ -938,10 +938,15 @@ def raw_rnn(cell, loop_fn,
       and `emit_output`: the output to store for this iteration.
 
       Note that `emit_output` should be a `Tensor` or (possibly nested)
-      tuple of tensors with shapes and structure matching `cell.output_size`
-      and `cell_output` above.  The parameter `cell_state` and output
-      `next_cell_state` may be either a single or (possibly nested) tuple
-      of tensors.  The parameter `loop_state` and
+      tuple of tensors which is aggregated in the `emit_ta` inside the
+      `while_loop`. For the first call to `loop_fn`, the `emit_output`
+      corresponds to the `emit_structure` which is then used to determine the
+      size of the `zero_tensor` for the `emit_ta` (defaults to
+      `cell.output_size`). For the subsequent calls to the `loop_fn`, the
+      `emit_output` corresponds to the actual output tensor
+      that is to be aggregated in the `emit_ta`. The parameter `cell_state`
+      and output `next_cell_state` may be either a single or (possibly nested)
+      tuple of tensors.  The parameter `loop_state` and
       output `next_loop_state` may be either a single or (possibly nested) tuple
       of `Tensor` and `TensorArray` objects.  This last parameter
       may be ignored by `loop_fn` and the return value may be `None`.  If it
diff --git a/tensorflow/python/ops/special_math_ops.py b/tensorflow/python/ops/special_math_ops.py
index 6d7eaababc..5e2146b79f 100644
--- a/tensorflow/python/ops/special_math_ops.py
+++ b/tensorflow/python/ops/special_math_ops.py
@@ -163,7 +163,7 @@ def einsum(equation, *inputs, **kwargs):
     if '...' in equation:
       raise ValueError('Subscripts with ellipses are not yet supported.')
 
-    match = re.match('([a-z,]+)(->[a-z]*)?', equation)
+    match = re.match('^([a-zA-Z,]+)(->[a-zA-Z]*)?$', equation)
     if not match:
       raise ValueError('Indices have incorrect format: %s' % equation)
 
@@ -402,7 +402,7 @@ def _exponential_space_einsum(equation, *inputs):
   if '...' in equation:
     raise ValueError('Subscripts with ellipses are not yet supported.')
 
-  match = re.match('([a-z,]+)(->[a-z]*)?', equation)
+  match = re.match('^([a-zA-Z,]+)(->[a-zA-Z]*)?$', equation)
   if not match:
     raise ValueError('Indices have incorrect format: %s' % equation)
 
diff --git a/tensorflow/python/ops/special_math_ops_test.py b/tensorflow/python/ops/special_math_ops_test.py
index 2c212f4548..d7c3a7e8dc 100644
--- a/tensorflow/python/ops/special_math_ops_test.py
+++ b/tensorflow/python/ops/special_math_ops_test.py
@@ -192,6 +192,9 @@ class EinsumTest(test.TestCase):
       'abc,cba',
       'dba,ead,cad->bce',
       'aef,fbc,dca->bde',
+      'iJ,Jk->ik',
+      'iJ,Ki->JK',
+      'iJk,Jklm->Jk'
   ]
 
   long_cases = [
@@ -208,6 +211,8 @@ class EinsumTest(test.TestCase):
       'ijk ijk',
       'ij.jk->ik',
       'ij...,jk...->ik...',
+      'ij,k ->kji',
+      'ij,k-> kji',
 
       # axis in output that does not exist
       'ij,jk->im',
diff --git a/tensorflow/python/tools/freeze_graph.py b/tensorflow/python/tools/freeze_graph.py
index a52f325ddb..e9f1def48c 100644
--- a/tensorflow/python/tools/freeze_graph.py
+++ b/tensorflow/python/tools/freeze_graph.py
@@ -56,8 +56,6 @@ from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.tools import saved_model_utils
 from tensorflow.python.training import saver as saver_lib
 
-FLAGS = None
-
 
 def freeze_graph_with_def_protos(input_graph_def,
                                  input_saver_def,
@@ -256,25 +254,24 @@ def freeze_graph(input_graph,
       checkpoint_version=checkpoint_version)
 
 
-def main(unused_args):
-  if FLAGS.checkpoint_version == 1:
+def main(unused_args, flags):
+  if flags.checkpoint_version == 1:
     checkpoint_version = saver_pb2.SaverDef.V1
-  elif FLAGS.checkpoint_version == 2:
+  elif flags.checkpoint_version == 2:
     checkpoint_version = saver_pb2.SaverDef.V2
   else:
     print("Invalid checkpoint version (must be '1' or '2'): %d" %
-          FLAGS.checkpoint_version)
+          flags.checkpoint_version)
     return -1
-  freeze_graph(FLAGS.input_graph, FLAGS.input_saver, FLAGS.input_binary,
-               FLAGS.input_checkpoint, FLAGS.output_node_names,
-               FLAGS.restore_op_name, FLAGS.filename_tensor_name,
-               FLAGS.output_graph, FLAGS.clear_devices, FLAGS.initializer_nodes,
-               FLAGS.variable_names_whitelist, FLAGS.variable_names_blacklist,
-               FLAGS.input_meta_graph, FLAGS.input_saved_model_dir,
-               FLAGS.saved_model_tags, checkpoint_version)
-
+  freeze_graph(flags.input_graph, flags.input_saver, flags.input_binary,
+               flags.input_checkpoint, flags.output_node_names,
+               flags.restore_op_name, flags.filename_tensor_name,
+               flags.output_graph, flags.clear_devices, flags.initializer_nodes,
+               flags.variable_names_whitelist, flags.variable_names_blacklist,
+               flags.input_meta_graph, flags.input_saved_model_dir,
+               flags.saved_model_tags, checkpoint_version)
 
-if __name__ == "__main__":
+def run_main():
   parser = argparse.ArgumentParser()
   parser.register("type", "bool", lambda v: v.lower() == "true")
   parser.add_argument(
@@ -376,5 +373,10 @@ if __name__ == "__main__":
       separated by \',\'. For tag-set contains multiple tags, all tags \
       must be passed in.\
       """)
-  FLAGS, unparsed = parser.parse_known_args()
-  app.run(main=main, argv=[sys.argv[0]] + unparsed)
+  flags, unparsed = parser.parse_known_args()
+
+  my_main = lambda unused_args: main(unused_args, flags)
+  app.run(main=my_main, argv=[sys.argv[0]] + unparsed)
+
+if __name__ == '__main__':
+  run_main()
diff --git a/tensorflow/python/tools/inspect_checkpoint.py b/tensorflow/python/tools/inspect_checkpoint.py
index dd876cbe7f..6504fbc107 100644
--- a/tensorflow/python/tools/inspect_checkpoint.py
+++ b/tensorflow/python/tools/inspect_checkpoint.py
@@ -30,7 +30,7 @@ FLAGS = None
 
 
 def print_tensors_in_checkpoint_file(file_name, tensor_name, all_tensors,
-                                     all_tensor_names):
+                                     all_tensor_names=False):
   """Prints tensors in a checkpoint file.
 
   If no `tensor_name` is provided, prints the tensor names and shapes
@@ -139,7 +139,7 @@ if __name__ == "__main__":
       const=True,
       type="bool",
       default=False,
-      help="If True, print the values of all the tensors.")
+      help="If True, print the names and values of all the tensors.")
   parser.add_argument(
       "--all_tensor_names",
       nargs="?",
diff --git a/tensorflow/python/tools/saved_model_cli.py b/tensorflow/python/tools/saved_model_cli.py
index b0e9e3e5ed..b88be4ae04 100644
--- a/tensorflow/python/tools/saved_model_cli.py
+++ b/tensorflow/python/tools/saved_model_cli.py
@@ -38,11 +38,15 @@ from tensorflow.core.example import example_pb2
 from tensorflow.core.framework import types_pb2
 from tensorflow.python.client import session
 from tensorflow.python.debug.wrappers import local_cli_wrapper
+from tensorflow.python.framework import meta_graph as meta_graph_lib
 from tensorflow.python.framework import ops as ops_lib
 from tensorflow.python.platform import app  # pylint: disable=unused-import
 from tensorflow.python.saved_model import loader
 from tensorflow.python.tools import saved_model_utils
 
+# Set of ops to blacklist.
+_OP_BLACKLIST = set(['WriteFile', 'ReadFile'])
+
 
 def _show_tag_sets(saved_model_dir):
   """Prints the tag-sets stored in SavedModel directory.
@@ -242,6 +246,27 @@ def get_signature_def_map(saved_model_dir, tag_set):
   return meta_graph.signature_def
 
 
+def scan_meta_graph_def(meta_graph_def):
+  """Scans meta_graph_def and reports if there are ops on blacklist.
+
+  Print ops if they are on black list, or print success if no blacklisted ops
+  found.
+
+  Args:
+    meta_graph_def: MetaGraphDef protocol buffer.
+  """
+  all_ops_set = set(
+      meta_graph_lib.ops_used_by_graph_def(meta_graph_def.graph_def))
+  blacklisted_ops = _OP_BLACKLIST & all_ops_set
+  if blacklisted_ops:
+    # TODO(yifeif): print more warnings
+    print('MetaGraph with tag set %s contains the following blacklisted ops:' %
+          meta_graph_def.meta_info_def.tags, blacklisted_ops)
+  else:
+    print('MetaGraph with tag set %s does not contain blacklisted ops.' %
+          meta_graph_def.meta_info_def.tags)
+
+
 def run_saved_model_with_feed_dict(saved_model_dir, tag_set, signature_def_key,
                                    input_tensor_key_feed_dict, outdir,
                                    overwrite_flag, tf_debug=False):
@@ -609,6 +634,21 @@ def run(args):
                                  args.overwrite, tf_debug=args.tf_debug)
 
 
+def scan(args):
+  """Function triggered by scan command.
+
+  Args:
+    args: A namespace parsed from command line.
+  """
+  if args.tag_set:
+    scan_meta_graph_def(
+        saved_model_utils.get_meta_graph_def(args.dir, args.tag_set))
+  else:
+    saved_model = reader.read_saved_model(args.dir)
+    for meta_graph_def in saved_model.meta_graphs:
+      scan_meta_graph_def(meta_graph_def)
+
+
 def create_parser():
   """Creates a parser that parse the command line arguments.
 
@@ -730,6 +770,26 @@ def create_parser():
            'SavedModel.')
   parser_run.set_defaults(func=run)
 
+  # scan command
+  scan_msg = ('Usage example:\n'
+              'To scan for blacklisted ops in SavedModel:\n'
+              '$saved_model_cli scan --dir /tmp/saved_model\n'
+              'To scan a specific MetaGraph, pass in --tag_set\n')
+  parser_scan = subparsers.add_parser(
+      'scan',
+      description=scan_msg,
+      formatter_class=argparse.RawTextHelpFormatter)
+  parser_scan.add_argument(
+      '--dir',
+      type=str,
+      required=True,
+      help='directory containing the SavedModel to execute')
+  parser_scan.add_argument(
+      '--tag_set',
+      type=str,
+      help='tag-set of graph in SavedModel to scan, separated by \',\'')
+  parser_scan.set_defaults(func=scan)
+
   return parser
 
 
diff --git a/tensorflow/python/tools/saved_model_cli_test.py b/tensorflow/python/tools/saved_model_cli_test.py
index f99c844845..eedc893a38 100644
--- a/tensorflow/python/tools/saved_model_cli_test.py
+++ b/tensorflow/python/tools/saved_model_cli_test.py
@@ -525,6 +525,28 @@ signature_def['serving_default']:
     y_expected = np.array([[2.5], [3.0]])
     self.assertAllClose(y_expected, y_actual)
 
+  def testScanCommand(self):
+    self.parser = saved_model_cli.create_parser()
+    base_path = test.test_src_dir_path(SAVED_MODEL_PATH)
+    args = self.parser.parse_args(['scan', '--dir', base_path])
+    with captured_output() as (out, _):
+      saved_model_cli.scan(args)
+    output = out.getvalue().strip()
+    self.assertTrue('does not contain blacklisted ops' in output)
+
+  def testScanCommandFoundBlacklistedOp(self):
+    self.parser = saved_model_cli.create_parser()
+    base_path = test.test_src_dir_path(SAVED_MODEL_PATH)
+    args = self.parser.parse_args(
+        ['scan', '--dir', base_path, '--tag_set', 'serve'])
+    op_blacklist = saved_model_cli._OP_BLACKLIST
+    saved_model_cli._OP_BLACKLIST = set(['VariableV2'])
+    with captured_output() as (out, _):
+      saved_model_cli.scan(args)
+    saved_model_cli._OP_BLACKLIST = op_blacklist
+    output = out.getvalue().strip()
+    self.assertTrue('\'VariableV2\'' in output)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 5ef8bd9e9c..ba0d038475 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -1135,8 +1135,9 @@ class Saver(object):
   the proliferation of checkpoint files on disk:
 
   * `max_to_keep` indicates the maximum number of recent checkpoint files to
-    keep.  As new files are created, older files are deleted.  If None or 0,
-    all checkpoint files are kept.  Defaults to 5 (that is, the 5 most recent
+    keep.  As new files are created, older files are deleted.   If None or 0,
+    no checkpoints are deleted from the filesystem but only the last one is
+    kept in the `checkpoint` file.  Defaults to 5 (that is, the 5 most recent
     checkpoint files are kept.)
 
   * `keep_checkpoint_every_n_hours`: In addition to keeping the most recent
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index 0b3b060fe7..03e3e0857f 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -274,7 +274,8 @@ CUDNN_DNN_ROUTINE_EACH_R6(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
 // clang-format off
 #if CUDNN_VERSION >= 7000
 #define CUDNN_DNN_ROUTINE_EACH_R7(__macro)                    \
-  __macro(cudnnSetConvolutionMathType)
+  __macro(cudnnSetConvolutionMathType)                        \
+  __macro(cudnnSetRNNMatrixMathType)
 
 // clang-format on
 CUDNN_DNN_ROUTINE_EACH_R7(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
@@ -586,6 +587,19 @@ static bool TensorOpMathEnabled() {
   return is_enabled;
 }
 
+// A helper function to decide whether to enable the TENSOR_OP_MATH math type
+// for RNNs.
+static bool RnnTensorOpMathEnabled() {
+  static bool is_enabled = [] {
+    bool is_disabled = false;
+    TF_CHECK_OK(
+        tensorflow::ReadBoolFromEnvVar("TF_DISABLE_CUDNN_RNN_TENSOR_OP_MATH",
+                                       /*default_val=*/false, &is_disabled));
+    return !is_disabled;
+  }();
+  return is_enabled;
+}
+
 // A helper function to decide whether to use CUDNN_BATCHNORM_SPATIAL_PERSISTENT
 // in batchnorm. This mode can be faster in some tasks because an optimized path
 // may be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF data types, compute
@@ -1124,6 +1138,9 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
       SetFailure(cudnn_params_desc_->Status());
       return;
     }
+    if (data_type == CUDNN_DATA_HALF) {
+      set_use_tensor_op_math(true);
+    }
   }
   ~CudnnRnnDescriptor() override {
     if (rnn_desc_) {
@@ -1132,6 +1149,20 @@ class CudnnRnnDescriptor : public CudnnDescriptorCommon<dnn::RnnDescriptor> {
       CUDNN_RETURN_IF_FAIL(status, "Unable to destroy RNN descriptor");
     }
   }
+  void set_use_tensor_op_math(bool use_tensor_op_math) {
+#if CUDNN_VERSION >= 7000
+    cudnnMathType_t math_type =
+        (use_tensor_op_math ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH);
+    if (RnnTensorOpMathEnabled()) {
+      cudnnStatus_t status =
+          wrap::cudnnSetRNNMatrixMathType(parent_, rnn_desc_, math_type);
+      if (status != CUDNN_STATUS_SUCCESS) {
+        LOG(FATAL) << "could not set cudnn RNN math type: "
+                   << ToString(status);
+      }
+    }
+#endif
+  }
   cudnnRNNDescriptor_t handle() const {
     if (!ok()) return nullptr;
     return rnn_desc_;
diff --git a/tensorflow/tools/api/tests/api_compatibility_test.py b/tensorflow/tools/api/tests/api_compatibility_test.py
index 99c47fd601..96f501e163 100644
--- a/tensorflow/tools/api/tests/api_compatibility_test.py
+++ b/tensorflow/tools/api/tests/api_compatibility_test.py
@@ -247,6 +247,8 @@ class ApiCompatibilityTest(test.TestCase):
     public_api_visitor = public_api.PublicAPIVisitor(visitor)
     public_api_visitor.do_not_descend_map['tf'].append('contrib')
     public_api_visitor.do_not_descend_map['tf.GPUOptions'] = ['Experimental']
+    # TODO(annarev): Make slide_dataset available in API.
+    public_api_visitor.private_map['tf'] = ['slide_dataset']
     traverse.traverse(api, public_api_visitor)
 
     proto_dict = visitor.GetProtos()
diff --git a/tensorflow/tools/ci_build/Dockerfile.cmake b/tensorflow/tools/ci_build/Dockerfile.cmake
index ec90c83aac..d5dea4f3e4 100644
--- a/tensorflow/tools/ci_build/Dockerfile.cmake
+++ b/tensorflow/tools/ci_build/Dockerfile.cmake
@@ -23,11 +23,12 @@ RUN /install/install_deb_packages.sh
 
 RUN apt-get update
 RUN apt-get install -y --no-install-recommends python-pip
+RUN pip install --upgrade wheel
 RUN pip install --upgrade astor
 RUN pip install --upgrade gast
 RUN pip install --upgrade numpy
 RUN pip install --upgrade termcolor
 
 # Install golang
-RUN add-apt-repository -y ppa:ubuntu-lxc/lxd-stable
-RUN apt-get install -y golang
+RUN apt-get install -t xenial-backports -y golang-1.9
+ENV PATH=${PATH}:/usr/lib/go-1.9/bin
diff --git a/tensorflow/tools/compatibility/tf_upgrade.py b/tensorflow/tools/compatibility/tf_upgrade.py
index 6e90b286c9..1f8833582a 100644
--- a/tensorflow/tools/compatibility/tf_upgrade.py
+++ b/tensorflow/tools/compatibility/tf_upgrade.py
@@ -662,9 +662,9 @@ class TFAPIChangeSpec(APIChangeSpec):
   def _reverse_handler(file_edit_recorder, node):
     # TODO(aselle): Could check for a literal list of bools and try to convert
     # them to indices.
-    comment = ("ERROR: tf.reverse has had its argument semantics changed\n"
-               "significantly the converter cannot detect this reliably, so you"
-               "need to inspect this usage manually.\n")
+    comment = ("ERROR: tf.reverse has had its argument semantics changed "
+               "significantly the converter cannot detect this reliably, so "
+               "you need to inspect this usage manually.\n")
     file_edit_recorder.add(
         comment,
         node.lineno,
diff --git a/tensorflow/tools/dist_test/README.md b/tensorflow/tools/dist_test/README.md
index c1b1f79bbd..228d5ee35d 100644
--- a/tensorflow/tools/dist_test/README.md
+++ b/tensorflow/tools/dist_test/README.md
@@ -17,6 +17,14 @@ cesnsu model:
 
     ./local_test.sh --model_name CENSUS_WIDENDEEP
 
+You can test specify version of TensorFlow:
+
+```shell
+./local_test.sh ${whl_file_url}
+```
+
+For example, you can find these TensorFlow python package URLs from [here](https://www.tensorflow.org/install/install_linux#the_url_of_the_tensorflow_python_package) for Ubuntu.
+
 **2) Launch a remote k8s cluster on Google Kubernetes Engine (GKE) and run the
 test suite on it**
 
diff --git a/tensorflow/tools/dist_test/local_test.sh b/tensorflow/tools/dist_test/local_test.sh
index 435f9d0dc9..caae7fd530 100755
--- a/tensorflow/tools/dist_test/local_test.sh
+++ b/tensorflow/tools/dist_test/local_test.sh
@@ -16,12 +16,11 @@
 #
 # Tests distributed TensorFlow on a locally running TF GRPC cluster.
 #
-# This script peforms the following steps:
-# 1) Build the docker-in-docker (dind) image capable of running docker and
-#    Kubernetes (k8s) cluster inside.
+# This script performs the following steps:
+# 1) Build the docker image capable of running distributed TensorFlow in docker.
 # 2) Run a container from the aforementioned image and start docker service
 #    in it
-# 3) Call a script to launch a k8s TensorFlow GRPC cluster inside the container
+# 3) Call a script to launch a distributed TensorFlow GRPC cluster inside the container
 #    and run the distributed test suite.
 #
 # Usage: local_test.sh <whl_file_location>
@@ -64,15 +63,9 @@ die() {
 
 # Configurations
 DOCKER_IMG_NAME="tensorflow/tf-dist-test-local-cluster"
-LOCAL_K8S_CACHE=${HOME}/kubernetes
 
-# Helper function
-get_container_id_by_image_name() {
-    # Get the id of a container by image name
-    # Usage: get_docker_container_id_by_image_name <img_name>
-
-    docker ps | grep $1 | awk '{print $1}'
-}
+# Use TensorFlow v1.5.0 for Python 2.7 and CPU only as we set num_gpus to 0 in the below
+DEFAULT_WHL_FILE_LOCATION="https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.5.0-cp27-none-linux_x86_64.whl"
 
 # Parse input arguments
 LEAVE_CONTAINER_RUNNING=0
@@ -84,7 +77,8 @@ SYNC_REPLICAS_FLAG=""
 
 WHL_FILE_LOCATION=${1}
 if [[ -z "${WHL_FILE_LOCATION}" ]]; then
-  die "whl file location is not specified"
+  WHL_FILE_LOCATION=${DEFAULT_WHL_FILE_LOCATION}
+  echo "use default whl file location"
 fi
 
 while true; do
@@ -121,7 +115,7 @@ DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # Get utility functions
 source ${DIR}/scripts/utils.sh
 
-# Build docker-in-docker image for local k8s cluster.
+# Build docker image for local distributed TensorFlow cluster.
 NO_CACHE_FLAG=""
 if [[ ! -z "${TF_DIST_DOCKER_NO_CACHE}" ]] &&
    [[ "${TF_DIST_DOCKER_NO_CACHE}" != "0" ]]; then
diff --git a/tensorflow/tools/dist_test/python/mnist_replica.py b/tensorflow/tools/dist_test/python/mnist_replica.py
index a2d12442c4..d6e7f317dd 100644
--- a/tensorflow/tools/dist_test/python/mnist_replica.py
+++ b/tensorflow/tools/dist_test/python/mnist_replica.py
@@ -56,7 +56,7 @@ flags.DEFINE_integer("task_index", None,
 flags.DEFINE_integer("num_gpus", 1, "Total number of gpus for each machine."
                      "If you don't use GPU, please set it to '0'")
 flags.DEFINE_integer("replicas_to_aggregate", None,
-                     "Number of replicas to aggregate before parameter update"
+                     "Number of replicas to aggregate before parameter update "
                      "is applied (For sync_replicas mode only; default: "
                      "num_workers)")
 flags.DEFINE_integer("hidden_units", 100,
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index b6682cd681..625321e123 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,11 +1,18 @@
-FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
+FROM nvidia/cuda:9.0-base-ubuntu16.04
 
 LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
+        cuda-command-line-tools-9-0 \
+        cuda-cublas-9-0 \
+        cuda-cufft-9-0 \
+        cuda-curand-9-0 \
+        cuda-cusolver-9-0 \
+        cuda-cusparse-9-0 \
         curl \
+        libcudnn7=7.0.5.15-1+cuda9.0 \
         libfreetype6-dev \
         libpng12-dev \
         libzmq3-dev \
diff --git a/tensorflow/tools/git/gen_git_source.py b/tensorflow/tools/git/gen_git_source.py
index 3630dbd740..cbcdbf5b80 100755
--- a/tensorflow/tools/git/gen_git_source.py
+++ b/tensorflow/tools/git/gen_git_source.py
@@ -114,6 +114,13 @@ def configure(src_base_path, gen_path, debug=False):
   for target, src in link_map.items():
     if src is None:
       open(os.path.join(gen_path, target), "w").write("")
+    elif not os.path.exists(src):
+      # Git repo is configured in a way we don't support such as having
+      # packed refs. Even though in a git repo, tf.__git_version__ will not
+      # be accurate.
+      # TODO(mikecase): Support grabbing git info when using packed refs.
+      open(os.path.join(gen_path, target), "w").write("")
+      spec["git"] = False
     else:
       try:
         # In python 3.5, symlink function exists even on Windows. But requires
diff --git a/tensorflow/tools/graph_transforms/BUILD b/tensorflow/tools/graph_transforms/BUILD
index b7d7fac315..6e21aa2846 100644
--- a/tensorflow/tools/graph_transforms/BUILD
+++ b/tensorflow/tools/graph_transforms/BUILD
@@ -178,6 +178,7 @@ tf_cc_test(
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
+        "//tensorflow/core/kernels:quantization_utils",
         "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/util/tensor_bundle",
     ],
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
index d89afe85c7..d86f65325b 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms.cc
@@ -182,6 +182,36 @@ Status FuseBatchNormWithConv(const NodeMatch& match,
   return Status::OK();
 }
 
+Status FuseBatchNormWithBatchToSpace(const NodeMatch& match,
+                             std::vector<NodeDef>* new_nodes) {
+  // Calculate the scale and offset values to apply.
+  std::vector<float> scale_values;
+  std::vector<float> offset_values;
+  TF_RETURN_IF_ERROR(
+      GetScaleAndOffsetValues(match, &scale_values, &offset_values));
+
+  // Fuse conv weights, and set the final output node name as batch_norm_node.
+  const NodeDef& batch_norm_node = match.node;
+  const NodeMatch& batch_to_space_node_match = match.inputs[0];
+  const NodeMatch& conv_node_match = batch_to_space_node_match.inputs[0];
+  const NodeDef& batch_to_space_node = batch_to_space_node_match.node;
+  const NodeDef& conv_node = conv_node_match.node;
+
+  string biasadd_name = conv_node.name() + "/biasadd";
+  TF_RETURN_IF_ERROR(
+      FuseScaleOffsetToConvWeights(scale_values, offset_values, conv_node_match,
+                                   biasadd_name , new_nodes));
+
+  NodeDef new_batch_to_space_node = batch_to_space_node;
+  // reuse batch_norm node name
+  new_batch_to_space_node.set_name(batch_norm_node.name());
+  new_batch_to_space_node.set_input(0, biasadd_name);
+  new_nodes->push_back(batch_to_space_node_match.inputs[1].node);
+  new_nodes->push_back(batch_to_space_node_match.inputs[2].node);
+  new_nodes->push_back(new_batch_to_space_node);
+  return Status::OK();
+}
+
 Status FuseBatchNormWithConvConcat(const NodeMatch& match,
                                    std::vector<NodeDef>* new_nodes) {
   // Calculate the scale and offset values to apply.
@@ -287,6 +317,43 @@ Status FoldOldBatchNorms(const GraphDef& input_graph_def,
   do {
     did_graph_change = false;
     GraphDef replaced_graph_def;
+    TF_RETURN_IF_ERROR(ReplaceMatchingOpTypes(
+        current_graph_def,  // clang-format off
+        {"BatchNormWithGlobalNormalization|FusedBatchNorm",    // batch_norm_node
+         {
+             {"BatchToSpaceND",                  // batch_to_space_node
+              {
+                  {"Conv2D",                     // conv_node
+                   {
+                       {"*"},                    // input_node
+                       {"Const"},                // weights_node
+                   }
+                  },
+                  {"Const"},                     // block_shape
+                  {"Const"},                     // crops
+              }
+             },
+             {"Const"},                          // mean_node
+             {"Const"},                          // variance_node
+             {"Const"},                          // beta_node
+             {"Const"},                          // gamma_node
+         }
+        },  // clang-format on
+        [&did_graph_change](const NodeMatch& match,
+                            const std::set<string>& input_nodes,
+                            const std::set<string>& output_nodes,
+                            std::vector<NodeDef>* new_nodes) {
+          TF_RETURN_IF_ERROR(FuseBatchNormWithBatchToSpace(match, new_nodes));
+          did_graph_change = true;
+          return Status::OK();
+        },
+        {}, &replaced_graph_def));
+    current_graph_def = replaced_graph_def;
+  } while (did_graph_change);
+
+  do {
+    did_graph_change = false;
+    GraphDef replaced_graph_def;
     // Replace BatchNorm with concat as input.
     TF_RETURN_IF_ERROR(ReplaceMatchingOpTypes(
         current_graph_def,  // clang-format off
diff --git a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
index b30ba9ac8b..7651a03fe5 100644
--- a/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
+++ b/tensorflow/tools/graph_transforms/fold_old_batch_norms_test.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/cc/ops/const_op.h"
 #include "tensorflow/cc/ops/image_ops.h"
 #include "tensorflow/cc/ops/nn_ops.h"
+#include "tensorflow/cc/ops/array_ops.h"
 #include "tensorflow/cc/ops/sendrecv_ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/tensor_testutil.h"
@@ -298,6 +299,96 @@ class FoldOldBatchNormsTest : public ::testing::Test {
   }
 };
 
+void TestFoldFusedBatchNormsWithBatchToSpace() {
+  auto root = tensorflow::Scope::NewRootScope();
+  using namespace ::tensorflow::ops;  // NOLINT(build/namespaces)
+
+  Tensor input_data(DT_FLOAT, TensorShape({2, 1, 3, 2}));
+  test::FillValues<float>(
+      &input_data, {1.0f, 4.0f, 2.0f, 5.0f, 3.0f, 6.0f, -1.0f, -4.0f, -2.0f,
+                    -5.0f, -3.0f, -6.0f});
+  Output input_op =
+      Const(root.WithOpName("input_op"), Input::Initializer(input_data));
+
+  Tensor weights_data(DT_FLOAT, TensorShape({1, 2, 2, 2}));
+  test::FillValues<float>(&weights_data,
+                          {1.0f, 2.0f, 3.0f, 4.0f, 0.1f, 0.2f, 0.3f, 0.4f});
+  Output weights_op =
+      Const(root.WithOpName("weights_op"), Input::Initializer(weights_data));
+
+  Output conv_op = Conv2D(root.WithOpName("conv_op"), input_op, weights_op,
+                          {1, 1, 1, 1}, "VALID");
+
+  Tensor block_shape_data(DT_INT32, TensorShape({2}));
+  test::FillValues<int32>(&block_shape_data, {1, 2});
+  Output block_shape_op =
+      Const(root.WithOpName("block_shape_op"), Input::Initializer(block_shape_data));
+
+  Tensor crops_data(DT_INT32, TensorShape({2, 2}));
+  test::FillValues<int32>(&crops_data, {0, 0, 0, 1});
+  Output crops_op =
+      Const(root.WithOpName("crops_op"), Input::Initializer(crops_data));
+
+  Output batch_to_space_op = BatchToSpaceND(root.WithOpName("batch_to_space_op"),
+                                            conv_op, block_shape_op, crops_data);
+
+  Tensor mean_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&mean_data, {10.0f, 20.0f});
+  Output mean_op =
+      Const(root.WithOpName("mean_op"), Input::Initializer(mean_data));
+
+  Tensor variance_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&variance_data, {0.25f, 0.5f});
+  Output variance_op = Const(root.WithOpName("variance_op"),
+                             Input::Initializer(variance_data));
+
+  Tensor beta_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&beta_data, {0.1f, 0.6f});
+  Output beta_op =
+      Const(root.WithOpName("beta_op"), Input::Initializer(beta_data));
+
+  Tensor gamma_data(DT_FLOAT, TensorShape({2}));
+  test::FillValues<float>(&gamma_data, {1.0f, 2.0f});
+  Output gamma_op =
+      Const(root.WithOpName("gamma_op"), Input::Initializer(gamma_data));
+
+  GraphDef original_graph_def;
+  TF_ASSERT_OK(root.ToGraphDef(&original_graph_def));
+
+  NodeDef batch_norm_node;
+  batch_norm_node.set_op("FusedBatchNorm");
+  batch_norm_node.set_name("output");
+  AddNodeInput("batch_to_space_op", &batch_norm_node);
+  AddNodeInput("gamma_op", &batch_norm_node);
+  AddNodeInput("beta_op", &batch_norm_node);
+  AddNodeInput("mean_op", &batch_norm_node);
+  AddNodeInput("variance_op", &batch_norm_node);
+  SetNodeAttr("T", DT_FLOAT, &batch_norm_node);
+  SetNodeAttr("epsilon", 0.00001f, &batch_norm_node);
+  SetNodeAttr("is_training", false, &batch_norm_node);
+  *(original_graph_def.mutable_node()->Add()) = batch_norm_node;
+
+  std::unique_ptr<Session> original_session(NewSession(SessionOptions()));
+  TF_ASSERT_OK(original_session->Create(original_graph_def));
+  std::vector<Tensor> original_outputs;
+  TF_ASSERT_OK(original_session->Run({}, {"output"}, {}, &original_outputs));
+
+  GraphDef fused_graph_def;
+  TF_ASSERT_OK(FoldOldBatchNorms(original_graph_def, {{}, {"output"}},
+                                 &fused_graph_def));
+
+  std::unique_ptr<Session> fused_session(NewSession(SessionOptions()));
+  TF_ASSERT_OK(fused_session->Create(fused_graph_def));
+  std::vector<Tensor> fused_outputs;
+  TF_ASSERT_OK(fused_session->Run({}, {"output"}, {}, &fused_outputs));
+
+  test::ExpectTensorNear<float>(original_outputs[0], fused_outputs[0], 1e-5);
+
+  for (const NodeDef& node : fused_graph_def.node()) {
+    EXPECT_NE("FusedBatchNormWithBatchToSpace", node.op());
+  }
+}
+
 TEST_F(FoldOldBatchNormsTest, TestFoldOldBatchNorms) {
   TestFoldOldBatchNorms();
 }
@@ -307,7 +398,7 @@ TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNorms) {
 }
 
 TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithConcat) {
-  // Test axis is not 3, so all weigths and offsets are fused to each of inputs
+  // Test axis is not 3, so all weights and offsets are fused to each of inputs
   // of conv2d.
   TestFoldFusedBatchNormsWithConcat(/*split=*/true);
   // Test axis = 3, BatchNorm weights and offsets will be split before fused
@@ -315,5 +406,9 @@ TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithConcat) {
   TestFoldFusedBatchNormsWithConcat(/*split=*/false);
 }
 
+TEST_F(FoldOldBatchNormsTest, TestFoldFusedBatchNormsWithBatchToSpace) {
+  TestFoldFusedBatchNormsWithBatchToSpace();
+}
+
 }  // namespace graph_transforms
 }  // namespace tensorflow
diff --git a/tensorflow/tools/pip_package/BUILD b/tensorflow/tools/pip_package/BUILD
index 1833d67d82..d55a883df5 100644
--- a/tensorflow/tools/pip_package/BUILD
+++ b/tensorflow/tools/pip_package/BUILD
@@ -108,6 +108,7 @@ filegroup(
         "@highwayhash//:LICENSE",
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
+        "@kafka//:LICENSE",
         "@libxsmm_archive//:LICENSE",
         "@lmdb//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index e1a5f091ba..e0152da4df 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -29,7 +29,7 @@ from setuptools.dist import Distribution
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.6.0-rc1'
+_VERSION = '1.6.0'
 
 REQUIRED_PACKAGES = [
     'absl-py >= 0.1.6',
@@ -72,7 +72,7 @@ if sys.version_info < (3, 4):
 
 # pylint: disable=line-too-long
 CONSOLE_SCRIPTS = [
-    'freeze_graph = tensorflow.python.tools.freeze_graph:main',
+    'freeze_graph = tensorflow.python.tools.freeze_graph:run_main',
     'toco_from_protos = tensorflow.contrib.lite.toco.python.toco_from_protos:main',
     'toco = tensorflow.contrib.lite.toco.python.toco_wrapper:main',
     'saved_model_cli = tensorflow.python.tools.saved_model_cli:main',
diff --git a/tensorflow/tools/test/upload_test_benchmarks.py b/tensorflow/tools/test/upload_test_benchmarks.py
index 77cc9f75f7..c030575109 100644
--- a/tensorflow/tools/test/upload_test_benchmarks.py
+++ b/tensorflow/tools/test/upload_test_benchmarks.py
@@ -87,6 +87,7 @@ import json
 import os
 import shutil
 
+from six import text_type
 from google.cloud import datastore
 
 
@@ -150,7 +151,7 @@ def upload_benchmark_data(client, data):
   """
   test_result = json.loads(data)
 
-  test_name = unicode(test_result["name"])
+  test_name = text_type(test_result["name"])
   start_time = datetime.datetime.utcfromtimestamp(
       float(test_result["startTime"]))
   batch = []
@@ -162,7 +163,7 @@ def upload_benchmark_data(client, data):
   t_val.update({
       "test": test_name,
       "start": start_time,
-      "info": unicode(data)
+      "info": text_type(data)
   })
   batch.append(t_val)
 
@@ -170,7 +171,7 @@ def upload_benchmark_data(client, data):
   # the attribute to be fetched and displayed.  The full entry information is
   # also stored as a non-indexed JSON blob.
   for ent in test_result["entries"].get("entry", []):
-    ent_name = unicode(ent["name"])
+    ent_name = text_type(ent["name"])
     e_key = client.key("Entry")
     e_val = datastore.Entity(e_key, exclude_from_indexes=["info"])
     e_val.update({
@@ -178,7 +179,7 @@ def upload_benchmark_data(client, data):
         "start": start_time,
         "entry": ent_name,
         "timing": ent["wallTime"],
-        "info": unicode(json.dumps(ent))
+        "info": text_type(json.dumps(ent))
     })
     batch.append(e_val)
 
diff --git a/third_party/jpeg/jpeg.BUILD b/third_party/jpeg/jpeg.BUILD
index 87a23925c4..4418ac32fc 100644
--- a/third_party/jpeg/jpeg.BUILD
+++ b/third_party/jpeg/jpeg.BUILD
@@ -526,12 +526,12 @@ config_setting(
 
 config_setting(
     name = "armeabi-v7a",
-    values = {"android_cpu": "armeabi-v7a"},
+    values = {"cpu": "armeabi-v7a"},
 )
 
 config_setting(
     name = "arm64-v8a",
-    values = {"android_cpu": "arm64-v8a"},
+    values = {"cpu": "arm64-v8a"},
 )
 
 config_setting(
diff --git a/third_party/kafka/BUILD b/third_party/kafka/BUILD
index a61a9e1f6c..a839ca717e 100644
--- a/third_party/kafka/BUILD
+++ b/third_party/kafka/BUILD
@@ -130,12 +130,16 @@ cc_library(
     ],
     hdrs = [
         "config.h",
+        "src-cpp/rdkafkacpp.h",
+        "src-cpp/rdkafkacpp_int.h",
+        "src/lz4.c",
+        "src/snappy_compat.h",
     ],
-    defines = [
+    copts = [
+        "-Iexternal/kafka/src",
+        "-Iexternal/kafka/src-cpp",
     ],
-    includes = [
-        "src",
-        "src-cpp",
+    defines = [
     ],
     linkopts = [
         "-lpthread",
@@ -143,5 +147,6 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = [
         "@boringssl//:ssl",
+        "@zlib_archive//:zlib",
     ],
 )
diff --git a/third_party/py/BUILD.tpl b/third_party/py/BUILD.tpl
index de06ad5f27..1dd8ab433a 100644
--- a/third_party/py/BUILD.tpl
+++ b/third_party/py/BUILD.tpl
@@ -2,20 +2,26 @@ licenses(["restricted"])
 
 package(default_visibility = ["//visibility:public"])
 
+# To build Python C/C++ extension on Windows, we need to link to python import library pythonXY.lib
+# See https://docs.python.org/3/extending/windows.html
+cc_import(
+    name = "python_lib",
+    interface_library = select({
+        ":windows": ":python_import_lib",
+        # A placeholder for Unix platforms which makes --no_build happy.
+        "//conditions:default": "not-existing.lib",
+    }),
+    system_provided = 1,
+)
+
 cc_library(
     name = "python_headers",
     hdrs = [":python_include"],
-    data = select({
-        ":windows": [":python_import_lib"],
+    deps = select({
+        ":windows": [":python_lib"],
         "//conditions:default": [],
     }),
     includes = ["python_include"],
-    linkopts = select({
-        # TODO(pcloudy): Ideally, this should just go into deps after resolving
-        # https://github.com/bazelbuild/bazel/issues/3237,
-        ":windows": ["$(locations :python_import_lib)"],
-        "//conditions:default": [],
-    }),
 )
 
 cc_library(
diff --git a/third_party/tensorrt/tensorrt_configure.bzl b/third_party/tensorrt/tensorrt_configure.bzl
index 8e76e5d02a..9b946505a6 100644
--- a/third_party/tensorrt/tensorrt_configure.bzl
+++ b/third_party/tensorrt/tensorrt_configure.bzl
@@ -57,6 +57,10 @@ def _find_trt_header_dir(repository_ctx, trt_install_path):
     path = "/usr/include/x86_64-linux-gnu"
     if _headers_exist(repository_ctx, path):
       return path
+  if trt_install_path == "/usr/lib/aarch64-linux-gnu":
+    path = "/usr/include/aarch64-linux-gnu"
+    if _headers_exist(repository_ctx, path):
+      return path
   path = str(repository_ctx.path("%s/../include" % trt_install_path).realpath)
   if _headers_exist(repository_ctx, path):
     return path