228 files changed, 1807 insertions, 7328 deletions
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index ff11d13140..10fd595fec 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -42,7 +42,7 @@ The Code of Conduct also applies within project spaces and in public spaces when
 
 Conflicts in an open source project can take many forms, from someone having a bad day and using harsh and hurtful language in the issue queue, to more serious instances such as sexist/racist statements or threats of violence, and everything in between.
 
-If the behavior is threatening or harassing, or for other reasons requires immediate escalation, please see below.
+If the behaviour is threatening or harassing, or for other reasons requires immediate escalation, please see below.
 
 However, for the vast majority of issues, we aim to empower individuals to first resolve conflicts themselves, asking for help when needed, and only after that fails to escalate further. This approach gives people more control over the outcome of their dispute. 
 
@@ -55,14 +55,14 @@ If you are experiencing or witnessing conflict, we ask you to use the following
 
 ## Reporting Violations
 
-Violations of the Code of Conduct can be reported to TensorFlow’s Project Stewards, Edd Wilder-James (ewj@google.com) and Sarah Novotny (sarahnovotny@google.com). The Project Steward will determine whether the Code of Conduct was violated, and will issue an appropriate sanction, possibly including a written warning or expulsion from the project, project sponsored spaces, or project forums. We ask that you make a good-faith effort to resolve your conflict via the conflict resolution policy before submitting a report.
+Violations of the Code of Conduct can be reported to TensorFlow’s Project Steward at conduct@tensorflow.org. The Project Steward will determine whether the Code of Conduct was violated, and will issue an appropriate sanction, possibly including a written warning or expulsion from the project, project sponsored spaces, or project forums. We ask that you make a good-faith effort to resolve your conflict via the conflict resolution policy before submitting a report.
 
 Violations of the Code of Conduct can occur in any setting, even those unrelated to the project. We will only consider complaints about conduct that has occurred within one year of the report.
 
 
 ## Enforcement
 
-If the Project Stewards receive a report alleging a violation of the Code of Conduct, the Project Stewards will notify the accused of the report, and provide them an opportunity to discuss the report before a sanction is issued. The Project Stewards will do their utmost to keep the reporter anonymous. If the act is ongoing (such as someone engaging in harassment), or involves a threat to anyone's safety (e.g. threats of violence), the Project Stewards may issue sanctions without notice.
+If the Project Steward receives a report alleging a violation of the Code of Conduct, the Project Steward will notify the accused of the report, and provide them an opportunity to discuss the report before a sanction is issued. The Project Steward will do their utmost to keep the reporter anonymous. If the act is ongoing (such as someone engaging in harassment), or involves a threat to anyone's safety (e.g. threats of violence), the Project Steward may issue sanctions without notice.
 
 
 ## Attribution
diff --git a/README.md b/README.md
index aff3427bdd..24bbb6cec1 100644
--- a/README.md
+++ b/README.md
@@ -73,11 +73,11 @@ $ python
 
 ## For more information
 
-* [TensorFlow Website](https://www.tensorflow.org)
+* [TensorFlow website](https://www.tensorflow.org)
 * [TensorFlow White Papers](https://www.tensorflow.org/about/bib)
 * [TensorFlow Model Zoo](https://github.com/tensorflow/models)
 * [TensorFlow MOOC on Udacity](https://www.udacity.com/course/deep-learning--ud730)
-* [TensorFlow Course at Stanford](https://web.stanford.edu/class/cs20si)
+* [TensorFlow course at Stanford](https://web.stanford.edu/class/cs20si)
 
 Learn more about the TensorFlow community at the [community page of tensorflow.org](https://www.tensorflow.org/community) for a few ways to participate.
 
diff --git a/configure.py b/configure.py
index 26da09bd94..0d1afbfe15 100644
--- a/configure.py
+++ b/configure.py
@@ -43,7 +43,6 @@ _DEFAULT_CUDA_PATH_WIN = ('C:/Program Files/NVIDIA GPU Computing '
                           'Toolkit/CUDA/v%s' % _DEFAULT_CUDA_VERSION)
 _TF_OPENCL_VERSION = '1.2'
 _DEFAULT_COMPUTECPP_TOOLKIT_PATH = '/usr/local/computecpp'
-_DEFAULT_TRISYCL_INCLUDE_DIR = '/usr/local/triSYCL/include'
 
 
 def is_windows():
@@ -637,7 +636,7 @@ def set_tf_cuda_version(environ_cp):
   write_action_env_to_bazelrc('TF_CUDA_VERSION', tf_cuda_version)
 
 
-def set_tf_cudnn_version(environ_cp):
+def set_tf_cunn_version(environ_cp):
   """Set CUDNN_INSTALL_PATH and TF_CUDNN_VERSION."""
   ask_cudnn_version = (
       'Please specify the cuDNN version you want to use. '
@@ -883,27 +882,6 @@ def set_computecpp_toolkit_path(environ_cp):
   write_action_env_to_bazelrc('COMPUTECPP_TOOLKIT_PATH',
                               computecpp_toolkit_path)
 
-def set_trisycl_include_dir(environ_cp):
-  """Set TRISYCL_INCLUDE_DIR"""
-  ask_trisycl_include_dir = ('Please specify the location of the triSYCL '
-                             'include directory. (Use --config=sycl_trisycl '
-                             'when building with Bazel) '
-                             '[Default is %s]: '
-                             ) % (_DEFAULT_TRISYCL_INCLUDE_DIR)
-  while True:
-    trisycl_include_dir = get_from_env_or_user_or_default(
-      environ_cp, 'TRISYCL_INCLUDE_DIR', ask_trisycl_include_dir,
-      _DEFAULT_TRISYCL_INCLUDE_DIR)
-    if os.path.exists(trisycl_include_dir):
-      break
-
-    print('Invalid triSYCL include directory, %s cannot be found'
-          % (trisycl_include_dir))
-
-  # Set TRISYCL_INCLUDE_DIR
-  environ_cp['TRISYCL_INCLUDE_DIR'] = trisycl_include_dir
-  write_action_env_to_bazelrc('TRISYCL_INCLUDE_DIR',
-                              trisycl_include_dir)
 
 def set_mpi_home(environ_cp):
   """Set MPI_HOME."""
@@ -1019,8 +997,6 @@ def main():
     environ_cp['TF_NEED_GCP'] = '0'
     environ_cp['TF_NEED_HDFS'] = '0'
     environ_cp['TF_NEED_JEMALLOC'] = '0'
-    environ_cp['TF_NEED_OPENCL_SYCL'] = '0'
-    environ_cp['TF_NEED_COMPUTECPP'] = '0'
     environ_cp['TF_NEED_OPENCL'] = '0'
     environ_cp['TF_CUDA_CLANG'] = '0'
 
@@ -1042,21 +1018,17 @@ def main():
   set_build_var(environ_cp, 'TF_NEED_VERBS', 'VERBS', 'with_verbs_support',
                 False, 'verbs')
 
-  set_action_env_var(environ_cp, 'TF_NEED_OPENCL_SYCL', 'OpenCL SYCL', False)
-  if environ_cp.get('TF_NEED_OPENCL_SYCL') == '1':
+  set_action_env_var(environ_cp, 'TF_NEED_OPENCL', 'OpenCL', False)
+  if environ_cp.get('TF_NEED_OPENCL') == '1':
     set_host_cxx_compiler(environ_cp)
     set_host_c_compiler(environ_cp)
-    set_action_env_var(environ_cp, 'TF_NEED_COMPUTECPP', 'ComputeCPP', True)
-    if environ_cp.get('TF_NEED_COMPUTECPP') == '1':
-      set_computecpp_toolkit_path(environ_cp)
-    else:
-      set_trisycl_include_dir(environ_cp)
+    set_computecpp_toolkit_path(environ_cp)
 
   set_action_env_var(environ_cp, 'TF_NEED_CUDA', 'CUDA', False)
   if (environ_cp.get('TF_NEED_CUDA') == '1' and
       'TF_CUDA_CONFIG_REPO' not in environ_cp):
     set_tf_cuda_version(environ_cp)
-    set_tf_cudnn_version(environ_cp)
+    set_tf_cunn_version(environ_cp)
     set_tf_cuda_compute_capabilities(environ_cp)
 
     set_tf_cuda_clang(environ_cp)
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index c8f0b6b061..49828cd4d6 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -55,15 +55,6 @@ config_setting(
 )
 
 config_setting(
-    name = "raspberry_pi_armeabi",
-    values = {
-        "crosstool_top": "@local_config_arm_compiler//:toolchain",
-        "cpu": "armeabi",
-    },
-    visibility = ["//visibility:public"],
-)
-
-config_setting(
     name = "android_arm",
     values = {
         "crosstool_top": "//external:android/crosstool",
@@ -769,13 +760,6 @@ tf_cc_shared_object(
     ],
 )
 
-exports_files(
-    [
-        "tf_version_script.lds",
-        "tf_exported_symbols.lds",
-    ],
-)
-
 py_library(
     name = "tensorflow_py",
     srcs = ["__init__.py"],
diff --git a/tensorflow/compiler/aot/tfcompile.bzl b/tensorflow/compiler/aot/tfcompile.bzl
index b795afd5b8..ee291c12d0 100644
--- a/tensorflow/compiler/aot/tfcompile.bzl
+++ b/tensorflow/compiler/aot/tfcompile.bzl
@@ -130,10 +130,6 @@ def tf_library(name, graph, config,
   header_file = name + ".h"
   object_file = name + ".o"
   ep = ("__" + PACKAGE_NAME + "__" + name).replace("/", "_")
-  if type(tfcompile_flags) == type(""):
-    flags = tfcompile_flags
-  else:
-    flags = " ".join(["'" + arg.replace("'", "'\\''") + "'" for arg in (tfcompile_flags or [])])
   native.genrule(
       name=("gen_" + name),
       srcs=[
@@ -152,7 +148,7 @@ def tf_library(name, graph, config,
            " --target_triple=" + target_llvm_triple() +
            " --out_header=$(@D)/" + header_file +
            " --out_object=$(@D)/" + object_file +
-           flags),
+           " " + (tfcompile_flags or "")),
       tools=[tfcompile_tool],
       visibility=visibility,
       testonly=testonly,
@@ -189,7 +185,7 @@ def tf_library(name, graph, config,
            " --cpp_class=" + cpp_class +
            " --target_triple=" + target_llvm_triple() +
            " --out_session_module=$(@D)/" + session_module_pb +
-           flags),
+           " " + (tfcompile_flags or "")),
       tools=[tfcompile_tool],
       visibility=visibility,
       testonly=testonly,
@@ -199,7 +195,8 @@ def tf_library(name, graph, config,
 
   # The cc_library rule packaging up the header and object file, and needed
   # kernel implementations.
-  need_xla_data_proto = (flags and flags.find("--gen_program_shape") != -1)
+  need_xla_data_proto = (tfcompile_flags and
+                         tfcompile_flags.find("--gen_program_shape") != -1)
   native.cc_library(
       name=name,
       srcs=[object_file],
diff --git a/tensorflow/compiler/tests/BUILD b/tensorflow/compiler/tests/BUILD
index 79c4befd36..c372e05474 100644
--- a/tensorflow/compiler/tests/BUILD
+++ b/tensorflow/compiler/tests/BUILD
@@ -672,7 +672,7 @@ tf_library(
     cpp_class = "LSTMLayerInference",
     graph = "lstm_layer_inference.pbtxt",
     tags = ["manual"],
-    tfcompile_flags = ["--xla_cpu_multi_thread_eigen=false"],
+    tfcompile_flags = "--xla_cpu_multi_thread_eigen=false",
 )
 
 # -----------------------------------------------------------------------------
diff --git a/tensorflow/compiler/tests/fused_batchnorm_test.py b/tensorflow/compiler/tests/fused_batchnorm_test.py
index a773b5a947..936fcf8b6b 100644
--- a/tensorflow/compiler/tests/fused_batchnorm_test.py
+++ b/tensorflow/compiler/tests/fused_batchnorm_test.py
@@ -36,7 +36,7 @@ class FusedBatchNormTest(XLATestCase):
     x_square = x * x
     x_square_sum = np.sum(x_square, (0, 1, 2))
     x_sum = np.sum(x, axis=(0, 1, 2))
-    element_count = np.size(x) / int(np.shape(x)[-1])
+    element_count = np.size(x) / int(np.shape(x)[0])
     mean = x_sum / element_count
     var = x_square_sum / element_count - mean * mean
     normalized = (x - mean) / np.sqrt(var + epsilon)
@@ -64,9 +64,8 @@ class FusedBatchNormTest(XLATestCase):
     return grad_x, grad_scale, grad_offset
 
   def testInference(self):
-    channel = 3
-    x_shape = [2, 2, 6, channel]
-    scale_shape = [channel]
+    x_shape = [2, 2, 6, 2]
+    scale_shape = [2]
     x_val = np.random.random_sample(x_shape).astype(np.float32)
     scale_val = np.random.random_sample(scale_shape).astype(np.float32)
 
@@ -75,8 +74,8 @@ class FusedBatchNormTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       # To avoid constant folding
       t_val = array_ops.placeholder(np.float32, shape=x_shape, name="x")
-      scale = array_ops.placeholder(np.float32, shape=scale_shape, name="scale")
-      offset = array_ops.placeholder(np.float32, shape=scale_shape, name="offset")
+      scale = array_ops.placeholder(np.float32, shape=[2], name="scale")
+      offset = array_ops.placeholder(np.float32, shape=[2], name="offset")
       epsilon = 0.001
       y_ref, mean_ref, var_ref = self._reference_training(
           x_val, scale_val, offset_val, epsilon, data_format)
@@ -98,9 +97,8 @@ class FusedBatchNormTest(XLATestCase):
       self.assertAllClose(y_val, y_ref, atol=1e-3)
 
   def _testLearning(self, use_gradient_checker):
-    channel = 3
-    x_shape = [2, 2, 6, channel]
-    scale_shape = [channel]
+    x_shape = [2, 2, 6, 2]
+    scale_shape = [2]
     x_val = np.random.random_sample(x_shape).astype(np.float32)
     scale_val = np.random.random_sample(scale_shape).astype(np.float32)
 
@@ -111,8 +109,8 @@ class FusedBatchNormTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       # To avoid constant folding
       t_val = array_ops.placeholder(np.float32, shape=x_shape, name="x")
-      scale = array_ops.placeholder(np.float32, shape=scale_shape, name="scale")
-      offset = array_ops.placeholder(np.float32, shape=scale_shape, name="offset")
+      scale = array_ops.placeholder(np.float32, shape=[2], name="scale")
+      offset = array_ops.placeholder(np.float32, shape=[2], name="offset")
       epsilon = 0.001
       y, mean, var = nn.fused_batch_norm(
           t_val,
@@ -156,9 +154,8 @@ class FusedBatchNormTest(XLATestCase):
   def testGradient(self):
     # TODO(b/64270657): Use gradient_checker here in addition to comparing with
     # this reference implementation.
-    channel = 3
-    x_shape = [2, 2, 6, channel]
-    scale_shape = [channel]
+    x_shape = [2, 2, 6, 2]
+    scale_shape = [2]
     grad_val = np.random.random_sample(x_shape).astype(np.float32)
     x_val = np.random.random_sample(x_shape).astype(np.float32)
     scale_val = np.random.random_sample(scale_shape).astype(np.float32)
diff --git a/tensorflow/compiler/xla/service/BUILD b/tensorflow/compiler/xla/service/BUILD
index fb980e7056..db265510f2 100644
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@@ -90,6 +90,8 @@ cc_library(
         ":shape_inference",
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:status",
+        "//tensorflow/compiler/xla:status_macros",
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:util",
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.h b/tensorflow/compiler/xla/service/hlo_instruction.h
index cda8b07c61..1bd0cca945 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.h
+++ b/tensorflow/compiler/xla/service/hlo_instruction.h
@@ -222,7 +222,7 @@ class HloInstruction {
       tensorflow::gtl::ArraySlice<int64> strides);
 
   // Creates a slice instruction, where the first operand is sliced by
-  // start indices specified in the second operand, and by size specified in
+  // start indices specified in the second operand, and by size specfied in
   // 'slice_sizes'.
   static std::unique_ptr<HloInstruction> CreateDynamicSlice(
       const Shape& shape, HloInstruction* operand,
diff --git a/tensorflow/compiler/xla/service/hlo_instruction_test.cc b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
index 76b12fc8d3..070bb4bc42 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
@@ -792,8 +792,8 @@ TEST_F(HloInstructionTest, ComplexFusionOp) {
   //   sub = Sub(mul, clamp)
   //   tuple = Tuple({sub, sub, mul, C1})
   //
-  // Notable complexities are repeated operands in the same instruction,
-  // different shapes, use of value in different expressions.
+  // Notable complexities are repeated operands in a same instruction, different
+  // shapes, use of value in different expressions.
   auto c1 = builder.AddInstruction(
       HloInstruction::CreateConstant(Literal::CreateR0<float>(1.1f)));
   auto c2 = builder.AddInstruction(
diff --git a/tensorflow/contrib/batching/BUILD b/tensorflow/contrib/batching/BUILD
index a111cfecb3..8b7df4a84c 100644
--- a/tensorflow/contrib/batching/BUILD
+++ b/tensorflow/contrib/batching/BUILD
@@ -82,7 +82,6 @@ cc_library(
 tf_cc_test(
     name = "adaptive_shared_batch_scheduler_test",
     srcs = ["adaptive_shared_batch_scheduler_test.cc"],
-    tags = ["manual"],  # b/69013768
     deps = [
         ":adaptive_shared_batch_scheduler",
         "//tensorflow/contrib/batching/test_util:fake_clock_env",
diff --git a/tensorflow/contrib/batching/kernels/batch_kernels.cc b/tensorflow/contrib/batching/kernels/batch_kernels.cc
index 6041d8c9b2..3b7c538fcc 100644
--- a/tensorflow/contrib/batching/kernels/batch_kernels.cc
+++ b/tensorflow/contrib/batching/kernels/batch_kernels.cc
@@ -461,7 +461,7 @@ class BatchResource : public ResourceBase {
     return Status::OK();
   }
 
-  // Looks up the batcher queue for 'queue_name'. If it didn't previously exist,
+  // Looks up the batcher queue for 'queue_name'. If it did't previously exist,
   // creates it.
   Status LookupOrCreateBatcherQueue(const string& queue_name,
                                     BatcherQueue** queue) {
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py
index 2e94b7206d..8c6a614beb 100644
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py
+++ b/tensorflow/contrib/bayesflow/python/kernel_tests/csiszar_divergence_test.py
@@ -759,7 +759,7 @@ class CsiszarVIMCOTest(test.TestCase):
   def _csiszar_vimco_helper_grad(self, logu, delta):
     """Finite difference approximation of `grad(csiszar_vimco_helper, logu)`."""
 
-    # This code actually estimates the sum of the Jacobiab because that's what
+    # This code actually estimates the sum of the Jacobiab because thats what
     # TF's `gradients` does.
     np_log_avg_u1, np_log_sooavg_u1 = self._csiszar_vimco_helper(
         logu[..., None] + np.diag([delta]*len(logu)))
diff --git a/tensorflow/contrib/cmake/CMakeLists.txt b/tensorflow/contrib/cmake/CMakeLists.txt
index 77a3fc0c83..8744fc492f 100644
--- a/tensorflow/contrib/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@@ -34,41 +34,13 @@ option(tensorflow_BUILD_SHARED_LIB "Build TensorFlow as a shared library" OFF)
 option(tensorflow_OPTIMIZE_FOR_NATIVE_ARCH "Enable compiler optimizations for the native processor architecture (if available)" ON)
 option(tensorflow_WIN_CPU_SIMD_OPTIONS "Enables CPU SIMD instructions")
 option(tensorflow_ENABLE_SNAPPY_SUPPORT "Enable SNAPPY compression support" ON)
-if(HAIKU)
-	option(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE "Enable PIE support" OFF)
-else()
-	option(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE "Enable PIE support" ON)
-endif()
-
 
 if (NOT WIN32)
   # Threads: defines CMAKE_THREAD_LIBS_INIT and adds -pthread compile option
   # for targets that link ${CMAKE_THREAD_LIBS_INIT}.
   find_package (Threads)
-
-  option(tensorflow_PATH_STATIC_LIB "Additional library search path for libcudnn_static.a, libnccl_static.a, libculibos.a" /usr/local/cuda/lib64/)
-  option(tensorflow_CUDNN_INCLUDE "cudnn.h header install path" /usr/include/)
-  if (NOT tensorflow_CUDNN_INCLUDE)
-    # option's default value is OFF. Fill it with real default values
-    set(tensorflow_CUDNN_INCLUDE /usr/include)
-  endif (NOT tensorflow_CUDNN_INCLUDE)
-  option(tensorflow_PATH_CUDNN_STATIC_LIB "Override PATH_STATIC_LIB for libcudnn_static.a" ${tensorflow_PATH_STATIC_LIB})
-  option(tensorflow_PATH_NCCL_STATIC_LIB "Override PATH_STATIC_LIB for libnccl_static.a" ${tensorflow_PATH_STATIC_LIB})
-  option(tensorflow_CUDA_LIBRARY_PATH "Designate the default CUDA library paths" /usr/local/cuda/lib64)
-  if (NOT tensorflow_CUDA_LIBRARY_PATH)
-    # option's default value is OFF. Fill it with real default values
-    set(tensorflow_CUDA_LIBRARY_PATH /usr/local/cuda/lib64)
-  endif (NOT tensorflow_CUDA_LIBRARY_PATH)
 endif()
 
-if (WIN32)
-  set(BOOL_WIN32 ON)
-else (WIN32)
-  set(BOOL_WIN32 OFF)
-  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fPIC")
-  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")
-endif (WIN32)
-
 # [CLEANUP] Remove when done
 # For debugging
 function(SHOW_VARIABLES)
@@ -86,12 +58,7 @@ set (DOWNLOAD_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/downloads"
      CACHE PATH "Location where external projects will be downloaded.")
 mark_as_advanced(DOWNLOAD_LOCATION)
 
-if (tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-	set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-else()
-	set(CMAKE_POSITION_INDEPENDENT_CODE OFF)
-endif()
-
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 add_definitions(-DEIGEN_AVOID_STL_ARRAY)
 if(WIN32)
   add_definitions(-DNOMINMAX -D_WIN32_WINNT=0x0A00 -DLANG_CXX11 -DCOMPILER_MSVC)
@@ -250,35 +217,20 @@ endif()
 if(UNIX)
   list(APPEND tensorflow_EXTERNAL_LIBRARIES ${CMAKE_THREAD_LIBS_INIT} ${CMAKE_DL_LIBS})
 endif()
-if(HAIKU)
-  list(APPEND tensorflow_EXTERNAL_LIBRARIES network)
-endif()
 
 if (tensorflow_ENABLE_GPU)
-  if (NOT WIN32)
-    # Default install paths for cuda libraries in Linux
-    # In some Linux distros, find_package(CUDA) seems to require CMAKE_LIBRARY_PATH to include cuda-lib paths
-    list(APPEND CMAKE_LIBRARY_PATH "${tensorflow_CUDA_LIBRARY_PATH}")
-    list(APPEND CMAKE_LIBRARY_PATH "${tensorflow_CUDA_LIBRARY_PATH}/stubs")
-  endif (NOT WIN32)
-
-  find_package(CUDA 8.0 REQUIRED)
-
-  # by default we assume compute cabability 3.5 and 5.2. If you change this change it in
-  # CUDA_NVCC_FLAGS and cuda_config.h below
-  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_30,code=\"sm_30,compute_30\";-gencode arch=compute_35,code=\"sm_35,compute_35\";-gencode arch=compute_52,code=\"sm_52,compute_52\")
-  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};--include-path ${PROJECT_BINARY_DIR}/$\{build_configuration\};--expt-relaxed-constexpr)
-  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-ftz=true)  # Flush denormals to zero
-  set(CUDA_INCLUDE ${CUDA_TOOLKIT_TARGET_DIR} ${CUDA_TOOLKIT_TARGET_DIR}/extras/CUPTI/include)
-  include_directories(${CUDA_INCLUDE})
   if (WIN32)
+    find_package(CUDA 8.0 REQUIRED)
+
+    # by default we assume compute cabability 3.5 and 5.2. If you change this change it in
+    # CUDA_NVCC_FLAGS and cuda_config.h below
+    set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_30,code=\"sm_30,compute_30\";-gencode arch=compute_35,code=\"sm_35,compute_35\";-gencode arch=compute_52,code=\"sm_52,compute_52\")
+    set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};--include-path ${PROJECT_BINARY_DIR}/$\{build_configuration\};--expt-relaxed-constexpr)
+    set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-ftz=true)  # Flush denormals to zero
+    set(CUDA_INCLUDE ${CUDA_TOOLKIT_TARGET_DIR} ${CUDA_TOOLKIT_TARGET_DIR}/extras/CUPTI/include)
+    include_directories(${CUDA_INCLUDE})
     add_definitions(-DGOOGLE_CUDA=1 -DTF_EXTRA_CUDA_CAPABILITIES=3.0,3.5,5.2)
-  else (WIN32)
-    # Without these double quotes, cmake in Linux makes it "-DTF_EXTRA_CUDA_CAPABILITIES=3.0, -D3.5, -D5.2" for cc, which incurs build breaks
-    add_definitions(-DGOOGLE_CUDA=1 -D"TF_EXTRA_CUDA_CAPABILITIES=3.0,3.5,5.2")
-  endif (WIN32)
 
-  if (WIN32)
     # add cudnn
     if(NOT CUDNN_HOME)
       set(CUDNN_HOME ${CUDA_TOOLKIT_TARGET_DIR})
@@ -286,48 +238,18 @@ if (tensorflow_ENABLE_GPU)
     include_directories(${CUDNN_HOME})
     set(CUDA_LIBRARIES ${CUDA_LIBRARIES} ${CUDA_CUDA_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_CUFFT_LIBRARIES}
       ${CUDA_curand_LIBRARY} ${CUDA_cupti_LIBRARY} ${CUDA_cusolver_LIBRARY} ${CUDNN_HOME}/lib/x64/cudnn.lib)
-  else (WIN32)
-    set(CUDNN_INCLUDE "${tensorflow_CUDNN_INCLUDE}")
-
-    find_library(nccl_STATIC_LIBRARY NAMES libnccl_static.a PATHS ${tensorflow_PATH_NCCL_STATIC_LIB} ${CUDA_TOOLKIT_ROOT_DIR})
-    if (NOT nccl_STATIC_LIBRARY)
-      message(FATAL_ERROR "NCCL is required for GPU-build")
-    else (NOT nccl_STATIC_LIBRARY)
-      message("nccl-static: ${nccl_STATIC_LIBRARY}")
-      # something like /usr/lib64/libnccl_static.a
-    endif (NOT nccl_STATIC_LIBRARY)
-
-    find_library(cudnn_STATIC_LIBRARY NAMES libcudnn_static.a PATHS ${tensorflow_PATH_CUDNN_STATIC_LIB} ${CUDA_TOOLKIT_ROOT_DIR})
-    if (NOT cudnn_STATIC_LIBRARY)
-      message(FATAL_ERROR "CUDNN is required for GPU-build")
-    else (NOT cudnn_STATIC_LIBRARY)
-      message("cudnn-static: ${cudnn_STATIC_LIBRARY}")
-    endif (NOT cudnn_STATIC_LIBRARY)
 
-    find_library(culibos_STATIC_LIBRARY NAMES libculibos.a PATHS ${tensorflow_PATH_STATIC_LIB} ${CUDA_TOOLKIT_ROOT_DIR})
-    if (NOT culibos_STATIC_LIBRARY)
-      message(FATAL_ERROR "CULIBOS is required for GPU-build")
-    else (NOT culibos_STATIC_LIBRARY)
-      message("culibos-static: ${culibos_STATIC_LIBRARY}")
-    endif (NOT culibos_STATIC_LIBRARY)
-
-    include_directories(${CUDNN_INCLUDE})
-    set(CUDA_LIBRARIES ${CUDA_LIBRARIES} ${CUDA_CUDA_LIBRARY} ${CUDA_CUBLAS_LIBRARIES} ${CUDA_CUFFT_LIBRARIES}
-      ${CUDA_curand_LIBRARY} ${CUDA_cupti_LIBRARY} ${CUDA_cusolver_LIBRARY} ${cudnn_STATIC_LIBRARY} ${culibos_STATIC_LIBRARY} ${nccl_STATIC_LIBRARY})
-  endif (WIN32)
-
-  # create cuda_config.h
-  FILE(WRITE ${tensorflow_source_dir}/third_party/gpus/cuda/cuda_config.h
-    "#ifndef CUDA_CUDA_CONFIG_H_\n"
-    "#define CUDA_CUDA_CONFIG_H_\n"
-    "#define TF_CUDA_CAPABILITIES CudaVersion(\"3.0\"),CudaVersion(\"3.5\"),CudaVersion(\"5.2\")\n"
-    "#define TF_CUDA_VERSION \"64_80\"\n"
-    "#define TF_CUDNN_VERSION \"64_6\"\n"
-    "#define TF_CUDA_TOOLKIT_PATH \"${CUDA_TOOLKIT_ROOT_DIR}\"\n"
-    "#endif  // CUDA_CUDA_CONFIG_H_\n"
-  )
+    # create cuda_config.h
+    FILE(WRITE ${tensorflow_source_dir}/third_party/gpus/cuda/cuda_config.h
+      "#ifndef CUDA_CUDA_CONFIG_H_\n"
+      "#define CUDA_CUDA_CONFIG_H_\n"
+      "#define TF_CUDA_CAPABILITIES CudaVersion(\"3.0\"),CudaVersion(\"3.5\"),CudaVersion(\"5.2\")\n"
+      "#define TF_CUDA_VERSION \"64_80\"\n"
+      "#define TF_CUDNN_VERSION \"64_6\"\n"
+      "#define TF_CUDA_TOOLKIT_PATH \"${CUDA_TOOLKIT_ROOT_DIR}\"\n"
+      "#endif  // CUDA_CUDA_CONFIG_H_\n"
+    )
 
-  if (WIN32)
     # tf assumes in various places header files to be in cuda/include. On windows the cuda sdk
     # installs them under cuda/version/include and to avoid that we need to change tf we copy a
     # few files to cuda/include
@@ -339,25 +261,12 @@ if (tensorflow_ENABLE_GPU)
       ${CUDA_TOOLKIT_TARGET_DIR}/include/cusolverDn.h
       DESTINATION ${tensorflow_source_dir}/third_party/gpus/cuda/include
     )
-  else(WIN32)
-    # Linux has slightly differnt install paths than Windows
-    FILE(COPY
-      ${CUDA_TOOLKIT_TARGET_DIR}/include/cuda.h ${CUDA_TOOLKIT_TARGET_DIR}/include/cuComplex.h
-      ${CUDA_TOOLKIT_TARGET_DIR}/include/cublas_v2.h ${CUDNN_INCLUDE}/cudnn.h
-      ${CUDA_TOOLKIT_TARGET_DIR}/include/cufft.h ${CUDA_TOOLKIT_TARGET_DIR}/include/curand.h
-      ${CUDA_TOOLKIT_TARGET_DIR}/include/cuda_runtime_api.h
-      ${CUDA_TOOLKIT_TARGET_DIR}/include/cusolverDn.h
-      DESTINATION ${tensorflow_source_dir}/third_party/gpus/cuda/include
-    )
-  endif(WIN32)
+    include_directories(${tensorflow_source_dir}/third_party/gpus)
+    # add cuda libraries to tensorflow_EXTERNAL_LIBRARIES
+    list(APPEND tensorflow_EXTERNAL_LIBRARIES ${CUDA_LIBRARIES})
 
-  include_directories(${tensorflow_source_dir}/third_party/gpus)
-  # add cuda libraries to tensorflow_EXTERNAL_LIBRARIES
-  list(APPEND tensorflow_EXTERNAL_LIBRARIES ${CUDA_LIBRARIES})
-
-  # NOTE(mrry): Update these flags when the version of CUDA or cuDNN used
-  # in the default build is upgraded.
-  if(WIN32)
+    # NOTE(mrry): Update these flags when the version of CUDA or cuDNN used
+    # in the default build is upgraded.
     set(tensorflow_BUILD_INFO_FLAGS --build_config cuda --key_value
       msvcp_dll_name=msvcp140.dll
       cudart_dll_name=cudart64_80.dll
@@ -366,9 +275,7 @@ if (tensorflow_ENABLE_GPU)
       cudnn_dll_name=cudnn64_6.dll
       cudnn_version_number=6)
   else(WIN32)
-    set(tensorflow_BUILD_INFO_FLAGS --build_config cuda --key_value
-      cuda_version_number=8.0
-      cudnn_version_number=6)
+    message(FATAL_ERROR "CMake GPU build is currently only supported on Windows.")
   endif(WIN32)
 else(tensorflow_ENABLE_GPU)
   set(tensorflow_BUILD_INFO_FLAGS --build_config cpu --key_value
@@ -386,7 +293,9 @@ include(tf_core_framework.cmake)
 # NOTE: Disabled until issue #3996 is fixed.
 # include(tf_stream_executor.cmake)
 if (tensorflow_ENABLE_GPU)
+  if (WIN32)
     include(tf_stream_executor.cmake)
+  endif()
 endif()
 
 include(tf_core_cpu.cmake)
diff --git a/tensorflow/contrib/cmake/external/boringssl.cmake b/tensorflow/contrib/cmake/external/boringssl.cmake
index cca8444e2a..dc27eadaca 100644
--- a/tensorflow/contrib/cmake/external/boringssl.cmake
+++ b/tensorflow/contrib/cmake/external/boringssl.cmake
@@ -39,12 +39,8 @@ ExternalProject_Add(boringssl
     # BUILD_IN_SOURCE 1
     INSTALL_COMMAND ""
     CMAKE_CACHE_ARGS
-        if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-        	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-        else()
-        	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-        endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 )
 
diff --git a/tensorflow/contrib/cmake/external/jsoncpp.cmake b/tensorflow/contrib/cmake/external/jsoncpp.cmake
index d2ae4c76e8..5127d7e8f7 100644
--- a/tensorflow/contrib/cmake/external/jsoncpp.cmake
+++ b/tensorflow/contrib/cmake/external/jsoncpp.cmake
@@ -42,12 +42,8 @@ ExternalProject_Add(jsoncpp
     BUILD_IN_SOURCE 1
     INSTALL_COMMAND ""
     CMAKE_CACHE_ARGS
-  	  if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-  	      -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-  	  else()
-   	    	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-   	 endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 )
 
diff --git a/tensorflow/contrib/cmake/external/lmdb.cmake b/tensorflow/contrib/cmake/external/lmdb.cmake
index e41384f023..79971b7cfc 100644
--- a/tensorflow/contrib/cmake/external/lmdb.cmake
+++ b/tensorflow/contrib/cmake/external/lmdb.cmake
@@ -29,14 +29,10 @@ ExternalProject_Add(lmdb
     INSTALL_DIR ${lmdb_INSTALL}
     DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
         -DCMAKE_INSTALL_PREFIX:STRING=${lmdb_INSTALL}
+    -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 )
 
 if(WIN32)
diff --git a/tensorflow/contrib/cmake/external/png.cmake b/tensorflow/contrib/cmake/external/png.cmake
index aad6618f52..2b2bd47d1c 100644
--- a/tensorflow/contrib/cmake/external/png.cmake
+++ b/tensorflow/contrib/cmake/external/png.cmake
@@ -41,14 +41,10 @@ ExternalProject_Add(png
     INSTALL_DIR ${png_INSTALL}
     DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
         -DCMAKE_INSTALL_PREFIX:STRING=${png_INSTALL}
+	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 	-DZLIB_ROOT:STRING=${ZLIB_INSTALL}
 )
 
diff --git a/tensorflow/contrib/cmake/external/protobuf.cmake b/tensorflow/contrib/cmake/external/protobuf.cmake
index b53857a47b..1e300e21df 100644
--- a/tensorflow/contrib/cmake/external/protobuf.cmake
+++ b/tensorflow/contrib/cmake/external/protobuf.cmake
@@ -44,12 +44,8 @@ ExternalProject_Add(protobuf
         ${PROTOBUF_ADDITIONAL_CMAKE_OPTIONS}
     INSTALL_COMMAND ""
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
         -DZLIB_ROOT:STRING=${ZLIB_INSTALL}
 )
diff --git a/tensorflow/contrib/cmake/external/re2.cmake b/tensorflow/contrib/cmake/external/re2.cmake
index b56f4b0898..cb4ec9c2de 100644
--- a/tensorflow/contrib/cmake/external/re2.cmake
+++ b/tensorflow/contrib/cmake/external/re2.cmake
@@ -38,11 +38,7 @@ ExternalProject_Add(re2
     BUILD_IN_SOURCE 1
     DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_INSTALL_PREFIX:STRING=${re2_INSTALL}
-)
+        -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+)
+\ No newline at end of file
diff --git a/tensorflow/contrib/cmake/external/snappy.cmake b/tensorflow/contrib/cmake/external/snappy.cmake
index 926c271fd9..2d2451521c 100644
--- a/tensorflow/contrib/cmake/external/snappy.cmake
+++ b/tensorflow/contrib/cmake/external/snappy.cmake
@@ -40,15 +40,11 @@ ExternalProject_Add(snappy
     LOG_CONFIGURE ON
     LOG_BUILD ON
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
         -DSNAPPY_BUILD_TESTS:BOOL=OFF
+        -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 )
 
 # actually enables snappy in the source code
-add_definitions(-DTF_USE_SNAPPY)
-\ No newline at end of file
+add_definitions(-DTF_USE_SNAPPY)
diff --git a/tensorflow/contrib/cmake/external/sqlite.cmake b/tensorflow/contrib/cmake/external/sqlite.cmake
index 785039a469..1770dcb1fd 100644
--- a/tensorflow/contrib/cmake/external/sqlite.cmake
+++ b/tensorflow/contrib/cmake/external/sqlite.cmake
@@ -53,13 +53,9 @@ else()
         INSTALL_DIR ${sqlite_INSTALL}
         DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
         CMAKE_CACHE_ARGS
-			if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-				-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-			else()
-				-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-			endif()
             -DCMAKE_BUILD_TYPE:STRING=Release
             -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+            -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
             -DCMAKE_INSTALL_PREFIX:STRING=${sqlite_INSTALL}
     )
 
diff --git a/tensorflow/contrib/cmake/external/zlib.cmake b/tensorflow/contrib/cmake/external/zlib.cmake
index f10f84336e..c8af611e1e 100644
--- a/tensorflow/contrib/cmake/external/zlib.cmake
+++ b/tensorflow/contrib/cmake/external/zlib.cmake
@@ -42,13 +42,9 @@ ExternalProject_Add(zlib
     BUILD_IN_SOURCE 1
     DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
     CMAKE_CACHE_ARGS
-		if(tensorflow_ENABLE_POSITION_INDEPENDENT_CODE)
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-		else()
-			-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=OFF
-		endif()
         -DCMAKE_BUILD_TYPE:STRING=Release
         -DCMAKE_INSTALL_PREFIX:STRING=${ZLIB_INSTALL}
+	-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
 )
 
 # put zlib includes in the directory where they are expected
diff --git a/tensorflow/contrib/cmake/tf_cc_ops.cmake b/tensorflow/contrib/cmake/tf_cc_ops.cmake
index 6e2ac203f9..45eeb11062 100644
--- a/tensorflow/contrib/cmake/tf_cc_ops.cmake
+++ b/tensorflow/contrib/cmake/tf_cc_ops.cmake
@@ -148,11 +148,7 @@ list(REMOVE_ITEM tf_cc_srcs ${tf_cc_test_srcs})
 add_library(tf_cc OBJECT ${tf_cc_srcs})
 add_dependencies(tf_cc tf_cc_framework tf_cc_ops)
 
-if (WIN32)
-  set (pywrap_tensorflow_lib "${CMAKE_CURRENT_BINARY_DIR}/${CMAKE_BUILD_TYPE}/pywrap_tensorflow_internal.lib")
-else (WIN32)
-  set (pywrap_tensorflow_lib "${CMAKE_CURRENT_BINARY_DIR}/libpywrap_tensorflow_internal.so")
-endif (WIN32)
+set (pywrap_tensorflow_lib "${CMAKE_CURRENT_BINARY_DIR}/${CMAKE_BUILD_TYPE}/pywrap_tensorflow_internal.lib")
 add_custom_target(tf_extension_ops)
 
 function(AddUserOps)
@@ -168,13 +164,15 @@ function(AddUserOps)
   # create shared library from source and cuda obj
   add_library(${_AT_TARGET} SHARED ${_AT_SOURCES} ${gpu_lib})
   target_link_libraries(${_AT_TARGET} ${pywrap_tensorflow_lib})
-  if (tensorflow_ENABLE_GPU AND _AT_GPUSOURCES)
-      # some ops call out to cuda directly; need to link libs for the cuda dlls
-      target_link_libraries(${_AT_TARGET} ${CUDA_LIBRARIES})
-  endif()
-  if (_AT_DISTCOPY)
-      add_custom_command(TARGET ${_AT_TARGET} POST_BUILD
-          COMMAND ${CMAKE_COMMAND} -E copy $<TARGET_FILE:${_AT_TARGET}> ${_AT_DISTCOPY}/)
+  if(WIN32)
+    if (tensorflow_ENABLE_GPU AND _AT_GPUSOURCES)
+        # some ops call out to cuda directly; need to link libs for the cuda dlls
+        target_link_libraries(${_AT_TARGET} ${CUDA_LIBRARIES})
+    endif()
+    if (_AT_DISTCOPY)
+        add_custom_command(TARGET ${_AT_TARGET} POST_BUILD
+            COMMAND ${CMAKE_COMMAND} -E copy $<TARGET_FILE:${_AT_TARGET}> ${_AT_DISTCOPY}/)
+    endif()
   endif()
   if (_AT_DEPENDS)
     add_dependencies(${_AT_TARGET} ${_AT_DEPENDS})
@@ -182,19 +180,9 @@ function(AddUserOps)
   # make sure TF_COMPILE_LIBRARY is not defined for this target
   get_target_property(target_compile_flags  ${_AT_TARGET} COMPILE_FLAGS)
   if(target_compile_flags STREQUAL "target_compile_flags-NOTFOUND")
-    if (WIN32)
-      set(target_compile_flags "/UTF_COMPILE_LIBRARY")
-    else (WIN32)
-      # gcc uses UTF as default
-      set(target_compile_flags "-finput-charset=UTF-8")
-    endif (WIN32)
+    set(target_compile_flags "/UTF_COMPILE_LIBRARY")
   else()
-    if (WIN32)
-      set(target_compile_flags "${target_compile_flags} /UTF_COMPILE_LIBRARY")
-    else (WIN32)
-      # gcc uses UTF as default
-      set(target_compile_flags "${target_compile_flags} -finput-charset=UTF-8")
-    endif (WIN32)
+    set(target_compile_flags "${target_compile_flags} /UTF_COMPILE_LIBRARY")
   endif()
   set_target_properties(${_AT_TARGET} PROPERTIES COMPILE_FLAGS ${target_compile_flags})
   add_dependencies(tf_extension_ops ${_AT_TARGET})
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
index 2d015908a8..d6b8990664 100644
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -179,7 +179,6 @@ file(GLOB_RECURSE tf_core_gpu_kernels_srcs
     "${tensorflow_source_dir}/tensorflow/contrib/image/kernels/*.cu.cc"
     "${tensorflow_source_dir}/tensorflow/contrib/rnn/kernels/*.cu.cc"
     "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/*.cu.cc"
-    "${tensorflow_source_dir}/tensorflow/contrib/resampler/kernels/*.cu.cc"
 )
 
 if(WIN32 AND tensorflow_ENABLE_GPU)
@@ -203,16 +202,16 @@ endif(WIN32 AND tensorflow_ENABLE_GPU)
 add_library(tf_core_kernels OBJECT ${tf_core_kernels_srcs})
 add_dependencies(tf_core_kernels tf_core_cpu)
 
-if (WIN32)
+if(WIN32)
   target_compile_options(tf_core_kernels PRIVATE /MP)
-endif (WIN32)
-if (tensorflow_ENABLE_GPU)
-  set_source_files_properties(${tf_core_gpu_kernels_srcs} PROPERTIES CUDA_SOURCE_PROPERTY_FORMAT OBJ)
-  set(tf_core_gpu_kernels_lib tf_core_gpu_kernels)
-  cuda_add_library(${tf_core_gpu_kernels_lib} ${tf_core_gpu_kernels_srcs})
-  set_target_properties(${tf_core_gpu_kernels_lib}
-                        PROPERTIES DEBUG_POSTFIX ""
-                        COMPILE_FLAGS "${TF_REGULAR_CXX_FLAGS}"
-  )
-  add_dependencies(${tf_core_gpu_kernels_lib} tf_core_cpu)
+  if (tensorflow_ENABLE_GPU)
+    set_source_files_properties(${tf_core_gpu_kernels_srcs} PROPERTIES CUDA_SOURCE_PROPERTY_FORMAT OBJ)
+    set(tf_core_gpu_kernels_lib tf_core_gpu_kernels)
+    cuda_add_library(${tf_core_gpu_kernels_lib} ${tf_core_gpu_kernels_srcs})
+    set_target_properties(${tf_core_gpu_kernels_lib}
+                          PROPERTIES DEBUG_POSTFIX ""
+                          COMPILE_FLAGS "${TF_REGULAR_CXX_FLAGS}"
+    )
+    add_dependencies(${tf_core_gpu_kernels_lib} tf_core_cpu)
+  endif()
 endif()
diff --git a/tensorflow/contrib/cmake/tf_label_image_example.cmake b/tensorflow/contrib/cmake/tf_label_image_example.cmake
index 7f2f60b089..0d3a4699eb 100644
--- a/tensorflow/contrib/cmake/tf_label_image_example.cmake
+++ b/tensorflow/contrib/cmake/tf_label_image_example.cmake
@@ -34,8 +34,3 @@ target_link_libraries(tf_label_image_example PUBLIC
     ${tf_core_gpu_kernels_lib}
     ${tensorflow_EXTERNAL_LIBRARIES}
 )
-
-install(TARGETS tf_label_image_example
-        RUNTIME DESTINATION bin
-        LIBRARY DESTINATION lib
-        ARCHIVE DESTINATION lib)
-\ No newline at end of file
diff --git a/tensorflow/contrib/cmake/tf_python.cmake b/tensorflow/contrib/cmake/tf_python.cmake
index 61b3fd715d..9b863f7bc6 100755
--- a/tensorflow/contrib/cmake/tf_python.cmake
+++ b/tensorflow/contrib/cmake/tf_python.cmake
@@ -715,9 +715,6 @@ function(GENERATE_PYTHON_OP_LIB tf_python_op_lib_name)
       set(require_shape_fn 1)
     endif()
 
-    get_filename_component(GENERATE_PYTHON_OP_LIB_MKDIRPATH ${GENERATE_PYTHON_OP_LIB_DESTINATION} PATH)
-    file(MAKE_DIRECTORY ${GENERATE_PYTHON_OP_LIB_MKDIRPATH})
-
     # Create a C++ executable that links in the appropriate op
     # registrations and generates Python wrapper code based on the
     # registered ops.
@@ -746,7 +743,6 @@ function(GENERATE_PYTHON_OP_LIB tf_python_op_lib_name)
         ${GENERATE_PYTHON_OP_LIB_DESTINATION} PARENT_SCOPE)
 endfunction()
 
-GENERATE_PYTHON_OP_LIB("audio_ops")
 GENERATE_PYTHON_OP_LIB("array_ops")
 GENERATE_PYTHON_OP_LIB("bitwise_ops")
 GENERATE_PYTHON_OP_LIB("math_ops")
@@ -991,7 +987,7 @@ add_library(pywrap_tensorflow_internal SHARED
     $<TARGET_OBJECTS:tf_tools_transform_graph_lib>
     $<$<BOOL:${tensorflow_ENABLE_GRPC_SUPPORT}>:$<TARGET_OBJECTS:tf_core_distributed_runtime>>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
     ${pywrap_tensorflow_deffile}
 )
@@ -1067,23 +1063,25 @@ if(WIN32)
         DISTCOPY ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/rnn/python/ops/)
 endif(WIN32)
 
-# include contrib/seq2seq as .so
-#
-set(tf_beam_search_srcs
-    "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops.cc"
-    "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops.h"
-    "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/ops/beam_search_ops.cc"
-)
+if(WIN32)
+    # include contrib/seq2seq as .so
+    #
+    set(tf_beam_search_srcs
+        "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops.cc"
+        "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops.h"
+        "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/ops/beam_search_ops.cc"
+    )
 
-set(tf_beam_search_gpu_srcs
-    "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops_gpu.cu.cc"
-)
+    set(tf_beam_search_gpu_srcs
+        "${tensorflow_source_dir}/tensorflow/contrib/seq2seq/kernels/beam_search_ops_gpu.cu.cc"
+    )
 
-AddUserOps(TARGET _beam_search_ops
-    SOURCES "${tf_beam_search_srcs}"
-    GPUSOURCES ${tf_beam_search_gpu_srcs}
-    DEPENDS pywrap_tensorflow_internal tf_python_ops
-    DISTCOPY ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/seq2seq/python/ops/)
+    AddUserOps(TARGET _beam_search_ops
+        SOURCES "${tf_beam_search_srcs}"
+        GPUSOURCES ${tf_beam_search_gpu_srcs}
+        DEPENDS pywrap_tensorflow_internal tf_python_ops
+        DISTCOPY ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/seq2seq/python/ops/)
+endif(WIN32)
 
 ############################################################
 # Build a PIP package containing the TensorFlow runtime.
diff --git a/tensorflow/contrib/cmake/tf_shared_lib.cmake b/tensorflow/contrib/cmake/tf_shared_lib.cmake
index 3e3fe0cdfa..9bf45bab30 100644
--- a/tensorflow/contrib/cmake/tf_shared_lib.cmake
+++ b/tensorflow/contrib/cmake/tf_shared_lib.cmake
@@ -73,7 +73,7 @@ add_library(tensorflow SHARED
     $<TARGET_OBJECTS:tf_tools_transform_graph_lib>
     $<$<BOOL:${tensorflow_ENABLE_GRPC_SUPPORT}>:$<TARGET_OBJECTS:tf_core_distributed_runtime>>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
     ${tensorflow_deffile}
 )
@@ -94,46 +94,3 @@ endif()
 if(WIN32)
   add_dependencies(tensorflow tensorflow_static)
 endif(WIN32)
-
-install(TARGETS tensorflow
-        RUNTIME DESTINATION bin
-        LIBRARY DESTINATION lib
-        ARCHIVE DESTINATION lib)
-
-# install necessary headers
-# tensorflow headers
-install(DIRECTORY ${tensorflow_source_dir}/tensorflow/cc/
-        DESTINATION include/tensorflow/cc
-        FILES_MATCHING PATTERN "*.h")
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/tensorflow/cc/
-        DESTINATION include/tensorflow/cc
-        FILES_MATCHING PATTERN "*.h")
-install(DIRECTORY ${tensorflow_source_dir}/tensorflow/core/
-        DESTINATION include/tensorflow/core
-        FILES_MATCHING PATTERN "*.h")
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/tensorflow/core/
-        DESTINATION include/tensorflow/core
-        FILES_MATCHING PATTERN "*.h")
-install(DIRECTORY ${tensorflow_source_dir}/tensorflow/stream_executor/
-        DESTINATION include/tensorflow/stream_executor
-        FILES_MATCHING PATTERN "*.h")
-# google protobuf headers
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src/google/
-        DESTINATION include/google
-        FILES_MATCHING PATTERN "*.h")
-# nsync headers
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/external/nsync/
-        DESTINATION include/external/nsync
-        FILES_MATCHING PATTERN "*.h")
-# Eigen directory
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/eigen/src/eigen/Eigen/
-        DESTINATION include/Eigen)
-# external directory
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive/
-        DESTINATION include/external/eigen_archive)
-# third_party eigen directory
-install(DIRECTORY ${tensorflow_source_dir}/third_party/eigen3/
-        DESTINATION include/third_party/eigen3)
-# unsupported Eigen directory
-install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/eigen/src/eigen/unsupported/Eigen/
-        DESTINATION include/unsupported/Eigen)
diff --git a/tensorflow/contrib/cmake/tf_stream_executor.cmake b/tensorflow/contrib/cmake/tf_stream_executor.cmake
index 8d95f0d3e8..3d84f1ebb9 100644
--- a/tensorflow/contrib/cmake/tf_stream_executor.cmake
+++ b/tensorflow/contrib/cmake/tf_stream_executor.cmake
@@ -74,9 +74,6 @@ endif()
 #)
 #list(REMOVE_ITEM tf_stream_executor_srcs ${tf_stream_executor_test_srcs}) 
 
-if (NOT WIN32)
-  set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lgomp")
-endif (NOT WIN32)
 add_library(tf_stream_executor OBJECT ${tf_stream_executor_srcs})
 
 add_dependencies(tf_stream_executor
diff --git a/tensorflow/contrib/cmake/tf_tools.cmake b/tensorflow/contrib/cmake/tf_tools.cmake
index cb58a2e7df..6ef9598963 100644
--- a/tensorflow/contrib/cmake/tf_tools.cmake
+++ b/tensorflow/contrib/cmake/tf_tools.cmake
@@ -73,7 +73,7 @@ add_executable(${transform_graph}
     $<TARGET_OBJECTS:tf_core_direct_session>
     $<TARGET_OBJECTS:tf_tools_transform_graph_lib>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
 )
 
@@ -95,7 +95,7 @@ add_executable(${summarize_graph}
     $<TARGET_OBJECTS:tf_core_direct_session>
     $<TARGET_OBJECTS:tf_tools_transform_graph_lib>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
 )
 
@@ -117,7 +117,7 @@ add_executable(${compare_graphs}
     $<TARGET_OBJECTS:tf_core_direct_session>
     $<TARGET_OBJECTS:tf_tools_transform_graph_lib>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
 )
 
@@ -138,7 +138,7 @@ add_executable(${benchmark_model}
     $<TARGET_OBJECTS:tf_core_ops>
     $<TARGET_OBJECTS:tf_core_direct_session>
     $<TARGET_OBJECTS:tf_core_kernels>
-    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<$<BOOL:${BOOL_WIN32}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>>
+    $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_core_kernels_cpu_only>>
     $<$<BOOL:${tensorflow_ENABLE_GPU}>:$<TARGET_OBJECTS:tf_stream_executor>>
 )
 
@@ -147,8 +147,3 @@ target_link_libraries(${benchmark_model} PUBLIC
   ${tf_core_gpu_kernels_lib}
   ${tensorflow_EXTERNAL_LIBRARIES}
 )
-
-install(TARGETS ${transform_graph} ${summarize_graph} ${compare_graphs} ${benchmark_model}
-        RUNTIME DESTINATION bin
-        LIBRARY DESTINATION lib
-        ARCHIVE DESTINATION lib)
diff --git a/tensorflow/contrib/cmake/tf_tutorials.cmake b/tensorflow/contrib/cmake/tf_tutorials.cmake
index e63fccc181..858e7dda92 100644
--- a/tensorflow/contrib/cmake/tf_tutorials.cmake
+++ b/tensorflow/contrib/cmake/tf_tutorials.cmake
@@ -34,8 +34,3 @@ target_link_libraries(tf_tutorials_example_trainer PUBLIC
     ${tf_core_gpu_kernels_lib}
     ${tensorflow_EXTERNAL_LIBRARIES}
 )
-
-install(TARGETS tf_tutorials_example_trainer
-        RUNTIME DESTINATION bin
-        LIBRARY DESTINATION lib
-        ARCHIVE DESTINATION lib)
diff --git a/tensorflow/contrib/crf/python/ops/crf.py b/tensorflow/contrib/crf/python/ops/crf.py
index 1612c75179..4282be5ec8 100644
--- a/tensorflow/contrib/crf/python/ops/crf.py
+++ b/tensorflow/contrib/crf/python/ops/crf.py
@@ -363,8 +363,8 @@ class CrfDecodeForwardRnnCell(rnn_cell.RNNCell):
       scope: Unused variable scope of this cell.
 
     Returns:
-      backpointers: A [batch_size, num_tags] matrix of backpointers.
-      new_state: A [batch_size, num_tags] matrix of new score values.
+      backpointers: [batch_size, num_tags], containing backpointers.
+      new_state: [batch_size, num_tags], containing new score values.
     """
     # For simplicity, in shape comments, denote:
     # 'batch_size' by 'B', 'max_seq_len' by 'T' , 'num_tags' by 'O' (output).
@@ -404,9 +404,8 @@ class CrfDecodeBackwardRnnCell(rnn_cell.RNNCell):
     """Build the CrfDecodeBackwardRnnCell.
 
     Args:
-      inputs: A [batch_size, num_tags] matrix of
-            backpointer of next step (in time order).
-      state: A [batch_size, 1] matrix of tag index of next step.
+      inputs: [batch_size, num_tags], backpointer of next step (in time order).
+      state: [batch_size, 1], next position's tag index.
       scope: Unused variable scope of this cell.
 
     Returns:
@@ -430,16 +429,16 @@ def crf_decode(potentials, transition_params, sequence_length):
   This is a function for tensor.
 
   Args:
-    potentials: A [batch_size, max_seq_len, num_tags] tensor of
+    potentials: A [batch_size, max_seq_len, num_tags] tensor, matrix of
               unary potentials.
-    transition_params: A [num_tags, num_tags] matrix of
+    transition_params: A [num_tags, num_tags] tensor, matrix of
               binary potentials.
-    sequence_length: A [batch_size] vector of true sequence lengths.
+    sequence_length: A [batch_size] tensor, containing sequence lengths.
 
   Returns:
-    decode_tags: A [batch_size, max_seq_len] matrix, with dtype `tf.int32`.
+    decode_tags: A [batch_size, max_seq_len] tensor, with dtype tf.int32.
                 Contains the highest scoring tag indices.
-    best_score: A [batch_size] vector, containing the score of `decode_tags`.
+    best_score: A [batch_size] tensor, containing the score of decode_tags.
   """
   # For simplicity, in shape comments, denote:
   # 'batch_size' by 'B', 'max_seq_len' by 'T' , 'num_tags' by 'O' (output).
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index dd0457d54b..1923c0586a 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -11,7 +11,6 @@ py_test(
     size = "small",
     srcs = ["batch_dataset_op_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
@@ -373,7 +372,6 @@ py_test(
     size = "small",
     srcs = ["sequence_dataset_op_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
@@ -451,7 +449,6 @@ py_test(
     size = "small",
     srcs = ["zip_dataset_op_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
     deps = [
         ":dataset_serialization_test",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
@@ -468,10 +465,7 @@ py_test(
     size = "small",
     srcs = ["prefetching_ops_test.py"],
     srcs_version = "PY2AND3",
-    tags = [
-        "manual",
-        "no_oss",  # b/68785503
-    ],
+    tags = ["no_oss"],  # b/68785503
     deps = [
         "//tensorflow/contrib/data/python/ops:prefetching_py",
         "//tensorflow/core:protos_all_py",
diff --git a/tensorflow/contrib/distributions/BUILD b/tensorflow/contrib/distributions/BUILD
index 145b9495ff..2dc8ad9483 100644
--- a/tensorflow/contrib/distributions/BUILD
+++ b/tensorflow/contrib/distributions/BUILD
@@ -141,23 +141,6 @@ cuda_py_test(
 )
 
 cuda_py_test(
-    name = "cauchy_test",
-    size = "medium",
-    srcs = ["python/kernel_tests/cauchy_test.py"],
-    additional_deps = [
-        ":distributions_py",
-        "//third_party/py/numpy",
-        "//tensorflow/python:array_ops",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:variables",
-    ],
-)
-
-cuda_py_test(
     name = "chi2_test",
     srcs = ["python/kernel_tests/chi2_test.py"],
     additional_deps = [
diff --git a/tensorflow/contrib/distributions/__init__.py b/tensorflow/contrib/distributions/__init__.py
index 0d12d83893..16f6533e57 100644
--- a/tensorflow/contrib/distributions/__init__.py
+++ b/tensorflow/contrib/distributions/__init__.py
@@ -24,7 +24,6 @@ from __future__ import print_function
 
 from tensorflow.contrib.distributions.python.ops import bijectors
 from tensorflow.contrib.distributions.python.ops.binomial import *
-from tensorflow.contrib.distributions.python.ops.cauchy import *
 from tensorflow.contrib.distributions.python.ops.chi2 import *
 from tensorflow.contrib.distributions.python.ops.conditional_distribution import *
 from tensorflow.contrib.distributions.python.ops.conditional_transformed_distribution import *
@@ -84,7 +83,6 @@ from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
     'bijectors',
-    'Cauchy',
     'ConditionalDistribution',
     'ConditionalTransformedDistribution',
     'FULLY_REPARAMETERIZED',
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/cauchy_test.py b/tensorflow/contrib/distributions/python/kernel_tests/cauchy_test.py
deleted file mode 100644
index 7f7697357c..0000000000
--- a/tensorflow/contrib/distributions/python/kernel_tests/cauchy_test.py
+++ /dev/null
@@ -1,437 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for Cauchy."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import importlib
-import numpy as np
-
-from tensorflow.contrib.distributions.python.ops import cauchy as cauchy_lib
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gradients_impl
-from tensorflow.python.ops import variables
-from tensorflow.python.platform import test
-from tensorflow.python.platform import tf_logging
-
-
-def try_import(name):  # pylint: disable=invalid-name
-  module = None
-  try:
-    module = importlib.import_module(name)
-  except ImportError as e:
-    tf_logging.warning("Could not import %s: %s" % (name, str(e)))
-  return module
-
-stats = try_import("scipy.stats")
-
-
-class CauchyTest(test.TestCase):
-
-  def setUp(self):
-    self._rng = np.random.RandomState(123)
-
-  def assertAllFinite(self, tensor):
-    is_finite = np.isfinite(tensor.eval())
-    all_true = np.ones_like(is_finite, dtype=np.bool)
-    self.assertAllEqual(all_true, is_finite)
-
-  def _testParamShapes(self, sample_shape, expected):
-    with self.test_session():
-      param_shapes = cauchy_lib.Cauchy.param_shapes(sample_shape)
-      loc_shape, scale_shape = param_shapes["loc"], param_shapes["scale"]
-      self.assertAllEqual(expected, loc_shape.eval())
-      self.assertAllEqual(expected, scale_shape.eval())
-      loc = array_ops.zeros(loc_shape)
-      scale = array_ops.ones(scale_shape)
-      self.assertAllEqual(
-          expected,
-          array_ops.shape(cauchy_lib.Cauchy(loc, scale).sample()).eval())
-
-  def _testParamStaticShapes(self, sample_shape, expected):
-    param_shapes = cauchy_lib.Cauchy.param_static_shapes(sample_shape)
-    loc_shape, scale_shape = param_shapes["loc"], param_shapes["scale"]
-    self.assertEqual(expected, loc_shape)
-    self.assertEqual(expected, scale_shape)
-
-  def testParamShapes(self):
-    sample_shape = [10, 3, 4]
-    self._testParamShapes(sample_shape, sample_shape)
-    self._testParamShapes(constant_op.constant(sample_shape), sample_shape)
-
-  def testParamStaticShapes(self):
-    sample_shape = [10, 3, 4]
-    self._testParamStaticShapes(sample_shape, sample_shape)
-    self._testParamStaticShapes(
-        tensor_shape.TensorShape(sample_shape), sample_shape)
-
-  def testCauchyLogPDF(self):
-    with self.test_session():
-      batch_size = 6
-      loc = constant_op.constant([3.0] * batch_size)
-      scale = constant_op.constant([np.sqrt(10.0)] * batch_size)
-      x = np.array([-2.5, 2.5, 4.0, 0.0, -1.0, 2.0], dtype=np.float32)
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      log_pdf = cauchy.log_prob(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          log_pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          log_pdf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, log_pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, log_pdf.eval().shape)
-
-      pdf = cauchy.prob(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), pdf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, pdf.eval().shape)
-
-      if not stats:
-        return
-      expected_log_pdf = stats.cauchy(loc.eval(), scale.eval()).logpdf(x)
-      self.assertAllClose(expected_log_pdf, log_pdf.eval())
-      self.assertAllClose(np.exp(expected_log_pdf), pdf.eval())
-
-  def testCauchyLogPDFMultidimensional(self):
-    with self.test_session():
-      batch_size = 6
-      loc = constant_op.constant([[3.0, -3.0]] * batch_size)
-      scale = constant_op.constant([[np.sqrt(10.0), np.sqrt(15.0)]] *
-                                   batch_size)
-      x = np.array([[-2.5, 2.5, 4.0, 0.0, -1.0, 2.0]], dtype=np.float32).T
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      log_pdf = cauchy.log_prob(x)
-      log_pdf_values = log_pdf.eval()
-      self.assertEqual(log_pdf.shape, (6, 2))
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          log_pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          log_pdf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, log_pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, log_pdf.eval().shape)
-
-      pdf = cauchy.prob(x)
-      pdf_values = pdf.eval()
-      self.assertEqual(pdf.shape, (6, 2))
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), pdf_values.shape)
-      self.assertAllEqual(cauchy.batch_shape, pdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, pdf_values.shape)
-
-      if not stats:
-        return
-      expected_log_pdf = stats.cauchy(loc.eval(), scale.eval()).logpdf(x)
-      self.assertAllClose(expected_log_pdf, log_pdf_values)
-      self.assertAllClose(np.exp(expected_log_pdf), pdf_values)
-
-  def testCauchyCDF(self):
-    with self.test_session():
-      batch_size = 50
-      loc = self._rng.randn(batch_size)
-      scale = self._rng.rand(batch_size) + 1.0
-      x = np.linspace(-8.0, 8.0, batch_size).astype(np.float64)
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-      cdf = cauchy.cdf(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), cdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), cdf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, cdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, cdf.eval().shape)
-      if not stats:
-        return
-      expected_cdf = stats.cauchy(loc, scale).cdf(x)
-      self.assertAllClose(expected_cdf, cdf.eval(), atol=0)
-
-  def testCauchySurvivalFunction(self):
-    with self.test_session():
-      batch_size = 50
-      loc = self._rng.randn(batch_size)
-      scale = self._rng.rand(batch_size) + 1.0
-      x = np.linspace(-8.0, 8.0, batch_size).astype(np.float64)
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      sf = cauchy.survival_function(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), sf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), sf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, sf.shape)
-      self.assertAllEqual(cauchy.batch_shape, sf.eval().shape)
-      if not stats:
-        return
-      expected_sf = stats.cauchy(loc, scale).sf(x)
-      self.assertAllClose(expected_sf, sf.eval(), atol=0)
-
-  def testCauchyLogCDF(self):
-    with self.test_session():
-      batch_size = 50
-      loc = self._rng.randn(batch_size)
-      scale = self._rng.rand(batch_size) + 1.0
-      x = np.linspace(-100.0, 10.0, batch_size).astype(np.float64)
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      cdf = cauchy.log_cdf(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), cdf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), cdf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, cdf.shape)
-      self.assertAllEqual(cauchy.batch_shape, cdf.eval().shape)
-
-      if not stats:
-        return
-      expected_cdf = stats.cauchy(loc, scale).logcdf(x)
-      self.assertAllClose(expected_cdf, cdf.eval(), atol=0, rtol=1e-5)
-
-  def testFiniteGradientAtDifficultPoints(self):
-    for dtype in [np.float32, np.float64]:
-      g = ops.Graph()
-      with g.as_default():
-        loc = variables.Variable(dtype(0.0))
-        scale = variables.Variable(dtype(1.0))
-        dist = cauchy_lib.Cauchy(loc=loc, scale=scale)
-        x = np.array([-100., -20., -5., 0., 5., 20., 100.]).astype(dtype)
-        for func in [
-            dist.cdf, dist.log_cdf, dist.survival_function,
-            dist.log_survival_function, dist.log_prob, dist.prob
-        ]:
-          value = func(x)
-          grads = gradients_impl.gradients(value, [loc, scale])
-          with self.test_session(graph=g):
-            variables.global_variables_initializer().run()
-            self.assertAllFinite(value)
-            self.assertAllFinite(grads[0])
-            self.assertAllFinite(grads[1])
-
-  def testCauchyLogSurvivalFunction(self):
-    with self.test_session():
-      batch_size = 50
-      loc = self._rng.randn(batch_size)
-      scale = self._rng.rand(batch_size) + 1.0
-      x = np.linspace(-10.0, 100.0, batch_size).astype(np.float64)
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      sf = cauchy.log_survival_function(x)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), sf.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), sf.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, sf.shape)
-      self.assertAllEqual(cauchy.batch_shape, sf.eval().shape)
-
-      if not stats:
-        return
-      expected_sf = stats.cauchy(loc, scale).logsf(x)
-      self.assertAllClose(expected_sf, sf.eval(), atol=0, rtol=1e-5)
-
-  def testCauchyEntropy(self):
-    with self.test_session():
-      loc = np.array([1.0, 1.0, 1.0])
-      scale = np.array([[1.0, 2.0, 3.0]])
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      entropy = cauchy.entropy()
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          entropy.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(),
-                          entropy.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, entropy.shape)
-      self.assertAllEqual(cauchy.batch_shape, entropy.eval().shape)
-
-      if not stats:
-        return
-      expected_entropy = stats.cauchy(loc, scale).entropy()
-      self.assertAllClose(expected_entropy, entropy.eval())
-
-  def testCauchyMode(self):
-    with self.test_session():
-      # Mu will be broadcast to [7, 7, 7].
-      loc = [7.]
-      scale = [11., 12., 13.]
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      self.assertAllEqual((3,), cauchy.mode().shape)
-      self.assertAllEqual([7., 7, 7], cauchy.mode().eval())
-
-  def testCauchyMean(self):
-    with self.test_session():
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      self.assertAllEqual((3,), cauchy.mean().shape)
-      self.assertAllEqual([np.nan] * 3, cauchy.mean().eval())
-
-  def testCauchyNanMean(self):
-    with self.test_session():
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale, allow_nan_stats=False)
-
-      with self.assertRaises(ValueError):
-        cauchy.mean().eval()
-
-  def testCauchyQuantile(self):
-    with self.test_session():
-      batch_size = 50
-      loc = self._rng.randn(batch_size)
-      scale = self._rng.rand(batch_size) + 1.0
-      p = np.linspace(0.000001, 0.999999, batch_size).astype(np.float64)
-
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-      x = cauchy.quantile(p)
-
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), x.shape)
-      self.assertAllEqual(cauchy.batch_shape_tensor().eval(), x.eval().shape)
-      self.assertAllEqual(cauchy.batch_shape, x.shape)
-      self.assertAllEqual(cauchy.batch_shape, x.eval().shape)
-
-      if not stats:
-        return
-      expected_x = stats.cauchy(loc, scale).ppf(p)
-      self.assertAllClose(expected_x, x.eval(), atol=0.)
-
-  def testCauchyVariance(self):
-    with self.test_session():
-      # scale will be broadcast to [7, 7, 7]
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      self.assertAllEqual((3,), cauchy.variance().shape)
-      self.assertAllEqual([np.nan] * 3, cauchy.variance().eval())
-
-  def testCauchyNanVariance(self):
-    with self.test_session():
-      # scale will be broadcast to [7, 7, 7]
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale, allow_nan_stats=False)
-
-      with self.assertRaises(ValueError):
-        cauchy.variance().eval()
-
-  def testCauchyStandardDeviation(self):
-    with self.test_session():
-      # scale will be broadcast to [7, 7, 7]
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      self.assertAllEqual((3,), cauchy.stddev().shape)
-      self.assertAllEqual([np.nan] * 3, cauchy.stddev().eval())
-
-  def testCauchyNanStandardDeviation(self):
-    with self.test_session():
-      # scale will be broadcast to [7, 7, 7]
-      loc = [1., 2., 3.]
-      scale = [7.]
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale, allow_nan_stats=False)
-
-      with self.assertRaises(ValueError):
-        cauchy.stddev().eval()
-
-  def testCauchySample(self):
-    with self.test_session():
-      loc = constant_op.constant(3.0)
-      scale = constant_op.constant(1.0)
-      loc_v = 3.0
-      n = constant_op.constant(100000)
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-      samples = cauchy.sample(n)
-      sample_values = samples.eval()
-
-      self.assertEqual(sample_values.shape, (100000,))
-      self.assertAllClose(np.median(sample_values), loc_v, atol=1e-1)
-
-      expected_shape = tensor_shape.TensorShape([n.eval()]).concatenate(
-          tensor_shape.TensorShape(cauchy.batch_shape_tensor().eval()))
-
-      self.assertAllEqual(expected_shape, samples.shape)
-      self.assertAllEqual(expected_shape, sample_values.shape)
-
-      expected_shape = (tensor_shape.TensorShape(
-          [n.eval()]).concatenate(cauchy.batch_shape))
-
-      self.assertAllEqual(expected_shape, samples.shape)
-      self.assertAllEqual(expected_shape, sample_values.shape)
-
-  def testCauchySampleMultiDimensional(self):
-    with self.test_session():
-      batch_size = 2
-      loc = constant_op.constant([[3.0, -3.0]] * batch_size)
-      scale = constant_op.constant([[0.5, 1.0]] * batch_size)
-      loc_v = [3.0, -3.0]
-      n = constant_op.constant(100000)
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-      samples = cauchy.sample(n)
-      sample_values = samples.eval()
-      self.assertEqual(samples.shape, (100000, batch_size, 2))
-      self.assertAllClose(np.median(sample_values[:, 0, 0]),
-                          loc_v[0], atol=1e-1)
-      self.assertAllClose(np.median(sample_values[:, 0, 1]),
-                          loc_v[1], atol=1e-1)
-
-      expected_shape = tensor_shape.TensorShape([n.eval()]).concatenate(
-          tensor_shape.TensorShape(cauchy.batch_shape_tensor().eval()))
-      self.assertAllEqual(expected_shape, samples.shape)
-      self.assertAllEqual(expected_shape, sample_values.shape)
-
-      expected_shape = (tensor_shape.TensorShape(
-          [n.eval()]).concatenate(cauchy.batch_shape))
-      self.assertAllEqual(expected_shape, samples.shape)
-      self.assertAllEqual(expected_shape, sample_values.shape)
-
-  def testCauchyNegativeLocFails(self):
-    with self.test_session():
-      cauchy = cauchy_lib.Cauchy(loc=[1.], scale=[-5.], validate_args=True)
-      with self.assertRaisesOpError("Condition x > 0 did not hold"):
-        cauchy.mode().eval()
-
-  def testCauchyShape(self):
-    with self.test_session():
-      loc = constant_op.constant([-3.0] * 5)
-      scale = constant_op.constant(11.0)
-      cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-      self.assertEqual(cauchy.batch_shape_tensor().eval(), [5])
-      self.assertEqual(cauchy.batch_shape, tensor_shape.TensorShape([5]))
-      self.assertAllEqual(cauchy.event_shape_tensor().eval(), [])
-      self.assertEqual(cauchy.event_shape, tensor_shape.TensorShape([]))
-
-  def testCauchyShapeWithPlaceholders(self):
-    loc = array_ops.placeholder(dtype=dtypes.float32)
-    scale = array_ops.placeholder(dtype=dtypes.float32)
-    cauchy = cauchy_lib.Cauchy(loc=loc, scale=scale)
-
-    with self.test_session() as sess:
-      # get_batch_shape should return an "<unknown>" tensor.
-      self.assertEqual(cauchy.batch_shape, tensor_shape.TensorShape(None))
-      self.assertEqual(cauchy.event_shape, ())
-      self.assertAllEqual(cauchy.event_shape_tensor().eval(), [])
-      self.assertAllEqual(
-          sess.run(cauchy.batch_shape_tensor(),
-                   feed_dict={loc: 5.0,
-                              scale: [1.0, 2.0]}), [2])
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/distributions/python/ops/cauchy.py b/tensorflow/contrib/distributions/python/ops/cauchy.py
deleted file mode 100644
index a17bb091f6..0000000000
--- a/tensorflow/contrib/distributions/python/ops/cauchy.py
+++ /dev/null
@@ -1,223 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""The Cauchy distribution class."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import check_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import random_ops
-from tensorflow.python.ops.distributions import distribution
-
-
-__all__ = [
-    "Cauchy",
-]
-
-
-class Cauchy(distribution.Distribution):
-  """The Cauchy distribution with location `loc` and scale `scale`.
-
-  #### Mathematical details
-
-  The probability density function (pdf) is,
-
-  ```none
-  pdf(x; loc, scale) = 1 / (pi * scale * (1 + ((x - loc) / scale)**2))
-  ```
-  where `loc` is the location, and `scale` is the scale.
-
-  The Cauchy distribution is a member of the [location-scale family](
-  https://en.wikipedia.org/wiki/Location-scale_family), i.e.
-
-  ```none
-  X ~ Cauchy(loc=0, scale=1)
-  Y ~ Cauchy(loc=loc, scale=scale)
-  Y = loc + scale * X
-  ```
-
-  #### Examples
-
-  Examples of initialization of one or a batch of distributions.
-
-  ```python
-  # Define a single scalar Cauchy distribution.
-  dist = Cauchy(loc=0., scale=3.)
-
-  # Evaluate the cdf at 1, returning a scalar.
-  dist.cdf(1.)
-
-  # Define a batch of two scalar valued Cauchy distributions.
-  dist = Cauchy(loc=[1, 2.], scale=[11, 22.])
-
-  # Evaluate the pdf of the first distribution on 0, and the second on 1.5,
-  # returning a length two tensor.
-  dist.prob([0, 1.5])
-
-  # Get 3 samples, returning a 3 x 2 tensor.
-  dist.sample([3])
-  ```
-
-  Arguments are broadcast when possible.
-
-  ```python
-  # Define a batch of two scalar valued Cauchy distributions.
-  # Both have median 1, but different scales.
-  dist = tf.contrib.distributions.Cauchy(loc=1., scale=[11, 22.])
-  # Evaluate the pdf of both distributions on the same point, 3.0,
-  # returning a length 2 tensor.
-  dist.prob(3.0)
-  ```
-  """
-
-  def __init__(self,
-               loc,
-               scale,
-               validate_args=False,
-               allow_nan_stats=True,
-               name="Cauchy"):
-    """Construct Cauchy distributions with loc and and scale `loc` and `scale`.
-
-    The parameters `loc` and `scale` must be shaped in a way that supports
-    broadcasting (e.g. `loc + scale` is a valid operation).
-
-    Args:
-      loc: Floating point tensor; the modes of the distribution(s).
-      scale: Floating point tensor; the locations of the distribution(s).
-        Must contain only positive values.
-      validate_args: Python `bool`, default `False`. When `True` distribution
-        parameters are checked for validity despite possibly degrading runtime
-        performance. When `False` invalid inputs may silently render incorrect
-        outputs.
-      allow_nan_stats: Python `bool`, default `True`. When `True`,
-        statistics (e.g., mean, mode, variance) use the value "`NaN`" to
-        indicate the result is undefined. When `False`, an exception is raised
-        if one or more of the statistic's batch members are undefined.
-      name: Python `str` name prefixed to Ops created by this class.
-
-    Raises:
-      TypeError: if `loc` and `scale` have different `dtype`.
-    """
-    parameters = locals()
-    with ops.name_scope(name, values=[loc, scale]):
-      with ops.control_dependencies([check_ops.assert_positive(scale)] if
-                                    validate_args else []):
-        self._loc = array_ops.identity(loc, name="loc")
-        self._scale = array_ops.identity(scale, name="scale")
-        check_ops.assert_same_float_dtype([self._loc, self._scale])
-    super(Cauchy, self).__init__(
-        dtype=self._scale.dtype,
-        reparameterization_type=distribution.FULLY_REPARAMETERIZED,
-        validate_args=validate_args,
-        allow_nan_stats=allow_nan_stats,
-        parameters=parameters,
-        graph_parents=[self._loc, self._scale],
-        name=name)
-
-  @staticmethod
-  def _param_shapes(sample_shape):
-    return dict(
-        zip(("loc", "scale"), ([ops.convert_to_tensor(
-            sample_shape, dtype=dtypes.int32)] * 2)))
-
-  @property
-  def loc(self):
-    """Distribution parameter for the mean."""
-    return self._loc
-
-  @property
-  def scale(self):
-    """Distribution parameter for standard deviation."""
-    return self._scale
-
-  def _batch_shape_tensor(self):
-    return array_ops.broadcast_dynamic_shape(
-        array_ops.shape(self.loc),
-        array_ops.shape(self.scale))
-
-  def _batch_shape(self):
-    return array_ops.broadcast_static_shape(
-        self.loc.shape,
-        self.scale.shape)
-
-  def _event_shape_tensor(self):
-    return constant_op.constant([], dtype=dtypes.int32)
-
-  def _event_shape(self):
-    return tensor_shape.scalar()
-
-  def _sample_n(self, n, seed=None):
-    shape = array_ops.concat([[n], self.batch_shape_tensor()], 0)
-    probs = random_ops.random_uniform(
-        shape=shape, minval=0., maxval=1., dtype=self.dtype, seed=seed)
-    return self._quantile(probs)
-
-  def _log_prob(self, x):
-    return self._log_unnormalized_prob(x) - self._log_normalization()
-
-  def _cdf(self, x):
-    return math_ops.atan(self._z(x)) / np.pi + 0.5
-
-  def _log_cdf(self, x):
-    return math_ops.log1p(2 / np.pi * math_ops.atan(self._z(x))) - np.log(2)
-
-  def _log_unnormalized_prob(self, x):
-    return -math_ops.log1p(math_ops.square(self._z(x)))
-
-  def _log_normalization(self):
-    return np.log(np.pi) + math_ops.log(self.scale)
-
-  def _entropy(self):
-    h = np.log(4 * np.pi) + math_ops.log(self.scale)
-    return h * array_ops.ones_like(self.loc)
-
-  def _quantile(self, p):
-    return self.loc + self.scale * math_ops.tan(np.pi * (p - 0.5))
-
-  def _mode(self):
-    return self.loc * array_ops.ones_like(self.scale)
-
-  def _z(self, x):
-    """Standardize input `x`."""
-    with ops.name_scope("standardize", values=[x]):
-      return (x - self.loc) / self.scale
-
-  def _inv_z(self, z):
-    """Reconstruct input `x` from a its normalized version."""
-    with ops.name_scope("reconstruct", values=[z]):
-      return z * self.scale + self.loc
-
-  def _mean(self):
-    if self.allow_nan_stats:
-      return array_ops.fill(self.batch_shape_tensor(),
-                            self.dtype.as_numpy_dtype(np.nan))
-    else:
-      raise ValueError("`mean` is undefined for Cauchy distribution.")
-
-  def _stddev(self):
-    if self.allow_nan_stats:
-      return array_ops.fill(self.batch_shape_tensor(),
-                            self.dtype.as_numpy_dtype(np.nan))
-    else:
-      raise ValueError("`stddev` is undefined for Cauchy distribution.")
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
index 459f2f4a7d..01616f2e7d 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/1_basics.ipynb
@@ -429,9 +429,7 @@
         "cpu_tensor = tf.random_normal([SIZE, SIZE])\n",
         "\n",
         "if is_gpu_available:\n",
-        "  gpu_tensor = cpu_tensor.gpu()\n",
-        "else:\n",
-        "  print(\"GPU not available.\")"
+        "  gpu_tensor = cpu_tensor.gpu()"
       ]
     },
     {
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
index e6c7c11733..3b7e2cd435 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/2_gradients.ipynb
@@ -383,7 +383,7 @@
         "\n",
         "`implicit_value_and_gradients()` returns a function that accepts the same inputs as the function passed in, and returns a tuple consisting of:\n",
         "\n",
-        "1. the value returned by the function passed in (in this case, the loss calculated by `loss_fn()`), and\n",
+        "1. the value returned by the function passed in (in this case, the loss calculated by `calculate_linear_model_loss()`), and\n",
         "1. a list of tuples consisting of:\n",
         "  1. The value of the gradient (a `tf.Tensor`) with respect to a given variable\n",
         "  1. The corresponding variable (`tf.Variable`)\n",
@@ -698,7 +698,7 @@
       "source": [
         "## Other Ways to Compute Gradients\n",
         "\n",
-        "Using our loss function as an example (`loss_fn()`), there are several other ways we could compute gradients:\n",
+        "Using our loss function as an example (`calculate_linear_model_loss()`), there are several other ways we could compute gradients:\n",
         "\n",
         "1. `tfe.implicit_gradients()`\n",
         "1. `tfe.gradients_function()`\n",
@@ -841,7 +841,7 @@
         "# tfe.implicit_value_and_gradients() demo\n",
         "value_gradients_fn = tfe.implicit_value_and_gradients(loss_fn)\n",
         "\n",
-        "# Returns the value returned by the function passed in, gradients, and variables:\n",
+        "# Returns only gradients:\n",
         "value_gradients_fn(inputs, labels, wb)"
       ]
     }
diff --git a/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb b/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb
index 0088da5c4b..ebcc7027c1 100644
--- a/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb
+++ b/tensorflow/contrib/eager/python/examples/notebooks/3_datasets.ipynb
@@ -9,7 +9,7 @@
       "source": [
         "# Eager Execution Tutorial: Importing Data\n",
         "\n",
-        "This notebook demonstrates the use of the [`tf.data.Dataset` API](https://www.tensorflow.org/programmers_guide/datasets) to build pipelines to feed data to your program. It covers:\n",
+        "This notebook demonstrates the use of the [`tf.contrib.data.Dataset` API](https://www.tensorflow.org/programmers_guide/datasets) to build pipelines to feed data to your program. It covers:\n",
         "\n",
         "* Creating a `Dataset`.\n",
         "* Iteration over a `Dataset` with eager execution enabled.\n",
@@ -64,7 +64,7 @@
       "source": [
         "# Step 1: Create a source `Dataset`\n",
         "\n",
-        "Create a _source_ dataset using one of the factory functions like [`Dataset.from_tensors`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors), [`Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) or using objects that read from files like [`TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) or [`TFRecordDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset). See the [Programmer's Guide](https://www.google.com/url?sa=D\u0026q=https%3A%2F%2Fwww.tensorflow.org%2Fprogrammers_guide%2Fdatasets%23reading_input_data) for more information."
+        "Create a _source_ dataset using one of the factory functions like [`Dataset.from_tensors`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#from_tensors), [`Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#from_tensor_slices) or using objects that read from files like [`TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/TextLineDataset) or [`TFRecordDataset`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/TFRecordDataset). See the [Programmer's Guide](https://www.google.com/url?sa=D\u0026q=https%3A%2F%2Fwww.tensorflow.org%2Fprogrammers_guide%2Fdatasets%23reading_input_data) for more information."
       ]
     },
     {
@@ -83,7 +83,7 @@
       },
       "outputs": [],
       "source": [
-        "ds_tensors = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])\n",
+        "ds_tensors = tf.contrib.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])\n",
         "\n",
         "# Create a CSV file\n",
         "import tempfile\n",
@@ -93,7 +93,7 @@
         "Line 2\n",
         "Line 3\n",
         "  \"\"\")\n",
-        "ds_file = tf.data.TextLineDataset(filename)\n"
+        "ds_file = tf.contrib.data.TextLineDataset(filename)\n"
       ]
     },
     {
@@ -105,7 +105,7 @@
       "source": [
         "# Step 2: Apply transformations\n",
         "\n",
-        "Use the transformations functions like [`map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map), [`batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch), [`shuffle`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) etc. to apply transformations to the records of the dataset. See the [API documentation for `tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for details."
+        "Use the transformations functions like [`map`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#map), [`batch`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#batch), [`shuffle`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#shuffle) etc. to apply transformations to the records of the dataset. See the [API documentation for `tf.contrib.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset) for details."
       ]
     },
     {
diff --git a/tensorflow/contrib/layers/python/layers/layers.py b/tensorflow/contrib/layers/python/layers/layers.py
index 9378fe8799..46b3eeae91 100644
--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@@ -286,6 +286,7 @@ def _fused_batch_norm(inputs,
     ValueError: If the rank of `inputs` is neither 2 or 4.
     ValueError: If rank or `C` dimension of `inputs` is undefined.
   """
+  # TODO(reedwm): Add support for fp16 inputs.
   if data_format not in (DATA_FORMAT_NCHW, DATA_FORMAT_NHWC):
     raise ValueError('data_format has to be either NCHW or NHWC.')
   with variable_scope.variable_scope(
@@ -319,10 +320,9 @@ def _fused_batch_norm(inputs,
                        (inputs.name, params_shape))
 
     # Allocate parameters for the beta and gamma of the normalization.
+    trainable_beta = trainable and center
     beta_collections = utils.get_variable_collections(variables_collections,
                                                       'beta')
-    # Float32 required to avoid precision-loss when using fp16 input/output
-    variable_dtype = dtypes.float32
     if not param_initializers:
       param_initializers = {}
     if not param_regularizers:
@@ -336,13 +336,13 @@ def _fused_batch_norm(inputs,
       beta = variables.model_variable(
           'beta',
           shape=params_shape,
-          dtype=variable_dtype,
+          dtype=dtype,
           initializer=beta_initializer,
           regularizer=beta_regularizer,
           collections=beta_collections,
-          trainable=trainable)
+          trainable=trainable_beta)
     else:
-      beta = array_ops.constant(0.0, dtype=variable_dtype, shape=params_shape)
+      beta = array_ops.constant(0.0, shape=params_shape)
 
     if scale:
       gamma_collections = utils.get_variable_collections(
@@ -352,13 +352,13 @@ def _fused_batch_norm(inputs,
       gamma = variables.model_variable(
           'gamma',
           shape=params_shape,
-          dtype=variable_dtype,
+          dtype=dtype,
           initializer=gamma_initializer,
           regularizer=gamma_regularizer,
           collections=gamma_collections,
           trainable=trainable)
     else:
-      gamma = array_ops.constant(1.0, dtype=variable_dtype, shape=params_shape)
+      gamma = array_ops.constant(1.0, shape=params_shape)
 
     # Create moving_mean and moving_variance variables and add them to the
     # appropriate collections. We disable variable partitioning while creating
@@ -375,7 +375,7 @@ def _fused_batch_norm(inputs,
       moving_mean = variables.model_variable(
           'moving_mean',
           shape=params_shape,
-          dtype=variable_dtype,
+          dtype=dtype,
           initializer=moving_mean_initializer,
           trainable=False,
           collections=moving_mean_collections)
@@ -386,7 +386,7 @@ def _fused_batch_norm(inputs,
       moving_variance = variables.model_variable(
           'moving_variance',
           shape=params_shape,
-          dtype=variable_dtype,
+          dtype=dtype,
           initializer=moving_variance_initializer,
           trainable=False,
           collections=moving_variance_collections)
diff --git a/tensorflow/contrib/layers/python/layers/layers_test.py b/tensorflow/contrib/layers/python/layers/layers_test.py
index 5aa2253516..ff7f0e4462 100644
--- a/tensorflow/contrib/layers/python/layers/layers_test.py
+++ b/tensorflow/contrib/layers/python/layers/layers_test.py
@@ -1774,12 +1774,10 @@ class BatchNormTest(test.TestCase):
       with self.assertRaisesRegexp(ValueError, 'undefined'):
         _layers.batch_norm(inputs, data_format='NCHW')
 
-  def _testCreateOp(self, fused, dtype=None):
-    if dtype is None:
-      dtype = dtypes.float32
+  def _testCreateOp(self, fused):
     height, width = 3, 3
     with self.test_session():
-      images = np.random.uniform(size=(5, height, width, 3)).astype(dtype.as_numpy_dtype)
+      images = np.random.uniform(size=(5, height, width, 3)).astype('f')
       output = _layers.batch_norm(images, fused=fused)
       expected_name = ('BatchNorm/FusedBatchNorm' if fused else
                        'BatchNorm/batchnorm')
@@ -1794,9 +1792,6 @@ class BatchNormTest(test.TestCase):
   def testCreateOpFused(self):
     self._testCreateOp(True)
 
-  def testCreateOpFusedFloat16(self):
-    self._testCreateOp(True, dtypes.float16)
-
   def _testCreateOpBetaRegularizer(self, fused=True):
     height, width = 3, 3
     with self.test_session():
@@ -2664,68 +2659,10 @@ class BatchNormTest(test.TestCase):
   def testBatchNormBeta(self):
     # Test case for 11673
     with self.test_session() as sess:
-      a_32 = array_ops.placeholder(dtypes.float32, shape=(10, 10, 10, 10))
-      b_32 = _layers.batch_norm(a_32, center=False, data_format='NCHW',
-                                zero_debias_moving_mean=True)
-      a_16 = array_ops.placeholder(dtypes.float16, shape=(10, 10, 10, 10))
-      b_16 = _layers.batch_norm(a_16, center=False, data_format='NCHW',
-                                zero_debias_moving_mean=True)
-      sess.run(variables_lib.global_variables_initializer())
-
-  def testVariablesAreFloat32(self):
-    height, width = 3, 3
-    with self.test_session():
-      images = random_ops.random_uniform((5, height, width, 3),
-                                         seed=1, dtype=dtypes.float16)
-      _layers.batch_norm(images, scale=True)
-      beta = variables.get_variables_by_name('beta')[0]
-      gamma = variables.get_variables_by_name('gamma')[0]
-      self.assertEqual(beta.dtype, dtypes.float32_ref)
-      self.assertEqual(gamma.dtype, dtypes.float32_ref)
-      moving_mean = variables.get_variables_by_name('moving_mean')[0]
-      moving_variance = variables.get_variables_by_name('moving_variance')[0]
-      self.assertEqual(moving_mean.dtype, dtypes.float32_ref)
-      self.assertEqual(moving_variance.dtype, dtypes.float32_ref)
-
-  def _runFusedBatchNorm(self, shape, dtype):
-    channels = shape[1]
-    images = np.arange(np.product(shape), dtype=dtype).reshape(shape)
-    beta = init_ops.constant_initializer(
-        np.arange(
-            2, channels + 2, dtype=np.float32))
-    gamma = init_ops.constant_initializer(
-        np.arange(
-            10, channels + 10, dtype=np.float32) * 2.0)
-    mean = init_ops.constant_initializer(
-        np.arange(
-            3, channels + 3, dtype=np.float32) * 5.0)
-    variance = init_ops.constant_initializer(
-        np.arange(
-            1, channels + 1, dtype=np.float32) * 4.0)
-    output = _layers.batch_norm(
-        images,
-        fused=True,
-        is_training=True,
-        scale=True,
-        epsilon=0.5,
-        param_initializers={
-            'beta': beta,
-            'gamma': gamma,
-            'moving_mean': mean,
-            'moving_variance': variance,
-        },
-        data_format='NCHW')
-    with self.test_session(use_gpu=True) as sess:
+      a = array_ops.placeholder(dtypes.float32, shape=(10, 10, 10, 10))
+      b = _layers.batch_norm(a, center=False, data_format='NCHW',
+                                       zero_debias_moving_mean=True)
       sess.run(variables_lib.global_variables_initializer())
-      return sess.run(output)
-
-  def testFusedBatchNormFloat16MatchesFloat32(self):
-    if test.is_gpu_available(cuda_only=True):
-      shape = [5, 4, 2, 3]
-      res_32 = self._runFusedBatchNorm(shape, np.float32)
-      res_16 = self._runFusedBatchNorm(shape, np.float16)
-      self.assertAllClose(res_32, res_16, rtol=1e-3)
-
 
   def testAdjustmentCreated(self):
     # Tests that the adjustment is appropriately passed to and used by the core
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head.py b/tensorflow/contrib/learn/python/learn/estimators/head.py
index bc0e6fc009..468d792a0d 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head.py
@@ -119,7 +119,7 @@ class Head(object):
       update_op = tf.contrib.layers.optimize_loss(optimizer=sync,
                                                   loss=model_fn_ops.loss, ...)
       hooks = [sync.make_session_run_hook(is_chief)]
-      ... update train_op and hooks in ModelFnOps and return
+      ... upate train_op and hooks in ModelFnOps and return
     ```
   """
   __metaclass__ = abc.ABCMeta
diff --git a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
index 44e6c7c52d..8be9c72adf 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
@@ -23,6 +23,7 @@ import collections
 
 import six
 
+from tensorflow.contrib import framework as contrib_framework
 from tensorflow.contrib.framework import get_graph_from_inputs
 from tensorflow.contrib.learn.python.learn.estimators import constants
 from tensorflow.contrib.learn.python.learn.estimators import metric_key
@@ -31,7 +32,6 @@ from tensorflow.python.estimator import model_fn as core_model_fn_lib
 from tensorflow.python.estimator.export import export_output as core_export_lib
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
@@ -156,11 +156,11 @@ class ModelFnOps(
     else:
       if isinstance(predictions, dict):
         predictions = {
-            k: sparse_tensor.convert_to_tensor_or_sparse_tensor(v)
+            k: contrib_framework.convert_to_tensor_or_sparse_tensor(v)
             for k, v in six.iteritems(predictions)
         }
       else:
-        predictions = sparse_tensor.convert_to_tensor_or_sparse_tensor(
+        predictions = contrib_framework.convert_to_tensor_or_sparse_tensor(
             predictions)
 
     # Validate eval_metric_ops
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
index db18ebf05d..4c50d40aaa 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
@@ -28,14 +28,13 @@ import six
 from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 
 # pylint: disable=g-multiple-import,g-bad-import-order
 from .pandas_io import HAS_PANDAS, extract_pandas_data, extract_pandas_matrix, extract_pandas_labels
 from .dask_io import HAS_DASK, extract_dask_data, extract_dask_labels
+
 # pylint: enable=g-multiple-import,g-bad-import-order
 
 
@@ -366,13 +365,8 @@ class DataFeeder(object):
     self.random_state = np.random.RandomState(
         42) if random_state is None else random_state
 
-    if x_is_dict:
-      num_samples = list(self._x.values())[0].shape[0]
-    elif tensor_util.is_tensor(self._x):
-      num_samples = self._x.shape[0].value  # shape will be a Dimension, extract an int
-    else:
-      num_samples = self._x.shape[0]
-      
+    num_samples = list(self._x.values())[0].shape[
+        0] if x_is_dict else self._x.shape[0]
     if self._shuffle:
       self.indices = self.random_state.permutation(num_samples)
     else:
diff --git a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
index 86d8484391..13f2f0f502 100644
--- a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
+++ b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
@@ -238,10 +238,10 @@ class SdcaModel(object):
     with name_scope('sdca/prediction'):
       sparse_variables = self._convert_n_to_tensor(self._variables[
           'sparse_features_weights'])
-      result_sparse = 0.0
+      result = 0.0
       for sfc, sv in zip(examples['sparse_features'], sparse_variables):
         # TODO(sibyl-Aix6ihai): following does not take care of missing features.
-        result_sparse += math_ops.segment_sum(
+        result += math_ops.segment_sum(
             math_ops.multiply(
                 array_ops.gather(sv, sfc.feature_indices), sfc.feature_values),
             sfc.example_indices)
@@ -249,13 +249,12 @@ class SdcaModel(object):
       dense_variables = self._convert_n_to_tensor(self._variables[
           'dense_features_weights'])
 
-      result_dense = 0.0
       for i in range(len(dense_variables)):
-        result_dense += math_ops.matmul(
-            dense_features[i], array_ops.expand_dims(dense_variables[i], -1))
+        result += math_ops.matmul(dense_features[i],
+                                  array_ops.expand_dims(dense_variables[i], -1))
 
     # Reshaping to allow shape inference at graph construction time.
-    return array_ops.reshape(result_dense, [-1]) + result_sparse
+    return array_ops.reshape(result, [-1])
 
   def predictions(self, examples):
     """Add operations to compute predictions by the model.
diff --git a/tensorflow/contrib/lite/python/BUILD b/tensorflow/contrib/lite/python/BUILD
index 89e8693490..b4aa032ff8 100644
--- a/tensorflow/contrib/lite/python/BUILD
+++ b/tensorflow/contrib/lite/python/BUILD
@@ -23,7 +23,6 @@ py_library(
 py_test(
     name = "lite_test",
     srcs = ["lite_test.py"],
-    srcs_version = "PY2AND3",
     deps = [
         ":lite",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/contrib/lite/testing/generate_examples.py b/tensorflow/contrib/lite/testing/generate_examples.py
index b122818221..86540d58a6 100644
--- a/tensorflow/contrib/lite/testing/generate_examples.py
+++ b/tensorflow/contrib/lite/testing/generate_examples.py
@@ -36,10 +36,6 @@ import traceback
 import zipfile
 import numpy as np
 from six import StringIO
-
-# TODO(aselle): Disable GPU for now
-os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
-
 import tensorflow as tf
 from google.protobuf import text_format
 # TODO(aselle): switch to TensorFlow's resource_loader
@@ -383,13 +379,12 @@ def make_zip_of_tests(zip_path,
         report["toco_log"] = ""
         tf.reset_default_graph()
 
-        with tf.device('/cpu:0'):
-          try:
-            inputs, outputs = make_graph(param_dict_real)
-          except (tf.errors.UnimplementedError, tf.errors.InvalidArgumentError,
-                  ValueError):
-            report["tf_log"] += traceback.format_exc()
-            return None, report
+        try:
+          inputs, outputs = make_graph(param_dict_real)
+        except (tf.errors.UnimplementedError, tf.errors.InvalidArgumentError,
+                ValueError):
+          report["tf_log"] += traceback.format_exc()
+          return None, report
 
         sess = tf.Session()
         try:
diff --git a/tensorflow/contrib/lite/toco/python/BUILD b/tensorflow/contrib/lite/toco/python/BUILD
index 17115047d2..92246a8aed 100644
--- a/tensorflow/contrib/lite/toco/python/BUILD
+++ b/tensorflow/contrib/lite/toco/python/BUILD
@@ -61,7 +61,6 @@ tf_py_test(
     data = [
         ":toco_from_protos",
     ],
-    tags = ["no_pip"],
 )
 
 filegroup(
diff --git a/tensorflow/contrib/makefile/Makefile b/tensorflow/contrib/makefile/Makefile
index e2e6c05591..dba1464653 100644
--- a/tensorflow/contrib/makefile/Makefile
+++ b/tensorflow/contrib/makefile/Makefile
@@ -314,8 +314,7 @@ ifeq ($(TARGET),ANDROID)
 -Wno-narrowing \
 -fomit-frame-pointer \
 $(MARCH_OPTION) \
--fPIE \
--fPIC
+-fPIE
 	INCLUDES = \
 -I$(NDK_ROOT)/sources/android/support/include \
 -I$(NDK_ROOT)/sources/cxx-stl/gnu-libstdc++/4.9/include \
diff --git a/tensorflow/contrib/makefile/README.md b/tensorflow/contrib/makefile/README.md
index 65bd60c12a..715eb51577 100644
--- a/tensorflow/contrib/makefile/README.md
+++ b/tensorflow/contrib/makefile/README.md
@@ -174,26 +174,10 @@ tensorflow/contrib/makefile/build_all_ios.sh
 
 This process will take around twenty minutes on a modern MacBook Pro.
 
-When it completes, you will have a unified library for all architectures
-(i386sim, x86_64sim, armv7, armv7s and arm64)  and the benchmark program.
-Although successfully compiling the benchmark program is a
+When it completes, you will have a library for a single architecture and the
+benchmark program. Although successfully compiling the benchmark program is a
 sign of success, the program is not a complete iOS app.
 
-If you would only like to build only one architecture to save time:
-(iOS 11+ only supports 64bit so you can get away with arm64)
-
-```bash
-tensorflow/contrib/makefile/build_all_ios.sh -a arm64
-```
-
-After the first build if you would like to just build the tensorflow
-library you can pass the -T flag to avoid a clean & rebuild. This should
-take you just a few seconds to generate the library if you modified one file.
-
-```bash
-tensorflow/contrib/makefile/build_all_ios.sh -a arm64 -T
-```
-
 To see TensorFlow running on iOS, the example Xcode project in
 [tensorflow/examples/ios](../../examples/ios/) shows how to use the static
 library in a simple app.
@@ -209,18 +193,19 @@ If you have not already, you will need to download dependencies:
 tensorflow/contrib/makefile/download_dependencies.sh
 ```
 
-Next, you will need to compile protobufs for iOS (optionally takes the -a $ARCH flag):
+Next, you will need to compile protobufs for iOS:
 
 ```bash
-tensorflow/contrib/makefile/compile_ios_protobuf.sh
+tensorflow/contrib/makefile/compile_ios_protobuf.sh 
 ```
 
-Then, you will need to compile the nsync library for iOS (optionally takes -a $ARCH flag):
+Then, you will need to compile the nsync library for iOS:
 
 ```bash
 export HOST_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh`
 export TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios`
 ```
+
 Then, you can run the makefile specifying iOS as the target, along with the
 architecture you want to build for:
 
@@ -234,6 +219,10 @@ This creates a library in
 `tensorflow/contrib/makefile/gen/lib/libtensorflow-core.a` that you can link any
 xcode project against. 
 
+At this point, you will have a library for a single architecture and the
+benchmark program. Although successfully compiling the benchmark program is a
+sign of success, the program is not a complete iOS app. 
+
 To see TensorFlow running on iOS, the example Xcode project in
 [tensorflow/examples/ios](../../examples/ios/) shows how to use the static
 library in a simple app.
@@ -248,14 +237,6 @@ time follow it with:
 compile_ios_tensorflow.sh
 ```
 
-`compile_ios_tensorflow.sh` takes the -a flag to build only for one architecture.
-In case you run into issues with unresolved symbols with nsync you can also pass
--h ${HOST_NSYNC_LIB} and -n {TARGET_NSYNC_LIB} so it would look like:
-
-```bash
-tensorflow/contrib/makefile/compile_ios_tensorflow.sh -f "-O3" -h tensorflow/contrib/makefile/downloads/nsync/builds/default.macos.c++11/nsync.a -n tensorflow/contrib/makefile/downloads/nsync/builds/lipo.ios.c++11/nsync.a -a arm64
-```
-
 In XCode, you will need to use -force_load in the linker flags
 section of the build settings to pull in the global constructors that are used
 to register ops and kernels. 
@@ -268,7 +249,7 @@ debug mode. If you are concerned about performance or are working on a release
 build, you would likely want a higher optimization setting, like so:
  
 ```bash
-compile_ios_tensorflow.sh -f "-Os"
+compile_ios_tensorflow.sh "-Os"
 ```
 
 For other variations of valid optimization flags, see [clang optimization levels](http://stackoverflow.com/questions/15548023/clang-optimization-levels).
diff --git a/tensorflow/contrib/makefile/build_all_ios.sh b/tensorflow/contrib/makefile/build_all_ios.sh
index 988e12b482..a49bbe4565 100755
--- a/tensorflow/contrib/makefile/build_all_ios.sh
+++ b/tensorflow/contrib/makefile/build_all_ios.sh
@@ -23,29 +23,14 @@ if [[ $(uname) != "Darwin" ]]; then
     exit 1
 fi
 
-usage() {
-  echo "Usage: $(basename "$0") [-a:T]"
-  echo "-a [build_arch] build only for specified arch x86_64 [default=all]"
-  echo "-T only build tensorflow (dont download other deps etc)"
-  exit 1
-}
-
-while getopts "a:T" opt_name; do
-  case "$opt_name" in
-    a) BUILD_ARCH="${OPTARG}";;
-    T) ONLY_MAKE_TENSORFLOW="true";;
-    *) usage;;
-  esac
-done
-shift $((OPTIND - 1))
-
-
 # Make sure we're in the correct directory, at the root of the source tree.
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 cd ${SCRIPT_DIR}/../../../
 
-source "${SCRIPT_DIR}/build_helper.subr"
-JOB_COUNT="${JOB_COUNT:-$(get_job_count)}"
+
+# Remove any old files first.
+make -f tensorflow/contrib/makefile/Makefile clean
+rm -rf tensorflow/contrib/makefile/downloads
 
 # Setting a deployment target is required for building with bitcode,
 # otherwise linking will fail with:
@@ -56,37 +41,20 @@ if [[ -n MACOSX_DEPLOYMENT_TARGET ]]; then
     export MACOSX_DEPLOYMENT_TARGET=$(sw_vers -productVersion)
 fi
 
-if [[ "${ONLY_MAKE_TENSORFLOW}" != "true" ]]; then
-    # Remove any old files first.
-    make -f tensorflow/contrib/makefile/Makefile clean
-    rm -rf tensorflow/contrib/makefile/downloads
+# Pull down the required versions of the frameworks we need.
+tensorflow/contrib/makefile/download_dependencies.sh
 
-    # Pull down the required versions of the frameworks we need.
-    tensorflow/contrib/makefile/download_dependencies.sh
-
-    # Compile protobuf for the target iOS device architectures.
-    tensorflow/contrib/makefile/compile_ios_protobuf.sh
-fi
+# Compile protobuf for the target iOS device architectures.
+tensorflow/contrib/makefile/compile_ios_protobuf.sh
 
 # Compile nsync for the target iOS device architectures.
 # Don't use  export var=`something` syntax; it swallows the exit status.
 HOST_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh`
-if [[ -z "${BUILD_ARCH}" ]]; then
-    # No arch specified so build all architectures
-    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios`
-else
-    # arch specified so build just that
-    TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios -a ${BUILD_ARCH}`
-fi
+TARGET_NSYNC_LIB=`tensorflow/contrib/makefile/compile_nsync.sh -t ios`
 export HOST_NSYNC_LIB TARGET_NSYNC_LIB
 
-if [[ -z "${BUILD_ARCH}" ]]; then
-    # build the ios tensorflow libraries.
-    tensorflow/contrib/makefile/compile_ios_tensorflow.sh -f "-O3" -h $HOST_NSYNC_LIB -n $TARGET_NSYNC_LIB
-else
-    # arch specified so build just that
-    tensorflow/contrib/makefile/compile_ios_tensorflow.sh -f "-O3" -a "${BUILD_ARCH}" -h $HOST_NSYNC_LIB -n $TARGET_NSYNC_LIB
-fi
+# Build the iOS TensorFlow libraries.
+tensorflow/contrib/makefile/compile_ios_tensorflow.sh "-O3"
 
 # Creates a static universal library in
 # tensorflow/contrib/makefile/gen/lib/libtensorflow-core.a
diff --git a/tensorflow/contrib/makefile/compile_ios_protobuf.sh b/tensorflow/contrib/makefile/compile_ios_protobuf.sh
index 43e5809dd2..4056db18a7 100755
--- a/tensorflow/contrib/makefile/compile_ios_protobuf.sh
+++ b/tensorflow/contrib/makefile/compile_ios_protobuf.sh
@@ -21,28 +21,10 @@ if [[ -n MACOSX_DEPLOYMENT_TARGET ]]; then
     export MACOSX_DEPLOYMENT_TARGET=$(sw_vers -productVersion)
 fi
 
-usage() {
-  echo "Usage: $(basename "$0") [-a]"
-  echo "-a [build_arch] build for specified arch comma separate for multiple archs (eg: x86_64,arm64)"
-  echo "default arch i386, x86_64, armv7, armv7s, arm64"
-  exit 1
-}
-
-BUILD_TARGET="i386 x86_64 armv7 armv7s arm64"
-while getopts "a:" opt_name; do
-  case "$opt_name" in
-    a) BUILD_TARGET="${OPTARG}";;
-    *) usage;;
-  esac
-done
-shift $((OPTIND - 1))
-
-IFS=' ' read -r -a build_targets <<< "${BUILD_TARGET}"
-
-SCRIPT_DIR=$(cd `dirname $0` && pwd)
+SCRIPT_DIR=$(dirname $0)
 source "${SCRIPT_DIR}/build_helper.subr"
 
-cd ${SCRIPT_DIR}
+cd tensorflow/contrib/makefile
 
 HOST_GENDIR="$(pwd)/gen/protobuf-host"
 mkdir -p "${HOST_GENDIR}"
@@ -82,10 +64,6 @@ else
   echo "protoc found. Skip building host tools."
 fi
 
-# Remove old libs
-rm -f ${LIBDIR}/libprotobuf.a
-rm -f ${LIBDIR}/libprotobuf-lite.a
-
 ./autogen.sh
 if [ $? -ne 0 ]
 then
@@ -93,192 +71,157 @@ then
   exit 1
 fi
 
-package_pb_library() {
-    pb_libs="${LIBDIR}/${1}/lib/libprotobuf.a"
-    if [ -f "${LIBDIR}/libprotobuf.a" ]; then
-        pb_libs="$pb_libs ${LIBDIR}/libprotobuf.a"
-    fi
-    lipo \
-    $pb_libs \
-    -create \
-    -output ${LIBDIR}/libprotobuf.a
-
-    pblite_libs="${LIBDIR}/${1}/lib/libprotobuf-lite.a"
-    if [ -f "${LIBDIR}/libprotobuf-lite.a" ]; then
-        pblite_libs="$pblite_libs ${LIBDIR}/libprotobuf-lite.a"
-    fi
-    lipo \
-    $pblite_libs \
-    -create \
-    -output ${LIBDIR}/libprotobuf-lite.a
-}
-
-build_target() {
-case "$1" in
-    i386)  make distclean
-        ./configure \
-        --host=i386-apple-${OSX_VERSION} \
-        --disable-shared \
-        --enable-cross-compile \
-        --with-protoc="${PROTOC_PATH}" \
-        --prefix=${LIBDIR}/iossim_386 \
-        --exec-prefix=${LIBDIR}/iossim_386 \
-        "CFLAGS=${CFLAGS} \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        -arch i386 \
-        -fembed-bitcode \
-        -isysroot ${IPHONESIMULATOR_SYSROOT}" \
-        "CXX=${CXX}" \
-        "CXXFLAGS=${CXXFLAGS} \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        -arch i386 \
-        -fembed-bitcode \
-        -isysroot \
-        ${IPHONESIMULATOR_SYSROOT}" \
-        LDFLAGS="-arch i386 \
-        -fembed-bitcode \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        ${LDFLAGS} \
-        -L${IPHONESIMULATOR_SYSROOT}/usr/lib/ \
-        -L${IPHONESIMULATOR_SYSROOT}/usr/lib/system" \
-        "LIBS=${LIBS}"
-        make -j"${JOB_COUNT}"
-        make install
-
-        package_pb_library "iossim_386"
-        ;;
-
-    x86_64) make distclean
-        ./configure \
-        --host=x86_64-apple-${OSX_VERSION} \
-        --disable-shared \
-        --enable-cross-compile \
-        --with-protoc="${PROTOC_PATH}" \
-        --prefix=${LIBDIR}/iossim_x86_64 \
-        --exec-prefix=${LIBDIR}/iossim_x86_64 \
-        "CFLAGS=${CFLAGS} \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        -arch x86_64 \
-        -fembed-bitcode \
-        -isysroot ${IPHONESIMULATOR_SYSROOT}" \
-        "CXX=${CXX}" \
-        "CXXFLAGS=${CXXFLAGS} \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        -arch x86_64 \
-        -fembed-bitcode \
-        -isysroot \
-        ${IPHONESIMULATOR_SYSROOT}" \
-        LDFLAGS="-arch x86_64 \
-        -fembed-bitcode \
-        -mios-simulator-version-min=${MIN_SDK_VERSION} \
-        ${LDFLAGS} \
-        -L${IPHONESIMULATOR_SYSROOT}/usr/lib/ \
-        -L${IPHONESIMULATOR_SYSROOT}/usr/lib/system" \
-        "LIBS=${LIBS}"
-        make -j"${JOB_COUNT}"
-        make install
-
-        package_pb_library "iossim_x86_64"
-        ;;
-
-    armv7) make distclean
-        ./configure \
-        --host=armv7-apple-${OSX_VERSION} \
-        --with-protoc="${PROTOC_PATH}" \
-        --disable-shared \
-        --prefix=${LIBDIR}/ios_arm7 \
-        --exec-prefix=${LIBDIR}/ios_arm7 \
-        "CFLAGS=${CFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch armv7 \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        "CXX=${CXX}" \
-        "CXXFLAGS=${CXXFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch armv7 \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        LDFLAGS="-arch armv7 \
-        -fembed-bitcode \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        ${LDFLAGS}" \
-        "LIBS=${LIBS}"
-        make -j"${JOB_COUNT}"
-        make install
-
-        package_pb_library "ios_arm7"
-        ;;
-
-    armv7s) make distclean
-        ./configure \
-        --host=armv7s-apple-${OSX_VERSION} \
-        --with-protoc="${PROTOC_PATH}" \
-        --disable-shared \
-        --prefix=${LIBDIR}/ios_arm7s \
-        --exec-prefix=${LIBDIR}/ios_arm7s \
-        "CFLAGS=${CFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch armv7s \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        "CXX=${CXX}" \
-        "CXXFLAGS=${CXXFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch armv7s \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        LDFLAGS="-arch armv7s \
-        -fembed-bitcode \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        ${LDFLAGS}" \
-        "LIBS=${LIBS}"
-        make -j"${JOB_COUNT}"
-        make install
-
-        package_pb_library "ios_arm7s"
-        ;;
-
-    arm64) make distclean
-        ./configure \
-        --host=arm \
-        --with-protoc="${PROTOC_PATH}" \
-        --disable-shared \
-        --prefix=${LIBDIR}/ios_arm64 \
-        --exec-prefix=${LIBDIR}/ios_arm64 \
-        "CFLAGS=${CFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch arm64 \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        "CXXFLAGS=${CXXFLAGS} \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        -arch arm64 \
-        -fembed-bitcode \
-        -isysroot ${IPHONEOS_SYSROOT}" \
-        LDFLAGS="-arch arm64 \
-        -fembed-bitcode \
-        -miphoneos-version-min=${MIN_SDK_VERSION} \
-        ${LDFLAGS}" \
-        "LIBS=${LIBS}"
-        make -j"${JOB_COUNT}"
-        make install
-
-        package_pb_library "ios_arm64"
-        ;;
-    *)
-        echo "Unknown ARCH"
-        exit 1
-        ;;
-esac 
-}
-
-for build_element in "${build_targets[@]}"
-do
-    echo "$build_element"
-    build_target "$build_element"
-done
-
-file ${LIBDIR}/libprotobuf.a
-file ${LIBDIR}/libprotobuf-lite.a
-echo "Done building and packaging the libraries"
+make distclean
+./configure \
+--host=i386-apple-${OSX_VERSION} \
+--disable-shared \
+--enable-cross-compile \
+--with-protoc="${PROTOC_PATH}" \
+--prefix=${LIBDIR}/iossim_386 \
+--exec-prefix=${LIBDIR}/iossim_386 \
+"CFLAGS=${CFLAGS} \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+-arch i386 \
+-fembed-bitcode \
+-isysroot ${IPHONESIMULATOR_SYSROOT}" \
+"CXX=${CXX}" \
+"CXXFLAGS=${CXXFLAGS} \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+-arch i386 \
+-fembed-bitcode \
+-isysroot \
+${IPHONESIMULATOR_SYSROOT}" \
+LDFLAGS="-arch i386 \
+-fembed-bitcode \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+${LDFLAGS} \
+-L${IPHONESIMULATOR_SYSROOT}/usr/lib/ \
+-L${IPHONESIMULATOR_SYSROOT}/usr/lib/system" \
+"LIBS=${LIBS}"
+make -j"${JOB_COUNT}"
+make install
+
+make distclean
+./configure \
+--host=x86_64-apple-${OSX_VERSION} \
+--disable-shared \
+--enable-cross-compile \
+--with-protoc="${PROTOC_PATH}" \
+--prefix=${LIBDIR}/iossim_x86_64 \
+--exec-prefix=${LIBDIR}/iossim_x86_64 \
+"CFLAGS=${CFLAGS} \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+-arch x86_64 \
+-fembed-bitcode \
+-isysroot ${IPHONESIMULATOR_SYSROOT}" \
+"CXX=${CXX}" \
+"CXXFLAGS=${CXXFLAGS} \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+-arch x86_64 \
+-fembed-bitcode \
+-isysroot \
+${IPHONESIMULATOR_SYSROOT}" \
+LDFLAGS="-arch x86_64 \
+-fembed-bitcode \
+-mios-simulator-version-min=${MIN_SDK_VERSION} \
+${LDFLAGS} \
+-L${IPHONESIMULATOR_SYSROOT}/usr/lib/ \
+-L${IPHONESIMULATOR_SYSROOT}/usr/lib/system" \
+"LIBS=${LIBS}"
+make -j"${JOB_COUNT}"
+make install
+
+make distclean
+./configure \
+--host=armv7-apple-${OSX_VERSION} \
+--with-protoc="${PROTOC_PATH}" \
+--disable-shared \
+--prefix=${LIBDIR}/ios_arm7 \
+--exec-prefix=${LIBDIR}/ios_arm7 \
+"CFLAGS=${CFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch armv7 \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+"CXX=${CXX}" \
+"CXXFLAGS=${CXXFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch armv7 \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+LDFLAGS="-arch armv7 \
+-fembed-bitcode \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+${LDFLAGS}" \
+"LIBS=${LIBS}"
+make -j"${JOB_COUNT}"
+make install
+
+make distclean
+./configure \
+--host=armv7s-apple-${OSX_VERSION} \
+--with-protoc="${PROTOC_PATH}" \
+--disable-shared \
+--prefix=${LIBDIR}/ios_arm7s \
+--exec-prefix=${LIBDIR}/ios_arm7s \
+"CFLAGS=${CFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch armv7s \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+"CXX=${CXX}" \
+"CXXFLAGS=${CXXFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch armv7s \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+LDFLAGS="-arch armv7s \
+-fembed-bitcode \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+${LDFLAGS}" \
+"LIBS=${LIBS}"
+make -j"${JOB_COUNT}"
+make install
+
+make distclean
+./configure \
+--host=arm \
+--with-protoc="${PROTOC_PATH}" \
+--disable-shared \
+--prefix=${LIBDIR}/ios_arm64 \
+--exec-prefix=${LIBDIR}/ios_arm64 \
+"CFLAGS=${CFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch arm64 \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+"CXXFLAGS=${CXXFLAGS} \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+-arch arm64 \
+-fembed-bitcode \
+-isysroot ${IPHONEOS_SYSROOT}" \
+LDFLAGS="-arch arm64 \
+-fembed-bitcode \
+-miphoneos-version-min=${MIN_SDK_VERSION} \
+${LDFLAGS}" \
+"LIBS=${LIBS}"
+make -j"${JOB_COUNT}"
+make install
+
+lipo \
+${LIBDIR}/iossim_386/lib/libprotobuf.a \
+${LIBDIR}/iossim_x86_64/lib/libprotobuf.a \
+${LIBDIR}/ios_arm7/lib/libprotobuf.a \
+${LIBDIR}/ios_arm7s/lib/libprotobuf.a \
+${LIBDIR}/ios_arm64/lib/libprotobuf.a \
+-create \
+-output ${LIBDIR}/libprotobuf.a
+
+lipo \
+${LIBDIR}/iossim_386/lib/libprotobuf-lite.a \
+${LIBDIR}/iossim_x86_64/lib/libprotobuf-lite.a \
+${LIBDIR}/ios_arm7/lib/libprotobuf-lite.a \
+${LIBDIR}/ios_arm7s/lib/libprotobuf-lite.a \
+${LIBDIR}/ios_arm64/lib/libprotobuf-lite.a \
+-create \
+-output ${LIBDIR}/libprotobuf-lite.a
diff --git a/tensorflow/contrib/makefile/compile_ios_tensorflow.sh b/tensorflow/contrib/makefile/compile_ios_tensorflow.sh
index ae82163e11..5d1cc8b375 100755
--- a/tensorflow/contrib/makefile/compile_ios_tensorflow.sh
+++ b/tensorflow/contrib/makefile/compile_ios_tensorflow.sh
@@ -43,124 +43,55 @@ then
     exit 1
 fi
 
-usage() {
-  echo "Usage: $(basename "$0") [-a]"
-  echo "-a [build_arch] build for specified arch comma separate for multiple archs (eg: x86_64,arm64)"
-  echo "default is [i386, x86_64, armv7, armv7s, arm64]"
-  exit 1
-}
-
-BUILD_TARGET="i386 x86_64 armv7 armv7s arm64"
-while getopts "a:f:h:n:" opt_name; do
-  case "$opt_name" in
-    a) BUILD_TARGET="${OPTARG}";;
-    f) BUILD_OPT="${OPTARG}";;
-    h) NSYNC_HOST="${OPTARG}";;
-    n) NSYNC_TARGET="${OPTARG}";;
-    *) usage;;
-  esac
-done
-shift $((OPTIND - 1))
-
-IFS=' ' read -r -a build_targets <<< "${BUILD_TARGET}"
-
-SCRIPT_DIR=$(cd `dirname $0` && pwd)
-source "${SCRIPT_DIR}/build_helper.subr"
-
-
 GENDIR=tensorflow/contrib/makefile/gen/
 LIBDIR=${GENDIR}lib
 LIB_PREFIX=libtensorflow-core
 
-#remove any old artifacts
-rm -rf ${LIBDIR}/${LIB_PREFIX}.a
+make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
+TARGET=IOS IOS_ARCH=ARMV7 LIB_NAME=${LIB_PREFIX}-armv7.a OPTFLAGS="$1"
+if [ $? -ne 0 ]
+then
+  echo "armv7 compilation failed."
+  exit 1
+fi
 
-package_tf_library() {
-    CAP_DIR=`echo $1 | tr 'a-z' 'A-Z'`
-    tf_libs="${LIBDIR}/ios_${CAP_DIR}/${LIB_PREFIX}-${1}.a"
-    if [ -f "${LIBDIR}/${LIB_PREFIX}.a" ]; then
-        tf_libs="$tf_libs ${LIBDIR}/${LIB_PREFIX}.a"
-    fi
-    lipo \
-    $tf_libs \
-    -create \
-    -output ${LIBDIR}/${LIB_PREFIX}.a
-}
+make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
+TARGET=IOS IOS_ARCH=ARMV7S LIB_NAME=${LIB_PREFIX}-armv7s.a OPTFLAGS="$1"
+if [ $? -ne 0 ]
+then
+  echo "arm7vs compilation failed."
+  exit 1
+fi
 
-build_tf_target() {
-case "$1" in
-    armv7)
-        make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
-        TARGET=IOS IOS_ARCH=ARMV7 LIB_NAME=${LIB_PREFIX}-armv7.a \
-        OPTFLAGS="${BUILD_OPT}" HOST_NSYNC_LIB="${NSYNC_HOST}" \
-        TARGET_NSYNC_LIB="${NSYNC_TARGET}"
-        if [ $? -ne 0 ]
-        then
-          echo "armv7 compilation failed."
-          exit 1
-        fi
-        package_tf_library "armv7"
-        ;;
-    armv7s)
-        make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
-        TARGET=IOS IOS_ARCH=ARMV7S LIB_NAME=${LIB_PREFIX}-armv7s.a \
-        OPTFLAGS="${BUILD_OPT}" HOST_NSYNC_LIB="${NSYNC_HOST}" \
-        TARGET_NSYNC_LIB="${NSYNC_TARGET}"
+make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
+TARGET=IOS IOS_ARCH=ARM64 LIB_NAME=${LIB_PREFIX}-arm64.a OPTFLAGS="$1"
+if [ $? -ne 0 ]
+then
+  echo "arm64 compilation failed."
+  exit 1
+fi
 
-        if [ $? -ne 0 ]
-        then
-          echo "arm7vs compilation failed."
-          exit 1
-        fi
-        package_tf_library "armv7s"
-        ;;
-    arm64)
-        make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
-        TARGET=IOS IOS_ARCH=ARM64 LIB_NAME=${LIB_PREFIX}-arm64.a \
-        OPTFLAGS="${BUILD_OPT}" HOST_NSYNC_LIB="${NSYNC_HOST}" \
-        TARGET_NSYNC_LIB="${NSYNC_TARGET}"
-        if [ $? -ne 0 ]
-        then
-          echo "arm64 compilation failed."
-          exit 1
-        fi
-        package_tf_library "arm64"
-        ;;
-    i386)
-        make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
-        TARGET=IOS IOS_ARCH=I386 LIB_NAME=${LIB_PREFIX}-i386.a \
-        OPTFLAGS="${BUILD_OPT}" HOST_NSYNC_LIB="${NSYNC_HOST}" \
-        TARGET_NSYNC_LIB="${NSYNC_TARGET}"
-        if [ $? -ne 0 ]
-        then
-          echo "i386 compilation failed."
-          exit 1
-        fi
-        package_tf_library "i386"
-        ;;
-    x86_64)
-        make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
-        TARGET=IOS IOS_ARCH=X86_64 LIB_NAME=${LIB_PREFIX}-x86_64.a \
-        OPTFLAGS="${BUILD_OPT}" HOST_NSYNC_LIB="${NSYNC_HOST}" \
-        TARGET_NSYNC_LIB="${NSYNC_TARGET}"
-        if [ $? -ne 0 ]
-        then
-          echo "x86_64 compilation failed."
-          exit 1
-        fi
-        package_tf_library "x86_64"
-        ;;
-    *)
-        echo "Unknown ARCH"
-        exit 1
-esac
-}
+make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
+TARGET=IOS IOS_ARCH=I386 LIB_NAME=${LIB_PREFIX}-i386.a OPTFLAGS="$1"
+if [ $? -ne 0 ]
+then
+  echo "i386 compilation failed."
+  exit 1
+fi
 
-for build_tf_element in "${build_targets[@]}"
-do
-    echo "$build_tf_element"
-    build_tf_target "$build_tf_element"
-done
+make -j"${JOB_COUNT}" -f tensorflow/contrib/makefile/Makefile \
+TARGET=IOS IOS_ARCH=X86_64 LIB_NAME=${LIB_PREFIX}-x86_64.a OPTFLAGS="$1"
+if [ $? -ne 0 ]
+then
+  echo "x86_64 compilation failed."
+  exit 1
+fi
 
-echo "Done building and packaging TF"
-file ${LIBDIR}/${LIB_PREFIX}.a
+lipo \
+${LIBDIR}/ios_ARMV7/${LIB_PREFIX}-armv7.a \
+${LIBDIR}/ios_ARMV7S/${LIB_PREFIX}-armv7s.a \
+${LIBDIR}/ios_ARM64/${LIB_PREFIX}-arm64.a \
+${LIBDIR}/ios_I386/${LIB_PREFIX}-i386.a \
+${LIBDIR}/ios_X86_64/${LIB_PREFIX}-x86_64.a \
+-create \
+-output ${LIBDIR}/${LIB_PREFIX}.a
diff --git a/tensorflow/contrib/makefile/compile_nsync.sh b/tensorflow/contrib/makefile/compile_nsync.sh
index 930e6b8dea..ecbd9bb825 100755
--- a/tensorflow/contrib/makefile/compile_nsync.sh
+++ b/tensorflow/contrib/makefile/compile_nsync.sh
@@ -265,7 +265,7 @@ for arch in $archs; do
                                           -I$(NDK_ROOT)/sources/cxx-stl/gnu-libstdc++/4.9/libs/'"$arch"'/include \
                                           -I../../platform/c++11 -I../../platform/gcc \
                                           -I../../platform/posix -pthread
-                        PLATFORM_CFLAGS=-std=c++11 -Wno-narrowing '"$march_option"' -fPIE -fPIC
+                        PLATFORM_CFLAGS=-std=c++11 -Wno-narrowing '"$march_option"' -fPIE
                         PLATFORM_LDFLAGS=-pthread
                         MKDEP=${CC} -M -std=c++11
                         PLATFORM_C=../../platform/c++11/src/nsync_semaphore_mutex.cc \
@@ -301,9 +301,6 @@ done
 
 case "$target_platform" in
 ios)    nsync_platform_dir="$nsync_builds_dir/lipo.$target_platform.c++11"
-        if [ -d "$nsync_platform_dir" ]; then
-            rm -rf "$nsync_platform_dir"
-        fi
         mkdir "$nsync_platform_dir"
         eval lipo $platform_libs -create -output '$nsync_platform_dir/nsync.a'
         echo "$nsync_platform_dir/nsync.a"
diff --git a/tensorflow/contrib/nn/__init__.py b/tensorflow/contrib/nn/__init__.py
index 0bc133a00e..3bf795d19a 100644
--- a/tensorflow/contrib/nn/__init__.py
+++ b/tensorflow/contrib/nn/__init__.py
@@ -15,7 +15,6 @@
 """Module for variants of ops in tf.nn.
 
 @@alpha_dropout
-@@conv1d_transpose
 @@deprecated_flipped_softmax_cross_entropy_with_logits
 @@deprecated_flipped_sparse_softmax_cross_entropy_with_logits
 @@deprecated_flipped_sigmoid_cross_entropy_with_logits
@@ -33,7 +32,6 @@ from tensorflow.contrib.nn.python.ops.alpha_dropout import *
 from tensorflow.contrib.nn.python.ops.cross_entropy import *
 from tensorflow.contrib.nn.python.ops.sampling_ops import *
 from tensorflow.contrib.nn.python.ops.scaled_softplus import *
-from tensorflow.python.ops.nn_ops import conv1d_transpose
 from tensorflow.python.ops.nn_ops import nth_element
 # pylint: enable=unused-import,wildcard-import
 
diff --git a/tensorflow/contrib/opt/BUILD b/tensorflow/contrib/opt/BUILD
index a9a63cbce0..8c46becf2c 100644
--- a/tensorflow/contrib/opt/BUILD
+++ b/tensorflow/contrib/opt/BUILD
@@ -19,7 +19,6 @@ py_library(
         "python/training/external_optimizer.py",
         "python/training/lazy_adam_optimizer.py",
         "python/training/moving_average_optimizer.py",
-        "python/training/multitask_optimizer_wrapper.py",
         "python/training/nadam_optimizer.py",
         "python/training/powersign.py",
         "python/training/sign_decay.py",
@@ -100,23 +99,6 @@ py_test(
 )
 
 py_test(
-    name = "multitask_optimizer_wrapper_test",
-    srcs = ["python/training/multitask_optimizer_wrapper_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":opt_py",
-        "//tensorflow/python:client",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:constant_op",
-        "//tensorflow/python:dtypes",
-        "//tensorflow/python:training",
-        "//tensorflow/python:variables",
-        "//third_party/py/numpy",
-        "@six_archive//:six",
-    ],
-)
-
-py_test(
     name = "lazy_adam_optimizer_test",
     srcs = ["python/training/lazy_adam_optimizer_test.py"],
     srcs_version = "PY2AND3",
diff --git a/tensorflow/contrib/opt/__init__.py b/tensorflow/contrib/opt/__init__.py
index 4c60c99342..caf22536bb 100644
--- a/tensorflow/contrib/opt/__init__.py
+++ b/tensorflow/contrib/opt/__init__.py
@@ -24,7 +24,7 @@ from tensorflow.contrib.opt.python.training.drop_stale_gradient_optimizer import
 from tensorflow.contrib.opt.python.training.external_optimizer import *
 from tensorflow.contrib.opt.python.training.lazy_adam_optimizer import *
 from tensorflow.contrib.opt.python.training.moving_average_optimizer import *
-from tensorflow.contrib.opt.python.training.multitask_optimizer_wrapper import *
+from tensorflow.contrib.opt.python.training.nadam_optimizer import *
 from tensorflow.contrib.opt.python.training.nadam_optimizer import *
 from tensorflow.contrib.opt.python.training.powersign import *
 from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
@@ -38,8 +38,7 @@ _allowed_symbols = [
     'DelayCompensatedGradientDescentOptimizer',
     'DropStaleGradientOptimizer', 'ExternalOptimizerInterface',
     'LazyAdamOptimizer', 'NadamOptimizer', 'MovingAverageOptimizer',
-    'ScipyOptimizerInterface', 'VariableClippingOptimizer',
-    'MultitaskOptimizerWrapper', 'clip_gradients_by_global_norm',
+    'ScipyOptimizerInterface', 'VariableClippingOptimizer'
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py b/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py
deleted file mode 100644
index c26037935d..0000000000
--- a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper.py
+++ /dev/null
@@ -1,138 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-
-"""An optimizer wrapper that ensures correct behaviour
-of stateful optimizers with multitask loss."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import types
-import six
-
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import clip_ops
-from tensorflow.python.ops import control_flow_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.training import optimizer
-
-__all__ = ["MultitaskOptimizerWrapper",
-           "clip_gradients_by_global_norm"]
-
-def _is_all_zeros(grad):
-  all_zeros = math_ops.equal(math_ops.count_nonzero(grad), 0)
-  return all_zeros
-
-def _get_wrapper(fn, opt):
-  def wrapper(self, grad, *args, **kwargs):  # pylint: disable=unused-argument
-    all_zeros = _is_all_zeros(grad)
-    return control_flow_ops.cond(
-        all_zeros,
-        control_flow_ops.no_op,
-        lambda: fn(grad, *args, **kwargs))
-  wrapper = types.MethodType(wrapper, opt)
-  return wrapper
-
-class MultitaskOptimizerWrapper(object):
-  """Optimizer wrapper that ensures that
-  all-zero gradients don't affect the optimizer state.
-
-  This might be useful when a multi-task loss is used,
-  and some components of the loss might be
-  not present (e.g. masked out) in some training batches.
-  Technically their gradient would be zero,
-  which would normally affect the optimizer state
-  (e.g. push running average to zero).
-  However this is not the desired behaviour,
-  since the missing loss component
-  should be treated as unknown rather than zero.
-
-  This wrapper filters out all-zero gradient tensors,
-  therefore preserving the optimizer state.
-
-  If gradient clipping by global norm is used,
-  the provided function clip_gradients_by_global_norm
-  should be used (and specified explicitly by the user).
-  Otherwise the global norm would be underestimated
-  because of all-zero tensors that should be ignored.
-
-  The gradient calculation and application
-  are delegated to an underlying optimizer.
-  The gradient application is altered only for all-zero tensors.
-
-  Example:
-  ```python
-  momentum_optimizer = tf.train.MomentumOptimizer(
-    learning_rate, momentum=0.9)
-  multitask_momentum_optimizer = tf.contrib.opt.MultitaskOptimizerWrapper(
-    momentum_optimizer)
-  gradvars = multitask_momentum_optimizer.compute_gradients(
-    loss)
-  gradvars_clipped, _ = tf.contrib.opt.clip_gradients_by_global_norm(
-    gradvars, 15.0)
-  train_op = multitask_momentum_optimizer.apply_gradients(
-    gradvars_clipped, global_step=batch)
-  ```
-  """
-  def __init__(self, opt):
-    """
-    Args:
-    opt: an instance of a class that implements tf.train.Optimizer.
-    """
-    if not isinstance(opt, optimizer.Optimizer):
-      raise TypeError(
-          "Supplied optimizer must be an instance of tf.train.Optimizer")
-    self._opt = opt
-    overriden_methods = ('_apply_dense',
-                         '_resource_apply_dense',
-                         '_apply_sparse',
-                         '_resource_apply_sparse')
-    for name in overriden_methods:
-      fn = getattr(self._opt, name)
-      wrapper = _get_wrapper(fn, self._opt)
-      setattr(self._opt, name, wrapper)
-
-  def __getattr__(self, name):
-    return getattr(self._opt, name)
-
-
-def clip_gradients_by_global_norm(gradients_variables, clip_norm=20.):
-  """Clips gradients of a multitask loss by their global norm.
-  Ignores all-zero tensors when computing the global norm.
-
-  Args:
-  gradients_variables: a list of pairs (gradient, variable).
-  clip_norm: a float Tensor, the global norm to clip on. Default is 20.0.
-
-  Returns:
-  list: A list of pairs of the same type as gradients_variables,.
-  fixed_global_norm: A 0-D (scalar) Tensor representing the global norm.
-  """
-  gradients, variables = six.moves.zip(*gradients_variables)
-  def _replace_nonexisting_grad(grad):
-    if grad is None:
-      return grad
-    all_zeros = _is_all_zeros(grad)
-    return control_flow_ops.cond(all_zeros,
-                                 lambda: array_ops.zeros(
-                                     [], dtype=dtypes.as_dtype(grad.dtype)),
-                                 lambda: grad)
-  nonzero_gradients = [_replace_nonexisting_grad(g) for g in gradients]
-  fixed_global_norm = clip_ops.global_norm(nonzero_gradients)
-  gradients, _ = clip_ops.clip_by_global_norm(gradients, clip_norm,
-                                              use_norm=fixed_global_norm)
-  return list(six.moves.zip(gradients, variables)), fixed_global_norm
diff --git a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper_test.py b/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper_test.py
deleted file mode 100644
index b06213f715..0000000000
--- a/tensorflow/contrib/opt/python/training/multitask_optimizer_wrapper_test.py
+++ /dev/null
@@ -1,119 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for MultitaskOptimizerWrapper."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.opt.python.training import multitask_optimizer_wrapper
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import variables
-from tensorflow.python.platform import test
-from tensorflow.python.training import momentum
-
-import numpy as np
-import six
-
-class MultitaskOptimizerWrapperTest(test.TestCase):
-  """
-  Tests for the multitask optimizer wrapper.
-  """
-  def testWrapper(self):
-    with self.test_session():
-      var0 = variables.Variable([1.0, 2.0], dtype=dtypes.float32)
-      var1 = variables.Variable([3.0, 4.0], dtype=dtypes.float32)
-      grads0 = constant_op.constant([0.1, 0.1], dtype=dtypes.float32)
-      grads1 = constant_op.constant([0.01, 0.01], dtype=dtypes.float32)
-      grads_allzero = constant_op.constant([0.0, 0.0], dtype=dtypes.float32)
-      mom_opt_impl = momentum.MomentumOptimizer(
-          learning_rate=2.0, momentum=0.9)
-      mom_opt = multitask_optimizer_wrapper.MultitaskOptimizerWrapper(
-          mom_opt_impl)
-      mom_update = mom_opt.apply_gradients(
-          zip([grads0, grads1], [var0, var1]))
-      mom_update_partial = mom_opt.apply_gradients(
-          zip([grads_allzero, grads1], [var0, var1]))
-      mom_update_no_action = mom_opt.apply_gradients(
-          zip([grads_allzero, grads_allzero], [var0, var1]))
-      self.evaluate(variables.global_variables_initializer())
-      # Fetch params to validate initial values
-      self.assertAllClose([1.0, 2.0], self.evaluate(var0))
-      self.assertAllClose([3.0, 4.0], self.evaluate(var1))
-
-      self.assertEqual(["momentum"], mom_opt.get_slot_names())
-      slot0 = mom_opt.get_slot(var0, "momentum")
-      self.assertEquals(slot0.get_shape(), var0.get_shape())
-      slot1 = mom_opt.get_slot(var1, "momentum")
-      self.assertEquals(slot1.get_shape(), var1.get_shape())
-
-      # Step 1: normal momentum update.
-      self.evaluate(mom_update)
-      # Check that the momentum accumulators have been updated.
-      self.assertAllCloseAccordingToType(np.array([0.1, 0.1]),
-                                         self.evaluate(slot0))
-      self.assertAllCloseAccordingToType(np.array([0.01, 0.01]),
-                                         self.evaluate(slot1))
-      # Check that the parameters have been updated.
-      self.assertAllCloseAccordingToType(
-          np.array([1.0 - (0.1 * 2.0), 2.0 - (0.1 * 2.0)]),
-          self.evaluate(var0))
-      self.assertAllCloseAccordingToType(
-          np.array([3.0 - (0.01 * 2.0), 4.0 - (0.01 * 2.0)]),
-          self.evaluate(var1))
-
-      # Step 2: momentum update that changes only slot1 but not slot0.
-      self.evaluate(mom_update_partial)
-      # Check that only the relevant momentum accumulator has been updated.
-      self.assertAllCloseAccordingToType(np.array([0.1, 0.1]),
-                                         self.evaluate(slot0))
-      self.assertAllCloseAccordingToType(
-          np.array([(0.9 * 0.01 + 0.01), (0.9 * 0.01 + 0.01)]),
-          self.evaluate(slot1))
-
-      # Step 3: momentum update that does not change anything.
-      self.evaluate(mom_update_no_action)
-      # Check that the momentum accumulators have *NOT* been updated.
-      self.assertAllCloseAccordingToType(np.array([0.1, 0.1]),
-                                         self.evaluate(slot0))
-      self.assertAllCloseAccordingToType(
-          np.array([(0.9 * 0.01 + 0.01), (0.9 * 0.01 + 0.01)]),
-          self.evaluate(slot1))
-
-  def testGradientClipping(self):
-    with self.test_session():
-      var0 = variables.Variable([1.0, 2.0], dtype=dtypes.float32)
-      var1 = variables.Variable([3.0, 4.0], dtype=dtypes.float32)
-      var2 = variables.Variable([3.0, 4.0], dtype=dtypes.float32)
-      var3 = variables.Variable([3.0, 4.0], dtype=dtypes.float32)
-      grads0 = constant_op.constant([10.0, 15.0], dtype=dtypes.float32)
-      grads1 = constant_op.constant([0.0, 5.0], dtype=dtypes.float32)
-      grads2 = constant_op.constant([0.0, 0.0], dtype=dtypes.float32)
-      grads3 = None
-      varlist = [var0, var1, var2, var3]
-      gradients = [grads0, grads1, grads2, grads3]
-      clipped_gradvars, global_norm = multitask_optimizer_wrapper.clip_gradients_by_global_norm(
-          six.moves.zip(gradients, varlist), clip_norm=1.0)
-      clipped_grads = list(six.moves.zip(*clipped_gradvars))[0]
-      reference_global_norm = np.sqrt(np.sum(np.square([10.0, 15.0, 0.0, 5.0])))
-      self.assertAllCloseAccordingToType(
-          self.evaluate(global_norm), reference_global_norm)
-      self.assertAllCloseAccordingToType(
-          self.evaluate(clipped_grads[2]), np.array([0., 0.]))
-      self.assertEqual(clipped_grads[3], None)
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
index 16b6d145e3..909c6aba2b 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
@@ -38,9 +38,6 @@ from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables as variables_lib
 from tensorflow.python.platform import test
-from tensorflow.python.framework import test_util
-from tensorflow.contrib.rnn.python.ops import rnn_cell as contrib_rnn_cell
-
 
 
 # pylint: enable=protected-access
@@ -361,45 +358,6 @@ class RNNCellTest(test.TestCase):
       self.assertEquals(variables[2].op.name,
                         "root/lstm_cell/projection/kernel")
 
-  def testLSTMCellLayerNorm(self):
-    with self.test_session() as sess:
-      num_units = 2
-      num_proj = 3
-      batch_size = 1
-      input_size = 4
-      with variable_scope.variable_scope(
-          "root", initializer=init_ops.constant_initializer(0.5)):
-        x = array_ops.zeros([batch_size, input_size])
-        c = array_ops.zeros([batch_size, num_units])
-        h = array_ops.zeros([batch_size, num_proj])
-        state = rnn_cell_impl.LSTMStateTuple(c, h)
-        cell = contrib_rnn_cell.LayerNormLSTMCell(
-          num_units=num_units,
-          num_proj=num_proj,
-          forget_bias=1.0,
-          layer_norm=True,
-          norm_gain=1.0,
-          norm_shift=0.0)
-        g, out_m = cell(x, state)
-        sess.run([variables_lib.global_variables_initializer()])
-        res = sess.run([g, out_m], {
-          x.name: np.ones((batch_size, input_size)),
-          c.name: 0.1 * np.ones((batch_size, num_units)),
-          h.name: 0.1 * np.ones((batch_size, num_proj))
-        })
-        self.assertEqual(len(res), 2)
-        # The numbers in results were not calculated, this is mostly just a
-        # smoke test.
-        self.assertEqual(res[0].shape, (batch_size, num_proj))
-        self.assertEqual(res[1][0].shape, (batch_size, num_units))
-        self.assertEqual(res[1][1].shape, (batch_size, num_proj))
-        # Different inputs so different outputs and states
-        for i in range(1, batch_size):
-          self.assertTrue(
-            float(np.linalg.norm((res[0][0, :] - res[0][i, :]))) < 1e-6)
-          self.assertTrue(
-            float(np.linalg.norm((res[1][0, :] - res[1][i, :]))) < 1e-6)
-
   def testOutputProjectionWrapper(self):
     with self.test_session() as sess:
       with variable_scope.variable_scope(
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
index b4a5f2d7eb..ebd4564f12 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
@@ -37,7 +37,6 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import rnn
 from tensorflow.python.ops import rnn_cell
-from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -1276,49 +1275,6 @@ class LayerNormBasicLSTMCellTest(test.TestCase):
         self.assertAllClose(res[2].c, expected_c1, 1e-5)
         self.assertAllClose(res[2].h, expected_h1, 1e-5)
 
-
-  def testBasicLSTMCellWithStateTupleLayerNorm(self):
-    """The results of LSTMCell and LayerNormBasicLSTMCell 
-    should be same. """
-    with self.test_session() as sess:
-      with variable_scope.variable_scope(
-          "root", initializer=init_ops.constant_initializer(0.5)):
-        x = array_ops.zeros([1, 2])
-        c0 = array_ops.zeros([1, 2])
-        h0 = array_ops.zeros([1, 2])
-        state0 = rnn_cell_impl.LSTMStateTuple(c0, h0)
-        c1 = array_ops.zeros([1, 2])
-        h1 = array_ops.zeros([1, 2])
-        state1 = rnn_cell_impl.LSTMStateTuple(c1, h1)
-        cell = rnn_cell_impl.MultiRNNCell(
-          [contrib_rnn_cell.LayerNormLSTMCell(
-              2,
-              layer_norm=True,
-              norm_gain=1.0,
-              norm_shift=0.0) for _ in range(2)])
-        h, (s0, s1) = cell(x, (state0, state1))
-        sess.run([variables.global_variables_initializer()])
-        res = sess.run([h, s0, s1], {
-          x.name: np.array([[1., 1.]]),
-          c0.name: 0.1 * np.asarray([[0, 1]]),
-          h0.name: 0.1 * np.asarray([[2, 3]]),
-          c1.name: 0.1 * np.asarray([[4, 5]]),
-          h1.name: 0.1 * np.asarray([[6, 7]]),
-        })
-
-        expected_h = np.array([[-0.38079708, 0.38079708]])
-        expected_h0 = np.array([[-0.38079708, 0.38079708]])
-        expected_c0 = np.array([[-1.0, 1.0]])
-        expected_h1 = np.array([[-0.38079708, 0.38079708]])
-        expected_c1 = np.array([[-1.0, 1.0]])
-
-        self.assertEqual(len(res), 3)
-        self.assertAllClose(res[0], expected_h, 1e-5)
-        self.assertAllClose(res[1].c, expected_c0, 1e-5)
-        self.assertAllClose(res[1].h, expected_h0, 1e-5)
-        self.assertAllClose(res[2].c, expected_c1, 1e-5)
-        self.assertAllClose(res[2].h, expected_h1, 1e-5)
-
   def testBasicLSTMCellWithDropout(self):
 
     def _is_close(x, y, digits=4):
diff --git a/tensorflow/contrib/rnn/python/ops/rnn_cell.py b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
index 5e85c125df..d4691f2c27 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
@@ -36,7 +36,6 @@ from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import rnn_cell_impl
 from tensorflow.python.ops import variable_scope as vs
-from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import nest
 
@@ -77,18 +76,6 @@ def _get_sharded_variable(name, shape, dtype, num_shards):
   return shards
 
 
-def _norm(g, b, inp, scope):
-  shape = inp.get_shape()[-1:]
-  gamma_init = init_ops.constant_initializer(g)
-  beta_init = init_ops.constant_initializer(b)
-  with vs.variable_scope(scope):
-    # Initialize beta and gamma for use by layer_norm.
-    vs.get_variable("gamma", shape=shape, initializer=gamma_init)
-    vs.get_variable("beta", shape=shape, initializer=beta_init)
-  normalized = layers.layer_norm(inp, reuse=True, scope=scope)
-  return normalized
-
-
 class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
   """Long short-term memory unit (LSTM) recurrent network cell.
 
@@ -115,24 +102,13 @@ class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
 
   The class uses optional peep-hole connections, and an optional projection
   layer.
-  
-  Layer normalization implementation is based on:
-
-    https://arxiv.org/abs/1607.06450.
-
-  "Layer Normalization"
-  Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
-
-  and is applied before the internal nonlinearities.
-  
   """
 
   def __init__(self, num_units, use_peepholes=False,
                initializer=None, num_proj=None, proj_clip=None,
                num_unit_shards=1, num_proj_shards=1,
                forget_bias=1.0, state_is_tuple=True,
-               activation=math_ops.tanh, reuse=None,
-               layer_norm=False, norm_gain=1.0, norm_shift=0.0):
+               activation=math_ops.tanh, reuse=None):
     """Initialize the parameters for an LSTM cell.
 
     Args:
@@ -159,13 +135,6 @@ class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
       reuse: (optional) Python boolean describing whether to reuse variables
         in an existing scope.  If not `True`, and the existing scope already has
         the given variables, an error is raised.
-      layer_norm: If `True`, layer normalization will be applied.
-      norm_gain: float, The layer normalization gain initial value. If
-        `layer_norm` has been set to `False`, this argument will be ignored.
-      norm_shift: float, The layer normalization shift initial value. If
-        `layer_norm` has been set to `False`, this argument will be ignored.
-        
-        
     """
     super(CoupledInputForgetGateLSTMCell, self).__init__(_reuse=reuse)
     if not state_is_tuple:
@@ -183,9 +152,6 @@ class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
     self._state_is_tuple = state_is_tuple
     self._activation = activation
     self._reuse = reuse
-    self._layer_norm = layer_norm
-    self._norm_gain = norm_gain
-    self._norm_shift = norm_shift
 
     if num_proj:
       self._state_size = (rnn_cell_impl.LSTMStateTuple(num_units, num_proj)
@@ -254,20 +220,9 @@ class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
 
     # j = new_input, f = forget_gate, o = output_gate
     cell_inputs = array_ops.concat([inputs, m_prev], 1)
-    lstm_matrix = math_ops.matmul(cell_inputs, concat_w)
-
-    # If layer nomalization is applied, do not add bias
-    if not self._layer_norm:
-      lstm_matrix = nn_ops.bias_add(lstm_matrix, b)
-
+    lstm_matrix = nn_ops.bias_add(math_ops.matmul(cell_inputs, concat_w), b)
     j, f, o = array_ops.split(value=lstm_matrix, num_or_size_splits=3, axis=1)
 
-    # Apply layer normalization
-    if self._layer_norm:
-      j = _norm(self._norm_gain, self._norm_shift, j, "transform")
-      f = _norm(self._norm_gain, self._norm_shift, f, "forget")
-      o = _norm(self._norm_gain, self._norm_shift, o, "output")
-
     # Diagonal connections
     if self._use_peepholes:
       w_f_diag = vs.get_variable(
@@ -281,10 +236,6 @@ class CoupledInputForgetGateLSTMCell(rnn_cell_impl.RNNCell):
       f_act = sigmoid(f + self._forget_bias)
     c = (f_act * c_prev + (1 - f_act) * self._activation(j))
 
-    # Apply layer normalization
-    if self._layer_norm:
-      c = _norm(self._norm_gain, self._norm_shift, c, "state")
-
     if self._use_peepholes:
       m = sigmoid(o + w_o_diag * c) * self._activation(c)
     else:
@@ -1350,8 +1301,8 @@ class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):
     self._keep_prob = dropout_keep_prob
     self._seed = dropout_prob_seed
     self._layer_norm = layer_norm
-    self._norm_gain = norm_gain
-    self._norm_shift = norm_shift
+    self._g = norm_gain
+    self._b = norm_shift
     self._reuse = reuse
 
   @property
@@ -1362,25 +1313,24 @@ class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):
   def output_size(self):
     return self._num_units
 
-  def _norm(self, inp, scope, dtype=dtypes.float32):
+  def _norm(self, inp, scope):
     shape = inp.get_shape()[-1:]
-    gamma_init = init_ops.constant_initializer(self._norm_gain)
-    beta_init = init_ops.constant_initializer(self._norm_shift)
+    gamma_init = init_ops.constant_initializer(self._g)
+    beta_init = init_ops.constant_initializer(self._b)
     with vs.variable_scope(scope):
       # Initialize beta and gamma for use by layer_norm.
-      vs.get_variable("gamma", shape=shape, initializer=gamma_init, dtype=dtype)
-      vs.get_variable("beta", shape=shape, initializer=beta_init, dtype=dtype)
+      vs.get_variable("gamma", shape=shape, initializer=gamma_init)
+      vs.get_variable("beta", shape=shape, initializer=beta_init)
     normalized = layers.layer_norm(inp, reuse=True, scope=scope)
     return normalized
 
   def _linear(self, args):
     out_size = 4 * self._num_units
     proj_size = args.get_shape()[-1]
-    dtype = args.dtype
-    weights = vs.get_variable("kernel", [proj_size, out_size], dtype=dtype)
+    weights = vs.get_variable("kernel", [proj_size, out_size])
     out = math_ops.matmul(args, weights)
     if not self._layer_norm:
-      bias = vs.get_variable("bias", [out_size], dtype=dtype)
+      bias = vs.get_variable("bias", [out_size])
       out = nn_ops.bias_add(out, bias)
     return out
 
@@ -1389,14 +1339,13 @@ class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):
     c, h = state
     args = array_ops.concat([inputs, h], 1)
     concat = self._linear(args)
-    dtype = args.dtype
 
     i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)
     if self._layer_norm:
-      i = self._norm(i, "input", dtype=dtype)
-      j = self._norm(j, "transform", dtype=dtype)
-      f = self._norm(f, "forget", dtype=dtype)
-      o = self._norm(o, "output", dtype=dtype)
+      i = self._norm(i, "input")
+      j = self._norm(j, "transform")
+      f = self._norm(f, "forget")
+      o = self._norm(o, "output")
 
     g = self._activation(j)
     if (not isinstance(self._keep_prob, float)) or self._keep_prob < 1:
@@ -1405,7 +1354,7 @@ class LayerNormBasicLSTMCell(rnn_cell_impl.RNNCell):
     new_c = (c * math_ops.sigmoid(f + self._forget_bias)
              + math_ops.sigmoid(i) * g)
     if self._layer_norm:
-      new_c = self._norm(new_c, "state", dtype=dtype)
+      new_c = self._norm(new_c, "state")
     new_h = self._activation(new_c) * math_ops.sigmoid(o)
 
     new_state = rnn_cell_impl.LSTMStateTuple(new_c, new_h)
@@ -2357,264 +2306,3 @@ class GLSTMCell(rnn_cell_impl.RNNCell):
 
     new_state = rnn_cell_impl.LSTMStateTuple(c, m)
     return m, new_state
-
-
-class LayerNormLSTMCell(rnn_cell_impl.RNNCell):
-  """Long short-term memory unit (LSTM) recurrent network cell.
-
-  The default non-peephole implementation is based on:
-
-    http://www.bioinf.jku.at/publications/older/2604.pdf
-
-  S. Hochreiter and J. Schmidhuber.
-  "Long Short-Term Memory". Neural Computation, 9(8):1735-1780, 1997.
-
-  The peephole implementation is based on:
-
-    https://research.google.com/pubs/archive/43905.pdf
-
-  Hasim Sak, Andrew Senior, and Francoise Beaufays.
-  "Long short-term memory recurrent neural network architectures for
-   large scale acoustic modeling." INTERSPEECH, 2014.
-
-  The class uses optional peep-hole connections, optional cell clipping, and
-  an optional projection layer.
-
-  Layer normalization implementation is based on:
-
-    https://arxiv.org/abs/1607.06450.
-
-  "Layer Normalization"
-  Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
-
-  and is applied before the internal nonlinearities.
-
-  """
-
-  def __init__(self, num_units,
-               use_peepholes=False, cell_clip=None,
-               initializer=None, num_proj=None, proj_clip=None,
-               forget_bias=1.0,
-               activation=None, layer_norm=False,
-               norm_gain=1.0, norm_shift=0.0, reuse=None):
-    """Initialize the parameters for an LSTM cell.
-
-    Args:
-      num_units: int, The number of units in the LSTM cell
-      use_peepholes: bool, set True to enable diagonal/peephole connections.
-      cell_clip: (optional) A float value, if provided the cell state is clipped
-        by this value prior to the cell output activation.
-      initializer: (optional) The initializer to use for the weight and
-        projection matrices.
-      num_proj: (optional) int, The output dimensionality for the projection
-        matrices.  If None, no projection is performed.
-      proj_clip: (optional) A float value.  If `num_proj > 0` and `proj_clip` is
-        provided, then the projected values are clipped elementwise to within
-        `[-proj_clip, proj_clip]`.
-      forget_bias: Biases of the forget gate are initialized by default to 1
-        in order to reduce the scale of forgetting at the beginning of
-        the training. Must set it manually to `0.0` when restoring from
-        CudnnLSTM trained checkpoints.
-      activation: Activation function of the inner states.  Default: `tanh`.
-      layer_norm: If `True`, layer normalization will be applied.
-      norm_gain: float, The layer normalization gain initial value. If
-        `layer_norm` has been set to `False`, this argument will be ignored.
-      norm_shift: float, The layer normalization shift initial value. If
-        `layer_norm` has been set to `False`, this argument will be ignored.
-      reuse: (optional) Python boolean describing whether to reuse variables
-        in an existing scope.  If not `True`, and the existing scope already has
-        the given variables, an error is raised.
-
-      When restoring from CudnnLSTM-trained checkpoints, must use
-      CudnnCompatibleLSTMCell instead.
-    """
-    super(LayerNormLSTMCell, self).__init__(_reuse=reuse)
-
-    self._num_units = num_units
-    self._use_peepholes = use_peepholes
-    self._cell_clip = cell_clip
-    self._initializer = initializer
-    self._num_proj = num_proj
-    self._proj_clip = proj_clip
-    self._forget_bias = forget_bias
-    self._activation = activation or math_ops.tanh
-    self._layer_norm = layer_norm
-    self._norm_gain = norm_gain
-    self._norm_shift = norm_shift
-
-    if num_proj:
-      self._state_size = (rnn_cell_impl.LSTMStateTuple(num_units, num_proj))
-      self._output_size = num_proj
-    else:
-      self._state_size = (rnn_cell_impl.LSTMStateTuple(num_units, num_units))
-      self._output_size = num_units
-
-  @property
-  def state_size(self):
-    return self._state_size
-
-  @property
-  def output_size(self):
-    return self._output_size
-
-
-  def _linear(self,
-              args,
-              output_size,
-              bias,
-              bias_initializer=None,
-              kernel_initializer=None,
-              layer_norm=False):
-    """Linear map: sum_i(args[i] * W[i]), where W[i] is a Variable.
-
-    Args:
-      args: a 2D Tensor or a list of 2D, batch x n, Tensors.
-      output_size: int, second dimension of W[i].
-      bias: boolean, whether to add a bias term or not.
-      bias_initializer: starting value to initialize the bias
-        (default is all zeros).
-      kernel_initializer: starting value to initialize the weight.
-      layer_norm: boolean, whether to apply layer normalization.
-
-
-    Returns:
-      A 2D Tensor with shape [batch x output_size] taking value
-      sum_i(args[i] * W[i]), where each W[i] is a newly created Variable.
-
-    Raises:
-      ValueError: if some of the arguments has unspecified or wrong shape.
-    """
-    if args is None or (nest.is_sequence(args) and not args):
-      raise ValueError("`args` must be specified")
-    if not nest.is_sequence(args):
-      args = [args]
-
-    # Calculate the total size of arguments on dimension 1.
-    total_arg_size = 0
-    shapes = [a.get_shape() for a in args]
-    for shape in shapes:
-      if shape.ndims != 2:
-        raise ValueError("linear is expecting 2D arguments: %s" % shapes)
-      if shape[1].value is None:
-        raise ValueError("linear expects shape[1] to be provided for shape %s, "
-                         "but saw %s" % (shape, shape[1]))
-      else:
-        total_arg_size += shape[1].value
-
-    dtype = [a.dtype for a in args][0]
-
-    # Now the computation.
-    scope = vs.get_variable_scope()
-    with vs.variable_scope(scope) as outer_scope:
-      weights = vs.get_variable(
-        "kernel", [total_arg_size, output_size],
-        dtype=dtype,
-        initializer=kernel_initializer)
-      if len(args) == 1:
-        res = math_ops.matmul(args[0], weights)
-      else:
-        res = math_ops.matmul(array_ops.concat(args, 1), weights)
-      if not bias:
-        return res
-      with vs.variable_scope(outer_scope) as inner_scope:
-        inner_scope.set_partitioner(None)
-        if bias_initializer is None:
-          bias_initializer = init_ops.constant_initializer(0.0, dtype=dtype)
-        biases = vs.get_variable(
-          "bias", [output_size],
-          dtype=dtype,
-          initializer=bias_initializer)
-
-    if not layer_norm:
-      res = nn_ops.bias_add(res, biases)
-
-    return res
-
-  def call(self, inputs, state):
-    """Run one step of LSTM.
-
-    Args:
-      inputs: input Tensor, 2D, batch x num_units.
-      state: this must be a tuple of state Tensors,
-       both `2-D`, with column sizes `c_state` and
-        `m_state`.
-
-    Returns:
-      A tuple containing:
-
-      - A `2-D, [batch x output_dim]`, Tensor representing the output of the
-        LSTM after reading `inputs` when previous state was `state`.
-        Here output_dim is:
-           num_proj if num_proj was set,
-           num_units otherwise.
-      - Tensor(s) representing the new state of LSTM after reading `inputs` when
-        the previous state was `state`.  Same type and shape(s) as `state`.
-
-    Raises:
-      ValueError: If input size cannot be inferred from inputs via
-        static shape inference.
-    """
-    num_proj = self._num_units if self._num_proj is None else self._num_proj
-    sigmoid = math_ops.sigmoid
-
-    (c_prev, m_prev) = state
-
-    dtype = inputs.dtype
-    input_size = inputs.get_shape().with_rank(2)[1]
-    if input_size.value is None:
-      raise ValueError("Could not infer input size from inputs.get_shape()[-1]")
-    scope = vs.get_variable_scope()
-    with vs.variable_scope(scope, initializer=self._initializer) as unit_scope:
-
-      # i = input_gate, j = new_input, f = forget_gate, o = output_gate
-      lstm_matrix = self._linear([inputs, m_prev], 4 * self._num_units, bias=True,
-                            bias_initializer=None, layer_norm=self._layer_norm)
-      i, j, f, o = array_ops.split(
-        value=lstm_matrix, num_or_size_splits=4, axis=1)
-
-      if self._layer_norm:
-        i = _norm(self._norm_gain, self._norm_shift, i, "input")
-        j = _norm(self._norm_gain, self._norm_shift, j, "transform")
-        f = _norm(self._norm_gain, self._norm_shift, f, "forget")
-        o = _norm(self._norm_gain, self._norm_shift, o, "output")
-
-      # Diagonal connections
-      if self._use_peepholes:
-        with vs.variable_scope(unit_scope) as projection_scope:
-          w_f_diag = vs.get_variable(
-            "w_f_diag", shape=[self._num_units], dtype=dtype)
-          w_i_diag = vs.get_variable(
-            "w_i_diag", shape=[self._num_units], dtype=dtype)
-          w_o_diag = vs.get_variable(
-            "w_o_diag", shape=[self._num_units], dtype=dtype)
-
-      if self._use_peepholes:
-        c = (sigmoid(f + self._forget_bias + w_f_diag * c_prev) * c_prev +
-             sigmoid(i + w_i_diag * c_prev) * self._activation(j))
-      else:
-        c = (sigmoid(f + self._forget_bias) * c_prev + sigmoid(i) *
-             self._activation(j))
-
-      if self._layer_norm:
-        c = _norm(self._norm_gain, self._norm_shift, c, "state")
-
-      if self._cell_clip is not None:
-        # pylint: disable=invalid-unary-operand-type
-        c = clip_ops.clip_by_value(c, -self._cell_clip, self._cell_clip)
-        # pylint: enable=invalid-unary-operand-type
-      if self._use_peepholes:
-        m = sigmoid(o + w_o_diag * c) * self._activation(c)
-      else:
-        m = sigmoid(o) * self._activation(c)
-
-      if self._num_proj is not None:
-        with vs.variable_scope("projection") as proj_scope:
-          m = self._linear(m, self._num_proj, bias=False)
-
-        if self._proj_clip is not None:
-          # pylint: disable=invalid-unary-operand-type
-          m = clip_ops.clip_by_value(m, -self._proj_clip, self._proj_clip)
-          # pylint: enable=invalid-unary-operand-type
-
-    new_state = (rnn_cell_impl.LSTMStateTuple(c, m))
-    return m, new_state
diff --git a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
index c3b180d9f4..87230e3355 100644
--- a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
+++ b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
@@ -149,7 +149,7 @@ class _BaseAttentionMechanism(AttentionMechanism):
                memory_sequence_length=None,
                memory_layer=None,
                check_inner_dims_defined=True,
-               score_mask_value=None,
+               score_mask_value=float("-inf"),
                name=None):
     """Construct base AttentionMechanism class.
 
@@ -187,12 +187,9 @@ class _BaseAttentionMechanism(AttentionMechanism):
           "memory_layer is not a Layer: %s" % type(memory_layer).__name__)
     self._query_layer = query_layer
     self._memory_layer = memory_layer
-    self.dtype = memory_layer.dtype
     if not callable(probability_fn):
       raise TypeError("probability_fn must be callable, saw type: %s" %
                       type(probability_fn).__name__)
-    if score_mask_value is None:
-      score_mask_value = dtypes.as_dtype(self._memory_layer.dtype).as_numpy_dtype(-np.inf)
     self._probability_fn = lambda score, prev: (  # pylint:disable=g-long-lambda
         probability_fn(
             _maybe_mask_score(score, memory_sequence_length, score_mask_value),
@@ -337,8 +334,7 @@ class LuongAttention(_BaseAttentionMechanism):
                memory_sequence_length=None,
                scale=False,
                probability_fn=None,
-               score_mask_value=None,
-               dtype=None,
+               score_mask_value=float("-inf"),
                name="LuongAttention"):
     """Construct the AttentionMechanism mechanism.
 
@@ -357,20 +353,17 @@ class LuongAttention(_BaseAttentionMechanism):
       score_mask_value: (optional) The mask value for score before passing into
         `probability_fn`. The default is -inf. Only used if
         `memory_sequence_length` is not None.
-      dtype: The data type for the memory layer of the attention mechanism.
       name: Name to use when creating ops.
     """
     # For LuongAttention, we only transform the memory layer; thus
     # num_units **must** match expected the query depth.
     if probability_fn is None:
       probability_fn = nn_ops.softmax
-    if dtype is None:
-      dtype = dtypes.float32
     wrapped_probability_fn = lambda score, _: probability_fn(score)
     super(LuongAttention, self).__init__(
         query_layer=None,
         memory_layer=layers_core.Dense(
-            num_units, name="memory_layer", use_bias=False, dtype=dtype),
+            num_units, name="memory_layer", use_bias=False),
         memory=memory,
         probability_fn=wrapped_probability_fn,
         memory_sequence_length=memory_sequence_length,
@@ -482,8 +475,7 @@ class BahdanauAttention(_BaseAttentionMechanism):
                memory_sequence_length=None,
                normalize=False,
                probability_fn=None,
-               score_mask_value=None,
-               dtype=None,
+               score_mask_value=float("-inf"),
                name="BahdanauAttention"):
     """Construct the Attention mechanism.
 
@@ -502,20 +494,16 @@ class BahdanauAttention(_BaseAttentionMechanism):
       score_mask_value: (optional): The mask value for score before passing into
         `probability_fn`. The default is -inf. Only used if
         `memory_sequence_length` is not None.
-      dtype: The data type for the query and memory layers of the attention
-        mechanism.
       name: Name to use when creating ops.
     """
     if probability_fn is None:
       probability_fn = nn_ops.softmax
-    if dtype is None:
-      dtype = dtypes.float32
     wrapped_probability_fn = lambda score, _: probability_fn(score)
     super(BahdanauAttention, self).__init__(
         query_layer=layers_core.Dense(
-            num_units, name="query_layer", use_bias=False, dtype=dtype),
+            num_units, name="query_layer", use_bias=False),
         memory_layer=layers_core.Dense(
-            num_units, name="memory_layer", use_bias=False, dtype=dtype),
+            num_units, name="memory_layer", use_bias=False),
         memory=memory,
         probability_fn=wrapped_probability_fn,
         memory_sequence_length=memory_sequence_length,
@@ -750,12 +738,11 @@ class BahdanauMonotonicAttention(_BaseMonotonicAttentionMechanism):
                memory,
                memory_sequence_length=None,
                normalize=False,
-               score_mask_value=None,
+               score_mask_value=float("-inf"),
                sigmoid_noise=0.,
                sigmoid_noise_seed=None,
                score_bias_init=0.,
                mode="parallel",
-               dtype=None,
                name="BahdanauMonotonicAttention"):
     """Construct the Attention mechanism.
 
@@ -779,21 +766,17 @@ class BahdanauMonotonicAttention(_BaseMonotonicAttentionMechanism):
       mode: How to compute the attention distribution.  Must be one of
         'recursive', 'parallel', or 'hard'.  See the docstring for
         `tf.contrib.seq2seq.monotonic_attention` for more information.
-      dtype: The data type for the query and memory layers of the attention
-        mechanism.
       name: Name to use when creating ops.
     """
     # Set up the monotonic probability fn with supplied parameters
-    if dtype is None:
-      dtype = dtypes.float32
     wrapped_probability_fn = functools.partial(
         _monotonic_probability_fn, sigmoid_noise=sigmoid_noise, mode=mode,
         seed=sigmoid_noise_seed)
     super(BahdanauMonotonicAttention, self).__init__(
         query_layer=layers_core.Dense(
-            num_units, name="query_layer", use_bias=False, dtype=dtype),
+            num_units, name="query_layer", use_bias=False),
         memory_layer=layers_core.Dense(
-            num_units, name="memory_layer", use_bias=False, dtype=dtype),
+            num_units, name="memory_layer", use_bias=False),
         memory=memory,
         probability_fn=wrapped_probability_fn,
         memory_sequence_length=memory_sequence_length,
@@ -851,12 +834,11 @@ class LuongMonotonicAttention(_BaseMonotonicAttentionMechanism):
                memory,
                memory_sequence_length=None,
                scale=False,
-               score_mask_value=None,
+               score_mask_value=float("-inf"),
                sigmoid_noise=0.,
                sigmoid_noise_seed=None,
                score_bias_init=0.,
                mode="parallel",
-               dtype=None,
                name="LuongMonotonicAttention"):
     """Construct the Attention mechanism.
 
@@ -880,21 +862,17 @@ class LuongMonotonicAttention(_BaseMonotonicAttentionMechanism):
       mode: How to compute the attention distribution.  Must be one of
         'recursive', 'parallel', or 'hard'.  See the docstring for
         `tf.contrib.seq2seq.monotonic_attention` for more information.
-      dtype: The data type for the query and memory layers of the attention
-        mechanism.
       name: Name to use when creating ops.
     """
     # Set up the monotonic probability fn with supplied parameters
-    if dtype is None:
-      dtype = dtypes.float32
     wrapped_probability_fn = functools.partial(
         _monotonic_probability_fn, sigmoid_noise=sigmoid_noise, mode=mode,
         seed=sigmoid_noise_seed)
     super(LuongMonotonicAttention, self).__init__(
         query_layer=layers_core.Dense(
-            num_units, name="query_layer", use_bias=False, dtype=dtype),
+            num_units, name="query_layer", use_bias=False),
         memory_layer=layers_core.Dense(
-            num_units, name="memory_layer", use_bias=False, dtype=dtype),
+            num_units, name="memory_layer", use_bias=False),
         memory=memory,
         probability_fn=wrapped_probability_fn,
         memory_sequence_length=memory_sequence_length,
@@ -1145,9 +1123,8 @@ class AttentionWrapper(rnn_cell_impl.RNNCell):
             % (len(attention_layer_sizes), len(attention_mechanisms)))
       self._attention_layers = tuple(
           layers_core.Dense(
-              attention_layer_size, name="attention_layer", use_bias=False,
-              dtype=attention_mechanisms[i].dtype)
-          for i, attention_layer_size in enumerate(attention_layer_sizes))
+              attention_layer_size, name="attention_layer", use_bias=False)
+          for attention_layer_size in attention_layer_sizes)
       self._attention_layer_size = sum(attention_layer_sizes)
     else:
       self._attention_layers = None
diff --git a/tensorflow/contrib/slim/README.md b/tensorflow/contrib/slim/README.md
index f7a85557ca..0bfd0801d5 100644
--- a/tensorflow/contrib/slim/README.md
+++ b/tensorflow/contrib/slim/README.md
@@ -237,7 +237,7 @@ One way to reduce this code duplication would be via a `for` loop:
 ```python
 net = ...
 for i in range(3):
-  net = slim.conv2d(net, 256, [3, 3], scope='conv3_%d' % (i+1))
+  net = slim.conv2d(net, 256, [3, 3], scope='conv3_' % (i+1))
 net = slim.max_pool2d(net, [2, 2], scope='pool2')
 ```
 
diff --git a/tensorflow/contrib/slim/python/slim/nets/resnet_v1_test.py b/tensorflow/contrib/slim/python/slim/nets/resnet_v1_test.py
index 576444214d..b4fd2580c2 100644
--- a/tensorflow/contrib/slim/python/slim/nets/resnet_v1_test.py
+++ b/tensorflow/contrib/slim/python/slim/nets/resnet_v1_test.py
@@ -386,7 +386,7 @@ class ResnetCompleteNetworkTest(test.TestCase):
                 inputs, None, is_training=False, global_pool=False)
             sess.run(variables.global_variables_initializer())
             self.assertAllClose(
-                output.eval(), expected.eval(), atol=2e-4, rtol=1e-4)
+                output.eval(), expected.eval(), atol=1e-4, rtol=1e-4)
 
   def testUnknownBatchSize(self):
     batch = 2
diff --git a/tensorflow/contrib/verbs/README.md b/tensorflow/contrib/verbs/README.md
index dcb390b0a5..da5f2b0223 100644
--- a/tensorflow/contrib/verbs/README.md
+++ b/tensorflow/contrib/verbs/README.md
@@ -1,4 +1,4 @@
-## How to compile, use and configure RDMA-enabled TensorFlow
+## How to compile and use RDMA-enabled TensorFlow
 1. Follow the regular TF compilation instructions. During configure step, if you want ibverbs based RDMA support, answer yes to this question:
 
     ```Do you wish to build TensorFlow with VERBS-RDMA support [y/N]```
@@ -7,18 +7,6 @@
 
     ```server = tf.train.Server(cluster, job_name="local", task_index=0, protocol='grpc+verbs') # default protocol is 'grpc'```
 
-3. RDMA configuration is done by setting the following environment variables:
-   * **RDMA_DEVICE**: The RDMA device name to be used. If not defined by user, a default device with an active port will be set if exists.
-   * **RDMA_DEVICE_PORT**: The port within the selected device. Not relevant if RDMA_DEVICE is not defined. If not defined by user, a default active port will be set if exists.
-   * **RDMA_GID_INDEX**: The GID index of the port. If not defined by user, a default suitable GID index will be set (RoCEV2 is favourable as default).
-   * **RDMA_QP_PKEY_INDEX**: The Pkey for the QP. If not defined by user, the default value is 0.
-   * **RDMA_QP_QUEUE_DEPTH**: TX/RX queue size for the QP. If not defined by user, the default value is 1024.
-   * **RDMA_QP_TIMEOUT**: The retransmission timeout for QPs. If not defined by user, the default value is 14.
-   * **RDMA_QP_RETRY_COUNT**: Number of retransmission for QPs. If not defined by user, the default value is 7.
-   * **RDMA_QP_SL**: Service level configuration for QOS and ECN, valid values are 0-7. If not defined by user, the default value is 0.
-   * **RDMA_QP_MTU**: MTU configuration for the QPs. If not defined by user, the default value is active MTU from query_port.
-   * **RDMA_TRAFFIC_CLASS**: Traffic class configuration for QP, in case of DSCP trust level QoS configuration. If not defined by user, the default value is 0. For more info see [HowTo Configure Trust state on Mellanox Adapters](https://community.mellanox.com/docs/DOC-2866).
-
 ## Overview
 The design is based on TensorFlow r1.0. An RDMA path is added between servers for tensor transfer (weights, gradients, etc). The existing GRPC path remains and is responsible for "administrative" tasks, such as setting up the RDMA path, exchanging computation graphs, etc.
 
diff --git a/tensorflow/contrib/verbs/rdma.cc b/tensorflow/contrib/verbs/rdma.cc
index 331943a3ef..26e18b28aa 100644
--- a/tensorflow/contrib/verbs/rdma.cc
+++ b/tensorflow/contrib/verbs/rdma.cc
@@ -17,7 +17,6 @@ limitations under the License.
 
 #include "tensorflow/contrib/verbs/rdma.h"
 #include <cstdlib>
-#include <fcntl.h>
 #include "tensorflow/contrib/verbs/verbs_util.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
 #include "tensorflow/core/common_runtime/dma_helper.h"
@@ -34,8 +33,6 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define RoCE_V2 "RoCE v2"
-
 namespace {
 // hash name to 32-bit integer
 uint32_t NameHash(const string& name) {
@@ -69,337 +66,16 @@ string MessageTypeToString(RdmaMessageType rmt) {
 }
 }  // namespace
 
-// Function to get environment variable
-// Args:
-//    var_name - the name of the environmental variable
-// Returns:
-//    string with it's value or empty string if not set
-string get_env_var(char const* var_name) {
-  char const* var_temp = getenv(var_name);
-
-  return (var_temp == NULL) ? string() : string(var_temp);
-}
-
-// Function to open device
-// Args:
-//   ibv_dev device to open
-// Returns:
-//   context of the opened device
-ibv_context* open_device(ibv_device* ibv_dev) {
-  ibv_context* context = ibv_open_device(ibv_dev);
-
-  CHECK(context) << "Open context failed for " << ibv_get_device_name(ibv_dev);
-  return context;
-}
-
-// Function to count the number of active ports for device
-// Args:
-//   device - to check active ports
-// Returns:
-//   number of active ports of the given device
-int get_dev_active_port_count(ibv_device* device) {
-  ibv_device_attr device_att;
-  ibv_port_attr port_attr;
-  ibv_context* context = NULL;
-  int rc, port_index, active_ports = 0;
-
-  context = ibv_open_device(device);
-  CHECK(context) << "Open context failed for " << ibv_get_device_name(device);
-  rc = ibv_query_device(context, &device_att);
-  CHECK(!rc) << "Failed to query the device";
-
-  for (port_index = 1; port_index <= device_att.phys_port_cnt; port_index++) {
-    rc = ibv_query_port(context, port_index, &port_attr);
-    CHECK(!rc) << "Failed to query the port" << port_index;
-    if (port_attr.state == IBV_PORT_ACTIVE) {
-      active_ports++;
-    }
-  }
-  ibv_close_device(context);
-  return active_ports;
-}
-
-// Function to set device. If RDMA_DEVICE not set, search for device with active
-// port.
-// Fails if more than one device with active port was found.
-// Returns:
-//   device to use
-ibv_device* set_device() {
+ibv_context* open_default_device() {
   ibv_device** dev_list;
-  int dev_num, device_index, device_to_open = 0;
-  int num_devs_with_active_port = 0;
-  string env_p_rdma_device, str_port_num;
-
-  dev_list = ibv_get_device_list(&dev_num);
+  ibv_device* ib_dev;
+  dev_list = ibv_get_device_list(NULL);
   CHECK(dev_list) << "No InfiniBand device found";
-
-  env_p_rdma_device = get_env_var("RDMA_DEVICE");
-  if (!env_p_rdma_device.empty()) {
-    for (device_index = 0; device_index < dev_num; device_index++) {
-      if (!env_p_rdma_device.compare(
-               ibv_get_device_name(dev_list[device_index]))) {
-        CHECK(get_dev_active_port_count(dev_list[device_index]) != 0)
-            << "Device " << ibv_get_device_name(dev_list[device_index])
-            << " has no active ports";
-        return dev_list[device_index];
-      }
-    }
-    // check validity of input device
-    CHECK(false) << "The device " << env_p_rdma_device << " wasn't found";
-  } else {
-  // set default device
-    str_port_num = get_env_var("RDMA_DEVICE_PORT");
-    CHECK(str_port_num.empty())
-        << "RDMA_DEVICE should be provided if RDMA_DEVICE_PORT is set by user";
-    for (device_index = 0; device_index < dev_num; device_index++) {
-      // get port_num
-      if (get_dev_active_port_count(dev_list[device_index]) > 0) {
-        num_devs_with_active_port++;
-        CHECK(num_devs_with_active_port <= 1) << ". More than one device with "
-                                                 "active port in the system. "
-                                                 "Please enter RDMA_DEVICE";
-        // found device with at least 1 active port
-        device_to_open = device_index;
-      }
-    }
-    CHECK(num_devs_with_active_port > 0)
-        << "There is no active port in the system";
-    return dev_list[device_to_open];
-  }
-  CHECK(false) << "No device was set!";
-  return NULL;  // never happens
-}
-
-// Function to set port for device.
-// If RDMA_DEVICE_PORT not set, first active port of the device will be set.
-// Args:
-//   context of the device
-// Returns:
-//   port to use
-uint8_t set_port(ibv_context* context) {
-  uint8_t port_num = 0; //0 is illegal port number
-  string str_port_num;
-  ibv_device_attr device_att;
-  ibv_port_attr port_attr;
-  int rc, port_index;
-
-  rc = ibv_query_device(context, &device_att);
-  CHECK(!rc) << "Failed to query the device\n";
-
-  str_port_num = get_env_var("RDMA_DEVICE_PORT");
-  // user defined port
-  if (!str_port_num.empty()) {
-    port_num = stoi(str_port_num);
-    CHECK(port_num > 0) << "RDMA_DEVICE_PORT should be positive";
-    CHECK(port_num <= device_att.phys_port_cnt) << "RDMA_DEVICE_PORT should be "
-                                                   "less or equal to amount of "
-                                                   "available ports";
-    rc = ibv_query_port(context, port_num, &port_attr);
-    CHECK(!rc) << "Failed to query the port" << port_num;
-    // check if port id active
-    CHECK(port_attr.state == IBV_PORT_ACTIVE)
-        << "Selected RDMA_DEVICE_PORT is not active";
-  }
-  // set default port
-  else {
-    for (port_index = 1; port_index <= device_att.phys_port_cnt; port_index++) {
-      rc = ibv_query_port(context, port_index, &port_attr);
-      CHECK(!rc) << "Failed to query the port" << port_index;
-      if (port_attr.state == IBV_PORT_ACTIVE) {
-        port_num = port_index;
-        break;
-      }
-    }
-    CHECK_GT(port_num, 0) << "No active ports";
-  }
-  return port_num;
-}
-
-// Function read from sysfs file
-// Args:
-//   dir - directory
-//   file - file
-//   buff - buffer for the result
-//   size - buffer size
-// Returns:
-//   number of bytes were read or -1 if failed
-int read_sysfs_file(const char* dir, const char* file, char* buf, size_t size) {
-  char* path;
-  int fd;
-  int len;
-
-  if (asprintf(&path, "%s/%s", dir, file) < 0) return -1;
-
-  fd = open(path, O_RDONLY);
-  if (fd < 0) {
-    free(path);
-    return -1;
-  }
-
-  len = read(fd, buf, size);
-
-  close(fd);
-  free(path);
-
-  if (len > 0 && buf[len - 1] == '\n') buf[--len] = '\0';
-
-  return len;
-}
-
-// Function to check if GID index support RoCE V2
-// Args:
-//   context - device context
-//   port_num - port number
-//   index -  GID index
-// Returns:
-//   if GID supports RoCE V2 - true, otherwise - false.
-bool is_gid_type_roce_v2(ibv_context* context, uint8_t port_num,
-                         uint8_t index) {
-  char name[32];
-  char buff[41];
-
-  snprintf(name, sizeof(name), "ports/%d/gid_attrs/types/%d", port_num, index);
-  if (read_sysfs_file(context->device->ibdev_path, name, buff, sizeof(buff)) <=
-      0) {
-    return false;
-  }
-  return !strcmp(buff, RoCE_V2);
-}
-
-// Function to set GID index.
-// If the port link is IB, no GID index should be selected.
-// If Ethernet but RDMA_GID_INDEX not set gid index that supports
-//   RoCE V2 will be chosen(fails if more then one IP is configured)
-// Args:
-//   context - device context
-//   port_num - port number
-// Returns:
-//   GID index to use
-uint8_t set_gid(uint8_t port_num, ibv_context* context) {
-  ibv_port_attr port_attr;
-  string gid_str;
-  int rc, i, gids_num = 0, v2_ip_num = 0;
-  union ibv_gid gid;
-  uint8_t gid_index = 0;
-
-  rc = ibv_query_port(context, port_num, &port_attr);
-  CHECK(!rc) << "Failed to query the port" << port_num;
-
-  for (i = 0; i < port_attr.gid_tbl_len; i++) {
-    rc = ibv_query_gid(context, port_num, i, &gid);
-    CHECK(!rc) << "Failed to query gid to port " << (int)port_num << " index "
-               << i;
-    if (gid.global.interface_id) {
-      gids_num++;
-      if (gid.global.subnet_prefix == 0 &&
-          is_gid_type_roce_v2(context, port_num, i)) {
-        if (v2_ip_num == 0) {
-          // can be overwritten by RDMA_GID_INDEX later
-          gid_index = i;
-        }
-        v2_ip_num++;
-      }
-    }
-  }
-  switch (port_attr.link_layer) {
-    case(IBV_LINK_LAYER_ETHERNET) :
-      gid_str = get_env_var("RDMA_GID_INDEX");
-      if (!gid_str.empty()) {
-        gid_index = stoi(gid_str);
-        CHECK(gid_index < gids_num)
-            << "RDMA_GID_INDEX should be less than GIDs amount" << gids_num;
-      } else {
-        CHECK(v2_ip_num <= 1)
-            << "More than one IP is available, please specify GID_INDEX";
-      }
-      break;
-    case(IBV_LINK_LAYER_INFINIBAND) :  // no need in GID index
-      break;
-    default:
-      LOG(INFO) << "Unknown port link layer. Currently supporting Ethernet and "
-                   "InfiniBand only. ";
-  }
-  if (!is_gid_type_roce_v2(context, port_num, gid_index)) {
-    LOG(INFO) << "RoCE v2 is not configured for GID_INDEX " << (int)gid_index;
-  }
-  return gid_index;
-}
-
-// set the default or environment value to the configuration parameter.
-// Args:
-//   default_val- the default value for this parameter
-//   env_param- the environment parameter's name
-// Returns:
-//   32-bit value
-uint32_t set_param(uint32_t default_val, const char* env_param) {
-  uint32_t val = default_val;
-  string val_s;
-
-  val_s = get_env_var(env_param);
-
-  if (!val_s.empty()) {
-    val = stoi(val_s);
-  }
-  return val;
-}
-
-enum ibv_mtu set_mtu(uint8_t port_num, ibv_context* context) {
-  ibv_port_attr port_attr;
-  enum ibv_mtu mtu;
-  string mtu_s;
-  int rc, mtu_i;
-
-  rc = ibv_query_port(context, port_num, &port_attr);
-  CHECK(!rc) << "Failed to query the port" << port_num;
-
-  mtu_s = get_env_var("RDMA_MTU");
-
-  if (!mtu_s.empty()) {
-    mtu_i = stoi(mtu_s);
-    switch (mtu_i) {
-      case 256:
-        mtu = IBV_MTU_256;
-        break;
-      case 512:
-        mtu = IBV_MTU_512;
-        break;
-      case 1024:
-        mtu = IBV_MTU_1024;
-        break;
-      case 2048:
-        mtu = IBV_MTU_2048;
-        break;
-      case 4096:
-        mtu = IBV_MTU_4096;
-        break;
-      default:
-        CHECK(0) << "Error: MTU input value must be one of the following: 256, "
-                    "512, 1024, 2048, 4096. MTU " << mtu << " is invalid\n";
-        break;
-    }
-    CHECK(mtu < port_attr.active_mtu)
-        << "MTU configuration for the QPs is larger than active MTU";
-  } else {
-    mtu = port_attr.active_mtu;
-  }
-  return mtu;
-}
-
-RdmaParams params_init(ibv_context* context) {
-  RdmaParams params;
-
-  params.port_num = set_port(context);
-  params.sgid_index = set_gid(params.port_num, context);
-  params.pkey_index = (uint8_t)set_param(PKEY_DEFAULT, "RDMA_PKEY");
-  params.queue_depth = set_param(QUEUE_DEPTH_DEFAULT, "RDMA_QUEUE_DEPTH");
-  params.timeout = (uint8_t)set_param(TIMEOUT_DEFAULT, "RDMA_TIMEOUT");
-  params.retry_cnt = (uint8_t)set_param(RETRY_CNT_DEFAULT, "RDMA_RETRY_CNT");
-  params.sl = (uint8_t)set_param(SL_DEFAULT, "RDMA_SL");
-  CHECK(params.sl <= 7) << "SL value is " << (int)params.sl
-                        << ". Valid values are 0-7.";
-  params.mtu = set_mtu(params.port_num, context);
-  params.traffic_class = set_param(TRAFFIC_CLASS, "RDMA_TRAFFIC_CLASS");
-  return params;
+  ib_dev = dev_list[0];
+  CHECK(ib_dev) << "No InfiniBand device found";
+  ibv_context* context = ibv_open_device(ib_dev);
+  CHECK(context) << "Open context failed for " << ibv_get_device_name(ib_dev);
+  return context;
 }
 
 ibv_pd* alloc_protection_domain(ibv_context* context) {
@@ -409,8 +85,7 @@ ibv_pd* alloc_protection_domain(ibv_context* context) {
 }
 
 RdmaAdapter::RdmaAdapter(const WorkerEnv* worker_env)
-    : context_(open_device(set_device())),
-      params_(params_init(context_)),
+    : context_(open_default_device()),
       pd_(alloc_protection_domain(context_)),
       worker_env_(worker_env) {
   event_channel_ = ibv_create_comp_channel(context_);
@@ -453,9 +128,9 @@ void RdmaAdapter::Process_CQ() {
     CHECK_GE(ne, 0);
     for (int i = 0; i < ne; ++i) {
       CHECK(wc_[i].status == IBV_WC_SUCCESS)
-          << "Failed status \n" << ibv_wc_status_str(wc_[i].status) << " "
-          << wc_[i].status << " " << static_cast<int>(wc_[i].wr_id) << " "
-          << wc_[i].vendor_err;
+          << "Failed status \n"
+          << ibv_wc_status_str(wc_[i].status) << " " << wc_[i].status << " "
+          << static_cast<int>(wc_[i].wr_id) << " " << wc_[i].vendor_err;
       if (wc_[i].opcode == IBV_WC_RECV_RDMA_WITH_IMM) {
         RdmaChannel* rc = reinterpret_cast<RdmaChannel*>(wc_[i].wr_id);
         // put back a recv wr.
@@ -567,8 +242,8 @@ RdmaChannel::RdmaChannel(const RdmaAdapter* adapter, const string local_name,
     memset(&attr, 0, sizeof(ibv_qp_init_attr));
     attr.send_cq = adapter_->cq_;
     attr.recv_cq = adapter_->cq_;
-    attr.cap.max_send_wr = adapter_->params_.queue_depth;
-    attr.cap.max_recv_wr = adapter_->params_.queue_depth;
+    attr.cap.max_send_wr = RdmaAdapter::MAX_CONCURRENT_WRITES;
+    attr.cap.max_recv_wr = RdmaAdapter::MAX_CONCURRENT_WRITES;
     attr.cap.max_send_sge = 1;
     attr.cap.max_recv_sge = 1;
     attr.qp_type = IBV_QPT_RC;
@@ -582,8 +257,8 @@ RdmaChannel::RdmaChannel(const RdmaAdapter* adapter, const string local_name,
     struct ibv_qp_attr attr;
     memset(&attr, 0, sizeof(ibv_qp_attr));
     attr.qp_state = IBV_QPS_INIT;
-    attr.pkey_index = adapter_->params_.pkey_index;
-    attr.port_num = adapter_->params_.port_num;
+    attr.pkey_index = 0;
+    attr.port_num = 1;
     attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
 
     int mask =
@@ -594,15 +269,13 @@ RdmaChannel::RdmaChannel(const RdmaAdapter* adapter, const string local_name,
   // Local address
   {
     struct ibv_port_attr attr;
-    CHECK(
-        !ibv_query_port(adapter_->context_, adapter_->params_.port_num, &attr))
+    CHECK(!ibv_query_port(adapter_->context_, (uint8_t)1, &attr))
         << "Query port";
     self_.lid = attr.lid;
     self_.qpn = qp_->qp_num;
     self_.psn = static_cast<uint32_t>(random::New64()) & 0xffffff;
     union ibv_gid gid;
-    CHECK(!ibv_query_gid(adapter_->context_, adapter_->params_.port_num,
-                         adapter_->params_.sgid_index, &gid))
+    CHECK(!ibv_query_gid(adapter_->context_, (uint8_t)1, 0, &gid))
         << "Query gid";
     self_.snp = gid.global.subnet_prefix;
     self_.iid = gid.global.interface_id;
@@ -611,7 +284,7 @@ RdmaChannel::RdmaChannel(const RdmaAdapter* adapter, const string local_name,
   // create message and ack buffers, then initialize the tables.
   {
     const string buffer_names[] = {"tx_message_buffer", "rx_message_buffer",
-                                   "tx_ack_buffer",     "rx_ack_buffer"};
+                                   "tx_ack_buffer", "rx_ack_buffer"};
     tx_message_buffer_ = new RdmaMessageBuffer(this, buffer_names[0]);
     rx_message_buffer_ = new RdmaMessageBuffer(this, buffer_names[1]);
     tx_ack_buffer_ = new RdmaAckBuffer(this, buffer_names[2]);
@@ -672,7 +345,7 @@ void RdmaChannel::SetRemoteAddress(const RdmaAddress& ra, bool override) {
 void RdmaChannel::Recv() {
   struct ibv_recv_wr wr;
   memset(&wr, 0, sizeof(wr));
-  wr.wr_id = (uint64_t) this;
+  wr.wr_id = (uint64_t)this;
   struct ibv_recv_wr* bad_wr;
   CHECK(!ibv_post_recv(qp_, &wr, &bad_wr)) << "Failed to post recv";
 }
@@ -806,9 +479,11 @@ void RdmaChannel::Connect(const RdmaAddress& remoteAddr) {
     struct ibv_qp_attr attr;
     memset(&attr, 0, sizeof(ibv_qp_attr));
     attr.qp_state = IBV_QPS_RTR;
-
+    struct ibv_port_attr port_attr;
+    CHECK(!ibv_query_port(adapter_->context_, (uint8_t)1, &port_attr))
+        << "Query port failed";
     // This assumes both QP's ports are configured with the same MTU
-    attr.path_mtu = adapter_->params_.mtu;
+    attr.path_mtu = port_attr.active_mtu;
     attr.dest_qp_num = remoteAddr.qpn;
     attr.rq_psn = remoteAddr.psn;
     attr.max_dest_rd_atomic = 1;
@@ -819,32 +494,30 @@ void RdmaChannel::Connect(const RdmaAddress& remoteAddr) {
     attr.ah_attr.grh.flow_label = 0;
     attr.ah_attr.grh.hop_limit = 255;
     attr.ah_attr.dlid = remoteAddr.lid;
-    attr.ah_attr.sl = adapter_->params_.sl;
+    attr.ah_attr.sl = 0;
     attr.ah_attr.src_path_bits = 0;
-    attr.ah_attr.port_num = adapter_->params_.port_num;
-    attr.ah_attr.grh.sgid_index = adapter_->params_.sgid_index;
-    attr.ah_attr.grh.traffic_class = adapter_->params_.traffic_class;
+    attr.ah_attr.port_num = 1;
 
     int r;
-    CHECK(!(r = ibv_modify_qp(qp_, &attr, IBV_QP_STATE | IBV_QP_AV |
-                                              IBV_QP_PATH_MTU |
-                                              IBV_QP_DEST_QPN | IBV_QP_RQ_PSN |
-                                              IBV_QP_MAX_DEST_RD_ATOMIC |
-                                              IBV_QP_MIN_RNR_TIMER)))
+    CHECK(!(r = ibv_modify_qp(qp_, &attr,
+                              IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU |
+                                  IBV_QP_DEST_QPN | IBV_QP_RQ_PSN |
+                                  IBV_QP_MAX_DEST_RD_ATOMIC |
+                                  IBV_QP_MIN_RNR_TIMER)))
         << "QP to Ready to Receive " << r;
 
     memset(&attr, 0, sizeof(ibv_qp_attr));
     attr.qp_state = IBV_QPS_RTS;
     attr.sq_psn = self_.psn;
-    attr.timeout = adapter_->params_.timeout;
-    attr.retry_cnt = adapter_->params_.retry_cnt;
+    attr.timeout = 14;
+    attr.retry_cnt = 7;
     attr.rnr_retry = 7; /* infinite */
     attr.max_rd_atomic = 1;
 
-    CHECK(!(r = ibv_modify_qp(qp_, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT |
-                                              IBV_QP_RETRY_CNT |
-                                              IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN |
-                                              IBV_QP_MAX_QP_RD_ATOMIC)))
+    CHECK(!(r = ibv_modify_qp(qp_, &attr,
+                              IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
+                                  IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN |
+                                  IBV_QP_MAX_QP_RD_ATOMIC)))
         << "QP to Ready to Send " << r;
 
     connected_ = true;
@@ -931,7 +604,7 @@ void RdmaBuffer::Write(uint32_t imm_data, size_t buffer_size) {
 
   struct ibv_send_wr wr;
   memset(&wr, 0, sizeof(wr));
-  wr.wr_id = (uint64_t) this;
+  wr.wr_id = (uint64_t)this;
   wr.sg_list = &list;
   wr.num_sge = 1;
   wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM;
@@ -1026,9 +699,9 @@ Rendezvous::DoneCallback RdmaTensorBuffer::getRecvTensorCallback(
     TensorProto proto;
     if (src_dev->tensorflow_gpu_device_info() &&
         (!send_args.alloc_attrs.on_host())) {
-      CHECK(send_args.device_context) << "send dev name: " << src_dev->name()
-                                      << " gpu_info: "
-                                      << src_dev->tensorflow_gpu_device_info();
+      CHECK(send_args.device_context)
+          << "send dev name: " << src_dev->name()
+          << " gpu_info: " << src_dev->tensorflow_gpu_device_info();
 
       if (can_memcpy) {
         AllocatorAttributes host_alloc_attrs;
@@ -1054,8 +727,8 @@ Rendezvous::DoneCallback RdmaTensorBuffer::getRecvTensorCallback(
         // aync instead
         GPUUtil::SetProtoFromGPU(
             in, src_dev, send_args.device_context, &proto, is_dead,
-	    [this, proto, buffer_size, key, in, step_id, key_with_step_id,
-            is_dead, send_args, recv_args](const Status& s) mutable {
+            [this, proto, buffer_size, key, in, step_id, key_with_step_id,
+             is_dead, send_args, recv_args](const Status& s) mutable {
               CHECK(s.ok()) << "copy proto from gpu sync";
               auto tensor_bytes = proto.ByteSize();
               buffer_size += tensor_bytes;
diff --git a/tensorflow/contrib/verbs/rdma.h b/tensorflow/contrib/verbs/rdma.h
index 52d92a7c5b..e1e07db776 100644
--- a/tensorflow/contrib/verbs/rdma.h
+++ b/tensorflow/contrib/verbs/rdma.h
@@ -36,24 +36,7 @@ limitations under the License.
 #include "tensorflow/core/platform/mutex.h"
 
 namespace tensorflow {
-#define PKEY_DEFAULT 0
-#define QUEUE_DEPTH_DEFAULT 1024
-#define TIMEOUT_DEFAULT 14
-#define RETRY_CNT_DEFAULT 7
-#define SL_DEFAULT 0
-#define TRAFFIC_CLASS 0
-
-struct RdmaParams {
-  uint8_t port_num;
-  uint8_t sgid_index;
-  uint8_t pkey_index;
-  uint32_t queue_depth;
-  uint8_t timeout;
-  uint8_t retry_cnt;
-  uint8_t sl;
-  enum ibv_mtu mtu;
-  uint8_t traffic_class;
-};
+
 // structure to save the address of remote channels.
 struct RdmaAddress {
   uint32_t lid;
@@ -67,20 +50,9 @@ struct RemoteMR {
   uint64_t remote_addr;
   uint32_t rkey;
 };
-enum BufferStatus {
-  none,
-  idle,
-  busy
-};
-enum Location {
-  local,
-  remote
-};
-enum BufferType {
-  ACK,
-  MESSAGE,
-  TENSOR
-};
+enum BufferStatus { none, idle, busy };
+enum Location { local, remote };
+enum BufferType { ACK, MESSAGE, TENSOR };
 enum RdmaMessageType {
   RDMA_MESSAGE_ACK,
   RDMA_MESSAGE_BUFFER_IDLE,
@@ -112,8 +84,6 @@ class RdmaAdapter {
  protected:
   static const int MAX_CONCURRENT_WRITES = 1000;
   ibv_context* context_;
-  // RDMA configuration parameters
-  RdmaParams params_;
   // ibverbs protection domain
   ibv_pd* pd_;
   // Completion event channel, to wait for work completions
@@ -213,7 +183,7 @@ class RdmaBuffer {
   }
   void FreeBuffer();
   void EnqueueItem(string Item);
-  virtual void SendNextItem() {};
+  virtual void SendNextItem(){};
   void CreateCPUBuffer(size_t size, bool lock = true);
   void SetRemoteMR(RemoteMR rmi, bool override);
   uint32_t LookupBufferIndex(const string& buffer_name) {
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 30ff4ef358..d71f314e11 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -2710,7 +2710,6 @@ tf_cc_test_mkl(
     srcs = [
         "graph/mkl_layout_pass_test.cc",
         "graph/mkl_tfconversion_pass_test.cc",
-        "util/mkl_util_test.cc",
     ],
     linkstatic = 1,
     deps = [
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index 63b74e8dbf..53e80b1ee3 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -81,7 +81,7 @@ class MklCPUAllocator : public Allocator {
       }
 #if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
       if (user_val > max_mem_bytes) {
-        LOG(WARNING) << "The user specified a memory limit " << kMaxLimitStr
+        LOG(WARNING) << "The user specifed a memory limit " << kMaxLimitStr
                      << "=" << user_val
                      << " greater than available physical memory: "
                      << max_mem_bytes
diff --git a/tensorflow/core/common_runtime/sycl/sycl_device.h b/tensorflow/core/common_runtime/sycl/sycl_device.h
index cc272d156e..9caa076c72 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_device.h
+++ b/tensorflow/core/common_runtime/sycl/sycl_device.h
@@ -46,8 +46,8 @@ class GSYCLInterface {
 
     if (!found_device) {
       // Currently Intel GPU is not supported
-      LOG(WARNING) << "No OpenCL GPU found that is supported by "
-                   << "ComputeCpp/triSYCL, trying OpenCL CPU";
+      LOG(WARNING) << "No OpenCL GPU found that is supported by ComputeCpp, "
+                      "trying OpenCL CPU";
     }
 
     for (const auto& device : device_list) {
@@ -59,23 +59,9 @@ class GSYCLInterface {
     }
 
     if (!found_device) {
-      LOG(WARNING) << "No OpenCL CPU found that is supported by "
-                   << "ComputeCpp/triSYCL, checking for host sycl device";
-    }
-
-    for (const auto& device : device_list) {
-      // triSYCL only supports the host device for now
-      if (device.is_host()) {
-        LOG(WARNING) << "Found SYCL host device";
-        AddDevice(device);
-        found_device = true;
-      }
-    }
-
-    if (!found_device) {
       // Currently Intel GPU is not supported
-      LOG(FATAL) << "No SYCL host and no OpenCL GPU nor CPU"
-                 << " supported by ComputeCPP/triSYCL was found";
+      LOG(FATAL)
+          << "No OpenCL GPU nor CPU found that is supported by ComputeCpp";
     } else {
       LOG(INFO) << "Found following OpenCL devices:";
       for (int i = 0; i < device_list.size(); i++) {
diff --git a/tensorflow/core/graph/graph.cc b/tensorflow/core/graph/graph.cc
index fd1b5d33b9..87c41186d5 100644
--- a/tensorflow/core/graph/graph.cc
+++ b/tensorflow/core/graph/graph.cc
@@ -453,21 +453,6 @@ const Edge* Graph::AddControlEdge(Node* source, Node* dest,
   return AddEdge(source, kControlSlot, dest, kControlSlot);
 }
 
-void Graph::RemoveControlEdge(const Edge* e) {
-  if (!e->src_->IsSource() && !e->dst_->IsSink()) {
-    e->dst_->MaybeCopyOnWrite();
-    std::string e_src_name = strings::StrCat("^", e->src_->name());
-    auto* inputs = e->dst_->props_->node_def.mutable_input();
-    for (auto it = inputs->begin(); it != inputs->end(); ++it) {
-      if (*it == e_src_name) {
-        inputs->erase(it);
-        break;
-      }
-    }
-  }
-  RemoveEdge(e);
-}
-
 Status Graph::UpdateEdge(Node* new_src, int new_src_index, Node* dst,
                          int dst_index) {
   TF_RETURN_IF_ERROR(IsValidOutputTensor(new_src, new_src_index));
diff --git a/tensorflow/core/graph/graph.h b/tensorflow/core/graph/graph.h
index d0dba6e1f0..c5dde722fa 100644
--- a/tensorflow/core/graph/graph.h
+++ b/tensorflow/core/graph/graph.h
@@ -451,11 +451,6 @@ class Graph {
   // REQUIRES: The edge must exist.
   void RemoveEdge(const Edge* edge);
 
-  // Removes control edge `edge` from the graph. Note that this also updates
-  // the corresponding NodeDef to reflect the change.
-  // REQUIRES: The control edge must exist.
-  void RemoveControlEdge(const Edge* e);
-  
   // Updates the input to a node.  The existing edge to `dst` is removed and an
   // edge from `new_src` to `dst` is created. The NodeDef associated with `dst`
   // is also updated.
diff --git a/tensorflow/core/graph/graph_partition.cc b/tensorflow/core/graph/graph_partition.cc
index 1924c05d3d..b9e3cba035 100644
--- a/tensorflow/core/graph/graph_partition.cc
+++ b/tensorflow/core/graph/graph_partition.cc
@@ -117,7 +117,7 @@ DataType EdgeType(const Edge* e) {
   }
 }
 
-// Return true iff we need to add the same device send/recv for 'edge'.
+// Return true iff we need to add a same device send/recv for 'edge'.
 bool NeedSameDeviceSendRecv(const Edge* edge, const GraphInfo& info) {
   if (edge->IsControlEdge()) {
     return false;
@@ -1116,7 +1116,7 @@ Status Partition(const PartitionOptions& opts, Graph* g,
         // before the data is available.
         AddInput(real_recv, send->name(), Graph::kControlSlot);
       } else if (control_flow_edge != nullptr) {
-        // Redirect control edge to the real recv since this is not the same
+        // Redirect control edge to the real recv since this is not a same
         // device send/recv.
         --num_control_flow_edges;
         AddInput(real_recv, control_flow_edge->src()->name(),
diff --git a/tensorflow/core/graph/graph_test.cc b/tensorflow/core/graph/graph_test.cc
index 2aa1b31e15..7686cef219 100644
--- a/tensorflow/core/graph/graph_test.cc
+++ b/tensorflow/core/graph/graph_test.cc
@@ -118,25 +118,6 @@ class GraphTest : public ::testing::Test {
     LOG(FATAL) << name;
   }
 
-  bool ControlEdgeExistsInGraphOrNodeDef(const Node* src,
-                                         const Node* dst) {
-    for (const Edge *e : dst->in_edges()) {
-      if (e->IsControlEdge() &&
-          e->src() == src &&
-          e->src_output() == Graph::kControlSlot &&
-          e->dst_input() == Graph::kControlSlot) {
-        return true;
-      }
-    }
-    std::string control_edge_name = strings::StrCat("^", src->name());
-    for (int i = 0; i < dst->def().input_size(); ++i) {
-      if (dst->def().input(i) == control_edge_name) {
-        return true;
-      }
-    }
-    return false;
-  }
-
   Graph graph_;
 
  private:
@@ -477,8 +458,8 @@ TEST_F(GraphTest, AddControlEdge) {
   EXPECT_TRUE(edge == nullptr);
   EXPECT_EQ(b->def().input_size(), 2);
 
-  // Can add redundant control edge with allow_duplicates.
-  edge = graph_.AddControlEdge(a, b, /*allow_duplicates=*/true);
+  // Can add redundant control edge with create_duplicate.
+  edge = graph_.AddControlEdge(a, b, /*create_duplicate=*/true);
   EXPECT_TRUE(edge != nullptr);
   // create_duplicate causes the NodeDef not to be updated.
   ASSERT_EQ(b->def().input_size(), 2);
@@ -496,47 +477,6 @@ TEST_F(GraphTest, AddControlEdge) {
   EXPECT_EQ(b->def().input_size(), 2);
 }
 
-TEST_F(GraphTest, RemoveControlEdge) {
-  FromGraphDef(
-      "node { name: 'A' op: 'OneOutput' }"
-      "node { name: 'B' op: 'OneInputTwoOutputs' input: [ 'A:0' ] }"
-      "node { name: 'C' op: 'NoOp' } ");
-  Node* a = FindNode("A");
-  Node* b = FindNode("B");
-  Node* c = FindNode("C");
-
-  // Add a control edge.
-  const Edge* edge_1 = graph_.AddControlEdge(c, a);
-  const Edge* edge_2 = graph_.AddControlEdge(a, b);
-  ASSERT_TRUE(edge_1 != nullptr);
-  ASSERT_TRUE(edge_2 != nullptr);
-
-  ASSERT_TRUE(ControlEdgeExistsInGraphOrNodeDef(c, a));
-  ASSERT_TRUE(ControlEdgeExistsInGraphOrNodeDef(a, b));
-
-  graph_.RemoveControlEdge(edge_1);
-  ASSERT_TRUE(!ControlEdgeExistsInGraphOrNodeDef(c, a));
-  ASSERT_TRUE(ControlEdgeExistsInGraphOrNodeDef(a, b));
-
-  graph_.RemoveControlEdge(edge_2);
-  ASSERT_TRUE(!ControlEdgeExistsInGraphOrNodeDef(c, a));
-  ASSERT_TRUE(!ControlEdgeExistsInGraphOrNodeDef(a, b));
-
-  // Test removing a duplicate control edge.
-  // Note that unless allow_duplicates is true, the duplicate edge
-  // will not be added. That's why we expect edge_4 to be a null
-  // pointer. We are not testing with allow_duplicates set to true,
-  // as that is a highly unlikely use case that does not make much
-  // sense.
-  const Edge* edge_3 = graph_.AddControlEdge(c, a);
-  const Edge* edge_4 = graph_.AddControlEdge(c, a);
-  ASSERT_TRUE(edge_3 != nullptr);
-  ASSERT_TRUE(edge_4 == nullptr);
-
-  graph_.RemoveControlEdge(edge_3);
-  ASSERT_TRUE(!ControlEdgeExistsInGraphOrNodeDef(c, a));
-}
-
 TEST_F(GraphTest, UpdateEdge) {
   // Build a little graph
   Node* a = FromNodeDef("A", "OneOutput", 0);
diff --git a/tensorflow/core/graph/mkl_graph_util.h b/tensorflow/core/graph/mkl_graph_util.h
index 880e4e712e..cb32d64334 100644
--- a/tensorflow/core/graph/mkl_graph_util.h
+++ b/tensorflow/core/graph/mkl_graph_util.h
@@ -21,108 +21,107 @@ limitations under the License.
 #include "tensorflow/core/framework/op_kernel.h"
 
 namespace tensorflow {
-  // Since our ops are going to produce and also consume N addition tensors
-  // (Mkl) for N Tensorflow tensors, we can have following different
-  // orderings among these 2N tensors.
-  //
-  // E.g., for Tensorflow tensors A, B, and C, our ops will produce and
-  // consume A_m, B_m, and C_m additionally.
-  //
-  // INTERLEAVED: in this case 2N tensors are interleaved. So for above
-  //              example, the ordering looks like: A, A_m, B, B_m, C, C_m.
-  //
-  // CONTIGUOUS: in thi case N Tensorflow tensors are contiguous followed
-  //             by N Mkl tensors. So for above example, the ordering looks
-  //             like: A, B, C, A_m, B_m, C_m
-  //
-  // Following APIs map index of original Tensorflow tensors to their
-  // appropriate position based on selected ordering. For contiguous ordering,
-  // we need to know the total number of tensors (parameter total).
-  //
-  typedef enum { TENSORS_INTERLEAVED, TENSORS_CONTIGUOUS } MklTfTensorOrdering;
-  // NOTE: Currently, we use contiguous ordering. If you change this, then you
-  // would need to change Mkl op definitions in nn_ops.cc.
-  static MklTfTensorOrdering kTensorOrdering = TENSORS_CONTIGUOUS;
+// Since our ops are going to produce and also consume N addition tensors
+// (Mkl) for N Tensorflow tensors, we can have following different
+// orderings among these 2N tensors.
+//
+// E.g., for Tensorflow tensors A, B, and C, our ops will produce and
+// consume A_m, B_m, and C_m additionally.
+//
+// INTERLEAVED: in this case 2N tensors are interleaved. So for above
+//              example, the ordering looks like: A, A_m, B, B_m, C, C_m.
+//
+// CONTIGUOUS: in thi case N Tensorflow tensors are contiguous followed
+//             by N Mkl tensors. So for above example, the ordering looks
+//             like: A, B, C, A_m, B_m, C_m
+//
+// Following APIs map index of original Tensorflow tensors to their
+// appropriate position based on selected ordering. For contiguous ordering,
+// we need to know the total number of tensors (parameter total).
+//
+typedef enum { TENSORS_INTERLEAVED, TENSORS_CONTIGUOUS } MklTfTensorOrdering;
+// NOTE: Currently, we use contiguous ordering. If you change this, then you
+// would need to change Mkl op definitions in nn_ops.cc.
+static MklTfTensorOrdering kTensorOrdering = TENSORS_CONTIGUOUS;
 
-  // Get index of MetaData tensor from index 'n' of Data tensor.
-  inline int DataIndexToMetaDataIndex(int n, int total_tensors) {
-    if (kTensorOrdering == MklTfTensorOrdering::TENSORS_INTERLEAVED) {
-      // For interleaved ordering, Mkl tensor follows immediately after
-      // Tensorflow tensor.
-      return n + 1;
-    } else {
-      CHECK_EQ(kTensorOrdering, MklTfTensorOrdering::TENSORS_CONTIGUOUS);
-      // For contiguous ordering, Mkl tensor is n+total_tensors / 2 away.
-      return n + total_tensors / 2;
-    }
+// Get index of MetaData tensor from index 'n' of Data tensor.
+inline int DataIndexToMetaDataIndex(int n, int total_tensors) {
+  if (kTensorOrdering == MklTfTensorOrdering::TENSORS_INTERLEAVED) {
+    // For interleaved ordering, Mkl tensor follows immediately after
+    // Tensorflow tensor.
+    return n + 1;
+  } else {
+    CHECK_EQ(kTensorOrdering, MklTfTensorOrdering::TENSORS_CONTIGUOUS);
+    // For contiguous ordering, Mkl tensor is n+total_tensors / 2 away.
+    return n + total_tensors / 2;
   }
+}
 
-  int inline GetTensorDataIndex(int n, int total_tensors) {
-      if (kTensorOrdering == MklTfTensorOrdering::TENSORS_INTERLEAVED) {
-        return 2 * n;  // index corresponding to nth input/output tensor
-      } else {
-        CHECK_EQ(kTensorOrdering, MklTfTensorOrdering::TENSORS_CONTIGUOUS);
-        return n;
-      }
-    }
+int inline GetTensorDataIndex(int n, int total_tensors) {
+  if (kTensorOrdering == MklTfTensorOrdering::TENSORS_INTERLEAVED) {
+    return 2 * n;  // index corresponding to nth input/output tensor
+  } else {
+    CHECK_EQ(kTensorOrdering, MklTfTensorOrdering::TENSORS_CONTIGUOUS);
+    return n;
+  }
+}
 
-  int inline GetTensorMetaDataIndex(int n, int total_tensors) {
-      // Get index for TensorData first and then use mapping function
-      // to get TensorMetaData index from TensorData index.
-      int tidx = GetTensorDataIndex(n, total_tensors);
-      return DataIndexToMetaDataIndex(tidx, total_tensors);
-    }
+int inline GetTensorMetaDataIndex(int n, int total_tensors) {
+  // Get index for TensorData first and then use mapping function
+  // to get TensorMetaData index from TensorData index.
+  int tidx = GetTensorDataIndex(n, total_tensors);
+  return DataIndexToMetaDataIndex(tidx, total_tensors);
+}
 
 namespace mkl_op_registry {
-  static const char* kMklOpLabel = "MklOp";
-  static const char* kMklOpLabelPattern = "label='MklOp'";
+static const char* kMklOpLabel = "MklOp";
+static const char* kMklOpLabelPattern = "label='MklOp'";
 
-  // Get the name of Mkl op from original TensorFlow op
-  // We prefix 'Mkl' to the original op to get Mkl op.
-  inline string GetMklOpName(const string& name) {
-    // Prefix that we add to Tensorflow op name to construct Mkl op name.
-    const char* const kMklOpPrefix = "_Mkl";
-    return string(kMklOpPrefix) + name;
-  }
+// Get the name of Mkl op from original TensorFlow op
+// We prefix 'Mkl' to the original op to get Mkl op.
+inline string GetMklOpName(const string& name) {
+  // Prefix that we add to Tensorflow op name to construct Mkl op name.
+  const char* const kMklOpPrefix = "_Mkl";
+  return string(kMklOpPrefix) + name;
+}
 
-  // Check whether opname with type T is registered as MKL-compliant.
-  //
-  // @input: name of the op
-  // @input: T datatype to be used for checking op
-  // @return: true if opname is registered as Mkl op; false otherwise
-  static inline bool IsMklOp(const std::string& op_name, DataType T) {
-    string kernel = KernelsRegisteredForOp(op_name);
-    bool result =
-        kernel.find(kMklOpLabelPattern) != string::npos && (T == DT_FLOAT);
-    if (result) {
-      VLOG(1) << "mkl_op_registry::" << op_name << " is " << kMklOpLabel;
-    }
-    return result;
+// Check whether opname with type T is registered as MKL-compliant.
+//
+// @input: name of the op
+// @input: T datatype to be used for checking op
+// @return: true if opname is registered as Mkl op; false otherwise
+static inline bool IsMklOp(const std::string& op_name, DataType T) {
+  string kernel = KernelsRegisteredForOp(op_name);
+  bool result =
+      kernel.find(kMklOpLabelPattern) != string::npos && (T == DT_FLOAT);
+  if (result) {
+    VLOG(1) << "mkl_op_registry::" << op_name << " is " << kMklOpLabel;
   }
+  return result;
+}
 
-  // Check whether opname with type T is registered as MKL-compliant and
-  // is element-wise.
-  //
-  // @input: name of the op
-  // @input: T datatype to be used for checking op
-  // @return: true if opname is registered as element-wise Mkl op;
-  // false otherwise
-  static inline bool IsMklElementWiseOp(const std::string& op_name,
-    DataType T) {
-    if (!IsMklOp(op_name, T)) {
-      return false;
-    }
+// Check whether opname with type T is registered as MKL-compliant and
+// is element-wise.
+//
+// @input: name of the op
+// @input: T datatype to be used for checking op
+// @return: true if opname is registered as element-wise Mkl op;
+// false otherwise
+static inline bool IsMklElementWiseOp(const std::string& op_name, DataType T) {
+  if (!IsMklOp(op_name, T)) {
+    return false;
+  }
 
-    bool result = (0 == op_name.compare(GetMklOpName("Add")) ||
-                    0 == op_name.compare(GetMklOpName("Sub")) ||
-                    0 == op_name.compare(GetMklOpName("Mul")) ||
-                    0 == op_name.compare(GetMklOpName("Maximum")) ||
-                    0 == op_name.compare(GetMklOpName("SquaredDifference")));
+  bool result = (0 == op_name.compare(GetMklOpName("Add")) ||
+                 0 == op_name.compare(GetMklOpName("Sub")) ||
+                 0 == op_name.compare(GetMklOpName("Mul")) ||
+                 0 == op_name.compare(GetMklOpName("Maximum")) ||
+                 0 == op_name.compare(GetMklOpName("SquaredDifference")));
 
-    VLOG(1) << "mkl_op_registry::" << op_name
-            << " is elementwise MKL op: " << result;
-    return result;
-  }
+  VLOG(1) << "mkl_op_registry::" << op_name
+          << " is elementwise MKL op: " << result;
+  return result;
+}
 }  // namespace mkl_op_registry
 }  // namespace tensorflow
 #endif  // INTEL_MKL
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index 912075aa28..f4c9073dee 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -37,8 +37,8 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/util/tensor_format.h"
 
-#include "tensorflow/core/graph/mkl_layout_pass.h"
 #include "tensorflow/core/graph/mkl_graph_util.h"
+#include "tensorflow/core/graph/mkl_layout_pass.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/graph/mkl_tfconversion_pass.cc b/tensorflow/core/graph/mkl_tfconversion_pass.cc
index 599bb88f01..fe4588389e 100644
--- a/tensorflow/core/graph/mkl_tfconversion_pass.cc
+++ b/tensorflow/core/graph/mkl_tfconversion_pass.cc
@@ -33,8 +33,8 @@ limitations under the License.
 #include "tensorflow/core/lib/hash/hash.h"
 #include "tensorflow/core/platform/logging.h"
 
-#include "tensorflow/core/graph/mkl_tfconversion_pass.h"
 #include "tensorflow/core/graph/mkl_graph_util.h"
+#include "tensorflow/core/graph/mkl_tfconversion_pass.h"
 
 namespace tensorflow {
 
@@ -68,7 +68,7 @@ namespace tensorflow {
 // take place before we hit the op. For this, we add a new op before each
 // element-wise MKL op to deal with the inputs, called _MklInputConversion.
 // This pass has been enhanced to add this capability.
-//
+// 
 // The _MklInputConversion op will check the inputs to the elementwise op and
 // make sure that either both are in MKL format or both are in TF format,
 // depending on their initial state and whether broadcast is needed or not.
diff --git a/tensorflow/core/grappler/costs/graph_properties.h b/tensorflow/core/grappler/costs/graph_properties.h
index 239b5ac244..ee279b7e0a 100644
--- a/tensorflow/core/grappler/costs/graph_properties.h
+++ b/tensorflow/core/grappler/costs/graph_properties.h
@@ -58,12 +58,6 @@ class GraphProperties {
   const std::vector<OpInfo::TensorProperties>& GetOutputProperties(
       const string& node_name) const;
 
-  static void FillTensorPropertiesFromContext(
-      const shape_inference::ShapeHandle&, const DataType&,
-      shape_inference::InferenceContext*,
-      std::unordered_map<const shape_inference::Dimension*, int>* dim_ids,
-      OpInfo::TensorProperties*);
-
  private:
   // Inputs
   GrapplerItem item_;
diff --git a/tensorflow/core/grappler/utils.cc b/tensorflow/core/grappler/utils.cc
index 9ab889beb5..7fd1876371 100644
--- a/tensorflow/core/grappler/utils.cc
+++ b/tensorflow/core/grappler/utils.cc
@@ -62,7 +62,7 @@ const std::set<NodeDef*>& NodeMap::GetOutputs(const string& node_name) const {
 void NodeMap::AddNode(const string& name, NodeDef* node) {
   auto ret = nodes_.insert(std::make_pair(name, node));
   CHECK(ret.second) << "Pair (" << name << "," << node
-                    << ") is not inserted because the same key already exists.";
+                    << ") is not inserted because a same key already exists.";
 }
 
 void NodeMap::AddOutput(const string& node_name, const string& output_name) {
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index f1cb9a1860..d7b457eab7 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -930,25 +930,6 @@ tf_cc_test(
 )
 
 tf_cuda_cc_test(
-    name = "bincount_op_test",
-    size = "small",
-    srcs = ["bincount_op_test.cc"],
-    deps = [
-        ":bincount_op",
-        ":ops_testutil",
-        ":ops_util",
-        "//tensorflow/core:core_cpu",
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:math_ops_op_lib",
-        "//tensorflow/core:protos_all_cc",
-        "//tensorflow/core:test",
-        "//tensorflow/core:test_main",
-        "//tensorflow/core:testlib",
-    ],
-)
-
-tf_cuda_cc_test(
     name = "constant_op_test",
     size = "small",
     srcs = ["constant_op_test.cc"],
@@ -1636,10 +1617,7 @@ DYNAMIC_DEPS = [
 tf_kernel_library(
     name = "dynamic_partition_op",
     prefix = "dynamic_partition_op",
-    deps = DYNAMIC_DEPS + [
-        ":fill_functor",
-        ":gather_functor",
-    ] + if_cuda(["@cub_archive//:cub"]),
+    deps = DYNAMIC_DEPS,
 )
 
 tf_kernel_library(
@@ -1709,7 +1687,7 @@ tf_kernel_library(
     ],
 )
 
-tf_cuda_cc_tests(
+tf_cc_tests(
     name = "dynamic_op_test",
     size = "small",
     srcs = [
@@ -2594,9 +2572,8 @@ tf_kernel_library(
 
 tf_kernel_library(
     name = "bucketize_op",
-    gpu_srcs = ["cuda_device_array.h"],
     prefix = "bucketize_op",
-    deps = ARRAY_DEPS,
+    deps = MATH_DEPS,
 )
 
 tf_kernel_library(
@@ -3197,7 +3174,7 @@ tf_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//third_party/eigen3",
-    ] + if_cuda(["@cub_archive//:cub"]),
+    ],
 )
 
 tf_kernel_library(
diff --git a/tensorflow/core/kernels/avgpooling_op.cc b/tensorflow/core/kernels/avgpooling_op.cc
index f918023693..af629d0de8 100644
--- a/tensorflow/core/kernels/avgpooling_op.cc
+++ b/tensorflow/core/kernels/avgpooling_op.cc
@@ -153,8 +153,7 @@ class AvgPoolingOp<GPUDevice, T> : public UnaryOp<T> {
     if (data_format_ == FORMAT_NCHW) {
       DnnPoolingOp<T>::Compute(
           context, perftools::gputools::dnn::PoolingMode::kAverage, ksize_,
-          stride_, padding_, data_format_, tensor_in, output_shape,
-          /*propagate_nans=*/false);
+          stride_, padding_, data_format_, tensor_in, output_shape);
     } else {
       Tensor* output = nullptr;
       OP_REQUIRES_OK(context,
@@ -409,7 +408,7 @@ class AvgPoolingGradOp<GPUDevice, T> : public OpKernel {
     DnnPoolingGradOp<T>::Compute(
         context, perftools::gputools::dnn::PoolingMode::kAverage, ksize_,
         stride_, padding_, data_format_, nullptr, nullptr, out_backprop,
-        output_shape, /*propagate_nans=*/false);
+        output_shape);
   }
 
  private:
@@ -533,7 +532,7 @@ class AvgPoolingGradOpCustomGPUKernel : public OpKernel {
       DnnPoolingGradOp<T>::Compute(
           context, perftools::gputools::dnn::PoolingMode::kAverage, ksize_,
           stride_, padding_, data_format_, nullptr, nullptr, out_backprop,
-          output_shape, /*propagate_nans=*/false);
+          output_shape);
     }
   }
 
diff --git a/tensorflow/core/kernels/bincount_op.cc b/tensorflow/core/kernels/bincount_op.cc
index 766d63e3be..1cd5943ef3 100644
--- a/tensorflow/core/kernels/bincount_op.cc
+++ b/tensorflow/core/kernels/bincount_op.cc
@@ -17,7 +17,6 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
-#include "tensorflow/core/kernels/bincount_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/types.h"
@@ -28,37 +27,46 @@ namespace tensorflow {
 
 using thread::ThreadPool;
 
-typedef Eigen::ThreadPoolDevice CPUDevice;
-typedef Eigen::GpuDevice GPUDevice;
-
-namespace functor {
-
 template <typename T>
-struct BincountFunctor<CPUDevice, T> {
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<int32, 1>::ConstTensor& arr,
-                        const typename TTypes<T, 1>::ConstTensor& weights,
-                        typename TTypes<T, 1>::Tensor& output) {
-    int size = output.size();
+class BincountOp : public OpKernel {
+ public:
+  explicit BincountOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor& arr_t = ctx->input(0);
+    const Tensor& size_tensor = ctx->input(1);
+    const Tensor& weights_t = ctx->input(2);
+    int32 size = size_tensor.scalar<int32>()();
+    OP_REQUIRES(
+        ctx, size >= 0,
+        errors::InvalidArgument("size (", size, ") must be non-negative"));
+    const bool has_weights = weights_t.NumElements() > 0;
+    OP_REQUIRES(ctx, !(has_weights && arr_t.shape() != weights_t.shape()),
+                errors::InvalidArgument(
+                    "If weights are passed, they must have the same shape (" +
+                    weights_t.shape().DebugString() + ") as arr (" +
+                    arr_t.shape().DebugString() + ")"));
+    const auto arr = arr_t.flat<int32>();
+    const auto weights = weights_t.flat<T>();
 
     Tensor all_nonneg_t;
-    TF_RETURN_IF_ERROR(context->allocate_temp(
-        DT_BOOL, TensorShape({}), &all_nonneg_t, AllocatorAttributes()));
-    all_nonneg_t.scalar<bool>().device(context->eigen_cpu_device()) =
+    OP_REQUIRES_OK(ctx,
+                   ctx->allocate_temp(DT_BOOL, TensorShape({}), &all_nonneg_t,
+                                      AllocatorAttributes()));
+    all_nonneg_t.scalar<bool>().device(ctx->eigen_cpu_device()) =
         (arr >= 0).all();
-    if (!all_nonneg_t.scalar<bool>()()) {
-      return errors::InvalidArgument("Input arr must be non-negative!");
-    }
+    OP_REQUIRES(ctx, all_nonneg_t.scalar<bool>()(),
+                errors::InvalidArgument("Input arr must be non-negative!"));
 
     // Allocate partial output bin sums for each worker thread. Worker ids in
     // ParallelForWithWorkerId range from 0 to NumThreads() inclusive.
     ThreadPool* thread_pool =
-        context->device()->tensorflow_cpu_worker_threads()->workers;
+        ctx->device()->tensorflow_cpu_worker_threads()->workers;
     const int64 num_threads = thread_pool->NumThreads() + 1;
     Tensor partial_bins_t;
-    TF_RETURN_IF_ERROR(context->allocate_temp(DataTypeToEnum<T>::value,
-                                              TensorShape({num_threads, size}),
-                                              &partial_bins_t));
+    OP_REQUIRES_OK(ctx, ctx->allocate_temp(weights_t.dtype(),
+                                           TensorShape({num_threads, size}),
+                                           &partial_bins_t));
     auto partial_bins = partial_bins_t.matrix<T>();
     partial_bins.setZero();
     thread_pool->ParallelForWithWorkerId(
@@ -67,7 +75,7 @@ struct BincountFunctor<CPUDevice, T> {
           for (int64 i = start_ind; i < limit_ind; i++) {
             int32 value = arr(i);
             if (value < size) {
-              if (weights.size()) {
+              if (has_weights) {
                 partial_bins(worker_id, value) += weights(i);
               } else {
                 // Complex numbers don't support "++".
@@ -76,62 +84,25 @@ struct BincountFunctor<CPUDevice, T> {
             }
           }
         });
-
+    TensorShape output_shape({size});
+    Tensor* output_t;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output_t));
     // Sum the partial bins along the 0th axis.
     Eigen::array<int, 1> reduce_dims({0});
-    output.device(context->eigen_cpu_device()) = partial_bins.sum(reduce_dims);
-    return Status::OK();
-  }
-};
-
-}  // namespace functor
-
-template <typename Device, typename T>
-class BincountOp : public OpKernel {
- public:
-  explicit BincountOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
-
-  void Compute(OpKernelContext* ctx) override {
-    const Tensor& arr_t = ctx->input(0);
-    const Tensor& size_tensor = ctx->input(1);
-    const Tensor& weights_t = ctx->input(2);
-
-    int32 size = size_tensor.scalar<int32>()();
-    OP_REQUIRES(ctx, size >= 0, errors::InvalidArgument(
-                                    "size (", size, ") must be non-negative"));
-
-    const auto arr = arr_t.flat<int32>();
-    const auto weights = weights_t.flat<T>();
-    Tensor* output_t;
-    OP_REQUIRES_OK(ctx,
-                   ctx->allocate_output(0, TensorShape({size}), &output_t));
-    auto output = output_t->flat<T>();
-    OP_REQUIRES_OK(ctx, functor::BincountFunctor<Device, T>::Compute(
-                            ctx, arr, weights, output));
+    output_t->flat<T>().device(ctx->eigen_cpu_device()) =
+        partial_bins.sum(reduce_dims);
   }
 };
 
-#define REGISTER_KERNELS(type)                                       \
+#define REGISTER(TYPE)                                               \
   REGISTER_KERNEL_BUILDER(                                           \
-      Name("Bincount").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
-      BincountOp<CPUDevice, type>)
-
-TF_CALL_NUMBER_TYPES(REGISTER_KERNELS);
-#undef REGISTER_KERNELS
-
-#if GOOGLE_CUDA
-
-#define REGISTER_KERNELS(type)                            \
-  REGISTER_KERNEL_BUILDER(Name("Bincount")                \
-                              .Device(DEVICE_GPU)         \
-                              .HostMemory("size")         \
-                              .TypeConstraint<type>("T"), \
-                          BincountOp<GPUDevice, type>)
+      Name("Bincount").Device(DEVICE_CPU).TypeConstraint<TYPE>("T"), \
+      BincountOp<TYPE>)
 
-TF_CALL_int32(REGISTER_KERNELS);
-TF_CALL_float(REGISTER_KERNELS);
-#undef REGISTER_KERNELS
+TF_CALL_NUMBER_TYPES(REGISTER);
 
-#endif  // GOOGLE_CUDA
+// TODO(ringwalt): Add a GPU implementation. We probably want to take a
+// different approach, e.g. threads in a warp each taking a pass over the same
+// data, and each thread summing a single bin.
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/bincount_op.h b/tensorflow/core/kernels/bincount_op.h
deleted file mode 100644
index 0f8dd2b82a..0000000000
--- a/tensorflow/core/kernels/bincount_op.h
+++ /dev/null
@@ -1,41 +0,0 @@
-/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_BINCOUNT_OP_H_
-#define TENSORFLOW_BINCOUNT_OP_H_
-
-#include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/tensor_types.h"
-#include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/lib/core/errors.h"
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-
-namespace tensorflow {
-
-namespace functor {
-
-template <typename Device, typename T>
-struct BincountFunctor {
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<int32, 1>::ConstTensor& arr,
-                        const typename TTypes<T, 1>::ConstTensor& weights,
-                        typename TTypes<T, 1>::Tensor& output);
-};
-
-}  // end namespace functor
-
-}  // end namespace tensorflow
-
-#endif  // TENSORFLOW_BINCOUNT_OP_H_
diff --git a/tensorflow/core/kernels/bincount_op_gpu.cu.cc b/tensorflow/core/kernels/bincount_op_gpu.cu.cc
deleted file mode 100644
index ae9e26ffdf..0000000000
--- a/tensorflow/core/kernels/bincount_op_gpu.cu.cc
+++ /dev/null
@@ -1,114 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#if GOOGLE_CUDA
-
-#define EIGEN_USE_GPU
-
-#include "tensorflow/core/kernels/bincount_op.h"
-#include "external/cub_archive/cub/device/device_histogram.cuh"
-#include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/register_types.h"
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/framework/tensor_shape.h"
-#include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/util/cuda_kernel_helper.h"
-
-namespace tensorflow {
-
-typedef Eigen::GpuDevice GPUDevice;
-
-namespace functor {
-
-template <typename T>
-struct BincountFunctor<GPUDevice, T> {
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<int32, 1>::ConstTensor& arr,
-                        const typename TTypes<T, 1>::ConstTensor& weights,
-                        typename TTypes<T, 1>::Tensor& output) {
-    if (weights.size() != 0) {
-      return errors::InvalidArgument(
-          "Weights should not be passed as it should be "
-          "handled by unsorted_segment_sum");
-    }
-    if (output.size() == 0) {
-      return Status::OK();
-    }
-    // In case weight.size() == 0, use CUB
-    size_t temp_storage_bytes = 0;
-    const int32* d_samples = arr.data();
-    T* d_histogram = output.data();
-    int num_levels = output.size() + 1;
-    int32 lower_level = 0;
-    int32 upper_level = output.size();
-    int num_samples = arr.size();
-    const cudaStream_t& stream = GetCudaStream(context);
-
-    // The first HistogramEven is to obtain the temp storage size required
-    // with d_temp_storage = NULL passed to the call.
-    auto err = cub::DeviceHistogram::HistogramEven(
-        /* d_temp_storage */ NULL,
-        /* temp_storage_bytes */ temp_storage_bytes,
-        /* d_samples */ d_samples,
-        /* d_histogram */ d_histogram,
-        /* num_levels */ num_levels,
-        /* lower_level */ lower_level,
-        /* upper_level */ upper_level,
-        /* num_samples */ num_samples,
-        /* stream */ stream);
-    if (err != cudaSuccess) {
-      return errors::Internal(
-          "Could not launch HistogramEven to get temp storage: ",
-          cudaGetErrorString(err), ".");
-    }
-    Tensor temp_storage;
-    TF_RETURN_IF_ERROR(context->allocate_temp(
-        DataTypeToEnum<int8>::value,
-        TensorShape({static_cast<int64>(temp_storage_bytes)}), &temp_storage));
-
-    void* d_temp_storage = temp_storage.flat<int8>().data();
-    // The second HistogramEven is to actual run with d_temp_storage
-    // allocated with temp_storage_bytes.
-    err = cub::DeviceHistogram::HistogramEven(
-        /* d_temp_storage */ d_temp_storage,
-        /* temp_storage_bytes */ temp_storage_bytes,
-        /* d_samples */ d_samples,
-        /* d_histogram */ d_histogram,
-        /* num_levels */ num_levels,
-        /* lower_level */ lower_level,
-        /* upper_level */ upper_level,
-        /* num_samples */ num_samples,
-        /* stream */ stream);
-    if (err != cudaSuccess) {
-      return errors::Internal("Could not launch HistogramEven: ",
-                              cudaGetErrorString(err), ".");
-    }
-    return Status::OK();
-  }
-};
-
-}  // end namespace functor
-
-#define REGISTER_GPU_SPEC(type) \
-  template struct functor::BincountFunctor<GPUDevice, type>;
-
-TF_CALL_int32(REGISTER_GPU_SPEC);
-TF_CALL_float(REGISTER_GPU_SPEC);
-#undef REGISTER_GPU_SPEC
-
-}  // namespace tensorflow
-
-#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/bincount_op_test.cc b/tensorflow/core/kernels/bincount_op_test.cc
deleted file mode 100644
index 14becc87a7..0000000000
--- a/tensorflow/core/kernels/bincount_op_test.cc
+++ /dev/null
@@ -1,75 +0,0 @@
-/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
-#include "tensorflow/core/framework/fake_input.h"
-#include "tensorflow/core/framework/node_def_builder.h"
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/graph/node_builder.h"
-#include "tensorflow/core/kernels/ops_testutil.h"
-#include "tensorflow/core/platform/test.h"
-#include "tensorflow/core/platform/test_benchmark.h"
-
-namespace tensorflow {
-
-static Graph* Bincount(int arr_size, int nbins) {
-  Graph* g = new Graph(OpRegistry::Global());
-
-  Tensor arr(DT_INT32, TensorShape({arr_size}));
-  arr.flat<int32>() = arr.flat<int32>().setRandom().abs();
-
-  Tensor size(DT_INT32, TensorShape({(int32)1}));
-  size.flat<int32>()(0) = (int32)nbins;
-
-  Tensor weights(DT_INT32, TensorShape({0}));
-
-  Node* node;
-  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "Bincount")
-                  .Input(test::graph::Constant(g, arr))
-                  .Input(test::graph::Constant(g, size))
-                  .Input(test::graph::Constant(g, weights))
-                  .Attr("T", DT_INT32)
-                  .Finalize(g, &node));
-  return g;
-}
-
-#define BM_BincountDev(K, NBINS, type)                             \
-  static void BM_Bincount##_##type##_##K##_##NBINS(int iters) {    \
-    testing::ItemsProcessed(static_cast<int64>(iters) * K * 1024); \
-    test::Benchmark(#type, Bincount(K * 1024, NBINS)).Run(iters);  \
-  }                                                                \
-  BENCHMARK(BM_Bincount##_##type##_##K##_##NBINS);
-
-BM_BincountDev(32, 1000, cpu);
-BM_BincountDev(32, 2000, cpu);
-BM_BincountDev(32, 5000, cpu);
-BM_BincountDev(64, 1000, cpu);
-BM_BincountDev(64, 2000, cpu);
-BM_BincountDev(64, 5000, cpu);
-BM_BincountDev(128, 1000, cpu);
-BM_BincountDev(128, 2000, cpu);
-BM_BincountDev(128, 5000, cpu);
-
-BM_BincountDev(32, 1000, gpu);
-BM_BincountDev(32, 2000, gpu);
-BM_BincountDev(32, 5000, gpu);
-BM_BincountDev(64, 1000, gpu);
-BM_BincountDev(64, 2000, gpu);
-BM_BincountDev(64, 5000, gpu);
-BM_BincountDev(128, 1000, gpu);
-BM_BincountDev(128, 2000, gpu);
-BM_BincountDev(128, 5000, gpu);
-
-}  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/bucketize_op.cc b/tensorflow/core/kernels/bucketize_op.cc
index c1693de538..93c2d01221 100644
--- a/tensorflow/core/kernels/bucketize_op.cc
+++ b/tensorflow/core/kernels/bucketize_op.cc
@@ -15,43 +15,15 @@ limitations under the License.
 
 // See docs in ../ops/math_ops.cc.
 
-#include "tensorflow/core/kernels/bucketize_op.h"
+#include <algorithm>
+#include <vector>
+
 #include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/register_types.h"
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/framework/tensor_shape.h"
-#include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/lib/core/errors.h"
 
 namespace tensorflow {
 
-using thread::ThreadPool;
-
-typedef Eigen::ThreadPoolDevice CPUDevice;
-typedef Eigen::GpuDevice GPUDevice;
-
-namespace functor {
-
 template <typename T>
-struct BucketizeFunctor<CPUDevice, T> {
-  // PRECONDITION: boundaries_vector must be sorted.
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<T, 1>::ConstTensor& input,
-                        const std::vector<float>& boundaries_vector,
-                        typename TTypes<int32, 1>::Tensor& output) {
-    const int N = input.size();
-    for (int i = 0; i < N; i++) {
-      auto first_bigger_it = std::upper_bound(
-          boundaries_vector.begin(), boundaries_vector.end(), input(i));
-      output(i) = first_bigger_it - boundaries_vector.begin();
-    }
-
-    return Status::OK();
-  }
-};
-}  // namespace functor
-
-template <typename Device, typename T>
 class BucketizeOp : public OpKernel {
  public:
   explicit BucketizeOp(OpKernelConstruction* context) : OpKernel(context) {
@@ -62,42 +34,36 @@ class BucketizeOp : public OpKernel {
 
   void Compute(OpKernelContext* context) override {
     const Tensor& input_tensor = context->input(0);
-    const auto input = input_tensor.flat<T>();
-
+    auto input = input_tensor.flat<T>();
     Tensor* output_tensor = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                      &output_tensor));
     auto output = output_tensor->template flat<int32>();
-    OP_REQUIRES_OK(context, functor::BucketizeFunctor<Device, T>::Compute(
-                                context, input, boundaries_, output));
+
+    const int N = input.size();
+    for (int i = 0; i < N; i++) {
+      output(i) = CalculateBucketIndex(input(i));
+    }
   }
 
  private:
+  int32 CalculateBucketIndex(const T value) {
+    auto first_bigger_it =
+        std::upper_bound(boundaries_.begin(), boundaries_.end(), value);
+    return first_bigger_it - boundaries_.begin();
+  }
   std::vector<float> boundaries_;
 };
 
 #define REGISTER_KERNEL(T)                                         \
   REGISTER_KERNEL_BUILDER(                                         \
       Name("Bucketize").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
-      BucketizeOp<CPUDevice, T>);
-
-REGISTER_KERNEL(int32);
-REGISTER_KERNEL(int64);
-REGISTER_KERNEL(float);
-REGISTER_KERNEL(double);
-#undef REGISTER_KERNEL
-
-#if GOOGLE_CUDA
-#define REGISTER_KERNEL(T)                                         \
-  REGISTER_KERNEL_BUILDER(                                         \
-      Name("Bucketize").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
-      BucketizeOp<GPUDevice, T>);
+      BucketizeOp<T>);
 
 REGISTER_KERNEL(int32);
 REGISTER_KERNEL(int64);
 REGISTER_KERNEL(float);
 REGISTER_KERNEL(double);
 #undef REGISTER_KERNEL
-#endif  // GOOGLE_CUDA
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/bucketize_op.h b/tensorflow/core/kernels/bucketize_op.h
deleted file mode 100644
index c8e461beb9..0000000000
--- a/tensorflow/core/kernels/bucketize_op.h
+++ /dev/null
@@ -1,41 +0,0 @@
-/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifndef TENSORFLOW_BUCKETIZE_OP_H_
-#define TENSORFLOW_BUCKETIZE_OP_H_
-
-#include <vector>
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-
-#include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/tensor_types.h"
-#include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/lib/core/errors.h"
-
-namespace tensorflow {
-namespace functor {
-
-template <typename Device, typename T>
-struct BucketizeFunctor {
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<T, 1>::ConstTensor& input,
-                        const std::vector<float>& boundaries_vector,
-                        typename TTypes<int32, 1>::Tensor& output);
-};
-
-}  // namespace functor
-}  // namespace tensorflow
-
-#endif  // TENSORFLOW_BUCKETIZE_OP_H_
diff --git a/tensorflow/core/kernels/bucketize_op_gpu.cu.cc b/tensorflow/core/kernels/bucketize_op_gpu.cu.cc
deleted file mode 100644
index aafbbe41b4..0000000000
--- a/tensorflow/core/kernels/bucketize_op_gpu.cu.cc
+++ /dev/null
@@ -1,101 +0,0 @@
-/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#if GOOGLE_CUDA
-
-#define EIGEN_USE_GPU
-
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-
-#include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/register_types.h"
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/framework/tensor_shape.h"
-#include "tensorflow/core/kernels/bucketize_op.h"
-#include "tensorflow/core/kernels/cuda_device_array.h"
-#include "tensorflow/core/platform/logging.h"
-#include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/util/cuda_kernel_helper.h"
-
-namespace tensorflow {
-
-typedef Eigen::GpuDevice GPUDevice;
-
-template <typename T>
-__global__ void BucketizeCustomKernel(
-    const int32 size_in, const T* in, const int32 size_boundaries,
-    CudaDeviceArrayStruct<float> boundaries_array, int32* out) {
-  const float* boundaries = GetCudaDeviceArrayOnDevice(&boundaries_array);
-  CUDA_1D_KERNEL_LOOP(i, size_in) {
-    T value = in[i];
-    int32 bucket = 0;
-    int32 count = size_boundaries;
-    while (count > 0) {
-      int32 l = bucket;
-      int32 step = count / 2;
-      l += step;
-      if (!(value < static_cast<T>(boundaries[l]))) {
-        bucket = ++l;
-        count -= step + 1;
-      } else {
-        count = step;
-      }
-    }
-    out[i] = bucket;
-  }
-}
-
-namespace functor {
-
-template <typename T>
-struct BucketizeFunctor<GPUDevice, T> {
-  // PRECONDITION: boundaries_vector must be sorted.
-  static Status Compute(OpKernelContext* context,
-                        const typename TTypes<T, 1>::ConstTensor& input,
-                        const std::vector<float>& boundaries_vector,
-                        typename TTypes<int32, 1>::Tensor& output) {
-    const GPUDevice& d = context->eigen_device<GPUDevice>();
-
-    CudaDeviceArrayOnHost<float> boundaries_array(context,
-                                                  boundaries_vector.size());
-    TF_RETURN_IF_ERROR(boundaries_array.Init());
-    for (int i = 0; i < boundaries_vector.size(); ++i) {
-      boundaries_array.Set(i, boundaries_vector[i]);
-    }
-    TF_RETURN_IF_ERROR(boundaries_array.Finalize());
-
-    CudaLaunchConfig config = GetCudaLaunchConfig(input.size(), d);
-    BucketizeCustomKernel<
-        T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
-        input.size(), input.data(), boundaries_vector.size(),
-        boundaries_array.data(), output.data());
-
-    return Status::OK();
-  }
-};
-}  // namespace functor
-
-#define REGISTER_GPU_SPEC(type) \
-  template struct functor::BucketizeFunctor<GPUDevice, type>;
-
-REGISTER_GPU_SPEC(int32);
-REGISTER_GPU_SPEC(int64);
-REGISTER_GPU_SPEC(float);
-REGISTER_GPU_SPEC(double);
-#undef REGISTER_GPU_SPEC
-
-}  // namespace tensorflow
-
-#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/conv_grad_ops_3d.cc b/tensorflow/core/kernels/conv_grad_ops_3d.cc
index f819fccbfb..21f5cb1716 100644
--- a/tensorflow/core/kernels/conv_grad_ops_3d.cc
+++ b/tensorflow/core/kernels/conv_grad_ops_3d.cc
@@ -236,7 +236,6 @@ class Conv3DBackpropInputOp : public OpKernel {
   REGISTER_KERNEL_BUILDER(                                                     \
       Name("Conv3DBackpropInputV2").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
       Conv3DBackpropInputOp<CPUDevice, T>);
-TF_CALL_half(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 #undef REGISTER_CPU_KERNEL
@@ -384,7 +383,6 @@ class Conv3DBackpropFilterOp : public OpKernel {
                               .Device(DEVICE_CPU)                             \
                               .TypeConstraint<T>("T"),                        \
                           Conv3DBackpropFilterOp<CPUDevice, T>);
-TF_CALL_half(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 #undef REGISTER_CPU_KERNEL
@@ -411,7 +409,6 @@ namespace functor {
       const std::array<int, 3>& padding_right,                        \
       typename TTypes<T, 5, int>::Tensor out, TensorFormat format);
 
-DECLARE_GPU_SPEC(Eigen::half);
 DECLARE_GPU_SPEC(float);
 #undef DECLARE_GPU_SPEC
 }  // namespace functor
@@ -1101,29 +1098,22 @@ class Conv3DBackpropFilterOp<GPUDevice, T> : public OpKernel {
   bool cudnn_use_autotune_;
 };
 
-
-
-#define REGISTER_GPU_KERNEL(T)                                                \
-  REGISTER_KERNEL_BUILDER(                                                    \
-      Name("Conv3DBackpropInput").Device(DEVICE_GPU).TypeConstraint<T>("T"),  \
-      Conv3DBackpropInputOp<GPUDevice, T>);                                   \
-  REGISTER_KERNEL_BUILDER(Name("Conv3DBackpropInputV2")                       \
-                            .Device(DEVICE_GPU)                               \
-                            .TypeConstraint<T>("T")                           \
-                            .HostMemory("input_sizes"),                       \
-                        Conv3DBackpropInputOp<GPUDevice, T>);                 \
-  REGISTER_KERNEL_BUILDER(                                                    \
-    Name("Conv3DBackpropFilter").Device(DEVICE_GPU).TypeConstraint<T>("T"),   \
-    Conv3DBackpropFilterOp<GPUDevice, T>);                                    \
-  REGISTER_KERNEL_BUILDER(Name("Conv3DBackpropFilterV2")                      \
-                            .Device(DEVICE_GPU)                               \
-                            .TypeConstraint<T>("T")                           \
-                            .HostMemory("filter_sizes"),                      \
-                        Conv3DBackpropFilterOp<GPUDevice, T>);
-TF_CALL_half(REGISTER_GPU_KERNEL);
-TF_CALL_float(REGISTER_GPU_KERNEL);
-#undef REGISTER_GPU_KERNEL
-     
+REGISTER_KERNEL_BUILDER(
+    Name("Conv3DBackpropInput").Device(DEVICE_GPU).TypeConstraint<float>("T"),
+    Conv3DBackpropInputOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(Name("Conv3DBackpropInputV2")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<float>("T")
+                            .HostMemory("input_sizes"),
+                        Conv3DBackpropInputOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("Conv3DBackpropFilter").Device(DEVICE_GPU).TypeConstraint<float>("T"),
+    Conv3DBackpropFilterOp<GPUDevice, float>);
+REGISTER_KERNEL_BUILDER(Name("Conv3DBackpropFilterV2")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<float>("T")
+                            .HostMemory("filter_sizes"),
+                        Conv3DBackpropFilterOp<GPUDevice, float>);
 #endif  // GOOGLE_CUDA
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/conv_ops_3d.cc b/tensorflow/core/kernels/conv_ops_3d.cc
index 37cb67bc51..8a89d564de 100644
--- a/tensorflow/core/kernels/conv_ops_3d.cc
+++ b/tensorflow/core/kernels/conv_ops_3d.cc
@@ -145,7 +145,6 @@ class Conv3DOp : public BinaryOp<T> {
   REGISTER_KERNEL_BUILDER(                                      \
       Name("Conv3D").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
       Conv3DOp<CPUDevice, T>);
-TF_CALL_half(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 #undef REGISTER_CPU_KERNEL
@@ -483,7 +482,6 @@ namespace functor {
       const std::array<int, 3>& padding_right,                        \
       typename TTypes<T, 5, int>::Tensor out, TensorFormat format);
 
-DECLARE_GPU_SPEC(Eigen::half);
 DECLARE_GPU_SPEC(float);
 #undef DECLARE_GPU_SPEC
 
@@ -491,9 +489,6 @@ DECLARE_GPU_SPEC(float);
 
 // Registration of the GPU implementations.
 REGISTER_KERNEL_BUILDER(
-    Name("Conv3D").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
-    Conv3DOp<GPUDevice, Eigen::half>);
-REGISTER_KERNEL_BUILDER(
     Name("Conv3D").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     Conv3DOp<GPUDevice, float>);
 #endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/cwise_op_acosh.cc b/tensorflow/core/kernels/cwise_op_acosh.cc
index 39c8814073..7bdd8d22a3 100644
--- a/tensorflow/core/kernels/cwise_op_acosh.cc
+++ b/tensorflow/core/kernels/cwise_op_acosh.cc
@@ -20,8 +20,16 @@ namespace tensorflow {
 REGISTER4(UnaryOp, CPU, "Acosh", functor::acosh, float, double,
           complex64, complex128);
 
-#ifdef TENSORFLOW_USE_SYCL
-REGISTER2(UnaryOp, SYCL, "Acosh", functor::acosh, float, double);
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Acosh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::acosh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
 #endif // TENSORFLOW_USE_SYCL
 
 #if GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/cwise_op_asinh.cc b/tensorflow/core/kernels/cwise_op_asinh.cc
index 8d44208aa7..e0644323c0 100644
--- a/tensorflow/core/kernels/cwise_op_asinh.cc
+++ b/tensorflow/core/kernels/cwise_op_asinh.cc
@@ -20,9 +20,17 @@ namespace tensorflow {
 REGISTER4(UnaryOp, CPU, "Asinh", functor::asinh, float, double,
           complex64, complex128);
 
-#ifdef TENSORFLOW_USE_SYCL
-REGISTER2(UnaryOp, SYCL, "Asinh", functor::asinh, float, double);
-#endif // TENSORFLOW_USE_SYCL
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Asinh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::asinh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
+#endif // TENSORFLOW_USE_SYC
 
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Asinh", functor::asinh, float, double);
diff --git a/tensorflow/core/kernels/cwise_op_atanh.cc b/tensorflow/core/kernels/cwise_op_atanh.cc
index bbc69e45aa..058f5140c5 100644
--- a/tensorflow/core/kernels/cwise_op_atanh.cc
+++ b/tensorflow/core/kernels/cwise_op_atanh.cc
@@ -20,9 +20,17 @@ namespace tensorflow {
 REGISTER4(UnaryOp, CPU, "Atanh", functor::atanh, float, double,
           complex64, complex128);
 
-#ifdef TENSORFLOW_USE_SYCL
-REGISTER2(UnaryOp, SYCL, "Atanh", functor::atanh, float, double);
-#endif // TENSORFLOW_USE_SYCL
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(TYPE)                                    \
+  REGISTER_KERNEL_BUILDER(                                            \
+                          Name("Atanh")                               \
+                          .Device(DEVICE_SYCL)                        \
+                          .TypeConstraint<TYPE>("T"),                 \
+                          UnaryOp<SYCLDevice, functor::atanh<TYPE>>);
+REGISTER_SYCL_KERNEL(float);
+REGISTER_SYCL_KERNEL(double);
+#undef REGISTER_SYCL_KERNEL
+#endif // TENSORFLOW_USE_SYC
 
 #if GOOGLE_CUDA
 REGISTER2(UnaryOp, GPU, "Atanh", functor::atanh, float, double);
diff --git a/tensorflow/core/kernels/cwise_ops.h b/tensorflow/core/kernels/cwise_ops.h
index d32185b6bf..6c22b124de 100644
--- a/tensorflow/core/kernels/cwise_ops.h
+++ b/tensorflow/core/kernels/cwise_ops.h
@@ -49,11 +49,7 @@ template <typename T>
 struct scalar_asinh_op {
   EIGEN_EMPTY_STRUCT_CTOR(scalar_asinh_op)
   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T operator()(const T& a) const {
-#if EIGEN_HAS_CXX11_MATH
-    return numext::asinh(a);
-#else
     return std::asinh(a);
-#endif  // EIGEN_HAS_CXX11_MATH
   }
 };
 template <typename T>
@@ -65,11 +61,7 @@ template <typename T>
 struct scalar_acosh_op {
   EIGEN_EMPTY_STRUCT_CTOR(scalar_acosh_op)
   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T operator()(const T& a) const {
-#if EIGEN_HAS_CXX11_MATH
-    return numext::acosh(a);
-#else
     return std::acosh(a);
-#endif  // EIGEN_HAS_CXX11_MATH
   }
 };
 template <typename T>
@@ -81,11 +73,7 @@ template <typename T>
 struct scalar_atanh_op {
   EIGEN_EMPTY_STRUCT_CTOR(scalar_atanh_op)
   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T operator()(const T& a) const {
-#if EIGEN_HAS_CXX11_MATH
-    return numext::atanh(a);
-#else
     return std::atanh(a);
-#endif  // EIGEN_HAS_CXX11_MATH
   }
 };
 template <typename T>
diff --git a/tensorflow/core/kernels/depthwise_conv_grad_op.cc b/tensorflow/core/kernels/depthwise_conv_grad_op.cc
index 53d65a22d1..9804d7d38e 100644
--- a/tensorflow/core/kernels/depthwise_conv_grad_op.cc
+++ b/tensorflow/core/kernels/depthwise_conv_grad_op.cc
@@ -231,7 +231,7 @@ static void CopyOutputBackpropRegion(const DepthwiseArgs& args,
       }
       // Pad to vector-register width (if needed).
       for (int64 d = 0; d < pad_size; ++d) {
-        buffer[buf_base + vectorized_size + scalar_size + d] = static_cast<T>(0);
+        buffer[buf_base + vectorized_size + scalar_size + d] = 0;
       }
     }
   }
@@ -297,7 +297,7 @@ static void ComputeBackpropInput(const DepthwiseArgs& args,
 
   for (int i = 0; i < output_vectorized_size; i += kPacketSize) {
     // Reset accumulator.
-    auto vaccum = Eigen::internal::pset1<Packet>(static_cast<T>(0));
+    auto vaccum = Eigen::internal::pset1<Packet>(0);
     for (int j = 0; j < filter_spatial_size; ++j) {
       // Calculate index.
       const int64 index = i + j * padded_filter_inner_dim_size;
@@ -318,7 +318,7 @@ static void ComputeBackpropInput(const DepthwiseArgs& args,
   }
 
   if (output_scalar_size > 0) {
-    auto vaccum = Eigen::internal::pset1<Packet>(static_cast<T>(0));
+    auto vaccum = Eigen::internal::pset1<Packet>(0);
     for (int j = 0; j < filter_spatial_size; ++j) {
       const int64 index =
           output_vectorized_size + j * padded_filter_inner_dim_size;
@@ -346,7 +346,7 @@ static void ComputeBackpropInput(const DepthwiseArgs& args,
   if (depth_multiplier > 1) {
     for (int64 d = 0; d < in_depth; ++d) {
       const int64 index = d * args.depth_multiplier;
-      T accum = static_cast<T>(0);
+      T accum = 0;
       for (int64 dm = 0; dm < dm_vectorized_size; dm += kPacketSize) {
         const auto v = Eigen::internal::ploadu<Packet>(out_buffer + index + dm);
         accum += Eigen::internal::predux(v);
@@ -510,7 +510,6 @@ static void DepthwiseConvBackpropInputReference(const DepthwiseArgs& args,
 
 #if GOOGLE_CUDA
 
-extern template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, Eigen::half>;
 extern template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, float>;
 extern template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, double>;
 
@@ -885,7 +884,6 @@ static void DepthwiseConvBackpropFilterReference(const DepthwiseArgs& args,
 
 #if GOOGLE_CUDA
 
-extern template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, Eigen::half>;
 extern template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, float>;
 extern template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, double>;
 
diff --git a/tensorflow/core/kernels/depthwise_conv_op.cc b/tensorflow/core/kernels/depthwise_conv_op.cc
index 2759ecb2f1..bbeeaf7895 100644
--- a/tensorflow/core/kernels/depthwise_conv_op.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op.cc
@@ -94,7 +94,7 @@ struct DepthwiseConv2DKernel {
 
     for (int i = 0; i < output_vectorized_size; i += kPacketSize) {
       // Reset accumulator.
-      auto vaccum = Eigen::internal::pset1<Packet>(static_cast<T>(0));
+      auto vaccum = Eigen::internal::pset1<Packet>(0);
       for (int j = 0; j < filter_spatial_size; ++j) {
         // Calculate index.
         const int64 index = i + j * padded_filter_inner_dim_size;
@@ -115,7 +115,7 @@ struct DepthwiseConv2DKernel {
     }
 
     if (output_scalar_size > 0) {
-      auto vaccum = Eigen::internal::pset1<Packet>(static_cast<T>(0));
+      auto vaccum = Eigen::internal::pset1<Packet>(0);
       for (int j = 0; j < filter_spatial_size; ++j) {
         const int64 index =
             output_vectorized_size + j * padded_filter_inner_dim_size;
@@ -246,7 +246,6 @@ extern template class LaunchConv2DOp<CPUDevice, float>;
 #if GOOGLE_CUDA
 
 // Extern template instantiated in depthwise_conv_op_gpu.cc.
-extern template struct LaunchDepthwiseConvOp<GPUDevice, Eigen::half>;
 extern template struct LaunchDepthwiseConvOp<GPUDevice, float>;
 extern template struct LaunchDepthwiseConvOp<GPUDevice, double>;
 
@@ -420,7 +419,6 @@ class DepthwiseConv2dNativeOp : public BinaryOp<T> {
       Name("DepthwiseConv2dNative").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
       DepthwiseConv2dNativeOp<CPUDevice, T>);
 
-TF_CALL_half(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 #if !defined(PLATFORM_WINDOWS) || !defined(_DEBUG)
 TF_CALL_double(REGISTER_CPU_KERNEL);
@@ -428,10 +426,6 @@ TF_CALL_double(REGISTER_CPU_KERNEL);
 
 #if GOOGLE_CUDA
 REGISTER_KERNEL_BUILDER(
-    Name("DepthwiseConv2dNative").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"),
-    DepthwiseConv2dNativeOp<GPUDevice, Eigen::half>);
-
-REGISTER_KERNEL_BUILDER(
     Name("DepthwiseConv2dNative").Device(DEVICE_GPU).TypeConstraint<float>("T"),
     DepthwiseConv2dNativeOp<GPUDevice, float>);
 
diff --git a/tensorflow/core/kernels/depthwise_conv_op.h b/tensorflow/core/kernels/depthwise_conv_op.h
index 11aed5b415..aa5b5c76f6 100644
--- a/tensorflow/core/kernels/depthwise_conv_op.h
+++ b/tensorflow/core/kernels/depthwise_conv_op.h
@@ -158,7 +158,7 @@ struct DepthwiseFilterPadOp {
       }
       // Pad the remainder of output to vector-register boundary.
       for (int64 j = 0; j < pad_size; ++j) {
-        padded_filter[output_base + vectorized_size + scalar_size + j] = static_cast<T>(0);
+        padded_filter[output_base + vectorized_size + scalar_size + j] = 0;
       }
     }
   }
@@ -266,7 +266,7 @@ struct DepthwiseInputCopyOp {
 
           // Pad the remainder of the output to vector register boundary.
           for (int64 d = 0; d < output_pad_size; ++d) {
-            in_buf[d] = static_cast<T>(0);
+            in_buf[d] = 0;
           }
           in_buf += output_pad_size;
 
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 903aac5d68..ecfe51d599 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -105,7 +105,7 @@ __global__ void __launch_bounds__(1024, 2)
     const int input_row_end = input_row_start + filter_rows;
     const int input_col_end = input_col_start + filter_cols;
 
-    T sum = static_cast<T>(0);
+    T sum = 0;
 
     const int input_offset_temp = in_rows * OB;
     if (input_row_start >= 0 && input_col_start >= 0 &&
@@ -258,8 +258,8 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
     __syncthreads();
 
     if (depth_in_range) {
-      T sum1 = static_cast<T>(0);
-      T sum2 = static_cast<T>(0);
+      T sum1 = 0;
+      T sum2 = 0;
       int shared_offset = data_idx;
       const T* filter_ptr = filter_read_offset + shared_data;
       UNROLL for (int r = 0; r < filter_rows; ++r) {
@@ -369,7 +369,7 @@ __global__ void __launch_bounds__(1024, 2)
     const int input_row_end = input_row_start + filter_rows;
     const int input_col_end = input_col_start + filter_cols;
 
-    T sum = static_cast<T>(0);
+    T sum = 0;
     if (input_row_start >= 0 && input_col_start >= 0 &&
         input_row_end < in_rows && input_col_end < in_cols) {
       // Loop that doesn't need to check for boundary conditions.
@@ -529,8 +529,8 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
     __syncthreads();
 
     if (slice_in_range) {
-      T sum1 = static_cast<T>(0);
-      T sum2 = static_cast<T>(0);
+      T sum1 = 0;
+      T sum2 = 0;
       int shared_offset = data_idx;
       const T* filter_ptr = filter_read_offset + shared_data;
       UNROLL for (int r = 0; r < filter_rows; ++r) {
@@ -710,7 +710,6 @@ void LaunchDepthwiseConvOp<GPUDevice, T>::operator()(OpKernelContext* ctx,
                   "Launch of gpu kernel for DepthwiseConv2dGPULaunch failed"));
 }
 
-template struct LaunchDepthwiseConvOp<GPUDevice, Eigen::half>;
 template struct LaunchDepthwiseConvOp<GPUDevice, float>;
 template struct LaunchDepthwiseConvOp<GPUDevice, double>;
 
@@ -745,7 +744,7 @@ __global__ void __launch_bounds__(640, 2)
     const int in_r = (thread_id / in_depth / in_cols) % in_rows;
     const int b = thread_id / in_depth / in_cols / in_rows;
 
-    T sum = static_cast<T>(0);
+    T sum = 0;
 
     const int out_r_start =
         tf_max<int>(0, (in_r - filter_rows + pad_rows + stride) / stride);
@@ -811,7 +810,7 @@ __global__ void __launch_bounds__(640, 2)
     const int in_d = (thread_id / in_cols / in_rows) % in_depth;
     const int b = thread_id / in_depth / in_cols / in_rows;
 
-    T sum = static_cast<T>(0);
+    T sum = 0;
     const int out_d_start = in_d * depth_multiplier;
     const int out_d_end = out_d_start + depth_multiplier;
 
@@ -920,7 +919,6 @@ void LaunchDepthwiseConvBackpropInputOp<GPUDevice, T>::operator()(
                                "utGPULaunch failed"));
 }
 
-template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, Eigen::half>;
 template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, float>;
 template struct LaunchDepthwiseConvBackpropInputOp<GPUDevice, double>;
 
@@ -1633,7 +1631,6 @@ void LaunchDepthwiseConvBackpropFilterOp<GPUDevice, T>::operator()(
                                "terGPULaunch failed"));
 }
 
-template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, Eigen::half>;
 template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, float>;
 template struct LaunchDepthwiseConvBackpropFilterOp<GPUDevice, double>;
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc b/tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc
deleted file mode 100644
index 7249c8c66c..0000000000
--- a/tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc
+++ /dev/null
@@ -1,376 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-// The algorithm for dynamic partition has the following steps:
-// 1. Let N be the size of partitions. We initialize a new vector indices_in
-//    with the values 0, 1, 2, ..., N-1.
-// 2. We apply cub::DeviceRadixSort::SortPairs to the key - value pairs given
-//    by partitions and indices_in. This will result in two new vectors
-//    partitions_out and indices_out, with partitions_out sorted.
-// 3. The first dimension of outputs[i] is equal to the length of the interval
-//    of i-values in partitions_out. We determine it in two steps:
-//    - compute the starting and ending point of each interval,
-//    - subtract the starting and ending points to find the length.
-//    The result is placed in partition_count.
-// 4. Because partition_count is on the GPU, we bring it asynchronously to
-//    the CPU. Then we can allocate the output tensors.
-// 5. Finally, we use indices_out and the gather functor to collect the output.
-//    This works, because for each interval of i-values, indices_out points
-//    to the slices which should form output[i].
-
-#if GOOGLE_CUDA
-
-#define EIGEN_USE_GPU
-
-#include "external/cub_archive/cub/device/device_radix_sort.cuh"
-#include "tensorflow/core/common_runtime/gpu/gpu_event_mgr.h"
-#include "tensorflow/core/framework/op_kernel.h"
-#include "tensorflow/core/framework/register_types.h"
-#include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/framework/types.h"
-#include "tensorflow/core/kernels/bounds_check.h"
-#include "tensorflow/core/kernels/fill_functor.h"
-#include "tensorflow/core/kernels/gather_functor_gpu.cu.h"
-#include "tensorflow/core/util/cuda_kernel_helper.h"
-
-namespace tensorflow {
-
-typedef Eigen::GpuDevice GPUDevice;
-
-namespace {
-
-template <typename T>
-__global__ void RangeInitKernel(const T start, const T delta, const int32 size,
-                                T* out) {
-  CUDA_1D_KERNEL_LOOP(i, size) { out[i] = start + i * delta; }
-}
-
-__global__ void FindEndpointsKernel(const int32* partitions, int32 size,
-                                    int32 nump, int32* start, int32* end) {
-  CUDA_1D_KERNEL_LOOP(i, size) {
-    int32 current = ldg(partitions + i);
-    if (FastBoundsCheck(current, nump)) {
-      if (i == 0)
-        start[current] = i;
-      else {
-        int32 before = ldg(partitions + i - 1);
-        if (before != current) start[current] = i;
-      }
-      if (i == size - 1)
-        end[current] = i + 1;
-      else {
-        int32 after = ldg(partitions + i + 1);
-        if (after != current) end[current] = i + 1;
-      }
-    }
-  }
-}
-
-// We create a local version of subtract, because the tf.subtract kernel
-// is not defined for int32. We use it to compute the length of an interval
-// by subtracting the endpoints.
-__global__ void IntervalLengthKernel(int32* start, int32 size, int32* end) {
-  CUDA_1D_KERNEL_LOOP(i, size) {
-    int32 start_point = ldg(start + i);
-    end[i] = end[i] - start_point;
-  }
-}
-
-// Initialize out with range start, start + delta, start + 2 * delta, ...
-// This is needed because tf.range has no GPU implementation.
-template <typename T>
-void RangeInit(const GPUDevice& d, const T start, const T delta,
-               const int32 size, typename TTypes<T>::Flat out) {
-  CudaLaunchConfig config = GetCudaLaunchConfig(size, d);
-  RangeInitKernel<
-      T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
-      start, delta, size, out.data());
-}
-
-// Partitions is a sorted vector of N non-negative integer numbers.
-// This function computes the starting and ending points of each interval
-// of values.
-void ComputeIntervals(const GPUDevice& d, Tensor* partitions, int32 N,
-                      int32 nump, int32* start_ptr, int32* end_ptr) {
-  CudaLaunchConfig config = GetCudaLaunchConfig(N, d);
-  FindEndpointsKernel<<<config.block_count, config.thread_per_block, 0,
-                        d.stream()>>>(partitions->flat<int32>().data(), N, nump,
-                                      start_ptr, end_ptr);
-}
-
-// Subtract the ending points of each interval to obtain the interval length.
-void ComputeItvLength(const GPUDevice& d, int32 num, int32* start_ptr,
-                      int32* end_ptr) {
-  CudaLaunchConfig config = GetCudaLaunchConfig(num, d);
-  IntervalLengthKernel<<<config.block_count, config.thread_per_block, 0,
-                         d.stream()>>>(start_ptr, num, end_ptr);
-}
-
-template <typename T>
-void CallGatherKernel(const GPUDevice& d, const T* params, const int32* indices,
-                      T* out, int64 gather_dim_size, int64 indices_size,
-                      int64 slice_size, int64 out_size) {
-  CudaLaunchConfig config = GetCudaLaunchConfig(out_size, d);
-  GatherOpKernel<
-      T, int32,
-      true><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
-      params, indices, out, gather_dim_size, indices_size, slice_size,
-      out_size);
-}
-
-}  // namespace
-
-// The current implementation has memory cost on GPU
-// I + P + max(3N + R, O + N), where:
-// I - the size of the input
-// N - the size of the partitions tensor
-// R - the temporary storage used by cub::RadixSort, about 2N
-// P - the number of partitions
-// O - the size of the output
-// So roughly the cost is I + P + max(5N, O + N).
-template <typename T>
-class DynamicPartitionOpGPU : public AsyncOpKernel {
- public:
-  explicit DynamicPartitionOpGPU(OpKernelConstruction* c) : AsyncOpKernel(c) {
-    OP_REQUIRES_OK(c, c->GetAttr("num_partitions", &num_partitions_));
-    OP_REQUIRES(c, num_partitions_ >= 1,
-                errors::InvalidArgument("num_partitions must be at least 1"));
-  }
-
-  void AllocateTempSpace(OpKernelContext* c, int32 N, Tensor* indices_in,
-                         Tensor* partitions_out, Tensor* indices_out,
-                         DoneCallback done) {
-    int32 M = std::max(N, num_partitions_);
-    // indices_in will be made slightly larger to accomodate
-    // later computations.
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(DT_INT32, TensorShape({M}), indices_in), done);
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(DT_INT32, TensorShape({N}), partitions_out), done);
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(DT_INT32, TensorShape({N}), indices_out), done);
-  }
-
-  void AllocateOutputs(OpKernelContext* c, const Tensor* data,
-                       const Tensor* partitions, const Tensor* partition_count,
-                       OpOutputList* Tout, DoneCallback done) {
-    auto e_part_count = partition_count->flat<int32>();
-    // Allocate output tensors of the right size
-    OP_REQUIRES_OK_ASYNC(c, c->output_list("outputs", Tout), done);
-    for (int p = 0; p < num_partitions_; p++) {
-      TensorShape shape;
-      shape.AddDim(e_part_count(p));
-      for (int i = partitions->dims(); i < data->dims(); i++) {
-        shape.AddDim(data->dim_size(i));
-      }
-      Tensor* out;
-      OP_REQUIRES_OK_ASYNC(c, Tout->allocate(p, shape, &out), done);
-    }
-  }
-
-  void ComputeAsync(OpKernelContext* c, DoneCallback done) {
-    const Tensor& data = c->input(0);
-    const Tensor& partitions = c->input(1);
-
-    OP_REQUIRES_ASYNC(
-        c, TensorShapeUtils::StartsWith(data.shape(), partitions.shape()),
-        errors::InvalidArgument("data.shape must start with partitions.shape, ",
-                                "got data.shape = ", data.shape().DebugString(),
-                                ", partitions.shape = ",
-                                partitions.shape().DebugString()),
-        done);
-
-    Tensor partition_count;
-
-    // We must handle the case of empty partitions separately,
-    // because kernels don't work with 0-sized tensors.
-    if (partitions.NumElements() == 0) {
-      AllocatorAttributes alloc_attr;
-      alloc_attr.set_on_host(true);
-      OP_REQUIRES_OK_ASYNC(
-          c, c->allocate_temp(DT_INT32, TensorShape({num_partitions_}),
-                              &partition_count, alloc_attr),
-          done);
-      auto e_part_count = partition_count.flat<int32>();
-      for (int i = 0; i < num_partitions_; i++) e_part_count(i) = 0;
-      OpOutputList outputs;
-      this->AllocateOutputs(c, &data, &partitions, &partition_count, &outputs,
-                            done);
-      if (c->status().ok()) done();
-      return;
-    }
-
-    // Prepare for counting.
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(DT_INT32, TensorShape({num_partitions_}),
-                            &partition_count),
-        done);
-    Tensor indices_out;
-    // Count how many times each partition index occurs.
-    // Also sort the info in partitions and output it in indices_out,
-    // in preparation for the next step.
-    this->CountAndSortParts(c, &partitions, &partition_count, &indices_out,
-                            done);
-    if (!c->status().ok()) return;
-
-    // In order to allocate the output tensor we have to move partition_count
-    // to CPU.
-    auto* stream = c->op_device_context()->stream();
-    OP_REQUIRES_ASYNC(c, stream, errors::Internal("No GPU stream available."),
-                      done);
-    Tensor cpu_tensor;
-    AllocatorAttributes alloc_attr;
-    alloc_attr.set_on_host(true);
-    alloc_attr.set_gpu_compatible(true);
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(partition_count.dtype(), partition_count.shape(),
-                            &cpu_tensor, alloc_attr),
-        done);
-    perftools::gputools::DeviceMemoryBase wrapped(
-        partition_count.flat<int32>().data(), num_partitions_ * sizeof(int32));
-    const bool status =
-        stream
-            ->ThenMemcpy(cpu_tensor.flat<int32>().data(), wrapped,
-                         num_partitions_ * sizeof(int32))
-            .ok();
-    OP_REQUIRES_ASYNC(
-        c, status,
-        errors::Internal("Failed to launch copy from device to host."), done);
-
-    // Keep a reference to partition_count so that the buffer
-    // is not deallocated at the end of the function, before
-    // memcpy is completed.
-    TensorReference partition_ref(partition_count);
-    auto wrapped_callback = [this, c, &data, &partitions, indices_out,
-                             partition_ref, cpu_tensor, done]() {
-      OpOutputList outputs;
-      this->AllocateOutputs(c, &data, &partitions, &cpu_tensor, &outputs, done);
-      if (!c->status().ok()) {
-        partition_ref.Unref();
-        return;
-      }
-      int32 N = partitions.NumElements();
-      int64 slice_size = data.NumElements() / N;
-      this->GatherSlices(c, &data, &indices_out, N, slice_size, outputs);
-      partition_ref.Unref();
-      done();
-    };
-
-    c->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
-        stream, wrapped_callback);
-  }
-
- protected:
-  void RadixSort(OpKernelContext* c, const Tensor* partitions,
-                 Tensor* indices_in, Tensor* partitions_out,
-                 Tensor* indices_out, DoneCallback done) {
-    int32 N = partitions->NumElements();
-    const GPUDevice& device = c->eigen_device<GPUDevice>();
-    const cudaStream_t& cu_stream = GetCudaStream(c);
-
-    // Initialize the indices_in tensor using the Range GPU kernel.
-    RangeInit(device, 0, 1, N, indices_in->flat<int32>());
-    // Obtain the pointers to inner buffers.
-    const int32* partitions_ptr = partitions->flat<int32>().data();
-    int32* partitions_out_ptr = partitions_out->flat<int32>().data();
-    int32* indices_in_ptr = indices_in->flat<int32>().data();
-    int32* indices_out_ptr = indices_out->flat<int32>().data();
-    // Determine temporary device storage requirements.
-    Tensor cub_temp_storage;
-    size_t temp_storage_bytes = 0;
-    cub::DeviceRadixSort::SortPairs(
-        NULL, temp_storage_bytes, partitions_ptr, partitions_out_ptr,
-        indices_in_ptr, indices_out_ptr, N, 0, sizeof(int32) * 8, cu_stream);
-    // Allocate temporary storage.
-    OP_REQUIRES_OK_ASYNC(
-        c, c->allocate_temp(
-               DT_INT8, TensorShape({static_cast<int64>(temp_storage_bytes)}),
-               &cub_temp_storage),
-        done);
-    // Radix-sort the partition information.
-    cub::DeviceRadixSort::SortPairs(
-        cub_temp_storage.flat<int8>().data(), temp_storage_bytes,
-        partitions_ptr, partitions_out_ptr, indices_in_ptr, indices_out_ptr, N,
-        0, sizeof(int32) * 8, cu_stream);
-  }  // At this point cub_temp_storage will be marked for deallocation.
-
-  void CountAndSortParts(OpKernelContext* c, const Tensor* partitions,
-                         Tensor* partition_count, Tensor* indices_out,
-                         DoneCallback done) {
-    const GPUDevice& device = c->eigen_device<GPUDevice>();
-    int32 N = partitions->NumElements();
-    Tensor indices_in;
-    Tensor partitions_out;
-
-    // Allocate memory for Radix-Sort.
-    this->AllocateTempSpace(c, N, &indices_in, &partitions_out, indices_out,
-                            done);
-    if (!c->status().ok()) return;
-    this->RadixSort(c, partitions, &indices_in, &partitions_out, indices_out,
-                    done);
-    if (!c->status().ok()) return;
-    // We still need a little bit of additional memory. However,
-    // we can reuse the indices_in tensor. We could also use atomic
-    // operations and no additional memory, but this approach seems faster.
-
-    // Zero-out the allocated memory.
-    functor::SetZeroFunctor<GPUDevice, int32> zero_functor;
-    zero_functor(device, partition_count->flat<int32>());
-    zero_functor(device, indices_in.flat<int32>());
-    // Obtain the pointers to inner buffers.
-    int32* start_ptr = indices_in.flat<int32>().data();
-    int32* end_ptr = partition_count->flat<int32>().data();
-    // Obtain the starting and ending points of each interval.
-    ComputeIntervals(device, &partitions_out, N, num_partitions_, start_ptr,
-                     end_ptr);
-    // Subtract to compute the number of appearances of each id.
-    ComputeItvLength(device, num_partitions_, start_ptr, end_ptr);
-  }  // At this point indices_in and partitions_out will be marked
-     // for deallocation.
-
-  void GatherSlices(OpKernelContext* c, const Tensor* data,
-                    const Tensor* indices, int32 N, int64 slice_size,
-                    OpOutputList& outs) {
-    const GPUDevice& device = c->eigen_device<GPUDevice>();
-    const int32* ind_base = indices->flat<int32>().data();
-    const T* data_base = data->flat<T>().data();
-
-    for (int p = 0; p < num_partitions_; p++) {
-      int32 indices_size = outs[p]->dim_size(0);
-      int64 out_size = outs[p]->NumElements();
-      T* out_base = outs[p]->flat<T>().data();
-      if (out_size > 0)
-        CallGatherKernel<T>(device, data_base, ind_base, out_base, N,
-                            indices_size, slice_size, out_size);
-      ind_base += indices_size;
-    }
-  }
-
-  int num_partitions_;
-};
-
-#define REGISTER_DYNAMIC_PARTITION_GPU(T)                                 \
-  REGISTER_KERNEL_BUILDER(                                                \
-      Name("DynamicPartition").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
-      DynamicPartitionOpGPU<T>)
-
-TF_CALL_GPU_NUMBER_TYPES(REGISTER_DYNAMIC_PARTITION_GPU);
-TF_CALL_complex64(REGISTER_DYNAMIC_PARTITION_GPU);
-TF_CALL_complex128(REGISTER_DYNAMIC_PARTITION_GPU);
-#undef REGISTER_DYNAMIC_PARTITION_GPU
-
-}  // namespace tensorflow
-
-#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/dynamic_partition_op_test.cc b/tensorflow/core/kernels/dynamic_partition_op_test.cc
index 9a7ed0af21..0e8fbc0a67 100644
--- a/tensorflow/core/kernels/dynamic_partition_op_test.cc
+++ b/tensorflow/core/kernels/dynamic_partition_op_test.cc
@@ -16,7 +16,6 @@ limitations under the License.
 #include <functional>
 #include <memory>
 
-#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/fake_input.h"
 #include "tensorflow/core/framework/node_def_builder.h"
@@ -24,14 +23,10 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/framework/types.pb.h"
-#include "tensorflow/core/graph/node_builder.h"
-#include "tensorflow/core/graph/testlib.h"
 #include "tensorflow/core/kernels/ops_testutil.h"
 #include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
-#include "tensorflow/core/lib/random/simple_philox.h"
 #include "tensorflow/core/platform/test.h"
-#include "tensorflow/core/platform/test_benchmark.h"
 
 namespace tensorflow {
 namespace {
@@ -158,58 +153,5 @@ TEST_F(DynamicPartitionOpTest, Error_IndexOutOfRange) {
       << s;
 }
 
-Node* DynamicPartitionNode(Graph* g, Node* in0, Node* in1, int num_partitions) {
-  Node* ret;
-  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "DynamicPartition")
-                  .Input(in0)
-                  .Input(in1)
-                  .Attr("num_partitions", num_partitions)
-                  .Finalize(g, &ret));
-  return ret;
-}
-
-template <typename T>
-static Graph* DynamicPartition(int num_partitions, int dim) {
-  Graph* g = new Graph(OpRegistry::Global());
-  // Always use a 128MB buffer.
-  const int kRows = ((128 << 20) / sizeof(T)) / dim;
-  Tensor data(DataTypeToEnum<T>::value, TensorShape({kRows, dim}));
-  data.flat<T>().setRandom();
-
-  random::PhiloxRandom philox(301, 17);
-  random::SimplePhilox rnd(&philox);
-  Tensor partitions(DT_INT32, TensorShape({kRows}));
-  for (int i = 0; i < kRows; i++) {
-    partitions.flat<int32>()(i) = rnd.Uniform(num_partitions);
-  }
-  DynamicPartitionNode(g, test::graph::Constant(g, data),
-                       test::graph::Constant(g, partitions), num_partitions);
-  return g;
-}
-
-#define BM_DYNAMIC_PARTITION(DEVICE, T, num)                            \
-  static void BM_##DEVICE##_dynpart_##T##_##num(int iters, int dim) {   \
-    const int64 items = ((128 << 20) / sizeof(T));                      \
-    const int64 tot = static_cast<int64>(iters) * items;                \
-    testing::ItemsProcessed(tot);                                       \
-    testing::UseRealTime();                                             \
-    test::Benchmark(#DEVICE, DynamicPartition<T>(num, dim)).Run(iters); \
-  }                                                                     \
-  BENCHMARK(BM_##DEVICE##_dynpart_##T##_##num)->Arg(1)->Arg(256)
-
-BM_DYNAMIC_PARTITION(cpu, float, 2);
-BM_DYNAMIC_PARTITION(cpu, float, 100);
-BM_DYNAMIC_PARTITION(cpu, double, 2);
-BM_DYNAMIC_PARTITION(cpu, double, 100);
-BM_DYNAMIC_PARTITION(cpu, complex64, 2);
-BM_DYNAMIC_PARTITION(cpu, complex64, 100);
-
-BM_DYNAMIC_PARTITION(gpu, float, 2);
-BM_DYNAMIC_PARTITION(gpu, float, 100);
-BM_DYNAMIC_PARTITION(gpu, double, 2);
-BM_DYNAMIC_PARTITION(gpu, double, 100);
-BM_DYNAMIC_PARTITION(gpu, complex64, 2);
-BM_DYNAMIC_PARTITION(gpu, complex64, 100);
-
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/fused_batch_norm_op.cc b/tensorflow/core/kernels/fused_batch_norm_op.cc
index 1688674eb7..0ecb829f34 100644
--- a/tensorflow/core/kernels/fused_batch_norm_op.cc
+++ b/tensorflow/core/kernels/fused_batch_norm_op.cc
@@ -54,20 +54,25 @@ struct FusedBatchNorm<CPUDevice, T, U> {
                   Tensor* batch_var_output, Tensor* saved_mean_output,
                   Tensor* saved_var_output, TensorFormat tensor_format,
                   bool is_training) {
+    // Currently U is ignored, since we only support the case where T and U are
+    // both float32.
+    // TODO(reedwm): Add float16 support, use U, and remove these asserts.
+    static_assert(std::is_same<T, float>::value, "T currently must be float.");
+    static_assert(std::is_same<U, float>::value, "U currently must be float.");
     OP_REQUIRES(context, tensor_format == FORMAT_NHWC,
                 errors::Internal("The CPU implementation of FusedBatchNorm "
                                  "only supports NHWC tensor format for now."));
     typename TTypes<T, 4>::ConstTensor x(x_input.tensor<T, 4>());
-    typename TTypes<U>::ConstVec scale(scale_input.vec<U>());
-    typename TTypes<U>::ConstVec offset(offset_input.vec<U>());
-    typename TTypes<U>::ConstVec estimated_mean(estimated_mean_input.vec<U>());
-    typename TTypes<U>::ConstVec estimated_variance(
-        estimated_variance_input.vec<U>());
+    typename TTypes<T>::ConstVec scale(scale_input.vec<T>());
+    typename TTypes<T>::ConstVec offset(offset_input.vec<T>());
+    typename TTypes<T>::ConstVec estimated_mean(estimated_mean_input.vec<T>());
+    typename TTypes<T>::ConstVec estimated_variance(
+        estimated_variance_input.vec<T>());
     typename TTypes<T, 4>::Tensor y(y_output->tensor<T, 4>());
-    typename TTypes<U>::Vec batch_mean(batch_mean_output->vec<U>());
-    typename TTypes<U>::Vec batch_var(batch_var_output->vec<U>());
-    typename TTypes<U>::Vec saved_mean(saved_mean_output->vec<U>());
-    typename TTypes<U>::Vec saved_var(saved_var_output->vec<U>());
+    typename TTypes<T>::Vec batch_mean(batch_mean_output->vec<T>());
+    typename TTypes<T>::Vec batch_var(batch_var_output->vec<T>());
+    typename TTypes<T>::Vec saved_mean(saved_mean_output->vec<T>());
+    typename TTypes<T>::Vec saved_var(saved_var_output->vec<T>());
 
     const CPUDevice& d = context->eigen_device<CPUDevice>();
 
@@ -88,15 +93,15 @@ struct FusedBatchNorm<CPUDevice, T, U> {
     bcast_spec.set(0, rest_size);
 #endif
 
-    auto x_rest_by_depth = x.reshape(rest_by_depth).template cast<U>();
+    auto x_rest_by_depth = x.reshape(rest_by_depth);
     const int rest_size_minus_one = (rest_size > 1) ? (rest_size - 1) : 1;
-    U rest_size_inv = static_cast<U>(1.0f / static_cast<U>(rest_size));
+    T rest_size_inv = static_cast<T>(1.0f / static_cast<T>(rest_size));
     // This adjustment is for Bessel's correction
-    U rest_size_adjust =
-        static_cast<U>(rest_size) / static_cast<U>(rest_size_minus_one);
+    T rest_size_adjust =
+        static_cast<T>(rest_size) / static_cast<T>(rest_size_minus_one);
 
-    Eigen::Tensor<U, 1, Eigen::RowMajor> mean(depth);
-    Eigen::Tensor<U, 1, Eigen::RowMajor> variance(depth);
+    Eigen::Tensor<T, 1, Eigen::RowMajor> mean(depth);
+    Eigen::Tensor<T, 1, Eigen::RowMajor> variance(depth);
     if (is_training) {
       mean.device(d) = (x_rest_by_depth.sum(reduce_dims) * rest_size_inv);
       batch_mean.device(d) = mean;
@@ -124,7 +129,7 @@ struct FusedBatchNorm<CPUDevice, T, U> {
     auto x_shifted =
         x_scaled + offset.reshape(one_by_depth).broadcast(bcast_spec);
 
-    y.reshape(rest_by_depth).device(d) = x_shifted.template cast<T>();
+    y.reshape(rest_by_depth).device(d) = x_shifted;
   }
 };
 
@@ -133,7 +138,7 @@ struct FusedBatchNormGrad<CPUDevice, T, U> {
   void operator()(OpKernelContext* context, const Tensor& y_backprop_input,
                   const Tensor& x_input, const Tensor& scale_input,
                   const Tensor& mean_input, const Tensor& variance_input,
-                  U epsilon, Tensor* x_backprop_output,
+                  T epsilon, Tensor* x_backprop_output,
                   Tensor* scale_backprop_output, Tensor* offset_backprop_output,
                   TensorFormat tensor_format) {
     OP_REQUIRES(context, tensor_format == FORMAT_NHWC,
@@ -142,12 +147,12 @@ struct FusedBatchNormGrad<CPUDevice, T, U> {
     typename TTypes<T, 4>::ConstTensor y_backprop(
         y_backprop_input.tensor<T, 4>());
     typename TTypes<T, 4>::ConstTensor x(x_input.tensor<T, 4>());
-    typename TTypes<U>::ConstVec scale(scale_input.vec<U>());
-    typename TTypes<U>::ConstVec mean(mean_input.vec<U>());
-    typename TTypes<U>::ConstVec variance(variance_input.vec<U>());
+    typename TTypes<T>::ConstVec scale(scale_input.vec<T>());
+    typename TTypes<T>::ConstVec mean(mean_input.vec<T>());
+    typename TTypes<T>::ConstVec variance(variance_input.vec<T>());
     typename TTypes<T, 4>::Tensor x_backprop(x_backprop_output->tensor<T, 4>());
-    typename TTypes<U>::Vec scale_backprop(scale_backprop_output->vec<U>());
-    typename TTypes<U>::Vec offset_backprop(offset_backprop_output->vec<U>());
+    typename TTypes<T>::Vec scale_backprop(scale_backprop_output->vec<T>());
+    typename TTypes<T>::Vec offset_backprop(offset_backprop_output->vec<T>());
 
     // Note: the following formulas are used to compute the gradients for
     // back propagation.
@@ -176,8 +181,8 @@ struct FusedBatchNormGrad<CPUDevice, T, U> {
     bcast_spec.set(0, rest_size);
 #endif
 
-    auto x_rest_by_depth = x.reshape(rest_by_depth).template cast<U>();
-    U rest_size_inv = static_cast<U>(1.0f / static_cast<U>(rest_size));
+    auto x_rest_by_depth = x.reshape(rest_by_depth);
+    T rest_size_inv = static_cast<T>(1.0f / static_cast<T>(rest_size));
 
     auto x_mean_rest_by_depth =
         mean.reshape(one_by_depth).broadcast(bcast_spec);
@@ -187,8 +192,7 @@ struct FusedBatchNormGrad<CPUDevice, T, U> {
         coef0.eval().reshape(one_by_depth).broadcast(bcast_spec);
     auto x_scaled = x_centered * coef0_rest_by_depth;
 
-    auto y_backprop_rest_by_depth =
-        y_backprop.eval().reshape(rest_by_depth).template cast<U>();
+    auto y_backprop_rest_by_depth = y_backprop.eval().reshape(rest_by_depth);
     scale_backprop.device(d) =
         (y_backprop_rest_by_depth * x_scaled).sum(reduce_dims);
     auto y_backprop_sum = y_backprop_rest_by_depth.sum(reduce_dims);
@@ -210,7 +214,7 @@ struct FusedBatchNormGrad<CPUDevice, T, U> {
                      .reshape(one_by_depth)
                      .broadcast(bcast_spec);
     x_backprop.reshape(rest_by_depth).device(d) =
-        (coef1 * (y_backprop_centered - x_centered * coef2)).template cast<T>();
+        coef1 * (y_backprop_centered - x_centered * coef2);
   }
 };
 
@@ -685,18 +689,6 @@ REGISTER_KERNEL_BUILDER(Name("FusedBatchNormGradV2")
                             .TypeConstraint<float>("U"),
                         FusedBatchNormGradOp<CPUDevice, float, float>);
 
-REGISTER_KERNEL_BUILDER(Name("FusedBatchNormV2")
-                            .Device(DEVICE_CPU)
-                            .TypeConstraint<Eigen::half>("T")
-                            .TypeConstraint<float>("U"),
-                        FusedBatchNormOp<CPUDevice, Eigen::half, float>);
-
-REGISTER_KERNEL_BUILDER(Name("FusedBatchNormGradV2")
-                            .Device(DEVICE_CPU)
-                            .TypeConstraint<Eigen::half>("T")
-                            .TypeConstraint<float>("U"),
-                        FusedBatchNormGradOp<CPUDevice, Eigen::half, float>);
-
 #if GOOGLE_CUDA
 
 REGISTER_KERNEL_BUILDER(
diff --git a/tensorflow/core/kernels/fused_batch_norm_op.h b/tensorflow/core/kernels/fused_batch_norm_op.h
index 3af104bf95..38b24d7011 100644
--- a/tensorflow/core/kernels/fused_batch_norm_op.h
+++ b/tensorflow/core/kernels/fused_batch_norm_op.h
@@ -92,28 +92,26 @@ struct FusedBatchNormFreezeGrad {
     // offset_backprop  = sum(y_backprop)
     // scale_backprop = y_backprop * ((x - pop_mean) * rsqrt(pop_var + epsilon))
     // x_backprop = y_backprop * (scale * rsqrt(pop_var + epsilon))
-
-    auto y_backprop_rest_by_depth =
-        y_backprop.reshape(rest_by_depth).template cast<U>();
-    auto input_rest_by_depth = input.reshape(rest_by_depth).template cast<U>();
-
-    offset_backprop.device(d) = y_backprop_rest_by_depth.sum(reduction_axis);
+    offset_backprop.device(d) = y_backprop.reshape(rest_by_depth)
+                                    .template cast<U>()
+                                    .sum(reduction_axis);
 
     // scratch1 = rsqrt(pop_var + epsilon)
     scratch1.device(d) = (pop_var + pop_var.constant(epsilon)).rsqrt();
 
     // scratch2 = sum(y_backprop * (x - mean))
     scratch2.device(d) =
-        (y_backprop_rest_by_depth *
-         (input_rest_by_depth -
+        (y_backprop.reshape(rest_by_depth).template cast<U>() *
+         (input.reshape(rest_by_depth).template cast<U>() -
           pop_mean.reshape(one_by_depth).broadcast(rest_by_one)))
             .sum(reduction_axis);
 
     x_backprop.reshape(rest_by_depth).device(d) =
-        (y_backprop_rest_by_depth * ((scratch1 * scale)
-                                         .eval()
-                                         .reshape(one_by_depth)
-                                         .broadcast(rest_by_one)))
+        (y_backprop.reshape(rest_by_depth).template cast<U>() *
+         ((scratch1 * scale)
+              .eval()
+              .reshape(one_by_depth)
+              .broadcast(rest_by_one)))
             .template cast<T>();
     scale_backprop.device(d) = scratch2 * scratch1;
   }
diff --git a/tensorflow/core/kernels/lmdb_reader_op.cc b/tensorflow/core/kernels/lmdb_reader_op.cc
index 31a427f2c9..3bb07301b5 100755
--- a/tensorflow/core/kernels/lmdb_reader_op.cc
+++ b/tensorflow/core/kernels/lmdb_reader_op.cc
@@ -36,7 +36,7 @@ class LMDBReader : public ReaderBase {
 
   Status OnWorkStartedLocked() override {
     MDB_CHECK(mdb_env_create(&mdb_env_));
-    int flags = MDB_RDONLY | MDB_NOTLS | MDB_NOLOCK;
+    int flags = MDB_RDONLY | MDB_NOTLS;
 
     // Check if the LMDB filename is actually a file instead of a directory.
     // If so, set appropriate flags so we can open it.
@@ -57,13 +57,10 @@ class LMDBReader : public ReaderBase {
     if (mdb_env_ != nullptr) {
       if (mdb_cursor_) {
         mdb_cursor_close(mdb_cursor_);
-        mdb_cursor_ = nullptr;
       }
-      mdb_dbi_close(mdb_env_, mdb_dbi_);
       mdb_txn_abort(mdb_txn_);
+      mdb_dbi_close(mdb_env_, mdb_dbi_);
       mdb_env_close(mdb_env_);
-      mdb_txn_ = nullptr;
-      mdb_dbi_ = 0;
       mdb_env_ = nullptr;
     }
     return Status::OK();
diff --git a/tensorflow/core/kernels/maxpooling_op.cc b/tensorflow/core/kernels/maxpooling_op.cc
index 157ce106ce..e2cf605811 100644
--- a/tensorflow/core/kernels/maxpooling_op.cc
+++ b/tensorflow/core/kernels/maxpooling_op.cc
@@ -20,6 +20,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/maxpooling_op.h"
 
 #include <vector>
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -33,11 +34,9 @@ limitations under the License.
 #include "tensorflow/core/kernels/pooling_ops_common.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
-#include "tensorflow/core/util/env_var.h"
 #include "tensorflow/core/util/padding.h"
 #include "tensorflow/core/util/tensor_format.h"
 #include "tensorflow/core/util/use_cudnn.h"
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
 #if GOOGLE_CUDA
 #include "tensorflow/core/kernels/maxpooling_op_gpu.h"
@@ -359,7 +358,6 @@ class MaxPoolingGradOp<Eigen::GpuDevice, T> : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
 
     use_dnn_ = CanUseCudnn();
-    ReadBoolFromEnvVar("TF_ENABLE_MAXPOOL_NANPROP", false, &propagate_nans_);
   }
 
   void Compute(OpKernelContext* context) override {
@@ -407,7 +405,7 @@ class MaxPoolingGradOp<Eigen::GpuDevice, T> : public OpKernel {
       DnnPoolingGradOp<T>::Compute(
           context, perftools::gputools::dnn::PoolingMode::kMaximum, ksize,
           stride, padding_, data_format_, &tensor_in, &tensor_out, out_backprop,
-          output_shape, propagate_nans_);
+          output_shape);
     } else {
       CHECK(data_format_ == FORMAT_NHWC)
           << "Non-Cudnn MaxPoolGrad only supports NHWC format";
@@ -422,7 +420,6 @@ class MaxPoolingGradOp<Eigen::GpuDevice, T> : public OpKernel {
   Padding padding_;
   TensorFormat data_format_;
   bool use_dnn_;
-  bool propagate_nans_;
 };
 
 #endif  // GOOGLE_CUDA
@@ -887,8 +884,6 @@ class MaxPoolingWithArgmaxOp : public OpKernel {
     OP_REQUIRES(context, ksize_[0] == 1 && stride_[0] == 1,
                 errors::Unimplemented(
                     "Pooling is not yet supported on the batch dimension."));
-
-    ReadBoolFromEnvVar("TF_ENABLE_MAXPOOL_NANPROP", false, &propagate_nans_);
   }
 
   void Compute(OpKernelContext* context) override {
@@ -907,15 +902,14 @@ class MaxPoolingWithArgmaxOp : public OpKernel {
     Tensor* argmax = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(1, out_shape, &argmax));
 
-    LaunchMaxPoolingWithArgmax<Device, T>::launch(
-        context, params, tensor_in, output, argmax, propagate_nans_);
+    LaunchMaxPoolingWithArgmax<Device, T>::launch(context, params, tensor_in,
+                                                  output, argmax);
   }
 
  private:
   std::vector<int32> ksize_;
   std::vector<int32> stride_;
   Padding padding_;
-  bool propagate_nans_;
 };
 
 template <typename Device, typename T>
@@ -1051,8 +1045,6 @@ class MaxPoolingNoMaskOp<GPUDevice, T> : public OpKernel {
                 errors::Unimplemented(
                     "Pooling is not yet supported on the batch dimension."));
     use_dnn_ = CanUseCudnn();
-
-    ReadBoolFromEnvVar("TF_ENABLE_MAXPOOL_NANPROP", false, &propagate_nans_);
   }
 
   void Compute(OpKernelContext* context) override {
@@ -1076,10 +1068,9 @@ class MaxPoolingNoMaskOp<GPUDevice, T> : public OpKernel {
 
     // These is_int8x4 checks avoid linker errors for missing qint8 kernels.
     if (!is_int8x4 && use_dnn_ && data_format_ == FORMAT_NCHW) {
-      DnnPoolingOp<T>::Compute(context,
-                               perftools::gputools::dnn::PoolingMode::kMaximum,
-                               ksize_, stride_, padding_, data_format_,
-                               tensor_in, out_shape, propagate_nans_);
+      DnnPoolingOp<T>::Compute(
+          context, perftools::gputools::dnn::PoolingMode::kMaximum, ksize_,
+          stride_, padding_, data_format_, tensor_in, out_shape);
     } else {
       Tensor* output = nullptr;
       OP_REQUIRES_OK(context, context->allocate_output(0, out_shape, &output));
@@ -1088,7 +1079,7 @@ class MaxPoolingNoMaskOp<GPUDevice, T> : public OpKernel {
                                                            tensor_in, output);
       } else if (data_format_ == FORMAT_NHWC) {
         LaunchMaxPoolingNoMask<Device, T>::launch(context, params, tensor_in,
-                                                  output, propagate_nans_);
+                                                  output);
       } else {
         LOG(FATAL) << "MaxPool currently only supports the following (layout, "
                       "type) combinations: (NHWC, non-qint8), "
@@ -1107,7 +1098,6 @@ class MaxPoolingNoMaskOp<GPUDevice, T> : public OpKernel {
   Padding padding_;
   TensorFormat data_format_;
   bool use_dnn_;
-  bool propagate_nans_;
 };
 
 template <typename T>
@@ -1137,7 +1127,6 @@ class MaxPoolingNoMaskV2Op<GPUDevice, T> : public OpKernel {
     }
     OP_REQUIRES_OK(context, context->GetAttr("padding", &padding_));
     use_dnn_ = CanUseCudnn();
-    ReadBoolFromEnvVar("TF_ENABLE_MAXPOOL_NANPROP", false, &propagate_nans_);
   }
 
   void Compute(OpKernelContext* context) override {
@@ -1179,17 +1168,16 @@ class MaxPoolingNoMaskV2Op<GPUDevice, T> : public OpKernel {
         ShapeFromFormat(data_format_, params.tensor_in_batch, params.out_height,
                         params.out_width, params.depth);
     if (use_dnn_ && data_format_ == FORMAT_NCHW) {
-      DnnPoolingOp<T>::Compute(context,
-                               perftools::gputools::dnn::PoolingMode::kMaximum,
-                               ksize, stride, padding_, data_format_, tensor_in,
-                               out_shape, propagate_nans_);
+      DnnPoolingOp<T>::Compute(
+          context, perftools::gputools::dnn::PoolingMode::kMaximum, ksize,
+          stride, padding_, data_format_, tensor_in, out_shape);
     } else {
       CHECK(data_format_ == FORMAT_NHWC)
           << "Non-Cudnn MaxPool only supports NHWC format";
       Tensor* output = nullptr;
       OP_REQUIRES_OK(context, context->allocate_output(0, out_shape, &output));
       LaunchMaxPoolingNoMask<Device, T>::launch(context, params, tensor_in,
-                                                output, propagate_nans_);
+                                                output);
     }
   }
 
@@ -1199,20 +1187,18 @@ class MaxPoolingNoMaskV2Op<GPUDevice, T> : public OpKernel {
   Padding padding_;
   TensorFormat data_format_;
   bool use_dnn_;
-  bool propagate_nans_;
 };
 
 template <typename T>
 struct LaunchMaxPoolingNoMask<Eigen::GpuDevice, T> {
   static void launch(OpKernelContext* context, const PoolParameters& params,
-                     const Tensor& input, Tensor* output, bool propagate_nans) {
+                     const Tensor& input, Tensor* output) {
     bool status = functor::MaxPoolForwardWithOptionalArgmax<T>()(
         input.flat<T>().data(), params.tensor_in_batch, params.tensor_in_rows,
         params.tensor_in_cols, params.depth, params.out_height,
         params.out_width, params.window_rows, params.window_cols,
         params.row_stride, params.col_stride, params.pad_rows, params.pad_cols,
-        output->flat<T>().data(), nullptr, context->eigen_gpu_device(),
-        propagate_nans);
+        output->flat<T>().data(), nullptr, context->eigen_gpu_device());
     if (!status) {
       context->SetStatus(
           errors::Internal("Failed launching MaxPoolForwardNoMask"));
@@ -1223,8 +1209,7 @@ struct LaunchMaxPoolingNoMask<Eigen::GpuDevice, T> {
 template <typename T>
 struct LaunchMaxPoolingWithArgmax<Eigen::GpuDevice, T> {
   static void launch(OpKernelContext* context, const PoolParameters& params,
-                     const Tensor& input, Tensor* output, Tensor* argmax,
-                     bool propagate_nans) {
+                     const Tensor& input, Tensor* output, Tensor* argmax) {
     bool status = functor::MaxPoolForwardWithOptionalArgmax<T>()(
         input.flat<T>().data(), params.tensor_in_batch, params.tensor_in_rows,
         params.tensor_in_cols, params.depth, params.out_height,
@@ -1232,7 +1217,7 @@ struct LaunchMaxPoolingWithArgmax<Eigen::GpuDevice, T> {
         params.row_stride, params.col_stride, params.pad_rows, params.pad_cols,
         output->flat<T>().data(),
         reinterpret_cast<int64*>(argmax->flat<int64>().data()),
-        context->eigen_gpu_device(), propagate_nans);
+        context->eigen_gpu_device());
     if (!status) {
       context->SetStatus(
           errors::Internal("Failed launching MaxPoolForwardWithArgmax"));
diff --git a/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc b/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc
index d96b844383..26f5274804 100644
--- a/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc
@@ -29,15 +29,6 @@ limitations under the License.
 
 namespace tensorflow {
 namespace {
-template <bool propagate_nans, typename dtype>
-EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE bool IsGreaterThan(dtype a, dtype b) {
-  if (propagate_nans) {
-    return !(a <= b);
-  } else {
-    return a > b;
-  }
-}
-
 // This is Yangqing's custom kernel for the maxpooling operation. There are
 // three functions: MaxPoolForwardNCHW and MaxPoolForwardNHWC are the two
 // forward functions, dealing with the forward case. MaxPoolBackward is the
@@ -60,7 +51,7 @@ EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE bool IsGreaterThan(dtype a, dtype b) {
 // const int output_size = batch * channels * pooled_height * pooled_width;
 // MaxPoolForwardNCHW<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock,
 //                      kThreadsPerBlock, 0, cuda_stream>>>(...);
-template <bool propagate_nans, typename dtype>
+template <typename dtype>
 __global__ void MaxPoolForwardNCHW(const int nthreads, const dtype* bottom_data,
                                    const int channels, const int height,
                                    const int width, const int pooled_height,
@@ -86,7 +77,7 @@ __global__ void MaxPoolForwardNCHW(const int nthreads, const dtype* bottom_data,
     for (int h = hstart; h < hend; ++h) {
       for (int w = wstart; w < wend; ++w) {
         int idx = c * height * width + h * width + w;
-        if (IsGreaterThan<propagate_nans>(bottom_data_n[idx], maxval)) {
+        if (bottom_data_n[idx] > maxval) {
           maxidx = idx;
           maxval = bottom_data_n[idx];
         }
@@ -135,7 +126,7 @@ __global__ void MaxPoolForwardNoMaskKernel_NCHW_VECT_C(
   }
 }
 
-template <bool propagate_nans, typename dtype>
+template <typename dtype>
 __global__ void MaxPoolForwardNHWC(const int nthreads, const dtype* bottom_data,
                                    const int height, const int width,
                                    const int channels, const int pooled_height,
@@ -162,7 +153,7 @@ __global__ void MaxPoolForwardNHWC(const int nthreads, const dtype* bottom_data,
     for (int h = hstart; h < hend; ++h) {
       for (int w = wstart; w < wend; ++w) {
         int idx = (h * width + w) * channels + c;
-        if (IsGreaterThan<propagate_nans>(bottom_data_n[idx], maxval)) {
+        if (bottom_data_n[idx] > maxval) {
           maxidx = idx;
           maxval = bottom_data_n[idx];
         }
@@ -399,24 +390,15 @@ bool MaxPoolForwardWithOptionalArgmax<T>::operator()(
     const int channels, const int pooled_height, const int pooled_width,
     const int kernel_h, const int kernel_w, const int stride_h,
     const int stride_w, const int pad_t, const int pad_l, T* top_data,
-    int64* mask, const Eigen::GpuDevice& d, bool propagate_nans) {
+    int64* mask, const Eigen::GpuDevice& d) {
   const int kThreadsPerBlock = 1024;
   const int output_size = batch * channels * pooled_height * pooled_width;
-  if (propagate_nans) {
-    MaxPoolForwardNHWC<true>
-        <<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock,
-           kThreadsPerBlock, 0, d.stream()>>>
-        (output_size, bottom_data, height, width, channels, pooled_height,
-         pooled_width, kernel_h, kernel_w, stride_h, stride_w, pad_t, pad_l,
-         top_data, mask);
-  } else {
-    MaxPoolForwardNHWC<false>
-        <<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock,
-           kThreadsPerBlock, 0, d.stream()>>>
-        (output_size, bottom_data, height, width, channels, pooled_height,
-         pooled_width, kernel_h, kernel_w, stride_h, stride_w, pad_t, pad_l,
-         top_data, mask);
-  }
+
+  MaxPoolForwardNHWC<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock,
+                       kThreadsPerBlock, 0, d.stream()>>>(
+      output_size, bottom_data, height, width, channels, pooled_height,
+      pooled_width, kernel_h, kernel_w, stride_h, stride_w, pad_t, pad_l,
+      top_data, mask);
   return d.ok();
 }
 
diff --git a/tensorflow/core/kernels/maxpooling_op_gpu.h b/tensorflow/core/kernels/maxpooling_op_gpu.h
index 38ebb34248..34203797cf 100644
--- a/tensorflow/core/kernels/maxpooling_op_gpu.h
+++ b/tensorflow/core/kernels/maxpooling_op_gpu.h
@@ -39,7 +39,7 @@ struct MaxPoolForwardWithOptionalArgmax {
                   const int pooled_width, const int kernel_h,
                   const int kernel_w, const int stride_h, const int stride_w,
                   const int pad_t, const int pad_l, T* top_data, int64* mask,
-                  const Eigen::GpuDevice& d, bool propagate_nans);
+                  const Eigen::GpuDevice& d);
 };
 
 struct MaxPoolForwardNoMask_NCHW_VECT_C {
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index f291281108..9080bf7be8 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -45,12 +45,12 @@ limitations under the License.
 #ifdef INTEL_MKL_DNN
 #include "mkldnn.hpp"
 
-using mkldnn::stream;
 using mkldnn::prop_kind;
+using mkldnn::stream;
 
-using mkldnn::convolution_forward;
 using mkldnn::convolution_backward_weights;
 using mkldnn::convolution_direct;
+using mkldnn::convolution_forward;
 
 #endif
 
@@ -463,12 +463,13 @@ class MklConv2DCustomBackpropFilterOp : public OpKernel {
 
       // Generate input shapes.
       TensorShape filter_shape;
-      OP_REQUIRES(context, TensorShapeUtils::IsVector(filter_tensor.shape()),
-        errors::InvalidArgument(
+      OP_REQUIRES(
+          context, TensorShapeUtils::IsVector(filter_tensor.shape()),
+          errors::InvalidArgument(
               "Conv2DBackpropFilter: filter_sizes input must be 1-dim, not ",
               filter_tensor.dims()));
       OP_REQUIRES_OK(context, TensorShapeUtils::MakeShape(
-                        filter_tensor.vec<int32>(), &filter_shape));
+                                  filter_tensor.vec<int32>(), &filter_shape));
       TensorShape input_shape = input_tensor.shape();
       TensorShape obp_shape = obp_tensor.shape();
 
@@ -480,27 +481,26 @@ class MklConv2DCustomBackpropFilterOp : public OpKernel {
 
       // Get forward convolution parameters.
       MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
-      conv_utl.GetConvFwdSizesInMklOrder(input_shape, filter_shape,
-                                         &fwd_input_dims, &fwd_filter_dims,
-                                         &strides,
-                                         &fwd_output_dims_tf_order,
-                                         &fwd_output_dims,
-                                         &padding_l, &padding_r);
+      conv_utl.GetConvFwdSizesInMklOrder(
+          input_shape, filter_shape, &fwd_input_dims, &fwd_filter_dims,
+          &strides, &fwd_output_dims_tf_order, &fwd_output_dims, &padding_l,
+          &padding_r);
       if (!context->status().ok()) return;
 
       // Create Convolution forward descriptor since Convolution backward
       // API needs it. For that, we first need to create input, filter
       // and output memory descriptors.
       auto mkl_data_format = TFDataFormatToMklDnnDataFormat(data_format_);
-      auto fwd_src_md = memory::desc(fwd_input_dims, MklDnnType<T>(),
-                                     mkl_data_format);
-      auto fwd_filter_md = memory::desc(fwd_filter_dims, MklDnnType<T>(),
-                                        memory::format::hwio);
-      auto fwd_out_md = memory::desc(fwd_output_dims, MklDnnType<T>(),
-                                     mkl_data_format);
-      auto fwd_desc = convolution_forward::desc(prop_kind::forward,
-            convolution_direct, fwd_src_md, fwd_filter_md, fwd_out_md,
-            strides, padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
+      auto fwd_src_md =
+          memory::desc(fwd_input_dims, MklDnnType<T>(), mkl_data_format);
+      auto fwd_filter_md =
+          memory::desc(fwd_filter_dims, MklDnnType<T>(), memory::format::hwio);
+      auto fwd_out_md =
+          memory::desc(fwd_output_dims, MklDnnType<T>(), mkl_data_format);
+      auto fwd_desc = convolution_forward::desc(
+          prop_kind::forward, convolution_direct, fwd_src_md, fwd_filter_md,
+          fwd_out_md, strides, padding_l, padding_r,
+          TFPaddingToMklDnnPadding(padding_));
       auto fwd_pd = convolution_forward::primitive_desc(fwd_desc, cpu_engine);
 
       // Allocate output tensor and shape
@@ -537,23 +537,22 @@ class MklConv2DCustomBackpropFilterOp : public OpKernel {
       output.SetOpMemDesc(bwd_output_dims, memory::format::any);
 
       // Create convolution backward weights primitive.
-      auto bwd_desc = convolution_backward_weights::desc(convolution_direct,
-                          input.GetOpMemDesc(), output.GetOpMemDesc(),
-                          outbackprop.GetOpMemDesc(), strides, padding_l,
-                          padding_r, TFPaddingToMklDnnPadding(padding_));
+      auto bwd_desc = convolution_backward_weights::desc(
+          convolution_direct, input.GetOpMemDesc(), output.GetOpMemDesc(),
+          outbackprop.GetOpMemDesc(), strides, padding_l, padding_r,
+          TFPaddingToMklDnnPadding(padding_));
 
-      auto bwd_pd = convolution_backward_weights::primitive_desc(bwd_desc,
-                                                              cpu_engine,
-                                                              fwd_pd);
+      auto bwd_pd = convolution_backward_weights::primitive_desc(
+          bwd_desc, cpu_engine, fwd_pd);
 
       PrepareAndExecutePrimitive(bwd_pd, &input, &outbackprop, &output);
-    } catch (mkldnn::error &e) {
-     string error_msg = "Status: " + std::to_string(e.status) +
-                       ", message: " + string(e.message) +
-                       ", in file " + string(__FILE__) + ":" +
-                       std::to_string(__LINE__);
-     OP_REQUIRES_OK(context, errors::Aborted("Operation received an exception:",
-                                            error_msg));
+    } catch (mkldnn::error& e) {
+      string error_msg = "Status: " + std::to_string(e.status) +
+                         ", message: " + string(e.message) + ", in file " +
+                         string(__FILE__) + ":" + std::to_string(__LINE__);
+      OP_REQUIRES_OK(
+          context,
+          errors::Aborted("Operation received an exception:", error_msg));
     }
   }
 
@@ -564,9 +563,8 @@ class MklConv2DCustomBackpropFilterOp : public OpKernel {
 
   // Prepare and execute net - checks for input and output reorders.
   void PrepareAndExecutePrimitive(
-                  const convolution_backward_weights::primitive_desc& conv_pd,
-                  MklDnnData<T>* input, MklDnnData<T>* obp,
-                  MklDnnData<T>* output) {
+      const convolution_backward_weights::primitive_desc& conv_pd,
+      MklDnnData<T>* input, MklDnnData<T>* obp, MklDnnData<T>* output) {
     // Create reorders between user layout and MKL layout if it is needed and
     // add it to the net before convolution.
     std::vector<primitive> net;
@@ -577,10 +575,10 @@ class MklConv2DCustomBackpropFilterOp : public OpKernel {
     // output side, we will prepare reorder primitive in case output
     // reorder to user memory is required.
     bool output_reorder_required = output->PrepareReorderToUserMemIfReq(
-                                      conv_pd.diff_weights_primitive_desc());
+        conv_pd.diff_weights_primitive_desc());
 
-    net.push_back(convolution_backward_weights(conv_pd, input->GetOpMem(),
-                                    obp->GetOpMem(), output->GetOpMem()));
+    net.push_back(convolution_backward_weights(
+        conv_pd, input->GetOpMem(), obp->GetOpMem(), output->GetOpMem()));
 
     // Insert reorder primitive in the net for output reorder if reorder is
     // required.
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index 4a47d0463e..4b6bf92e42 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -23,6 +23,8 @@ limitations under the License.
 #define EIGEN_USE_THREADS
 #include <algorithm>
 #include <vector>
+#include "mkl_dnn.h"
+#include "mkl_dnn_types.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -41,18 +43,16 @@ limitations under the License.
 #include "tensorflow/core/util/tensor_format.h"
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
-#include "mkl_dnn.h"
-#include "mkl_dnn_types.h"
 
 #ifdef INTEL_MKL_DNN
 #include "mkldnn.hpp"
 
-using mkldnn::stream;
 using mkldnn::prop_kind;
+using mkldnn::stream;
 
-using mkldnn::convolution_forward;
-using mkldnn::convolution_direct;
 using mkldnn::convolution_backward_data;
+using mkldnn::convolution_direct;
+using mkldnn::convolution_forward;
 #endif
 
 namespace tensorflow {
@@ -397,12 +397,13 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
 
       // Generate input shape.
       TensorShape input_shape;
-      OP_REQUIRES(context, TensorShapeUtils::IsVector(input_tensor.shape()),
-        errors::InvalidArgument(
+      OP_REQUIRES(
+          context, TensorShapeUtils::IsVector(input_tensor.shape()),
+          errors::InvalidArgument(
               "Conv2DBackpropInput: input_sizes input must be 1-dim, not ",
               input_tensor.dims()));
       OP_REQUIRES_OK(context, TensorShapeUtils::MakeShape(
-                        input_tensor.vec<int32>(), &input_shape));
+                                  input_tensor.vec<int32>(), &input_shape));
       TensorShape filter_shape = filter_tensor.shape();
       TensorShape obp_shape = obp_tensor.shape();
 
@@ -414,27 +415,26 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
 
       // Get forward convolution parameters.
       MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
-      conv_utl.GetConvFwdSizesInMklOrder(input_shape, filter_shape,
-                                         &fwd_input_dims, &fwd_filter_dims,
-                                         &strides,
-                                         &fwd_output_dims_tf_order,
-                                         &fwd_output_dims,
-                                         &padding_l, &padding_r);
+      conv_utl.GetConvFwdSizesInMklOrder(
+          input_shape, filter_shape, &fwd_input_dims, &fwd_filter_dims,
+          &strides, &fwd_output_dims_tf_order, &fwd_output_dims, &padding_l,
+          &padding_r);
       if (!context->status().ok()) return;
 
       // Create Convolution forward descriptor since Convolution backward
       // API needs it. For that, we first need to create input, filter
       // and output memory descriptors.
       auto mkl_data_format = TFDataFormatToMklDnnDataFormat(data_format_);
-      auto fwd_src_md = memory::desc(fwd_input_dims, MklDnnType<T>(),
-                                     mkl_data_format);
-      auto fwd_filter_md = memory::desc(fwd_filter_dims, MklDnnType<T>(),
-                                        memory::format::hwio);
-      auto fwd_out_md = memory::desc(fwd_output_dims, MklDnnType<T>(),
-                                     mkl_data_format);
-      auto fwd_desc = convolution_forward::desc(prop_kind::forward,
-            convolution_direct, fwd_src_md, fwd_filter_md, fwd_out_md,
-            strides, padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
+      auto fwd_src_md =
+          memory::desc(fwd_input_dims, MklDnnType<T>(), mkl_data_format);
+      auto fwd_filter_md =
+          memory::desc(fwd_filter_dims, MklDnnType<T>(), memory::format::hwio);
+      auto fwd_out_md =
+          memory::desc(fwd_output_dims, MklDnnType<T>(), mkl_data_format);
+      auto fwd_desc = convolution_forward::desc(
+          prop_kind::forward, convolution_direct, fwd_src_md, fwd_filter_md,
+          fwd_out_md, strides, padding_l, padding_r,
+          TFPaddingToMklDnnPadding(padding_));
       auto fwd_pd = convolution_forward::primitive_desc(fwd_desc, cpu_engine);
 
       // Allocate output tensor and shape
@@ -475,23 +475,22 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
       output.SetOpMemDesc(bwd_output_dims, memory::format::any);
 
       // Create convolution backward data primitive.
-      auto bwd_desc = convolution_backward_data::desc(convolution_direct,
-                          output.GetOpMemDesc(), filter.GetOpMemDesc(),
-                          outbackprop.GetOpMemDesc(), strides, padding_l,
-                          padding_r, TFPaddingToMklDnnPadding(padding_));
+      auto bwd_desc = convolution_backward_data::desc(
+          convolution_direct, output.GetOpMemDesc(), filter.GetOpMemDesc(),
+          outbackprop.GetOpMemDesc(), strides, padding_l, padding_r,
+          TFPaddingToMklDnnPadding(padding_));
 
-      auto bwd_pd = convolution_backward_data::primitive_desc(bwd_desc,
-                                                              cpu_engine,
-                                                              fwd_pd);
+      auto bwd_pd = convolution_backward_data::primitive_desc(
+          bwd_desc, cpu_engine, fwd_pd);
 
       PrepareAndExecutePrimitive(bwd_pd, &filter, &outbackprop, &output);
-    } catch (mkldnn::error &e) {
-     string error_msg = "Status: " + std::to_string(e.status) +
-                       ", message: " + string(e.message) +
-                       ", in file " + string(__FILE__) + ":" +
-                       std::to_string(__LINE__);
-     OP_REQUIRES_OK(context, errors::Aborted("Operation received an exception:",
-                                            error_msg));
+    } catch (mkldnn::error& e) {
+      string error_msg = "Status: " + std::to_string(e.status) +
+                         ", message: " + string(e.message) + ", in file " +
+                         string(__FILE__) + ":" + std::to_string(__LINE__);
+      OP_REQUIRES_OK(
+          context,
+          errors::Aborted("Operation received an exception:", error_msg));
     }
   }
 
@@ -502,9 +501,8 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
 
   // Prepare and execute net - checks for input and output reorders.
   void PrepareAndExecutePrimitive(
-                  const convolution_backward_data::primitive_desc& conv_pd,
-                  MklDnnData<T>* filter, MklDnnData<T>* obp,
-                  MklDnnData<T>* output) {
+      const convolution_backward_data::primitive_desc& conv_pd,
+      MklDnnData<T>* filter, MklDnnData<T>* obp, MklDnnData<T>* output) {
     // Create reorders between user layout and MKL layout if it is needed and
     // add it to the net before convolution.
     std::vector<primitive> net;
@@ -514,11 +512,11 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
     // Memory for output of convolution. Since we may need reorder on the
     // output side, we will prepare reorder primitive in case output
     // reorder to user memory is required.
-    bool output_reorder_required = output->PrepareReorderToUserMemIfReq(
-                                      conv_pd.diff_src_primitive_desc());
+    bool output_reorder_required =
+        output->PrepareReorderToUserMemIfReq(conv_pd.diff_src_primitive_desc());
 
-    net.push_back(convolution_backward_data(conv_pd, obp->GetOpMem(),
-                                    filter->GetOpMem(), output->GetOpMem()));
+    net.push_back(convolution_backward_data(
+        conv_pd, obp->GetOpMem(), filter->GetOpMem(), output->GetOpMem()));
 
     // Insert reorder primitive in the net for output reorder if reorder is
     // required.
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index a9872b8d6d..369f632fb4 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -18,8 +18,8 @@ limitations under the License.
 
 #include <string.h>
 #include <map>
-#include <vector>
 #include <string>
+#include <vector>
 
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -46,11 +46,11 @@ limitations under the License.
 #ifdef INTEL_MKL_DNN
 #include "mkldnn.hpp"
 
-using mkldnn::stream;
 using mkldnn::prop_kind;
+using mkldnn::stream;
 
-using mkldnn::convolution_forward;
 using mkldnn::convolution_direct;
+using mkldnn::convolution_forward;
 #endif
 
 namespace tensorflow {
@@ -523,19 +523,16 @@ class MklConv2DOp : public OpKernel {
 
       // Get shapes of input tensors in MKL-DNN order
       MklDnnConvUtil conv_utl(context, strides_, padding_, data_format_);
-      conv_utl.GetConvFwdSizesInMklOrder(src_tensor.shape(),
-                                         filter_tensor.shape(),
-                                         &src_dims, &filter_dims, &strides,
-                                         &output_dims_tf_order,
-                                         &output_dims_mkl_order, &padding_l,
-                                         &padding_r);
+      conv_utl.GetConvFwdSizesInMklOrder(
+          src_tensor.shape(), filter_tensor.shape(), &src_dims, &filter_dims,
+          &strides, &output_dims_tf_order, &output_dims_mkl_order, &padding_l,
+          &padding_r);
       if (!context->status().ok()) return;
 
       // Check for corner case - if there is nothing to compute, return.
-      TensorShape tf_output_shape({output_dims_tf_order[0],
-                                output_dims_tf_order[1],
-                                output_dims_tf_order[2],
-                                output_dims_tf_order[3]});
+      TensorShape tf_output_shape(
+          {output_dims_tf_order[0], output_dims_tf_order[1],
+           output_dims_tf_order[2], output_dims_tf_order[3]});
       Tensor* output_tensor = nullptr;
       MklShape mkl_output_mkl_shape;
       mkl_output_mkl_shape.SetMklTensor(false);
@@ -572,13 +569,13 @@ class MklConv2DOp : public OpKernel {
       // the layout is Tensorflow's layout (NHWC or NCHW depending on data
       // format).
       src.SetUsrMem(src_dims, TFDataFormatToMklDnnDataFormat(data_format_),
-                    const_cast<void*>(static_cast<const void*>(
-                    src_tensor.flat<T>().data())));
+                    const_cast<void*>(
+                        static_cast<const void*>(src_tensor.flat<T>().data())));
       // Although filter shape (filter_dims) required is in MKL-DNN order,
       // the layout is Tensorflow's layout (HWIO).
       filter.SetUsrMem(filter_dims, memory::format::hwio,
                        const_cast<void*>(static_cast<const void*>(
-                       filter_tensor.flat<T>().data())));
+                           filter_tensor.flat<T>().data())));
       // Although output shape (output_dims) required is in MKL-DNN order,
       // layout is Tensorflow's layout (NHWC or NCHW depending on data format).
       output.SetUsrMem(output_dims_mkl_order,
@@ -598,36 +595,36 @@ class MklConv2DOp : public OpKernel {
         const Tensor& bias_tensor = MklGetInput(context, 2);
         bias.SetUsrMem(bias_size, memory::format::x,
                        const_cast<void*>(static_cast<const void*>(
-                       bias_tensor.flat<T>().data())));
+                           bias_tensor.flat<T>().data())));
         bias.SetOpMemDesc(bias_size, memory::format::any);
 
         // Create convolution primitive with Bias.
-        auto conv_desc = convolution_forward::desc(prop_kind::forward,
-            convolution_direct, src.GetOpMemDesc(), filter.GetOpMemDesc(),
-            bias.GetOpMemDesc(), output.GetOpMemDesc(), strides,
-            padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
+        auto conv_desc = convolution_forward::desc(
+            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
+            filter.GetOpMemDesc(), bias.GetOpMemDesc(), output.GetOpMemDesc(),
+            strides, padding_l, padding_r, TFPaddingToMklDnnPadding(padding_));
 
-        auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
-                                                                cpu_engine);
+        auto conv_prim_desc =
+            convolution_forward::primitive_desc(conv_desc, cpu_engine);
         PrepareAndExecuteNet(conv_prim_desc, &src, &filter, &bias, &output);
       } else {
         // Create convolution primitive without Bias.
-        auto conv_desc = convolution_forward::desc(prop_kind::forward,
-            convolution_direct, src.GetOpMemDesc(), filter.GetOpMemDesc(),
-            output.GetOpMemDesc(), strides, padding_l, padding_r,
-            TFPaddingToMklDnnPadding(padding_));
+        auto conv_desc = convolution_forward::desc(
+            prop_kind::forward, convolution_direct, src.GetOpMemDesc(),
+            filter.GetOpMemDesc(), output.GetOpMemDesc(), strides, padding_l,
+            padding_r, TFPaddingToMklDnnPadding(padding_));
 
-        auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc,
-                                                                cpu_engine);
+        auto conv_prim_desc =
+            convolution_forward::primitive_desc(conv_desc, cpu_engine);
         PrepareAndExecuteNet(conv_prim_desc, &src, &filter, nullptr, &output);
       }
-    } catch (mkldnn::error &e) {
+    } catch (mkldnn::error& e) {
       string error_msg = "Status: " + std::to_string(e.status) +
-                       ", message: " + std::string(e.message) +
-                       ", in file " + std::string(__FILE__) + ":" +
-                       std::to_string(__LINE__);
-      OP_REQUIRES_OK(context,
-        errors::Aborted("Operation received an exception:", error_msg));
+                         ", message: " + std::string(e.message) + ", in file " +
+                         std::string(__FILE__) + ":" + std::to_string(__LINE__);
+      OP_REQUIRES_OK(
+          context,
+          errors::Aborted("Operation received an exception:", error_msg));
     }
   }
 
@@ -638,9 +635,9 @@ class MklConv2DOp : public OpKernel {
 
   // Prepare and execute net - checks for input and output reorders.
   void PrepareAndExecuteNet(
-                  const convolution_forward::primitive_desc& conv_prim_desc,
-                  MklDnnData<T>* src, MklDnnData<T>* filter,
-                  MklDnnData<T>* bias, MklDnnData<T>* output) {
+      const convolution_forward::primitive_desc& conv_prim_desc,
+      MklDnnData<T>* src, MklDnnData<T>* filter, MklDnnData<T>* bias,
+      MklDnnData<T>* output) {
     // Create reorders between user layout and MKL layout if it is needed and
     // add it to the net before convolution.
     std::vector<primitive> net;
@@ -651,18 +648,19 @@ class MklConv2DOp : public OpKernel {
     // output side, we will prepare reorder primitive in case output
     // reorder to user memory is required.
     bool output_reorder_required = output->PrepareReorderToUserMemIfReq(
-                                      conv_prim_desc.dst_primitive_desc());
+        conv_prim_desc.dst_primitive_desc());
 
     // Create convolution primitive and add it to net.
     if (bias) {
       CHECK_EQ(biasEnabled, true);
       net.push_back(convolution_forward(conv_prim_desc, src->GetOpMem(),
-                                    filter->GetOpMem(), bias->GetOpMem(),
-                                    output->GetOpMem()));
+                                        filter->GetOpMem(), bias->GetOpMem(),
+                                        output->GetOpMem()));
     } else {
       CHECK_EQ(biasEnabled, false);
       net.push_back(convolution_forward(conv_prim_desc, src->GetOpMem(),
-                                    filter->GetOpMem(), output->GetOpMem()));
+                                        filter->GetOpMem(),
+                                        output->GetOpMem()));
     }
 
     // Insert reorder primitive in the net for output reorder if reorder is
diff --git a/tensorflow/core/kernels/mkl_conv_ops.h b/tensorflow/core/kernels/mkl_conv_ops.h
index f0cb37f8a4..e29af19ca9 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.h
+++ b/tensorflow/core/kernels/mkl_conv_ops.h
@@ -16,8 +16,8 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_KERNELS_MKL_CONV_OPS_H_
 #define TENSORFLOW_CORE_KERNELS_MKL_CONV_OPS_H_
 
-#include <vector>
 #include <limits>
+#include <vector>
 
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -26,8 +26,8 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/tensor_slice.h"
 #include "tensorflow/core/kernels/bounds_check.h"
-#include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/kernels/conv_grad_ops.h"
+#include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/lib/strings/numbers.h"
@@ -49,15 +49,15 @@ namespace tensorflow {
 
 class MklDnnConvUtil {
  protected:
-  OpKernelContext* context_;  // We don't own this.
+  OpKernelContext *context_;  // We don't own this.
   std::vector<int32> strides_;
   Padding padding_;
   TensorFormat data_format_;
 
  public:
-  MklDnnConvUtil(OpKernelContext* context, const std::vector<int32>& strides,
-                 Padding pad, TensorFormat fm) : context_(context),
-    strides_(strides), padding_(pad), data_format_(fm) {}
+  MklDnnConvUtil(OpKernelContext *context, const std::vector<int32> &strides,
+                 Padding pad, TensorFormat fm)
+      : context_(context), strides_(strides), padding_(pad), data_format_(fm) {}
 
   virtual ~MklDnnConvUtil() { context_ = nullptr; }
 
@@ -75,14 +75,14 @@ class MklDnnConvUtil {
   // requires input in NCHW format. Function does not return anything.
   // But errors arising from sanity checks are returned in context's
   // status.
-  virtual inline void
-  GetInputSizeInMklOrder(const TensorShape& input_shape,
-                         memory::dims *input_dims) {
-  #define CHECK_BOUNDS(val, err_msg) do {                     \
-    OP_REQUIRES(context_, FastBoundsCheck(val,                \
-                            std::numeric_limits<int>::max()), \
-                errors::InvalidArgument(err_msg));            \
-  }while(0)
+  virtual inline void GetInputSizeInMklOrder(const TensorShape &input_shape,
+                                             memory::dims *input_dims) {
+#define CHECK_BOUNDS(val, err_msg)                                     \
+  do {                                                                 \
+    OP_REQUIRES(context_,                                              \
+                FastBoundsCheck(val, std::numeric_limits<int>::max()), \
+                errors::InvalidArgument(err_msg));                     \
+  } while (0)
 
     CHECK_NOTNULL(input_dims);
 
@@ -105,7 +105,7 @@ class MklDnnConvUtil {
     CHECK_BOUNDS(input_batch_raw, "Input batch too large");
     int input_batch = static_cast<int>(input_batch_raw);
 
-  #undef CHECK_BOUNDS
+#undef CHECK_BOUNDS
 
     // MKL-DNN always requires input in NCHW format.
     *input_dims = {input_batch, input_depth, input_rows, input_cols};
@@ -125,10 +125,9 @@ class MklDnnConvUtil {
   // forward gets actual tensor as input).
   //
   // TODO(nhasabni): Add similar function for input and filter in MklShape.
-  virtual inline void
-  GetFilterSizeInMklOrder(const TensorShape& input_shape,
-                          const TensorShape& filter_shape,
-                          memory::dims *filter_dims) {
+  virtual inline void GetFilterSizeInMklOrder(const TensorShape &input_shape,
+                                              const TensorShape &filter_shape,
+                                              memory::dims *filter_dims) {
     CHECK_NOTNULL(filter_dims);
 
     OP_REQUIRES(context_, filter_shape.dims() == 4,
@@ -136,17 +135,18 @@ class MklDnnConvUtil {
                                         filter_shape.DebugString()));
 
     for (int i = 0; i < 3; i++) {
-      OP_REQUIRES(context_, FastBoundsCheck(filter_shape.dim_size(i),
-                                           std::numeric_limits<int>::max()),
-                errors::InvalidArgument("filter too large"));
+      OP_REQUIRES(context_,
+                  FastBoundsCheck(filter_shape.dim_size(i),
+                                  std::numeric_limits<int>::max()),
+                  errors::InvalidArgument("filter too large"));
     }
 
     int input_depth = GetTensorDim(input_shape, data_format_, 'C');
 
-    OP_REQUIRES(
-        context_, input_depth == filter_shape.dim_size(2),
-        errors::InvalidArgument("input and filter must have the same depth: ",
-                                input_depth, " vs ", filter_shape.dim_size(2)));
+    OP_REQUIRES(context_, input_depth == filter_shape.dim_size(2),
+                errors::InvalidArgument(
+                    "input and filter must have the same depth: ", input_depth,
+                    " vs ", filter_shape.dim_size(2)));
 
     // TF filter is always in (rows, cols, in_depth, out_depth) order.
     int filter_rows = static_cast<int>(filter_shape.dim_size(0));
@@ -163,25 +163,25 @@ class MklDnnConvUtil {
   // requires filter in OIHW format. Function does not return anything.
   // But errors arising from sanity checks are returned in context's
   // status.
-  virtual inline void
-  GetFilterSizeInMklOrder(size_t src_index, size_t filter_index,
-                          memory::dims *filter_dims) {
+  virtual inline void GetFilterSizeInMklOrder(size_t src_index,
+                                              size_t filter_index,
+                                              memory::dims *filter_dims) {
     CHECK_NOTNULL(filter_dims);
-    const Tensor& input = MklGetInput(context_, src_index);
-    const Tensor& filter = MklGetInput(context_, filter_index);
+    const Tensor &input = MklGetInput(context_, src_index);
+    const Tensor &filter = MklGetInput(context_, filter_index);
     GetFilterSizeInMklOrder(input.shape(), filter.shape(), filter_dims);
   }
 
   // Calculate Bias size for 2D Convolution. Function does not return
   // anything, but sets error in context status.
-  virtual inline void
-  GetBiasSizeInMklOrder(size_t bias_index, memory::dims *bias_dims) {
-    const Tensor& bias = MklGetInput(context_, bias_index);
+  virtual inline void GetBiasSizeInMklOrder(size_t bias_index,
+                                            memory::dims *bias_dims) {
+    const Tensor &bias = MklGetInput(context_, bias_index);
     OP_REQUIRES(context_, bias.dims() == 1,
                 errors::InvalidArgument("bias must be 1-dimensional: ",
                                         bias.shape().DebugString()));
 
-    *bias_dims = { static_cast<int>(bias.dim_size(0)) };
+    *bias_dims = {static_cast<int>(bias.dim_size(0))};
   }
 
   // Function to calculate output and padding size for 2D convolution.
@@ -193,13 +193,11 @@ class MklDnnConvUtil {
   // status is returned via context status.
   //
   // TODO(nhasabni): Add similar function for input and filter in MklShape.
-  virtual inline void
-  GetOutputAndPadSizeInMklOrder(const TensorShape& input_shape,
-                                const TensorShape& filter_shape,
-                                const memory::dims& strides,
-                                memory::dims *output_dims_tf_order,
-                                memory::dims *output_dims_mkl_order,
-                                memory::dims *pad_l, memory::dims *pad_r) {
+  virtual inline void GetOutputAndPadSizeInMklOrder(
+      const TensorShape &input_shape, const TensorShape &filter_shape,
+      const memory::dims &strides, memory::dims *output_dims_tf_order,
+      memory::dims *output_dims_mkl_order, memory::dims *pad_l,
+      memory::dims *pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
     CHECK_NOTNULL(output_dims_mkl_order);
     CHECK_NOTNULL(pad_l);
@@ -225,21 +223,21 @@ class MklDnnConvUtil {
     int64 out_rows = 0, out_cols = 0;
     int64 pad_top = 0, pad_bottom = 0, pad_left, pad_right;
 
-    OP_REQUIRES_OK(context_,
-            GetWindowedOutputSizeVerbose(input_rows, filter_rows, stride_rows,
-                                 padding_, &out_rows, &pad_top, &pad_bottom));
-    OP_REQUIRES_OK(context_,
-            GetWindowedOutputSizeVerbose(input_cols, filter_cols, stride_cols,
-                                 padding_, &out_cols, &pad_left, &pad_right));
+    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
+                                 input_rows, filter_rows, stride_rows, padding_,
+                                 &out_rows, &pad_top, &pad_bottom));
+    OP_REQUIRES_OK(context_, GetWindowedOutputSizeVerbose(
+                                 input_cols, filter_cols, stride_cols, padding_,
+                                 &out_cols, &pad_left, &pad_right));
 
     // Tensorflow output is in data_format order. (NHWC or NCHW)
-    TensorShape out_shape = ShapeFromFormat(data_format_, out_batch,
-                                            out_rows, out_cols, out_depth);
+    TensorShape out_shape =
+        ShapeFromFormat(data_format_, out_batch, out_rows, out_cols, out_depth);
     *output_dims_tf_order = TFShapeToMklDnnDims(out_shape);
 
     // MKL-DNN always needs output in NCHW format.
     *output_dims_mkl_order = {out_batch, out_depth, static_cast<int>(out_rows),
-                   static_cast<int>(out_cols)};
+                              static_cast<int>(out_cols)};
 
     // Now handle padding. MKL-DNN uses asymetric padding.
     *pad_l = {static_cast<int>(pad_top), static_cast<int>(pad_left)};
@@ -250,27 +248,25 @@ class MklDnnConvUtil {
   // See comment on GetConvOutputAndPadSizeInMklOrder for parameters.
   //
   // Function does not return anything, but sets error in context status.
-  inline void
-  GetOutputAndPadSizeInMklOrder(size_t src_index, size_t filter_index,
-                                const memory::dims& strides,
-                                memory::dims *output_dims_tf_order,
-                                memory::dims *output_dims_mkl_order,
-                                memory::dims *pad_l, memory::dims *pad_r) {
+  inline void GetOutputAndPadSizeInMklOrder(
+      size_t src_index, size_t filter_index, const memory::dims &strides,
+      memory::dims *output_dims_tf_order, memory::dims *output_dims_mkl_order,
+      memory::dims *pad_l, memory::dims *pad_r) {
     CHECK_NOTNULL(output_dims_tf_order);
     CHECK_NOTNULL(output_dims_mkl_order);
     CHECK_NOTNULL(pad_l);
     CHECK_NOTNULL(pad_r);
 
-    const Tensor& input = MklGetInput(context_, src_index);
-    const Tensor& filter = MklGetInput(context_, filter_index);
+    const Tensor &input = MklGetInput(context_, src_index);
+    const Tensor &filter = MklGetInput(context_, filter_index);
 
     OP_REQUIRES(context_, input.dims() == 4,
                 errors::InvalidArgument("input must be 4-dimensional",
-                                          input.shape().DebugString()));
+                                        input.shape().DebugString()));
 
-    GetOutputAndPadSizeInMklOrder(input.shape(), filter.shape(),
-                                  strides, output_dims_tf_order,
-                                  output_dims_mkl_order, pad_l, pad_r);
+    GetOutputAndPadSizeInMklOrder(input.shape(), filter.shape(), strides,
+                                  output_dims_tf_order, output_dims_mkl_order,
+                                  pad_l, pad_r);
   }
 
   // Wrapper function to calculate input, filter, and output sizes of
@@ -279,15 +275,12 @@ class MklDnnConvUtil {
   // also calculates strides and paddings for 2D Convolution.
   //
   // Function does not return anything, but sets error in context status.
-  inline void GetConvFwdSizesInMklOrder(const TensorShape& input_shape,
-                                        const TensorShape& filter_shape,
-                                        memory::dims *input_dims,
-                                        memory::dims *filter_dims,
-                                        memory::dims *strides,
-                                        memory::dims *output_dims_tf_order,
-                                        memory::dims *output_dims_mkl_order,
-                                        memory::dims *pad_l,
-                                        memory::dims *pad_r) {
+  inline void GetConvFwdSizesInMklOrder(
+      const TensorShape &input_shape, const TensorShape &filter_shape,
+      memory::dims *input_dims, memory::dims *filter_dims,
+      memory::dims *strides, memory::dims *output_dims_tf_order,
+      memory::dims *output_dims_mkl_order, memory::dims *pad_l,
+      memory::dims *pad_r) {
     CHECK_NOTNULL(input_dims);
     CHECK_NOTNULL(filter_dims);
     CHECK_NOTNULL(strides);
@@ -302,8 +295,7 @@ class MklDnnConvUtil {
     if (!context_->status().ok()) return;
     GetStridesInMklOrder(strides);
     GetOutputAndPadSizeInMklOrder(input_shape, filter_shape, *strides,
-                                  output_dims_tf_order,
-                                  output_dims_mkl_order,
+                                  output_dims_tf_order, output_dims_mkl_order,
                                   pad_l, pad_r);
     if (!context_->status().ok()) return;
   }
diff --git a/tensorflow/core/kernels/mkl_tfconv_op.h b/tensorflow/core/kernels/mkl_tfconv_op.h
index 0a5be4fec9..a240ee44fb 100644
--- a/tensorflow/core/kernels/mkl_tfconv_op.h
+++ b/tensorflow/core/kernels/mkl_tfconv_op.h
@@ -13,11 +13,11 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#ifdef INTEL_MKL
+
 #ifndef TENSORFLOW_CORE_KERNELS_MKL_TFCONV_OP_H_
 #define TENSORFLOW_CORE_KERNELS_MKL_TFCONV_OP_H_
 
-#ifdef INTEL_MKL
-
 #include <algorithm>
 #include <vector>
 #include "tensorflow/core/framework/numeric_op.h"
@@ -35,10 +35,6 @@ limitations under the License.
 #include "mkl_dnn_types.h"
 #include "tensorflow/core/util/mkl_util.h"
 
-#ifdef INTEL_MKL_DNN
-using mkldnn::stream;
-#endif
-
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 
@@ -61,71 +57,6 @@ class MklToTfOp : public OpKernel {
     VLOG(1) << "MKLToTFConversion complete successfully.";
   }
 
-#ifdef INTEL_MKL_DNN
-  static void ConvertMklToTf(OpKernel* op_kernel, OpKernelContext* context,
-                             string data_format_str, DataType op_data_type,
-                             bool has_avx512f, uint input_number) {
-    try {
-      // Check that input tensor is in MKL format.
-      const Tensor& input_tensor = MklGetInput(context, input_number);
-      MklDnnShape input_shape;
-      GetMklShape(context, input_number, &input_shape);
-
-      // if input is already in Tf format, then copy input tensor to output.
-      if (!input_shape.IsMklTensor()) {
-        context->set_output(input_number, input_tensor);
-        VLOG(1) << "MKLToTFConversion: No conversion needed, "
-                << "copying input to output";
-        return;
-      }
-
-      // Check that input data type is same as operator data type and that it
-      // is same as output data type.
-      DataType input_data_type = op_kernel->input_type(input_number);
-      DataType output_data_type = op_kernel->output_type(input_number);
-      CHECK_EQ(op_data_type, input_data_type);
-      CHECK_EQ(op_data_type, output_data_type);
-
-      auto cpu_engine = engine(engine::cpu, 0);
-      MklDnnData<T> input(&cpu_engine);
-
-      // Get Mkl layout of input tensor.
-      auto input_mkl_md = input_shape.GetMklLayout();
-      // Get TensorFlow layout of input tensor. Expected output of conversion
-      // has same layout as Tensorflow layout of input tensor.
-      auto output_tf_md = input_shape.GetTfLayout();
-      auto output_tf_pd = memory::primitive_desc(output_tf_md, cpu_engine);
-      // Set input Mkl layout as the user layout.
-      input.SetUsrMem(input_mkl_md, &input_tensor);
-
-      // Allocate output tensor.
-      TensorShape output_shape = input_shape.GetTfShape();
-      Tensor* output_tensor = NULL;
-      OP_REQUIRES_OK(context, context->allocate_output(input_number,
-                                  output_shape, &output_tensor));
-      CHECK_NOTNULL(output_tensor);
-
-      // Do we need to reorder Mkl layout into TensorFlow layout?
-      if (input.IsReorderNeeded(output_tf_pd)) {
-        // Insert reorder between Mkl layout and TensorFlow layout.
-        std::vector<primitive> net;
-        CHECK_EQ(input.CheckReorderToOpMem(output_tf_pd, output_tensor, &net),
-                 true);
-        stream(stream::kind::eager).submit(net).wait();
-      } else {
-        // If not, just forward input tensor to output tensor.
-        CHECK(output_tensor->CopyFrom(input_tensor, output_shape));
-      }
-    } catch (mkldnn::error &e) {
-      string error_msg = "Status: " + std::to_string(e.status) +
-                       ", message: " + std::string(e.message) +
-                       ", in file " + std::string(__FILE__) + ":" +
-                       std::to_string(__LINE__);
-      OP_REQUIRES_OK(context,
-        errors::Aborted("Operation received an exception:", error_msg));
-    }
-  }
-#else
   static void ConvertMklToTf(OpKernel* op_kernel, OpKernelContext* context,
                              string data_format_str, DataType op_data_type,
                              bool has_avx512f, uint input_number) {
@@ -160,8 +91,8 @@ class MklToTfOp : public OpKernel {
 
     // Allocate output tensor.
     Tensor* output_tensor = NULL;
-    OP_REQUIRES_OK(context, context->allocate_output(input_number,
-                              output_shape, &output_tensor));
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(input_number, output_shape, &output_tensor));
 
     dnnLayout_t output_layout =
         static_cast<dnnLayout_t>(input_shape.GetTfLayout());
@@ -175,7 +106,6 @@ class MklToTfOp : public OpKernel {
                                      output_buffer);
     VLOG(1) << "MKLToTFConversion complete successfully.";
   }
-#endif
 
  private:
   /// Data format of the operation
@@ -202,5 +132,5 @@ class MklToTfOp : public OpKernel {
 TF_CALL_NUMBER_TYPES(REGISTER_CPU);
 #undef REGISTER_CPU
 }  // namespace tensorflow
-#endif  // INTEL_MKL
 #endif  // TENSORFLOW_CORE_KERNELS_MKL_TFCONV_OP_H_
+#endif  // INTEL_MKL
diff --git a/tensorflow/core/kernels/pooling_ops_common.cc b/tensorflow/core/kernels/pooling_ops_common.cc
index ac90f67ce0..7dee751c4f 100644
--- a/tensorflow/core/kernels/pooling_ops_common.cc
+++ b/tensorflow/core/kernels/pooling_ops_common.cc
@@ -143,7 +143,7 @@ void DnnPoolingOp<T>::Compute(
     perftools::gputools::dnn::PoolingMode pooling_mode,
     const std::vector<int32>& size, const std::vector<int32>& stride,
     Padding padding, TensorFormat data_format, const Tensor& tensor_in,
-    const TensorShape& tensor_out_shape, bool propagate_nans) {
+    const TensorShape& tensor_out_shape) {
   Tensor* tensor_out = nullptr;
   OP_REQUIRES_OK(context,
                  context->allocate_output(0, tensor_out_shape, &tensor_out));
@@ -188,8 +188,7 @@ void DnnPoolingOp<T>::Compute(
       .set_vertical_stride(params.row_stride)
       .set_horizontal_stride(params.col_stride)
       .set_vertical_padding(params.pad_rows)
-      .set_horizontal_padding(params.pad_cols)
-      .set_propagate_nans(propagate_nans);
+      .set_horizontal_padding(params.pad_cols);
 
   perftools::gputools::dnn::BatchDescriptor input_desc;
   input_desc.set_count(params.tensor_in_batch)
@@ -238,7 +237,7 @@ void DnnPoolingGradOp<T>::Compute(
     const std::vector<int32>& size, const std::vector<int32>& stride,
     Padding padding, TensorFormat data_format, const Tensor* tensor_in,
     const Tensor* tensor_out, const Tensor& out_backprop,
-    const TensorShape& tensor_in_shape, bool propagate_nans) {
+    const TensorShape& tensor_in_shape) {
   CHECK((pooling_mode != perftools::gputools::dnn::PoolingMode::kMaximum) ||
         (tensor_in && tensor_out))
       << "For MaxPoolGrad, both tensor_in and tensor_out needs to be "
@@ -328,8 +327,7 @@ void DnnPoolingGradOp<T>::Compute(
       .set_vertical_stride(params.row_stride)
       .set_horizontal_stride(params.col_stride)
       .set_vertical_padding(params.pad_rows)
-      .set_horizontal_padding(params.pad_cols)
-      .set_propagate_nans(propagate_nans);
+      .set_horizontal_padding(params.pad_cols);
 
   perftools::gputools::dnn::BatchDescriptor orig_output_desc;
   orig_output_desc.set_count(params.tensor_in_batch)
diff --git a/tensorflow/core/kernels/pooling_ops_common_gpu.h b/tensorflow/core/kernels/pooling_ops_common_gpu.h
index 1458456585..b594f39fad 100644
--- a/tensorflow/core/kernels/pooling_ops_common_gpu.h
+++ b/tensorflow/core/kernels/pooling_ops_common_gpu.h
@@ -44,7 +44,7 @@ class DnnPoolingOp {
                       const std::vector<int32>& size,
                       const std::vector<int32>& stride, Padding padding,
                       TensorFormat data_format, const Tensor& tensor_in,
-                      const TensorShape& tensor_out_shape, bool propagate_nans);
+                      const TensorShape& tensor_out_shape);
 };
 
 // A helper class that launch the cudnn pooling backward operations.
@@ -60,7 +60,7 @@ class DnnPoolingGradOp {
                       const std::vector<int32>& stride, Padding padding,
                       TensorFormat data_format, const Tensor* tensor_in,
                       const Tensor* tensor_out, const Tensor& out_backprop,
-                      const TensorShape& tensor_in_shape, bool propagate_nans);
+                      const TensorShape& tensor_in_shape);
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/quantized_add_op.cc b/tensorflow/core/kernels/quantized_add_op.cc
index 337c8e5c17..8be0c56798 100644
--- a/tensorflow/core/kernels/quantized_add_op.cc
+++ b/tensorflow/core/kernels/quantized_add_op.cc
@@ -489,7 +489,7 @@ class QuantizedAddOp : public OpKernel {
     // adding zero leaves the result unchanged, and to contain the largest of
     // the two input values with some room to spare.
     const float smallest_min = std::min(min_x, min_y);
-    const float largest_max = std::max(max_x, max_y);
+    const float largest_max = std::min(max_x, max_y);
     const float biggest_range =
         std::max(std::abs(smallest_min), std::abs(largest_max));
     const float output_range = (biggest_range * (1 << 14));
diff --git a/tensorflow/core/kernels/random_op.cc b/tensorflow/core/kernels/random_op.cc
index 55a8b9c9b6..a37c757865 100644
--- a/tensorflow/core/kernels/random_op.cc
+++ b/tensorflow/core/kernels/random_op.cc
@@ -577,7 +577,7 @@ struct FillPhiloxRandomKernel<Distribution, false> {
     const size_t kGroupSize = Distribution::kResultElementCount;
 
     const size_t item_id = item.get_global(0);
-    const size_t total_item_count = item.get_global_range();
+    const size_t total_item_count = item.get_global_range(0);
     size_t offset = item_id * kGroupSize;
     gen_.Skip(item_id);
 
@@ -633,7 +633,7 @@ struct FillPhiloxRandomKernel<Distribution, true> {
                                                 PhiloxRandom::kResultElementCount;
 
     const size_t item_id = item.get_global(0);
-    const size_t total_item_count = item.get_global_range();
+    const size_t total_item_count = item.get_global_range(0);
     size_t group_index = item_id;
     size_t offset = group_index * kGroupSize;
 
diff --git a/tensorflow/core/kernels/segment_reduction_ops.cc b/tensorflow/core/kernels/segment_reduction_ops.cc
index 2334e50f1d..4302a68a18 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.cc
+++ b/tensorflow/core/kernels/segment_reduction_ops.cc
@@ -376,9 +376,6 @@ struct UnsortedSegmentSumFunctor<CPUDevice, T, Index>
     auto data_flat = typename TTypes<T, 2>::ConstTensor(data, N, data_size / N);
     for (int64 i = 0; i < N; ++i) {
       Index j = internal::SubtleMustCopy(segment_ids(i));
-      if (j < 0) {
-        continue;
-      }
       OP_REQUIRES(ctx, FastBoundsCheck(j, output_rows),
                   errors::InvalidArgument(
                       "segment_ids", SliceDebugString(segment_ids_shape, i),
diff --git a/tensorflow/core/kernels/segment_reduction_ops.h b/tensorflow/core/kernels/segment_reduction_ops.h
index b10bea72ba..412c1d601d 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.h
+++ b/tensorflow/core/kernels/segment_reduction_ops.h
@@ -30,14 +30,14 @@ namespace functor {
 #ifdef GOOGLE_CUDA
 typedef Eigen::GpuDevice GPUDevice;
 // Functor for SegmentSumGPUOp.
-// output_rows: the number of output segments (unique segment ids in
+// 'output_rows': the number of output segments (unique segment ids in
 //                'segment_ids').
-// segment_ids_shape: shape of 'segment_ids' tensor.
-// segment_ids: unsorted map from input to output segment ids at which to
+// 'segment_ids_shape': shape of 'segment_ids' tensor.
+// 'segment_ids': unsorted map from input to output segment ids at which to
 //                perform segment sum operation.
-// data_size: size of input data tensor.
-// data: input data tensor.
-// output: output reshaped to {output_rows, output.size/output_rows}
+// 'data_size': size of input data tensor.
+// 'data': input data tensor.
+// 'output': output reshaped to {output_rows, output.size/output_rows}
 template <typename T, typename Index>
 struct SegmentSumFunctor {
   void operator()(OpKernelContext* ctx, const GPUDevice& d,
@@ -61,14 +61,14 @@ struct UnsortedSegmentBaseFunctor{
 };
 
 // Functor for UnsortedSegmentSumOp.
-// output_rows: the number of output segments (unique segment ids in
+// 'output_rows': the number of output segments (unique segment ids in
 //                'segment_ids').
-// segment_ids_shape: shape of 'segment_ids' tensor.
-// segment_ids: unsorted map from input to output segment ids at which to
+// 'segment_ids_shape': shape of 'segment_ids' tensor.
+// 'segment_ids': unsorted map from input to output segment ids at which to
 //                perform segment sum operation.
-// data_size: size of input data tensor.
-// data: input data tensor.
-// output: output reshaped to {output_rows, output.size/output_rows}
+// 'data_size': size of input data tensor.
+// 'data': input data tensor.
+// 'output': output reshaped to {output_rows, output.size/output_rows}
 template <typename Device, typename T, typename Index>
 struct UnsortedSegmentSumFunctor: public UnsortedSegmentBaseFunctor<Device, T, Index> {
   void operator()(OpKernelContext* ctx, const Device& d,
@@ -79,14 +79,14 @@ struct UnsortedSegmentSumFunctor: public UnsortedSegmentBaseFunctor<Device, T, I
 };
 
 // Functor for UnsortedSegmentMaxOp.
-// output_rows: the number of output segments (unique segment ids in
+// 'output_rows': the number of output segments (unique segment ids in
 //                'segment_ids').
-// segment_ids_shape: shape of 'segment_ids' tensor.
-// segment_ids: unsorted map from input to output segment ids at which to
+// 'segment_ids_shape': shape of 'segment_ids' tensor.
+// 'segment_ids': unsorted map from input to output segment ids at which to
 //                perform segment sum operation.
-// data_size: size of input data tensor.
-// data: input data tensor.
-// output: output reshaped to {output_rows, output.size/output_rows}
+// 'data_size': size of input data tensor.
+// 'data': input data tensor.
+// 'output': output reshaped to {output_rows, output.size/output_rows}
 template <typename Device, typename T, typename Index>
 struct UnsortedSegmentMaxFunctor: public UnsortedSegmentBaseFunctor<Device, T, Index> {
   void operator()(OpKernelContext* ctx, const Device& d,
diff --git a/tensorflow/core/kernels/shape_ops.cc b/tensorflow/core/kernels/shape_ops.cc
index 28a39bae3f..721f9b949b 100644
--- a/tensorflow/core/kernels/shape_ops.cc
+++ b/tensorflow/core/kernels/shape_ops.cc
@@ -341,12 +341,7 @@ REGISTER_KERNEL_BUILDER(Name("ExpandDims")
                             .Device(DEVICE_CPU)
                             .HostMemory("dim")
                             .TypeConstraint<int32>("Tdim"),
-                        ExpandDimsOp<int32>);
-REGISTER_KERNEL_BUILDER(Name("ExpandDims")
-                            .Device(DEVICE_CPU)
-                            .HostMemory("dim")
-                            .TypeConstraint<int64>("Tdim"),
-                        ExpandDimsOp<int64>);
+                        ExpandDimsOp);
 
 #if GOOGLE_CUDA
 #define REGISTER_GPU_KERNEL(type)                            \
@@ -355,13 +350,7 @@ REGISTER_KERNEL_BUILDER(Name("ExpandDims")
                               .TypeConstraint<type>("T")     \
                               .TypeConstraint<int32>("Tdim") \
                               .HostMemory("dim"),            \
-                          ExpandDimsOp<int32>);              \
-  REGISTER_KERNEL_BUILDER(Name("ExpandDims")                 \
-                              .Device(DEVICE_GPU)            \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int64>("Tdim") \
-                              .HostMemory("dim"),            \
-                          ExpandDimsOp<int64>);
+                          ExpandDimsOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER_GPU_KERNEL);
 TF_CALL_bool(REGISTER_GPU_KERNEL);
 #undef REGISTER_GPU_KERNEL
@@ -373,15 +362,7 @@ REGISTER_KERNEL_BUILDER(Name("ExpandDims")
                             .HostMemory("input")
                             .HostMemory("dim")
                             .HostMemory("output"),
-                        ExpandDimsOp<int32>);
-REGISTER_KERNEL_BUILDER(Name("ExpandDims")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int32>("T")
-                            .TypeConstraint<int64>("Tdim")
-                            .HostMemory("input")
-                            .HostMemory("dim")
-                            .HostMemory("output"),
-                        ExpandDimsOp<int64>);
+                        ExpandDimsOp);
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
@@ -391,13 +372,7 @@ REGISTER_KERNEL_BUILDER(Name("ExpandDims")
                               .TypeConstraint<type>("T")     \
                               .TypeConstraint<int32>("Tdim") \
                               .HostMemory("dim"),            \
-                          ExpandDimsOp<int32>);              \
-  REGISTER_KERNEL_BUILDER(Name("ExpandDims")                 \
-                              .Device(DEVICE_SYCL)           \
-                              .TypeConstraint<type>("T")     \
-                              .TypeConstraint<int64>("Tdim") \
-                              .HostMemory("dim"),            \
-                          ExpandDimsOp<int64>);
+                          ExpandDimsOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER_SYCL_KERNEL);
 TF_CALL_bool(REGISTER_SYCL_KERNEL);
 #undef REGISTER_SYCL_KERNEL
@@ -409,15 +384,7 @@ REGISTER_KERNEL_BUILDER(Name("ExpandDims")
                             .HostMemory("input")
                             .HostMemory("dim")
                             .HostMemory("output"),
-                        ExpandDimsOp<int32>);
-REGISTER_KERNEL_BUILDER(Name("ExpandDims")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<int32>("T")
-                            .TypeConstraint<int64>("Tdim")
-                            .HostMemory("input")
-                            .HostMemory("dim")
-                            .HostMemory("output"),
-                        ExpandDimsOp<int64>);
+                        ExpandDimsOp);
 #endif  // TENSORFLOW_USE_SYCL
 
 // Squeeze ---------------------------------------
diff --git a/tensorflow/core/kernels/shape_ops.h b/tensorflow/core/kernels/shape_ops.h
index 8d9d0ea846..ac607f4e8b 100644
--- a/tensorflow/core/kernels/shape_ops.h
+++ b/tensorflow/core/kernels/shape_ops.h
@@ -145,7 +145,6 @@ class SizeOp : public OpKernel {
   bool IsExpensive() override { return false; }
 };
 
-template <typename Tdim>
 class ExpandDimsOp : public OpKernel {
  public:
   explicit ExpandDimsOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
@@ -154,7 +153,7 @@ class ExpandDimsOp : public OpKernel {
     OP_REQUIRES(ctx, ctx->input(0).dtype() != DT_VARIANT,
                 errors::InvalidArgument("ExpandDims on Variant not supported"));
 
-    Tdim dim = ctx->input(1).flat<Tdim>()(0);
+    int32 dim = ctx->input(1).flat<int32>()(0);
     OP_REQUIRES(
         ctx, (dim >= -1 - ctx->input(0).dims() && dim <= ctx->input(0).dims()),
         errors::InvalidArgument("Tried to expand dim index ", dim,
@@ -176,7 +175,7 @@ class ExpandDimsOp : public OpKernel {
     }
 
     // Clamp to the end if needed.
-    dim = std::min<Tdim>(dim, existing_dims_size);
+    dim = std::min<int32>(dim, existing_dims_size);
     new_shape.emplace(new_shape.begin() + dim, 1);
     const TensorShape output_shape(new_shape);
 
@@ -235,10 +234,10 @@ class SqueezeOp : public OpKernel {
       if (!wrapped_squeeze_dims.empty()) {
         if (wrapped_squeeze_dims.count(i) > 0) {
           OP_REQUIRES(ctx, existing_dim == 1,
-                      errors::InvalidArgument("Tried to explicitly squeeze "
-                                              "dimension ",
-                                              i, " but dimension was not 1: ",
-                                              existing_dim));
+                      errors::InvalidArgument(
+                          "Tried to explicitly squeeze "
+                          "dimension ",
+                          i, " but dimension was not 1: ", existing_dim));
         } else {
           // This dimension is not being squeezed.
           new_shape.push_back(existing_dim);
diff --git a/tensorflow/core/kernels/slice_op.cc b/tensorflow/core/kernels/slice_op.cc
index 28a379774b..d46701749b 100644
--- a/tensorflow/core/kernels/slice_op.cc
+++ b/tensorflow/core/kernels/slice_op.cc
@@ -190,25 +190,41 @@ class SliceOp : public OpKernel {
         }
         return;
       }
-#define HANDLE_DIM(NDIM)                                              \
-  if (input_dims == NDIM) {                                           \
-    functor::Slice<Device, T, NDIM>()(                                \
-        context->eigen_device<Device>(), result, input, begin, size); \
-    return;                                                           \
+#define HANDLE_DIM(NDIM)                            \
+  if (input_dims == NDIM) {                         \
+    HandleCase<NDIM>(context, begin, size, result); \
+    return;                                         \
   }
+
       HANDLE_DIM(1);
       HANDLE_DIM(2);
       HANDLE_DIM(3);
       HANDLE_DIM(4);
       HANDLE_DIM(5);
       HANDLE_DIM(6);
+      HANDLE_DIM(7);
 
 #undef HANDLE_DIM
 
-      // handle cases which dim >= 7
-      functor::Slice<Device, T, 7>()(
-          context->eigen_device<Device>(), result, input, begin, size);
+      OP_REQUIRES(context, false, errors::Unimplemented(
+                                      "SliceOp : Unhandled input dimensions"));
+    }
+  }
+
+ private:
+  template <int NDIM>
+  void HandleCase(OpKernelContext* context, const gtl::ArraySlice<int64>& begin,
+                  const gtl::ArraySlice<int64>& size, Tensor* result) {
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> indices;
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> sizes;
+    for (int i = 0; i < NDIM; ++i) {
+      indices[i] = begin[i];
+      sizes[i] = size[i];
     }
+
+    functor::Slice<Device, T, NDIM>()(
+        context->eigen_device<Device>(), result->tensor<T, NDIM>(),
+        context->input(0).tensor<T, NDIM>(), indices, sizes);
   }
 };
 
@@ -248,16 +264,11 @@ class MklSliceOp : public OpKernel {
         }
         return;
       }
-      // Special case for handling 4-D tensor slice.
-      if (input_dims == 4) {
-        HandleCase4D(context, begin, size, result);
-      } else {
-#define HANDLE_DIM(NDIM)                                                  \
-      if (input_dims == NDIM) {                                           \
-        functor::Slice<Device, T, NDIM>()(                                \
-            context->eigen_device<Device>(), result, input, begin, size); \
-            return;                                                       \
-      }
+#define HANDLE_DIM(NDIM)                            \
+  if (input_dims == NDIM) {                         \
+    HandleCase<NDIM>(context, begin, size, result); \
+    return;                                         \
+  }
 
       HANDLE_DIM(1);
       HANDLE_DIM(2);
@@ -265,13 +276,12 @@ class MklSliceOp : public OpKernel {
       HANDLE_DIM(4);
       HANDLE_DIM(5);
       HANDLE_DIM(6);
+      HANDLE_DIM(7);
 
 #undef HANDLE_DIM
 
-        // handle cases which dim >= 7
-        functor::Slice<Device, T, 7>()(
-          context->eigen_device<Device>(), result, input, begin, size);
-      }
+      OP_REQUIRES(context, false, errors::Unimplemented(
+                                      "SliceOp : Unhandled input dimensions"));
     }
   }
 
@@ -318,7 +328,8 @@ class MklSliceOp : public OpKernel {
     return false;
   }
 
-  void HandleCase4D(OpKernelContext* context,
+  template <int NDIM>
+  void HandleCase(OpKernelContext* context,
                   const gtl::ArraySlice<int64>& begin,
                   const gtl::ArraySlice<int64>& size, Tensor* result) {
     int slice_dim = -1;
@@ -327,7 +338,8 @@ class MklSliceOp : public OpKernel {
     // differs from the input tensor in only 1 out of 4 dimensions.
     // This case arises in the context of Slice of 4-D tensor in NHWC or NCHW
     // format over channel dimension.
-    if (DoesSliceShapeDifferInOnly1D(in_shape, begin, size, &slice_dim)) {
+    if (NDIM == 4 &&
+        DoesSliceShapeDifferInOnly1D(in_shape, begin, size, &slice_dim)) {
         size_t in_strides[4] = { (size_t) in_shape.dim_size(1) *
                                           in_shape.dim_size(2) *
                                           in_shape.dim_size(3),
@@ -391,8 +403,16 @@ class MklSliceOp : public OpKernel {
         // slice_dim is not 1 or 3, then we fallback to Eigen implementation.
     }
 
-    functor::Slice<Device, T, 4>()(
-        context->eigen_device<Device>(), result, context->input(0), begin, size);
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> indices;
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> sizes;
+    for (int i = 0; i < NDIM; ++i) {
+      indices[i] = begin[i];
+      sizes[i] = size[i];
+    }
+
+    functor::Slice<Device, T, NDIM>()(
+        context->eigen_device<Device>(), result->tensor<T, NDIM>(),
+        context->input(0).tensor<T, NDIM>(), indices, sizes);
   }
 };
 #endif
@@ -400,13 +420,13 @@ class MklSliceOp : public OpKernel {
 // Forward declarations of the functor specializations for declared in the
 // sharded source files.
 namespace functor {
-#define DECLARE_CPU_SPEC(T, NDIM)                        \
-  template <>                                            \
-  void Slice<CPUDevice, T, NDIM>::operator()(            \
-      const CPUDevice& d, Tensor* output,                \
-      const Tensor& input,                               \
-      const gtl::ArraySlice<int64>& slice_indices,       \
-      const gtl::ArraySlice<int64>& slice_sizes);        \
+#define DECLARE_CPU_SPEC(T, NDIM)                                  \
+  template <>                                                      \
+  void Slice<CPUDevice, T, NDIM>::operator()(                      \
+      const CPUDevice& d, typename TTypes<T, NDIM>::Tensor output, \
+      typename TTypes<T, NDIM>::ConstTensor input,                 \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& indices,       \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& sizes);        \
   extern template struct Slice<CPUDevice, T, NDIM>;
 
 #define DECLARE_FOR_N(T)  \
@@ -456,14 +476,13 @@ REGISTER_SLICE(bfloat16);
 #if GOOGLE_CUDA
 // Forward declarations of the functor specializations for GPU.
 namespace functor {
-#define DECLARE_GPU_SPEC(T, NDIM)                        \
-  template <>                                            \
-  void Slice<GPUDevice, T, NDIM>::operator()(            \
-      const GPUDevice& d,                                \
-      Tensor* output,                                    \
-      const Tensor& input,                               \
-      const gtl::ArraySlice<int64>& slice_indices,       \
-      const gtl::ArraySlice<int64>& slice_sizes);        \
+#define DECLARE_GPU_SPEC(T, NDIM)                                  \
+  template <>                                                      \
+  void Slice<GPUDevice, T, NDIM>::operator()(                      \
+      const GPUDevice& d, typename TTypes<T, NDIM>::Tensor output, \
+      typename TTypes<T, NDIM>::ConstTensor input,                 \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& indices,       \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& sizes);        \
   extern template struct Slice<GPUDevice, T, NDIM>;
 
 #define DECLARE_FOR_N(T)  \
@@ -517,14 +536,13 @@ REGISTER_KERNEL_BUILDER(Name("Slice")
 #ifdef TENSORFLOW_USE_SYCL
 // Forward declarations of the functor specializations for SYCL.
 namespace functor {
-#define DECLARE_SYCL_SPEC(T, NDIM)                       \
-  template <>                                            \
-  void Slice<SYCLDevice, T, NDIM>::operator()(           \
-      const SYCLDevice& d,                               \
-      Tensor* output,                                    \
-      const Tensor& input,                               \
-      const gtl::ArraySlice<int64>& slice_indices,       \
-      const gtl::ArraySlice<int64>& slice_sizes);        \
+#define DECLARE_SYCL_SPEC(T, NDIM)                                 \
+  template <>                                                      \
+  void Slice<SYCLDevice, T, NDIM>::operator()(                     \
+      const SYCLDevice& d, typename TTypes<T, NDIM>::Tensor output,\
+      typename TTypes<T, NDIM>::ConstTensor input,                 \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& indices,       \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& sizes);        \
   extern template struct Slice<SYCLDevice, T, NDIM>;
 
 #define DECLARE_FOR_N(T)   \
diff --git a/tensorflow/core/kernels/slice_op.h b/tensorflow/core/kernels/slice_op.h
index 55a4be985b..db7eded745 100644
--- a/tensorflow/core/kernels/slice_op.h
+++ b/tensorflow/core/kernels/slice_op.h
@@ -19,104 +19,31 @@ limitations under the License.
 // Functor definition for SliceOp, must be compilable by nvcc.
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-#include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_types.h"
-#include "tensorflow/core/kernels/ops_util.h"
 
 namespace tensorflow {
-
-namespace internal {
-
-template <typename Device, typename T>
-void SliceSimple(const Device& d, Tensor* out, const Tensor& in,
-                 const gtl::ArraySlice<int64>& slice_indices);
-template <typename Device, typename T>
-void SliceSimpleGpu(const Device& d, Tensor* out, const Tensor& in,
-                 const gtl::ArraySlice<int64>& slice_indices);
-
-template <typename Device, typename T>
-void SliceSimple(const Device& d, Tensor* out, const Tensor& in,
-                 const gtl::ArraySlice<int64>& slice_indices) {
-  const int ndims = in.dims();
-  const int64 nelem = out->NumElements();
-  const gtl::InlinedVector<int64, 8> in_strides = ComputeStride<int64>(in.shape());
-  const gtl::InlinedVector<int64, 8> out_strides = ComputeStride<int64>(out->shape());
-  const T* p = in.flat<T>().data();
-  T* q = out->flat<T>().data();
-
-  std::vector<int64> i_idx(nelem, 0);
-  std::vector<int64> t(nelem, 0);
-
-  for (int64 o_idx = 0; o_idx < nelem; ++o_idx) {
-    t[o_idx] = o_idx;
-  }
-  for (int i = 0; i < ndims; ++i) {
-    int64 n = (nelem + 7) / 8;
-    int64 o_idx = 0;
-    switch (nelem % 8) {
-#define CALC_INPUT_IDX                                                            \
-  i_idx[o_idx] += (t[o_idx] / out_strides[i] + slice_indices[i]) * in_strides[i]; \
-  t[o_idx] %= out_strides[i];                                                     \
-  ++o_idx;
-      case 0: do { CALC_INPUT_IDX;
-      case 7:      CALC_INPUT_IDX;
-      case 6:      CALC_INPUT_IDX;
-      case 5:      CALC_INPUT_IDX;
-      case 4:      CALC_INPUT_IDX;
-      case 3:      CALC_INPUT_IDX;
-      case 2:      CALC_INPUT_IDX;
-      case 1:      CALC_INPUT_IDX;
-#undef CALC_INPUT_IDX
-              } while (--n > 0);
-    }
-  }
-  for (int64 o_idx = 0; o_idx < nelem; ++o_idx) {
-    q[o_idx] = p[i_idx[o_idx]];
-  }
-}
-
-template <typename Device, typename T, int NDIMS>
-void SliceUsingEigen(const Device& d, Tensor* out, const Tensor& in,
-                 const gtl::ArraySlice<int64>& slice_indices,
-                 const gtl::ArraySlice<int64>& slice_sizes) {
-  auto input = in.tensor<T, NDIMS>();
-  auto output = out->tensor<T, NDIMS>();
-  Eigen::DSizes<int, NDIMS> indices;
-  for (int i = 0; i < NDIMS; ++i) {
-    indices[i] = slice_indices[i];
-  }
-  Eigen::DSizes<int, NDIMS> sizes;
-  for (int i = 0; i < NDIMS; ++i) {
-    sizes[i] = slice_sizes[i];
-  }
-  const bool use_64bit = input.size() > Eigen::NumTraits<int>::highest();
-  if (!use_64bit &&
-      Eigen::internal::is_same<Device, Eigen::GpuDevice>::value) {
-    To32Bit(output).device(d) = To32Bit(input).slice(indices, sizes);
-  } else {
-    output.device(d) = input.slice(indices, sizes);
-  }
-}
-
-} // namespace internal
-
 namespace functor {
 
-// Template parameter NDIM is not neccesary here. The aim of keeping it
-// is to compile struct slice seperately which minimizes the compiling time.
-template <typename Device, typename T, int NDIM>
+template <typename Device, typename T, int NDIMS>
 struct Slice {
-  void operator()(const Device& d, Tensor* out, const Tensor& in,
-                  const gtl::ArraySlice<int64>& slice_indices,
-                  const gtl::ArraySlice<int64>& slice_sizes) {
-    if (in.dims() == NDIM) {
-        internal::SliceUsingEigen<Device, T, NDIM>(d, out, in, slice_indices, slice_sizes);
+  void operator()(const Device& d, typename TTypes<T, NDIMS>::Tensor output,
+                  typename TTypes<T, NDIMS>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDIMS>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDIMS>& slice_sizes) {
+    bool use_64bit = (input.size() > Eigen::NumTraits<int>::highest());
+    if (!use_64bit &&
+        Eigen::internal::is_same<Device, Eigen::GpuDevice>::value) {
+      Eigen::DSizes<int, NDIMS> indices;
+      for (int i = 0; i < NDIMS; ++i) {
+        indices[i] = slice_indices[i];
+      }
+      Eigen::DSizes<int, NDIMS> sizes;
+      for (int i = 0; i < NDIMS; ++i) {
+        sizes[i] = slice_sizes[i];
+      }
+      To32Bit(output).device(d) = To32Bit(input).slice(indices, sizes);
     } else {
-        if (Eigen::internal::is_same<Device, Eigen::GpuDevice>::value) {
-          internal::SliceSimpleGpu<Device, T>(d, out, in, slice_indices);
-        } else {
-          internal::SliceSimple<Device, T>(d, out, in, slice_indices);
-        }
+      output.device(d) = input.slice(slice_indices, slice_sizes);
     }
   }
 };
diff --git a/tensorflow/core/kernels/slice_op_gpu.cu.cc b/tensorflow/core/kernels/slice_op_gpu.cu.cc
index 3039b3d777..a301986f2f 100644
--- a/tensorflow/core/kernels/slice_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/slice_op_gpu.cu.cc
@@ -21,65 +21,9 @@ limitations under the License.
 
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor_types.h"
-#include "tensorflow/core/kernels/ops_util.h"
 #include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/util/cuda_kernel_helper.h"
 
 namespace tensorflow {
-namespace internal {
-
-template <typename T>
-__global__ void SliceKernel(int nthreads, const T* src, const int32* buf,
-                            const int32 ndims, T* dst) {
-  const int32* in_strides = buf;
-  const int32* out_strides = buf + ndims;
-  const int32* slice_indices = buf + ndims * 2;
-  CUDA_1D_KERNEL_LOOP(o_idx, nthreads) {
-    int32 i_idx = 0;
-    int32 t = o_idx;
-    for (int i = 0; i < ndims; ++i) {
-      i_idx += (t / out_strides[i] + slice_indices[i]) * in_strides[i];
-      t %= out_strides[i];
-    }
-    dst[o_idx] = ldg(src + i_idx);
-  }
-}
-
-template <typename Device, typename T>
-void SliceSimpleGpu(const Device& d, Tensor* out, const Tensor& in,
-                 const gtl::ArraySlice<int64>& slice_indices) {
-  // Ensures we can use 32-bit index.
-  const int64 in_nelem = in.NumElements();
-  CHECK_LT(in_nelem, kint32max) << "Tensor too large to transpose on GPU";
-  const int64 out_nelem = out->NumElements();
-  CHECK_LT(out_nelem, kint32max) << "Tensor too large to transpose on GPU";
-  // Pack strides and slice indices sizes into one buffer.
-  const int32 ndims = in.dims();
-  gtl::InlinedVector<int32, 24> host_buf(ndims * 3);
-  gtl::InlinedVector<int32, 8> in_strides = ComputeStride<int32>(in.shape());
-  gtl::InlinedVector<int32, 8> out_strides = ComputeStride<int32>(out->shape());
-  for (int i = 0; i < ndims; ++i) {
-    host_buf[i] = in_strides[i];
-    host_buf[ndims + i] = out_strides[i];
-    host_buf[ndims * 2 + i] = slice_indices[i];
-  }
-  auto num_bytes = sizeof(int64) * host_buf.size();
-  auto dev_buf = d.allocate(num_bytes);
-  // NOTE: host_buf is not allocated by CudaHostAllocator, and
-  // therefore we are doing a sync copy effectively.
-  d.memcpyHostToDevice(dev_buf, host_buf.data(), num_bytes);
-  // Launch kernel to q[...] = p[...].
-  const T* p = in.flat<T>().data();
-  T* q = out->flat<T>().data();
-  CudaLaunchConfig cfg = GetCudaLaunchConfig(out_nelem, d);
-  SliceKernel<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>>(
-      cfg.virtual_thread_count, p, reinterpret_cast<const int32*>(dev_buf),
-      ndims, q);
-  // Safe to deallocate immediately after the kernel launch.
-  d.deallocate(dev_buf);
-}
-
-} // namespace internal
 
 typedef Eigen::GpuDevice GPUDevice;
 
diff --git a/tensorflow/core/kernels/strided_slice_op.cc b/tensorflow/core/kernels/strided_slice_op.cc
index 8fc40db3cc..73b6d4cf6a 100644
--- a/tensorflow/core/kernels/strided_slice_op.cc
+++ b/tensorflow/core/kernels/strided_slice_op.cc
@@ -427,6 +427,7 @@ REGISTER_STRIDED_SLICE(bfloat16);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU);
 TF_CALL_complex64(REGISTER_GPU);
 TF_CALL_complex128(REGISTER_GPU);
+TF_CALL_int64(REGISTER_GPU);
 
 // A special GPU kernel for int32.
 // TODO(b/25387198): Also enable int32 in device memory. This kernel
diff --git a/tensorflow/core/kernels/strided_slice_op_impl.h b/tensorflow/core/kernels/strided_slice_op_impl.h
index 7d42887426..afe3a051e6 100644
--- a/tensorflow/core/kernels/strided_slice_op_impl.h
+++ b/tensorflow/core/kernels/strided_slice_op_impl.h
@@ -84,16 +84,16 @@ void HandleStridedSliceCase(OpKernelContext* context,
 
   gtl::InlinedVector<int64, 4> processing_dims = processing_shape.dim_sizes();
   if (is_simple_slice) {
-    gtl::InlinedVector<int64, 4> sizes(begin.size());
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> begin_di;
+    Eigen::DSizes<Eigen::DenseIndex, NDIM> sizes_di;
     for (int i = 0; i < NDIM; ++i) {
-      sizes[i] = end[i] - begin[i];
+      begin_di[i] = begin[i];
+      sizes_di[i] = end[i] - begin[i];
     }
-    const TensorShape final_shape = result->shape();
-    CHECK(result->CopyFrom(*result, processing_shape));
-    const Tensor input = context->input(0);
-    functor::Slice<Device, T, NDIM>()(
-        context->eigen_device<Device>(), result, input, begin, sizes);
-    CHECK(result->CopyFrom(*result, final_shape));
+    functor::Slice<Device, Proxy, NDIM>()(
+        context->eigen_device<Device>(),
+        result->bit_casted_shaped<Proxy, NDIM>(processing_dims),
+        context->input(0).bit_casted_tensor<Proxy, NDIM>(), begin_di, sizes_di);
   } else {
     Eigen::DSizes<Eigen::DenseIndex, NDIM> begin_di;
     Eigen::DSizes<Eigen::DenseIndex, NDIM> end_di;
@@ -196,9 +196,10 @@ class HandleStridedSliceAssignCase<Device, T, 0> {
   extern template struct StridedSlice<GPUDevice, T, NDIM>;         \
   template <>                                                      \
   void Slice<GPUDevice, T, NDIM>::operator()(                      \
-      const GPUDevice& d, Tensor* output, const Tensor& input,     \
-      const gtl::ArraySlice<int64>& slice_indices,                 \
-      const gtl::ArraySlice<int64>& slice_sizes);                  \
+      const GPUDevice& d, typename TTypes<T, NDIM>::Tensor output, \
+      typename TTypes<T, NDIM>::ConstTensor input,                 \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& indices,       \
+      const Eigen::DSizes<Eigen::DenseIndex, NDIM>& sizes);        \
   extern template struct Slice<GPUDevice, T, NDIM>;                \
   template <>                                                      \
   void StridedSliceGrad<GPUDevice, T, NDIM>::operator()(           \
@@ -283,6 +284,7 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_FOR_N_GPU);
 TF_CALL_complex64(DECLARE_FOR_N_GPU);
 TF_CALL_complex128(DECLARE_FOR_N_GPU);
 DECLARE_FOR_N_GPU(int32);
+DECLARE_FOR_N_GPU(int64);
 #endif  // END GOOGLE_CUDA
 
 TF_CALL_ALL_TYPES(DECLARE_FOR_N_CPU);
@@ -298,6 +300,7 @@ DECLARE_FOR_N_CPU(bfloat16);
 TF_CALL_SYCL_PROXY_TYPES(PREVENT_FOR_N_SYCL);
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DECLARE_FOR_N_SYCL);
 DECLARE_FOR_N_SYCL(int32);
+DECLARE_FOR_N_SYCL(int64);
 
 #undef DECLARE_FOR_N_SYCL
 #endif // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/strided_slice_op_test.cc b/tensorflow/core/kernels/strided_slice_op_test.cc
index 78bb15463c..281ca0f58f 100644
--- a/tensorflow/core/kernels/strided_slice_op_test.cc
+++ b/tensorflow/core/kernels/strided_slice_op_test.cc
@@ -76,69 +76,20 @@ static void SliceHelper(int iters, int size) {
   testing::UseRealTime();
 }
 
-template <typename T>
-static void Dim8SliceHelper(int iters, int size) {
-  testing::StopTiming();
-  Graph* g = new Graph(OpRegistry::Global());
-  DataType dt = DataTypeToEnum<T>::v();
-  int kDim = 100;
-  int kMaxSize = 15000;
-  CHECK_LT(size, kMaxSize);
-
-  Tensor begin(DT_INT32, TensorShape({8}));
-  begin.flat<int32>()(10) = 10;
-  for (int i = 1; i < 7; ++i) {
-    begin.flat<int32>()(i) = 0;
-  }
-  begin.flat<int32>()(7) = 10;
-
-  Tensor end(DT_INT32, TensorShape({8}));
-  end.flat<int32>()(0) = 10 + kDim;
-  for (int i = 1; i < 7; ++i) {
-    end.flat<int32>()(i) = 1;
-  }
-  end.flat<int32>()(7) = 10 + size;
-
-  Tensor strides(DT_INT32, TensorShape({8}));
-  for (int i = 0; i < 8; ++i) {
-    strides.flat<int32>()(i) = 1;
-  }
-
-  Tensor input(dt, TensorShape({2*kDim, 1, 1, 1, 1, 1, 1, kMaxSize}));
-  input.flat<T>().setRandom();
-
-  Node* node;
-  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "StridedSlice")
-                  .Input(test::graph::Constant(g, input))
-                  .Input(test::graph::Constant(g, begin))
-                  .Input(test::graph::Constant(g, end))
-                  .Input(test::graph::Constant(g, strides))
-                  .Attr("T", dt)
-                  .Finalize(g, &node));
-
-  testing::BytesProcessed(static_cast<int64>(iters) * kDim * size * sizeof(T));
-  testing::StartTiming();
-  test::Benchmark("cpu", g).Run(iters);
-  testing::UseRealTime();
-}
-
 static void BM_SliceFloat(int iters, int dim2) {
   SliceHelper<float>(iters, dim2);
-  Dim8SliceHelper<float>(iters, dim2);
 }
 
 BENCHMARK(BM_SliceFloat)->Arg(100)->Arg(1000)->Arg(10000);
 
 static void BM_SliceComplex64(int iters, int dim2) {
   SliceHelper<std::complex<float>>(iters, dim2);
-  Dim8SliceHelper<std::complex<float>>(iters, dim2);
 }
 
 BENCHMARK(BM_SliceComplex64)->Arg(100)->Arg(1000)->Arg(10000);
 
 static void BM_SliceBFloat16(int iters, int dim2) {
   SliceHelper<bfloat16>(iters, dim2);
-  Dim8SliceHelper<bfloat16>(iters, dim2);
 }
 
 BENCHMARK(BM_SliceBFloat16)->Arg(100)->Arg(1000)->Arg(10000);
diff --git a/tensorflow/core/kernels/transpose_op.cc b/tensorflow/core/kernels/transpose_op.cc
index 96c051c636..20f0edf309 100644
--- a/tensorflow/core/kernels/transpose_op.cc
+++ b/tensorflow/core/kernels/transpose_op.cc
@@ -31,14 +31,13 @@ limitations under the License.
 
 namespace tensorflow {
 
-// inv = InvertPermutationOp(T<int32/int64> p) takes a permutation of
+// inv = InvertPermutationOp(T<int32> p) takes a permutation of
 // integers 0, 1, ..., n - 1 and returns the inverted
 // permutation of p. I.e., inv[p[i]] == i, for i in [0 .. n).
 //
-// REQUIRES: input is a vector of int32 or int64.
+// REQUIRES: input is a vector of int32.
 // REQUIRES: input is a permutation of 0, 1, ..., n-1.
 
-template <typename T>
 class InvertPermutationOp : public OpKernel {
  public:
   explicit InvertPermutationOp(OpKernelConstruction* context)
@@ -49,19 +48,20 @@ class InvertPermutationOp : public OpKernel {
     OP_REQUIRES(
         context, TensorShapeUtils::IsVector(input.shape()),
         errors::InvalidArgument("invert_permutation expects a 1D vector."));
-    auto Tin = input.vec<T>();
+    auto Tin = input.vec<int32>();
     OP_REQUIRES(context,
                 FastBoundsCheck(Tin.size(), std::numeric_limits<int32>::max()),
                 errors::InvalidArgument("permutation of nonnegative int32s "
                                         "must have <= int32 max elements"));
-    const T N = static_cast<T>(Tin.size());  // Safe: bounds-checked above.
+    const int32 N =
+        static_cast<int32>(Tin.size());  // Safe: bounds-checked above.
     Tensor* output = nullptr;
     OP_REQUIRES_OK(context,
                    context->allocate_output(0, input.shape(), &output));
-    auto Tout = output->vec<T>();
+    auto Tout = output->vec<int32>();
     std::fill_n(Tout.data(), N, -1);
     for (int i = 0; i < N; ++i) {
-      const T d = internal::SubtleMustCopy(Tin(i));
+      const int32 d = internal::SubtleMustCopy(Tin(i));
       OP_REQUIRES(context, FastBoundsCheck(d, N),
                   errors::InvalidArgument(d, " is not between 0 and ", N));
       OP_REQUIRES(context, Tout(d) == -1,
@@ -73,23 +73,14 @@ class InvertPermutationOp : public OpKernel {
 
 REGISTER_KERNEL_BUILDER(
     Name("InvertPermutation").Device(DEVICE_CPU).TypeConstraint<int32>("T"),
-    InvertPermutationOp<int32>);
-REGISTER_KERNEL_BUILDER(
-    Name("InvertPermutation").Device(DEVICE_CPU).TypeConstraint<int64>("T"),
-    InvertPermutationOp<int64>);
+    InvertPermutationOp);
 
 REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<int32>("T")
                             .HostMemory("x")
                             .HostMemory("y"),
-                        InvertPermutationOp<int32>);
-REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int64>("T")
-                            .HostMemory("x")
-                            .HostMemory("y"),
-                        InvertPermutationOp<int64>);
+                        InvertPermutationOp);
 
 #ifdef TENSORFLOW_USE_SYCL
 REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
@@ -97,13 +88,7 @@ REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
                             .TypeConstraint<int32>("T")
                             .HostMemory("x")
                             .HostMemory("y"),
-                        InvertPermutationOp<int32>);
-REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<int64>("T")
-                            .HostMemory("x")
-                            .HostMemory("y"),
-                        InvertPermutationOp<int64>);
+                        InvertPermutationOp);
 #endif  // TENSORFLOW_USE_SYCL
 
 namespace {
diff --git a/tensorflow/core/kernels/unique_op.cc b/tensorflow/core/kernels/unique_op.cc
index d087784c8a..701c5f6d2b 100644
--- a/tensorflow/core/kernels/unique_op.cc
+++ b/tensorflow/core/kernels/unique_op.cc
@@ -13,7 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include <functional>
 #include <unordered_map>
 #include <utility>
 
@@ -22,7 +21,6 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/lib/core/status.h"
-#include "tensorflow/core/lib/hash/hash.h"
 
 namespace tensorflow {
 
@@ -35,6 +33,8 @@ class UniqueOp : public OpKernel {
 
   void Compute(OpKernelContext* context) override {
     const Tensor& input = context->input(0);
+    OP_REQUIRES(context, TensorShapeUtils::IsVector(input.shape()),
+                errors::InvalidArgument("unique expects a 1D vector."));
     // TODO(dga):  Make unique polymorphic for returning int32 and int64
     // vectors to support large tensors.
     OP_REQUIRES(context,
@@ -42,102 +42,31 @@ class UniqueOp : public OpKernel {
                 errors::InvalidArgument(
                     "unique does not support input tensors larger than ",
                     std::numeric_limits<int32>::max(), " elements"));
-
-    int64 axis = 0;
-    std::vector<int64> new_sizes{1, input.NumElements(), 1};
-    if (context->num_inputs() == 1) {
-      OP_REQUIRES(context, TensorShapeUtils::IsVector(input.shape()),
-                  errors::InvalidArgument("unique expects a 1D vector."));
-    } else {
-      // In case of UniqueV2, the axis is a 1D vector. The purpose is
-      // to allow specifying either "no axis" or "axis". The `[]` means
-      // "no axis", while `[x]` means `axis = x`.
-      const Tensor& axis_tensor = context->input(1);
-      OP_REQUIRES(context, TensorShapeUtils::IsVector(axis_tensor.shape()),
-                  errors::InvalidArgument("axis expects a 1D vector."));
-      OP_REQUIRES(
-          context, axis_tensor.NumElements() <= 1,
-          errors::InvalidArgument(
-              "axis does not support input tensors larger than 1 elements"));
-      if (axis_tensor.NumElements() == 0) {
-        OP_REQUIRES(context, TensorShapeUtils::IsVector(input.shape()),
-                    errors::InvalidArgument("unique expects a 1D vector."));
-      } else {
-        auto axis_vec = axis_tensor.vec<int64>();
-        axis = axis_vec(0);
-        axis = axis < 0 ? axis + input.dims() : axis;
-        OP_REQUIRES(context, 0 <= axis && axis < input.dims(),
-                    errors::InvalidArgument("axis has to be between [0, ",
-                                            input.dims(), ")"));
-        if (axis > 0) {
-          for (int64 i = 0; i < axis; i++) {
-            new_sizes[0] *= input.dim_size(i);
-          }
-        }
-        new_sizes[1] = input.dim_size(axis);
-        if (axis + 1 < input.dims()) {
-          for (int64 i = axis + 1; i < input.dims(); i++) {
-            new_sizes[2] *= input.dim_size(i);
-          }
-        }
-      }
-    }
-
-    auto Tin = input.shaped<T, 3>(new_sizes);
+    auto Tin = input.vec<T>();
+    const int64 N = static_cast<int64>(Tin.size());
 
     Tensor* idx = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                1, TensorShape({Tin.dimension(1)}), &idx));
+    OP_REQUIRES_OK(context, context->forward_input_or_allocate_output(
+                                {0}, 1, input.shape(), &idx));
     auto idx_vec = idx->template vec<TIndex>();
 
-    auto hash_fn = [&Tin](const int64& key) -> unsigned long {
-      size_t h = 0;
-      for (int64 i = 0; i < Tin.dimension(0); i++) {
-        for (int64 j = 0; j < Tin.dimension(2); j++) {
-          h = Hash64Combine(h, hash<T>{}(Tin(i, key, j)));
-        }
-      }
-      return h;
-    };
-
-    auto equal_to_fn = [&Tin](const int64& lhs, const int64& rhs) {
-      for (int64 i = 0; i < Tin.dimension(0); i++) {
-        for (int64 j = 0; j < Tin.dimension(2); j++) {
-          if (Tin(i, lhs, j) != Tin(i, rhs, j)) {
-            return false;
-          }
-        }
-      }
-      return true;
-    };
-
-    std::unordered_map<int64, int64, decltype(hash_fn), decltype(equal_to_fn)>
-        uniq(0, hash_fn, equal_to_fn);
-
-    uniq.reserve(2 * Tin.dimension(1));
-
-    for (int64 i = 0, j = 0; i < Tin.dimension(1); ++i) {
-      auto it = uniq.insert(std::make_pair(i, j));
+    std::unordered_map<T, TIndex> uniq;
+    uniq.reserve(2 * N);
+    for (int64 i = 0, j = 0; i < N; ++i) {
+      auto it = uniq.insert(std::make_pair(Tin(i), j));
       idx_vec(i) = it.first->second;
       if (it.second) {
         ++j;
       }
     }
-
     int64 uniq_size = static_cast<int64>(uniq.size());
-    new_sizes[1] = uniq_size;
-    TensorShape output_shape(input.shape());
-    output_shape.set_dim(axis, uniq_size);
     Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output));
-    auto Tout = output->shaped<T, 3>(new_sizes);
+    OP_REQUIRES_OK(context, context->allocate_output(
+                                0, TensorShape({uniq_size}), &output));
+    auto output_vec = output->template vec<T>();
 
     for (auto it : uniq) {
-      for (int64 i = 0; i < Tin.dimension(0); i++) {
-        for (int64 j = 0; j < Tin.dimension(2); j++) {
-          Tout(i, it.second, j) = Tin(i, it.first, j);
-        }
-      }
+      output_vec(it.second) = it.first;
     }
 
     if (num_outputs() > 2) {
@@ -145,7 +74,7 @@ class UniqueOp : public OpKernel {
                                   2, TensorShape({uniq_size}), &output));
       auto count_output_vec = output->template vec<TIndex>();
       count_output_vec.setZero();
-      for (int64 i = 0; i < Tin.dimension(1); ++i) {
+      for (int64 i = 0; i < N; ++i) {
         count_output_vec(idx_vec(i))++;
       }
     }
@@ -163,16 +92,6 @@ class UniqueOp : public OpKernel {
                               .TypeConstraint<type>("T")         \
                               .TypeConstraint<int64>("out_idx"), \
                           UniqueOp<type, int64>);                \
-  REGISTER_KERNEL_BUILDER(Name("UniqueV2")                       \
-                              .Device(DEVICE_CPU)                \
-                              .TypeConstraint<type>("T")         \
-                              .TypeConstraint<int32>("out_idx"), \
-                          UniqueOp<type, int32>);                \
-  REGISTER_KERNEL_BUILDER(Name("UniqueV2")                       \
-                              .Device(DEVICE_CPU)                \
-                              .TypeConstraint<type>("T")         \
-                              .TypeConstraint<int64>("out_idx"), \
-                          UniqueOp<type, int64>);                \
   REGISTER_KERNEL_BUILDER(Name("UniqueWithCounts")               \
                               .Device(DEVICE_CPU)                \
                               .TypeConstraint<type>("T")         \
@@ -257,5 +176,5 @@ REGISTER_KERNEL_BUILDER(Name("Unique")
                             .HostMemory("y")
                             .HostMemory("idx"),
                         UniqueOp<int64, int64>);
-#endif  // TENSORFLOW_USE_SYCL
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index 9fa6423d59..be2916f154 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -723,9 +723,7 @@ y: a tensor of the same shape and type as x but filled with zeros.
 REGISTER_OP("OnesLike")
     .Input("x: T")
     .Output("y: T")
-    .Attr(
-        "T: {float, double, int8, uint8, int16, uint16, int32, int64, "
-        "complex64, complex128, bool}")
+    .Attr("T: {float, double, int32, int64, complex64, complex128}")
     .SetShapeFn(shape_inference::UnchangedShape)
     .Doc(R"doc(
 Returns a tensor of ones with the same shape and type as x.
@@ -2033,46 +2031,6 @@ y: 1-D.
 idx: 1-D.
 )doc");
 
-REGISTER_OP("UniqueV2")
-    .Input("x: T")
-    .Input("axis: int64")
-    .Output("y: T")
-    .Output("idx: out_idx")
-    .Attr("T: type")
-    .Attr("out_idx: {int32, int64} = DT_INT32")
-    .SetShapeFn([](InferenceContext* c) {
-      c->set_output(0, c->Vector(InferenceContext::kUnknownDim));
-      c->set_output(1, c->input(0));
-      return Status::OK();
-    })
-    .Doc(R"doc(
-Finds unique elements in a 1-D tensor.
-
-This operation returns a tensor `y` containing all of the unique elements of `x`
-sorted in the same order that they occur in `x`. This operation also returns a
-tensor `idx` the same size as `x` that contains the index of each value of `x`
-in the unique output `y`. In other words:
-
-`y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
-
-For example:
-
-```
-# tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
-y, idx = unique(x)
-y ==> [1, 2, 4, 7, 8]
-idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
-```
-
-
-x: A `Tensor`.
-axis: A `Tensor` of type `int64` (default: 0). The axis of the Tensor to
-  find the unique elements.
-y: A `Tensor`. Unique elements along the `axis` of `Tensor` x.
-idx: A 1-D Tensor. Has the same type as x that contains the index of each
-  value of x in the output y.
-)doc");
-
 // --------------------------------------------------------------------------
 REGISTER_OP("UniqueWithCounts")
     .Input("x: T")
diff --git a/tensorflow/core/ops/math_ops.cc b/tensorflow/core/ops/math_ops.cc
index d30b847696..7b10af9f44 100644
--- a/tensorflow/core/ops/math_ops.cc
+++ b/tensorflow/core/ops/math_ops.cc
@@ -1829,8 +1829,6 @@ need not be sorted and need not cover all values in the full
 range of valid values.
 
 If the sum is empty for a given segment ID `i`, `output[i] = 0`.
-If the given segment ID `i` is negative, the value is dropped and will not be
-added to the sum of the segment.
 
 `num_segments` should equal the number of distinct segment IDs.
 
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index a242a13878..e245c8ba91 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -819,7 +819,7 @@ REGISTER_OP("DepthwiseConv2dNative")
     .Input("input: T")
     .Input("filter: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int)")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnetDataFormatAttrString())
@@ -945,7 +945,7 @@ REGISTER_OP("Conv3D")
     .Input("input: T")
     .Input("filter: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int) >= 5")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnet3dDataFormatAttrString())
@@ -977,7 +977,7 @@ REGISTER_OP("Conv3DBackpropInput")
     .Input("filter: T")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int) >= 5")
     .Attr(GetPaddingAttrString())
     .Deprecated(10, "Use Conv3DBackpropInputV2")
@@ -1003,7 +1003,7 @@ REGISTER_OP("Conv3DBackpropFilter")
     .Input("filter: T")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int) >= 5")
     .Attr(GetPaddingAttrString())
     .Deprecated(10, "Use Conv3DBackpropFilterV2")
@@ -1032,7 +1032,7 @@ REGISTER_OP("Conv3DBackpropInputV2")
     .Input("filter: T")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int) >= 5")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnet3dDataFormatAttrString())
@@ -1069,7 +1069,7 @@ REGISTER_OP("Conv3DBackpropFilterV2")
     .Input("filter_sizes: int32")
     .Input("out_backprop: T")
     .Output("output: T")
-    .Attr("T: {half, float, double}")
+    .Attr("T: {float, double}")
     .Attr("strides: list(int) >= 5")
     .Attr(GetPaddingAttrString())
     .Attr(GetConvnet3dDataFormatAttrString())
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index 9c41957ae6..6ce0b70c9d 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -5449,7 +5449,6 @@ op {
     type: "type"
     allowed_values {
       list {
-        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
       }
@@ -5516,7 +5515,6 @@ op {
     type: "type"
     allowed_values {
       list {
-        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
       }
@@ -5572,7 +5570,6 @@ op {
     type: "type"
     allowed_values {
       list {
-        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
       }
@@ -5638,7 +5635,6 @@ op {
     type: "type"
     allowed_values {
       list {
-        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
       }
@@ -5694,7 +5690,6 @@ op {
     type: "type"
     allowed_values {
       list {
-        type: DT_HALF
         type: DT_FLOAT
         type: DT_DOUBLE
       }
diff --git a/tensorflow/core/platform/default/build_config/BUILD b/tensorflow/core/platform/default/build_config/BUILD
index f2fadb4558..f746b15fee 100644
--- a/tensorflow/core/platform/default/build_config/BUILD
+++ b/tensorflow/core/platform/default/build_config/BUILD
@@ -12,7 +12,6 @@ load("//tensorflow:tensorflow.bzl", "tf_copts")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_library")
 load("//tensorflow/core:platform/default/build_config_root.bzl", "if_static")
 load("@local_config_sycl//sycl:platform.bzl", "sycl_library_path")
-load("@local_config_sycl//sycl:build_defs.bzl", "if_ccpp")
 
 cc_library(
     name = "gtest",
@@ -195,16 +194,17 @@ cc_library(
 
 cc_library(
     name = "sycl",
-    data = if_ccpp([
+    data = [
         "@local_config_sycl//sycl:{}".format(sycl_library_path("ComputeCpp")),
-    ]),
-    linkopts = if_ccpp([
-        "-Wl,-rpath,../local_config_sycl/sycl/lib",
-    ]),
-    deps = if_ccpp(
-        ["@local_config_sycl//sycl:syclrt"],
-        ["@local_config_sycl//sycl:sycl_headers"],
-    ),
+    ],
+    linkopts = select({
+        "//conditions:default": [
+            "-Wl,-rpath,../local_config_sycl/sycl/lib",
+        ],
+    }),
+    deps = [
+        "@local_config_sycl//sycl:syclrt",
+    ],
 )
 
 filegroup(
diff --git a/tensorflow/core/platform/default/notification.h b/tensorflow/core/platform/default/notification.h
index 5c401b7477..6a214dbd0a 100644
--- a/tensorflow/core/platform/default/notification.h
+++ b/tensorflow/core/platform/default/notification.h
@@ -73,7 +73,7 @@ class Notification {
   }
 
   mutex mu_;                    // protects mutations of notified_
-  condition_variable cv_;       // signaled when notified_ becomes non-zero
+  condition_variable cv_;       // signalled when notified_ becomes non-zero
   std::atomic<bool> notified_;  // mutations under mu_
 };
 
diff --git a/tensorflow/core/platform/posix/error.cc b/tensorflow/core/platform/posix/error.cc
index f8b0285c50..e9baad5422 100644
--- a/tensorflow/core/platform/posix/error.cc
+++ b/tensorflow/core/platform/posix/error.cc
@@ -72,7 +72,7 @@ error::Code ErrnoToCode(int err_number) {
     case EBUSY:       // Device or resource busy
     case ECHILD:      // No child processes
     case EISCONN:     // Socket is connected
-#if !defined(_WIN32) && !defined(__HAIKU__)
+#if !defined(_WIN32)
     case ENOTBLK:     // Block device required
 #endif
     case ENOTCONN:    // The socket is not connected
@@ -94,7 +94,7 @@ error::Code ErrnoToCode(int err_number) {
     case ENODATA:  // No message is available on the STREAM read queue
     case ENOMEM:   // Not enough space
     case ENOSR:    // No STREAM resources
-#if !defined(_WIN32) && !defined(__HAIKU__)
+#if !defined(_WIN32)
     case EUSERS:   // Too many users
 #endif
       code = error::RESOURCE_EXHAUSTED;
@@ -111,7 +111,7 @@ error::Code ErrnoToCode(int err_number) {
     case EPFNOSUPPORT:     // Protocol family not supported
 #endif
     case EPROTONOSUPPORT:  // Protocol not supported
-#if !defined(_WIN32) && !defined(__HAIKU__)
+#if !defined(_WIN32)
     case ESOCKTNOSUPPORT:  // Socket type not supported
 #endif
     case EXDEV:            // Improper link
@@ -131,8 +131,7 @@ error::Code ErrnoToCode(int err_number) {
     case ENETUNREACH:   // Network unreachable
     case ENOLCK:        // No locks available
     case ENOLINK:       // Link has been severed
-#if !(defined(__APPLE__) || defined(__FreeBSD__) || defined(_WIN32) \
-	|| defined(__HAIKU__))
+#if !(defined(__APPLE__) || defined(__FreeBSD__) || defined(_WIN32))
     case ENONET:  // Machine is not on the network
 #endif
       code = error::UNAVAILABLE;
@@ -157,7 +156,7 @@ error::Code ErrnoToCode(int err_number) {
     case ENOEXEC:      // Exec format error
     case ENOMSG:       // No message of the desired type
     case EPROTO:       // Protocol error
-#if !defined(_WIN32) && !defined(__HAIKU__)
+#if !defined(_WIN32)
     case EREMOTE:      // Object is remote
 #endif
       code = error::UNKNOWN;
diff --git a/tensorflow/core/platform/posix/port.cc b/tensorflow/core/platform/posix/port.cc
index 09f69a95c1..6cba40ccfc 100644
--- a/tensorflow/core/platform/posix/port.cc
+++ b/tensorflow/core/platform/posix/port.cc
@@ -37,8 +37,7 @@ limitations under the License.
 #ifdef TF_USE_SNAPPY
 #include "snappy.h"
 #endif
-#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__) \
-	|| defined(__HAIKU__)
+#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__)
 #include <thread>
 #endif
 
@@ -62,8 +61,7 @@ int NumSchedulableCPUs() {
   }
   perror("sched_getaffinity");
 #endif
-#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__) \
-	|| defined(__HAIKU__)
+#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__)
   unsigned int count = std::thread::hardware_concurrency();
   if (count > 0) return static_cast<int>(count);
 #endif
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index ec077c4283..1bf9c93101 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -24,7 +24,7 @@ limitations under the License.
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
-#define TF_VERSION_SUFFIX ""
+#define TF_VERSION_SUFFIX "-rc1"
 
 #define TF_STR_HELPER(x) #x
 #define TF_STR(x) TF_STR_HELPER(x)
diff --git a/tensorflow/core/util/mkl_util.h b/tensorflow/core/util/mkl_util.h
index 118ff0d0d6..1bfa4f83a3 100644
--- a/tensorflow/core/util/mkl_util.h
+++ b/tensorflow/core/util/mkl_util.h
@@ -26,23 +26,18 @@ limitations under the License.
 #include "mkl_trans.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
+
 #include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/graph/mkl_graph_util.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/gtl/array_slice.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/util/padding.h"
 #include "tensorflow/core/util/tensor_format.h"
-#include "tensorflow/core/graph/mkl_graph_util.h"
 
 #ifdef INTEL_MKL_DNN
 #include "mkldnn.hpp"
-
-using mkldnn::memory;
-using mkldnn::reorder;
-using mkldnn::primitive;
-using mkldnn::padding_kind;
-using mkldnn::engine;
 #endif
 
 // The file contains a number of utility classes and functions used by MKL
@@ -56,8 +51,6 @@ namespace tensorflow {
 // Tensorflow tensor.
 
 typedef enum { W = 0, H = 1, C = 2, N = 3 } MklDims;
-typedef enum { Dim_N = 0, Dim_C = 1, Dim_H = 2, Dim_W = 3,
-               Dim_O = 0, Dim_I = 1 } MklDnnDims;
 
 class MklShape {
  public:
@@ -150,9 +143,7 @@ class MklShape {
   size_t GetDimension() const { return dimension_; }
   const size_t* GetSizes() const { return sizes_; }
   int64 dim_size(int index) const { return sizes_[index]; }
-  int64 tf_dim_size(int index) const {
-    return sizes_[tf_to_mkl_dim_map_[index]];
-  }
+  int64 tf_dim_size(int index) const { return sizes_[tf_to_mkl_dim_map_[index]]; }
   const size_t* GetStrides() const { return strides_; }
   const size_t* GetTfToMklDimMap() const { return tf_to_mkl_dim_map_; }
   size_t tf_dim_idx(int index) const { return tf_to_mkl_dim_map_[index]; }
@@ -236,8 +227,7 @@ class MklShape {
   (IS_MKL_TENSOR_OFFSET + sizeof(size_t))  // Location of dimension_
 // Location of sizes. Note dim is not used here, left here
 // to make macros consistent.
-#define SIZES_OFFSET(dims) \
-  (DIMS_OFFSET + sizeof(size_t))
+#define SIZES_OFFSET(dims) (DIMS_OFFSET + sizeof(size_t))
 #define STRIDES_OFFSET(dims) \
   (SIZES_OFFSET(dims) + dims * sizeof(size_t))  // Location of strides
 #define MKL_LAYOUT_OFFSET(dims) \
@@ -319,266 +309,6 @@ class MklShape {
       nullptr;  // TF dimension corresponding to this MKL dimension
 };
 
-#ifdef INTEL_MKL_DNN
-
-// Forward decl
-TensorFormat MklDnnDataFormatToTFDataFormat(memory::format format);
-
-class MklDnnShape {
- private:
-  typedef struct {
-    /// Flag to indicate if the tensor is an  MKL tensor or not
-    bool is_mkl_tensor_ = false;
-    /// Number of dimensions in Tensorflow format
-    size_t dimension_ = 0;
-    /// Required by MKLDNN for conversions
-    mkldnn_dims_t sizes_;    // Required by MKL for conversions
-    memory::format tf_data_format_ = memory::format::format_undef;
-    memory::data_type T_ = memory::data_type::data_undef;
-    // MKL layout
-    mkldnn_memory_desc_t mkl_md_;
-    /// TF dimension corresponding to this MKL dimension
-    mkldnn_dims_t map_;
-  } MklShapeData;
-  MklShapeData data_;
-
-  typedef std::remove_extent<mkldnn_dims_t>::type mkldnn_dim_t;
-#define INVALID_DIM_SIZE -1
-
-
- public:
-  MklDnnShape() {
-    for (size_t i = 0; i < sizeof(data_.sizes_) /
-                           sizeof(data_.sizes_[0]); ++i) {
-      data_.sizes_[i] = -1;
-    }
-    for (size_t i = 0; i < sizeof(data_.map_) /
-                           sizeof(data_.map_[0]); ++i) {
-      data_.map_[i] = -1;
-    }
-  }
-
-  ~MklDnnShape() {}
-  TF_DISALLOW_COPY_AND_ASSIGN(MklDnnShape);  // Cannot copy
-
-  inline const bool IsMklTensor() const { return data_.is_mkl_tensor_; }
-  inline void SetMklTensor(bool is_mkl_tensor) {
-    data_.is_mkl_tensor_ = is_mkl_tensor;
-  }
-
-  inline void SetDimensions(const size_t dimension) {
-    data_.dimension_ = dimension;
-  }
-  inline size_t GetDimension(char dimension)const {
-    int index = GetMklDnnTensorDimIndex(dimension);
-    CHECK(index >= 0 && index < this->GetDimension())
-        << "Invalid index from the dimension: " << index << ", " << dimension;
-    return this->DimSize(index);
-  }
-
-  inline int32 GetMklDnnTensorDimIndex(char dimension)const {
-    switch (dimension) {
-  case 'N':
-    return MklDnnDims::Dim_N;
-  case 'C':
-    return MklDnnDims::Dim_C;
-  case 'H':
-    return MklDnnDims::Dim_H;
-  case 'W':
-    return MklDnnDims::Dim_W;
-  default:
-    LOG(FATAL) << "Invalid dimension: " << dimension;
-    return -1;  // Avoid compiler warning about missing return value
-    }
-  }
-
-  inline size_t GetDimension() const { return data_.dimension_; }
-  inline const int* GetSizes() const {
-    return reinterpret_cast<const int*>(&data_.sizes_[0]);
-  }
-
-  // Returns an mkldnn::memory::dims object that contains the sizes of this
-  // MklDnnShape object.
-  inline memory::dims GetSizesAsMklDnnDims() const {
-    memory::dims retVal;
-    if (data_.is_mkl_tensor_) {
-      int dimensions = sizeof(data_.sizes_) / sizeof(data_.sizes_[0]);
-      for (size_t i = 0 ; i < dimensions; i++) {
-        if (data_.sizes_[i] != INVALID_DIM_SIZE)
-        retVal.push_back(data_.sizes_[i]);
-      }
-    } else {
-      CHECK_EQ(data_.is_mkl_tensor_, true);
-    }
-    return retVal;
-  }
-
-  inline int64 DimSize(int index) const {
-    CHECK_LT(index, sizeof(data_.sizes_)/sizeof(data_.sizes_[0]));
-    return data_.sizes_[index];
-  }
-
-  /// Return TensorShape that describes the Tensorflow shape of the tensor
-  /// represented by this MklShape.
-  inline TensorShape GetTfShape() {
-    CHECK_EQ(data_.is_mkl_tensor_, true);
-
-    std::vector<int32> shape(data_.dimension_, -1);
-    for (size_t idx = 0; idx < data_.dimension_; ++idx) {
-      shape[idx] = data_.sizes_[TfDimIdx(idx)];
-    }
-
-    TensorShape ts;
-    bool ret = TensorShapeUtils::MakeShape(shape, &ts).ok();
-    CHECK_EQ(ret, true);
-    return ts;
-  }
-
-  inline void SetElemType(memory::data_type dt) { data_.T_ = dt; }
-  inline const memory::data_type GetElemType() { return data_.T_; }
-
-  inline void SetMklLayout(memory::primitive_desc* pd) {
-    CHECK_NOTNULL(pd);
-    data_.mkl_md_ = pd->desc().data;
-  }
-  inline const memory::desc GetMklLayout() const {
-    return memory::desc(data_.mkl_md_);
-  }
-
-  inline memory::format GetTfDataFormat() const {
-    return data_.tf_data_format_;
-  }
-  /// We don't create primitive_descriptor for TensorFlow layout now.
-  /// We use lazy evaluation and create it only when needed.
-  inline void SetTfLayout(size_t dims, const memory::dims& sizes,
-                   memory::format format) {
-    CHECK_EQ(dims, sizes.size());
-    data_.dimension_ = dims;
-    for (size_t ii = 0; ii < dims; ii++) {
-      data_.sizes_[ii] = sizes[ii];
-    }
-    data_.tf_data_format_ = format;
-    SetTfDimOrder(dims, format);
-  }
-  inline const memory::desc GetTfLayout() const {
-    memory::dims dims;
-    for (size_t ii = 0; ii < data_.dimension_; ii++) {
-      dims.push_back(data_.sizes_[ii]);
-    }
-    return memory::desc(dims, data_.T_, data_.tf_data_format_);
-  }
-  inline const memory::desc GetCurLayout() const {
-    return IsMklTensor() ? GetMklLayout() : GetTfLayout();
-  }
-
-  // nhasabni - I've removed SetTfDimOrder that was setting default order in
-  // case of MKL-ML. We don't need a case of default dimension order because
-  // when an operator that does not get data_format attribute gets all inputs
-  // in Tensorflow format, it will produce output in Tensorflow format.
-  inline void SetTfDimOrder(const size_t dimension, const mkldnn_dims_t map) {
-    CHECK(dimension == data_.dimension_);
-    for (size_t ii = 0; ii < dimension; ii++) {
-      data_.map_[ii] = map[ii];
-    }
-  }
-
-  inline void SetTfDimOrder(const size_t dimension, TensorFormat data_format) {
-    // TODO(nhasabni): Why do we restrict this to 4D?
-    CHECK_EQ(dimension, 4);
-    CHECK(dimension == data_.dimension_);
-    data_.map_[GetTensorDimIndex<2>(data_format, 'W')] = MklDnnDims::Dim_W;
-    data_.map_[GetTensorDimIndex<2>(data_format, 'H')] = MklDnnDims::Dim_H;
-    data_.map_[GetTensorDimIndex<2>(data_format, 'C')] = MklDnnDims::Dim_C;
-    data_.map_[GetTensorDimIndex<2>(data_format, 'N')] = MklDnnDims::Dim_N;
-  }
-
-  inline void SetTfDimOrder(const size_t dimension, memory::format format) {
-    TensorFormat data_format = MklDnnDataFormatToTFDataFormat(format);
-    SetTfDimOrder(dimension, data_format);
-  }
-
-  inline const mkldnn_dim_t* GetTfToMklDimMap() const {
-    return &data_.map_[0];
-  }
-  inline size_t TfDimIdx(int index) const { return data_.map_[index]; }
-  inline int64 TfDimSize(int index) const {
-    return data_.sizes_[TfDimIdx(index)];
-  }
-
-  /// Query TF-MKL dimension ordering map and check if Tensorflow dimension 'd'
-  /// corresponds to MKL's Channel dimension.
-  inline bool IsMklChannelDim(int d) const {
-    return TfDimIdx(d) == MklDnnDims::Dim_C;
-  }
-  /// Query TF-MKL dimension ordering map and check if Tensorflow dimension 'd'
-  /// corresponds to MKL's Batch dimension.
-  inline bool IsMklBatchDim(int d) const {
-    return TfDimIdx(d) == MklDnnDims::Dim_N;
-  }
-  /// Query TF-MKL dimension ordering map and check if Tensorflow dimension 'd'
-  /// corresponds to MKL's Width dimension.
-  inline bool IsMklWidthDim(int d) const {
-    return TfDimIdx(d) == MklDnnDims::Dim_W;
-  }
-  /// Query TF-MKL dimension ordering map and check if Tensorflow dimension 'd'
-  /// corresponds to MKL's Height dimension.
-  inline bool IsMklHeightDim(int d) const {
-    return TfDimIdx(d) == MklDnnDims::Dim_H;
-  }
-
-  /// Check if the TF-Mkl dimension ordering map specifies if the input
-  /// tensor is in NCHW format.
-  inline bool IsTensorInNCHWFormat() const {
-    TensorFormat data_format = FORMAT_NCHW;
-    return (IsMklBatchDim(GetTensorDimIndex<2>(data_format, 'N')) &&
-            IsMklChannelDim(GetTensorDimIndex<2>(data_format, 'C')) &&
-            IsMklHeightDim(GetTensorDimIndex<2>(data_format, 'H')) &&
-            IsMklWidthDim(GetTensorDimIndex<2>(data_format, 'W')));
-  }
-
-  /// Check if the TF-Mkl dimension ordering map specifies if the input
-  /// tensor is in NHWC format.
-  inline bool IsTensorInNHWCFormat() const {
-    TensorFormat data_format = FORMAT_NHWC;
-    return (IsMklBatchDim(GetTensorDimIndex<2>(data_format, 'N')) &&
-            IsMklChannelDim(GetTensorDimIndex<2>(data_format, 'C')) &&
-            IsMklHeightDim(GetTensorDimIndex<2>(data_format, 'H')) &&
-            IsMklWidthDim(GetTensorDimIndex<2>(data_format, 'W')));
-  }
-
-  /// The following methods are used for serializing and de-serializing the
-  /// contents of the mklshape object.
-  /// The data is serialized in this order
-  /// is_mkl_tensor_ : dimension_ : sizes_ : map_: format_ : T_ : mkl_pd_;
-
-  /// Size of buffer to hold the serialized object, the size is computed by
-  /// following above mentioned order
-  inline size_t GetSerializeBufferSize() const {
-    return sizeof(MklShapeData);
-  }
-
-  void SerializeMklDnnShape(unsigned char* buf, size_t buf_size) const {
-    CHECK(buf_size >= GetSerializeBufferSize())
-        << "Buffer size is too small to SerializeMklDnnShape";
-    *reinterpret_cast<MklShapeData*>(buf) = data_;
-  }
-
-  void DeSerializeMklDnnShape(const unsigned char* buf, size_t buf_size) {
-    // Make sure buffer holds at least is_mkl_tensor_.
-    CHECK(buf_size >= sizeof(data_.is_mkl_tensor_))
-      << "Buffer size is too small in DeSerializeMklDnnShape";
-
-    const bool is_mkl_tensor = *reinterpret_cast<const bool*>(buf);
-    if (is_mkl_tensor) {  // If it is an MKL Tensor then read the rest
-      CHECK(buf_size >= GetSerializeBufferSize())
-        << "Buffer size is too small in DeSerializeMklDnnShape";
-      data_ = *reinterpret_cast<const MklShapeData*>(buf);
-    }
-  }
-};
-
-#endif
-
 // List of MklShape objects. Used in Concat/Split layers.
 typedef std::vector<MklShape> MklShapeList;
 
@@ -617,36 +347,6 @@ inline Tensor ConvertMklToTF(OpKernelContext* context, const Tensor& mkl_tensor,
   return output_tensor;
 }
 
-#ifdef INTEL_MKL_DNN
-template <typename T>
-inline Tensor ConvertMklToTF(OpKernelContext* context, const Tensor& mkl_tensor,
-                             const MklDnnShape& mkl_shape) {
-  Tensor output_tensor;
-  TensorShape output_shape;
-
-#if 0
-  // TODO(nhasabni): need to implement
-  for (size_t j = 0; j < mkl_shape.GetDimension(); j++) {
-    // Outermost to innermost dimension
-    output_shape.AddDim(mkl_shape.GetSizes()[mkl_shape.tf_dim_idx(j)]);
-  }
-
-  // Allocate output tensor.
-  context->allocate_temp(DataTypeToEnum<T>::v(), output_shape, &output_tensor);
-
-  dnnLayout_t output_layout = static_cast<dnnLayout_t>(mkl_shape.GetTfLayout());
-  void* input_buffer = const_cast<T*>(mkl_tensor.flat<T>().data());
-  void* output_buffer = const_cast<T*>(output_tensor.flat<T>().data());
-
-  if (mkl_tensor.NumElements() != 0) {
-    mkl_shape.GetConvertedFlatData(output_layout, input_buffer, output_buffer);
-  }
-#endif
-
-  return output_tensor;
-}
-#endif
-
 // Get the MKL shape from the second string tensor
 inline void GetMklShape(OpKernelContext* ctext, int n, MklShape* mklshape) {
   mklshape->DeSerializeMklShape(
@@ -659,20 +359,6 @@ inline void GetMklShape(OpKernelContext* ctext, int n, MklShape* mklshape) {
           sizeof(uint8));
 }
 
-#ifdef INTEL_MKL_DNN
-inline void GetMklShape(OpKernelContext* ctext, int n,
-                        MklDnnShape* mklshape) {
-  mklshape->DeSerializeMklDnnShape(
-      ctext->input(GetTensorMetaDataIndex(n, ctext->num_inputs()))
-          .flat<uint8>()
-          .data(),
-      ctext->input(GetTensorMetaDataIndex(n, ctext->num_inputs()))
-              .flat<uint8>()
-              .size() *
-          sizeof(uint8));
-}
-#endif
-
 // Gets the actual input
 inline const Tensor& MklGetInput(OpKernelContext* ctext, int n) {
   return ctext->input(GetTensorDataIndex(n, ctext->num_inputs()));
@@ -696,27 +382,6 @@ inline void GetMklShapeList(OpKernelContext* ctext, StringPiece name,
   }
 }
 
-#ifdef INTEL_MKL_DNN
-/// Get shape of input tensor pointed by 'input_idx' in TensorShape format.
-/// If the input tensor is in MKL layout, then obtains TensorShape from
-/// MklShape.
-inline TensorShape GetTfShape(OpKernelContext* context,
-                              size_t input_idx) {
-  // Sanity check.
-  CHECK_NOTNULL(context);
-  CHECK_LT(input_idx, context->num_inputs());
-
-  MklDnnShape input_mkl_shape;
-  GetMklShape(context, input_idx, &input_mkl_shape);
-  if (input_mkl_shape.IsMklTensor()) {
-    return input_mkl_shape.GetTfShape();
-  } else {
-    const Tensor& t = MklGetInput(context, input_idx);
-    return t.shape();
-  }
-}
-#endif
-
 // Allocate the second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -732,23 +397,6 @@ inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
       second_tensor->flat<uint8>().size() * sizeof(uint8));
 }
 
-#ifdef INTEL_MKL_DNN
-// Allocate the second output tensor that will contain
-// the MKL shape serialized
-inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
-                                      const MklDnnShape& mkl_shape) {
-  Tensor* second_tensor = nullptr;
-  TensorShape second_shape;
-  second_shape.AddDim(mkl_shape.GetSerializeBufferSize());
-  OP_REQUIRES_OK(ctext, ctext->allocate_output(
-                            GetTensorMetaDataIndex(n, ctext->num_outputs()),
-                            second_shape, &second_tensor));
-  mkl_shape.SerializeMklDnnShape(
-      second_tensor->flat<uint8>().data(),
-      second_tensor->flat<uint8>().size() * sizeof(uint8));
-}
-#endif
-
 // Allocate the output tensor, create a second output tensor that will contain
 // the MKL shape serialized
 inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
@@ -769,43 +417,9 @@ inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
       second_tensor->flat<uint8>().size() * sizeof(uint8));
 }
 
-#ifdef INTEL_MKL_DNN
-// Allocate the output tensor, create a second output tensor that will contain
-// the MKL shape serialized
-inline void AllocateOutputSetMklShape(OpKernelContext* ctext, int n,
-                                      Tensor** output,
-                                      const TensorShape& tf_shape,
-                                      const MklDnnShape& mkl_shape) {
-  Tensor* second_tensor = nullptr;
-  TensorShape second_shape;
-  second_shape.AddDim(mkl_shape.GetSerializeBufferSize());
-  OP_REQUIRES_OK(
-      ctext, ctext->allocate_output(GetTensorDataIndex(n, ctext->num_outputs()),
-                                    tf_shape, output));
-  OP_REQUIRES_OK(ctext, ctext->allocate_output(
-                            GetTensorMetaDataIndex(n, ctext->num_outputs()),
-                            second_shape, &second_tensor));
-  mkl_shape.SerializeMklDnnShape(
-      second_tensor->flat<uint8>().data(),
-      second_tensor->flat<uint8>().size() * sizeof(uint8));
-}
-#endif
-
 // Allocates a temp tensor and returns the data buffer for temporary storage.
 // Currently
-#ifdef INTEL_MKL_DNN
-template <typename T>
-inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
-                           const memory::primitive_desc& pd, void** buf_out) {
-  TensorShape tf_shape;
-
-  tf_shape.AddDim(pd.get_size() / sizeof(T) + 1);
-  OP_REQUIRES_OK(context, context->allocate_temp(DataTypeToEnum<T>::v(),
-                                                 tf_shape, tensor_out));
-  *buf_out = static_cast<void*>(tensor_out->flat<T>().data());
-}
-#endif
-
+// we only support F32, will need to templatize if other types are added
 inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
                            dnnLayout_t lt_buff, void** buf_out) {
   TensorShape tf_shape;
@@ -821,7 +435,7 @@ inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
 
 template <typename T>
 inline void AllocTmpBuffer(OpKernelContext* context, Tensor* tensor_out,
-                              TensorShape tf_shape) {
+                           TensorShape tf_shape) {
   OP_REQUIRES_OK(context, context->allocate_temp(DataTypeToEnum<T>::v(),
                                                  tf_shape, tensor_out));
 }
@@ -1055,8 +669,6 @@ inline bool MklCompareShapes(const TensorShape* input_shape_0,
   return true;
 }
 
-// These functions do not compile with MKL-DNN since mkl.h is missing.
-// We may need to remove them later.
 // TODO(intel_tf): Remove this routine when faster MKL layout conversion is
 // out.
 inline void MklNHWCToNCHW(const Tensor& input, Tensor** output) {
@@ -1095,11 +707,18 @@ inline void MklNCHWToNHWC(const Tensor& input, Tensor** output) {
 
 #ifdef INTEL_MKL_DNN
 
+using mkldnn::engine;
+using mkldnn::memory;
+using mkldnn::padding_kind;
+using mkldnn::primitive;
+using mkldnn::reorder;
+
 /// Return MKL-DNN data type (memory::data_type) for input type T
 ///
 /// @input None
 /// @return memory::data_type corresponding to type T
-template<typename T> static memory::data_type MklDnnType();
+template <typename T>
+static memory::data_type MklDnnType();
 
 /// Instantiation for float type. Add similar instantiations for other
 /// type if needed.
@@ -1114,26 +733,15 @@ memory::data_type MklDnnType<float>() {
 /// @return: memory::format corresponding to TensorFlow data format;
 ///          Fails with an error if invalid data format.
 inline memory::format TFDataFormatToMklDnnDataFormat(TensorFormat format) {
-  if (format == FORMAT_NHWC) return memory::format::nhwc;
-  else if (format == FORMAT_NCHW) return memory::format::nchw;
-  TF_CHECK_OK(Status(error::Code::INVALID_ARGUMENT,
-                     "Unsupported data format"));
+  if (format == FORMAT_NHWC)
+    return memory::format::nhwc;
+  else if (format == FORMAT_NCHW)
+    return memory::format::nchw;
+  TF_CHECK_OK(Status(error::Code::INVALID_ARGUMENT, "Unsupported data format"));
   // Return to get rid of compiler warning
   return memory::format::format_undef;
 }
 
-/// Map MKL-DNN data format to TensorFlow's data format
-///
-/// @input: memory::format
-/// @return: Tensorflow data format corresponding to memory::format
-///          Fails with an error if invalid data format.
-inline TensorFormat MklDnnDataFormatToTFDataFormat(memory::format format) {
-  if (format == memory::format::nhwc) return FORMAT_NHWC;
-  else if (format == memory::format::nchw) return FORMAT_NCHW;
-  TF_CHECK_OK(Status(error::Code::INVALID_ARGUMENT,
-                     "Unsupported data format"));
-}
-
 /// Map TensorShape object into memory::dims required by MKL-DNN
 ///
 /// This function will simply map input TensorShape into MKL-DNN dims
@@ -1145,7 +753,7 @@ inline TensorFormat MklDnnDataFormatToTFDataFormat(memory::format format) {
 /// @return memory::dims corresponding to TensorShape
 inline memory::dims TFShapeToMklDnnDims(const TensorShape& shape) {
   memory::dims dims(shape.dims());
-  for (int d = 0; d < shape.dims(); ++d) {
+  for (unsigned int d = 0; d < shape.dims(); ++d) {
     dims[d] = shape.dim_size(d);
   }
   return dims;
@@ -1161,7 +769,7 @@ inline memory::dims TFShapeToMklDnnDims(const TensorShape& shape) {
 /// @input TensorShape object in shape
 /// @return memory::dims in MKL-DNN required NCHW format
 inline memory::dims TFShapeToMklDnnDimsInNCHW(const TensorShape& shape,
-                                            TensorFormat format) {
+                                              TensorFormat format) {
   // Check validity of format.
   CHECK_NE(TFDataFormatToMklDnnDataFormat(format),
            memory::format::format_undef);
@@ -1175,43 +783,6 @@ inline memory::dims TFShapeToMklDnnDimsInNCHW(const TensorShape& shape,
   return memory::dims({n, c, h, w});
 }
 
-/// Map MklDnn memory::dims object into TensorShape object.
-///
-/// This function will simply map input shape in MKL-DNN memory::dims format
-/// in Tensorflow's TensorShape object by perserving dimension order.
-///
-/// @input MKL-DNN memory::dims object
-/// @output TensorShape corresponding to memory::dims
-inline TensorShape MklDnnDimsToTFShape(const memory::dims& dims) {
-  std::vector<int32> shape(dims.size(), -1);
-  for (int d = 0; d < dims.size(); d++) {
-    shape[d] = dims[d];
-  }
-
-  TensorShape ret;
-  CHECK_EQ(TensorShapeUtils::MakeShape(shape, &ret).ok(), true);
-  return ret;
-}
-
-/// Function to calculate strides given tensor shape in Tensorflow order
-/// E.g., if dims_tf_order is {1, 2, 3, 4}, then as per Tensorflow convention,
-/// dimesion with size 1 is outermost dimension; while dimension with size 4 is
-/// innermost dimension. So strides for this tensor would be {4 * 3 * 2,
-/// 4 * 3, 4, 1}, i.e., {24, 12, 4, 1}.
-///
-/// @input Tensorflow shape in memory::dims type
-/// @return memory::dims containing strides for the tensor.
-inline memory::dims CalculateTFStrides(const memory::dims& dims_tf_order) {
-  CHECK_GT(dims_tf_order.size(), 0);
-  memory::dims strides(dims_tf_order.size());
-  int last_dim_idx = dims_tf_order.size() - 1;
-  strides[last_dim_idx] = 1;
-  for (int d = last_dim_idx - 1; d >= 0; d--) {
-    strides[d] = strides[d + 1] * dims_tf_order[d + 1];
-  }
-  return strides;
-}
-
 inline padding_kind TFPaddingToMklDnnPadding(Padding pad) {
   // MKL-DNN only supports zero padding.
   return padding_kind::zero;
@@ -1237,21 +808,23 @@ class MklDnnData {
   const engine* cpu_engine_;
 
  public:
-  explicit MklDnnData(const engine* e) : user_memory_(nullptr),
-                                         reorder_memory_(nullptr),
-                                         op_md_(nullptr), cpu_engine_(e) {}
+  explicit MklDnnData(const engine* e)
+      : user_memory_(nullptr),
+        reorder_memory_(nullptr),
+        op_md_(nullptr),
+        cpu_engine_(e) {}
 
   ~MklDnnData() {
     cpu_engine_ = nullptr;  // We don't own this.
-    delete(user_memory_);
-    delete(reorder_memory_);
-    delete(op_md_);
+    delete (user_memory_);
+    delete (reorder_memory_);
+    delete (op_md_);
   }
 
-  inline void* GetTensorBuffer(const Tensor* tensor) const {
+  void* GetTensorBuffer(const Tensor* tensor) {
     CHECK_NOTNULL(tensor);
-    return const_cast<void*>(static_cast<const void*>(
-              tensor->flat<T>().data()));
+    return const_cast<void*>(
+        static_cast<const void*>(tensor->flat<T>().data()));
   }
 
   /// Set user memory primitive using specified dimensions, memory format and
@@ -1262,83 +835,35 @@ class MklDnnData {
   /// an operation. E.g., filter of Conv2D is of shape {1, 2, 3, 4}, and
   /// memory format HWIO, and the buffer that contains actual values is
   /// pointed by data_buffer.
-  inline void SetUsrMem(const memory::dims& dim, memory::format fm,
-                        void* data_buffer = nullptr) {
-    auto md = memory::desc(dim, MklDnnType<T>(), fm);
-    SetUsrMem(md, data_buffer);
+  void SetUsrMem(memory::dims dim, memory::format fm, void* data_buffer) {
+    CHECK_NOTNULL(data_buffer);
+    CHECK_NOTNULL(cpu_engine_);
+    // TODO(nhasabni): can we remove dynamic memory allocation?
+    user_memory_ =
+        new memory(memory::primitive_desc(
+                       memory::desc(dim, MklDnnType<T>(), fm), *cpu_engine_),
+                   data_buffer);
   }
 
-  inline void SetUsrMem(const memory::dims& dim, memory::format fm,
-                        const Tensor* tensor) {
+  void SetUsrMem(memory::dims dim, memory::format fm, const Tensor* tensor) {
     CHECK_NOTNULL(tensor);
     SetUsrMem(dim, fm, GetTensorBuffer(tensor));
   }
 
-  /// Helper function to create memory descriptor in Blocked format
-  ///
-  /// @input: Tensor dimensions
-  /// @input: strides corresponding to dimensions. One can use utility
-  ///         function such as CalculateTFStrides to compute strides
-  ///         for given dimensions.
-  /// @return: memory::desc object corresponding to blocked memory format
-  ///          for given dimensions and strides.
-  static inline memory::desc CreateBlockedMemDesc(const memory::dims& dim,
-      const memory::dims& strides) {
-    CHECK_EQ(dim.size(), strides.size());
-
-    // We have to construct memory descriptor in a C style. This is not at all
-    // ideal but MKLDNN does not offer any API to construct descriptor in
-    // blocked format except a copy constructor that accepts
-    // mkldnn_memory_desc_t.
-    mkldnn_memory_desc_t md;
-    md.primitive_kind = mkldnn_memory;
-    md.ndims = dim.size();
-    md.format = mkldnn_blocked;
-    md.data_type = memory::convert_to_c(MklDnnType<T>());
-
-    for (size_t i = 0; i < dim.size(); i++) {
-      md.layout_desc.blocking.block_dims[i] = 1;
-      md.layout_desc.blocking.strides[1][i] = 1;
-      md.layout_desc.blocking.strides[0][i] = strides[i];
-      md.layout_desc.blocking.padding_dims[i] = dim[i];
-      md.layout_desc.blocking.offset_padding_to_data[i] = 0;
-      md.dims[i] = dim[i];
-    }
-    md.layout_desc.blocking.offset_padding = 0;
-
-    return memory::desc(md);
-  }
-
-  /// A version of SetUsrMem call that allows user to create memory in blocked
-  /// format. So in addition to accepting dimensions, it also accepts strides.
-  /// This allows user to create memory for tensor in a format that is not
-  /// supported by MKLDNN. E.g., MKLDNN does not support tensor format for 6
-  /// dimensional tensor as a native format. But by using blocked format, a user
-  /// can create memory for 6D tensor.
-  inline void SetUsrMem(const memory::dims& dim, const memory::dims& strides,
-                        void* data_buffer = nullptr) {
-    CHECK_EQ(dim.size(), strides.size());
-    auto blocked_md = MklDnnData<T>::CreateBlockedMemDesc(dim, strides);
-    SetUsrMem(blocked_md, data_buffer);
-  }
-
-  inline void SetUsrMem(const memory::dims& dim, const memory::dims& strides,
-                        const Tensor* tensor) {
-    CHECK_NOTNULL(tensor);
-    SetUsrMem(dim, strides, GetTensorBuffer(tensor));
-  }
-
   /// A version of function to set user memory primitive that accepts memory
   /// descriptor directly, instead of accepting dimensions and format. This
   /// function is more generic that the one above, but the function above is
   /// sufficient in most cases.
-  inline void SetUsrMem(const memory::desc& md, void* data_buffer = nullptr) {
-    auto pd = memory::primitive_desc(md, *cpu_engine_);
-    SetUsrMem(pd, data_buffer);
+  void SetUsrMem(memory::desc md, void* data_buffer) {
+    CHECK_NOTNULL(data_buffer);
+    CHECK_NOTNULL(cpu_engine_);
+    // TODO(nhasabni): can we remove dynamic memory allocation?
+    user_memory_ =
+        new memory(memory::primitive_desc(md, *cpu_engine_), data_buffer);
   }
 
   /// A version of SetUsrMem with memory descriptor and tensor
-  inline void SetUsrMem(const memory::desc& md, const Tensor* tensor) {
+  void SetUsrMem(memory::desc md, const Tensor* tensor) {
     CHECK_NOTNULL(tensor);
     SetUsrMem(md, GetTensorBuffer(tensor));
   }
@@ -1347,60 +872,41 @@ class MklDnnData {
   /// descriptor directly, instead of accepting dimensions and format. This
   /// function is more generic that the one above, but the function above is
   /// sufficient in most cases.
-  inline void SetUsrMem(const memory::primitive_desc& pd,
-                        void* data_buffer = nullptr) {
+  void SetUsrMem(memory::primitive_desc pd, void* data_buffer) {
+    CHECK_NOTNULL(data_buffer);
     CHECK_NOTNULL(cpu_engine_);
     // TODO(nhasabni): can we remove dynamic memory allocation?
-    if (data_buffer) {
-     user_memory_ = new memory(pd, data_buffer);
-    } else {
-      user_memory_ = new memory(pd);
-    }
+    user_memory_ = new memory(pd, data_buffer);
   }
 
   /// A version of SetUsrMem with primitive descriptor and tensor
-  inline void SetUsrMem(const memory::primitive_desc& pd,
-                        const Tensor* tensor) {
+  void SetUsrMem(memory::primitive_desc pd, const Tensor* tensor) {
     CHECK_NOTNULL(tensor);
     SetUsrMem(pd, GetTensorBuffer(tensor));
   }
 
   /// Get function for user memory primitive.
-  inline const memory* GetUsrMem() const { return user_memory_; }
+  const memory* GetUsrMem() const { return user_memory_; }
 
   /// Get function for primitive descriptor of user memory primitive.
-  inline const memory::primitive_desc GetUsrMemPrimDesc() const {
+  const memory::primitive_desc GetUsrMemPrimDesc() const {
     CHECK_NOTNULL(user_memory_);
     return user_memory_->get_primitive_desc();
   }
 
   /// Get function for descriptor of user memory.
-  inline memory::desc GetUsrMemDesc() {
+  memory::desc GetUsrMemDesc() {
     // This is ugly. Why MKL-DNN does not provide desc() method of const type??
     const memory::primitive_desc pd = GetUsrMemPrimDesc();
     return const_cast<memory::primitive_desc*>(&pd)->desc();
   }
 
   /// Get function for data buffer of user memory primitive.
-  inline void* GetUsrMemDataHandle() const {
+  void* GetUsrMemDataHandle() const {
     CHECK_NOTNULL(user_memory_);
     return user_memory_->get_data_handle();
   }
 
-  /// Set function for data buffer of user memory primitive.
-  inline void* SetUsrMemDataHandle(void* data_buffer) {
-    CHECK_NOTNULL(user_memory_);
-    CHECK_NOTNULL(data_buffer);
-    return user_memory_->set_data_handle(data_buffer);
-  }
-
-  /// Set function for data buffer of user memory primitive.
-  inline void SetUsrMemDataHandle(const Tensor* tensor) {
-    CHECK_NOTNULL(user_memory_);
-    CHECK_NOTNULL(tensor);
-    user_memory_->set_data_handle(GetTensorBuffer(tensor));
-  }
-
   /// Get the memory primitive for input and output of an op. If inputs
   /// to an op require reorders, then this function returns memory primitive
   /// for reorder. Otherwise, it will return memory primitive for user memory.
@@ -1409,7 +915,7 @@ class MklDnnData {
   /// execute Conv2D, we need memory primitive for I and F. Buf if reorder is
   /// required for I and F (say I_r is reorder primitive for I; F_r is reorder
   /// primitive for F), then we need I_r and F_r to perform Conv2D.
-  inline const memory& GetOpMem() const {
+  const memory& GetOpMem() const {
     return reorder_memory_ ? *reorder_memory_ : *user_memory_;
   }
 
@@ -1417,32 +923,13 @@ class MklDnnData {
   /// format. E.g., For Conv2D, the dimensions would be same as user dimensions
   /// but memory::format would be mkldnn::any because we want MKL-DNN to choose
   /// best layout/format for given input dimensions.
-  inline void SetOpMemDesc(const memory::dims& dim, memory::format fm) {
+  void SetOpMemDesc(const memory::dims& dim, memory::format fm) {
     // TODO(nhasabni): can we remove dynamic memory allocation?
     op_md_ = new memory::desc(dim, MklDnnType<T>(), fm);
   }
 
   /// Get function for memory descriptor for an operation
-  inline const memory::desc& GetOpMemDesc() const { return *op_md_; }
-
-  /// Predicate that checks if we need to reorder user's memory into memory
-  /// pointed by op_pd.
-  ///
-  /// @input: op_pd - memory primitive descriptor of the given input of an
-  ///               operation
-  /// @return: true in case reorder of input is needed; false, otherwise.
-  inline bool IsReorderNeeded(const memory::primitive_desc& op_pd) const {
-    CHECK_NOTNULL(user_memory_);
-    return op_pd != user_memory_->get_primitive_desc();
-  }
-
-  /// Function to create a reorder from memory pointed by from to memory pointed
-  /// by to. Returns created primitive.
-  inline primitive CreateReorder(const memory* from, const memory* to) const {
-    CHECK_NOTNULL(from);
-    CHECK_NOTNULL(to);
-    return reorder(*from, *to);
-  }
+  const memory::desc& GetOpMemDesc() const { return *op_md_; }
 
   /// Function to handle input reordering
   ///
@@ -1458,62 +945,19 @@ class MklDnnData {
   ///               operation
   /// @input: net - net to which to add reorder primitive in case it is needed.
   /// @return: true in case reorder of input is needed; false, otherwise.
-  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
-                                  std::vector<primitive>* net) {
+  bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
+                           std::vector<primitive>* net) {
     CHECK_NOTNULL(net);
     CHECK_NOTNULL(user_memory_);
-    if (IsReorderNeeded(op_pd)) {
+    if (op_pd != user_memory_->get_primitive_desc()) {
       // TODO(nhasabni): can we remove dynamic memory allocation?
       reorder_memory_ = new memory(op_pd);
-      net->push_back(CreateReorder(user_memory_, reorder_memory_));
-      return true;
-    }
-    return false;
-  }
-
-  /// Overloaded version of above function that accepts memory buffer
-  /// where output of reorder needs to be stored.
-  ///
-  /// @input: op_pd - memory primitive descriptor of the given input of an
-  ///               operation
-  /// @reorder_data_handle - memory buffer where output of reorder needs to be
-  ///                        stored. Primitive does not check if buffer is
-  ///                        enough size to write.
-  /// @input: net - net to which to add reorder primitive in case it is needed.
-  /// @return: true in case reorder of input is needed; false, otherwise.
-  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
-                                  void* reorder_data_handle,
-                                  std::vector<primitive>* net) {
-    CHECK_NOTNULL(net);
-    CHECK_NOTNULL(reorder_data_handle);
-    CHECK_NOTNULL(user_memory_);
-    if (IsReorderNeeded(op_pd)) {
-      // TODO(nhasabni): can we remove dynamic memory allocation?
-      reorder_memory_ = new memory(op_pd, reorder_data_handle);
-      net->push_back(CreateReorder(user_memory_, reorder_memory_));
+      net->push_back(reorder(*user_memory_, *reorder_memory_));
       return true;
     }
     return false;
   }
 
-  /// Another overloaded version of CheckReorderToOpMem that accepts Tensor
-  /// where output of reorder needs to be stored.
-  ///
-  /// @input: op_pd - memory primitive descriptor of the given input of an
-  ///               operation
-  /// @reorder_tensor - Tensor whose buffer is to be used to store output of
-  ///                   reorder. Primitive does not check if buffer is
-  ///                   enough size to write.
-  /// @input: net - net to which to add reorder primitive in case it is needed.
-  /// @return: true in case reorder of input is needed; false, otherwise.
-  inline bool CheckReorderToOpMem(const memory::primitive_desc& op_pd,
-                                  Tensor* reorder_tensor,
-                                  std::vector<primitive>* net) {
-    CHECK_NOTNULL(net);
-    CHECK_NOTNULL(reorder_tensor);
-    return CheckReorderToOpMem(op_pd, GetTensorBuffer(reorder_tensor), net);
-  }
-
   /// Function to handle output reorder
   ///
   /// This function performs very similar functionality as input reordering
@@ -1526,10 +970,9 @@ class MklDnnData {
   ///
   /// @input memory primitive descriptor for the given output of an operation
   /// @return: true in case reorder of output is needed; false, otherwise.
-  inline bool PrepareReorderToUserMemIfReq(
-      const memory::primitive_desc& op_pd) {
+  bool PrepareReorderToUserMemIfReq(const memory::primitive_desc& op_pd) {
     CHECK_NOTNULL(user_memory_);
-    if (IsReorderNeeded(op_pd)) {
+    if (op_pd != user_memory_->get_primitive_desc()) {
       // TODO(nhasabni): can we remove dynamic memory allocation?
       reorder_memory_ = new memory(op_pd);
       return true;
@@ -1544,11 +987,11 @@ class MklDnnData {
   /// to the user-specified output buffer.
   ///
   /// @input: net - net to which to add reorder primitive
-  inline void InsertReorderToUserMem(std::vector<primitive>* net) {
+  void InsertReorderToUserMem(std::vector<primitive>* net) {
     CHECK_NOTNULL(net);
     CHECK_NOTNULL(user_memory_);
     CHECK_NOTNULL(reorder_memory_);
-    net->push_back(CreateReorder(reorder_memory_, user_memory_));
+    net->push_back(reorder(*reorder_memory_, *user_memory_));
   }
 };
 
diff --git a/tensorflow/core/util/mkl_util_test.cc b/tensorflow/core/util/mkl_util_test.cc
deleted file mode 100644
index 6aef3d86e9..0000000000
--- a/tensorflow/core/util/mkl_util_test.cc
+++ /dev/null
@@ -1,92 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#ifdef INTEL_MKL
-
-#include "tensorflow/core/util/mkl_util.h"
-
-#include "tensorflow/core/platform/test.h"
-
-namespace tensorflow {
-namespace {
-
-#ifdef INTEL_MKL_DNN
-
-TEST(MklUtilTest, MklDnnTfShape) {
-  auto cpu_engine = engine(engine::cpu, 0);
-  MklDnnData<float> a(&cpu_engine);
-
-  const int N = 1, C = 2, H = 3, W = 4;
-  memory::dims a_dims = {N, C, H, W};
-  MklDnnShape a_mkldnn_shape;
-  a_mkldnn_shape.SetMklTensor(true);
-  // Create TF layout in NCHW.
-  a_mkldnn_shape.SetTfLayout(a_dims.size(), a_dims, memory::format::nchw);
-  TensorShape a_tf_shape_nchw({N, C, H, W});
-  TensorShape a_tf_shape_nhwc({N, H, W, C});
-  TensorShape a_mkldnn_tf_shape = a_mkldnn_shape.GetTfShape();
-  // Check that returned shape is in NCHW format.
-  EXPECT_EQ(a_tf_shape_nchw, a_mkldnn_tf_shape);
-  EXPECT_NE(a_tf_shape_nhwc, a_mkldnn_tf_shape);
-
-  memory::dims b_dims = {N, C, H, W};
-  MklDnnShape b_mkldnn_shape;
-  b_mkldnn_shape.SetMklTensor(true);
-  // Create TF layout in NHWC.
-  b_mkldnn_shape.SetTfLayout(b_dims.size(), b_dims, memory::format::nhwc);
-  TensorShape b_tf_shape_nhwc({N, H, W, C});
-  TensorShape b_tf_shape_nchw({N, C, H, W});
-  TensorShape b_mkldnn_tf_shape = b_mkldnn_shape.GetTfShape();
-  // Check that returned shape is in NHWC format.
-  EXPECT_EQ(b_tf_shape_nhwc, b_mkldnn_tf_shape);
-  EXPECT_NE(b_tf_shape_nchw, b_mkldnn_tf_shape);
-}
-
-
-TEST(MklUtilTest, MklDnnBlockedFormatTest) {
-  // Let's create 2D tensor of shape {3, 4} with 3 being innermost dimension
-  // first (case 1) and then it being outermost dimension (case 2).
-  auto cpu_engine = engine(engine::cpu, 0);
-
-  // Setting for case 1
-  MklDnnData<float> a(&cpu_engine);
-  memory::dims dim1 = {3, 4};
-  memory::dims strides1 = {1, 3};
-  a.SetUsrMem(dim1, strides1);
-
-  memory::desc a_md1 = a.GetUsrMemDesc();
-  EXPECT_EQ(a_md1.data.ndims, 2);
-  EXPECT_EQ(a_md1.data.dims[0], 3);
-  EXPECT_EQ(a_md1.data.dims[1], 4);
-  EXPECT_EQ(a_md1.data.format, mkldnn_blocked);
-
-  // Setting for case 2
-  MklDnnData<float> b(&cpu_engine);
-  memory::dims dim2 = {3, 4};
-  memory::dims strides2 = {4, 1};
-  b.SetUsrMem(dim2, strides2);
-
-  memory::desc b_md2 = b.GetUsrMemDesc();
-  EXPECT_EQ(b_md2.data.ndims, 2);
-  EXPECT_EQ(b_md2.data.dims[0], 3);
-  EXPECT_EQ(b_md2.data.dims[1], 4);
-  EXPECT_EQ(b_md2.data.format, mkldnn_blocked);
-}
-
-#endif  // INTEL_MKL_DNN
-}  // namespace
-}  // namespace tensorflow
-
-#endif  // INTEL_MKL
diff --git a/tensorflow/docs_src/api_guides/python/threading_and_queues.md b/tensorflow/docs_src/api_guides/python/threading_and_queues.md
index 8ad4c4c075..ab95ce0af9 100644
--- a/tensorflow/docs_src/api_guides/python/threading_and_queues.md
+++ b/tensorflow/docs_src/api_guides/python/threading_and_queues.md
@@ -3,7 +3,7 @@
 Note: In versions of TensorFlow before 1.2, we recommended using multi-threaded,
 queue-based input pipelines for performance. Beginning with TensorFlow 1.4,
 however, we recommend using the `tf.data` module instead. (See
-@{$datasets$Datasets} for details. In TensorFlow 1.2 and 1.3, the module was
+[Datasets](datasets) for details. In TensorFlow 1.2 and 1.3, the module was
 called `tf.contrib.data`.) The `tf.data` module offers an easier-to-use
 interface for constructing efficient input pipelines. Furthermore, we've stopped
 developing the old multi-threaded, queue-based input pipelines.  We've retained
diff --git a/tensorflow/docs_src/get_started/get_started.md b/tensorflow/docs_src/get_started/get_started.md
index be14ab4026..8409962744 100644
--- a/tensorflow/docs_src/get_started/get_started.md
+++ b/tensorflow/docs_src/get_started/get_started.md
@@ -272,7 +272,7 @@ train = optimizer.minimize(loss)
 ```
 
 ```python
-sess.run(init) # reset variables to incorrect defaults.
+sess.run(init) # reset values to incorrect defaults.
 for i in range(1000):
   sess.run(train, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]})
 
@@ -317,7 +317,7 @@ y_train = [0, -1, -2, -3]
 # training loop
 init = tf.global_variables_initializer()
 sess = tf.Session()
-sess.run(init) # initialize variables with incorrect defaults.
+sess.run(init) # reset values to wrong
 for i in range(1000):
   sess.run(train, {x: x_train, y: y_train})
 
@@ -383,7 +383,7 @@ train_input_fn = tf.estimator.inputs.numpy_input_fn(
 eval_input_fn = tf.estimator.inputs.numpy_input_fn(
     {"x": x_eval}, y_eval, batch_size=4, num_epochs=1000, shuffle=False)
 
-# We can invoke 1000 training steps by invoking the method and passing the
+# We can invoke 1000 training steps by invoking the  method and passing the
 # training data set.
 estimator.train(input_fn=input_fn, steps=1000)
 
diff --git a/tensorflow/docs_src/get_started/input_fn.md b/tensorflow/docs_src/get_started/input_fn.md
index 0db5c6143a..9d3af5d96a 100644
--- a/tensorflow/docs_src/get_started/input_fn.md
+++ b/tensorflow/docs_src/get_started/input_fn.md
@@ -191,7 +191,7 @@ import pandas as pd
 
 def get_input_fn_from_pandas(data_set, num_epochs=None, shuffle=True):
   return tf.estimator.inputs.pandas_input_fn(
-      x=pd.DataFrame(...),
+      x=pdDataFrame(...),
       y=pd.Series(...),
       num_epochs=num_epochs,
       shuffle=shuffle)
@@ -267,8 +267,8 @@ tf.logging.set_verbosity(tf.logging.INFO)
 
 Define the column names for the data set in `COLUMNS`. To distinguish features
 from the label, also define `FEATURES` and `LABEL`. Then read the three CSVs
-([train](http://download.tensorflow.org/data/boston_train.csv),
-[test](http://download.tensorflow.org/data/boston_test.csv), and
+(@{tf.train},
+@{tf.test}, and
 [predict](http://download.tensorflow.org/data/boston_predict.csv)) into _pandas_
 `DataFrame`s:
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index df622c6ac5..3a153e8114 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -38,7 +38,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for macOS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.4.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.4.0-rc1.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 8b3da49a0d..df43255896 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -38,7 +38,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.4.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.4.0-rc1.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 6eb8158249..f7f2c3cdc7 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -36,7 +36,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.4.0</version>
+  <version>1.4.0-rc1</version>
 </dependency>
 ```
 
@@ -65,7 +65,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.4.0</version>
+                 <version>1.4.0-rc1</version>
                </dependency>
              </dependencies>
          </project>
@@ -124,7 +124,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or macOS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0-rc1.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -143,7 +143,7 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.4.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.4.0-rc1.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -151,10 +151,10 @@ Take the following steps to install TensorFlow for Java on Linux or macOS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0-rc1.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.4.0.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.4.0-rc1.zip).
   3. Extract this .zip file.
 
 
@@ -202,7 +202,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.4.0.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.4.0-rc1.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -216,11 +216,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and macOS X:
 
-<pre><b>java -cp libtensorflow-1.4.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.4.0-rc1.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.4.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.4.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index f7380bac8a..414ab7b1f7 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -188,7 +188,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      Virtualenv environment:
 
      <pre>(tensorflow)$ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common_installation_problems).
@@ -293,7 +293,7 @@ take the following steps:
 
      <pre>
      $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp34-cp34m-linux_x86_64.whl</b>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp34-cp34m-linux_x86_64.whl</b>
      </pre>
 
      If this step fails, see
@@ -480,7 +480,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -648,14 +648,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc1-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -667,14 +667,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc1-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -686,14 +686,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc1-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -705,14 +705,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc1-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc1-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 79b383817b..9a95710bfa 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -114,7 +114,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      TensorFlow in the active Virtualenv is as follows:
 
      <pre> $ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc1-py2-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -235,7 +235,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py2-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc1-py2-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -344,7 +344,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc1-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -517,7 +517,7 @@ This section documents the relevant values for Mac OS installations.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc1-py2-none-any.whl
 </pre>
 
 
@@ -525,7 +525,7 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py2-none-any.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc1-py3-none-any.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index aa4ae6c876..6d0dcdcd4a 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -355,10 +355,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.4.0 on Linux:
+for TensorFlow 1.4.0rc1 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.0-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.0rc1-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
@@ -447,10 +447,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Linux**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.5.4</td><td>6</td><td>8</td></tr>
- <tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
+<tr><td>tensorflow-1.4.0rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.4.0rc1</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
 <tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>5.1</td><td>8</td></tr>
 <tr><td>tensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
@@ -462,8 +460,7 @@ Stack Overflow and specify the `tensorflow` tag.
 **Mac**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.5.4</td><td>N/A</td><td>N/A</td></tr>
- <tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow-1.4.0rc1</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.1.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>5.1</td><td>8</td></tr>
@@ -474,10 +471,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Windows**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.4.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.4.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
-<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
+<tr><td>tensorflow-1.4.0rc1</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.4.0rc1</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
 <tr><td>tensorflow-1.2.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>5.1</td><td>8</td></tr>
 <tr><td>tensorflow-1.1.0</td><td>CPU</td><td>3.5</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
diff --git a/tensorflow/docs_src/mobile/prepare_models.md b/tensorflow/docs_src/mobile/prepare_models.md
index 8fc65be35a..c5a560e074 100644
--- a/tensorflow/docs_src/mobile/prepare_models.md
+++ b/tensorflow/docs_src/mobile/prepare_models.md
@@ -296,6 +296,6 @@ complains about missing header files, add the .h’s that are needed into
 the
 [`android_extended_ops`](https://www.tensorflow.org/code/tensorflow/core/kernels/BUILD#L3525) target.
 
-If you’re using a makefile targeting iOS, Raspberry Pi, etc, go to
+If you’re using a makefile targetting iOS, Raspberry Pi, etc, go to
 [`tensorflow/contrib/makefile/tf_op_files.txt`](https://www.tensorflow.org/code/tensorflow/contrib/makefile/tf_op_files.txt) and
 add the right implementation files there.
diff --git a/tensorflow/docs_src/programmers_guide/debugger.md b/tensorflow/docs_src/programmers_guide/debugger.md
index 25cb72008d..1f856bbf3f 100644
--- a/tensorflow/docs_src/programmers_guide/debugger.md
+++ b/tensorflow/docs_src/programmers_guide/debugger.md
@@ -9,19 +9,11 @@ lets you view the internal structure and states of running TensorFlow graphs
 during training and inference, which is difficult to debug with general-purpose
 debuggers such as Python's `pdb` due to TensorFlow's computation-graph paradigm.
 
-> NOTE: TensorFlow debugger uses a
-> [curses](https://en.wikipedia.org/wiki/Curses_\(programming_library\))-based
-> text user interface. On Mac OS X, the `ncurses` library is required and can
-> be installed with `brew install homebrew/dupes/ncurses`. On Windows, curses
-> isn't as well supported, so a
-> [readline](https://en.wikipedia.org/wiki/GNU_Readline)-based interface can
-> be used with tfdbg by installing `pyreadline` with pip.
-> If you use Anaconda3, you can install it with a command
+> NOTE: The system requirements of tfdbg on supported external platforms include
+> the following. On Mac OS X, the `ncurses` library is required. It can be
+> installed with `brew install homebrew/dupes/ncurses`. On Windows, `pyreadline`
+> is required. If you use Anaconda3, you can install it with a command
 > such as `"C:\Program Files\Anaconda3\Scripts\pip.exe" install pyreadline`.
-> Unofficial Windows curses packages can be downloaded
-> [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#curses), then subsequently
-> installed using `pip install <your_version>.whl`, however curses on Windows
-> may not work as reliably as curses on Linux or Mac.
 
 This tutorial demonstrates how to use the **tfdbg** command-line interface
 (CLI) to debug the appearance of [`nan`s](https://en.wikipedia.org/wiki/NaN)
@@ -157,7 +149,6 @@ Try the following commands at the `tfdbg>` prompt (referencing the code at
 | | `pt <tensor>[slicing]` | Print a subarray of tensor, using [numpy](http://www.numpy.org/)-style array slicing. | `pt hidden/Relu:0[0:50,:]` |
 | | `-a` | Print the entirety of a large tensor, without using ellipses. (May take a long time for large tensors.) | `pt -a hidden/Relu:0[0:50,:]` |
 | | `-r <range>` | Highlight elements falling into specified numerical range. Multiple ranges can be used in conjunction. | `pt hidden/Relu:0 -a -r [[-inf,-1],[1,inf]]` |
-| | `-n <number>` | Print dump corresponding to specified 0-based dump number. Required for tensors with multiple dumps. | `pt -n 0 hidden/Relu:0` |
 | | `-s` | Include a summary of the numeric values of the tensor (applicable only to non-empty tensors with Boolean and numeric types such as `int*` and `float*`.) | `pt -s hidden/Relu:0[0:50,:]` |
 | **`@[coordinates]`** | | Navigate to specified element in `pt` output. | `@[10,0]` or `@10,0` |
 | **`/regex`** | |  [less](https://linux.die.net/man/1/less)-style search for given regular expression. | `/inf` |
@@ -175,12 +166,10 @@ Try the following commands at the `tfdbg>` prompt (referencing the code at
 | | `-r` | List the inputs to node, recursively (the input tree.) | `li -r hidden/Relu:0` |
 | | `-d <max_depth>` | Limit recursion depth under the `-r` mode. | `li -r -d 3 hidden/Relu:0` |
 | | `-c` | Include control inputs. | `li -c -r hidden/Relu:0` |
-| | `-t` | Show op types of input nodes. | `li -t -r hidden/Relu:0` |
 | **`lo`** | | **List output recipients of node** | |
 | | `-r` | List the output recipients of node, recursively (the output tree.) | `lo -r hidden/Relu:0` |
 | | `-d <max_depth>` | Limit recursion depth under the `-r` mode. | `lo -r -d 3 hidden/Relu:0` |
 | | `-c` | Include recipients via control edges. | `lo -c -r hidden/Relu:0` |
-| | `-t` | Show op types of recipient nodes. | `lo -t -r hidden/Relu:0` |
 | **`ls`** | | **List Python source files involved in node creation.** | |
 | | `-p <path_pattern>` | Limit output to source files matching given regular-expression path pattern. | `ls -p .*debug_mnist.*` |
 | | `-n` | Limit output to node names matching given regular-expression pattern. | `ls -n Softmax.*` |
diff --git a/tensorflow/docs_src/programmers_guide/tensors.md b/tensorflow/docs_src/programmers_guide/tensors.md
index 88eb277e35..d6f80430cd 100644
--- a/tensorflow/docs_src/programmers_guide/tensors.md
+++ b/tensorflow/docs_src/programmers_guide/tensors.md
@@ -29,8 +29,8 @@ Some types of tensors are special, and these will be covered in other
 units of the Programmer's guide. The main ones are:
 
   * `tf.Variable`
-  * `tf.constant`
-  * `tf.placeholder`
+  * `tf.Constant`
+  * `tf.Placeholder`
   * `tf.SparseTensor`
 
 With the exception of `tf.Variable`, the value of a tensor is immutable, which
@@ -64,7 +64,7 @@ The following snippet demonstrates creating a few rank 0 variables:
 mammal = tf.Variable("Elephant", tf.string)
 ignition = tf.Variable(451, tf.int16)
 floating = tf.Variable(3.14159265359, tf.float64)
-its_complicated = tf.Variable(12.3 - 4.85j, tf.complex64)
+its_complicated = tf.Variable((12.3, -4.85), tf.complex64)
 ```
 
 Note: A string is treated as a single item in TensorFlow, not as a sequence of
@@ -79,7 +79,7 @@ initial value. For example:
 mystr = tf.Variable(["Hello"], tf.string)
 cool_numbers  = tf.Variable([3.14159, 2.71828], tf.float32)
 first_primes = tf.Variable([2, 3, 5, 7, 11], tf.int32)
-its_very_complicated = tf.Variable([12.3 - 4.85j, 7.5 - 6.23j], tf.complex64)
+its_very_complicated = tf.Variable([(12.3, -4.85), (7.5, -6.23)], tf.complex64)
 ```
 
 
@@ -275,8 +275,8 @@ Graphs and Sessions for more information).
 
 Sometimes it is not possible to evaluate a `tf.Tensor` with no context because
 its value might depend on dynamic information that is not available. For
-example, tensors that depend on `placeholder`s can't be evaluated without
-providing a value for the `placeholder`.
+example, tensors that depend on `Placeholder`s can't be evaluated without
+providing a value for the `Placeholder`.
 
 ``` python
 p = tf.placeholder(tf.float32)
diff --git a/tensorflow/examples/speech_commands/models.py b/tensorflow/examples/speech_commands/models.py
index ab611f414a..82d6a94ea1 100644
--- a/tensorflow/examples/speech_commands/models.py
+++ b/tensorflow/examples/speech_commands/models.py
@@ -326,7 +326,7 @@ def create_low_latency_conv_model(fingerprint_input, model_settings,
   first_filter_height = input_time_size
   first_filter_count = 186
   first_filter_stride_x = 1
-  first_filter_stride_y = 1
+  first_filter_stride_y = 4
   first_weights = tf.Variable(
       tf.truncated_normal(
           [first_filter_height, first_filter_width, 1, first_filter_count],
diff --git a/tensorflow/go/android.go b/tensorflow/go/android.go
deleted file mode 100644
index 3db3ddfec5..0000000000
--- a/tensorflow/go/android.go
+++ /dev/null
@@ -1,20 +0,0 @@
-// Copyright 2016 The TensorFlow Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//  http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// +build android
-
-package tensorflow
-
-// #cgo LDFLAGS: -landroid -llog -lm -lz -ldl
-import "C"
diff --git a/tensorflow/go/operation_test.go b/tensorflow/go/operation_test.go
index 40c951ab8c..7cba043af2 100644
--- a/tensorflow/go/operation_test.go
+++ b/tensorflow/go/operation_test.go
@@ -123,14 +123,6 @@ func TestOutputDataTypeAndShape(t *testing.T) {
 			[]int64{2, 3},
 			Double,
 		},
-		{ // Matrix of Uint64
-			[][]uint64{
-				{1, 2, 3},
-				{4, 5, 6},
-			},
-			[]int64{2, 3},
-			Uint64,
-		},
 	}
 	for idx, test := range testdata {
 		t.Run(fmt.Sprintf("#%d Value %T", idx, test.Value), func(t *testing.T) {
diff --git a/tensorflow/go/tensor.go b/tensorflow/go/tensor.go
index 1326a95278..36a74c0081 100644
--- a/tensorflow/go/tensor.go
+++ b/tensorflow/go/tensor.go
@@ -101,7 +101,7 @@ func NewTensor(value interface{}) (*Tensor, error) {
 			return nil, bug("NewTensor incorrectly calculated the size of a tensor with type %v and shape %v as %v bytes instead of %v", dataType, shape, nbytes, buf.Len())
 		}
 	} else {
-		e := stringEncoder{offsets: buf, data: raw[nflattened*8:], status: newStatus()}
+		e := stringEncoder{offsets: buf, data: raw[nflattened*8 : len(raw)], status: newStatus()}
 		if err := e.encode(reflect.ValueOf(value), shape); err != nil {
 			return nil, err
 		}
@@ -207,9 +207,6 @@ func (t *Tensor) WriteContentsTo(w io.Writer) (int64, error) {
 func tensorData(c *C.TF_Tensor) []byte {
 	// See: https://github.com/golang/go/wiki/cgo#turning-c-arrays-into-go-slices
 	cbytes := C.TF_TensorData(c)
-	if cbytes == nil {
-		return nil
-	}
 	length := int(C.TF_TensorByteSize(c))
 	slice := (*[1 << 30]byte)(unsafe.Pointer(cbytes))[:length:length]
 	return slice
@@ -313,7 +310,7 @@ func encodeTensor(w *bytes.Buffer, v reflect.Value, shape []int64) error {
 		if err := w.WriteByte(b); err != nil {
 			return err
 		}
-	case reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64, reflect.Uint8, reflect.Uint16, reflect.Uint32, reflect.Uint64, reflect.Float32, reflect.Float64, reflect.Complex64, reflect.Complex128:
+	case reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64, reflect.Uint8, reflect.Uint16, reflect.Float32, reflect.Float64, reflect.Complex64, reflect.Complex128:
 		if err := binary.Write(w, nativeEndian, v.Interface()); err != nil {
 			return err
 		}
@@ -352,7 +349,7 @@ func decodeTensor(r *bytes.Reader, shape []int64, typ reflect.Type, ptr reflect.
 			return err
 		}
 		ptr.Elem().SetBool(b == 1)
-	case reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64, reflect.Uint8, reflect.Uint16, reflect.Uint32, reflect.Uint64, reflect.Float32, reflect.Float64, reflect.Complex64, reflect.Complex128:
+	case reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64, reflect.Uint8, reflect.Uint16, reflect.Float32, reflect.Float64, reflect.Complex64, reflect.Complex128:
 		if err := binary.Read(r, nativeEndian, ptr.Interface()); err != nil {
 			return err
 		}
diff --git a/tensorflow/go/tensor_test.go b/tensorflow/go/tensor_test.go
index 674a8ce86f..35bd2fd9a5 100644
--- a/tensorflow/go/tensor_test.go
+++ b/tensorflow/go/tensor_test.go
@@ -34,15 +34,11 @@ func TestNewTensor(t *testing.T) {
 		{nil, int64(5)},
 		{nil, uint8(5)},
 		{nil, uint16(5)},
-		{nil, uint32(5)},
-		{nil, uint64(5)},
 		{nil, float32(5)},
 		{nil, float64(5)},
 		{nil, complex(float32(5), float32(6))},
 		{nil, complex(float64(5), float64(6))},
 		{nil, "a string"},
-		{[]int64{1}, []uint32{1}},
-		{[]int64{1}, []uint64{1}},
 		{[]int64{2}, []bool{true, false}},
 		{[]int64{1}, []float64{1}},
 		{[]int64{1}, [1]float64{1}},
@@ -75,6 +71,11 @@ func TestNewTensor(t *testing.T) {
 		// native ints not supported
 		int(5),
 		[]int{5},
+		// uint32 and uint64 are not supported in TensorFlow
+		uint32(5),
+		[]uint32{5},
+		uint64(5),
+		[]uint64{5},
 		// Mismatched dimensions
 		[][]float32{{1, 2, 3}, {4}},
 		// Mismatched dimensions. Should return "mismatched slice lengths" error instead of "BUG"
diff --git a/tensorflow/java/src/main/java/org/tensorflow/Shape.java b/tensorflow/java/src/main/java/org/tensorflow/Shape.java
index d533c3d480..9aa92be111 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/Shape.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/Shape.java
@@ -77,24 +77,6 @@ public final class Shape {
     return shape[i];
   }
 
-  @Override
-  public int hashCode() {
-    return Arrays.hashCode(shape);
-  }
-
-  @Override
-  public boolean equals(Object obj) {
-    if (this == obj) {
-      return true;
-    }
-
-    if (obj instanceof Shape && Arrays.equals(this.shape, ((Shape) obj).shape)) {
-      return !hasUnknownDimension();
-    }
-
-    return super.equals(obj);
-  }
-
   /** Succinct description of the shape meant for debugging. */
   @Override
   public String toString() {
@@ -116,18 +98,4 @@ public final class Shape {
   }
 
   private long[] shape;
-
-  private boolean hasUnknownDimension() {
-    if (shape == null) {
-      return true;
-    }
-
-    for (long dimension : shape) {
-      if (dimension == -1) {
-        return true;
-      }
-    }
-
-    return false;
-  }
 }
diff --git a/tensorflow/java/src/test/java/org/tensorflow/ShapeTest.java b/tensorflow/java/src/test/java/org/tensorflow/ShapeTest.java
index 92cc3bd60e..3b027700c5 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/ShapeTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/ShapeTest.java
@@ -16,7 +16,6 @@ limitations under the License.
 package org.tensorflow;
 
 import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.assertNotEquals;
 
 import org.junit.Test;
 import org.junit.runner.RunWith;
@@ -78,29 +77,4 @@ public class ShapeTest {
       assertEquals(5, n.shape().size(1));
     }
   }
-
-  @Test
-  public void equalsWorksCorrectly() {
-    assertEquals(Shape.scalar(), Shape.scalar());
-    assertEquals(Shape.make(1, 2, 3), Shape.make(1, 2, 3));
-
-    assertNotEquals(Shape.make(1,2), null);
-    assertNotEquals(Shape.make(1,2), new Object());
-    assertNotEquals(Shape.make(1, 2, 3), Shape.make(1, 2, 4));
-
-
-    assertNotEquals(Shape.unknown(), Shape.unknown());
-    assertNotEquals(Shape.make(-1), Shape.make(-1));
-    assertNotEquals(Shape.make(1, -1, 3), Shape.make(1, -1, 3));
-  }
-
-  @Test
-  public void hashCodeIsAsExpected() {
-    assertEquals(Shape.make(1, 2, 3, 4).hashCode(), Shape.make(1, 2, 3, 4).hashCode());
-    assertEquals(Shape.scalar().hashCode(), Shape.scalar().hashCode());
-    assertEquals(Shape.unknown().hashCode(), Shape.unknown().hashCode());
-
-    assertNotEquals(Shape.make(1, 2).hashCode(), Shape.make(1, 3).hashCode());
-  }
 }
-
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index 54c43c1337..5ae4aace16 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -5,10 +5,7 @@ package(
     default_visibility = [
         "//engedu/ml/tf_from_scratch:__pkg__",
         "//tensorflow:internal",
-        "//tensorflow/contrib/lite/toco/python:__pkg__",
         "//tensorflow_models:__subpackages__",
-        # TODO(aselle): to pass open source test.
-        "//bazel_pip/tensorflow/contrib/lite/toco/python:__pkg__",
     ],
 )
 
@@ -48,7 +45,6 @@ py_library(
         "//tensorflow/compiler/aot/tests:__pkg__",  # TODO(b/34059704): remove when fixed
         "//tensorflow/contrib/learn:__pkg__",  # TODO(b/34059704): remove when fixed
         "//tensorflow/contrib/learn/python/learn/datasets:__pkg__",  # TODO(b/34059704): remove when fixed
-        "//tensorflow/contrib/lite/toco/python:__pkg__",  # TODO(b/34059704): remove when fixed
         "//tensorflow/python/debug:__pkg__",  # TODO(b/34059704): remove when fixed
         "//tensorflow/python/tools:__pkg__",  # TODO(b/34059704): remove when fixed
         "//tensorflow/tools/api/generator:__pkg__",
diff --git a/tensorflow/python/estimator/canned/head.py b/tensorflow/python/estimator/canned/head.py
index fa5d02c476..62fea05867 100644
--- a/tensorflow/python/estimator/canned/head.py
+++ b/tensorflow/python/estimator/canned/head.py
@@ -117,7 +117,7 @@ class _Head(object):
       update_op = tf.contrib.layers.optimize_loss(optimizer=sync,
                                                   loss=estimator_spec.loss, ...)
       hooks = [sync.make_session_run_hook(is_chief)]
-      ... update train_op and hooks in EstimatorSpec and return
+      ... upate train_op and hooks in EstimatorSpec and return
     ```
   """
   __metaclass__ = abc.ABCMeta
diff --git a/tensorflow/python/estimator/inputs/numpy_io.py b/tensorflow/python/estimator/inputs/numpy_io.py
index 3512f66284..c9f37f06e8 100644
--- a/tensorflow/python/estimator/inputs/numpy_io.py
+++ b/tensorflow/python/estimator/inputs/numpy_io.py
@@ -19,7 +19,6 @@ from __future__ import division
 from __future__ import print_function
 
 import collections
-from six import string_types
 from tensorflow.python.estimator.inputs.queues import feeding_functions
 
 # Key name to pack the target into dict of `features`. See
@@ -52,9 +51,8 @@ def numpy_input_fn(x,
                    num_threads=1):
   """Returns input function that would feed dict of numpy arrays into the model.
 
-  This returns a function outputting `features` and `targets` based on the dict
-  of numpy arrays. The dict `features` has the same keys as the `x`. The dict
-  `targets` has the same keys as the `y` if `y` is a dict.
+  This returns a function outputting `features` and `target` based on the dict
+  of numpy arrays. The dict `features` has the same keys as the `x`.
 
   Example:
 
@@ -71,7 +69,7 @@ def numpy_input_fn(x,
 
   Args:
     x: dict of numpy array object.
-    y: numpy array object or dict of numpy array object. `None` if absent.
+    y: numpy array object. `None` if absent.
     batch_size: Integer, size of batches to return.
     num_epochs: Integer, number of epochs to iterate over data. If `None` will
       run forever.
@@ -83,13 +81,11 @@ def numpy_input_fn(x,
       such as in prediction and evaluation mode, `num_threads` should be 1.
 
   Returns:
-    Function, that has signature of ()->(dict of `features`, `targets`)
+    Function, that has signature of ()->(dict of `features`, `target`)
 
   Raises:
     ValueError: if the shape of `y` mismatches the shape of values in `x` (i.e.,
       values in `x` have same shape).
-    ValueError: if duplicate keys are in both `x` and `y` when `y` is a dict.
-    ValueError: if x or y is an empty dict.
     TypeError: `x` is not a dict or `shuffle` is not bool.
   """
 
@@ -101,76 +97,43 @@ def numpy_input_fn(x,
     """Numpy input function."""
     if not isinstance(x, dict):
       raise TypeError('x must be dict; got {}'.format(type(x).__name__))
-    if not x:
-      raise ValueError('x cannot be empty')
 
     # Make a shadow copy and also ensure the order of iteration is consistent.
-    ordered_dict_data = collections.OrderedDict(
+    ordered_dict_x = collections.OrderedDict(
         sorted(x.items(), key=lambda t: t[0]))
-    # Deep copy keys which is a view in python 3
-    feature_keys = list(ordered_dict_data.keys())
-
-    if y is None:
-      target_keys = None
-    elif isinstance(y, dict):
-      if not y:
-        raise ValueError('y cannot be empty dict, use None instead.')
-
-      ordered_dict_y = collections.OrderedDict(
-        sorted(y.items(), key=lambda t: t[0]))
-      target_keys = list(ordered_dict_y.keys())
-
-      duplicate_keys = set(feature_keys).intersection(set(target_keys))
-      if len(duplicate_keys):
-        raise ValueError('{} duplicate keys are found in both x and y: '
-                         '{}'.format(len(duplicate_keys), duplicate_keys))
-
-      ordered_dict_data.update(ordered_dict_y)
-    else:
-      target_keys = _get_unique_target_key(ordered_dict_data)
-      ordered_dict_data[target_keys] = y
-
-    if len(set(v.shape[0] for v in ordered_dict_data.values())) != 1:
-      shape_dict_of_x = {k: ordered_dict_data[k].shape
-                         for k in feature_keys}
-
-      if target_keys is None:
-        shape_of_y = None
-      elif isinstance(target_keys, string_types):
-        shape_of_y = y.shape
-      else:
-        shape_of_y = {k: ordered_dict_data[k].shape
-                      for k in target_keys}
 
+    unique_target_key = _get_unique_target_key(ordered_dict_x)
+    if y is not None:
+      ordered_dict_x[unique_target_key] = y
+
+    if len(set(v.shape[0] for v in ordered_dict_x.values())) != 1:
+      shape_dict_of_x = {k: ordered_dict_x[k].shape
+                         for k in ordered_dict_x.keys()}
+      shape_of_y = None if y is None else y.shape
       raise ValueError('Length of tensors in x and y is mismatched. All '
                        'elements in x and y must have the same length.\n'
                        'Shapes in x: {}\n'
-                       'Shapes in y: {}\n'.format(shape_dict_of_x, shape_of_y))
+                       'Shape for y: {}\n'.format(shape_dict_of_x, shape_of_y))
 
     queue = feeding_functions._enqueue_data(  # pylint: disable=protected-access
-        ordered_dict_data,
+        ordered_dict_x,
         queue_capacity,
         shuffle=shuffle,
         num_threads=num_threads,
         enqueue_size=batch_size,
         num_epochs=num_epochs)
 
-    batch = (queue.dequeue_many(batch_size) if num_epochs is None
+    features = (queue.dequeue_many(batch_size) if num_epochs is None
                 else queue.dequeue_up_to(batch_size))
 
-    # Remove the first `Tensor` in `batch`, which is the row number.
-    if len(batch) > 0:
-      batch.pop(0)
+    # Remove the first `Tensor` in `features`, which is the row number.
+    if len(features) > 0:
+      features.pop(0)
 
-    features = dict(zip(feature_keys, batch[:len(feature_keys)]))
-    if target_keys is None:
-      # TODO(martinwicke), return consistent result
-      return features
-    elif isinstance(target_keys, string_types):
-      target = batch[-1]
-      return features, target
-    else:
-      target = dict(zip(target_keys, batch[-len(target_keys):]))
+    features = dict(zip(ordered_dict_x.keys(), features))
+    if y is not None:
+      target = features.pop(unique_target_key)
       return features, target
+    return features
 
   return input_fn
diff --git a/tensorflow/python/estimator/inputs/numpy_io_test.py b/tensorflow/python/estimator/inputs/numpy_io_test.py
index 65eae7a7dc..02df22b632 100644
--- a/tensorflow/python/estimator/inputs/numpy_io_test.py
+++ b/tensorflow/python/estimator/inputs/numpy_io_test.py
@@ -239,40 +239,6 @@ class NumpyIoTest(test.TestCase):
             x, y, batch_size=2, shuffle=False, num_epochs=1)
         failing_input_fn()
 
-  def testNumpyInputFnWithXIsEmptyDict(self):
-    x = {}
-    y = np.arange(4)
-    with self.test_session():
-      with self.assertRaisesRegexp(ValueError, 'x cannot be empty'):
-        failing_input_fn = numpy_io.numpy_input_fn(x, y, shuffle=False)
-        failing_input_fn()
-
-  def testNumpyInputFnWithYIsNone(self):
-    a = np.arange(4) * 1.0
-    b = np.arange(32, 36)
-    x = {'a': a, 'b': b}
-    y = None
-
-    with self.test_session() as session:
-      input_fn = numpy_io.numpy_input_fn(
-        x, y, batch_size=2, shuffle=False, num_epochs=1)
-      features_tensor = input_fn()
-
-      coord = coordinator.Coordinator()
-      threads = queue_runner_impl.start_queue_runners(session, coord=coord)
-
-      feature = session.run(features_tensor)
-      self.assertEqual(len(feature), 2)
-      self.assertAllEqual(feature['a'], [0, 1])
-      self.assertAllEqual(feature['b'], [32, 33])
-
-      session.run([features_tensor])
-      with self.assertRaises(errors.OutOfRangeError):
-        session.run([features_tensor])
-
-      coord.request_stop()
-      coord.join(threads)
-
   def testNumpyInputFnWithNonBoolShuffle(self):
     x = np.arange(32, 36)
     y = np.arange(4)
@@ -319,59 +285,6 @@ class NumpyIoTest(test.TestCase):
             num_epochs=1)
         failing_input_fn()
 
-  def testNumpyInputFnWithYAsDict(self):
-    a = np.arange(4) * 1.0
-    b = np.arange(32, 36)
-    x = {'a': a, 'b': b}
-    y = {'y1': np.arange(-32, -28), 'y2': np.arange(32, 28, -1)}
-
-    with self.test_session() as session:
-      input_fn = numpy_io.numpy_input_fn(
-        x, y, batch_size=2, shuffle=False, num_epochs=1)
-      features_tensor, targets_tensor = input_fn()
-
-      coord = coordinator.Coordinator()
-      threads = queue_runner_impl.start_queue_runners(session, coord=coord)
-
-      features, targets = session.run([features_tensor, targets_tensor])
-      self.assertEqual(len(features), 2)
-      self.assertAllEqual(features['a'], [0, 1])
-      self.assertAllEqual(features['b'], [32, 33])
-      self.assertEqual(len(targets), 2)
-      self.assertAllEqual(targets['y1'], [-32, -31])
-      self.assertAllEqual(targets['y2'], [32, 31])
-
-      session.run([features_tensor, targets_tensor])
-      with self.assertRaises(errors.OutOfRangeError):
-        session.run([features_tensor, targets_tensor])
-
-      coord.request_stop()
-      coord.join(threads)
-
-  def testNumpyInputFnWithYIsEmptyDict(self):
-    a = np.arange(4) * 1.0
-    b = np.arange(32, 36)
-    x = {'a': a, 'b': b}
-    y = {}
-    with self.test_session():
-      with self.assertRaisesRegexp(ValueError, 'y cannot be empty'):
-        failing_input_fn = numpy_io.numpy_input_fn(x, y, shuffle=False)
-        failing_input_fn()
-
-  def testNumpyInputFnWithDuplicateKeysInXAndY(self):
-    a = np.arange(4) * 1.0
-    b = np.arange(32, 36)
-    x = {'a': a, 'b': b}
-    y = {'y1': np.arange(-32, -28),
-         'a': a,
-         'y2': np.arange(32, 28, -1),
-         'b': b}
-    with self.test_session():
-      with self.assertRaisesRegexp(
-              ValueError, '2 duplicate keys are found in both x and y'):
-        failing_input_fn = numpy_io.numpy_input_fn(x, y, shuffle=False)
-        failing_input_fn()
-
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index dc4ffb1747..2785aed13e 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -860,10 +860,6 @@ def convert_to_tensor(value, dtype=None, name=None, preferred_dtype=None):
   inputs, which allows those ops to accept numpy arrays, Python lists,
   and scalars in addition to `Tensor` objects.
 
-  Note: This function diverges from default Numpy behavior for `float` and
-    `string` types when `None` is present in a Python list or scalar. Rather
-    than silently converting `None` values, an error will be thrown.
-
   Args:
     value: An object whose type has a registered `Tensor` conversion function.
     dtype: Optional element type for the returned tensor. If missing, the
diff --git a/tensorflow/python/framework/tensor_util.py b/tensorflow/python/framework/tensor_util.py
index e283542172..7e74c19124 100644
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@@ -286,7 +286,6 @@ _TF_TO_IS_OK = {
     dtypes.bool: [_FilterBool],
     dtypes.complex128: [_FilterComplex],
     dtypes.complex64: [_FilterComplex],
-    dtypes.float16: [_FilterFloat],
     dtypes.float32: [_FilterFloat],
     dtypes.float64: [_FilterFloat],
     dtypes.int16: [_FilterInt],
diff --git a/tensorflow/python/framework/test_util.py b/tensorflow/python/framework/test_util.py
index 1610214d54..cfa5fe5e3e 100644
--- a/tensorflow/python/framework/test_util.py
+++ b/tensorflow/python/framework/test_util.py
@@ -986,9 +986,8 @@ class TensorFlowTestCase(googletest.TestCase):
       err: A float value.
       msg: An optional string message to append to the failure message.
     """
-    # f1 == f2 is needed here as we might have: f1, f2 = inf, inf
     self.assertTrue(
-        f1 == f2 or math.fabs(f1 - f2) <= err,
+        math.fabs(f1 - f2) <= err,
         "%f != %f +/- %f%s" % (f1, f2, err, " (%s)" % msg
                                if msg is not None else ""))
 
diff --git a/tensorflow/python/kernel_tests/array_ops_test.py b/tensorflow/python/kernel_tests/array_ops_test.py
index 76b80e60ea..6eb9c66d06 100644
--- a/tensorflow/python/kernel_tests/array_ops_test.py
+++ b/tensorflow/python/kernel_tests/array_ops_test.py
@@ -107,41 +107,22 @@ class BooleanMaskTest(test_util.TensorFlowTestCase):
   def setUp(self):
     self.rng = np.random.RandomState(42)
 
-  def CheckVersusNumpy(self, ndims_mask, arr_shape, make_mask=None, axis=None):
+  def CheckVersusNumpy(self, ndims_mask, arr_shape, make_mask=None):
     """Check equivalence between boolean_mask and numpy masking."""
     if make_mask is None:
       make_mask = lambda shape: self.rng.randint(0, 2, size=shape).astype(bool)
     arr = np.random.rand(*arr_shape)
     mask = make_mask(arr_shape[:ndims_mask])
-    if axis is not None:
-      mask = make_mask(arr_shape[axis:ndims_mask+axis])
-    if axis is None or axis == 0:
-      masked_arr = arr[mask]
-    elif axis == 1:
-      masked_arr = arr[:,mask]
-    elif axis == 2:
-      masked_arr = arr[:,:,mask]
-    with self.test_session() as sess:
-      masked_tensor = array_ops.boolean_mask(arr, mask, axis=axis)
+    masked_arr = arr[mask]
+    with self.test_session():
+      masked_tensor = array_ops.boolean_mask(arr, mask)
 
       # Leading dimension size of masked_tensor is always unknown until runtime
       # since we don't how many elements will be kept.
-      leading = 1 if axis is None else axis + 1
-      self.assertAllEqual(masked_tensor.get_shape()[leading:],
-          masked_arr.shape[leading:])
+      self.assertAllEqual(masked_tensor.get_shape()[1:], masked_arr.shape[1:])
 
       self.assertAllClose(masked_arr, masked_tensor.eval())
 
-  def testMaskDim1ArrDim2Axis1(self):
-    ndims_mask = 1
-    for arr_shape in [(1, 1), (2, 2), (2, 5)]:
-      self.CheckVersusNumpy(ndims_mask, arr_shape, axis=1)
-
-  def testMaskDim2ArrDim2Axis1(self):
-    ndims_mask = 2
-    for arr_shape in [(1, 1), (2, 2), (2, 5)]:
-      self.CheckVersusNumpy(ndims_mask, arr_shape, axis=1)
-
   def testMaskDim1ArrDim1(self):
     ndims_mask = 1
     for arr_shape in [(1,), (2,), (3,), (10,)]:
@@ -505,7 +486,7 @@ class StridedSliceTest(test_util.TensorFlowTestCase):
         _ = checker2[...]
         _ = checker2[tuple()]
 
-  def testInt64GPU(self):
+  def testFloatSlicedArrayAndInt64IndicesGPU(self):
     if not test_util.is_gpu_available():
       self.skipTest("No GPU available")
     with self.test_session(use_gpu=True, force_gpu=True):
@@ -516,6 +497,17 @@ class StridedSliceTest(test_util.TensorFlowTestCase):
       s = array_ops.strided_slice(x, begin, end, strides)
       self.assertAllEqual([3.], self.evaluate(s))
 
+  def testInt64SlicedArrayAndIndicesGPU(self):
+    if not test_util.is_gpu_available():
+      self.skipTest("No GPU available")
+    with self.test_session(use_gpu=True, force_gpu=True):
+      x = constant_op.constant([1, 2, 3], dtype=dtypes.int64)
+      begin = constant_op.constant([2], dtype=dtypes.int64)
+      end = constant_op.constant([3], dtype=dtypes.int64)
+      strides = constant_op.constant([1], dtype=dtypes.int64)
+      s = array_ops.strided_slice(x, begin, end, strides)
+      self.assertAllEqual([3], self.evaluate(s))
+
   def testDegenerateSlices(self):
     with self.test_session(use_gpu=True):
       checker = StridedSliceChecker(self, StridedSliceChecker.REF_TENSOR)
@@ -1078,16 +1070,6 @@ class PadTest(test_util.TensorFlowTestCase):
                            [0, 0, 4, 5, 6, 0, 0],
                            [0, 0, 0, 0, 0, 0, 0]])
 
-class InvertPermutationTest(test_util.TensorFlowTestCase):
-
-  def testInvertPermutation(self):
-    for dtype in [dtypes.int32, dtypes.int64]:
-      with self.test_session(use_gpu=True):
-        x = constant_op.constant([3, 4, 0, 2, 1], dtype=dtype)
-        y = array_ops.invert_permutation(x)
-        self.assertAllEqual(y.get_shape(), [5])
-        self.assertAllEqual(y.eval(), [2, 4, 3, 0, 1])
-
 
 if __name__ == "__main__":
   test_lib.main()
diff --git a/tensorflow/python/kernel_tests/bincount_op_test.py b/tensorflow/python/kernel_tests/bincount_op_test.py
index 79285476b4..7a610debd1 100644
--- a/tensorflow/python/kernel_tests/bincount_op_test.py
+++ b/tensorflow/python/kernel_tests/bincount_op_test.py
@@ -25,10 +25,11 @@ from tensorflow.python.framework import test_util
 from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import googletest
 
+
 class BincountTest(test_util.TensorFlowTestCase):
 
   def test_empty(self):
-    with self.test_session(use_gpu=True):
+    with self.test_session():
       self.assertAllEqual(
           math_ops.bincount([], minlength=5).eval(), [0, 0, 0, 0, 0])
       self.assertAllEqual(math_ops.bincount([], minlength=1).eval(), [0])
@@ -41,7 +42,7 @@ class BincountTest(test_util.TensorFlowTestCase):
           np.float64)
 
   def test_values(self):
-    with self.test_session(use_gpu=True):
+    with self.test_session():
       self.assertAllEqual(
           math_ops.bincount([1, 1, 1, 2, 2, 3]).eval(), [0, 3, 2, 1])
       arr = [1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
@@ -56,14 +57,14 @@ class BincountTest(test_util.TensorFlowTestCase):
           math_ops.bincount(np.arange(10000)).eval(), np.ones(10000))
 
   def test_maxlength(self):
-    with self.test_session(use_gpu=True):
+    with self.test_session():
       self.assertAllEqual(math_ops.bincount([5], maxlength=3).eval(), [0, 0, 0])
       self.assertAllEqual(math_ops.bincount([1], maxlength=3).eval(), [0, 1])
       self.assertAllEqual(math_ops.bincount([], maxlength=3).eval(), [])
 
   def test_random_with_weights(self):
     num_samples = 10000
-    with self.test_session(use_gpu=True):
+    with self.test_session():
       np.random.seed(42)
       for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]:
         arr = np.random.randint(0, 1000, num_samples)
@@ -71,29 +72,17 @@ class BincountTest(test_util.TensorFlowTestCase):
           weights = np.random.randint(-100, 100, num_samples)
         else:
           weights = np.random.random(num_samples)
-        self.assertAllClose(
+        self.assertAllEqual(
             math_ops.bincount(arr, weights).eval(),
             np.bincount(arr, weights))
 
-  def test_random_without_weights(self):
-    num_samples = 10000
-    with self.test_session(use_gpu=True):
-      np.random.seed(42)
-      for dtype in [np.int32, np.float32]:
-        arr = np.random.randint(0, 1000, num_samples)
-        weights = np.ones(num_samples).astype(dtype)
-        self.assertAllClose(
-            math_ops.bincount(arr, None).eval(),
-            np.bincount(arr, weights))
-
   def test_zero_weights(self):
-    with self.test_session(use_gpu=True):
+    with self.test_session():
       self.assertAllEqual(
           math_ops.bincount(np.arange(1000), np.zeros(1000)).eval(),
           np.zeros(1000))
 
   def test_negative(self):
-    # unsorted_segment_sum will only report InvalidArgumentError on CPU
     with self.test_session():
       with self.assertRaises(errors.InvalidArgumentError):
         math_ops.bincount([1, 2, 3, -1, 6, 8]).eval()
diff --git a/tensorflow/python/kernel_tests/bucketize_op_test.py b/tensorflow/python/kernel_tests/bucketize_op_test.py
index e612b1c134..6db3592055 100644
--- a/tensorflow/python/kernel_tests/bucketize_op_test.py
+++ b/tensorflow/python/kernel_tests/bucketize_op_test.py
@@ -31,7 +31,7 @@ class BucketizationOpTest(test.TestCase):
         constant_op.constant([-5, 0, 2, 3, 5, 8, 10, 11, 12]),
         boundaries=[0, 3, 8, 11])
     expected_out = [0, 1, 1, 2, 2, 3, 3, 4, 4]
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       self.assertAllEqual(expected_out, sess.run(op))
 
   def testFloat(self):
@@ -39,7 +39,7 @@ class BucketizationOpTest(test.TestCase):
         constant_op.constant([-5., 0., 2., 3., 5., 8., 10., 11., 12.]),
         boundaries=[0., 3., 8., 11.])
     expected_out = [0, 1, 1, 2, 2, 3, 3, 4, 4]
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       self.assertAllEqual(expected_out, sess.run(op))
 
   def test2DInput(self):
@@ -47,13 +47,13 @@ class BucketizationOpTest(test.TestCase):
         constant_op.constant([[-5, 0, 2, 3, 5], [8, 10, 11, 12, 0]]),
         boundaries=[0, 3, 8, 11])
     expected_out = [[0, 1, 1, 2, 2], [3, 3, 4, 4, 1]]
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       self.assertAllEqual(expected_out, sess.run(op))
 
   def testInvalidBoundariesOrder(self):
     op = math_ops._bucketize(
         constant_op.constant([-5, 0]), boundaries=[0, 8, 3, 11])
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       with self.assertRaisesRegexp(
           errors_impl.InvalidArgumentError, "Expected sorted boundaries"):
         sess.run(op)
diff --git a/tensorflow/python/kernel_tests/constant_op_test.py b/tensorflow/python/kernel_tests/constant_op_test.py
index 6cbdd4cbb3..6167cb9999 100644
--- a/tensorflow/python/kernel_tests/constant_op_test.py
+++ b/tensorflow/python/kernel_tests/constant_op_test.py
@@ -439,10 +439,9 @@ class ZerosLikeTest(test.TestCase):
 
   def testZerosLikeCPU(self):
     for dtype in [
-        dtypes_lib.float32, dtypes_lib.float64,
-        dtypes_lib.int8, dtypes_lib.uint8, dtypes_lib.int16, dtypes_lib.uint16,
-        dtypes_lib.int32, dtypes_lib.int64, dtypes_lib.bool,
-        dtypes_lib.complex64, dtypes_lib.complex128,
+        dtypes_lib.float32, dtypes_lib.float64, dtypes_lib.int32,
+        dtypes_lib.uint8, dtypes_lib.int16, dtypes_lib.int8,
+        dtypes_lib.complex64, dtypes_lib.complex128, dtypes_lib.int64,
         dtypes_lib.string
     ]:
       self._compareZeros(dtype, fully_defined_shape=False, use_gpu=False)
@@ -574,10 +573,9 @@ class OnesLikeTest(test.TestCase):
 
   def testOnesLike(self):
     for dtype in [
-        dtypes_lib.float32, dtypes_lib.float64,
-        dtypes_lib.int8, dtypes_lib.uint8, dtypes_lib.int16, dtypes_lib.uint16,
-        dtypes_lib.int32, dtypes_lib.int64, dtypes_lib.bool,
-        dtypes_lib.complex64, dtypes_lib.complex128
+        dtypes_lib.float32, dtypes_lib.float64, dtypes_lib.int32,
+        dtypes_lib.uint8, dtypes_lib.int16, dtypes_lib.int8,
+        dtypes_lib.complex64, dtypes_lib.complex128, dtypes_lib.int64
     ]:
       numpy_dtype = dtype.as_numpy_dtype
       with self.test_session():
diff --git a/tensorflow/python/kernel_tests/conv1d_test.py b/tensorflow/python/kernel_tests/conv1d_test.py
index a7e23ead1c..b67a4e3f89 100644
--- a/tensorflow/python/kernel_tests/conv1d_test.py
+++ b/tensorflow/python/kernel_tests/conv1d_test.py
@@ -17,9 +17,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import numpy as np
-from six.moves import xrange  # pylint: disable=redefined-builtin
-
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
@@ -53,45 +50,5 @@ class Conv1DTest(test.TestCase):
           self.assertAllClose(output, [2 * 1 + 1 * 2, 2 * 3 + 1 * 4])
 
 
-  def testConv1DTranspose(self):
-    with self.test_session():
-      stride = 2
-
-      # Input, output: [batch, width, depth]
-      x_shape = [2, 4, 3]
-      y_shape = [2, 9, 2]
-
-      # Filter: [kernel_width, output_depth, input_depth]
-      f_shape = [3, 2, 3]
-
-      x = constant_op.constant(
-          1.0, shape=x_shape, name="x", dtype=dtypes.float32)
-      f = constant_op.constant(
-          1.0, shape=f_shape, name="filter", dtype=dtypes.float32)
-      output = nn_ops.conv1d_transpose(
-          x, f, y_shape, stride=stride, padding="VALID")
-      value = output.eval()
-
-      cache_values = np.zeros(y_shape, dtype=np.float32)
-
-      # The amount of padding added
-      pad = 1
-
-      for n in xrange(x_shape[0]):
-        for k in xrange(f_shape[1]):
-          for w in xrange(pad, y_shape[1] - pad):
-            target = 3.0
-            # We add a case for locations divisible by the stride.
-            w_in = w % stride == 0 and w > pad and w < y_shape[1] - 1 - pad
-            if w_in:
-              target += 3.0
-            cache_values[n, w, k] = target
-
-          # copy values in the border
-          cache_values[n, 0, k] = cache_values[n, 1, k]
-          cache_values[n, -1, k] = cache_values[n, -2, k]
-
-    self.assertAllClose(cache_values, value)
-
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/conv_ops_3d_test.py b/tensorflow/python/kernel_tests/conv_ops_3d_test.py
index 116681fc4c..14622ab467 100644
--- a/tensorflow/python/kernel_tests/conv_ops_3d_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_3d_test.py
@@ -21,8 +21,6 @@ from __future__ import print_function
 import collections
 import math
 
-import numpy as np
-
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import test_util
@@ -47,19 +45,8 @@ def GetTestConfigs():
 
 class Conv3DTest(test.TestCase):
 
-  def _DtypesToTest(self, use_gpu):
-    if use_gpu:
-      if not test_util.CudaSupportsHalfMatMulAndConv():
-        return [dtypes.float32]
-      else:
-        # It is important that float32 comes before float16 here,
-        # as we will be using its gradients as reference for fp16 gradients.
-        return [dtypes.float32, dtypes.float16]
-    else:
-      return [dtypes.float64, dtypes.float32, dtypes.float16]
-
   def _SetupValuesForDevice(self, tensor_in_sizes, filter_in_sizes, stride,
-                            padding, data_format, dtype, use_gpu):
+                            padding, data_format, use_gpu):
     total_size_1 = 1
     total_size_2 = 1
     for s in tensor_in_sizes:
@@ -67,14 +54,13 @@ class Conv3DTest(test.TestCase):
     for s in filter_in_sizes:
       total_size_2 *= s
 
-    # Initializes the input tensor with array containing numbers from 0 to 1.
-    # We keep the input tensor values fairly small to avoid overflowing a float16 
-    # tensor during the conv3d 
-    x1 = [f * 1.0 / total_size_1 for f in range(1, total_size_1 + 1)]
-    x2 = [f * 1.0 / total_size_2 for f in range(1, total_size_2 + 1)]
+    # Initializes the input tensor with array containing incrementing
+    # numbers from 1.
+    x1 = [f * 1.0 for f in range(1, total_size_1 + 1)]
+    x2 = [f * 1.0 for f in range(1, total_size_2 + 1)]
     with self.test_session(use_gpu=use_gpu):
-      t1 = constant_op.constant(x1, shape=tensor_in_sizes, dtype=dtype)
-      t2 = constant_op.constant(x2, shape=filter_in_sizes, dtype=dtype)
+      t1 = constant_op.constant(x1, shape=tensor_in_sizes)
+      t2 = constant_op.constant(x2, shape=filter_in_sizes)
 
       if isinstance(stride, collections.Iterable):
         strides = [1] + list(stride) + [1]
@@ -95,35 +81,27 @@ class Conv3DTest(test.TestCase):
                     expected):
     results = []
     for data_format, use_gpu in GetTestConfigs():
-      for dtype in self._DtypesToTest(use_gpu):
-        result = self._SetupValuesForDevice(
-            tensor_in_sizes,
-            filter_in_sizes,
-            stride,
-            padding,
-            data_format,
-            dtype,
-            use_gpu=use_gpu)
-        results.append(result)
-
+      result = self._SetupValuesForDevice(
+          tensor_in_sizes,
+          filter_in_sizes,
+          stride,
+          padding,
+          data_format,
+          use_gpu=use_gpu)
+      results.append(result)
+      tolerance = 1e-2 if use_gpu else 1e-5
       with self.test_session() as sess:
         values = sess.run(results)
         for value in values:
           print("expected = ", expected)
           print("actual = ", value)
-          tol = 1e-6
-          if value.dtype == np.float16:
-            tol = 1e-3
-
-          self.assertAllClose(expected, value.flatten(), atol=tol,
-                              rtol=tol)
+          self.assertAllClose(expected, value.flatten(), atol=tolerance,
+                              rtol=1e-6)
 
   def testConv3D1x1x1Filter(self):
     expected_output = [
-        0.18518519,  0.22222222,  0.25925926,  0.40740741,  0.5       ,
-        0.59259259,  0.62962963,  0.77777778,  0.92592593,  0.85185185,
-        1.05555556,  1.25925926,  1.07407407,  1.33333333,  1.59259259,
-        1.2962963 ,  1.61111111,  1.92592593
+        30.0, 36.0, 42.0, 66.0, 81.0, 96.0, 102.0, 126.0, 150.0, 138.0, 171.0,
+        204.0, 174.0, 216.0, 258.0, 210.0, 261.0, 312.0
     ]
 
     # These are equivalent to the Conv2D1x1 case.
@@ -149,10 +127,8 @@ class Conv3DTest(test.TestCase):
   # Expected values computed using scipy's correlate function.
   def testConv3D2x2x2Filter(self):
     expected_output = [
-        3.77199074,   3.85069444,   3.92939815,   4.2650463 ,   4.35763889,
-        4.45023148,   6.73032407,   6.89236111,   7.05439815,   7.22337963,
-        7.39930556,   7.57523148,   9.68865741,   9.93402778,  10.17939815,
-        10.18171296,  10.44097222,  10.70023148
+        19554., 19962., 20370., 22110., 22590., 23070., 34890., 35730., 36570.,
+        37446., 38358., 39270., 50226., 51498., 52770., 52782., 54126., 55470.
     ]
     # expected_shape = [1, 3, 1, 2, 5]
     self._VerifyValues(
@@ -164,19 +140,69 @@ class Conv3DTest(test.TestCase):
 
   def testConv3DStrides(self):
     expected_output = [
-        0.06071429,  0.08988095,  0.10238095,  0.11488095,  0.12738095,
-        0.13988095,  0.08452381,  0.26071429,  0.35238095,  0.36488095,
-        0.37738095,  0.38988095,  0.40238095,  0.23452381,  0.46071429,
-        0.61488095,  0.62738095,  0.63988095,  0.65238095,  0.66488095,
-        0.38452381,  1.12738095,  1.48988095,  1.50238095,  1.51488095,
-        1.52738095,  1.53988095,  0.88452381,  1.32738095,  1.75238095,
-        1.76488095,  1.77738095,  1.78988095,  1.80238095,  1.03452381,
-        1.52738095,  2.01488095,  2.02738095,  2.03988095,  2.05238095,
-        2.06488095,  1.18452381,  2.19404762,  2.88988095,  2.90238095,
-        2.91488095,  2.92738095,  2.93988095,  1.68452381,  2.39404762,
-        3.15238095,  3.16488095,  3.17738095,  3.18988095,  3.20238095,
-        1.83452381,  2.59404762,  3.41488095,  3.42738095,  3.43988095,
-        3.45238095,  3.46488095,  1.98452381
+        102.,
+        151.,
+        172.,
+        193.,
+        214.,
+        235.,
+        142.,
+        438.,
+        592.,
+        613.,
+        634.,
+        655.,
+        676.,
+        394.,
+        774.,
+        1033.,
+        1054.,
+        1075.,
+        1096.,
+        1117.,
+        646.,
+        1894.,
+        2503.,
+        2524.,
+        2545.,
+        2566.,
+        2587.,
+        1486.,
+        2230.,
+        2944.,
+        2965.,
+        2986.,
+        3007.,
+        3028.,
+        1738.,
+        2566.,
+        3385.,
+        3406.,
+        3427.,
+        3448.,
+        3469.,
+        1990.,
+        3686.,
+        4855.,
+        4876.,
+        4897.,
+        4918.,
+        4939.,
+        2830.,
+        4022.,
+        5296.,
+        5317.,
+        5338.,
+        5359.,
+        5380.,
+        3082.,
+        4358.,
+        5737.,
+        5758.,
+        5779.,
+        5800.,
+        5821.,
+        3334.,
     ]
     self._VerifyValues(
         tensor_in_sizes=[1, 5, 8, 7, 1],
@@ -186,10 +212,7 @@ class Conv3DTest(test.TestCase):
         expected=expected_output)
 
   def testConv3D2x2x2FilterStride2(self):
-    expected_output = [
-        3.77199074,  3.85069444,  3.92939815,  9.68865741,  9.93402778,
-        10.17939815
-    ]
+    expected_output = [19554., 19962., 20370., 50226., 51498., 52770.]
     self._VerifyValues(
         tensor_in_sizes=[1, 4, 2, 3, 3],
         filter_in_sizes=[2, 2, 2, 3, 3],
@@ -199,14 +222,11 @@ class Conv3DTest(test.TestCase):
 
   def testConv3DStride3(self):
     expected_output = [
-        1.51140873,  1.57167659,  1.63194444,  1.56349206,  1.62673611,
-        1.68998016,  1.6155754 ,  1.68179563,  1.74801587,  1.9280754 ,
-        2.01215278,  2.09623016,  1.98015873,  2.0672123 ,  2.15426587,
-        2.03224206,  2.12227183,  2.21230159,  4.4280754 ,  4.65500992,
-        4.88194444,  4.48015873,  4.71006944,  4.93998016,  4.53224206,
-        4.76512897,  4.99801587,  4.84474206,  5.09548611,  5.34623016,
-        4.8968254 ,  5.15054563,  5.40426587,  4.94890873,  5.20560516,
-        5.46230159
+        36564., 38022., 39480., 37824., 39354., 40884., 39084., 40686., 42288.,
+        46644., 48678., 50712., 47904., 50010., 52116., 49164., 51342., 53520.,
+        107124., 112614., 118104., 108384., 113946., 119508., 109644., 115278.,
+        120912., 117204., 123270., 129336., 118464., 124602., 130740., 119724.,
+        125934., 132144.
     ]
     self._VerifyValues(
         tensor_in_sizes=[1, 6, 7, 8, 2],
@@ -217,9 +237,8 @@ class Conv3DTest(test.TestCase):
 
   def testConv3D2x2x2FilterStride2Same(self):
     expected_output = [
-        3.77199074,   3.85069444,   3.92939815,   2.0162037 ,   2.06597222,
-        2.11574074,   9.68865741,   9.93402778,  10.17939815,   4.59953704,
-        4.73263889,   4.86574074
+        19554., 19962., 20370., 10452., 10710., 10968., 50226., 51498., 52770.,
+        23844., 24534., 25224.
     ]
     self._VerifyValues(
         tensor_in_sizes=[1, 4, 2, 3, 3],
@@ -229,10 +248,7 @@ class Conv3DTest(test.TestCase):
         expected=expected_output)
 
   def testKernelSmallerThanStride(self):
-    expected_output = [
-        0.03703704,  0.11111111,  0.25925926,  0.33333333,  0.7037037 ,
-        0.77777778,  0.92592593,  1.
-    ]
+    expected_output = [1., 3., 7., 9., 19., 21., 25., 27.]
     self._VerifyValues(
         tensor_in_sizes=[1, 3, 3, 3, 1],
         filter_in_sizes=[1, 1, 1, 1, 1],
@@ -247,12 +263,9 @@ class Conv3DTest(test.TestCase):
         expected=expected_output)
 
     expected_output = [
-        0.54081633,  0.58017493,  0.28061224,  0.81632653,  0.85568513,
-        0.40306122,  0.41873178,  0.4340379 ,  0.19642857,  2.46938776,
-        2.50874636,  1.1377551 ,  2.74489796,  2.78425656,  1.26020408,
-        1.16873178,  1.1840379 ,  0.51785714,  1.09511662,  1.10604956,
-        0.44642857,  1.17164723,  1.18258017,  0.47704082,  0.3691691 ,
-        0.37244898,  0.125
+        1484., 1592., 770., 2240., 2348., 1106., 1149., 1191., 539., 6776.,
+        6884., 3122., 7532., 7640., 3458., 3207., 3249., 1421., 3005., 3035.,
+        1225., 3215., 3245., 1309., 1013., 1022., 343.
     ]
     self._VerifyValues(
         tensor_in_sizes=[1, 7, 7, 7, 1],
@@ -261,10 +274,7 @@ class Conv3DTest(test.TestCase):
         padding="SAME",
         expected=expected_output)
 
-    expected_output = [
-        0.540816,  0.580175,  0.816327,  0.855685,  2.469388,  2.508746,
-        2.744898,  2.784257
-    ]
+    expected_output = [1484., 1592., 2240., 2348., 6776., 6884., 7532., 7640.]
     self._VerifyValues(
         tensor_in_sizes=[1, 7, 7, 7, 1],
         filter_in_sizes=[2, 2, 2, 1, 1],
@@ -278,7 +288,7 @@ class Conv3DTest(test.TestCase):
         filter_in_sizes=[2, 1, 2, 1, 2],
         stride=1,
         padding="VALID",
-        expected=[1.5625,  1.875])
+        expected=[50, 60])
 
   def _ConstructAndTestGradientForConfig(
       self, batch, input_shape, filter_shape, in_depth, out_depth, stride,
@@ -318,63 +328,50 @@ class Conv3DTest(test.TestCase):
     input_data = [x * 1.0 / input_size for x in range(0, input_size)]
     filter_data = [x * 1.0 / filter_size for x in range(0, filter_size)]
 
-
-    for data_type in self._DtypesToTest(use_gpu=use_gpu):
+    if test.is_gpu_available() and use_gpu:
+      data_type = dtypes.float32
       # TODO(mjanusz): Modify gradient_checker to also provide max relative
       # error and synchronize the tolerance levels between the tests for forward
       # and backward computations.
-      if data_type == dtypes.float64:
-        tolerance = 1e-8
-      elif data_type == dtypes.float32:
+      if test.is_gpu_available():
         tolerance = 5e-3
-      elif data_type == dtypes.float16:
-        tolerance = 1e-3
-
-
-      with self.test_session(use_gpu=use_gpu):
-        orig_input_tensor = constant_op.constant(
+      else:
+        # As of Aug 2016, higher tolerance is needed for some CPU architectures.
+        # Runs on a single machine can also generate slightly different errors
+        # because of multithreading.
+        tolerance = 8e-3
+    else:
+      data_type = dtypes.float64
+      tolerance = 1e-8
+    with self.test_session(use_gpu=use_gpu):
+      orig_input_tensor = constant_op.constant(
           input_data, shape=input_shape, dtype=data_type, name="input")
-        filter_tensor = constant_op.constant(
+      filter_tensor = constant_op.constant(
           filter_data, shape=filter_shape, dtype=data_type, name="filter")
 
-        if data_format == "NCDHW":
-          input_tensor = test_util.NHWCToNCHW(orig_input_tensor)
-          new_strides = test_util.NHWCToNCHW(strides)
-        else:
-          input_tensor = orig_input_tensor
-          new_strides = strides
+      if data_format == "NCDHW":
+        input_tensor = test_util.NHWCToNCHW(orig_input_tensor)
+        strides = test_util.NHWCToNCHW(strides)
+      else:
+        input_tensor = orig_input_tensor
 
-        conv = nn_ops.conv3d(
-          input_tensor, filter_tensor, new_strides, padding,
+      conv = nn_ops.conv3d(
+          input_tensor, filter_tensor, strides, padding,
           data_format=data_format, name="conv")
 
-        if data_format == "NCDHW":
-          conv = test_util.NCHWToNHWC(conv)
-
-        
-        if test_input:
-          jacob_t, jacob_n = gradient_checker.compute_gradient(orig_input_tensor,
-                                                               input_shape,
-                                                               conv,
-                                                               output_shape)
-        else:
-          jacob_t, jacob_n = gradient_checker.compute_gradient(filter_tensor,
-                                                               filter_shape,
-                                                               conv,
-                                                               output_shape)
-        
-        
-        if data_type != dtypes.float16:
-          reference_jacob_t = jacob_t
-          err = np.fabs(jacob_t - jacob_n).max()
-        else:
-          # Compare fp16 theoretical gradients to fp32 theoretical gradients,
-          # since fp16 numerical gradients are too imprecise.
-          err = np.fabs(jacob_t - reference_jacob_t).max()
-
-      print("conv3d gradient error = ", err)
-      self.assertLess(err, tolerance)
+      if data_format == "NCDHW":
+        conv = test_util.NCHWToNHWC(conv)
 
+      if test_input:
+        err = gradient_checker.compute_gradient_error(orig_input_tensor,
+                                                      input_shape,
+                                                      conv, output_shape)
+      else:
+        err = gradient_checker.compute_gradient_error(filter_tensor,
+                                                      filter_shape, conv,
+                                                      output_shape)
+    print("conv3d gradient error = ", err)
+    self.assertLess(err, tolerance)
 
   def ConstructAndTestGradient(self, **kwargs):
     for data_format, use_gpu in GetTestConfigs():
diff --git a/tensorflow/python/kernel_tests/depthwise_conv_op_test.py b/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
index f7ae1a0f37..3298092fbe 100644
--- a/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
+++ b/tensorflow/python/kernel_tests/depthwise_conv_op_test.py
@@ -122,9 +122,7 @@ class DepthwiseConv2DTest(test.TestCase):
     x1 = [f * 1.0 for f in range(1, total_size_1 + 1)]
     x2 = [f * 1.0 for f in range(1, total_size_2 + 1)]
     with self.test_session(use_gpu=use_gpu) as sess:
-      if data_type == dtypes.float16:
-        tolerance = 1e-5
-      elif data_type == dtypes.float32:
+      if data_type == dtypes.float32:
         tolerance = 1e-5
       else:
         self.assertEqual(data_type, dtypes.float64)
@@ -171,7 +169,7 @@ class DepthwiseConv2DTest(test.TestCase):
                 padding) in enumerate(ConfigsToTest()):
       print("Testing DepthwiseConv2D,", index, "th config:", input_size, "*",
             filter_size, "stride:", stride, "padding:", padding)
-      for data_type in [dtypes.float16, dtypes.float32, dtypes.float64]:
+      for data_type in [dtypes.float32, dtypes.float64]:
         self._VerifyValues(
             input_size, filter_size, stride, padding, data_type, use_gpu=True)
 
@@ -183,7 +181,7 @@ class DepthwiseConv2DTest(test.TestCase):
                 padding) in enumerate(ConfigsToTest()):
       print("Testing DepthwiseConv2DFormat,", index, "th config:", input_size,
             "*", filter_size, "stride:", stride, "padding:", padding)
-      for data_type in [dtypes.float16, dtypes.float32, dtypes.float64]:
+      for data_type in [dtypes.float32, dtypes.float64]:
         self._VerifyValues(
             input_size,
             filter_size,
@@ -320,9 +318,7 @@ class DepthwiseConv2DTest(test.TestCase):
     input_data = [x * 1.0 / input_size for x in range(0, input_size)]
     filter_data = [x * 1.0 / filter_size for x in range(0, filter_size)]
     with self.test_session(use_gpu=use_gpu):
-      if data_type == dtypes.float16:
-        tolerance = 0.002
-      elif data_type == dtypes.float32:
+      if data_type == dtypes.float32:
         tolerance = 0.002
       else:
         self.assertEqual(data_type, dtypes.float64)
@@ -373,8 +369,6 @@ class DepthwiseConv2DTest(test.TestCase):
       print("Testing DepthwiseConv2DInputGrad,", index, "th config:",
             input_size, "*", filter_size, "stride:", stride, "padding:",
             padding)
-      # Note: float16 test for DepthwiseConv2DInputGrad is not enabled,
-      # calculations are not very precise.
       for data_type in [dtypes.float32, dtypes.float64]:
         self._ConstructAndTestGradient(
             input_size,
@@ -395,8 +389,6 @@ class DepthwiseConv2DTest(test.TestCase):
       print("Testing DepthwiseConv2DInputGradFormat,", index, "th config:",
             input_size, "*", filter_size, "stride:", stride, "padding:",
             padding)
-      # Note: float16 test for DepthwiseConv2DInputGradFormat is not enabled,
-      # calculations are not very precise.
       for data_type in [dtypes.float32, dtypes.float64]:
         self._ConstructAndTestGradient(
             input_size,
@@ -415,8 +407,6 @@ class DepthwiseConv2DTest(test.TestCase):
       print("Testing DepthwiseConv2DFilterGrad,", index, "th config:",
             input_size, "*", filter_size, "stride:", stride, "padding:",
             padding)
-      # Note: float16 test for DepthwiseConv2DFilterGrad is not enabled,
-      # calculations are not very precise.
       for data_type in [dtypes.float32, dtypes.float64]:
         self._ConstructAndTestGradient(
             input_size,
@@ -437,8 +427,6 @@ class DepthwiseConv2DTest(test.TestCase):
       print("Testing DepthwiseConv2DFilterGradFormat,", index, "th config:",
             input_size, "*", filter_size, "stride:", stride, "padding:",
             padding)
-      # Note: float16 test for DepthwiseConv2DFilterGradFormat is not enabled,
-      # calculations are not very precise.
       for data_type in [dtypes.float32, dtypes.float64]:
         self._ConstructAndTestGradient(
             input_size,
diff --git a/tensorflow/python/kernel_tests/distributions/BUILD b/tensorflow/python/kernel_tests/distributions/BUILD
index e220d05692..e21446c2ef 100644
--- a/tensorflow/python/kernel_tests/distributions/BUILD
+++ b/tensorflow/python/kernel_tests/distributions/BUILD
@@ -193,7 +193,6 @@ cuda_py_test(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:platform_test",
     ],
-    tags = ["manual"],  # b/69001419
 )
 
 cuda_py_test(
diff --git a/tensorflow/python/kernel_tests/distributions/multinomial_test.py b/tensorflow/python/kernel_tests/distributions/multinomial_test.py
index e24e8ade73..ebc89f15c5 100644
--- a/tensorflow/python/kernel_tests/distributions/multinomial_test.py
+++ b/tensorflow/python/kernel_tests/distributions/multinomial_test.py
@@ -250,11 +250,13 @@ class MultinomialTest(test.TestCase):
     theta = np.array([[1., 2, 3],
                       [2.5, 4, 0.01]], dtype=np.float32)
     theta /= np.sum(theta, 1)[..., array_ops.newaxis]
-    n = np.array([[10., 9.], [8., 7.], [6., 5.]], dtype=np.float32)
+    # Ideally we'd be able to test broadcasting but, the multinomial sampler
+    # doesn't support different total counts.
+    n = np.float32(5)
     with self.test_session() as sess:
-      # batch_shape=[3, 2], event_shape=[3]
+      # batch_shape=[2], event_shape=[3]
       dist = multinomial.Multinomial(n, theta)
-      x = dist.sample(int(1000e3), seed=1)
+      x = dist.sample(int(250e3), seed=1)
       sample_mean = math_ops.reduce_mean(x, 0)
       x_centered = x - sample_mean[array_ops.newaxis, ...]
       sample_cov = math_ops.reduce_mean(math_ops.matmul(
@@ -289,9 +291,9 @@ class MultinomialTest(test.TestCase):
   def testSampleUnbiasedNonScalarBatch(self):
     with self.test_session() as sess:
       dist = multinomial.Multinomial(
-          total_count=[7., 6., 5.],
+          total_count=5.,
           logits=math_ops.log(2. * self._rng.rand(4, 3, 2).astype(np.float32)))
-      n = int(3e4)
+      n = int(3e3)
       x = dist.sample(n, seed=0)
       sample_mean = math_ops.reduce_mean(x, 0)
       # Cyclically rotate event dims left.
diff --git a/tensorflow/python/kernel_tests/dynamic_partition_op_test.py b/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
index 2460950aa9..4883095707 100644
--- a/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
+++ b/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
@@ -33,8 +33,8 @@ from tensorflow.python.platform import test
 class DynamicPartitionTest(test.TestCase):
 
   def testSimpleOneDimensional(self):
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant([0, 13, 2, 39, 4, 17], dtype=dtypes.float32)
+    with self.test_session() as sess:
+      data = constant_op.constant([0, 13, 2, 39, 4, 17])
       indices = constant_op.constant([0, 0, 2, 3, 2, 1])
       partitions = data_flow_ops.dynamic_partition(
           data, indices, num_partitions=4)
@@ -52,10 +52,9 @@ class DynamicPartitionTest(test.TestCase):
     self.assertEqual([None], partitions[3].get_shape().as_list())
 
   def testSimpleTwoDimensional(self):
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       data = constant_op.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11],
-                                   [12, 13, 14], [15, 16, 17]],
-                                  dtype=dtypes.float32)
+                                   [12, 13, 14], [15, 16, 17]])
       indices = constant_op.constant([0, 0, 2, 3, 2, 1])
       partitions = data_flow_ops.dynamic_partition(
           data, indices, num_partitions=4)
@@ -72,61 +71,9 @@ class DynamicPartitionTest(test.TestCase):
     self.assertEqual([None, 3], partitions[2].get_shape().as_list())
     self.assertEqual([None, 3], partitions[3].get_shape().as_list())
 
-  def testLargeOneDimensional(self):
-    num = 100000
-    data_list = [x for x in range(num)]
-    indices_list = [x % 2 for x in range(num)]
-    part1 = [x for x in range(num) if x % 2 == 0]
-    part2 = [x for x in range(num) if x % 2 == 1]
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.float32)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=2)
-      partition_vals = sess.run(partitions)
-
-    self.assertAllEqual(part1, partition_vals[0])
-    self.assertAllEqual(part2, partition_vals[1])
-
-  def testLargeTwoDimensional(self):
-    rows = 100000
-    cols = 100
-    data_list = [None] * rows
-    for i in range(rows):
-      data_list[i] = [i for _ in range(cols)]
-    num_partitions = 97
-    indices_list = [(i ** 2) % num_partitions for i in range(rows)]
-    parts = [[] for _ in range(num_partitions)]
-    for i in range(rows):
-      parts[(i ** 2) % num_partitions].append(data_list[i])
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.float32)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=num_partitions)
-      partition_vals = sess.run(partitions)
-
-    for i in range(num_partitions):
-      # reshape because of empty parts
-      parts_np = np.array(parts[i], dtype=np.float).reshape(-1, cols)
-      self.assertAllEqual(parts_np, partition_vals[i])
-
-  def testSimpleComplex(self):
-    data_list = [1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j]
-    indices_list = [1, 0, 1, 0]
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.complex64)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=2)
-      partition_vals = sess.run(partitions)
-
-    self.assertAllEqual([3 + 4j, 7 + 8j], partition_vals[0])
-    self.assertAllEqual([1 + 2j, 5 + 6j], partition_vals[1])
-
   def testHigherRank(self):
     np.random.seed(7)
-    with self.test_session(use_gpu=True) as sess:
+    with self.test_session() as sess:
       for n in 2, 3:
         for shape in (4,), (4, 5), (4, 5, 2):
           partitions = np.random.randint(n, size=np.prod(shape)).reshape(shape)
@@ -148,49 +95,6 @@ class DynamicPartitionTest(test.TestCase):
             self.assertEqual(grads[1], None)  # Partitions has no gradients
             self.assertAllEqual(7 * data, sess.run(grads[0]))
 
-  def testEmptyParts(self):
-    data_list = [1, 2, 3, 4]
-    indices_list = [1, 3, 1, 3]
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.float32)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=4)
-      partition_vals = sess.run(partitions)
-
-    self.assertAllEqual([], partition_vals[0])
-    self.assertAllEqual([1, 3], partition_vals[1])
-    self.assertAllEqual([], partition_vals[2])
-    self.assertAllEqual([2, 4], partition_vals[3])
-
-  def testEmptyDataTwoDimensional(self):
-    data_list = [[], []]
-    indices_list = [0, 1]
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.float32)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=3)
-      partition_vals = sess.run(partitions)
-
-    self.assertAllEqual([[]], partition_vals[0])
-    self.assertAllEqual([[]], partition_vals[1])
-    self.assertAllEqual(np.array([], dtype=np.float).reshape(0, 0),
-                        partition_vals[2])
-
-  def testEmptyPartitions(self):
-    data_list = []
-    indices_list = []
-    with self.test_session(use_gpu=True) as sess:
-      data = constant_op.constant(data_list, dtype=dtypes.float32)
-      indices = constant_op.constant(indices_list, dtype=dtypes.int32)
-      partitions = data_flow_ops.dynamic_partition(
-          data, indices, num_partitions=2)
-      partition_vals = sess.run(partitions)
-
-    self.assertAllEqual([], partition_vals[0])
-    self.assertAllEqual([], partition_vals[1])
-
   def testErrorIndexOutOfRange(self):
     with self.test_session() as sess:
       data = constant_op.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11],
diff --git a/tensorflow/python/kernel_tests/pooling_ops_test.py b/tensorflow/python/kernel_tests/pooling_ops_test.py
index 150e2ff7f2..a126180414 100644
--- a/tensorflow/python/kernel_tests/pooling_ops_test.py
+++ b/tensorflow/python/kernel_tests/pooling_ops_test.py
@@ -19,7 +19,6 @@ from __future__ import division
 from __future__ import print_function
 
 import numpy as np
-import os
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
@@ -1342,33 +1341,11 @@ class PoolingTest(test.TestCase):
       return
 
     # Test the GPU implementation that uses cudnn for now.
-    saved_nanprop = os.environ.get("TF_ENABLE_MAXPOOL_NANPROP")
-    # Do not propagate the diff in cases of NaNs
-    os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = "0"
+    # It does not propagate the diff in cases of NaNs
     expected_input_backprop_cudnn = [
         0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
         0.0, 0.0
     ]
-
-    for v2 in [True, False]:
-      self._testMaxPoolGradDirect(
-          input_data,
-          output_backprop,
-          expected_input_backprop_cudnn,
-          input_sizes=[1, 4, 4, 1],
-          output_sizes=[1, 3, 3, 1],
-          window_rows=2,
-          window_cols=2,
-          row_stride=1,
-          col_stride=1,
-          padding="VALID",
-          use_gpu=True,
-          v2=v2)
-
-    # Propagate the diff in cases of NaNs
-    os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = "1"
-    expected_input_backprop_cudnn = expected_input_backprop_tf_cpu
-
     for v2 in [True, False]:
       self._testMaxPoolGradDirect(
           input_data,
@@ -1384,11 +1361,6 @@ class PoolingTest(test.TestCase):
           use_gpu=True,
           v2=v2)
 
-    if saved_nanprop:
-      os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = saved_nanprop
-    else:
-      del os.environ["TF_ENABLE_MAXPOOL_NANPROP"]
-
   def _testMaxPoolGradDirectWithNans2_2(self):
     input_data = [float("nan")] * 16
     output_backprop = [
@@ -1419,14 +1391,11 @@ class PoolingTest(test.TestCase):
       return
 
     # Test the GPU implementation that uses cudnn for now.
-    saved_nanprop = os.environ.get("TF_ENABLE_MAXPOOL_NANPROP")
-    # Do not propagate the diff in cases of NaNs
-    os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = "0"
+    # It does not propagate the diff in cases of NaNs
     expected_input_backprop_cudnn = [
         0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
         0.0, 0.0
     ]
-
     for v2 in [True, False]:
       self._testMaxPoolGradDirect(
           input_data,
@@ -1442,31 +1411,6 @@ class PoolingTest(test.TestCase):
           use_gpu=True,
           v2=v2)
 
-
-    # Propagate the diff in cases of NaNs
-    os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = "1"
-    expected_input_backprop_cudnn = expected_input_backprop_tf_cpu
-
-    for v2 in [True, False]:
-      self._testMaxPoolGradDirect(
-          input_data,
-          output_backprop,
-          expected_input_backprop_cudnn,
-          input_sizes=[1, 4, 4, 1],
-          output_sizes=[1, 3, 3, 1],
-          window_rows=2,
-          window_cols=2,
-          row_stride=1,
-          col_stride=1,
-          padding="VALID",
-          use_gpu=True,
-          v2=v2)
-
-    if saved_nanprop:
-      os.environ["TF_ENABLE_MAXPOOL_NANPROP"] = saved_nanprop
-    else:
-      del os.environ["TF_ENABLE_MAXPOOL_NANPROP"]
-
   def testMaxPoolGradDirect(self):
     self._testMaxPoolGradDirect1_1()
     self._testMaxPoolGradDirect1_2()
diff --git a/tensorflow/python/kernel_tests/reader_ops_test.py b/tensorflow/python/kernel_tests/reader_ops_test.py
index 8e54d10f32..5630259b7b 100644
--- a/tensorflow/python/kernel_tests/reader_ops_test.py
+++ b/tensorflow/python/kernel_tests/reader_ops_test.py
@@ -35,9 +35,6 @@ from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import io_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
-from tensorflow.python.training import coordinator
-from tensorflow.python.training import input as input_lib
-from tensorflow.python.training import queue_runner_impl
 from tensorflow.python.util import compat
 
 prefix_path = "tensorflow/core/lib"
@@ -1014,25 +1011,6 @@ class LMDBReaderTest(test.TestCase):
                                     "\\(requested 1, current size 0\\)"):
         k, v = sess.run([key, value])
 
-  def testReadFromSameFile(self):
-    with self.test_session() as sess:
-      reader1 = io_ops.LMDBReader(name="test_read_from_same_file1")
-      reader2 = io_ops.LMDBReader(name="test_read_from_same_file2")
-      filename_queue = input_lib.string_input_producer([self.db_path],
-                                                       num_epochs=None)
-      key1, value1 = reader1.read(filename_queue)
-      key2, value2 = reader2.read(filename_queue)
-
-      coord = coordinator.Coordinator()
-      threads = queue_runner_impl.start_queue_runners(sess, coord=coord)
-      for i in range(3):
-        for j in range(10):
-          k1, v1, k2, v2 = sess.run([key1, value1, key2, value2])
-          self.assertAllEqual(compat.as_bytes(k1), compat.as_bytes(k2))
-          self.assertAllEqual(compat.as_bytes(v1), compat.as_bytes(v2))
-      coord.request_stop()
-      coord.join(threads)
-
   def testReadFromFolder(self):
     with self.test_session() as sess:
       reader = io_ops.LMDBReader(name="test_read_from_folder")
@@ -1051,25 +1029,6 @@ class LMDBReaderTest(test.TestCase):
                                     "\\(requested 1, current size 0\\)"):
         k, v = sess.run([key, value])
 
-  def testReadFromFileRepeatedly(self):
-    with self.test_session() as sess:
-      reader = io_ops.LMDBReader(name="test_read_from_file_repeated")
-      filename_queue = input_lib.string_input_producer([self.db_path],
-                                                       num_epochs=None)
-      key, value = reader.read(filename_queue)
-
-      coord = coordinator.Coordinator()
-      threads = queue_runner_impl.start_queue_runners(sess, coord=coord)
-      # Iterate over the lmdb 3 times.
-      for i in range(3):
-        # Go over all 10 records each time.
-        for j in range(10):
-          k, v = sess.run([key, value])
-          self.assertAllEqual(compat.as_bytes(k), compat.as_bytes(str(j)))
-          self.assertAllEqual(
-              compat.as_bytes(v), compat.as_bytes(str(chr(ord("a") + j))))
-      coord.request_stop()
-      coord.join(threads)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
index 3a02f24902..516a9d000e 100644
--- a/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/segment_reduction_ops_test.py
@@ -323,9 +323,8 @@ class UnsortedSegmentSumTest(SegmentReductionHelper):
   def testBadIndices(self):
     # Note: GPU kernel does not return the out-of-range error needed for this
     # test, so this test is marked as cpu-only.
-    # Note: With PR #13055 a negative index will be ignored silently.
     with self.test_session(use_gpu=False):
-      for bad in [[2]], [[7]]:
+      for bad in [[-1]], [[7]]:
         unsorted = math_ops.unsorted_segment_sum([[17]], bad, num_segments=2)
         with self.assertRaisesOpError(
             r"segment_ids\[0,0\] = %d is out of range \[0, 2\)" % bad[0][0]):
@@ -361,32 +360,6 @@ class UnsortedSegmentSumTest(SegmentReductionHelper):
             x_init_value=np_x.astype(np.double), delta=1)
       self.assertAllClose(jacob_t, jacob_n)
 
-  def testDropNegatives(self):
-    # Note: the test is done by replacing segment_ids with 8 to -1
-    # for index  and replace values generated by numpy with 0.
-    dtypes = [
-        dtypes_lib.float32, dtypes_lib.float64, dtypes_lib.int64,
-        dtypes_lib.int32, dtypes_lib.complex64, dtypes_lib.complex128
-    ]
-    indices_flat = np.array([0, 4, 0, 8, 3, 8, 4, 7, 7, 3])
-    num_segments = 12
-    for indices in indices_flat, indices_flat.reshape(5, 2):
-      shape = indices.shape + (2,)
-      for dtype in dtypes:
-        with self.test_session(use_gpu=True):
-          tf_x, np_x = self._input(shape, dtype=dtype)
-          np_ans = self._segmentReduce(
-              indices, np_x, np.add, op2=None, num_out_rows=num_segments)
-          # Replace np_ans[8] with 0 for the value
-          np_ans[8:] = 0
-          # Replace 8 with -1 in indices
-          np.place(indices, indices==8, [-1])
-          s = math_ops.unsorted_segment_sum(
-              data=tf_x, segment_ids=indices, num_segments=num_segments)
-          tf_ans = s.eval()
-        self.assertAllClose(np_ans, tf_ans)
-        self.assertShapeEqual(np_ans, s)
-
 
 class SparseSegmentReductionHelper(SegmentReductionHelper):
 
diff --git a/tensorflow/python/kernel_tests/shape_ops_test.py b/tensorflow/python/kernel_tests/shape_ops_test.py
index 7368251ab6..a9fc699b21 100644
--- a/tensorflow/python/kernel_tests/shape_ops_test.py
+++ b/tensorflow/python/kernel_tests/shape_ops_test.py
@@ -258,16 +258,6 @@ class ShapeOpsTest(test.TestCase):
       self.assertAllEqual([True], array_ops.expand_dims(inp, 0).eval())
       self.assertAllEqual([True], array_ops.expand_dims(inp, -1).eval())
 
-  def testExpandDimsDimType(self):
-    for dtype in [dtypes.int32, dtypes.int64]:
-      x = np.zeros([2])
-      np_ans = np.expand_dims(x, axis=0)
-      with self.test_session(use_gpu=True):
-        tensor = array_ops.expand_dims(x, constant_op.constant(0, dtype))
-        tf_ans = tensor.eval()
-      self.assertShapeEqual(np_ans, tensor)
-      self.assertAllEqual(np_ans, tf_ans)
-
   def _compareSqueeze(self, x, squeeze_dims, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       if squeeze_dims:
diff --git a/tensorflow/python/kernel_tests/slice_op_test.py b/tensorflow/python/kernel_tests/slice_op_test.py
index 6cdc7872f9..051a25080b 100644
--- a/tensorflow/python/kernel_tests/slice_op_test.py
+++ b/tensorflow/python/kernel_tests/slice_op_test.py
@@ -217,30 +217,6 @@ class SliceTest(test.TestCase):
     self.assertEqual(expected_val.shape, slice_t.get_shape())
     self.assertEqual(expected_val.shape, slice2_t.get_shape())
 
-  def testRandomHighRank(self):
-    # Random dims of rank 8
-    input_shape = np.random.randint(0, 20, size=8)
-    inp = np.random.rand(*input_shape).astype("f")
-    with self.test_session(use_gpu=True) as sess:
-      a = constant_op.constant(
-          [float(x) for x in inp.ravel(order="C")],
-          shape=input_shape,
-          dtype=dtypes.float32)
-      indices = [0 if x == 0 else np.random.randint(x) for x in input_shape]
-      sizes = [
-          np.random.randint(0, input_shape[i] - indices[i] + 1)
-          for i in range(8)
-      ]
-      slice_t = array_ops.slice(a, indices, sizes)
-      slice_val = sess.run(slice_t)
-
-    expected_val = inp[indices[0]:indices[0] + sizes[0], indices[1]:indices[1] + sizes[
-      1], indices[2]:indices[2] + sizes[2], indices[3]:indices[3] + sizes[3], indices[
-        4]:indices[4] + sizes[4], indices[5]:indices[5] + sizes[5], indices[6]:indices[
-          6] + sizes[6], indices[7]:indices[7] + sizes[7]]
-    self.assertAllEqual(slice_val, expected_val)
-    self.assertEqual(expected_val.shape, slice_t.get_shape())
-
   def testPartialShapeInference(self):
     z = array_ops.zeros((1, 2, 3))
     self.assertAllEqual(z.get_shape().as_list(), [1, 2, 3])
@@ -251,6 +227,7 @@ class SliceTest(test.TestCase):
     m2 = array_ops.slice(z, [0, 0, 0], [constant_op.constant(1) + 0, 2, -1])
     self.assertAllEqual(m2.get_shape().as_list(), [None, 2, None])
 
+
   def _testGradientSlice(self, input_shape, slice_begin, slice_size):
     with self.test_session(use_gpu=True):
       num_inputs = np.prod(input_shape)
diff --git a/tensorflow/python/kernel_tests/unique_op_test.py b/tensorflow/python/kernel_tests/unique_op_test.py
index 04758ce45a..a50f53b3cd 100644
--- a/tensorflow/python/kernel_tests/unique_op_test.py
+++ b/tensorflow/python/kernel_tests/unique_op_test.py
@@ -22,7 +22,6 @@ import numpy as np
 
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_array_ops
 from tensorflow.python.platform import test
 
 
@@ -62,31 +61,6 @@ class UniqueTest(test.TestCase):
     for i in range(len(x)):
       self.assertEqual(x[i], tf_y[tf_idx[i]].decode('ascii'))
 
-  def testInt32Axis(self):
-    x = np.array([[1, 0, 0], [1, 0, 0], [2, 0, 0]])
-    with self.test_session() as sess:
-      y0, idx0 = gen_array_ops.unique_v2(x, axis=[0])
-      tf_y0, tf_idx0 = sess.run([y0, idx0])
-      y1, idx1 = gen_array_ops.unique_v2(x, axis=[1])
-      tf_y1, tf_idx1 = sess.run([y1, idx1])
-    self.assertAllEqual(tf_y0, np.array([[1, 0, 0], [2, 0, 0]]))
-    self.assertAllEqual(tf_idx0, np.array([0, 0, 1]))
-    self.assertAllEqual(tf_y1, np.array([[1, 0], [1, 0], [2, 0]]))
-    self.assertAllEqual(tf_idx1, np.array([0, 1, 1]))
-
-  def testInt32V2(self):
-    # This test is only temporary, once V2 is used
-    # by default, the axis will be wrapped to allow `axis=None`.
-    x = np.random.randint(2, high=10, size=7000)
-    with self.test_session() as sess:
-      y, idx = gen_array_ops.unique_v2(x, axis=[])
-      tf_y, tf_idx = sess.run([y, idx])
-
-    self.assertEqual(len(x), len(tf_idx))
-    self.assertEqual(len(tf_y), len(np.unique(x)))
-    for i in range(len(x)):
-      self.assertEqual(x[i], tf_y[tf_idx[i]])
-
 class UniqueWithCountsTest(test.TestCase):
 
   def testInt32(self):
diff --git a/tensorflow/python/layers/base.py b/tensorflow/python/layers/base.py
index 6be2bc3e76..74b85da845 100644
--- a/tensorflow/python/layers/base.py
+++ b/tensorflow/python/layers/base.py
@@ -221,7 +221,7 @@ class Layer(object):
 
     Weight updates (for instance, the updates of the moving mean and variance
     in a BatchNormalization layer) may be dependent on the inputs passed
-    when calling a layer. Hence, when reusing the same layer on
+    when calling a layer. Hence, when reusing a same layer on
     different inputs `a` and `b`, some entries in `layer.updates` may be
     dependent on `a` and some on `b`. This method automatically keeps track
     of dependencies.
@@ -295,9 +295,9 @@ class Layer(object):
     """Add loss tensor(s), potentially dependent on layer inputs.
 
     Some losses (for instance, activity regularization losses) may be dependent
-    on the inputs passed when calling a layer. Hence, when reusing the same
-    layer on different inputs `a` and `b`, some entries in `layer.losses` may
-    be dependent on `a` and some on `b`. This method automatically keeps track
+    on the inputs passed when calling a layer. Hence, when reusing a same layer
+    on different inputs `a` and `b`, some entries in `layer.losses` may be
+    dependent on `a` and some on `b`. This method automatically keeps track
     of dependencies.
 
     The `get_losses_for` method allows to retrieve the losses relevant to a
diff --git a/tensorflow/python/layers/convolutional.py b/tensorflow/python/layers/convolutional.py
index 8c327d7e27..0c7ce02835 100644
--- a/tensorflow/python/layers/convolutional.py
+++ b/tensorflow/python/layers/convolutional.py
@@ -813,7 +813,6 @@ def conv3d(inputs,
       bias_constraint=bias_constraint,
       trainable=trainable,
       name=name,
-      dtype=inputs.dtype.base_dtype,
       _reuse=reuse,
       _scope=name)
   return layer.apply(inputs)
@@ -1747,7 +1746,6 @@ def conv3d_transpose(inputs,
       bias_constraint=bias_constraint,
       trainable=trainable,
       name=name,
-      dtype=inputs.dtype.base_dtype,
       _reuse=reuse,
       _scope=name)
   return layer.apply(inputs)
diff --git a/tensorflow/python/layers/normalization.py b/tensorflow/python/layers/normalization.py
index 4d5fb97845..9d9b2b3941 100644
--- a/tensorflow/python/layers/normalization.py
+++ b/tensorflow/python/layers/normalization.py
@@ -26,7 +26,6 @@ import numpy as np
 
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.layers import base
@@ -240,12 +239,6 @@ class BatchNormalization(base.Layer):
         raise ValueError('Unsupported axis, fused batch norm only supports '
                          'axis == [1] or axis == [3]')
 
-    # Raise parameters of fp16 batch norm to fp32
-    if self.dtype == dtypes.float16:
-      param_dtype = dtypes.float32
-    else:
-      param_dtype = self.dtype or dtypes.float32
-
     axis_to_dim = {x: input_shape[x].value for x in self.axis}
     for x in axis_to_dim:
       if axis_to_dim[x] is None:
@@ -269,7 +262,6 @@ class BatchNormalization(base.Layer):
     if self.scale:
       self.gamma = self.add_variable(name='gamma',
                                      shape=param_shape,
-                                     dtype=param_dtype,
                                      initializer=self.gamma_initializer,
                                      regularizer=self.gamma_regularizer,
                                      constraint=self.gamma_constraint,
@@ -277,14 +269,11 @@ class BatchNormalization(base.Layer):
     else:
       self.gamma = None
       if self.fused:
-        self._gamma_const = array_ops.constant(1.0,
-                                               dtype=param_dtype,
-                                               shape=param_shape)
+        self._gamma_const = array_ops.constant(1.0, shape=param_shape)
 
     if self.center:
       self.beta = self.add_variable(name='beta',
                                     shape=param_shape,
-                                    dtype=param_dtype,
                                     initializer=self.beta_initializer,
                                     regularizer=self.beta_regularizer,
                                     constraint=self.beta_constraint,
@@ -292,9 +281,7 @@ class BatchNormalization(base.Layer):
     else:
       self.beta = None
       if self.fused:
-        self._beta_const = array_ops.constant(0.0,
-                                              dtype=param_dtype,
-                                              shape=param_shape)
+        self._beta_const = array_ops.constant(0.0, shape=param_shape)
 
     # Disable variable partitioning when creating the moving mean and variance
     try:
@@ -306,14 +293,12 @@ class BatchNormalization(base.Layer):
       self.moving_mean = self.add_variable(
           name='moving_mean',
           shape=param_shape,
-          dtype=param_dtype,
           initializer=self.moving_mean_initializer,
           trainable=False)
 
       self.moving_variance = self.add_variable(
           name='moving_variance',
           shape=param_shape,
-          dtype=param_dtype,
           initializer=self.moving_variance_initializer,
           trainable=False)
 
@@ -329,7 +314,6 @@ class BatchNormalization(base.Layer):
         def _renorm_variable(name, shape):
           var = self.add_variable(name=name,
                                   shape=shape,
-                                  dtype=param_dtype,
                                   initializer=init_ops.zeros_initializer(),
                                   trainable=False)
           return var
@@ -372,6 +356,7 @@ class BatchNormalization(base.Layer):
 
   def _fused_batch_norm(self, inputs, training):
     """Returns the output of fused batch norm."""
+    # TODO(reedwm): Add support for fp16 inputs.
     beta = self.beta if self.center else self._beta_const
     gamma = self.gamma if self.scale else self._gamma_const
 
@@ -767,7 +752,6 @@ def batch_normalization(inputs,
       virtual_batch_size=virtual_batch_size,
       adjustment=adjustment,
       name=name,
-      dtype=inputs.dtype.base_dtype,
       _reuse=reuse,
       _scope=name)
   return layer.apply(inputs, training=training)
diff --git a/tensorflow/python/layers/normalization_test.py b/tensorflow/python/layers/normalization_test.py
index b2876c58c2..90ebdc8c86 100644
--- a/tensorflow/python/layers/normalization_test.py
+++ b/tensorflow/python/layers/normalization_test.py
@@ -68,12 +68,11 @@ class BNTest(test.TestCase):
              use_gpu,
              is_fused,
              restore=False,
-             freeze_mode=False,
-             dtype=dtypes.float32):
+             freeze_mode=False):
     ops.reset_default_graph()
     graph = ops.get_default_graph()
     with self.test_session(graph=graph, use_gpu=use_gpu) as sess:
-      image = array_ops.placeholder(dtype=dtype, shape=shape)
+      image = array_ops.placeholder(dtype='float32', shape=shape)
       loss, train_op, saver = self._simple_model(image, is_fused, freeze_mode)
       if restore:
         saver.restore(sess, checkpoint_path)
@@ -81,7 +80,7 @@ class BNTest(test.TestCase):
         sess.run(variables.global_variables_initializer())
       np.random.seed(0)
       for _ in range(2):
-        image_val = np.random.rand(*shape).astype(dtype.as_numpy_dtype)
+        image_val = np.random.rand(*shape).astype(np.float32)
         sess.run([loss, train_op], feed_dict={image: image_val})
       if restore:
         all_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
@@ -91,74 +90,15 @@ class BNTest(test.TestCase):
         saver.save(sess, checkpoint_path)
 
   def _infer(self, checkpoint_path, image_val, shape, use_gpu, is_fused):
-    dtype = image_val.dtype
     ops.reset_default_graph()
     graph = ops.get_default_graph()
     with self.test_session(graph=graph, use_gpu=use_gpu) as sess:
-      image = array_ops.placeholder(dtype=dtype, shape=shape)
+      image = array_ops.placeholder(dtype='float32', shape=shape)
       loss, _, saver = self._simple_model(image, is_fused, True)
       saver.restore(sess, checkpoint_path)
       loss_val = sess.run(loss, feed_dict={image: image_val})
       return loss_val
 
-  def _trainEvalSequence(self,
-                         dtype,
-                         train1_use_gpu,
-                         train2_use_gpu,
-                         infer_use_gpu):
-    batch, height, width, input_channels = 2, 4, 5, 3
-    shape = [batch, height, width, input_channels]
-    checkpoint = os.path.join(self.get_temp_dir(), 'cp_%s_%s_%s_%s' %
-        (dtype, train1_use_gpu, train2_use_gpu, infer_use_gpu))
-
-    self._train(
-        checkpoint,
-        shape,
-        use_gpu=train1_use_gpu,
-        is_fused=True,
-        restore=False,
-        freeze_mode=False,
-        dtype=dtype)
-
-    train_vars = self._train(
-        checkpoint,
-        shape,
-        use_gpu=train2_use_gpu,
-        is_fused=True,
-        restore=True,
-        freeze_mode=False,
-        dtype=dtype)
-
-    np.random.seed(0)
-    image_val = np.random.rand(batch,
-                               height,
-                               width,
-                               input_channels).astype(dtype.as_numpy_dtype)
-    loss_val = self._infer(checkpoint, image_val, shape,
-                           use_gpu=infer_use_gpu, is_fused=True)
-
-    return train_vars, loss_val
-
-  def testHalfPrecision(self):
-    ref_vars, ref_loss = self._trainEvalSequence(dtype=dtypes.float32,
-                                                 train1_use_gpu=True,
-                                                 train2_use_gpu=True,
-                                                 infer_use_gpu=True)
- 
-    self.assertEqual(len(ref_vars), 5)
-
-    for train1_use_gpu in [True, False]:
-      for train2_use_gpu in [True, False]:
-        for infer_use_gpu in [True, False]:
-          test_vars, test_loss = self._trainEvalSequence(dtypes.float16,
-                                                         train1_use_gpu,
-                                                         train2_use_gpu,
-                                                         infer_use_gpu)
-          self.assertEqual(len(test_vars), 5)
-          for test_var, ref_var in zip(test_vars, ref_vars):
-            self.assertAllClose(test_var, ref_var, rtol=1.e-3, atol=1.e-3)
-          self.assertAllClose(test_loss, ref_loss, rtol=1.e-3, atol=1.e-3)
-
   def _testCheckpoint(self, is_fused_checkpoint_a, is_fused_checkpoint_b,
                       use_gpu_checkpoint_a, use_gpu_checkpoint_b,
                       use_gpu_test_a, use_gpu_test_b, freeze_mode):
@@ -278,36 +218,6 @@ class BNTest(test.TestCase):
         ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES),
         bn.trainable_variables)
 
-  def testCreateFusedBNFloat16(self):
-    # Call layer.
-    bn = normalization_layers.BatchNormalization(axis=1, fused=True)
-    inputs = random_ops.random_uniform((5, 4, 3, 3),
-                                       seed=1,
-                                       dtype=dtypes.float16)
-    training = array_ops.placeholder(dtype='bool')
-    outputs = bn.apply(inputs, training=training)
-
-    # Verify shape.
-    self.assertListEqual(outputs.get_shape().as_list(), [5, 4, 3, 3])
-
-    # Verify layer attributes.
-    self.assertEqual(len(bn.updates), 2)
-    self.assertEqual(len(bn.variables), 4)
-    self.assertEqual(len(bn.trainable_variables), 2)
-    self.assertEqual(len(bn.non_trainable_variables), 2)
-    for var in bn.variables:
-      self.assertEqual(var.dtype, dtypes.float32_ref)
-
-    # Test that updates were created and added to UPDATE_OPS.
-    self.assertEqual(len(bn.updates), 2)
-    self.assertListEqual(
-        ops.get_collection(ops.GraphKeys.UPDATE_OPS), bn.updates)
-
-    # Test that weights were created and added to TRAINABLE_VARIABLES.
-    self.assertListEqual(
-        ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES),
-        bn.trainable_variables)
-
   def test3DInputAxis1(self):
     epsilon = 1e-3
     bn = normalization_layers.BatchNormalization(
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index 43238757c7..c3c7ecd080 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -1132,7 +1132,7 @@ def concat(values, axis, name="concat"):
   return gen_array_ops._concat_v2(values=values, axis=axis, name=name)
 
 
-def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
+def boolean_mask(tensor, mask, name="boolean_mask"):
   """Apply boolean mask to tensor.  Numpy equivalent is `tensor[mask]`.
 
   ```python
@@ -1146,17 +1146,11 @@ def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
   the first K dimensions of `tensor`'s shape.  We then have:
     `boolean_mask(tensor, mask)[i, j1,...,jd] = tensor[i1,...,iK,j1,...,jd]`
   where `(i1,...,iK)` is the ith `True` entry of `mask` (row-major order).
-  The `axis` could be used with `mask` to indicate the axis to mask from.
-  In that case, `axis + dim(mask) <= dim(tensor)` and `mask`'s shape must match
-  the first `axis + dim(mask)` dimensions of `tensor`'s shape.
 
   Args:
     tensor:  N-D tensor.
     mask:  K-D boolean tensor, K <= N and K must be known statically.
     name:  A name for this operation (optional).
-    axis:  A 0-D int Tensor representing the axis in `tensor` to mask from.
-      By default, axis is 0 which will mask from the first dimension. Otherwise
-      K + axis <= N.
 
   Returns:
     (N-K+1)-dimensional tensor populated by entries in `tensor` corresponding
@@ -1175,10 +1169,10 @@ def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
   ```
   """
 
-  def _apply_mask_1d(reshaped_tensor, mask, axis=None):
+  def _apply_mask_1d(reshaped_tensor, mask):
     """Mask tensor along dimension 0 with a 1-D mask."""
     indices = squeeze(where(mask), squeeze_dims=[1])
-    return gather(reshaped_tensor, indices, axis=axis)
+    return gather(reshaped_tensor, indices)
 
   with ops.name_scope(name, values=[tensor, mask]):
     tensor = ops.convert_to_tensor(tensor, name="tensor")
@@ -1193,22 +1187,19 @@ def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
       raise ValueError(
           "Number of mask dimensions must be specified, even if some dimensions"
           " are None.  E.g. shape=[None] is ok, but shape=None is not.")
-    axis = 0 if axis is None else axis
-    shape_tensor[axis:axis+ndims_mask].assert_is_compatible_with(shape_mask)
+    shape_tensor[:ndims_mask].assert_is_compatible_with(shape_mask)
 
-    leading_size = gen_math_ops._prod(shape(tensor)[axis:axis+ndims_mask], [0])
+    leading_size = gen_math_ops._prod(shape(tensor)[:ndims_mask], [0])
     tensor = reshape(tensor,
-                     concat([shape(tensor)[:axis],
-                             [leading_size],
-                             shape(tensor)[axis+ndims_mask:]], 0))
-    first_dim = shape_tensor[axis:axis+ndims_mask].num_elements()
+                     concat([[leading_size],
+                             shape(tensor)[ndims_mask:]], 0))
+    first_dim = shape_tensor[:ndims_mask].num_elements()
     tensor.set_shape(
-        tensor_shape.as_shape(shape_tensor[:axis])
-        .concatenate([first_dim])
-        .concatenate(shape_tensor[axis+ndims_mask:]))
+        tensor_shape.as_shape([first_dim])
+        .concatenate(shape_tensor[ndims_mask:]))
 
     mask = reshape(mask, [-1])
-    return _apply_mask_1d(tensor, mask, axis)
+    return _apply_mask_1d(tensor, mask)
 
 
 def sparse_mask(a, mask_indices, name=None):
@@ -1530,8 +1521,7 @@ def zeros_like(tensor, dtype=None, name=None, optimize=True):
   Args:
     tensor: A `Tensor`.
     dtype: A type for the returned `Tensor`. Must be `float32`, `float64`,
-      `int8`, `uint8`, `int16`, `uint16`, int32`, `int64`,
-      `complex64`, `complex128` or `bool`.
+    `int8`, `int16`, `int32`, `int64`, `uint8`, `complex64`, or `complex128`.
     name: A name for the operation (optional).
     optimize: if true, attempt to statically determine the shape of 'tensor'
     and encode it as a constant.
@@ -1582,8 +1572,8 @@ def ones_like(tensor, dtype=None, name=None, optimize=True):
   Args:
     tensor: A `Tensor`.
     dtype: A type for the returned `Tensor`. Must be `float32`, `float64`,
-      `int8`, `uint8`, `int16`, `uint16`, int32`, `int64`,
-      `complex64`, `complex128` or `bool`.
+      `int8`, `int16`, `int32`, `int64`, `uint8`, `complex64`, `complex128` or
+      `bool`.
     name: A name for the operation (optional).
     optimize: if true, attempt to statically determine the shape of 'tensor'
     and encode it as a constant.
diff --git a/tensorflow/python/ops/distributions/dirichlet.py b/tensorflow/python/ops/distributions/dirichlet.py
index 2accedf1b9..923696a553 100644
--- a/tensorflow/python/ops/distributions/dirichlet.py
+++ b/tensorflow/python/ops/distributions/dirichlet.py
@@ -196,7 +196,7 @@ class Dirichlet(distribution.Distribution):
         alpha=self.concentration,
         dtype=self.dtype,
         seed=seed)
-    return gamma_sample / math_ops.reduce_sum(gamma_sample, -1, keepdims=True)
+    return gamma_sample / math_ops.reduce_sum(gamma_sample, -1, keep_dims=True)
 
   @distribution_util.AppendDocstring(_dirichlet_sample_note)
   def _log_prob(self, x):
diff --git a/tensorflow/python/ops/distributions/multinomial.py b/tensorflow/python/ops/distributions/multinomial.py
index d49fac59ca..00b5697c83 100644
--- a/tensorflow/python/ops/distributions/multinomial.py
+++ b/tensorflow/python/ops/distributions/multinomial.py
@@ -26,7 +26,6 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
-from tensorflow.python.ops import functional_ops
 from tensorflow.python.ops.distributions import distribution
 from tensorflow.python.ops.distributions import util as distribution_util
 
@@ -141,8 +140,6 @@ class Multinomial(distribution.Distribution):
 
   counts = [[2., 1, 1], [3, 1, 1]]
   dist.prob(counts)  # Shape [2]
-
-  dist.sample(5) # Shape [5, 2, 3]
   ```
   """
 
@@ -234,35 +231,29 @@ class Multinomial(distribution.Distribution):
 
   def _sample_n(self, n, seed=None):
     n_draws = math_ops.cast(self.total_count, dtype=dtypes.int32)
+    if self.total_count.get_shape().ndims is not None:
+      if self.total_count.get_shape().ndims != 0:
+        raise NotImplementedError(
+            "Sample only supported for scalar number of draws.")
+    elif self.validate_args:
+      is_scalar = check_ops.assert_rank(
+          n_draws, 0,
+          message="Sample only supported for scalar number of draws.")
+      n_draws = control_flow_ops.with_dependencies([is_scalar], n_draws)
     k = self.event_shape_tensor()[0]
-
-    # boardcast the total_count and logits to same shape
-    n_draws = array_ops.ones_like(
-        self.logits[..., 0], dtype=n_draws.dtype) * n_draws
-    logits = array_ops.ones_like(
-        n_draws[..., array_ops.newaxis], dtype=self.logits.dtype) * self.logits
-
-    # flatten the total_count and logits
-    flat_logits = array_ops.reshape(logits, [-1, k]) # [B1B2...Bm, k]
-    flat_ndraws = n * array_ops.reshape(n_draws, [-1]) # [B1B2...Bm]
-
-    # computes each total_count and logits situation by map_fn
-    def _sample_single(args):
-      logits, n_draw = args[0], args[1] # [K], []
-      x = random_ops.multinomial(logits[array_ops.newaxis, ...],
-                                 n_draw, seed) # [1, n*n_draw]
-      x = array_ops.reshape(x, shape=[n, -1]) # [n, n_draw]
-      x = math_ops.reduce_sum(array_ops.one_hot(x, depth=k), axis=-2) # [n, k]
-      return x
-    x = functional_ops.map_fn(_sample_single,
-                              [flat_logits, flat_ndraws],
-                              dtype=self.dtype) # [B1B2...Bm, n, k]
-
-    # reshape the results to proper shape
+    # Flatten batch dims so logits has shape [B, k],
+    # where B = reduce_prod(self.batch_shape_tensor()).
+    x = random_ops.multinomial(
+        logits=array_ops.reshape(self.logits, [-1, k]),
+        num_samples=n * n_draws,
+        seed=seed)
+    x = array_ops.reshape(x, shape=[-1, n, n_draws])
+    x = math_ops.reduce_sum(array_ops.one_hot(x, depth=k),
+                            axis=-2)  # shape: [B, n, k]
     x = array_ops.transpose(x, perm=[1, 0, 2])
     final_shape = array_ops.concat([[n], self.batch_shape_tensor(), [k]], 0)
-    x = array_ops.reshape(x, final_shape) # [n, B1, B2,..., Bm, k]
-    return x
+    x = array_ops.reshape(x, final_shape)
+    return math_ops.cast(x, self.dtype)
 
   @distribution_util.AppendDocstring(_multinomial_sample_note)
   def _log_prob(self, counts):
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index 7c23321ca5..2946dbe81e 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -1121,7 +1121,7 @@ def rgb_to_grayscale(images, name=None):
     rank_1 = array_ops.expand_dims(array_ops.rank(images) - 1, 0)
     gray_float = math_ops.reduce_sum(flt_image * rgb_weights,
                                      rank_1,
-                                     keepdims=True)
+                                     keep_dims=True)
     gray_float.set_shape(images.get_shape()[:-1].concatenate([1]))
     return convert_image_dtype(gray_float, orig_dtype, name=name)
 
@@ -1212,7 +1212,26 @@ def adjust_hue(image, delta, name=None):
     orig_dtype = image.dtype
     flt_image = convert_image_dtype(image, dtypes.float32)
 
-    rgb_altered = gen_image_ops.adjust_hue(flt_image, delta)
+    # TODO(zhengxq): we will switch to the fused version after we add a GPU
+    # kernel for that.
+    fused = os.environ.get('TF_ADJUST_HUE_FUSED', '')
+    fused = fused.lower() in ('true', 't', '1')
+
+    if not fused:
+      hsv = gen_image_ops.rgb_to_hsv(flt_image)
+
+      hue = array_ops.slice(hsv, [0, 0, 0], [-1, -1, 1])
+      saturation = array_ops.slice(hsv, [0, 0, 1], [-1, -1, 1])
+      value = array_ops.slice(hsv, [0, 0, 2], [-1, -1, 1])
+
+      # Note that we add 2*pi to guarantee that the resulting hue is a positive
+      # floating point number since delta is [-0.5, 0.5].
+      hue = math_ops.mod(hue + (delta + 1.), 1.)
+
+      hsv_altered = array_ops.concat([hue, saturation, value], 2)
+      rgb_altered = gen_image_ops.hsv_to_rgb(hsv_altered)
+    else:
+      rgb_altered = gen_image_ops.adjust_hue(flt_image, delta)
 
     return convert_image_dtype(rgb_altered, orig_dtype)
 
diff --git a/tensorflow/python/ops/linalg_ops.py b/tensorflow/python/ops/linalg_ops.py
index 14a039ffd0..2cb467c891 100644
--- a/tensorflow/python/ops/linalg_ops.py
+++ b/tensorflow/python/ops/linalg_ops.py
@@ -30,7 +30,6 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops.gen_linalg_ops import *
 # pylint: enable=wildcard-import
 from tensorflow.python.util import compat
-from tensorflow.python.util.deprecation import deprecated_args
 
 # Names below are lower_case.
 # pylint: disable=invalid-name
@@ -439,10 +438,7 @@ def svd(tensor, full_matrices=False, compute_uv=True, name=None):
 
 
 # pylint: disable=redefined-builtin
-@deprecated_args(None, "keep_dims is deprecated, use keepdims instead",
-                 "keep_dims")
-def norm(tensor, ord='euclidean', axis=None, keepdims=None, name=None,
-         keep_dims=None):
+def norm(tensor, ord='euclidean', axis=None, keep_dims=False, name=None):
   r"""Computes the norm of vectors, matrices, and tensors.
 
   This function can compute several different vector norms (the 1-norm, the
@@ -475,13 +471,13 @@ def norm(tensor, ord='euclidean', axis=None, keepdims=None, name=None,
       can be either a matrix or a batch of matrices at runtime, pass
       `axis=[-2,-1]` instead of `axis=None` to make sure that matrix norms are
       computed.
-    keepdims: If True, the axis indicated in `axis` are kept with size 1.
+    keep_dims: If True, the axis indicated in `axis` are kept with size 1.
       Otherwise, the dimensions in `axis` are removed from the output shape.
     name: The name of the op.
 
   Returns:
     output: A `Tensor` of the same type as tensor, containing the vector or
-      matrix norms. If `keepdims` is True then the rank of output is equal to
+      matrix norms. If `keep_dims` is True then the rank of output is equal to
       the rank of `tensor`. Otherwise, if `axis` is none the output is a scalar,
       if `axis` is an integer, the rank of `output` is one less than the rank
       of `tensor`, if `axis` is a 2-tuple the rank of `output` is two less
@@ -501,13 +497,6 @@ def norm(tensor, ord='euclidean', axis=None, keepdims=None, name=None,
   @end_compatibility
   """
 
-  if keep_dims is not None:
-    if keepdims is not None:
-      raise ValueError("Cannot specify both 'keep_dims' and 'keepdims'")
-    keepdims = keep_dims
-  if keepdims is None:
-    keepdims = False
-
   is_matrix_norm = ((isinstance(axis, tuple) or isinstance(axis, list)) and
                     len(axis) == 2)
   if is_matrix_norm:
@@ -539,25 +528,25 @@ def norm(tensor, ord='euclidean', axis=None, keepdims=None, name=None,
       # matrices.
       result = math_ops.sqrt(
           math_ops.reduce_sum(
-              tensor * math_ops.conj(tensor), axis, keepdims=True))
+              tensor * math_ops.conj(tensor), axis, keep_dims=True))
     else:
       result = math_ops.abs(tensor)
       if ord == 1:
         sum_axis = None if axis is None else axis[0]
-        result = math_ops.reduce_sum(result, sum_axis, keepdims=True)
+        result = math_ops.reduce_sum(result, sum_axis, keep_dims=True)
         if is_matrix_norm:
-          result = math_ops.reduce_max(result, axis[-1], keepdims=True)
+          result = math_ops.reduce_max(result, axis[-1], keep_dims=True)
       elif ord == np.inf:
         if is_matrix_norm:
-          result = math_ops.reduce_sum(result, axis[1], keepdims=True)
+          result = math_ops.reduce_sum(result, axis[1], keep_dims=True)
         max_axis = None if axis is None else axis[0]
-        result = math_ops.reduce_max(result, max_axis, keepdims=True)
+        result = math_ops.reduce_max(result, max_axis, keep_dims=True)
       else:
         # General p-norms (positive p only)
         result = math_ops.pow(
             math_ops.reduce_sum(
-                math_ops.pow(result, ord), axis, keepdims=True), 1.0 / ord)
-    if not keepdims:
+                math_ops.pow(result, ord), axis, keep_dims=True), 1.0 / ord)
+    if not keep_dims:
       result = array_ops.squeeze(result, axis)
     return result
 
diff --git a/tensorflow/python/ops/math_grad_test.py b/tensorflow/python/ops/math_grad_test.py
index 04eeb00518..5732c756ce 100644
--- a/tensorflow/python/ops/math_grad_test.py
+++ b/tensorflow/python/ops/math_grad_test.py
@@ -113,23 +113,6 @@ class MinOrMaxGradientTest(test.TestCase):
       self.assertLess(error, 1e-4)
 
 
-class MaximumOrMinimumGradientTest(test.TestCase):
-
-  def testMaximumGradient(self):
-    inputs = constant_op.constant([1.0, 2.0, 3.0, 4.0], dtype=dtypes.float32)
-    outputs = math_ops.maximum(inputs, 3.0)
-    with self.test_session():
-      error = gradient_checker.compute_gradient_error(inputs, [4], outputs, [4])
-      self.assertLess(error, 1e-4)
-
-  def testMinimumGradient(self):
-    inputs = constant_op.constant([1.0, 2.0, 3.0, 4.0], dtype=dtypes.float32)
-    outputs = math_ops.minimum(inputs, 2.0)
-    with self.test_session():
-      error = gradient_checker.compute_gradient_error(inputs, [4], outputs, [4])
-      self.assertLess(error, 1e-4)
-
-
 class ProdGradientTest(test.TestCase):
 
   def testProdGradient(self):
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index e2e23dccef..4c400423b6 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -170,13 +170,14 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops.gen_math_ops import *
 # pylint: enable=wildcard-import
 from tensorflow.python.util import compat
-from tensorflow.python.util import deprecation
+from tensorflow.python.util.deprecation import deprecated
+from tensorflow.python.util.deprecation import deprecated_args
 
 # Aliases for some automatically-generated names.
 linspace = gen_math_ops.lin_space
 
-arg_max = deprecation.deprecated(None, "Use `argmax` instead")(arg_max)  # pylint: disable=used-before-assignment
-arg_min = deprecation.deprecated(None, "Use `argmin` instead")(arg_min)  # pylint: disable=used-before-assignment
+arg_max = deprecated(None, "Use `argmax` instead")(arg_max)  # pylint: disable=used-before-assignment
+arg_min = deprecated(None, "Use `argmin` instead")(arg_min)  # pylint: disable=used-before-assignment
 
 
 def _set_doc(doc):
@@ -189,8 +190,7 @@ def _set_doc(doc):
 
 
 # pylint: disable=redefined-builtin
-@deprecation.deprecated_args(None, "Use the `axis` argument instead",
-                             "dimension")
+@deprecated_args(None, "Use the `axis` argument instead", "dimension")
 @_set_doc(
     gen_math_ops.arg_max.__doc__.replace("dimensions", "axes").replace(
         "dimension", "axis"))
@@ -208,8 +208,7 @@ def argmax(input,
   return gen_math_ops.arg_max(input, axis, name=name, output_type=output_type)
 
 
-@deprecation.deprecated_args(None, "Use the `axis` argument instead",
-                             "dimension")
+@deprecated_args(None, "Use the `axis` argument instead", "dimension")
 @_set_doc(
     gen_math_ops.arg_min.__doc__.replace("dimensions", "axes").replace(
         "dimension", "axis"))
@@ -325,7 +324,7 @@ multiply.__doc__ = gen_math_ops._mul.__doc__.replace("Mul", "`tf.multiply`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
-@deprecation.deprecated(
+@deprecated(
     "2016-12-30",
     "`tf.mul(x, y)` is deprecated, please use `tf.multiply(x, y)` or `x * y`")
 def _mul(x, y, name=None):
@@ -344,7 +343,7 @@ subtract.__doc__ = gen_math_ops._sub.__doc__.replace("`Sub`", "`tf.subtract`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
-@deprecation.deprecated(
+@deprecated(
     "2016-12-30",
     "`tf.sub(x, y)` is deprecated, please use `tf.subtract(x, y)` or `x - y`")
 def _sub(x, y, name=None):
@@ -382,9 +381,8 @@ def negative(x, name=None):
 
 
 # pylint: disable=g-docstring-has-escape
-@deprecation.deprecated(
-    "2016-12-30",
-    "`tf.neg(x)` is deprecated, please use `tf.negative(x)` or `-x`")
+@deprecated("2016-12-30",
+            "`tf.neg(x)` is deprecated, please use `tf.negative(x)` or `-x`")
 def _neg(x, name=None):
   """Computes numerical negative value element-wise.
 
@@ -1271,27 +1269,24 @@ def _ReductionDims(x, axis, reduction_indices):
     return range(0, array_ops.rank(x))
 
 
-def _may_reduce_to_scalar(keepdims, axis, reduction_indices, output):
+def _may_reduce_to_scalar(keep_dims, axis, reduction_indices, output):
   """Set a reduction's output's shape to be a scalar if we are certain."""
-  if (not output.shape.is_fully_defined()) and (not keepdims) and (
+  if (not output.shape.is_fully_defined()) and (not keep_dims) and (
       axis is None) and (reduction_indices is None):
     output.set_shape(())
   return output
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_sum(input_tensor,
                axis=None,
-               keepdims=None,
+               keep_dims=False,
                name=None,
-               reduction_indices=None,
-               keep_dims=None):
+               reduction_indices=None):
   """Computes the sum of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1304,7 +1299,7 @@ def reduce_sum(input_tensor,
   tf.reduce_sum(x)  # 6
   tf.reduce_sum(x, 0)  # [2, 2, 2]
   tf.reduce_sum(x, 1)  # [3, 3]
-  tf.reduce_sum(x, 1, keepdims=True)  # [[3], [3]]
+  tf.reduce_sum(x, 1, keep_dims=True)  # [[3], [3]]
   tf.reduce_sum(x, [0, 1])  # 6
   ```
 
@@ -1313,10 +1308,9 @@ def reduce_sum(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1325,34 +1319,26 @@ def reduce_sum(input_tensor,
   Equivalent to np.sum
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._sum(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def count_nonzero(input_tensor,
                   axis=None,
-                  keepdims=None,
+                  keep_dims=False,
                   dtype=dtypes.int64,
                   name=None,
-                  reduction_indices=None,
-                  keep_dims=None):
+                  reduction_indices=None):
   """Computes number of nonzero elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1369,7 +1355,7 @@ def count_nonzero(input_tensor,
   tf.count_nonzero(x)  # 3
   tf.count_nonzero(x, 0)  # [1, 2, 0]
   tf.count_nonzero(x, 1)  # [1, 2]
-  tf.count_nonzero(x, 1, keepdims=True)  # [[1], [2]]
+  tf.count_nonzero(x, 1, keep_dims=True)  # [[1], [2]]
   tf.count_nonzero(x, [0, 1])  # 3
   ```
 
@@ -1378,20 +1364,14 @@ def count_nonzero(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     dtype: The output dtype; defaults to `tf.int64`.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor (number of nonzero values).
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-
   with ops.name_scope(name, "count_nonzero", [input_tensor]):
     input_tensor = ops.convert_to_tensor(input_tensor, name="input_tensor")
     zero = input_tensor.dtype.as_numpy_dtype()
@@ -1400,24 +1380,21 @@ def count_nonzero(input_tensor,
             # int64 reduction happens on GPU
             to_int64(gen_math_ops.not_equal(input_tensor, zero)),
             axis=axis,
-            keepdims=keepdims,
+            keep_dims=keep_dims,
             reduction_indices=reduction_indices),
         dtype=dtype)
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_mean(input_tensor,
                 axis=None,
-                keepdims=None,
+                keep_dims=False,
                 name=None,
-                reduction_indices=None,
-                keep_dims=None):
+                reduction_indices=None):
   """Computes the mean of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1437,58 +1414,36 @@ def reduce_mean(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
 
   @compatibility(numpy)
   Equivalent to np.mean
-
-  Please note that `np.mean` has a `dtype` parameter that could be used to
-  specify the output type. By default this is `dtype=float64`. On the other
-  hand, `tf.reduce_mean` has an aggressive type inference from `input_tensor`,
-  for example:
-
-  ```python
-  x = tf.constant([1, 0, 1, 0])
-  tf.reduce_mean(x)  # 0
-  y = tf.constant([1., 0., 1., 0.])
-  tf.reduce_mean(y)  # 0.5
-  ```
-
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._mean(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_prod(input_tensor,
                 axis=None,
-                keepdims=None,
+                keep_dims=False,
                 name=None,
-                reduction_indices=None,
-                keep_dims=None):
+                reduction_indices=None):
   """Computes the product of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1499,10 +1454,9 @@ def reduce_prod(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1511,33 +1465,25 @@ def reduce_prod(input_tensor,
   Equivalent to np.prod
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._prod(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_min(input_tensor,
                axis=None,
-               keepdims=None,
+               keep_dims=False,
                name=None,
-               reduction_indices=None,
-               keep_dims=None):
+               reduction_indices=None):
   """Computes the minimum of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1548,10 +1494,9 @@ def reduce_min(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1560,32 +1505,25 @@ def reduce_min(input_tensor,
   Equivalent to np.min
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._min(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_max(input_tensor,
                axis=None,
-               keepdims=None,
+               keep_dims=False,
                name=None,
-               reduction_indices=None,
-               keep_dims=None):
+               reduction_indices=None):
   """Computes the maximum of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1596,10 +1534,9 @@ def reduce_max(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1608,32 +1545,25 @@ def reduce_max(input_tensor,
   Equivalent to np.max
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._max(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_all(input_tensor,
                axis=None,
-               keepdims=None,
+               keep_dims=False,
                name=None,
-               reduction_indices=None,
-               keep_dims=None):
+               reduction_indices=None):
   """Computes the "logical and" of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1653,10 +1583,9 @@ def reduce_all(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1665,32 +1594,25 @@ def reduce_all(input_tensor,
   Equivalent to np.all
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._all(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_any(input_tensor,
                axis=None,
-               keepdims=None,
+               keep_dims=False,
                name=None,
-               reduction_indices=None,
-               keep_dims=None):
+               reduction_indices=None):
   """Computes the "logical or" of elements across dimensions of a tensor.
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1710,10 +1632,9 @@ def reduce_any(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
@@ -1722,32 +1643,25 @@ def reduce_any(input_tensor,
   Equivalent to np.any
   @end_compatibility
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
-  return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
+  return _may_reduce_to_scalar(keep_dims, axis, reduction_indices,
                                gen_math_ops._any(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
-                                   keepdims,
+                                   keep_dims,
                                    name=name))
 
 
-@deprecation.deprecated_args(
-    None, "keep_dims is deprecated, use keepdims instead", "keep_dims")
 def reduce_logsumexp(input_tensor,
                      axis=None,
-                     keepdims=None,
+                     keep_dims=False,
                      name=None,
-                     reduction_indices=None,
-                     keep_dims=None):
+                     reduction_indices=None):
   """Computes log(sum(exp(elements across dimensions of a tensor))).
 
   Reduces `input_tensor` along the dimensions given in `axis`.
-  Unless `keepdims` is true, the rank of the tensor is reduced by 1 for each
-  entry in `axis`. If `keepdims` is true, the reduced dimensions
+  Unless `keep_dims` is true, the rank of the tensor is reduced by 1 for each
+  entry in `axis`. If `keep_dims` is true, the reduced dimensions
   are retained with length 1.
 
   If `axis` has no entries, all dimensions are reduced, and a
@@ -1764,7 +1678,7 @@ def reduce_logsumexp(input_tensor,
   tf.reduce_logsumexp(x)  # log(6)
   tf.reduce_logsumexp(x, 0)  # [log(2), log(2), log(2)]
   tf.reduce_logsumexp(x, 1)  # [log(3), log(3)]
-  tf.reduce_logsumexp(x, 1, keepdims=True)  # [[log(3)], [log(3)]]
+  tf.reduce_logsumexp(x, 1, keep_dims=True)  # [[log(3)], [log(3)]]
   tf.reduce_logsumexp(x, [0, 1])  # log(6)
   ```
 
@@ -1773,24 +1687,19 @@ def reduce_logsumexp(input_tensor,
     axis: The dimensions to reduce. If `None` (the default),
       reduces all dimensions. Must be in the range
       `[-rank(input_tensor), rank(input_tensor))`.
-    keepdims: If true, retains reduced dimensions with length 1.
+    keep_dims: If true, retains reduced dimensions with length 1.
     name: A name for the operation (optional).
     reduction_indices: The old (deprecated) name for axis.
-    keep_dims: Deprecated alias for `keepdims`.
 
   Returns:
     The reduced tensor.
   """
-  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
-                                                    "keep_dims", keep_dims)
-  if keepdims is None:
-    keepdims = False
   with ops.name_scope(name, "ReduceLogSumExp", [input_tensor]) as name:
     raw_max = reduce_max(
         input_tensor,
         axis=axis,
         reduction_indices=reduction_indices,
-        keepdims=True)
+        keep_dims=True)
     my_max = array_ops.stop_gradient(
         array_ops.where(
             gen_math_ops.is_finite(raw_max), raw_max,
@@ -1799,13 +1708,13 @@ def reduce_logsumexp(input_tensor,
         reduce_sum(
             gen_math_ops.exp(input_tensor - my_max),
             axis,
-            keepdims=True,
+            keep_dims=True,
             reduction_indices=reduction_indices)) + my_max
-    if not keepdims:
+    if not keep_dims:
       if isinstance(axis, int):
         axis = [axis]
       result = array_ops.squeeze(result, axis)
-    return _may_reduce_to_scalar(keepdims, axis, reduction_indices, result)
+    return _may_reduce_to_scalar(keep_dims, axis, reduction_indices, result)
 
 
 def trace(x, name=None):
@@ -2307,10 +2216,9 @@ def bincount(arr,
     maxlength = ops.convert_to_tensor(
         maxlength, name="maxlength", dtype=dtypes.int32)
     output_size = gen_math_ops.minimum(maxlength, output_size)
-  if weights is not None:
-    weights = ops.convert_to_tensor(weights, name="weights")
-    return gen_math_ops.unsorted_segment_sum(weights, arr, output_size)
-  weights = constant_op.constant([], dtype)
+  weights = (
+      ops.convert_to_tensor(weights, name="weights")
+      if weights is not None else constant_op.constant([], dtype))
   return gen_math_ops.bincount(arr, output_size, weights)
 
 
@@ -2473,7 +2381,7 @@ def reduced_shape(input_shape, axes):
     input_shape: 1-D Tensor, the shape of the Tensor being reduced.
     axes: 1-D Tensor, the reduction axes.
   Returns:
-    A 1-D Tensor, the output shape as if keepdims were set to True.
+    A 1-D Tensor, the output shape as if keep_dims were set to True.
   """
   # Example:
   # cast needed for SparseTensor reductions
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index d30f6b92ad..717ee1254f 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -794,7 +794,7 @@ def mean_cosine_distance(labels, predictions, dim, weights=None,
   radial_diffs = math_ops.multiply(predictions, labels)
   radial_diffs = math_ops.reduce_sum(radial_diffs,
                                      reduction_indices=[dim,],
-                                     keepdims=True)
+                                     keep_dims=True)
   mean_distance, update_op = mean(radial_diffs, weights,
                                   None,
                                   None,
diff --git a/tensorflow/python/ops/nn_fused_batchnorm_test.py b/tensorflow/python/ops/nn_fused_batchnorm_test.py
index e72d34d1f7..1fcd0384da 100644
--- a/tensorflow/python/ops/nn_fused_batchnorm_test.py
+++ b/tensorflow/python/ops/nn_fused_batchnorm_test.py
@@ -335,22 +335,22 @@ class BatchNormalizationTest(test.TestCase):
 
   def testInference(self):
     x_shape = [1, 1, 6, 1]
-    for dtype in [np.float16, np.float32]:
-      if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True):
+      for dtype in [np.float16, np.float32]:
         self._test_inference(
             x_shape, dtype, [1], np.float32, use_gpu=True, data_format='NHWC')
         self._test_inference(
             x_shape, dtype, [1], np.float32, use_gpu=True, data_format='NCHW')
-      self._test_inference(
-          x_shape, dtype, [1], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_inference(
+        x_shape, np.float32, [1], np.float32, use_gpu=False, data_format='NHWC')
 
     x_shape = [1, 1, 6, 2]
     if test.is_gpu_available(cuda_only=True):
       for dtype in [np.float16, np.float32]:
         self._test_inference(
             x_shape, dtype, [2], np.float32, use_gpu=True, data_format='NHWC')
-        self._test_inference(
-            x_shape, dtype, [2], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_inference(
+        x_shape, np.float32, [2], np.float32, use_gpu=False, data_format='NHWC')
 
     x_shape = [1, 2, 1, 6]
     if test.is_gpu_available(cuda_only=True):
@@ -359,33 +359,33 @@ class BatchNormalizationTest(test.TestCase):
             x_shape, dtype, [2], np.float32, use_gpu=True, data_format='NCHW')
 
     x_shape = [27, 131, 127, 6]
-    for dtype in [np.float16, np.float32]:
-      if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True):
+      for dtype in [np.float16, np.float32]:
         self._test_inference(
             x_shape, dtype, [131], np.float32, use_gpu=True, data_format='NCHW')
         self._test_inference(
             x_shape, dtype, [6], np.float32, use_gpu=True, data_format='NHWC')
-      self._test_inference(
-          x_shape, dtype, [6], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_inference(
+        x_shape, np.float32, [6], np.float32, use_gpu=False, data_format='NHWC')
 
   def testTraining(self):
     x_shape = [1, 1, 6, 1]
-    for dtype in [np.float16, np.float32]:
-      if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True):
+      for dtype in [np.float16, np.float32]:
         self._test_training(
             x_shape, dtype, [1], np.float32, use_gpu=True, data_format='NHWC')
         self._test_training(
             x_shape, dtype, [1], np.float32, use_gpu=True, data_format='NCHW')
-      self._test_training(
-          x_shape, dtype, [1], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_training(
+        x_shape, np.float32, [1], np.float32, use_gpu=False, data_format='NHWC')
 
     x_shape = [1, 1, 6, 2]
-    for dtype in [np.float16, np.float32]:
-      if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True):
+      for dtype in [np.float16, np.float32]:
         self._test_training(
             x_shape, dtype, [2], np.float32, use_gpu=True, data_format='NHWC')
-      self._test_training(
-          x_shape, dtype, [2], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_training(
+        x_shape, np.float32, [2], np.float32, use_gpu=False, data_format='NHWC')
 
     x_shape = [1, 2, 1, 6]
     if test.is_gpu_available(cuda_only=True):
@@ -394,20 +394,20 @@ class BatchNormalizationTest(test.TestCase):
             x_shape, dtype, [2], np.float32, use_gpu=True, data_format='NCHW')
 
     x_shape = [27, 131, 127, 6]
-    for dtype in [np.float16, np.float32]:
-      if test.is_gpu_available(cuda_only=True):
+    if test.is_gpu_available(cuda_only=True):
+      for dtype in [np.float16, np.float32]:
         self._test_training(
             x_shape, dtype, [131], np.float32, use_gpu=True, data_format='NCHW')
         self._test_training(
             x_shape, dtype, [6], np.float32, use_gpu=True, data_format='NHWC')
-      self._test_training(
-          x_shape, dtype, [6], np.float32, use_gpu=False, data_format='NHWC')
+    self._test_training(
+        x_shape, np.float32, [6], np.float32, use_gpu=False, data_format='NHWC')
 
   def testBatchNormGrad(self):
     for is_training in [True, False]:
       x_shape = [1, 1, 6, 1]
-      for dtype in [np.float16, np.float32]:
-        if test.is_gpu_available(cuda_only=True):
+      if test.is_gpu_available(cuda_only=True):
+        for dtype in [np.float16, np.float32]:
           self._test_gradient(
               x_shape,
               dtype, [1],
@@ -422,17 +422,17 @@ class BatchNormalizationTest(test.TestCase):
               use_gpu=True,
               data_format='NCHW',
               is_training=is_training)
-        self._test_gradient(
-            x_shape,
-            dtype, [1],
-            np.float32,
-            use_gpu=False,
-            data_format='NHWC',
-            is_training=is_training)
+      self._test_gradient(
+          x_shape,
+          np.float32, [1],
+          np.float32,
+          use_gpu=False,
+          data_format='NHWC',
+          is_training=is_training)
 
       x_shape = [1, 1, 6, 2]
-      for dtype in [np.float16, np.float32]:
-        if test.is_gpu_available(cuda_only=True):
+      if test.is_gpu_available(cuda_only=True):
+        for dtype in [np.float16, np.float32]:
           self._test_gradient(
               x_shape,
               dtype, [2],
@@ -440,13 +440,13 @@ class BatchNormalizationTest(test.TestCase):
               use_gpu=True,
               data_format='NHWC',
               is_training=is_training)
-        self._test_gradient(
-            x_shape,
-            dtype, [2],
-            np.float32,
-            use_gpu=False,
-            data_format='NHWC',
-            is_training=is_training)
+      self._test_gradient(
+          x_shape,
+          np.float32, [2],
+          np.float32,
+          use_gpu=False,
+          data_format='NHWC',
+          is_training=is_training)
 
       x_shape = [1, 2, 1, 6]
       if test.is_gpu_available(cuda_only=True):
@@ -460,8 +460,8 @@ class BatchNormalizationTest(test.TestCase):
               is_training=is_training)
 
       x_shape = [5, 7, 11, 4]
-      for dtype in [np.float16, np.float32]:
-        if test.is_gpu_available(cuda_only=True):
+      if test.is_gpu_available(cuda_only=True):
+        for dtype in [np.float16, np.float32]:
           self._test_gradient(
               x_shape,
               dtype, [7],
@@ -476,13 +476,13 @@ class BatchNormalizationTest(test.TestCase):
               use_gpu=True,
               data_format='NHWC',
               is_training=is_training)
-        self._test_gradient(
-            x_shape,
-            dtype, [4],
-            np.float32,
-            use_gpu=False,
-            data_format='NHWC',
-            is_training=is_training)
+      self._test_gradient(
+          x_shape,
+          np.float32, [4],
+          np.float32,
+          use_gpu=False,
+          data_format='NHWC',
+          is_training=is_training)
 
   def _testBatchNormGradGrad(self, config):
     shape = config['shape']
@@ -506,14 +506,15 @@ class BatchNormalizationTest(test.TestCase):
             data_format='NCHW',
             is_training=is_training,
             err_tolerance=err_tolerance)
-      self._test_grad_grad(
-          shape,
-          dtype, [shape[3]],
-          np.float32,
-          use_gpu=False,
-          data_format='NHWC',
-          is_training=is_training,
-          err_tolerance=err_tolerance)
+      if dtype != np.float16:
+        self._test_grad_grad(
+            shape,
+            np.float32, [shape[3]],
+            np.float32,
+            use_gpu=False,
+            data_format='NHWC',
+            is_training=is_training,
+            err_tolerance=err_tolerance)
 
   def testBatchNormGradGrad(self):
     configs = [{
@@ -525,10 +526,6 @@ class BatchNormalizationTest(test.TestCase):
         'err_tolerance': 1e-3,
         'dtype': np.float32,
     }, {
-        'shape': [2, 3, 4, 5],
-        'err_tolerance': 1e-2,
-        'dtype': np.float16,
-    }, {
         'shape': [2, 3, 2, 2],
         'err_tolerance': 2e-3,
         'dtype': np.float16,
diff --git a/tensorflow/python/ops/nn_impl.py b/tensorflow/python/ops/nn_impl.py
index da037a7983..431ea1186a 100644
--- a/tensorflow/python/ops/nn_impl.py
+++ b/tensorflow/python/ops/nn_impl.py
@@ -32,8 +32,6 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import sparse_ops
 from tensorflow.python.ops import variables
-from tensorflow.python.util.deprecation import deprecated_args
-from tensorflow.python.util.deprecation import deprecated_argument_lookup
 
 
 def log_poisson_loss(targets, log_input, compute_full_loss=False, name=None):
@@ -315,20 +313,19 @@ def swish(features):
   return features * math_ops.sigmoid(features)
 
 
-@deprecated_args(None, "dim is deprecated, use axis instead", "dim")
-def l2_normalize(x, axis=None, epsilon=1e-12, name=None, dim=None):
-  """Normalizes along dimension `axis` using an L2 norm.
+def l2_normalize(x, dim, epsilon=1e-12, name=None):
+  """Normalizes along dimension `dim` using an L2 norm.
 
-  For a 1-D tensor with `axis = 0`, computes
+  For a 1-D tensor with `dim = 0`, computes
 
       output = x / sqrt(max(sum(x**2), epsilon))
 
   For `x` with more dimensions, independently normalizes each 1-D slice along
-  dimension `axis`.
+  dimension `dim`.
 
   Args:
     x: A `Tensor`.
-    axis: Dimension along which to normalize.  A scalar or a vector of
+    dim: Dimension along which to normalize.  A scalar or a vector of
       integers.
     epsilon: A lower bound value for the norm. Will use `sqrt(epsilon)` as the
       divisor if `norm < sqrt(epsilon)`.
@@ -338,9 +335,8 @@ def l2_normalize(x, axis=None, epsilon=1e-12, name=None, dim=None):
     A `Tensor` with the same shape as `x`.
   """
   with ops.name_scope(name, "l2_normalize", [x]) as name:
-    axis = deprecated_argument_lookup("axis", axis, "dim", dim)
     x = ops.convert_to_tensor(x, name="x")
-    square_sum = math_ops.reduce_sum(math_ops.square(x), axis, keep_dims=True)
+    square_sum = math_ops.reduce_sum(math_ops.square(x), dim, keep_dims=True)
     x_inv_norm = math_ops.rsqrt(math_ops.maximum(square_sum, epsilon))
     return math_ops.multiply(x, x_inv_norm, name=name)
 
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 61fa462988..bdaac65904 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -23,7 +23,6 @@ import numbers
 import numpy as np
 
 from tensorflow.python.eager import context
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import graph_util
 from tensorflow.python.framework import ops
@@ -38,8 +37,6 @@ from tensorflow.python.ops import random_ops
 # pylint: disable=wildcard-import
 from tensorflow.python.ops.gen_nn_ops import *
 # pylint: enable=wildcard-import
-from tensorflow.python.util.deprecation import deprecated_args
-from tensorflow.python.util.deprecation import deprecated_argument_lookup
 
 from tensorflow.python.util import deprecation
 
@@ -1648,18 +1645,17 @@ def _softmax(logits, compute_op, dim=-1, name=None):
   return output
 
 
-@deprecated_args(None, "dim is deprecated, use axis instead", "dim")
-def softmax(logits, axis=None, name=None, dim=None):
+def softmax(logits, dim=-1, name=None):
   """Computes softmax activations.
 
   This function performs the equivalent of
 
-      softmax = tf.exp(logits) / tf.reduce_sum(tf.exp(logits), axis)
+      softmax = tf.exp(logits) / tf.reduce_sum(tf.exp(logits), dim)
 
   Args:
     logits: A non-empty `Tensor`. Must be one of the following types: `half`,
       `float32`, `float64`.
-    axis: The dimension softmax would be performed on. The default is -1 which
+    dim: The dimension softmax would be performed on. The default is -1 which
       indicates the last dimension.
     name: A name for the operation (optional).
 
@@ -1667,27 +1663,23 @@ def softmax(logits, axis=None, name=None, dim=None):
     A `Tensor`. Has the same type and shape as `logits`.
 
   Raises:
-    InvalidArgumentError: if `logits` is empty or `axis` is beyond the last
+    InvalidArgumentError: if `logits` is empty or `dim` is beyond the last
       dimension of `logits`.
   """
-  axis = deprecated_argument_lookup("axis", axis, "dim", dim)
-  if axis is None:
-    axis = -1
-  return _softmax(logits, gen_nn_ops._softmax, axis, name)
+  return _softmax(logits, gen_nn_ops._softmax, dim, name)
 
 
-@deprecated_args(None, "dim is deprecated, use axis instead", "dim")
-def log_softmax(logits, axis=None, name=None, dim=None):
+def log_softmax(logits, dim=-1, name=None):
   """Computes log softmax activations.
 
   For each batch `i` and class `j` we have
 
-      logsoftmax = logits - log(reduce_sum(exp(logits), axis))
+      logsoftmax = logits - log(reduce_sum(exp(logits), dim))
 
   Args:
     logits: A non-empty `Tensor`. Must be one of the following types: `half`,
       `float32`, `float64`.
-    axis: The dimension softmax would be performed on. The default is -1 which
+    dim: The dimension softmax would be performed on. The default is -1 which
       indicates the last dimension.
     name: A name for the operation (optional).
 
@@ -1695,13 +1687,10 @@ def log_softmax(logits, axis=None, name=None, dim=None):
     A `Tensor`. Has the same type as `logits`. Same shape as `logits`.
 
   Raises:
-    InvalidArgumentError: if `logits` is empty or `axis` is beyond the last
+    InvalidArgumentError: if `logits` is empty or `dim` is beyond the last
       dimension of `logits`.
   """
-  axis = deprecated_argument_lookup("axis", axis, "dim", dim)
-  if axis is None:
-    axis = -1
-  return _softmax(logits, gen_nn_ops._log_softmax, axis, name)
+  return _softmax(logits, gen_nn_ops._log_softmax, dim, name)
 
 
 def _ensure_xent_args(name, sentinel, labels, logits):
@@ -2316,100 +2305,6 @@ def conv1d(value, filters, stride, padding,
     return array_ops.squeeze(result, [spatial_start_dim])
 
 
-def conv1d_transpose(value,
-                     filter,
-                     output_shape,
-                     stride,
-                     padding="SAME",
-                     data_format="NWC",
-                     name=None):
-  """The transpose of `conv1d`.
-
-  This operation is sometimes called "deconvolution" after [Deconvolutional
-  Networks](http://www.matthewzeiler.com/pubs/cvpr2010/cvpr2010.pdf), but is
-  actually the transpose (gradient) of `conv1d` rather than an actual
-  deconvolution.
-
-  Args:
-    value: A 3-D `Tensor` of type `float` and shape
-      `[batch, in_width, in_channels]` for `NWC` data format or
-      `[batch, in_channels, in_width]` for `NCW` data format.
-    filter: A 3-D `Tensor` with the same type as `value` and shape
-      `[filter_width, output_channels, in_channels]`.  `filter`'s
-      `in_channels` dimension must match that of `value`.
-    output_shape: A 1-D `Tensor` representing the output shape of the
-      deconvolution op.
-    stride: An `integer`.  The number of entries by which
-      the filter is moved right at each step.
-    padding: A string, either `'VALID'` or `'SAME'`. The padding algorithm.
-      See the @{tf.nn.convolution$comment here}
-    data_format: A string. 'NHWC' and 'NCHW' are supported.
-    name: Optional name for the returned tensor.
-
-  Returns:
-    A `Tensor` with the same type as `value`.
-
-  Raises:
-    ValueError: If input/output depth does not match `filter`'s shape, or if
-      padding is other than `'VALID'` or `'SAME'`.
-  """
-  with ops.name_scope(name, "conv1d_transpose",
-                      [value, filter, output_shape]) as name:
-    output_shape_ = ops.convert_to_tensor(output_shape, name="output_shape")
-    if not output_shape_.get_shape().is_compatible_with(tensor_shape.vector(3)):
-      raise ValueError("output_shape must have shape (3,), got {}"
-                       .format(output_shape_.get_shape()))
-
-    # The format could be either NWC or NCW, map to NHWC or NCHW
-    if data_format is None or data_format == "NWC":
-      data_format_2d = "NHWC"
-      axis = 2
-    elif data_format == "NCW":
-      data_format_2d = "NCHW"
-      axis = 1
-    else:
-      raise ValueError("data_format must be \"NWC\" or \"NCW\".")
-
-    if not value.get_shape()[axis].is_compatible_with(filter.get_shape()[2]):
-      raise ValueError("input channels does not match filter's input channels, "
-                       "{} != {}".format(value.get_shape()[axis],
-                                         filter.get_shape()[2]))
-
-    if isinstance(output_shape, (list, np.ndarray)):
-      # output_shape's shape should be == [3] if reached this point.
-      if not filter.get_shape()[1].is_compatible_with(output_shape[axis]):
-        raise ValueError(
-            "output_shape does not match filter's output channels, "
-            "{} != {}".format(output_shape[axis], filter.get_shape()[1]))
-
-    if padding != "VALID" and padding != "SAME":
-      raise ValueError("padding must be either VALID or SAME:"
-                       " {}".format(padding))
-
-    # Reshape the input tensor to [batch, 1, in_width, in_channels]
-    if data_format_2d == "NHWC":
-      output_shape_ = array_ops.concat([output_shape_[:1], [1],
-                                        output_shape_[1:]], axis=0)
-      spatial_start_dim = 1
-      strides = [1, 1, stride, 1]
-    else:
-      output_shape_ = array_ops.concat([output_shape_[:2], [1],
-                                        output_shape_[2:]], axis=0)
-      spatial_start_dim = 2
-      strides = [1, 1, 1, stride]
-    value = array_ops.expand_dims(value, spatial_start_dim)
-    filter = array_ops.expand_dims(filter, 0)
-
-    result = gen_nn_ops.conv2d_backprop_input(input_sizes=output_shape_,
-                                              filter=filter,
-                                              out_backprop=value,
-                                              strides=strides,
-                                              padding=padding,
-                                              data_format=data_format_2d,
-                                              name=name)
-    return array_ops.squeeze(result, [spatial_start_dim])
-
-
 @ops.RegisterStatistics("Dilation2D", "flops")
 def _calc_dilation2d_flops(graph, node):
   """Calculates the compute resources needed for Dilation2D."""
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index a1e4305de1..e9b1c67d16 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -1063,13 +1063,13 @@ class Variable(object):
 class PartitionedVariable(object):
   """A container for partitioned `Variable` objects.
 
-  @compatibility(eager) `tf.PartitionedVariable` is not compatible with
+  @compatiblity(eager) `tf.PartitionedVariable` is not compatible with
   eager execution.  Use `tfe.Variable` instead which is compatable
   with both eager execution and graph construction.  See [the
   TensorFlow Eager Execution
   guide](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/g3doc/guide.md#variables-and-optimizers)
   for details on how variables work in eager execution.
-  @end_compatibility
+  @end_compatiblity
   """
 
   class PartitionedVariableIterator(object):
diff --git a/tensorflow/python/tools/import_pb_to_tensorboard.py b/tensorflow/python/tools/import_pb_to_tensorboard.py
index 00de044505..00de044505 100755..100644
--- a/tensorflow/python/tools/import_pb_to_tensorboard.py
+++ b/tensorflow/python/tools/import_pb_to_tensorboard.py
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index d78362d4fb..99bed86a17 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -232,6 +232,7 @@ CUDNN_DNN_ROUTINE_EACH_R3(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
   __macro(cudnnRNNBackwardData)                               \
   __macro(cudnnRNNBackwardWeights)                            \
   __macro(cudnnSetRNNDescriptor)                              \
+  __macro(cudnnSetRNNDescriptor_v6)                           \
   __macro(cudnnGetFilterNdDescriptor)
 
 // clang-format on
@@ -244,8 +245,7 @@ CUDNN_DNN_ROUTINE_EACH_R5(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
 // clang-format off
 #if CUDNN_VERSION >= 6000
 #define CUDNN_DNN_ROUTINE_EACH_R6(__macro)                    \
-  __macro(cudnnConvolutionBiasActivationForward)              \
-  __macro(cudnnSetRNNDescriptor_v6)
+  __macro(cudnnConvolutionBiasActivationForward)
 
 // clang-format on
 CUDNN_DNN_ROUTINE_EACH_R6(PERFTOOLS_GPUTOOLS_CUDNN_WRAP)
@@ -665,6 +665,7 @@ class ScopedPoolingDescriptor {
       LOG(FATAL) << "could not create cudnn pooling descriptor: "
                  << ToString(status);
     }
+
     const std::vector<int64> strides64 = pooling_descriptor.strides();
     const std::vector<int64> padding64 = pooling_descriptor.padding();
     const std::vector<int64> shape64 = pooling_descriptor.window();
@@ -679,14 +680,14 @@ class ScopedPoolingDescriptor {
                    &CheckedNarrowing<int64, int>);
     std::transform(shape64.cbegin(), shape64.cend(), shape.begin(),
                    &CheckedNarrowing<int64, int>);
-    bool propagate_nans = pooling_descriptor.propagate_nans();
     status = wrap::cudnnSetPoolingNdDescriptor(
         parent_, handle_,
         (pooling_descriptor.mode() == dnn::PoolingMode::kMaximum
              ? CUDNN_POOLING_MAX
              : CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING),
 #if CUDNN_VERSION >= 5000
-        propagate_nans ? CUDNN_PROPAGATE_NAN : CUDNN_NOT_PROPAGATE_NAN,
+        // Always propagate nans.
+        CUDNN_PROPAGATE_NAN,
 #endif
         nd, shape.data(), padding.data(), strides.data());
     if (status != CUDNN_STATUS_SUCCESS) {
diff --git a/tensorflow/stream_executor/dnn.cc b/tensorflow/stream_executor/dnn.cc
index 43d2d3cd48..07fe8a85f4 100644
--- a/tensorflow/stream_executor/dnn.cc
+++ b/tensorflow/stream_executor/dnn.cc
@@ -482,7 +482,6 @@ void PoolingDescriptor::CloneFrom(const PoolingDescriptor& other) {
   window_ = other.window_;
   padding_ = other.padding_;
   strides_ = other.strides_;
-  propagate_nans_ = other.propagate_nans_;
 }
 
 string PoolingDescriptor::ToString() const {
@@ -496,12 +495,9 @@ string PoolingDescriptor::ToString() const {
     port::Appendf(&padding, "%lld", padding_[i]);
   }
 
-  const char* propagate_string = propagate_nans_ ? "Yes" : "No";
-
-  return port::Printf(
-      "{mode: %s window: %s strides: %s padding: %s propagate NaNs: %s}",
-      mode_string, window.c_str(), strides.c_str(), padding.c_str(),
-      propagate_string);
+  return port::Printf("{mode: %s window: %s strides: %s padding: %s}",
+                      mode_string, window.c_str(), strides.c_str(),
+                      padding.c_str());
 }
 
 string PoolingDescriptor::ToShortString() const {
@@ -512,8 +508,7 @@ string PoolingDescriptor::ToShortString() const {
     port::Appendf(&padding, "_p%d:%lld", i, padding_[i]);
   }
   return port::StrCat(mode_ == dnn::PoolingMode::kMaximum ? "max" : "avg",
-                      window, strides, padding,
-                      propagate_nans_ ? "propagate_nans" : "ignore_nans");
+                      window, strides, padding);
 }
 
 // -- NormalizeDescriptor
diff --git a/tensorflow/stream_executor/dnn.h b/tensorflow/stream_executor/dnn.h
index 0d2cd4a9f2..49235167ab 100644
--- a/tensorflow/stream_executor/dnn.h
+++ b/tensorflow/stream_executor/dnn.h
@@ -661,10 +661,6 @@ class PoolingDescriptor {
     SetDim(&strides_, dim, value);
     return *this;
   }
-  PoolingDescriptor& set_propagate_nans(bool value) {
-    propagate_nans_ = value;
-    return *this;
-  }
 
   int ndims() const { return ndims_; }
   void CloneFrom(const PoolingDescriptor& other);
@@ -685,12 +681,10 @@ class PoolingDescriptor {
   std::vector<int64> window() const { return window_; }
   std::vector<int64> padding() const { return padding_; }
   std::vector<int64> strides() const { return strides_; }
-  bool propagate_nans() const { return propagate_nans_; }
 
  private:
   PoolingMode mode_;
   int ndims_;
-  bool propagate_nans_;
 
   // Stored as: ..., y, x.
   std::vector<int64> window_;
diff --git a/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt b/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
index 62e634afb8..9fd38a29b7 100644
--- a/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.linalg.pbtxt
@@ -94,7 +94,7 @@ tf_module {
   }
   member_method {
     name: "norm"
-    argspec: "args=[\'tensor\', \'ord\', \'axis\', \'keepdims\', \'name\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'euclidean\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'tensor\', \'ord\', \'axis\', \'keep_dims\', \'name\'], varargs=None, keywords=None, defaults=[\'euclidean\', \'None\', \'False\', \'None\'], "
   }
   member_method {
     name: "qr"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.pbtxt
index ebd9c079b5..24c0448dea 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.pbtxt
@@ -170,7 +170,7 @@ tf_module {
   }
   member_method {
     name: "l2_normalize"
-    argspec: "args=[\'x\', \'axis\', \'epsilon\', \'name\', \'dim\'], varargs=None, keywords=None, defaults=[\'None\', \'1e-12\', \'None\', \'None\'], "
+    argspec: "args=[\'x\', \'dim\', \'epsilon\', \'name\'], varargs=None, keywords=None, defaults=[\'1e-12\', \'None\'], "
   }
   member_method {
     name: "leaky_relu"
@@ -190,7 +190,7 @@ tf_module {
   }
   member_method {
     name: "log_softmax"
-    argspec: "args=[\'logits\', \'axis\', \'name\', \'dim\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
+    argspec: "args=[\'logits\', \'dim\', \'name\'], varargs=None, keywords=None, defaults=[\'-1\', \'None\'], "
   }
   member_method {
     name: "log_uniform_candidate_sampler"
@@ -282,7 +282,7 @@ tf_module {
   }
   member_method {
     name: "softmax"
-    argspec: "args=[\'logits\', \'axis\', \'name\', \'dim\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
+    argspec: "args=[\'logits\', \'dim\', \'name\'], varargs=None, keywords=None, defaults=[\'-1\', \'None\'], "
   }
   member_method {
     name: "softmax_cross_entropy_with_logits"
diff --git a/tensorflow/tools/api/golden/tensorflow.pbtxt b/tensorflow/tools/api/golden/tensorflow.pbtxt
index 0edd4153d7..bf7bc6a7c1 100644
--- a/tensorflow/tools/api/golden/tensorflow.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.pbtxt
@@ -750,7 +750,7 @@ tf_module {
   }
   member_method {
     name: "boolean_mask"
-    argspec: "args=[\'tensor\', \'mask\', \'name\', \'axis\'], varargs=None, keywords=None, defaults=[\'boolean_mask\', \'None\'], "
+    argspec: "args=[\'tensor\', \'mask\', \'name\'], varargs=None, keywords=None, defaults=[\'boolean_mask\'], "
   }
   member_method {
     name: "broadcast_dynamic_shape"
@@ -858,7 +858,7 @@ tf_module {
   }
   member_method {
     name: "count_nonzero"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'dtype\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \"<dtype: \'int64\'>\", \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'dtype\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \"<dtype: \'int64\'>\", \'None\', \'None\'], "
   }
   member_method {
     name: "count_up_to"
@@ -1414,7 +1414,7 @@ tf_module {
   }
   member_method {
     name: "norm"
-    argspec: "args=[\'tensor\', \'ord\', \'axis\', \'keepdims\', \'name\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'euclidean\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'tensor\', \'ord\', \'axis\', \'keep_dims\', \'name\'], varargs=None, keywords=None, defaults=[\'euclidean\', \'None\', \'False\', \'None\'], "
   }
   member_method {
     name: "not_equal"
@@ -1546,11 +1546,11 @@ tf_module {
   }
   member_method {
     name: "reduce_all"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_any"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_join"
@@ -1558,27 +1558,27 @@ tf_module {
   }
   member_method {
     name: "reduce_logsumexp"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_max"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_mean"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_min"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_prod"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "reduce_sum"
-    argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
+    argspec: "args=[\'input_tensor\', \'axis\', \'keep_dims\', \'name\', \'reduction_indices\'], varargs=None, keywords=None, defaults=[\'None\', \'False\', \'None\', \'None\'], "
   }
   member_method {
     name: "register_tensor_conversion_function"
diff --git a/tensorflow/tools/ci_build/ci_parameterized_build.sh b/tensorflow/tools/ci_build/ci_parameterized_build.sh
index c27f4953e3..5f791d7bc7 100755
--- a/tensorflow/tools/ci_build/ci_parameterized_build.sh
+++ b/tensorflow/tools/ci_build/ci_parameterized_build.sh
@@ -165,7 +165,7 @@ else
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:embedding_lookup_test"
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:embedding_lookup_sparse_test"
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:fully_connected_test"
-  # BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/testing:generated_examples_zip_test"
+  BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/testing:generated_examples_zip_test"
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:hashtable_lookup_test"
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:local_response_norm_test"
   BAZEL_TARGET="${BAZEL_TARGET} //tensorflow/contrib/lite/kernels:lsh_projection_test"
diff --git a/tensorflow/tools/ci_build/install/install_golang.sh b/tensorflow/tools/ci_build/install/install_golang.sh
index e1edd62cc5..55c1674495 100755
--- a/tensorflow/tools/ci_build/install/install_golang.sh
+++ b/tensorflow/tools/ci_build/install/install_golang.sh
@@ -16,7 +16,7 @@
 
 set -ex
 
-GOLANG_URL="https://storage.googleapis.com/golang/go1.9.2.linux-amd64.tar.gz"
+GOLANG_URL="https://storage.googleapis.com/golang/go1.9.1.linux-amd64.tar.gz"
 
 sudo mkdir -p /usr/local
 wget -q -O - "${GOLANG_URL}" | sudo tar -C /usr/local -xz
diff --git a/tensorflow/tools/ci_build/linux/libtensorflow_docker.sh b/tensorflow/tools/ci_build/linux/libtensorflow_docker.sh
index e5d8303c6e..dcda8228bc 100755
--- a/tensorflow/tools/ci_build/linux/libtensorflow_docker.sh
+++ b/tensorflow/tools/ci_build/linux/libtensorflow_docker.sh
@@ -48,6 +48,6 @@ ${DOCKER_BINARY} run \
   -e "TF_NEED_GCP=0" \
   -e "TF_NEED_HDFS=0" \
   -e "TF_NEED_CUDA=${TF_NEED_CUDA}" \
-  -e "TF_NEED_OPENCL_SYCL=0" \
+  -e "TF_NEED_OPENCL=0" \
   "${DOCKER_IMAGE}" \
   "/workspace/tensorflow/tools/ci_build/linux/libtensorflow.sh"
diff --git a/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh b/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
index e1b56b9a25..d90a1b905d 100755
--- a/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
+++ b/tensorflow/tools/ci_build/osx/libtensorflow_cpu.sh
@@ -27,7 +27,7 @@ export PYTHON_BIN_PATH="/usr/bin/python"
 export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
 export TF_NEED_CUDA=0
-export TF_NEED_OPENCL_SYCL=0
+export TF_NEED_OPENCL=0
 export TF_NEED_MKL=0
 export COMPUTECPP_PATH="/usr/local"
 
diff --git a/tensorflow/tools/ci_build/osx/libtensorflow_gpu.sh b/tensorflow/tools/ci_build/osx/libtensorflow_gpu.sh
index 5a901af3e5..79973647c1 100755
--- a/tensorflow/tools/ci_build/osx/libtensorflow_gpu.sh
+++ b/tensorflow/tools/ci_build/osx/libtensorflow_gpu.sh
@@ -28,7 +28,7 @@ export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:${L
 export PYTHON_BIN_PATH="/usr/bin/python"
 export TF_NEED_GCP=0
 export TF_NEED_HDFS=0
-export TF_NEED_OPENCL_SYCL=0
+export TF_NEED_OPENCL=0
 export TF_NEED_MKL=0
 export COMPUTECPP_PATH="/usr/local"
 
diff --git a/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh b/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
index 88116d9f24..5244898c40 100755
--- a/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
+++ b/tensorflow/tools/ci_build/pi/build_raspberry_pi.sh
@@ -75,23 +75,17 @@ if [[ $1 == "PI_ONE" ]]; then
   PI_COPTS="--copt=-march=armv6 --copt=-mfpu=vfp
   --copt=-DUSE_GEMM_FOR_CONV --copt=-DUSE_OPENBLAS
   --copt=-isystem --copt=${OPENBLAS_INSTALL_PATH}/include/
-  --copt=-std=gnu11 --copt=-DS_IREAD=S_IRUSR --copt=-DS_IWRITE=S_IWUSR
   --linkopt=-L${OPENBLAS_INSTALL_PATH}/lib/
   --linkopt=-l:libopenblas.a"
   echo "Building for the Pi One/Zero, with no NEON support"
 else
   PI_COPTS='--copt=-march=armv7-a --copt=-mfpu=neon-vfpv4
-  --copt=-std=gnu11 --copt=-DS_IREAD=S_IRUSR --copt=-DS_IWRITE=S_IWUSR
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_1
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_2
   --copt=-U__GCC_HAVE_SYNC_COMPARE_AND_SWAP_8'
   echo "Building for the Pi Two/Three, with NEON acceleration"
 fi
 
-# We need to pass down the environment variable with a possible alternate Python
-# include path for Python 3.x builds to work.
-export CROSSTOOL_PYTHON_INCLUDE_PATH
-
 cd ${WORKSPACE_PATH}
 bazel build -c opt ${PI_COPTS} \
   --config=monolithic \
diff --git a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
index 44b6d52952..924ab1a4ae 100644
--- a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
@@ -117,7 +117,7 @@ function run_configure_for_cpu_build {
   export TF_NEED_VERBS=0
   export TF_NEED_GCP=0
   export TF_NEED_HDFS=0
-  export TF_NEED_OPENCL_SYCL=0
+  export TF_NEED_OPENCL=0
   echo "" | ./configure
 }
 
@@ -141,7 +141,7 @@ function run_configure_for_gpu_build {
   export TF_NEED_MKL=0
   export TF_NEED_GCP=0
   export TF_NEED_HDFS=0
-  export TF_NEED_OPENCL_SYCL=0
+  export TF_NEED_OPENCL=0
 
   # TODO(pcloudy): Remove this after TensorFlow uses its own CRSOOTOOL
   # for GPU build on Windows
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7 b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
index 9bcc3925a8..64ebc4607a 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
@@ -101,11 +101,12 @@ RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/lib
                 --jobs=${TF_AVAILABLE_CPUS} \
                 tensorflow/tools/pip_package:build_pip_package && \
     mkdir /pip_pkg && \
-    bazel-bin/tensorflow/tools/pip_package/build_pip_package /pip_pkg && \
-    pip --no-cache-dir install --upgrade /pip_pkg/tensorflow-*.whl && \
+    bazel-bin/tensorflow/tools/pip_package/build_pip_package /pip_pkg
+
+# Clean up pip wheel and Bazel cache when done.
+RUN pip --no-cache-dir install --upgrade /pip_pkg/tensorflow-*.whl && \
     rm -rf /pip_pkg && \
     rm -rf /root/.cache
-# Clean up pip wheel and Bazel cache when done.
 
 WORKDIR /root
 
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index e212d10290..0571dd7391 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
 LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
diff --git a/tensorflow/tools/docker/README.md b/tensorflow/tools/docker/README.md
index e35c58ff80..2e5a0038ed 100644
--- a/tensorflow/tools/docker/README.md
+++ b/tensorflow/tools/docker/README.md
@@ -60,20 +60,6 @@ Building TensorFlow Docker containers should be done through the
 script. The raw Dockerfiles should not be used directly as they contain strings
 to be replaced by the script during the build.
 
-Attempting to run [parameterized_docker_build.sh](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/parameterized_docker_build.sh)
-from a binary docker image such as for example `tensorflow/tensorflow:latest` will
-not work. One needs to execute the script from a developer docker image since by
-contrast with a binary docker image it contains not only the compiled solution but
-also the tensorflow source code. Please select the appropriate developer docker
-image of tensorflow at `tensorflow/tensorflow:[.](https://hub.docker.com/r/tensorflow/tensorflow/tags/)`.
-
-The smallest command line to generate a docker image will then be:
-```docker run -it tensorflow/tensorflow:"right_tag"```
-
-If you would like to start a jupyter notebook on your docker container, make sure
-to map the port 8888 of your docker container by adding -p 8888:8888 to the above
-command.
-
 To use the script, specify the container type (`CPU` vs. `GPU`), the desired
 Python version (`PYTHON2` vs. `PYTHON3`) and whether the developer Docker image
 is to be built (`NO` vs. `YES`). In addition, you need to specify the central
diff --git a/tensorflow/tools/graph_transforms/BUILD b/tensorflow/tools/graph_transforms/BUILD
index 9216008600..1bf7113c9e 100644
--- a/tensorflow/tools/graph_transforms/BUILD
+++ b/tensorflow/tools/graph_transforms/BUILD
@@ -131,8 +131,6 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:tensorflow",
-        "//tensorflow/contrib/rnn:gru_ops_op_lib",
-        "//tensorflow/contrib/rnn:lstm_ops_op_lib",
     ] + if_not_windows([
         "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/kernels:remote_fused_graph_rewriter_transform",
diff --git a/tensorflow/tools/graph_transforms/quantize_nodes.cc b/tensorflow/tools/graph_transforms/quantize_nodes.cc
index 97e8f77616..2b85e7e83c 100644
--- a/tensorflow/tools/graph_transforms/quantize_nodes.cc
+++ b/tensorflow/tools/graph_transforms/quantize_nodes.cc
@@ -759,7 +759,6 @@ Status QuantizeNodes(const GraphDef& input_graph_def,
           NodeDef reshape_dims;
           reshape_dims.set_op("Const");
           reshape_dims.set_name(unique_input_name + "/reshape_dims");
-          AddNodeInput("^" + input_name, &reshape_dims);
           SetNodeAttr("dtype", DT_INT32, &reshape_dims);
           Tensor reshape_dims_tensor(DT_INT32, {1});
           reshape_dims_tensor.flat<int32>()(0) = -1;
@@ -769,7 +768,6 @@ Status QuantizeNodes(const GraphDef& input_graph_def,
           NodeDef reduction_dims;
           reduction_dims.set_op("Const");
           reduction_dims.set_name(unique_input_name + "/reduction_dims");
-          AddNodeInput("^" + input_name, &reduction_dims);
           SetNodeAttr("dtype", DT_INT32, &reduction_dims);
           Tensor reduction_dims_tensor(DT_INT32, {1});
           reduction_dims_tensor.flat<int32>()(0) = 0;
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index a493c6f2aa..60282f6aa3 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -29,7 +29,7 @@ from setuptools.dist import Distribution
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.4.0'
+_VERSION = '1.4.0-rc1'
 
 REQUIRED_PACKAGES = [
     'absl-py',
diff --git a/third_party/aws.BUILD b/third_party/aws.BUILD
index bc9e37ffb3..bc6a2fd8cc 100644
--- a/third_party/aws.BUILD
+++ b/third_party/aws.BUILD
@@ -21,9 +21,6 @@ cc_library(
         "@%ws%//tensorflow:linux_ppc64le": glob([
             "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
         ]),
-        "@%ws%//tensorflow:raspberry_pi_armeabi": glob([
-            "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
-        ]),
         "//conditions:default": [],
     }) + glob([
         "aws-cpp-sdk-core/include/**/*.h",
diff --git a/third_party/curl.BUILD b/third_party/curl.BUILD
index 805a30d262..882967df1c 100644
--- a/third_party/curl.BUILD
+++ b/third_party/curl.BUILD
@@ -477,6 +477,7 @@ genrule(
         "#  define HAVE_RAND_EGD 1",
         "#  define HAVE_RAND_STATUS 1",
         "#  define HAVE_SSL_GET_SHUTDOWN 1",
+        "#  define HAVE_STROPTS_H 1",
         "#  define HAVE_TERMIOS_H 1",
         "#  define OS \"x86_64-pc-linux-gnu\"",
         "#  define RANDOM_FILE \"/dev/urandom\"",
diff --git a/third_party/sycl/crosstool/CROSSTOOL.tpl b/third_party/sycl/crosstool/CROSSTOOL.tpl
index f8e50efcc6..32884d71e7 100755
--- a/third_party/sycl/crosstool/CROSSTOOL.tpl
+++ b/third_party/sycl/crosstool/CROSSTOOL.tpl
@@ -35,10 +35,10 @@ toolchain {
   tool_path { name: "compat-ld" path: "/usr/bin/ld" }
   tool_path { name: "cpp" path: "/usr/bin/cpp" }
   tool_path { name: "dwp" path: "/usr/bin/dwp" }
-  tool_path { name: "gcc" path: "%{sycl_impl}" }
+  tool_path { name: "gcc" path: "computecpp" }
   # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
   # and the device compiler to use "-std=c++11".
-  cxx_flag: "%{c++_std}"
+  cxx_flag: "-std=c++11"
   linker_flag: "-Wl,-no-as-needed"
   linker_flag: "-lstdc++"
   linker_flag: "-B/usr/bin/"
@@ -53,7 +53,7 @@ toolchain {
   cxx_builtin_include_directory: "/usr/local/include"
   cxx_builtin_include_directory: "/usr/include"
 
-  cxx_builtin_include_directory: "%{sycl_include_dir}"
+  cxx_builtin_include_directory: "%{computecpp_toolkit_path}"
   cxx_builtin_include_directory: "%{python_lib_path}"
 
   tool_path { name: "gcov" path: "/usr/bin/gcov" }
@@ -214,4 +214,4 @@ toolchain {
     compiler_flag: "-O2"
     compiler_flag: "-DNDEBUG"
   }
-}
-\ No newline at end of file
+}
diff --git a/third_party/sycl/crosstool/trisycl.tpl b/third_party/sycl/crosstool/trisycl.tpl
deleted file mode 100644
index b470772fbf..0000000000
--- a/third_party/sycl/crosstool/trisycl.tpl
+++ /dev/null
@@ -1,73 +0,0 @@
-#!/usr/bin/env python
-
-import os
-import sys
-import tempfile
-from subprocess import call
-
-CPU_CXX_COMPILER = ('%{host_cxx_compiler}')
-CPU_C_COMPILER = ('%{host_c_compiler}')
-
-CURRENT_DIR = os.path.dirname(sys.argv[0])
-TRISYCL_INCLUDE_DIR = CURRENT_DIR + '/../sycl/include'
-
-def main():
-  compiler_flags = []
-
-  remove_flags = ('-Wl,--no-undefined', '-Wno-unused-but-set-variable', '-Wignored-attributes', '-fno-exceptions')
-  # remove -fsamotoze-coverage from string with g++
-  if 'g++' in CPU_CXX_COMPILER:
-    remove_flags += ('-fsanitize-coverage',)
-    compiler_flags += ['-fopenmp']
-  else:
-    compiler_flags += ['-fopenmp=libomp']
-
-  compiler_flags += [flag for flag in sys.argv[1:] if not flag.startswith(remove_flags)]
-
-
-  output_file_index = compiler_flags.index('-o') + 1
-  output_file_name = compiler_flags[output_file_index]
-
-  if(output_file_index == 1):
-    # we are linking
-    return call([CPU_CXX_COMPILER] + compiler_flags +
-                ['-Wl,--no-undefined'])
-
-  # find what we compile
-  compiling_cpp = 0
-  if('-c' in compiler_flags):
-      compiled_file_index = compiler_flags.index('-c') + 1
-      compiled_file_name = compiler_flags[compiled_file_index]
-      if(compiled_file_name.endswith(('.cc', '.c++', '.cpp', '.CPP',
-                                      '.C', '.cxx'))):
-        compiling_cpp = 1;
-
-  debug_flags = ['-DTRISYCL_DEBUG', '-DBOOST_LOG_DYN_LINK', '-DTRISYCL_TRACE_KERNEL', '-lpthread', '-lboost_log', '-g', '-rdynamic']
-
-  opt_flags = ['-DNDEBUG', '-DBOOST_DISABLE_ASSERTS', '-O3']
-
-  compiler_flags = compiler_flags + ['-DEIGEN_USE_SYCL=1',
-                                     '-DEIGEN_HAS_C99_MATH',
-                                     '-DEIGEN_MAX_ALIGN_BYTES=16',
-                                     '-DTENSORFLOW_USE_SYCL'] + opt_flags
-
-  if(compiling_cpp == 1):
-    # create a blacklist of folders that will be skipped when compiling
-    # with triSYCL
-    skip_extensions = [".cu.cc"]
-    skip_folders = ["tensorflow/compiler", "tensorflow/docs_src", "tensorflow/tensorboard", "third_party", "external", "hexagon"]
-    skip_folders = [(folder + '/') for folder in skip_folders]
-    # if compiling external project skip triSYCL
-    if any(compiled_file_name.endswith(_ext) for _ext in skip_extensions) or any(_folder in output_file_name for _folder in skip_folders):
-      return call([CPU_CXX_COMPILER] + compiler_flags)
-
-    host_compiler_flags = ['-xc++', '-Wno-unused-variable',
-                           '-I', TRISYCL_INCLUDE_DIR] + compiler_flags
-    x = call([CPU_CXX_COMPILER] + host_compiler_flags)
-    return x
-  else:
-    # compile for C
-    return call([CPU_C_COMPILER] + compiler_flags)
-
-if __name__ == '__main__':
-  sys.exit(main())
diff --git a/third_party/sycl/sycl/BUILD.tpl b/third_party/sycl/sycl/BUILD.tpl
index b6ceaadda7..6cad190630 100755
--- a/third_party/sycl/sycl/BUILD.tpl
+++ b/third_party/sycl/sycl/BUILD.tpl
@@ -10,27 +10,16 @@ package(default_visibility = ["//visibility:public"])
 exports_files(["LICENSE.text"])
 
 config_setting(
-    name = "using_sycl_ccpp",
-    define_values = {
-        "using_sycl": "true",
-        "using_trisycl": "false",
+    name = "using_sycl",
+    values = {
+        "define": "using_sycl=true",
     },
 )
 
-config_setting(
-    name = "using_sycl_trisycl",
-    define_values = {
-        "using_sycl": "true",
-        "using_trisycl": "false",
-    },
-)
-
-
 cc_library(
     name = "sycl_headers",
     hdrs = glob([
         "**/*.h",
-        "**/*.hpp",
     ]),
     includes = [".", "include"],
 )
diff --git a/third_party/sycl/sycl/build_defs.bzl.tpl b/third_party/sycl/sycl/build_defs.bzl.tpl
index 33386f8957..09bef0a661 100755
--- a/third_party/sycl/sycl/build_defs.bzl.tpl
+++ b/third_party/sycl/sycl/build_defs.bzl.tpl
@@ -5,24 +5,9 @@ def if_sycl(if_true, if_false = []):
 
     Returns a select statement which evaluates to if_true if we're building
     with SYCL enabled.  Otherwise, the select statement evaluates to if_false.
-    If we are building with triSYCL instead of ComputeCPP, a list with
-    the first element of if_true is returned.
-    """
-    return select({
-        "@local_config_sycl//sycl:using_sycl_ccpp": if_true,
-        "@local_config_sycl//sycl:using_sycl_trisycl": if_true[0:1],
-        "//conditions:default": if_false
-    })
-
-def if_ccpp(if_true, if_false = []):
-    """Shorthand for select()'ing if we are building with ComputeCPP.
 
-    Returns a select statement which evaluates to if_true if we're building
-    with ComputeCPP enabled. Otherwise, the select statement evaluates
-    to if_false.
     """
     return select({
-        "@local_config_sycl//sycl:using_sycl_ccpp": if_true,
-        "@local_config_sycl//sycl:using_sycl_trisycl": if_false,
+        "@local_config_sycl//sycl:using_sycl": if_true,
         "//conditions:default": if_false
     })
diff --git a/third_party/sycl/sycl_configure.bzl b/third_party/sycl/sycl_configure.bzl
index a0c9e4e43a..7af063178e 100644
--- a/third_party/sycl/sycl_configure.bzl
+++ b/third_party/sycl/sycl_configure.bzl
@@ -5,26 +5,20 @@
   * HOST_CXX_COMPILER:  The host C++ compiler
   * HOST_C_COMPILER:    The host C compiler
   * COMPUTECPP_TOOLKIT_PATH: The path to the ComputeCpp toolkit.
-  * TRISYCL_INCLUDE_DIR: The path to the include directory of triSYCL.
-                         (if using triSYCL instead of ComputeCPP)
   * PYTHON_LIB_PATH: The path to the python lib
 """
 
 _HOST_CXX_COMPILER = "HOST_CXX_COMPILER"
 _HOST_C_COMPILER= "HOST_C_COMPILER"
 _COMPUTECPP_TOOLKIT_PATH = "COMPUTECPP_TOOLKIT_PATH"
-_TRISYCL_INCLUDE_DIR = "TRISYCL_INCLUDE_DIR"
 _PYTHON_LIB_PATH = "PYTHON_LIB_PATH"
 
 def _enable_sycl(repository_ctx):
-  if "TF_NEED_OPENCL_SYCL" in repository_ctx.os.environ:
-    enable_sycl = repository_ctx.os.environ["TF_NEED_OPENCL_SYCL"].strip()
+  if "TF_NEED_OPENCL" in repository_ctx.os.environ:
+    enable_sycl = repository_ctx.os.environ["TF_NEED_OPENCL"].strip()
     return enable_sycl == "1"
   return False
 
-def _enable_compute_cpp(repository_ctx):
-  return _COMPUTECPP_TOOLKIT_PATH in repository_ctx.os.environ
-
 def auto_configure_fail(msg):
   """Output failure message when auto configuration fails."""
   red = "\033[0;31m"
@@ -65,15 +59,6 @@ def find_computecpp_root(repository_ctx):
     return sycl_name
   fail("Cannot find SYCL compiler, please correct your path")
 
-def find_trisycl_include_dir(repository_ctx):
-  """Find triSYCL include directory. """
-  sycl_name = ""
-  if _TRISYCL_INCLUDE_DIR in repository_ctx.os.environ:
-    sycl_name = repository_ctx.os.environ[_TRISYCL_INCLUDE_DIR].strip()
-    if sycl_name.startswith("/"):
-      return sycl_name
-  fail( "Cannot find triSYCL include directory, please correct your path")
-
 def find_python_lib(repository_ctx):
   """Returns python path."""
   if _PYTHON_LIB_PATH in repository_ctx.os.environ:
@@ -186,53 +171,26 @@ def _sycl_autoconf_imp(repository_ctx):
     _tpl(repository_ctx, "sycl:platform.bzl")
     _tpl(repository_ctx, "crosstool:BUILD")
     _file(repository_ctx, "sycl:LICENSE.text")
-
-    if _enable_compute_cpp(repository_ctx):
-      _tpl(repository_ctx, "crosstool:computecpp",
-      {
-        "%{host_cxx_compiler}" : find_cc(repository_ctx),
-        "%{host_c_compiler}" : find_c(repository_ctx)
-      })
-
-      computecpp_root = find_computecpp_root(repository_ctx);
-      _check_dir(repository_ctx, computecpp_root)
-
-      _tpl(repository_ctx, "crosstool:CROSSTOOL",
-      {
-        "%{sycl_include_dir}" : computecpp_root,
-        "%{sycl_impl}" : "computecpp",
-        "%{c++_std}" : "-std=c++11",
-        "%{python_lib_path}" : find_python_lib(repository_ctx),
-      })
-
-      # symlink libraries
-      _check_lib(repository_ctx, computecpp_root+"/lib", "libComputeCpp.so" )
-      _symlink_dir(repository_ctx, computecpp_root + "/lib", "sycl/lib")
-      _symlink_dir(repository_ctx, computecpp_root + "/include", "sycl/include")
-      _symlink_dir(repository_ctx, computecpp_root + "/bin", "sycl/bin")
-    else:
-
-      trisycl_include_dir = find_trisycl_include_dir(repository_ctx);
-      _check_dir(repository_ctx, trisycl_include_dir)
-
-      _tpl(repository_ctx, "crosstool:trisycl",
-      {
-        "%{host_cxx_compiler}" : find_cc(repository_ctx),
-        "%{host_c_compiler}" : find_c(repository_ctx),
-        "%{trisycl_include_dir}" : trisycl_include_dir
-      })
-
-
-      _tpl(repository_ctx, "crosstool:CROSSTOOL",
-      {
-        "%{sycl_include_dir}" : trisycl_include_dir,
-        "%{sycl_impl}" : "trisycl",
-        "%{c++_std}" : "-std=c++1y",
-        "%{python_lib_path}" : find_python_lib(repository_ctx),
-      })
-
-      _symlink_dir(repository_ctx, trisycl_include_dir, "sycl/include")
-
+    _tpl(repository_ctx, "crosstool:computecpp",
+    {
+      "%{host_cxx_compiler}" : find_cc(repository_ctx),
+      "%{host_c_compiler}" : find_c(repository_ctx),
+    })
+
+    computecpp_root = find_computecpp_root(repository_ctx)
+    _check_dir(repository_ctx, computecpp_root)
+
+    _tpl(repository_ctx, "crosstool:CROSSTOOL",
+    {
+      "%{computecpp_toolkit_path}" : computecpp_root,
+      "%{python_lib_path}" : find_python_lib(repository_ctx),
+    })
+
+    # symlink libraries
+    _check_lib(repository_ctx, computecpp_root+"/lib", "libComputeCpp.so" )
+    _symlink_dir(repository_ctx, computecpp_root + "/lib", "sycl/lib")
+    _symlink_dir(repository_ctx, computecpp_root + "/include", "sycl/include")
+    _symlink_dir(repository_ctx, computecpp_root + "/bin", "sycl/bin")
 
 sycl_configure = repository_rule(
   implementation = _sycl_autoconf_imp,
diff --git a/third_party/zlib.BUILD b/third_party/zlib.BUILD
index d164ee719c..8509668891 100644
--- a/third_party/zlib.BUILD
+++ b/third_party/zlib.BUILD
@@ -49,7 +49,7 @@ cc_library(
         ":windows_msvc": [],
         "//conditions:default": [
             "-Wno-shift-negative-value",
-            "-DZ_HAVE_UNISTD_H",
+            "-Wno-implicit-function-declaration",
         ],
     }),
     includes = ["."],
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 04c24d7511..2d7201ae57 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -9,16 +9,13 @@ build:win-cuda --define=using_cuda=true --define=using_cuda_nvcc=true
 build:mkl --define=using_mkl=true
 
 build:sycl --crosstool_top=@local_config_sycl//crosstool:toolchain
-build:sycl --define=using_sycl=true --define=using_trisycl=false
+build:sycl --define=using_sycl=true
 
 build:sycl_nodouble --crosstool_top=@local_config_sycl//crosstool:toolchain
 build:sycl_nodouble --define=using_sycl=true --cxxopt -DTENSORFLOW_SYCL_NO_DOUBLE
 
 build:sycl_asan --crosstool_top=@local_config_sycl//crosstool:toolchain
-build:sycl_asan --define=using_sycl=true --define=using_trisycl=false --copt -fno-omit-frame-pointer --copt -fsanitize-coverage=3 --copt -DGPR_NO_DIRECT_SYSCALLS --linkopt -fPIC --linkopt -fsanitize=address
-
-build:sycl_trisycl --crosstool_top=@local_config_sycl//crosstool:toolchain
-build:sycl_trisycl --define=using_sycl=true --define=using_trisycl=true
+build:sycl_asan --define=using_sycl=true --copt -fno-omit-frame-pointer --copt -fsanitize-coverage=3 --copt -DGPR_NO_DIRECT_SYSCALLS --linkopt -fPIC --linkopt -fsanitize=address
 
 build --define=use_fast_cpp_protos=true
 build --define=allow_oversize_protos=true
diff --git a/util/python/BUILD b/util/python/BUILD
index f5fa0c6d29..96daf9947a 100644
--- a/util/python/BUILD
+++ b/util/python/BUILD
@@ -1,4 +1,4 @@
-licenses(["notice"])  # New BSD, Python Software Foundation
+licenses(["restricted"])
 
 package(default_visibility = ["//visibility:public"])