193 files changed, 5403 insertions, 1408 deletions
diff --git a/.gitignore b/.gitignore
index 09734fe497..9ae0d9c96f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -17,3 +17,8 @@ cmake_build/
 .idea/**
 /build/
 /tensorflow/core/util/version_info.cc
+/tensorflow/python/framework/fast_tensor_util.cpp
+Pods
+Podfile.lock
+*.pbxproj
+*.xcworkspacedata
diff --git a/README.md b/README.md
index 6339c57c95..24bbb6cec1 100644
--- a/README.md
+++ b/README.md
@@ -38,10 +38,11 @@ People who are a little more adventurous can also try our nightly binaries:
 
 **Nightly pip packages**
 * We are pleased to announce that TensorFlow now offers nightly pip packages
-under the [tf-nightly](https://pypi.python.org/pypi/tf-nightly) project on pypi.
-Simply run `pip install tf-nightly` in a clean environment to install the nightly
-tensorflow  build. We currently only support CPU packages on Linux, Mac, and Windows.
-GPU packages on all platforms will arrive soon!
+under the [tf-nightly](https://pypi.python.org/pypi/tf-nightly) and
+[tf-nightly-gpu](https://pypi.python.org/pypi/tf-nightly-gpu) project on pypi.
+Simply run `pip install tf-nightly` or `pip install tf-nightly-gpu` in a clean
+environment to install the nightly TensorFlow build. We support CPU and GPU
+packages on Linux, Mac, and Windows.
 
 
 **Individual whl files**
diff --git a/RELEASE.md b/RELEASE.md
index 2c6535c15d..4a33bce8b2 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,20 +1,51 @@
 # Release 1.4.0
 
 ## Major Features And Improvements
+* `tf.keras` is now part of the core TensorFlow API.
 * [`tf.data`](http://tensorflow.org/programmers_guide/datasets) is now part of
   the core TensorFlow API.
   * The API is now subject to backwards compatibility guarantees.
   * For a guide to migrating from the `tf.contrib.data` API, see the
-    [README] (https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/contrib/data/README.md).
+    [README](https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/contrib/data/README.md).
   * Major new features include `Dataset.from_generator()` (for building an input
     pipeline from a Python generator), and the `Dataset.apply()` method for
     applying custom transformation functions.
   * Several custom transformation functions have been added, including
     `tf.contrib.data.batch_and_drop_remainder()` and
     `tf.contrib.data.sloppy_interleave()`.
+* Add `train_and_evaluate` for simple distributed `Estimator` training.
+* Add `tf.spectral.dct` for computing the DCT-II.
+* Add Mel-Frequency Cepstral Coefficient support to `tf.contrib.signal`
+  (with GPU and gradient support).
+* Add a self-check on `import tensorflow` for Windows DLL issues.
+* Add NCHW support to `tf.depth_to_space` on GPU.
+* SinhArcsinh (scalar) distribution added to `contrib.distributions`.
+* Make `GANEstimator` opensource.
+* `Estimator.export_savedmodel()` now includes all valid serving signatures
+  that can be constructed from the Serving Input Receiver and all available
+  ExportOutputs. For instance, a classifier may provide regression- and
+  prediction-flavored outputs, in addition to the classification-flavored one.
+  Building signatures from these allows TF Serving to honor requests using the
+  different APIs (Classify, Regress, and Predict). Furthermore,
+  `serving_input_receiver_fn()` may now specify alternative subsets of nodes
+  that may act as inputs. This allows, for instance, producing a prediction
+  signature for a classifier that accepts raw `Tensors` instead of a serialized
+  `tf.Example`.
+* Add `tf.contrib.bayesflow.hmc`.
+* Add `tf.contrib.distributions.MixtureSameFamily`.
+* Make `Dataset.shuffle()` always reshuffles after each iteration by default.
+* Add `tf.contrib.bayesflow.metropolis_hastings`.
+* Add `log_rate` parameter to `tf.contrib.distributions.Poisson`.
+* Extend `tf.contrib.distributions.bijector` API to handle some non-injective
+  transforms.
 * Java:
-  * Generics (e.g., `Tensor<Integer>`) for improved type-safety (courtesy @andrewcmyers).
+  * Generics (e.g., `Tensor<Integer>`) for improved type-safety
+    (courtesy @andrewcmyers).
   * Support for multi-dimensional string tensors.
+  * Support loading of custom operations (e.g. many in `tf.contrib`) on Linux
+    and OS X
+* All our prebuilt binaries have been built with CUDA 8 and cuDNN 6.
+  We anticipate releasing TensorFlow 1.5 with CUDA 9 and cuDNN 7.
 
 ## Bug Fixes and Other Changes
 * `tf.nn.rnn_cell.DropoutWrapper` is now more careful about dropping out LSTM
@@ -26,11 +57,57 @@
 * Removed `tf.contrib.training.python_input`.  The same behavior, in a more
   flexible and reproducible package, is available via the new
   `tf.contrib.data.Dataset.from_generator` method!
+* Fix `tf.contrib.distributions.Affine` incorrectly computing log-det-jacobian.
+* Fix `tf.random_gamma` incorrectly handling non-batch, scalar draws.
+* Resolved a race condition in TensorForest TreePredictionsV4Op.
+* Google Cloud Storage file system and Hadoop file system support are now
+  default build options.
+* Custom op libraries must link against libtensorflow_framework.so
+  (installed at `tf.sysconfig.get_lib()`).
 
 ## Breaking Changes to the API
 * The signature of the `tf.contrib.data.rejection_resample()` function has been
   changed. It now returns a function that can be used as an argument to
   `Dataset.apply()`.
+* Remove `tf.contrib.data.Iterator.from_dataset()` method. Use
+  `Dataset.make_initializable_iterator()` instead.
+* Remove seldom used and unnecessary `tf.contrib.data.Iterator.dispose_op()`.
+* Reorder some TFGAN loss functions in a non-backwards compatible way.
+
+## Thanks to our Contributors
+
+This release contains contributions from many people at Google, as well as:
+
+4d55397500, Abdullah Alrasheed, abenmao, Adam Salvail, Aditya Dhulipala, Ag Ramesh,
+Akimasa Kimura, Alan Du, Alan Yee, Alexander, Amit Kushwaha, Amy, Andrei Costinescu,
+Andrei Nigmatulin, Andrew Erlichson, Andrew Myers, Andrew Stepanov, Androbin, AngryPowman,
+Anish Shah, Anton Daitche, Artsiom Chapialiou, asdf2014, Aseem Raj Baranwal, Ash Hall,
+Bart Kiers, Batchu Venkat Vishal, ben, Ben Barsdell, Bill Piel, Carl Thomé, Catalin Voss,
+Changming Sun, Chengzhi Chen, Chi Zeng, Chris Antaki, Chris Donahue, Chris Oelmueller,
+Chris Tava, Clayne Robison, Codrut, Courtial Florian, Dalmo Cirne, Dan J, Darren Garvey,
+David Kristoffersson, David Norman, David RöThlisberger, DavidNorman, Dhruv, DimanNe,
+Dorokhov, Duncan Mac-Vicar P, EdwardDixon, EMCP, error.d, FAIJUL, Fan Xia,
+Francois Xavier, Fred Reiss, Freedom" Koan-Sin Tan, Fritz Obermeyer, Gao, Xiang,
+Guenther Schmuelling, Guo Yejun (郭叶军), Hans Gaiser, HectorSVC, Hyungsuk Yoon,
+James Pruegsanusak, Jay Young, Jean Wanka, Jeff Carpenter, Jeremy Rutman, Jeroen BéDorf,
+Jett Jones, Jimmy Jia, jinghuangintel, jinze1994, JKurland, Joel Hestness, joetoth,
+John B Nelson, John Impallomeni, John Lawson, Jonas, Jonathan Dekhtiar, joshkyh, Jun Luan,
+Jun Mei, Kai Sasaki, Karl Lessard, karl@kubx.ca, Kb Sriram, Kenichi Ueno, Kevin Slagle,
+Kongsea, Lakshay Garg, lhlmgr, Lin Min, liu.guangcong, Loki Der Quaeler, Louie Helm,
+lucasmoura, Luke Iwanski, Lyndon White, Mahmoud Abuzaina, Marcel Puyat, Mark Aaron Shirley,
+Michele Colombo, MtDersvan, Namrata-Ibm, Nathan Luehr, Naurril, Nayana Thorat, Nicolas Lopez,
+Niranjan Hasabnis, Nolan Liu, Nouce, Oliver Hennigh, osdamv, Patrik Erdes,
+Patryk Chrabaszcz, Pavel Christof, Penghao Cen, postBG, Qingqing Cao, Qingying Chen, qjivy,
+Raphael, Rasmi, raymondxyang, Renze Yu, resec, Roffel, Ruben Vereecken, Ryohei Kuroki,
+sandipmgiri, Santiago Castro, Scott Kirkland, Sean Vig, Sebastian Raschka, Sebastian Weiss,
+Sergey Kolesnikov, Sergii Khomenko, Shahid, Shivam Kotwalia, Stuart Berg, Sumit Gouthaman,
+superzerg, Sven Mayer, tetris, Ti Zhou, Tiago Freitas Pereira, Tian Jin, Tomoaki Oiki,
+Vaibhav Sood, vfdev, Vivek Rane, Vladimir Moskva, wangqr, Weber Xie, Will Frey,
+Yan Facai (颜发才), yanivbl6, Yaroslav Bulatov, Yixing Lao, Yong Tang, youkaichao,
+Yuan (Terry) Tang, Yue Zhang, Yuxin Wu, Ziming Dong, ZxYuan, 黄璞
+
+We are also grateful to all who filed issues or helped resolve them, asked and
+answered questions, and were part of inspiring discussions.
 
 # Release 1.3.0
 
diff --git a/configure.py b/configure.py
index ea3f598f3d..425eae676c 100644
--- a/configure.py
+++ b/configure.py
@@ -989,6 +989,7 @@ def main():
   run_gen_git_source(environ_cp)
 
   if is_windows():
+    environ_cp['TF_NEED_S3'] = '0'
     environ_cp['TF_NEED_GCP'] = '0'
     environ_cp['TF_NEED_HDFS'] = '0'
     environ_cp['TF_NEED_JEMALLOC'] = '0'
@@ -1001,9 +1002,9 @@ def main():
   set_build_var(environ_cp, 'TF_NEED_JEMALLOC', 'jemalloc as malloc',
                 'with_jemalloc', True)
   set_build_var(environ_cp, 'TF_NEED_GCP', 'Google Cloud Platform',
-                'with_gcp_support', False, 'gcp')
+                'with_gcp_support', True, 'gcp')
   set_build_var(environ_cp, 'TF_NEED_HDFS', 'Hadoop File System',
-                'with_hdfs_support', False, 'hdfs')
+                'with_hdfs_support', True, 'hdfs')
   set_build_var(environ_cp, 'TF_NEED_S3', 'Amazon S3 File System',
                 'with_s3_support', True, 's3')
   set_build_var(environ_cp, 'TF_ENABLE_XLA', 'XLA JIT', 'with_xla_support',
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 673e433a8a..20f02ad50a 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -323,6 +323,7 @@ filegroup(
         "//tensorflow/compiler/jit/kernels:all_files",
         "//tensorflow/compiler/jit/legacy_flags:all_files",
         "//tensorflow/compiler/jit/ops:all_files",
+        "//tensorflow/compiler/plugin:all_files",
         "//tensorflow/compiler/tests:all_files",
         "//tensorflow/compiler/tf2xla:all_files",
         "//tensorflow/compiler/tf2xla/cc:all_files",
diff --git a/tensorflow/c/c_api.h b/tensorflow/c/c_api.h
index 0c6bb53d01..1e8bfdc7b0 100644
--- a/tensorflow/c/c_api.h
+++ b/tensorflow/c/c_api.h
@@ -1153,7 +1153,7 @@ TF_CAPI_EXPORT extern TF_Function* TF_FunctionImportFunctionDef(
     const void* proto, size_t proto_len, TF_Status* status);
 
 // Sets function attribute named `attr_name` to value stored in `proto`.
-// If this attribute is already set to another value, it is overriden.
+// If this attribute is already set to another value, it is overridden.
 // `proto` should point to a sequence of bytes of length `proto_len`
 // representing a binary serialization of an AttrValue protocol
 // buffer.
diff --git a/tensorflow/cc/gradients/math_grad.cc b/tensorflow/cc/gradients/math_grad.cc
index 2417bf18a9..d7446b9560 100644
--- a/tensorflow/cc/gradients/math_grad.cc
+++ b/tensorflow/cc/gradients/math_grad.cc
@@ -13,6 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#define _USE_MATH_DEFINES
+#include <cmath>
+
 #include "tensorflow/cc/ops/array_ops_internal.h"
 #include "tensorflow/cc/ops/math_ops_internal.h"
 #include "tensorflow/cc/ops/standard_ops.h"
@@ -200,8 +203,8 @@ Status TanhGrad(const Scope& scope, const Operation& op,
   // evaluated.
   Scope grad_scope = scope.WithControlDependencies(grad);
   auto y = ConjugateHelper(grad_scope, op.output(0));
-  grad_outputs->push_back(internal::TanhGrad(scope, y, grad));
-  return scope.status();
+  grad_outputs->push_back(internal::TanhGrad(grad_scope, y, grad));
+  return grad_scope.status();
 }
 REGISTER_GRADIENT_OP("Tanh", TanhGrad);
 
@@ -256,8 +259,8 @@ Status SigmoidGrad(const Scope& scope, const Operation& op,
   // evaluated.
   Scope grad_scope = scope.WithControlDependencies(grad);
   auto y = ConjugateHelper(grad_scope, op.output(0));
-  grad_outputs->push_back(internal::SigmoidGrad(scope, y, grad));
-  return scope.status();
+  grad_outputs->push_back(internal::SigmoidGrad(grad_scope, y, grad));
+  return grad_scope.status();
 }
 REGISTER_GRADIENT_OP("Sigmoid", SigmoidGrad);
 
@@ -696,15 +699,32 @@ Status MeanGrad(const Scope& scope, const Operation& op,
 }
 REGISTER_GRADIENT_OP("Mean", MeanGrad);
 
+Status ErfGrad(const Scope& scope, const Operation& op,
+               const std::vector<Output>& grad_inputs,
+               std::vector<Output>* grad_outputs) {
+  auto grad = grad_inputs[0];
+  auto two_over_root_pi = Cast(scope, Const(scope, 2 / std::sqrt(M_PI)),
+                               grad.type());
+  Scope grad_scope = scope.WithControlDependencies(grad);
+  auto x = ConjugateHelper(grad_scope, op.input(0));
+  // grad * 2/sqrt(pi) * exp(-x**2)
+  auto dx = Mul(grad_scope,
+                Mul(grad_scope, grad, two_over_root_pi),
+                Exp(grad_scope, Neg(grad_scope, Square(grad_scope, x))));
+  grad_outputs->push_back(dx);
+  return grad_scope.status();
+}
+REGISTER_GRADIENT_OP("Erf", ErfGrad);
+
 Status LgammaGrad(const Scope& scope, const Operation& op,
                   const std::vector<Output>& grad_inputs,
                   std::vector<Output>* grad_outputs) {
   auto grad = grad_inputs[0];
   Scope grad_scope = scope.WithControlDependencies(grad);
   auto x = ConjugateHelper(grad_scope, op.input(0));
-  auto dx = Mul(scope, grad, Digamma(scope, x));
+  auto dx = Mul(grad_scope, grad, Digamma(grad_scope, x));
   grad_outputs->push_back(dx);
-  return scope.status();
+  return grad_scope.status();
 }
 REGISTER_GRADIENT_OP("Lgamma", LgammaGrad);
 
diff --git a/tensorflow/cc/gradients/math_grad_test.cc b/tensorflow/cc/gradients/math_grad_test.cc
index a174f223ad..6313f41da5 100644
--- a/tensorflow/cc/gradients/math_grad_test.cc
+++ b/tensorflow/cc/gradients/math_grad_test.cc
@@ -64,7 +64,9 @@ class CWiseUnaryGradTest : public ::testing::Test {
     IMAG,
     CONJ,
     COMPLEX,
-    ANGLE
+    ANGLE,
+    LGAMMA,
+    ERF
   };
 
   template <typename X_T, typename Y_T>
@@ -168,6 +170,12 @@ class CWiseUnaryGradTest : public ::testing::Test {
       case ANGLE:
         y = Angle(scope_, x);
         break;
+      case LGAMMA:
+        y = Lgamma(scope_, x);
+        break;
+      case ERF:
+        y = Erf(scope_, x);
+        break;
     }
 
     float max_error;
@@ -503,6 +511,42 @@ TEST_F(CWiseUnaryGradTest, Angle) {
   TestCWiseGrad<complex64, float>(ANGLE, x_fn);
 }
 
+TEST_F(CWiseUnaryGradTest, Lgamma) {
+  auto x_fn = [this](const int i) {
+    return RV({-3.5, -2.5, -1.5, 1.0, 2.0, 3.5});
+  };
+  TestCWiseGrad<float, float>(LGAMMA, x_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Lgamma_Complex) {
+  auto x_fn = [this](const int i) {
+    return CRV({{-3.5, 0.5}, {-1.5, -0.5}, {1.5, -1.0}, {3.5, 1.0}});
+  };
+  // TODO(kbsriram)
+  // Add test when the lgamma kernel supports complex numbers
+  if (false) {
+    TestCWiseGrad<complex64, complex64>(LGAMMA, x_fn);
+  }
+}
+
+TEST_F(CWiseUnaryGradTest, Erf) {
+  auto x_fn = [this](const int i) {
+    return RV({-1.2, -1.0, -0.5, 0.3, 0.5, 1.3});
+  };
+  TestCWiseGrad<float, float>(ERF, x_fn);
+}
+
+TEST_F(CWiseUnaryGradTest, Erf_Complex) {
+  auto x_fn = [this](const int i) {
+    return CRV({{-1.2, 0.5}, {-0.5, -0.5}, {0.5, 0.5}, {1.2, -0.5}});
+  };
+  // TODO(kbsriram)
+  // Add test when the erf kernel supports complex numbers
+  if (false) {
+    TestCWiseGrad<complex64, complex64>(ERF, x_fn);
+  }
+}
+
 class MathGradTest : public ::testing::Test {
  protected:
   MathGradTest() : root_(Scope::NewRootScope().WithDevice("/cpu:0")) {}
@@ -821,17 +865,5 @@ TEST_F(NaryGradTest, Minimum) {
   RunTest(x, x_init_value, y, shape);
 }
 
-TEST_F(NaryGradTest, Lgamma) {
-  TensorShape shape({3, 2});
-  auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(shape));
-  auto y = Lgamma(scope_, x);
-  // Select values to avoid instability when computing finite differences.
-  // Ref: https://en.wikipedia.org/wiki/File:Gamma_plot.svg
-  Tensor x_init_value =
-      test::AsTensor<float>({-3.5f, -2.5f, -1.5f, 1.0f, 2.0f, 3.5f}, {3, 2});
-  RunTest(x, x_init_value, y, shape);
-  // TODO(suharshs): add test case for complex values
-}
-
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index bf63b7e501..bf7d9cf14d 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -33,6 +33,7 @@ cc_library(
     deps = [
         ":xla_cpu_device",
         ":xla_cpu_jit",
+        "//tensorflow/compiler/plugin",
     ] + if_cuda_is_configured([
         ":xla_gpu_device",
         ":xla_gpu_jit",
diff --git a/tensorflow/compiler/plugin/BUILD b/tensorflow/compiler/plugin/BUILD
new file mode 100644
index 0000000000..c1edf2448c
--- /dev/null
+++ b/tensorflow/compiler/plugin/BUILD
@@ -0,0 +1,56 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Configuration file for an XLA plugin.
+
+  please don't check in changes to this file. to prevent changes appearing
+  in git status, use:
+
+  git update-index --assume-unchanged tensorflow/compiler/plugin/BUILD
+
+  To add additional devices to the XLA subsystem, add targets to the
+  dependency list in the 'plugin' target. For instance:
+
+    deps = ["//tensorflow/compiler/plugin/example:plugin_lib"],
+
+  ** Please don't remove this file - it is supporting some 3rd party plugins **
+"""
+
+licenses(["notice"])
+
+package(
+    default_visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "plugin",
+    deps = [
+        #"//tensorflow/compiler/plugin/example:example_lib",
+    ],
+)
+
+#-----------------------------------------------------------------------------
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
diff --git a/tensorflow/compiler/plugin/README.md b/tensorflow/compiler/plugin/README.md
new file mode 100644
index 0000000000..9dd0d2bdab
--- /dev/null
+++ b/tensorflow/compiler/plugin/README.md
@@ -0,0 +1,16 @@
+3rd party XLA devices
+---------------------
+
+This directory is intended as a place for 3rd party XLA devices which are _not_
+integrated into the public repository.
+
+By adding entries to the BUILD target in this directory, a third party device
+can be included as a dependency of the JIT subsystem.
+
+For integration into the unit test system, see the files:
+
+- tensorflow/compiler/tests/plugin.bzl
+- tensorflow/compiler/xla/tests/plugin.bzl
+
+
+- 
diff --git a/tensorflow/compiler/xla/service/hlo_computation_test.cc b/tensorflow/compiler/xla/service/hlo_computation_test.cc
index ccab7bf348..7b7588f4ba 100644
--- a/tensorflow/compiler/xla/service/hlo_computation_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation_test.cc
@@ -310,7 +310,7 @@ TEST_F(HloComputationTest, DeepCopyArrayAtIndices) {
 }
 
 TEST_F(HloComputationTest, DeepCopyTupleAtIndices) {
-  // Test that DeepCopyInstruction properly copies elements of a a tuple as
+  // Test that DeepCopyInstruction properly copies elements of a tuple as
   // specified by the given indices.
   auto builder = HloComputation::Builder(TestName());
   auto constant1 = builder.AddInstruction(HloInstruction::CreateConstant(
diff --git a/tensorflow/compiler/xla/service/inliner.cc b/tensorflow/compiler/xla/service/inliner.cc
index 0682434bfb..6ea0f127d5 100644
--- a/tensorflow/compiler/xla/service/inliner.cc
+++ b/tensorflow/compiler/xla/service/inliner.cc
@@ -90,8 +90,12 @@ Status InlinerVisitor::HandleMap(
     // different than the map shape. Hence, a broadcast is needed, else the
     // cloned operand with new shape and operands work.
     if (root.opcode() != HloOpcode::kConstant) {
+      std::vector<HloInstruction*> params;
+      for (int64 o = 0; o < root.operands().size(); o++) {
+        params.push_back(operands[root.operand(o)->parameter_number()]);
+      }
       HloInstruction* placed_instruction = computation_->AddInstruction(
-          root.CloneWithNewOperands(map->shape(), operands));
+          root.CloneWithNewOperands(map->shape(), params));
       TF_RETURN_IF_ERROR(
           computation_->ReplaceInstruction(map, placed_instruction));
     } else {
diff --git a/tensorflow/compiler/xla/service/inliner_test.cc b/tensorflow/compiler/xla/service/inliner_test.cc
index 9d845c5545..7aa1c7c835 100644
--- a/tensorflow/compiler/xla/service/inliner_test.cc
+++ b/tensorflow/compiler/xla/service/inliner_test.cc
@@ -108,5 +108,44 @@ TEST_F(InlinerTest, MapConstant) {
   LiteralTestUtil::ExpectEqual(*result, *expected);
 }
 
+TEST_F(InlinerTest, MapSubtractOppositeOrder) {
+  Shape r0f32 = ShapeUtil::MakeShape(F32, {});
+
+  // Note that the parameter ordinals are in the opposite order to their
+  // position as operands
+  auto max_builder = HloComputation::Builder(TestName());
+  auto param1 = max_builder.AddInstruction(
+          HloInstruction::CreateParameter(1, r0f32, "x"));
+  auto param2 = max_builder.AddInstruction(
+          HloInstruction::CreateParameter(0, r0f32, "y"));
+  max_builder.AddInstruction(HloInstruction::CreateBinary(
+          param1->shape(), HloOpcode::kSubtract, param1, param2));
+  auto max_f32 = max_builder.Build();
+
+  auto builder = HloComputation::Builder("MapSubFunction");
+  auto lhs = builder.AddInstruction(
+    HloInstruction::CreateConstant(Literal::CreateR1<float>({1, 2, 3, 4})));
+  auto rhs = builder.AddInstruction(
+    HloInstruction::CreateConstant(Literal::CreateR1<float>({4, 3, 2, 1})));
+  builder.AddInstruction(
+    HloInstruction::CreateMap(lhs->shape(), {lhs, rhs}, max_f32.get()));
+
+  auto computation = builder.Build();
+  auto hlo_module = CreateNewModule();
+  hlo_module->AddEmbeddedComputation(std::move(max_f32));
+  hlo_module->AddEntryComputation(std::move(computation));
+
+  Inliner inliner;
+  EXPECT_TRUE(inliner.Run(hlo_module.get()).ValueOrDie());
+  EXPECT_THAT(hlo_module->entry_computation()->root_instruction(),
+          op::Subtract(rhs, lhs));
+
+  // Verify execution on CPU.
+  auto result = ExecuteAndTransfer(std::move(hlo_module), {});
+  auto expected = Literal::CreateR1<float>({3, 1, -1, -3});
+  LiteralTestUtil::ExpectEqual(*result, *expected);
+}
+
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/contrib/all_reduce/python/all_reduce.py b/tensorflow/contrib/all_reduce/python/all_reduce.py
index 22d7633ce2..a5057da9fd 100644
--- a/tensorflow/contrib/all_reduce/python/all_reduce.py
+++ b/tensorflow/contrib/all_reduce/python/all_reduce.py
@@ -191,7 +191,7 @@ def _ragged_split(tensor, pieces):
 
 
 def _ring_permutations(num_workers, num_subchunks, gpu_perm):
-  """"Generate an array of device index arrays, one for for each subchunk.
+  """"Generate an array of device index arrays, one for each subchunk.
 
   In the basic ring reduction algorithm there are size(T)/num_devices
   data chunks and each device process one chunk per tick, i.e. sending
diff --git a/tensorflow/contrib/boosted_trees/README.md b/tensorflow/contrib/boosted_trees/README.md
index 9ce700f1a1..7d30032e53 100644
--- a/tensorflow/contrib/boosted_trees/README.md
+++ b/tensorflow/contrib/boosted_trees/README.md
@@ -1,7 +1,7 @@
 # TF Boosted Trees (TFBT)
 
 TF Boosted trees is an implementation of a gradient boosting algorithm with
-trees used as week learners.
+trees used as weak learners.
 
 ## Examples
 Folder "examples" demonstrates how TFBT estimators can be used for various
diff --git a/tensorflow/contrib/boosted_trees/examples/binary_mnist.py b/tensorflow/contrib/boosted_trees/examples/binary_mnist.py
index c003b1de66..47ee3d816f 100644
--- a/tensorflow/contrib/boosted_trees/examples/binary_mnist.py
+++ b/tensorflow/contrib/boosted_trees/examples/binary_mnist.py
@@ -21,7 +21,7 @@ r"""Demonstrates multiclass MNIST TF Boosted trees example.
   python tensorflow/contrib/boosted_trees/examples/binary_mnist.py \
   --output_dir="/tmp/binary_mnist" --depth=4 --learning_rate=0.3 \
   --batch_size=10761 --examples_per_layer=10761 --eval_batch_size=1030 \
-  --num_eval_steps=1 --num_trees=10 --l2=1 --vmodule=training_ops=1 \
+  --num_eval_steps=1 --num_trees=10 --l2=1 --vmodule=training_ops=1
 
   When training is done, accuracy on eval data is reported. Point tensorboard
   to the directory for the run to see how the training progresses:
diff --git a/tensorflow/contrib/boosted_trees/examples/mnist.py b/tensorflow/contrib/boosted_trees/examples/mnist.py
index 0539d77720..817c6eb3e1 100644
--- a/tensorflow/contrib/boosted_trees/examples/mnist.py
+++ b/tensorflow/contrib/boosted_trees/examples/mnist.py
@@ -22,7 +22,7 @@ r"""Demonstrates multiclass MNIST TF Boosted trees example.
   python tensorflow/contrib/boosted_trees/examples/mnist.py \
   --output_dir="/tmp/mnist" --depth=4 --learning_rate=0.3 --batch_size=60000  \
   --examples_per_layer=60000 --eval_batch_size=10000 --num_eval_steps=1 \
-  --num_trees=10 --l2=1 --vmodule=training_ops=1 \
+  --num_trees=10 --l2=1 --vmodule=training_ops=1
 
   When training is done, accuracy on eval data is reported. Point tensorboard
   to the directory for the run to see how the training progresses:
diff --git a/tensorflow/contrib/cmake/external/cub.cmake b/tensorflow/contrib/cmake/external/cub.cmake
index e03026b1b0..8368898955 100644
--- a/tensorflow/contrib/cmake/external/cub.cmake
+++ b/tensorflow/contrib/cmake/external/cub.cmake
@@ -14,8 +14,8 @@
 # ==============================================================================
 include (ExternalProject)
 
-set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.3.zip)
-set(cub_HASH SHA256=b7ead9e291d34ffa8074243541c1380d63be63f88de23de8ee548db573b72ebe)
+set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.4.zip)
+set(cub_HASH SHA256=20a1a39fd97e5da7f40f5f2e7fd73fd2ea59f9dc4bb8a6c5f228aa543e727e31)
 set(cub_BUILD ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_ARCHIVE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/cub_archive)
diff --git a/tensorflow/contrib/cmake/external/protobuf.cmake b/tensorflow/contrib/cmake/external/protobuf.cmake
index d600d8c3c0..1e300e21df 100644
--- a/tensorflow/contrib/cmake/external/protobuf.cmake
+++ b/tensorflow/contrib/cmake/external/protobuf.cmake
@@ -15,8 +15,8 @@
 include (ExternalProject)
 
 set(PROTOBUF_INCLUDE_DIRS ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
-set(PROTOBUF_URL https://github.com/mrry/protobuf.git)  # Includes MSVC fix.
-set(PROTOBUF_TAG 1d2c7b6c7376f396c8c7dd9b6afd2d4f83f3cb05)
+set(PROTOBUF_URL https://github.com/google/protobuf.git)
+set(PROTOBUF_TAG b04e5cba356212e4e8c66c61bbe0c3a20537c5b9)
 
 if(WIN32)
   set(protobuf_STATIC_LIBRARIES 
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
index 46c680aad5..65565aad7e 100644
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -33,6 +33,8 @@ else(tensorflow_BUILD_ALL_KERNELS)
      "${tensorflow_source_dir}/tensorflow/core/kernels/matmul_op.cc"
      "${tensorflow_source_dir}/tensorflow/core/kernels/no_op.h"
      "${tensorflow_source_dir}/tensorflow/core/kernels/no_op.cc"
+     "${tensorflow_source_dir}/tensorflow/core/kernels/ops_util.h"
+     "${tensorflow_source_dir}/tensorflow/core/kernels/ops_util.cc"
      "${tensorflow_source_dir}/tensorflow/core/kernels/sendrecv_ops.h"
      "${tensorflow_source_dir}/tensorflow/core/kernels/sendrecv_ops.cc"
   )
@@ -65,6 +67,8 @@ if(tensorflow_BUILD_CONTRIB_KERNELS)
       "${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/ops/split_handler_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/ops/stats_accumulator_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/boosted_trees/ops/training_ops.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/kernels/cudnn_rnn_ops.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/kernels/clustering_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/kernels/masked_matmul_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/kernels/wals_solver_ops.cc"
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index 1d58b1d416..ac55b9ea92 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -179,6 +179,9 @@ if (tensorflow_BUILD_PYTHON_TESTS)
 
   # exclude the ones we don't want
   set(tf_test_src_py_exclude
+    # generally excluded
+    "${tensorflow_source_dir}/tensorflow/python/kernel_tests/__init__.py"
+
     # Python source line inspection tests are flaky on Windows (b/36375074).
     "${tensorflow_source_dir}/tensorflow/python/debug/cli/analyzer_cli_test.py"
     "${tensorflow_source_dir}/tensorflow/python/debug/cli/profile_analyzer_cli_test.py"
@@ -188,19 +191,16 @@ if (tensorflow_BUILD_PYTHON_TESTS)
     "${tensorflow_source_dir}/tensorflow/python/debug/lib/dist_session_debug_grpc_test.py"
     "${tensorflow_source_dir}/tensorflow/python/debug/lib/session_debug_grpc_test.py"
     # generally not working
-    "${tensorflow_source_dir}/tensorflow/python/kernel_tests/__init__.py"
-    "${tensorflow_source_dir}/tensorflow/python/kernel_tests/benchmark_test.py"
-    "${tensorflow_source_dir}/tensorflow/python/kernel_tests/resource_variable_ops_test.py"
     "${tensorflow_source_dir}/tensorflow/python/profiler/pprof_profiler_test.py"
     # flaky test
     "${tensorflow_source_dir}/tensorflow/python/profiler/internal/run_metadata_test.py"
+    # Fails because uses data dependencies with bazel
     "${tensorflow_source_dir}/tensorflow/python/saved_model/saved_model_test.py"
     # requires scipy
     "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/preprocessing/*_test.py"
     "${tensorflow_source_dir}/tensorflow/contrib/tfprof/python/tools/tfprof/pprof_profiler_test.py"
-    # flaky tests
+    # Takes very long to run without sharding (defined in bazel build file).
     "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cwise_ops_test.py"
-    "${tensorflow_source_dir}/tensorflow/contrib/tfprof/python/tools/tfprof/internal/run_metadata_test.py"
     # Loading resources in contrib doesn't seem to work on Windows
     "${tensorflow_source_dir}/tensorflow/contrib/tensor_forest/client/random_forest_test.py"
     "${tensorflow_source_dir}/tensorflow/contrib/tensor_forest/python/tensor_forest_test.py"
@@ -213,47 +213,57 @@ if (tensorflow_BUILD_PYTHON_TESTS)
   if (WIN32)
     set(tf_test_src_py_exclude
       ${tf_test_src_py_exclude}
-      # generally excluded
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/__init__.py"
-
       # TODO: failing tests.
       # Nothing critical in here but should get this list down to []
       # The failing list is grouped by failure source
+
       # stl on windows handles overflows different
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/as_string_op_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cast_op_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/string_to_number_op_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/clip_ops_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/tensor_array_ops_test.py"  # Needs portpicker.
-      # Matrix_set_diag failing on GPU on windows.
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cholesky_op_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/diag_op_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/linalg_ops_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/ops/init_ops.py"
+      # Numerical issues, calculations off.
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/concat_op_test.py"
+      "${tensorflow_source_dir}/tensorflow/contrib/factorization/python/ops/wals_test.py"
+      # Float division by zero
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/benchmark_test.py"
+      # Flaky, for unknown reasons. Cannot reproduce in terminal. Revisit once we can get stack traces.
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/batch_matmul_op_test.py"
+      # Flaky because of local cluster creation.
+      "${tensorflow_source_dir}/tensorflow/python/training/sync_replicas_optimizer_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/debug/lib/session_debug_grpc_test.py"
+      "${tensorflow_source_dir}tensorflow/python/training/localhost_cluster_performance_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/iterator_ops_cluster_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/functional_ops_test.py"
+      "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/iterator_ops_cluster_test.py"
+      # Type error in testRemoteIteratorUsingRemoteCallOpDirectSessionGPUCPU.
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/iterator_ops_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/self_adjoint_eig_op_test.py"
-      # misc
+      "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/iterator_ops_test.py"
+      # IteratorGetMax OutOfRangeError
       "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/variable_scope_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/reshape_op_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/training/evaluation_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py"  # Depends on gemmlowp -> pthread.
+      # Depends on gemmlowp -> pthread
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py"
       # int32/int64 mixup
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cast_op_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/variable_scope_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/functional_ops_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/py_func_test.py"
+      # Windows file management related issues.
+      "${tensorflow_source_dir}/tensorflow/python/training/evaluation_test.py"
       # training tests
       "${tensorflow_source_dir}/tensorflow/python/training/basic_session_run_hooks_test.py"  # Needs tf.contrib fix.
       "${tensorflow_source_dir}/tensorflow/python/training/localhost_cluster_performance_test.py"  # Needs portpicker.
       "${tensorflow_source_dir}/tensorflow/python/training/quantize_training_test.py"  # Needs quantization ops to be included in windows.
       "${tensorflow_source_dir}/tensorflow/python/training/supervisor_test.py"  # Flaky I/O error on rename.
-      "${tensorflow_source_dir}/tensorflow/python/training/sync_replicas_optimizer_test.py"  # Needs portpicker.
       "${tensorflow_source_dir}/tensorflow/python/training/server_lib_test.py"  # Test occasionally deadlocks.
-      "${tensorflow_source_dir}/tensorflow/python/debug/lib/session_debug_multi_gpu_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/debug/lib/session_debug_multi_gpu_test.py"  # Fails on multiple GPUs.
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/concat_op_test.py"  # numerical issues
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/linalg_grad_test.py"  # cudaSolver handle creation fails.
-
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/array_ops_test.py"  # depends on python/framework/test_ops
       # Dataset tests
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/dataset_constructor_op_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/dataset_constructor_op_test.py"  # Segfaults on windows
+      "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/dataset_constructor_op_test.py"  # Segfaults on Windows.
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/iterator_ops_cluster_test.py"
       # Broken tensorboard test due to cmake issues.
       "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/iterator_ops_cluster_test.py"  # Needs portpicker
@@ -264,8 +274,6 @@ if (tensorflow_BUILD_PYTHON_TESTS)
       "${tensorflow_source_dir}/tensorflow/contrib/tensor_forest/python/kernel_tests/scatter_add_ndim_op_test.py"  # Bad placement.
       "${tensorflow_source_dir}/tensorflow/contrib/tensor_forest/python/topn_test.py"  # Results inaccurate
       "${tensorflow_source_dir}/tensorflow/python/ops/cloud/bigquery_reader_ops_test.py"  # No libcurl support
-      # Newly running on Windows since TensorBoard backend move. Fail on Windows and need debug.
-      "${tensorflow_source_dir}/tensorflow/contrib/data/python/kernel_tests/dataset_constructor_op_test.py"  # Segfaults on Windows.
       # Dask.Dataframe bugs on Window Build
       "${tensorflow_source_dir}/tensorflow/contrib/learn/python/learn/tests/dataframe/tensorflow_dataframe_test.py"
       "${tensorflow_source_dir}/tensorflow/contrib/learn/python/learn/learn_io/data_feeder_test.py"
@@ -274,37 +282,15 @@ if (tensorflow_BUILD_PYTHON_TESTS)
       # Need extra build
       "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/conditional_distribution_test.py"
       "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/conditional_transformed_distribution_test.py"
+      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/estimator_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/array_ops_test.py"  # depends on python/framework/test_ops
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/depthtospace_op_test.py"  # QuantizeV2
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/spacetodepth_op_test.py"  # QuantizeV2
       # Windows Path
       "${tensorflow_source_dir}/tensorflow/contrib/framework/python/ops/checkpoint_ops_test.py" #TODO: Fix path
-      "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/models_test.py"
-      # Related to Windows Multiprocessing https://github.com/fchollet/keras/issues/5071
-      "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/engine/training_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/utils/data_utils_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/callbacks_test.py"
-      # Scipy needed
-      "${tensorflow_source_dir}/tensorflow/contrib/keras/python/keras/preprocessing/image_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sigmoid_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/binomial_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/chi2_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/geometric_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/inverse_gamma_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/logistic_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/mixture_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/mvn_diag_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/mvn_full_covariance_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/mvn_tril_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/negative_binomial_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/poisson_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/quantized_distribution_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/relaxed_bernoulli_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/relaxed_onehot_categorical_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/transformed_distribution_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/vector_student_t_test.py"
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/wishart_test.py"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/python/ops/kmeans_test.py"
       "${tensorflow_source_dir}/tensorflow/contrib/learn/python/learn/estimators/kmeans_test.py"
-      # Failing with TF 1.3 (TODO)
-      "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/estimator_test.py"
+      # Numpy upgrade needed?
       "${tensorflow_source_dir}/tensorflow/contrib/distributions/python/kernel_tests/bijectors/sinh_arcsinh_test.py"
       # Test should only be run manually
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/reduction_ops_test_big.py"
diff --git a/tensorflow/contrib/cudnn_rnn/python/kernel_tests/cudnn_rnn_test.py b/tensorflow/contrib/cudnn_rnn/python/kernel_tests/cudnn_rnn_test.py
index 9e627bcaf4..1ce8954bb0 100644
--- a/tensorflow/contrib/cudnn_rnn/python/kernel_tests/cudnn_rnn_test.py
+++ b/tensorflow/contrib/cudnn_rnn/python/kernel_tests/cudnn_rnn_test.py
@@ -385,7 +385,7 @@ class CudnnRNNTestSaveRestore(TensorFlowTestCase):
       reset_op = state_ops.assign(
           opaque_params,
           array_ops.zeros(array_ops.shape(opaque_params), dtype=dtype))
-      # Passing graph explictly, otherwise an old sess would be reused.
+      # Passing graph explicitly, otherwise an old sess would be reused.
       with self.test_session(use_gpu=True, graph=g) as sess:
         sess.run(variables.global_variables_initializer())
         val = saver.save(sess, save_path)
@@ -436,7 +436,7 @@ class CudnnRNNTestSaveRestore(TensorFlowTestCase):
       save_path = os.path.join(self.get_temp_dir(),
                                "save-restore-variable-test2")
       saver = saver_lib.Saver()
-      # Passing graph explictly, otherwise an old sess would be reused.
+      # Passing graph explicitly, otherwise an old sess would be reused.
       with self.test_session(use_gpu=True, graph=g) as sess:
         sess.run(variables.global_variables_initializer())
         val = saver.save(sess, save_path)
@@ -484,7 +484,7 @@ class CudnnRNNTestSaveRestore(TensorFlowTestCase):
           array_ops.zeros(
               array_ops.shape(rnn.trainable_variables[0]), dtype=dtype))
 
-      # Passing graph explictly, otherwise an old sess would be reused.
+      # Passing graph explicitly, otherwise an old sess would be reused.
       with self.test_session(use_gpu=True, graph=g) as sess:
         sess.run(variables.global_variables_initializer())
         inputs, initial_state = model.SynthesizeInput(seq_length, batch_size)
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index 96447abd7c..5339ebb689 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -11,6 +11,9 @@ py_test(
     size = "small",
     srcs = ["batch_dataset_op_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "manual",  # b/67958604
+    ],
     deps = [
         "//tensorflow/contrib/data/python/ops:dataset_ops",
         "//tensorflow/contrib/data/python/ops:transformation_ops",
@@ -358,6 +361,9 @@ py_test(
     size = "small",
     srcs = ["sloppy_transformation_dataset_op_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "manual",  # b/67958761
+    ],
     deps = [
         "//tensorflow/contrib/data/python/ops:dataset_ops",
         "//tensorflow/contrib/data/python/ops:transformation_ops",
diff --git a/tensorflow/contrib/framework/BUILD b/tensorflow/contrib/framework/BUILD
index 6b0599ddd2..dd882acb8e 100644
--- a/tensorflow/contrib/framework/BUILD
+++ b/tensorflow/contrib/framework/BUILD
@@ -10,9 +10,8 @@ package(default_visibility = [
     "//tensorflow:__subpackages__",
 ])
 
-load("//tensorflow:tensorflow.bzl", "cuda_py_test")
-load("//tensorflow:tensorflow.bzl", "tf_custom_op_py_library")
 load("//tensorflow:tensorflow.bzl", "py_test")
+load("//tensorflow:tensorflow.bzl", "tf_custom_op_py_library")
 load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")
 load("//tensorflow:tensorflow.bzl", "tf_gen_op_wrapper_py")
 load("//tensorflow:tensorflow.bzl", "tf_gen_op_libs")
@@ -27,6 +26,7 @@ tf_custom_op_py_library(
         "python/framework/experimental.py",
         "python/framework/tensor_util.py",
         "python/ops/__init__.py",
+        "python/ops/accumulate_n_v2.py",
         "python/ops/arg_scope.py",
         "python/ops/audio_ops.py",
         "python/ops/checkpoint_ops.py",
@@ -150,6 +150,31 @@ py_test(
 )
 
 py_test(
+    name = "accumulate_n_v2_test",
+    size = "small",
+    srcs = ["python/ops/accumulate_n_v2_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":framework_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+    ],
+)
+
+py_test(
+    name = "accumulate_n_v2_eager_test",
+    size = "small",
+    srcs = ["python/ops/accumulate_n_v2_eager_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":framework_py",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python/eager:backprop",
+    ],
+)
+
+py_test(
     name = "ops_test",
     size = "small",
     srcs = ["python/ops/ops_test.py"],
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py b/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
new file mode 100644
index 0000000000..a0667bd489
--- /dev/null
+++ b/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
@@ -0,0 +1,111 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Ops that will eventually be folded into tensorflow/python/ops/math_ops.py
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import math_ops
+
+
+
+def accumulate_n_v2(inputs, shape=None, tensor_dtype=None, name=None):
+  """Returns the element-wise sum of a list of tensors.
+
+  Optionally, pass `shape` and `tensor_dtype` for shape and type checking,
+  otherwise, these are inferred.
+
+  `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
+  wait for all of its inputs to be ready before beginning to sum. This can
+  save memory if inputs are ready at different times, since minimum temporary
+  storage is proportional to the output size rather than the inputs size.
+
+  Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
+
+  For example:
+
+  ```python
+  a = tf.constant([[1, 2], [3, 4]])
+  b = tf.constant([[5, 0], [0, 6]])
+  tf.accumulate_n_v2([a, b, a])  # [[7, 4], [6, 14]]
+
+  # Explicitly pass shape and type
+  tf.accumulate_n_v2([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)  
+                                                                   # [[7,  4],
+                                                                   #  [6, 14]]
+  ```
+
+  Args:
+    inputs: A list of `Tensor` objects, each with same shape and type.
+    shape: Shape of elements of `inputs`.
+    tensor_dtype: The type of `inputs`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` of same shape and type as the elements of `inputs`.
+
+  Raises:
+    ValueError: If `inputs` don't all have same shape and dtype or the shape
+    cannot be inferred.
+  """
+  _INPUTS_ERR_MSG = ValueError("inputs must be a list of at least one Tensor"
+                               "with the same dtype and shape")
+  if not inputs or not isinstance(inputs, (list, tuple)):
+    raise _INPUTS_ERR_MSG
+  inputs = ops.convert_n_to_tensor_or_indexed_slices(inputs)
+  if not all(isinstance(x, ops.Tensor) for x in inputs):
+    raise _INPUTS_ERR_MSG
+  if not all(x.dtype == inputs[0].dtype for x in inputs):
+    raise _INPUTS_ERR_MSG
+  if shape is not None:
+    shape = tensor_shape.as_shape(shape)
+  else:
+    shape = tensor_shape.unknown_shape()
+  for input_tensor in inputs:
+    if isinstance(input_tensor, ops.Tensor):
+      shape = shape.merge_with(input_tensor.get_shape())
+
+  # tensor_dtype is for safety only; operator's output type computed in C++
+  if tensor_dtype is not None and tensor_dtype != inputs[0].dtype:
+    raise TypeError("tensor_dtype is {}, but input is of type {}"
+                    .format(tensor_dtype, inputs[0].dtype))
+
+  if len(inputs) == 1 and name is None:
+    return inputs[0]
+  elif len(inputs) == 1 and name is not None:
+    return array_ops.identity(inputs[0], name=name)
+  elif context.in_eager_mode():
+    # TemporaryVariable not currently supported in eager mode; fall back 
+    # onto AddN for now.
+    # TODO(frreiss) remove this once the lifetime of eager variables gets
+    # addressed
+    return math_ops.add_n(inputs, name=name)
+  else:
+    return gen_math_ops._accumulate_nv2(inputs, name=name, shape=shape)
+
+# The following code should eventually be merged into 
+# tensorflow/python/ops/math_grad.py
+@ops.RegisterGradient("AccumulateNV2")
+def _AddNGrad(op, grad):
+  """Same as gradient for AddN. Copies the gradient to all inputs."""
+  # Not broadcasting.
+  return [grad] * len(op.inputs)
+
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py b/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
new file mode 100644
index 0000000000..c2229bb8ad
--- /dev/null
+++ b/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
@@ -0,0 +1,85 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for new version of accumulate_n op that will eventually go into 
+`ops.math_ops`.
+
+These test cases spefically exercise the `eager` APIs. They need to be in a 
+separate file from the remaining tests because eager mode is currently something
+you can turn on but can't turn off for the lifetime of the current process."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
+
+from tensorflow.python.eager import backprop
+from tensorflow.python.eager import context as eager_context
+from tensorflow.python.eager import tape
+
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes as dtypes_lib
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.platform import test
+
+
+
+class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
+  """Tests of the new, differentiable version of accumulate_n"""
+
+  def testMinimalEagerMode(self):
+    forty = constant_op.constant(40)
+    two = constant_op.constant(2)
+    answer = av2.accumulate_n_v2([forty, two])
+    self.assertEqual(42, answer.numpy())
+
+
+  def testFloat(self):
+    np.random.seed(12345)
+    x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
+    tf_x = ops.convert_n_to_tensor(x)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).numpy())
+      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).numpy())
+
+  def testGrad(self):
+    np.random.seed(42)
+    num_inputs = 3
+    input_vars = [
+        resource_variable_ops.ResourceVariable(10.0 * np.random.random(), 
+                                               name="t%d" % i)
+        for i in range(0, num_inputs)
+    ]
+
+    def fn(first, second, third):
+      return av2.accumulate_n_v2([first, second, third])
+
+    grad_fn = backprop.gradients_function(fn)      
+    grad = grad_fn(input_vars[0], input_vars[1], input_vars[2])
+    self.assertAllEqual(np.repeat(1.0, num_inputs), # d/dx (x + y + ...) = 1
+                        [elem.numpy() for elem in grad])
+
+
+
+if __name__ == "__main__":
+  ops.enable_eager_execution()
+  test.main()
+
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py b/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
new file mode 100644
index 0000000000..3386e849d5
--- /dev/null
+++ b/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
@@ -0,0 +1,123 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for new version of accumulate_n op that will eventually go into 
+`ops.math_ops`."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes as dtypes_lib
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import gradients
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import googletest
+
+
+
+class AccumulateNV2Test(test_util.TensorFlowTestCase):
+  """Tests of the new, differentiable version of accumulate_n"""
+
+  def testFloat(self):
+    np.random.seed(12345)
+    x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
+    tf_x = ops.convert_n_to_tensor(x)
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).eval())
+      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).eval())
+
+  def testInt(self):
+    np.random.seed(54321)
+    x = [np.random.randint(-128, 128, (5, 4, 3, 2, 1)) for _ in range(6)]
+    tf_x = ops.convert_n_to_tensor(x)
+    with self.test_session(use_gpu=True):
+      self.assertAllEqual(sum(x), av2.accumulate_n_v2(tf_x).eval())
+      self.assertAllEqual(x[0] * 6, av2.accumulate_n_v2([tf_x[0]] * 6).eval())
+
+  def testGrad(self):
+    np.random.seed(42)
+    for num_inputs in range(1, 10):
+      with self.test_session(use_gpu=True) as sess:
+        input_vars = [
+            variables.Variable(10.0 * np.random.random())
+            for i in range(0, num_inputs)
+        ]
+        accum_n = av2.accumulate_n_v2(input_vars)
+        sess.run(variables.global_variables_initializer())
+        accum_n_grad = gradients.gradients(accum_n, input_vars)
+        self.assertAllEqual(np.repeat(1.0, num_inputs), # d/dx (x + y + ...) = 1
+                            [g.eval() for g in accum_n_grad])
+
+  # The tests below used to be in a separate class under cwise_ops_test.py,
+  # which did not run in the default test target.
+  # Putting them here so that everything that exercises AccumulateNV2 is in
+  # one place and the default build runs all unit tests.
+  def testSimple(self):
+    with self.test_session():
+      random_arrays = [
+          np.random.rand(16, 16, 16, 16).astype(np.float32) for _ in range(20)
+      ]
+      random_tensors = [
+          ops.convert_to_tensor(
+              x, dtype=dtypes_lib.float32) for x in random_arrays
+      ]
+      tf_val = av2.accumulate_n_v2(random_tensors)
+      np_val = random_arrays[0]
+      for random_array in random_arrays[1:]:
+        np_val += random_array
+      self.assertAllClose(np_val, tf_val.eval())
+
+  def testZeroArgs(self):
+    with self.test_session():
+      with self.assertRaises(ValueError):
+        tf_val = av2.accumulate_n_v2([])
+        tf_val.eval()
+
+  def testWrongShape(self):
+    with self.test_session():
+      with self.assertRaises(ValueError):
+        a = variables.Variable(0.2)
+        b = variables.Variable(0.1)
+        tf_val = av2.accumulate_n_v2([a,b], shape=[2,2]) # Should be shape=[]
+
+  def testIncompatibleShapes(self):
+    with self.test_session():
+      with self.assertRaises(ValueError):
+        a = variables.Variable(np.array([0.1,0.2]))
+        b = variables.Variable(np.array([[0.3],[0.4]]))
+        tf_val = av2.accumulate_n_v2([a,b]) 
+
+  def testWrongType(self):
+    with self.test_session():
+      with self.assertRaises(TypeError):
+        a = variables.Variable(0.2, dtype=np.float32)
+        b = variables.Variable(0.1, dtype=np.float32)
+        tf_val = av2.accumulate_n_v2([a,b], tensor_dtype=np.int32) 
+
+  def testWrongTypeOneInput(self):
+    # Scenario that used to trigger a bug, even when testWrongType() worked
+    with self.test_session():
+      with self.assertRaises(TypeError):
+        a = variables.Variable(0.2, dtype=np.float32)
+        tf_val = av2.accumulate_n_v2([a], tensor_dtype=np.int32) 
+
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/contrib/image/__init__.py b/tensorflow/contrib/image/__init__.py
index 59a322d3ca..d030dffade 100755
--- a/tensorflow/contrib/image/__init__.py
+++ b/tensorflow/contrib/image/__init__.py
@@ -26,6 +26,8 @@ projective transforms (including rotation) are supported.
 @@random_yiq_hsv
 @@rotate
 @@transform
+@@translate
+@@translations_to_projective_transforms
 @@bipartite_match
 @@single_image_random_dot_stereograms
 """
@@ -41,6 +43,8 @@ from tensorflow.contrib.image.python.ops.image_ops import angles_to_projective_t
 from tensorflow.contrib.image.python.ops.image_ops import compose_transforms
 from tensorflow.contrib.image.python.ops.image_ops import rotate
 from tensorflow.contrib.image.python.ops.image_ops import transform
+from tensorflow.contrib.image.python.ops.image_ops import translate
+from tensorflow.contrib.image.python.ops.image_ops import translations_to_projective_transforms
 from tensorflow.contrib.image.python.ops.single_image_random_dot_stereograms import single_image_random_dot_stereograms
 
 from tensorflow.python.util.all_util import remove_undocumented
diff --git a/tensorflow/contrib/image/python/kernel_tests/image_ops_test.py b/tensorflow/contrib/image/python/kernel_tests/image_ops_test.py
index b8a0706b61..b50177ae56 100644
--- a/tensorflow/contrib/image/python/kernel_tests/image_ops_test.py
+++ b/tensorflow/contrib/image/python/kernel_tests/image_ops_test.py
@@ -36,8 +36,8 @@ _DTYPES = set(
 class ImageOpsTest(test_util.TensorFlowTestCase):
 
   def test_zeros(self):
-    with self.test_session():
-      for dtype in _DTYPES:
+    for dtype in _DTYPES:
+      with self.test_session():
         for shape in [(5, 5), (24, 24), (2, 24, 24, 3)]:
           for angle in [0, 1, np.pi / 2.0]:
             image = array_ops.zeros(shape, dtype)
@@ -46,8 +46,8 @@ class ImageOpsTest(test_util.TensorFlowTestCase):
                 np.zeros(shape, dtype.as_numpy_dtype()))
 
   def test_rotate_even(self):
-    with self.test_session():
-      for dtype in _DTYPES:
+    for dtype in _DTYPES:
+      with self.test_session():
         image = array_ops.reshape(
             math_ops.cast(math_ops.range(36), dtype), (6, 6))
         image_rep = array_ops.tile(image[None, :, :, None], [3, 1, 1, 1])
@@ -68,8 +68,8 @@ class ImageOpsTest(test_util.TensorFlowTestCase):
                               [1, 7, 13, 19, 25, 31], [0, 6, 12, 18, 24, 30]]])
 
   def test_rotate_odd(self):
-    with self.test_session():
-      for dtype in _DTYPES:
+    for dtype in _DTYPES:
+      with self.test_session():
         image = array_ops.reshape(
             math_ops.cast(math_ops.range(25), dtype), (5, 5))
         image_rep = array_ops.tile(image[None, :, :, None], [3, 1, 1, 1])
@@ -87,9 +87,25 @@ class ImageOpsTest(test_util.TensorFlowTestCase):
                               [22, 17, 12, 7, 2], [23, 18, 13, 8, 3],
                               [24, 19, 14, 9, 4]]])
 
+  def test_translate(self):
+    for dtype in _DTYPES:
+      with self.test_session():
+        image = constant_op.constant(
+            [[1, 0, 1, 0],
+             [0, 1, 0, 1],
+             [1, 0, 1, 0],
+             [0, 1, 0, 1]], dtype=dtype)
+        translation = constant_op.constant([-1, -1], dtypes.float32)
+        image_translated = image_ops.translate(image, translation)
+        self.assertAllEqual(image_translated.eval(),
+                            [[1, 0, 1, 0],
+                             [0, 1, 0, 0],
+                             [1, 0, 1, 0],
+                             [0, 0, 0, 0]])
+
   def test_compose(self):
-    with self.test_session():
-      for dtype in _DTYPES:
+    for dtype in _DTYPES:
+      with self.test_session():
         image = constant_op.constant(
             [[1, 1, 1, 0],
              [1, 0, 0, 0],
@@ -246,4 +262,3 @@ class BipartiteMatchTest(test_util.TensorFlowTestCase):
 
 if __name__ == "__main__":
   googletest.main()
-
diff --git a/tensorflow/contrib/image/python/ops/image_ops.py b/tensorflow/contrib/image/python/ops/image_ops.py
index aef3e385b5..011ddeaa9a 100644
--- a/tensorflow/contrib/image/python/ops/image_ops.py
+++ b/tensorflow/contrib/image/python/ops/image_ops.py
@@ -37,16 +37,18 @@ _IMAGE_DTYPES = set(
 ops.RegisterShape("ImageProjectiveTransform")(common_shapes.call_cpp_shape_fn)
 
 
-def rotate(images, angles, interpolation="NEAREST"):
+def rotate(images, angles, interpolation="NEAREST", name=None):
   """Rotate image(s) by the passed angle(s) in radians.
 
   Args:
     images: A tensor of shape (num_images, num_rows, num_columns, num_channels)
        (NHWC), (num_rows, num_columns, num_channels) (HWC), or
-       (num_rows, num_columns) (HW).
+       (num_rows, num_columns) (HW). The rank must be statically known (the
+       shape is not `TensorShape(None)`.
     angles: A scalar angle to rotate all images by, or (if images has rank 4)
        a vector of length num_images, with an angle for each image in the batch.
     interpolation: Interpolation mode. Supported values: "NEAREST", "BILINEAR".
+    name: The name of the op.
 
   Returns:
     Image(s) with the same type and shape as `images`, rotated by the given
@@ -55,38 +57,77 @@ def rotate(images, angles, interpolation="NEAREST"):
   Raises:
     TypeError: If `image` is an invalid type.
   """
-  image_or_images = ops.convert_to_tensor(images, name="images")
-  if image_or_images.dtype.base_dtype not in _IMAGE_DTYPES:
-    raise TypeError("Invalid dtype %s." % image_or_images.dtype)
-  if len(image_or_images.get_shape()) == 2:
-    images = image_or_images[None, :, :, None]
-  elif len(image_or_images.get_shape()) == 3:
-    images = image_or_images[None, :, :, :]
-  elif len(image_or_images.get_shape()) == 4:
-    images = image_or_images
-  else:
-    raise TypeError("Images should have rank between 2 and 4.")
-
-  image_height = math_ops.cast(array_ops.shape(images)[1], dtypes.float32)[None]
-  image_width = math_ops.cast(array_ops.shape(images)[2], dtypes.float32)[None]
-  output = transform(
-      images,
-      angles_to_projective_transforms(angles, image_height, image_width),
-      interpolation=interpolation)
-  if len(image_or_images.get_shape()) == 2:
-    return output[0, :, :, 0]
-  elif len(image_or_images.get_shape()) == 3:
-    return output[0, :, :, :]
-  else:
-    return output
+  with ops.name_scope(name, "rotate"):
+    image_or_images = ops.convert_to_tensor(images)
+    if image_or_images.dtype.base_dtype not in _IMAGE_DTYPES:
+      raise TypeError("Invalid dtype %s." % image_or_images.dtype)
+    elif image_or_images.get_shape().ndims is None:
+      raise TypeError("image_or_images rank must be statically known")
+    elif len(image_or_images.get_shape()) == 2:
+      images = image_or_images[None, :, :, None]
+    elif len(image_or_images.get_shape()) == 3:
+      images = image_or_images[None, :, :, :]
+    elif len(image_or_images.get_shape()) == 4:
+      images = image_or_images
+    else:
+      raise TypeError("Images should have rank between 2 and 4.")
+
+    image_height = math_ops.cast(array_ops.shape(images)[1],
+                                 dtypes.float32)[None]
+    image_width = math_ops.cast(array_ops.shape(images)[2],
+                                dtypes.float32)[None]
+    output = transform(
+        images,
+        angles_to_projective_transforms(angles, image_height, image_width),
+        interpolation=interpolation)
+    if image_or_images.get_shape().ndims is None:
+      raise TypeError("image_or_images rank must be statically known")
+    elif len(image_or_images.get_shape()) == 2:
+      return output[0, :, :, 0]
+    elif len(image_or_images.get_shape()) == 3:
+      return output[0, :, :, :]
+    else:
+      return output
+
+
+def translate(images, translations, interpolation="NEAREST", name=None):
+  """Translate image(s) by the passed vectors(s).
 
+  Args:
+    images: A tensor of shape (num_images, num_rows, num_columns, num_channels)
+        (NHWC), (num_rows, num_columns, num_channels) (HWC), or
+        (num_rows, num_columns) (HW). The rank must be statically known (the
+        shape is not `TensorShape(None)`.
+    translations: A vector representing [dx, dy] or (if images has rank 4)
+        a matrix of length num_images, with a [dx, dy] vector for each image in
+        the batch.
+    interpolation: Interpolation mode. Supported values: "NEAREST", "BILINEAR".
+    name: The name of the op.
 
-def angles_to_projective_transforms(angles, image_height, image_width):
+  Returns:
+    Image(s) with the same type and shape as `images`, translated by the given
+        vector(s). Empty space due to the translation will be filled with zeros.
+
+  Raises:
+    TypeError: If `image` is an invalid type.
+  """
+  with ops.name_scope(name, "translate"):
+    return transform(
+        images,
+        translations_to_projective_transforms(translations),
+        interpolation=interpolation)
+
+
+def angles_to_projective_transforms(angles,
+                                    image_height,
+                                    image_width,
+                                    name=None):
   """Returns projective transform(s) for the given angle(s).
 
   Args:
     angles: A scalar angle to rotate all images by, or (for batches of images)
-      a vector with an angle to rotate each image in the batch.
+        a vector with an angle to rotate each image in the batch. The rank must
+        be statically known (the shape is not `TensorShape(None)`.
     image_height: Height of the image(s) to be transformed.
     image_width: Width of the image(s) to be transformed.
 
@@ -94,41 +135,89 @@ def angles_to_projective_transforms(angles, image_height, image_width):
     A tensor of shape (num_images, 8). Projective transforms which can be given
       to `tf.contrib.image.transform`.
   """
-  angle_or_angles = ops.convert_to_tensor(
-      angles, name="angles", dtype=dtypes.float32)
-  if len(angle_or_angles.get_shape()) == 0:  # pylint: disable=g-explicit-length-test
-    angles = angle_or_angles[None]
-  elif len(angle_or_angles.get_shape()) == 1:
-    angles = angle_or_angles
-  else:
-    raise TypeError("Angles should have rank 0 or 1.")
-  x_offset = ((image_width - 1) - (math_ops.cos(angles) *
-                                   (image_width - 1) - math_ops.sin(angles) *
-                                   (image_height - 1))) / 2.0
-  y_offset = ((image_height - 1) - (math_ops.sin(angles) *
-                                    (image_width - 1) + math_ops.cos(angles) *
-                                    (image_height - 1))) / 2.0
-  num_angles = array_ops.shape(angles)[0]
-  return array_ops.concat(
-      values=[
-          math_ops.cos(angles)[:, None],
-          -math_ops.sin(angles)[:, None],
-          x_offset[:, None],
-          math_ops.sin(angles)[:, None],
-          math_ops.cos(angles)[:, None],
-          y_offset[:, None],
-          array_ops.zeros((num_angles, 2), dtypes.float32),
-      ],
-      axis=1)
-
-
-def transform(images, transforms, interpolation="NEAREST"):
+  with ops.name_scope(name, "angles_to_projective_transforms"):
+    angle_or_angles = ops.convert_to_tensor(
+        angles, name="angles", dtype=dtypes.float32)
+    if len(angle_or_angles.get_shape()) == 0:  # pylint: disable=g-explicit-length-test
+      angles = angle_or_angles[None]
+    elif len(angle_or_angles.get_shape()) == 1:
+      angles = angle_or_angles
+    else:
+      raise TypeError("Angles should have rank 0 or 1.")
+    x_offset = ((image_width - 1) - (math_ops.cos(angles) *
+                                     (image_width - 1) - math_ops.sin(angles) *
+                                     (image_height - 1))) / 2.0
+    y_offset = ((image_height - 1) - (math_ops.sin(angles) *
+                                      (image_width - 1) + math_ops.cos(angles) *
+                                      (image_height - 1))) / 2.0
+    num_angles = array_ops.shape(angles)[0]
+    return array_ops.concat(
+        values=[
+            math_ops.cos(angles)[:, None],
+            -math_ops.sin(angles)[:, None],
+            x_offset[:, None],
+            math_ops.sin(angles)[:, None],
+            math_ops.cos(angles)[:, None],
+            y_offset[:, None],
+            array_ops.zeros((num_angles, 2), dtypes.float32),
+        ],
+        axis=1)
+
+
+def translations_to_projective_transforms(translations, name=None):
+  """Returns projective transform(s) for the given translation(s).
+
+  Args:
+      translations: A 2-element list representing [dx, dy] or a matrix of
+          2-element lists representing [dx, dy] to translate for each image
+          (for a batch of images). The rank must be statically known (the shape
+          is not `TensorShape(None)`.
+      name: The name of the op.
+
+  Returns:
+      A tensor of shape (num_images, 8) projective transforms which can be given
+          to `tf.contrib.image.transform`.
+  """
+  with ops.name_scope(name, "translations_to_projective_transforms"):
+    translation_or_translations = ops.convert_to_tensor(
+        translations, name="translations", dtype=dtypes.float32)
+    if translation_or_translations.get_shape().ndims is None:
+      raise TypeError(
+          "translation_or_translations rank must be statically known")
+    elif len(translation_or_translations.get_shape()) == 1:
+      translations = translation_or_translations[None]
+    elif len(translation_or_translations.get_shape()) == 2:
+      translations = translation_or_translations
+    else:
+      raise TypeError("Translations should have rank 1 or 2.")
+    num_translations = array_ops.shape(translations)[0]
+    # The translation matrix looks like:
+    #     [[1 0 -dx]
+    #      [0 1 -dy]
+    #      [0 0 1]]
+    # where the last entry is implicit.
+    # Translation matrices are always float32.
+    return array_ops.concat(
+        values=[
+            array_ops.ones((num_translations, 1), dtypes.float32),
+            array_ops.zeros((num_translations, 1), dtypes.float32),
+            -translations[:, 0, None],
+            array_ops.zeros((num_translations, 1), dtypes.float32),
+            array_ops.ones((num_translations, 1), dtypes.float32),
+            -translations[:, 1, None],
+            array_ops.zeros((num_translations, 2), dtypes.float32),
+        ],
+        axis=1)
+
+
+def transform(images, transforms, interpolation="NEAREST", name=None):
   """Applies the given transform(s) to the image(s).
 
   Args:
     images: A tensor of shape (num_images, num_rows, num_columns, num_channels)
        (NHWC), (num_rows, num_columns, num_channels) (HWC), or
-       (num_rows, num_columns) (HW).
+       (num_rows, num_columns) (HW). The rank must be statically known (the
+       shape is not `TensorShape(None)`.
     transforms: Projective transform matrix/matrices. A vector of length 8 or
        tensor of size N x 8. If one row of transforms is
        [a0, a1, a2, b0, b1, b2, c0, c1], then it maps the *output* point
@@ -146,34 +235,40 @@ def transform(images, transforms, interpolation="NEAREST"):
   Raises:
     TypeError: If `image` is an invalid type.
   """
-  image_or_images = ops.convert_to_tensor(images, name="images")
-  transform_or_transforms = ops.convert_to_tensor(
-      transforms, name="transforms", dtype=dtypes.float32)
-  if image_or_images.dtype.base_dtype not in _IMAGE_DTYPES:
-    raise TypeError("Invalid dtype %s." % image_or_images.dtype)
-  if len(image_or_images.get_shape()) == 2:
-    images = image_or_images[None, :, :, None]
-  elif len(image_or_images.get_shape()) == 3:
-    images = image_or_images[None, :, :, :]
-  elif len(image_or_images.get_shape()) == 4:
-    images = image_or_images
-  else:
-    raise TypeError("Images should have rank between 2 and 4.")
-
-  if len(transform_or_transforms.get_shape()) == 1:
-    transforms = transform_or_transforms[None]
-  elif len(transform_or_transforms.get_shape()) == 2:
-    transforms = transform_or_transforms
-  else:
-    raise TypeError("Transforms should have rank 1 or 2.")
-  output = gen_image_ops.image_projective_transform(
-      images, transforms, interpolation=interpolation.upper())
-  if len(image_or_images.get_shape()) == 2:
-    return output[0, :, :, 0]
-  elif len(image_or_images.get_shape()) == 3:
-    return output[0, :, :, :]
-  else:
-    return output
+  with ops.name_scope(name, "transform"):
+    image_or_images = ops.convert_to_tensor(images, name="images")
+    transform_or_transforms = ops.convert_to_tensor(
+        transforms, name="transforms", dtype=dtypes.float32)
+    if image_or_images.dtype.base_dtype not in _IMAGE_DTYPES:
+      raise TypeError("Invalid dtype %s." % image_or_images.dtype)
+    elif image_or_images.get_shape().ndims is None:
+      raise TypeError("image_or_images rank must be statically known")
+    elif len(image_or_images.get_shape()) == 2:
+      images = image_or_images[None, :, :, None]
+    elif len(image_or_images.get_shape()) == 3:
+      images = image_or_images[None, :, :, :]
+    elif len(image_or_images.get_shape()) == 4:
+      images = image_or_images
+    else:
+      raise TypeError("Images should have rank between 2 and 4.")
+
+    if len(transform_or_transforms.get_shape()) == 1:
+      transforms = transform_or_transforms[None]
+    elif transform_or_transforms.get_shape().ndims is None:
+      raise TypeError(
+          "transform_or_transforms rank must be statically known")
+    elif len(transform_or_transforms.get_shape()) == 2:
+      transforms = transform_or_transforms
+    else:
+      raise TypeError("Transforms should have rank 1 or 2.")
+    output = gen_image_ops.image_projective_transform(
+        images, transforms, interpolation=interpolation.upper())
+    if len(image_or_images.get_shape()) == 2:
+      return output[0, :, :, 0]
+    elif len(image_or_images.get_shape()) == 3:
+      return output[0, :, :, :]
+    else:
+      return output
 
 
 def compose_transforms(*transforms):
@@ -191,11 +286,12 @@ def compose_transforms(*transforms):
         order.
   """
   assert transforms, "transforms cannot be empty"
-  composed = _flat_transforms_to_matrices(transforms[0])
-  for tr in transforms[1:]:
-    # Multiply batches of matrices.
-    composed = math_ops.matmul(composed, _flat_transforms_to_matrices(tr))
-  return _transform_matrices_to_flat(composed)
+  with ops.name_scope("compose_transforms"):
+    composed = _flat_transforms_to_matrices(transforms[0])
+    for tr in transforms[1:]:
+      # Multiply batches of matrices.
+      composed = math_ops.matmul(composed, _flat_transforms_to_matrices(tr))
+    return _transform_matrices_to_flat(composed)
 
 
 def _flat_transforms_to_matrices(transforms):
@@ -211,8 +307,8 @@ def _flat_transforms_to_matrices(transforms):
 
 def _transform_matrices_to_flat(transform_matrices):
   # Flatten each matrix.
-  transforms = array_ops.reshape(
-      transform_matrices, constant_op.constant([-1, 9]))
+  transforms = array_ops.reshape(transform_matrices,
+                                 constant_op.constant([-1, 9]))
   # Divide each matrix by the last entry (normally 1).
   transforms /= transforms[:, 8:9]
   return transforms[:, :8]
@@ -260,10 +356,10 @@ def _image_projective_transform_grad(op, grad):
     return [output, None]
 
 
-def bipartite_match(
-    distance_mat,
-    num_valid_rows,
-    top_k=-1):
+def bipartite_match(distance_mat,
+                    num_valid_rows,
+                    top_k=-1,
+                    name="bipartite_match"):
   """Find bipartite matching based on a given distance matrix.
 
   A greedy bi-partite matching algorithm is used to obtain the matching with
@@ -282,6 +378,7 @@ def bipartite_match(
     top_k: A scalar that specifies the number of top-k matches to retrieve.
       If set to be negative, then is set according to the maximum number of
       matches from `distance_mat`.
+    name: The name of the op.
 
   Returns:
     row_to_col_match_indices: A vector of length num_rows, which is the number
@@ -292,7 +389,8 @@ def bipartite_match(
       If `col_to_row_match_indices[j]` is not -1, column j is matched to row
       `col_to_row_match_indices[j]`.
   """
-  result = gen_image_ops.bipartite_match(distance_mat, num_valid_rows, top_k)
+  result = gen_image_ops.bipartite_match(
+      distance_mat, num_valid_rows, top_k, name=name)
   return result
 
 
diff --git a/tensorflow/contrib/kfac/python/ops/loss_functions.py b/tensorflow/contrib/kfac/python/ops/loss_functions.py
index 0b5c3d4928..69d97f0b5b 100644
--- a/tensorflow/contrib/kfac/python/ops/loss_functions.py
+++ b/tensorflow/contrib/kfac/python/ops/loss_functions.py
@@ -104,7 +104,7 @@ class LossFunction(object):
 
   @abc.abstractmethod
   def multiply_hessian_factor_transpose(self, vector):
-    """Right-multiply a vector by the tranpose of a factor B of the Hessian.
+    """Right-multiply a vector by the transpose of a factor B of the Hessian.
 
     Here the 'Hessian' is the Hessian matrix (i.e. matrix of 2nd-derivatives)
     of the loss function with respect to its inputs.  Typically this will be
@@ -218,7 +218,7 @@ class NegativeLogProbLoss(LossFunction):
 
   @abc.abstractmethod
   def multiply_fisher_factor_transpose(self, vector):
-    """Right-multiply a vector by the tranpose of a factor B of the Fisher.
+    """Right-multiply a vector by the transpose of a factor B of the Fisher.
 
     Here the 'Fisher' is the Fisher information matrix (i.e. expected outer-
     product of gradients) with respect to the parameters of the underlying
@@ -397,7 +397,7 @@ class NormalMeanVarianceNegativeLogProbLoss(DistributionNegativeLogProbLoss):
 
   This class parameterizes a multivariate normal distribution with n independent
   dimensions. Unlike `NormalMeanNegativeLogProbLoss`, this class does not
-  assume the variance is held constant. The Fisher Information for for n = 1
+  assume the variance is held constant. The Fisher Information for n = 1
   is given by,
 
   F = [[1 / variance,                0],
diff --git a/tensorflow/contrib/kfac/python/ops/op_queue.py b/tensorflow/contrib/kfac/python/ops/op_queue.py
index 0617c5be4d..831870fca4 100644
--- a/tensorflow/contrib/kfac/python/ops/op_queue.py
+++ b/tensorflow/contrib/kfac/python/ops/op_queue.py
@@ -61,7 +61,7 @@ class OpQueue(object):
       sess: tf.Session.
 
     Returns:
-      Next Op chosen from from 'ops'.
+      Next Op chosen from 'ops'.
     """
     # In Python 3, type(next_op_name) == bytes. Calling bytes.decode('ascii')
     # returns a str.
diff --git a/tensorflow/contrib/layers/__init__.py b/tensorflow/contrib/layers/__init__.py
index d8ab7c2d70..d309ba958d 100644
--- a/tensorflow/contrib/layers/__init__.py
+++ b/tensorflow/contrib/layers/__init__.py
@@ -47,6 +47,7 @@ See the @{$python/contrib.layers} guide.
 @@separable_conv2d
 @@separable_convolution2d
 @@softmax
+@@spatial_softmax
 @@stack
 @@unit_norm
 @@bow_encoder
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner.py b/tensorflow/contrib/learn/python/learn/learn_runner.py
index 9f9740ec49..2af723a0d6 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner.py
@@ -165,7 +165,7 @@ def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
       must be None.
       2) It accepts two arguments `run_config` and `hparams`, which should be
       used to create the `Estimator` (`run_config` passed as `config` to its
-      constructor; `hparams` used as the hyper-paremeters of the model).
+      constructor; `hparams` used as the hyper-parameters of the model).
       It must return an `Experiment`. For this case, `output_dir` must be None.
     output_dir: Base output directory [Deprecated].
     schedule: The name of the method in the `Experiment` to run.
diff --git a/tensorflow/contrib/losses/python/losses/loss_ops.py b/tensorflow/contrib/losses/python/losses/loss_ops.py
index 1d2477b8b7..7c523ad492 100644
--- a/tensorflow/contrib/losses/python/losses/loss_ops.py
+++ b/tensorflow/contrib/losses/python/losses/loss_ops.py
@@ -28,6 +28,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.util.deprecation import deprecated
+from tensorflow.python.util.deprecation import deprecated_args
 
 __all__ = ["absolute_difference",
            "add_loss",
@@ -623,8 +624,9 @@ def mean_pairwise_squared_error(
 
 
 @deprecated("2016-12-30", "Use tf.losses.cosine_distance instead.")
+@deprecated_args(None, "dim is deprecated, use axis instead", "dim")
 def cosine_distance(
-    predictions, labels=None, dim=None, weights=1.0, scope=None):
+    predictions, labels=None, axis=None, weights=1.0, scope=None, dim=None):
   """Adds a cosine-distance loss to the training procedure.
 
   Note that the function assumes that `predictions` and `labels` are already
@@ -633,10 +635,11 @@ def cosine_distance(
   Args:
     predictions: An arbitrary matrix.
     labels: A `Tensor` whose shape matches 'predictions'
-    dim: The dimension along which the cosine distance is computed.
+    axis: The dimension along which the cosine distance is computed.
     weights: Coefficients for the loss a scalar, a tensor of shape
       [batch_size] or a tensor whose shape matches `predictions`.
     scope: The scope for the operations performed in computing the loss.
+    dim: The old (deprecated) name for `axis`.
 
   Returns:
     A scalar `Tensor` representing the loss value.
@@ -645,8 +648,12 @@ def cosine_distance(
     ValueError: If `predictions` shape doesn't match `labels` shape, or
       `weights` is `None`.
   """
-  if dim is None:
-    raise ValueError("`dim` cannot be None.")
+  if dim is not None:
+    if axis is not None:
+      raise ValueError("Cannot specify both 'axis' and 'dim'")
+    axis = dim
+  if axis is None and dim is None:
+    raise ValueError("You must specify 'axis'.")
   with ops.name_scope(scope, "cosine_distance_loss",
                       [predictions, labels, weights]) as scope:
     predictions.get_shape().assert_is_compatible_with(labels.get_shape())
@@ -655,5 +662,5 @@ def cosine_distance(
     labels = math_ops.to_float(labels)
 
     radial_diffs = math_ops.multiply(predictions, labels)
-    losses = 1 - math_ops.reduce_sum(radial_diffs, reduction_indices=[dim,])
+    losses = 1 - math_ops.reduce_sum(radial_diffs, reduction_indices=[axis,])
     return compute_weighted_loss(losses, weights, scope=scope)
diff --git a/tensorflow/contrib/makefile/Makefile b/tensorflow/contrib/makefile/Makefile
index 81024c26a4..b582493131 100644
--- a/tensorflow/contrib/makefile/Makefile
+++ b/tensorflow/contrib/makefile/Makefile
@@ -194,6 +194,10 @@ LIBFLAGS :=
 
 # If we're on OS X, make sure that globals aren't stripped out.
 ifeq ($(TARGET),OSX)
+ifeq ($(HAS_GEN_HOST_PROTOC),true)
+	LIBFLAGS += -L$(MAKEFILE_DIR)/gen/protobuf-host/lib
+	export LD_LIBRARY_PATH=$(MAKEFILE_DIR)/gen/protobuf-host/lib
+endif
 	LDFLAGS += -all_load
 endif
 # Make sure that we don't strip global constructors on Linux.
diff --git a/tensorflow/contrib/makefile/download_dependencies.sh b/tensorflow/contrib/makefile/download_dependencies.sh
index f0b9658e3d..12e3f58930 100755
--- a/tensorflow/contrib/makefile/download_dependencies.sh
+++ b/tensorflow/contrib/makefile/download_dependencies.sh
@@ -54,7 +54,7 @@ download_and_extract() {
   elif [[ "${url}" == *zip ]]; then
     tempdir=$(mktemp -d)
     tempdir2=$(mktemp -d)
-    wget ${url} -P ${tempdir}
+    wget -P ${tempdir} ${url}
     unzip ${tempdir}/* -d ${tempdir2}
     # unzip has no strip components, so unzip to a temp dir, and move the files
     # we want from the tempdir to destination.
diff --git a/tensorflow/contrib/makefile/tf_op_files.txt b/tensorflow/contrib/makefile/tf_op_files.txt
index a8690a04ad..8b77c99cb5 100644
--- a/tensorflow/contrib/makefile/tf_op_files.txt
+++ b/tensorflow/contrib/makefile/tf_op_files.txt
@@ -264,3 +264,4 @@ tensorflow/core/kernels/spacetobatch_functor.cc
 tensorflow/core/kernels/spacetobatch_op.cc
 tensorflow/core/kernels/batchtospace_op.cc
 tensorflow/core/kernels/warn_about_ints.cc
+tensorflow/core/kernels/segment_reduction_ops.cc
diff --git a/tensorflow/contrib/meta_graph_transform/meta_graph_transform.py b/tensorflow/contrib/meta_graph_transform/meta_graph_transform.py
index 303c02dfa4..2932ae1c8d 100644
--- a/tensorflow/contrib/meta_graph_transform/meta_graph_transform.py
+++ b/tensorflow/contrib/meta_graph_transform/meta_graph_transform.py
@@ -749,7 +749,7 @@ def meta_graph_transform(
         base_meta_graph_def, meta_graph_def, collection_name,
         removed_op_names)
 
-  # Append newly added initalizers to collection.
+  # Append newly added initializers to collection.
   _add_new_inits_to_collection(meta_graph_def, updated_initializer_names)
 
   # Copy signature_defs, excluding any pruned nodes
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
index f24bec7f11..6e038481e3 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
@@ -5856,7 +5856,7 @@ class StreamingMeanIOUTest(test.TestCase):
       sess.run(variables.local_variables_initializer())
       for _ in range(5):
         sess.run(update_op)
-      desired_output = np.mean([1.0 / 3.0, 2.0 / 4.0, 0.])
+      desired_output = np.mean([1.0 / 3.0, 2.0 / 4.0])
       self.assertAlmostEqual(desired_output, miou.eval())
 
   def testUpdateOpEvalIsAccumulatedConfusionMatrix(self):
@@ -5938,6 +5938,58 @@ class StreamingMeanIOUTest(test.TestCase):
       desired_miou = np.mean([2. / 4., 4. / 6.])
       self.assertAlmostEqual(desired_miou, miou.eval())
 
+  def testMissingClassInLabels(self):
+    labels = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 0, 0, 0, 0, 1]],
+      [[1, 1, 1, 1, 1, 1],
+       [0, 0, 0, 0, 0, 0]]])
+    predictions = constant_op.constant([
+      [[0, 0, 2, 1, 1, 0],
+       [0, 1, 2, 2, 0, 1]],
+      [[0, 0, 2, 1, 1, 1],
+       [1, 1, 2, 0, 0, 0]]])
+    num_classes = 3
+    with self.test_session() as sess:
+      miou, update_op = metrics.streaming_mean_iou(
+          predictions, labels, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[7, 4, 3], [3, 5, 2], [0, 0, 0]], update_op.eval())
+      self.assertAlmostEqual(
+        1 / 3 * (7 / (7 + 3 + 7) + 5 / (5 + 4 + 5) + 0 / (0 + 5 + 0)),
+        miou.eval())
+
+  def testMissingClassOverallSmall(self):
+    labels = constant_op.constant([0])
+    predictions = constant_op.constant([0])
+    num_classes = 2
+    with self.test_session() as sess:
+      miou, update_op = metrics.streaming_mean_iou(
+          predictions, labels, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[1, 0], [0, 0]], update_op.eval())
+      self.assertAlmostEqual(1, miou.eval())
+
+  def testMissingClassOverallLarge(self):
+    labels = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 0, 0, 0, 0, 1]],
+      [[1, 1, 1, 1, 1, 1],
+       [0, 0, 0, 0, 0, 0]]])
+    predictions = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 1, 0, 0, 1, 1]],
+      [[0, 0, 0, 1, 1, 1],
+       [1, 1, 1, 0, 0, 0]]])
+    num_classes = 3
+    with self.test_session() as sess:
+      miou, update_op = metrics.streaming_mean_iou(
+          predictions, labels, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[9, 5, 0], [3, 7, 0], [0, 0, 0]], update_op.eval())
+      self.assertAlmostEqual(
+        1 / 2 * (9 / (9 + 3 + 5) + 7 / (7 + 5 + 3)), miou.eval())
+
 
 class StreamingConcatTest(test.TestCase):
 
diff --git a/tensorflow/contrib/mpi_collectives/__init__.py b/tensorflow/contrib/mpi_collectives/__init__.py
index b94f7b0a35..9ed16a6f07 100644
--- a/tensorflow/contrib/mpi_collectives/__init__.py
+++ b/tensorflow/contrib/mpi_collectives/__init__.py
@@ -194,7 +194,7 @@ class DistributedOptimizer(tf.train.Optimizer):
 
     See Optimizer.compute_gradients() for more info.
 
-    In DistributedOptimizer, compute_gradients() is overriden to also
+    In DistributedOptimizer, compute_gradients() is overridden to also
     allreduce the gradients before returning them.
     """
     gradients = (super(DistributedOptimizer, self)
diff --git a/tensorflow/contrib/nn/__init__.py b/tensorflow/contrib/nn/__init__.py
index 7007e26bac..3bf795d19a 100644
--- a/tensorflow/contrib/nn/__init__.py
+++ b/tensorflow/contrib/nn/__init__.py
@@ -18,6 +18,7 @@
 @@deprecated_flipped_softmax_cross_entropy_with_logits
 @@deprecated_flipped_sparse_softmax_cross_entropy_with_logits
 @@deprecated_flipped_sigmoid_cross_entropy_with_logits
+@@nth_element
 @@rank_sampled_softmax_loss
 @@scaled_softplus
 """
@@ -31,6 +32,7 @@ from tensorflow.contrib.nn.python.ops.alpha_dropout import *
 from tensorflow.contrib.nn.python.ops.cross_entropy import *
 from tensorflow.contrib.nn.python.ops.sampling_ops import *
 from tensorflow.contrib.nn.python.ops.scaled_softplus import *
+from tensorflow.python.ops.nn_ops import nth_element
 # pylint: enable=unused-import,wildcard-import
 
 from tensorflow.python.util.all_util import remove_undocumented
diff --git a/tensorflow/contrib/receptive_field/python/util/receptive_field.py b/tensorflow/contrib/receptive_field/python/util/receptive_field.py
index db190a1a41..8b34465d21 100644
--- a/tensorflow/contrib/receptive_field/python/util/receptive_field.py
+++ b/tensorflow/contrib/receptive_field/python/util/receptive_field.py
@@ -27,13 +27,15 @@ import math
 from tensorflow.contrib.receptive_field.python.util import graph_compute_order
 from tensorflow.contrib.util import make_ndarray
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.framework import ops as framework_ops
+import numpy as np
 
 # White-listed layer operations, which do not affect the receptive field
 # computation.
 _UNCHANGED_RF_LAYER_OPS = [
-    "Softplus", "Relu", "BiasAdd", "Mul", "Add", "Const", "Identity",
-    "VariableV2", "Sub", "Rsqrt", "ConcatV2"
-]
+  'Add', 'BiasAdd', 'Ceil', 'ConcatV2', 'Const', 'Floor', 'Identity', 'Log',
+  'Mul', 'Pow', 'RealDiv', 'Relu', 'Round', 'Rsqrt', 'Softplus', 'Sub',
+  'VariableV2']
 
 # Different ways in which padding modes may be spelled.
 _VALID_PADDING = ["VALID", b"VALID"]
@@ -238,7 +240,8 @@ def _get_layer_params(node, name_to_order_node):
     padding_x = 0
     padding_y = 0
   else:
-    raise ValueError("Unknown layer op: %s" % node.op)
+    raise ValueError("Unknown layer for operation '%s': %s" %
+                     (node.name, node.op))
   return kernel_size_x, kernel_size_y, stride_x, stride_y, padding_x, padding_y
 
 
@@ -304,13 +307,103 @@ def _get_effective_padding_node_input(stride, padding,
   return stride * effective_padding_output + padding
 
 
-def compute_receptive_field_from_graph_def(graph_def, input_node, output_node):
-  """Computes receptive field (RF) parameters from a GraphDef object.
+class ReceptiveField:
+  """
+  Receptive field of a convolutional neural network.
+
+  Args:
+    size: Receptive field size.
+    stride: Effective stride.
+    padding: Effective padding.
+  """
+  def __init__(self, size, stride, padding):
+    self.size = np.asarray(size)
+    self.stride = np.asarray(stride)
+    self.padding = np.asarray(padding)
+
+  def compute_input_center_coordinates(self, y, axis=None):
+    """
+    Computes the center of the receptive field that generated a feature.
+
+    Args:
+      y: An array of feature coordinates with shape `(..., d)`, where `d` is the
+        number of dimensions of the coordinates.
+      axis: The dimensions for which to compute the input center coordinates.
+        If `None` (the default), compute the input center coordinates for all
+        dimensions.
+
+    Returns:
+      x: Center of the receptive field that generated the features, at the input
+        of the network.
+
+    Raises:
+      ValueError: If the number of dimensions of the feature coordinates does
+        not match the number of elements in `axis`.
+    """
+    # Use all dimensions.
+    if axis is None:
+      axis = range(self.size.size)
+    # Ensure axis is a list because tuples have different indexing behavior.
+    axis = list(axis)
+    y = np.asarray(y)
+    if y.shape[-1] != len(axis):
+      raise ValueError("Dimensionality of the feature coordinates `y` (%d) "
+                       "does not match dimensionality of `axis` (%d)" %
+                       (y.shape[-1], len(axis)))
+    return - self.padding[axis] + y * self.stride[axis] + \
+      (self.size[axis] - 1) / 2
+
+  def compute_feature_coordinates(self, x, axis=None):
+    """
+    Computes the position of a feature given the center of a receptive field.
+
+    Args:
+      x: An array of input center coordinates with shape `(..., d)`, where `d`
+        is the number of dimensions of the coordinates.
+      axis: The dimensions for which to compute the feature coordinates.
+        If `None` (the default), compute the feature coordinates for all
+        dimensions.
+
+    Returns:
+      y: Coordinates of the features.
+
+    Raises:
+      ValueError: If the number of dimensions of the input center coordinates
+        does not match the number of elements in `axis`.
+    """
+    # Use all dimensions.
+    if axis is None:
+      axis = range(self.size.size)
+    # Ensure axis is a list because tuples have different indexing behavior.
+    axis = list(axis)
+    x = np.asarray(x)
+    if x.shape[-1] != len(axis):
+      raise ValueError("Dimensionality of the input center coordinates `x` "
+                       "(%d) does not match dimensionality of `axis` (%d)" %
+                       (x.shape[-1], len(axis)))
+    return (x + self.padding[axis] + (1 - self.size[axis]) / 2) / \
+      self.stride[axis]
+
+  def __iter__(self):
+    return iter(np.concatenate([self.size, self.stride, self.padding]))
+
+
+def compute_receptive_field_from_graph_def(graph_def, input_node, output_node,
+                                           stop_propagation=None):
+  """Computes receptive field (RF) parameters from a Graph or GraphDef object.
+
+  The algorithm stops the calculation of the receptive field whenever it
+  encounters an operation in the list `stop_propagation`. Stopping the
+  calculation early can be useful to calculate the receptive field of a
+  subgraph such as a single branch of the
+  [inception network](https://arxiv.org/abs/1512.00567).
 
   Args:
-    graph_def: GraphDef object.
-    input_node: Name of the input node from graph.
-    output_node: Name of the output node from graph.
+    graph_def: Graph or GraphDef object.
+    input_node: Name of the input node or Tensor object from graph.
+    output_node: Name of the output node or Tensor object from graph.
+    stop_propagation: List of operation or scope names for which to stop the
+      propagation of the receptive field.
 
   Returns:
     rf_size_x: Receptive field size of network in the horizontal direction, with
@@ -331,6 +424,18 @@ def compute_receptive_field_from_graph_def(graph_def, input_node, output_node):
       cannot be found. For network criterion alignment, see
       photos/vision/features/delf/g3doc/rf_computation.md
   """
+  # Convert a graph to graph_def if necessary.
+  if isinstance(graph_def, framework_ops.Graph):
+    graph_def = graph_def.as_graph_def()
+
+  # Convert tensors to names.
+  if isinstance(input_node, framework_ops.Tensor):
+    input_node = input_node.op.name
+  if isinstance(output_node, framework_ops.Tensor):
+    output_node = output_node.op.name
+
+  stop_propagation = stop_propagation or []
+
   # Computes order of computation for a given graph.
   name_to_order_node = graph_compute_order.get_compute_order(
       graph_def=graph_def)
@@ -422,6 +527,10 @@ def compute_receptive_field_from_graph_def(graph_def, input_node, output_node):
 
       # Loop over this node's inputs and potentially propagate information down.
       for inp_name in node.input:
+        # Stop the propagation of the receptive field.
+        if any(inp_name.startswith(stop) for stop in stop_propagation):
+          logging.vlog(3, "Skipping explicitly ignored node %s.", node.name)
+          continue
         logging.vlog(4, "inp_name = %s", inp_name)
         inp_node = name_to_order_node[inp_name].node
         logging.vlog(4, "inp_node = \n%s", inp_node)
@@ -480,6 +589,7 @@ def compute_receptive_field_from_graph_def(graph_def, input_node, output_node):
     raise ValueError("Output node was not found")
   if input_node not in rf_sizes_x:
     raise ValueError("Input node was not found")
-  return (rf_sizes_x[input_node], rf_sizes_y[input_node],
-          effective_strides_x[input_node], effective_strides_y[input_node],
-          effective_paddings_x[input_node], effective_paddings_y[input_node])
+  return ReceptiveField(
+    (rf_sizes_x[input_node], rf_sizes_y[input_node]),
+    (effective_strides_x[input_node], effective_strides_y[input_node]),
+    (effective_paddings_x[input_node], effective_paddings_y[input_node]))
diff --git a/tensorflow/contrib/receptive_field/python/util/receptive_field_test.py b/tensorflow/contrib/receptive_field/python/util/receptive_field_test.py
index 2771389250..8d7d5440f6 100644
--- a/tensorflow/contrib/receptive_field/python/util/receptive_field_test.py
+++ b/tensorflow/contrib/receptive_field/python/util/receptive_field_test.py
@@ -25,6 +25,7 @@ from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.platform import test
+import numpy as np
 
 
 def create_test_network_1():
@@ -150,6 +151,31 @@ def create_test_network_5():
   return g
 
 
+def create_test_network_6():
+  """Aligned network with dropout for test.
+
+  The graph is similar to create_test_network_1(), except that the right branch
+  has dropout normalization.
+
+  Returns:
+    g: Tensorflow graph object (Graph proto).
+  """
+  g = ops.Graph()
+  with g.as_default():
+    # An 8x8 test image.
+    x = array_ops.placeholder(dtypes.float32, (1, 8, 8, 1), name='input_image')
+    # Left branch.
+    l1 = slim.conv2d(x, 1, [1, 1], stride=4, scope='L1', padding='VALID')
+    # Right branch.
+    l2_pad = array_ops.pad(x, [[0, 0], [1, 0], [1, 0], [0, 0]])
+    l2 = slim.conv2d(l2_pad, 1, [3, 3], stride=2, scope='L2', padding='VALID')
+    l3 = slim.conv2d(l2, 1, [1, 1], stride=2, scope='L3', padding='VALID')
+    dropout = slim.dropout(l3)
+    # Addition.
+    nn.relu(l1 + dropout, name='output')
+  return g
+
+
 class RfUtilsTest(test.TestCase):
 
   def testComputeRFFromGraphDefAligned(self):
@@ -220,6 +246,36 @@ class RfUtilsTest(test.TestCase):
     self.assertEqual(effective_padding_x, 0)
     self.assertEqual(effective_padding_y, 0)
 
+  def testComputeRFFromGraphDefStopPropagation(self):
+    graph_def = create_test_network_6().as_graph_def()
+    input_node = 'input_image'
+    output_node = 'output'
+    # Compute the receptive field but stop the propagation for the random
+    # uniform variable of the dropout.
+    (receptive_field_x, receptive_field_y, effective_stride_x,
+     effective_stride_y, effective_padding_x, effective_padding_y) = (
+         receptive_field.compute_receptive_field_from_graph_def(
+             graph_def, input_node, output_node,
+             ['Dropout/dropout/random_uniform']))
+    self.assertEqual(receptive_field_x, 3)
+    self.assertEqual(receptive_field_y, 3)
+    self.assertEqual(effective_stride_x, 4)
+    self.assertEqual(effective_stride_y, 4)
+    self.assertEqual(effective_padding_x, 1)
+    self.assertEqual(effective_padding_y, 1)
+
+  def testComputeCoordinatesRoundtrip(self):
+    graph_def = create_test_network_1()
+    input_node = 'input_image'
+    output_node = 'output'
+    rf = receptive_field.compute_receptive_field_from_graph_def(
+      graph_def, input_node, output_node)
+
+    x = np.random.randint(0, 100, (50, 2))
+    y = rf.compute_feature_coordinates(x)
+    x2 = rf.compute_input_center_coordinates(y)
+
+    self.assertAllEqual(x, x2)
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py b/tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py
index 9a36bdc2f9..cd4d46aa07 100644
--- a/tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py
+++ b/tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 from tensorflow.contrib import stateless
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import array_ops
@@ -79,6 +80,21 @@ class StatelessOpsTest(test.TestCase):
             for s1, v1 in values:
               self.assertEqual(s0 == s1, np.all(v0 == v1))
 
+  def testShapeType(self):
+    with self.test_session(use_gpu=True):
+      for shape_dtype in [dtypes.int32, dtypes.int64]:
+        seed_t = array_ops.placeholder(dtypes.int64, shape=[2])
+        seeds = [(x, y) for x in range(5) for y in range(5)] * 3
+        for stateless_op, _ in CASES:
+          for shape in (), (3,), (2, 5):
+            pure = stateless_op(constant_op.constant(shape, dtype=shape_dtype),
+                                seed=seed_t)
+            values = [(seed, pure.eval(feed_dict={seed_t: seed}))
+                      for seed in seeds]
+            for s0, v0 in values:
+              for s1, v1 in values:
+                self.assertEqual(s0 == s1, np.all(v0 == v1))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index c4f880da9d..1c58aa3315 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -783,6 +783,7 @@ cc_library(
         "//tensorflow/core/kernels:dataset_ops",
         "//tensorflow/core/kernels:fake_quant_ops",
         "//tensorflow/core/kernels:function_ops",
+        "//tensorflow/core/kernels:histogram_op",
         "//tensorflow/core/kernels:image",
         "//tensorflow/core/kernels:io",
         "//tensorflow/core/kernels:linalg",
@@ -1943,6 +1944,7 @@ CORE_CPU_LIB_HEADERS = CORE_CPU_BASE_HDRS + [
 tf_cuda_library(
     name = "core_cpu_impl",
     srcs = [
+        "common_runtime/accumulate_n_optimizer.cc",
         "common_runtime/allocator_retry.cc",
         "common_runtime/bfc_allocator.cc",
         "common_runtime/build_graph_options.cc",
@@ -2178,6 +2180,7 @@ tf_cuda_library(
         ":lib",
         ":lib_internal",
         ":protos_all_cc",
+        ":stream_executor",
         "//third_party/eigen3",
     ] + if_static([":gpu_runtime_impl"]),
 )
@@ -2674,6 +2677,22 @@ tf_cc_tests(
 )
 
 tf_cc_test_mkl(
+    name = "mkl_runtime_tests",
+    size = "small",
+    srcs = ["common_runtime/mkl_cpu_allocator_test.cc"],
+    linkstatic = 1,
+    deps = [
+        ":core",
+        ":core_cpu",
+        ":framework",
+        ":framework_internal",
+        ":test",
+        ":test_main",
+        ":testlib",
+    ],
+)
+
+tf_cc_test_mkl(
     name = "mkl_related_tests",
     size = "small",
     srcs = [
@@ -2700,7 +2719,20 @@ tf_cc_test_mkl(
         "//tensorflow/cc:sendrecv_ops",
         "//tensorflow/core/kernels:ops_util",
         "//third_party/eigen3",
-    ],
+    ] + if_mkl([
+        "//tensorflow/core/kernels:mkl_aggregate_ops",
+        "//tensorflow/core/kernels:mkl_concat_op",
+        "//tensorflow/core/kernels:mkl_conv_op",
+        "//tensorflow/core/kernels:mkl_cwise_ops_common",
+        "//tensorflow/core/kernels:mkl_fused_batch_norm_op",
+        "//tensorflow/core/kernels:mkl_identity_op",
+        "//tensorflow/core/kernels:mkl_input_conversion_op",
+        "//tensorflow/core/kernels:mkl_lrn_op",
+        "//tensorflow/core/kernels:mkl_pooling_ops",
+        "//tensorflow/core/kernels:mkl_relu_op",
+        "//tensorflow/core/kernels:mkl_reshape_op",
+        "//tensorflow/core/kernels:mkl_tfconv_op",
+    ]),
 )
 
 tf_cc_tests_gpu(
diff --git a/tensorflow/core/common_runtime/accumulate_n_optimizer.cc b/tensorflow/core/common_runtime/accumulate_n_optimizer.cc
new file mode 100644
index 0000000000..81cd44870e
--- /dev/null
+++ b/tensorflow/core/common_runtime/accumulate_n_optimizer.cc
@@ -0,0 +1,191 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+
+#include "tensorflow/core/common_runtime/optimization_registry.h"
+#include "tensorflow/core/graph/node_builder.h"
+
+
+namespace tensorflow {
+namespace {
+
+Tensor make_zeros(const DataType& dtype, const TensorShapeProto& shape) {
+  Tensor tensor(dtype, TensorShape(shape));
+
+  // Conveniently, all numeric data types have 0x0 == zero.  Otherwise we would
+  // need a giant switch statement here.
+  memset(const_cast<char*>(tensor.tensor_data().data()), 0,
+         tensor.tensor_data().size());
+
+  return tensor;
+}
+
+// Replaces occurrences of the "AccumulateNV2" stub operator with a graph of
+// lower-level ops. The graph is equivalent (modulo certain corner cases)
+// to the semantics of the original accumulate_n() Python op in math_ops.py.
+// Implementing the op with a rewrite allows this new variant of accumulate_n 
+// to be differentiable.
+//
+// The binary code that generates AccumulateNV2 stub ops is located in a
+// dynamic library built out of tensorflow/contrib/framework. Ideally, this
+// class would also be in contrib, but calls to REGISTER_OPTIMIZATION() from
+// third-party libraries aren't currently supported.
+class AccumulateNV2RemovePass : public GraphOptimizationPass {
+ public:
+
+  Status Run(const GraphOptimizationPassOptions& options) override {
+    // TODO(freiss.oss@gmail.com): Substantial shared code with
+    // ParallelConcatRemovePass::Run(). Consider refactoring if someone makes
+    // a third similar rewrite.
+    if (options.graph == nullptr) {
+      // TODO(apassos) returning OK feels weird here as we can't do anything
+      // without a graph, but some tests require this.
+      return Status::OK();
+    }
+
+    Graph* g = options.graph->get();
+    if (g == nullptr) {
+      return errors::Internal(
+          "AccumulateNV2 removal should happen before partitioning and a "
+          "graph should be available.");
+    }
+
+    // Build up a todo list of ops to replace, *then* modify the graph
+    gtl::InlinedVector<Node*, 2> matches;
+    for (Node* n : g->op_nodes()) {
+      if (n->type_string() == "AccumulateNV2") {
+        matches.push_back(n);
+      }
+    }
+    for (Node* n : matches) {
+      TF_RETURN_IF_ERROR(rewriteNode(n, g));
+    }
+    return Status::OK();
+  }
+
+  Status rewriteNode(Node* n, Graph* g) {
+    AttrSlice n_attrs = n->attrs();
+    auto base_make_node = [n, g, &n_attrs](const string& op,
+                                           const string& name) {
+      NodeBuilder node_builder(name, op);
+
+      // The pieces of AccumulateNV2 should all be on the same node.
+      node_builder.Device(n->requested_device());
+      string colo;
+      if (GetNodeAttr(n_attrs, kColocationAttrName, &colo).ok()) {
+        node_builder.Attr(kColocationAttrName, colo);
+      }
+      return node_builder;
+    };
+    auto make_node = [n, g, &n_attrs, &base_make_node](string op) {
+      return base_make_node(
+          op, g->NewName(strings::StrCat(n->name(), "/Internal")));
+    };
+
+    DataType dtype;
+    TF_RETURN_IF_ERROR(GetNodeAttr(n_attrs, "T", &dtype));
+    TensorShapeProto shape;
+    TF_RETURN_IF_ERROR(GetNodeAttr(n_attrs, "shape", &shape));
+
+    std::vector<const Edge*> data_edges, control_edges;
+    for (const Edge* input_edge : n->in_edges()) {
+      if (input_edge->IsControlEdge()) {
+        control_edges.push_back(input_edge);
+      } else {
+        data_edges.push_back(input_edge);
+      }
+    }
+
+    // Create the following ops to replace the AccumulateNV2 placeholder:
+    Node* create_accumulator = nullptr;            // TemporaryVariable op
+    Node* initial_val = nullptr;                   // Const op
+    Node* initialize_accumulator = nullptr;        // Assign op
+    std::vector<Node*> add_values_to_accumulator;  // AssignAdd ops
+    Node* clean_up_accumulator = nullptr;          // DestroyTemporaryVariable
+
+    const string accumulator_name =
+        strings::StrCat(n->name(), "/Internal/Accumulator");
+    TF_RETURN_IF_ERROR(make_node("TemporaryVariable")
+                           .Attr("shape", shape)
+                           .Attr("dtype", dtype)
+                           .Attr("var_name", accumulator_name)
+                           .Finalize(g, &create_accumulator));
+    TF_RETURN_IF_ERROR(make_node("Const")
+                           .Attr("value", make_zeros(dtype, shape))
+                           .Attr("dtype", dtype)
+                           .Finalize(g, &initial_val));
+    TF_RETURN_IF_ERROR(make_node("Assign")
+                           .Attr("T", dtype)
+                           .Input(create_accumulator)  // ref: Ref(T)
+                           .Input(initial_val)         // value: T
+                           .Finalize(g, &initialize_accumulator));
+    for (int i = 0; i < data_edges.size(); ++i) {
+      Node* assignAdd;
+      TF_RETURN_IF_ERROR(make_node("AssignAdd")
+                             .Attr("T", dtype)
+                             .Attr("use_locking", true)
+                             .Input(initialize_accumulator)  // ref: Ref(T)
+                             .Input(data_edges[i]->src(),
+                                    data_edges[i]->src_output())  // value: T
+                             .Finalize(g, &assignAdd));
+
+      add_values_to_accumulator.push_back(assignAdd);
+    }
+
+    // Note that we use the original placeholder op's name here
+    TF_RETURN_IF_ERROR(base_make_node("DestroyTemporaryVariable", n->name())
+                           .Attr("T", dtype)
+                           .Attr("var_name", accumulator_name)
+                           .Input(initialize_accumulator)
+                           .Finalize(g, &clean_up_accumulator));
+
+    // Add edges to the graph to ensure that operations occur in the right
+    // order:
+    // 1. Do anything that had a control edge to the AccumulateNV2 placeholder
+    // 2. Initialize accumulator
+    // 3. Add input values to accumulator (already handled by data edges
+    //    added above)
+    // 4. Reclaim the buffer that held the accumulator
+    // 5. Do anything that depended on the AccumulateNV2 placeholder
+    for (const Edge* control_edge : control_edges) {
+      g->AddControlEdge(control_edge->src(), initialize_accumulator);
+    }
+
+    for (Node* assign_add : add_values_to_accumulator) {
+      g->AddControlEdge(assign_add, clean_up_accumulator);
+    }
+
+    for (const Edge* out_edge : n->out_edges()) {
+      if (out_edge->IsControlEdge()) {
+        g->AddControlEdge(clean_up_accumulator, out_edge->dst());
+      } else {
+        g->AddEdge(clean_up_accumulator, 0, out_edge->dst(),
+                   out_edge->dst_input());
+      }
+    }
+
+    // Remove the original AccumulateNV2 placeholder op.
+    // This removal modifies the op and must happen after we have finished
+    // using its incoming/outgoing edge sets.
+    g->RemoveNode(n);
+
+    return Status::OK();
+  }
+};
+REGISTER_OPTIMIZATION(OptimizationPassRegistry::PRE_PLACEMENT, 0,
+                      AccumulateNV2RemovePass);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index f16da10d7a..53e80b1ee3 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -21,9 +21,13 @@ limitations under the License.
 
 #ifdef INTEL_MKL
 
+#include <unistd.h>
+#include <cstdlib>
 #include <string>
 #include "tensorflow/core/common_runtime/bfc_allocator.h"
 #include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/lib/strings/numbers.h"
+#include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/mem.h"
 
 #include "i_malloc.h"
@@ -46,10 +50,50 @@ class MklCPUAllocator : public Allocator {
  public:
   // Constructor and other standard functions
 
-  MklCPUAllocator() {
+  /// Environment variable that user can set to upper bound on memory allocation
+  static constexpr const char* kMaxLimitStr = "TF_MKL_ALLOC_MAX_BYTES";
+
+  /// Default upper limit on allocator size - 64GB
+  static const size_t kDefaultMaxLimit = 64LL << 30;
+
+  MklCPUAllocator() { TF_CHECK_OK(Initialize()); }
+
+  ~MklCPUAllocator() override { delete allocator_; }
+
+  Status Initialize() {
     VLOG(2) << "MklCPUAllocator: In MklCPUAllocator";
-    allocator_ =
-        new BFCAllocator(new MklSubAllocator, kMaxMemSize, kAllowGrowth, kName);
+
+    // Set upper bound on memory allocation to physical RAM available on the
+    // CPU unless explicitly specified by user
+    uint64 max_mem_bytes = kDefaultMaxLimit;
+#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
+    max_mem_bytes =
+        (uint64)sysconf(_SC_PHYS_PAGES) * (uint64)sysconf(_SC_PAGESIZE);
+#endif
+    char* user_mem_bytes = getenv(kMaxLimitStr);
+
+    if (user_mem_bytes != NULL) {
+      uint64 user_val = 0;
+      if (!strings::safe_strtou64(user_mem_bytes, &user_val)) {
+        return errors::InvalidArgument("Invalid memory limit (", user_mem_bytes,
+                                       ") specified for MKL allocator through ",
+                                       kMaxLimitStr);
+      }
+#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
+      if (user_val > max_mem_bytes) {
+        LOG(WARNING) << "The user specifed a memory limit " << kMaxLimitStr
+                     << "=" << user_val
+                     << " greater than available physical memory: "
+                     << max_mem_bytes
+                     << ". This could significantly reduce performance!";
+      }
+#endif
+      max_mem_bytes = user_val;
+    }
+
+    VLOG(1) << "MklCPUAllocator: Setting max_mem_bytes: " << max_mem_bytes;
+    allocator_ = new BFCAllocator(new MklSubAllocator, max_mem_bytes,
+                                  kAllowGrowth, kName);
 
     // For redirecting all allocations from MKL to this allocator
     // From: http://software.intel.com/en-us/node/528565
@@ -57,9 +101,9 @@ class MklCPUAllocator : public Allocator {
     i_calloc = CallocHook;
     i_realloc = ReallocHook;
     i_free = FreeHook;
-  }
 
-  ~MklCPUAllocator() override { delete allocator_; }
+    return Status::OK();
+  }
 
   inline string Name() override { return kName; }
 
@@ -71,6 +115,8 @@ class MklCPUAllocator : public Allocator {
     allocator_->DeallocateRaw(ptr);
   }
 
+  void GetStats(AllocatorStats* stats) { return allocator_->GetStats(stats); }
+
  private:
   // Hooks provided by this allocator for memory allocation routines from MKL
 
@@ -96,11 +142,6 @@ class MklCPUAllocator : public Allocator {
     TF_CHECK_OK(s);  // way to assert with an error message
   }
 
-  // TODO(jbobba): We should ideally move this into CPUOptions in config.proto.
-  /// Memory limit - 64GB
-  static const size_t kMaxMemSize =
-      static_cast<size_t>(64) * 1024 * 1024 * 1024;
-
   /// Do we allow growth in BFC Allocator
   static const bool kAllowGrowth = true;
 
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator_test.cc b/tensorflow/core/common_runtime/mkl_cpu_allocator_test.cc
new file mode 100644
index 0000000000..a67411cd2e
--- /dev/null
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator_test.cc
@@ -0,0 +1,53 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef INTEL_MKL
+
+#include "tensorflow/core/common_runtime/mkl_cpu_allocator.h"
+
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+
+TEST(MKLBFCAllocatorTest, TestMaxLimit) {
+  AllocatorStats stats;
+  setenv(MklCPUAllocator::kMaxLimitStr, "1000", 1);
+  MklCPUAllocator a;
+  TF_EXPECT_OK(a.Initialize());
+  a.GetStats(&stats);
+  EXPECT_EQ(stats.bytes_limit, 1000);
+
+  unsetenv(MklCPUAllocator::kMaxLimitStr);
+  TF_EXPECT_OK(a.Initialize());
+  a.GetStats(&stats);
+  uint64 max_mem_bytes = MklCPUAllocator::kDefaultMaxLimit;
+#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
+  max_mem_bytes =
+      (uint64)sysconf(_SC_PHYS_PAGES) * (uint64)sysconf(_SC_PAGESIZE);
+#endif
+  EXPECT_EQ(stats.bytes_limit, max_mem_bytes);
+
+  setenv(MklCPUAllocator::kMaxLimitStr, "wrong-input", 1);
+  EXPECT_TRUE(errors::IsInvalidArgument(a.Initialize()));
+
+  setenv(MklCPUAllocator::kMaxLimitStr, "-20", 1);
+  EXPECT_TRUE(errors::IsInvalidArgument(a.Initialize()));
+}
+
+}  // namespace tensorflow
+
+#endif  // INTEL_MKL
diff --git a/tensorflow/core/framework/common_shape_fns.cc b/tensorflow/core/framework/common_shape_fns.cc
index 4796c3c00a..315c99d32b 100644
--- a/tensorflow/core/framework/common_shape_fns.cc
+++ b/tensorflow/core/framework/common_shape_fns.cc
@@ -1020,6 +1020,29 @@ Status UnknownShape(shape_inference::InferenceContext* c) {
   return Status::OK();
 }
 
+template <typename T>
+Status ReductionShapeHelper(const Tensor* reduction_indices_t,
+                            const int32 input_rank,
+                            std::set<int64>& true_indices) {
+  auto reduction_indices = reduction_indices_t->flat<T>();
+  for (int i = 0; i < reduction_indices_t->NumElements(); ++i) {
+    const T reduction_index = reduction_indices(i);
+    if (reduction_index < -input_rank || reduction_index >= input_rank) {
+      return errors::InvalidArgument("Invalid reduction dimension ",
+                                     reduction_index, " for input with ",
+                                     input_rank, " dimensions.");
+    }
+
+    auto wrapped_index = reduction_index;
+    if (wrapped_index < 0) {
+      wrapped_index += input_rank;
+    }
+
+    true_indices.insert(wrapped_index);
+  }
+  return Status::OK();
+}
+
 Status ReductionShape(InferenceContext* c) {
   ShapeHandle input = c->input(0);
 
@@ -1050,22 +1073,16 @@ Status ReductionShape(InferenceContext* c) {
   }
 
   const int32 input_rank = c->Rank(input);
-  std::set<int32> true_indices;
-  auto reduction_indices = reduction_indices_t->flat<int32>();
-  for (int i = 0; i < reduction_indices_t->NumElements(); ++i) {
-    int32 reduction_index = reduction_indices(i);
-    if (reduction_index < -input_rank || reduction_index >= input_rank) {
-      return errors::InvalidArgument("Invalid reduction dimension ",
-                                     reduction_index, " for input with ",
-                                     input_rank, " dimensions.");
-    }
-
-    int32 wrapped_index = reduction_index;
-    if (wrapped_index < 0) {
-      wrapped_index += input_rank;
-    }
-
-    true_indices.insert(wrapped_index);
+  std::set<int64> true_indices;
+  if (reduction_indices_t->dtype() == DataType::DT_INT32) {
+    TF_RETURN_IF_ERROR(ReductionShapeHelper<int32>(reduction_indices_t,
+                                                   input_rank, true_indices));
+  } else if (reduction_indices_t->dtype() == DataType::DT_INT64) {
+    TF_RETURN_IF_ERROR(ReductionShapeHelper<int64>(reduction_indices_t,
+                                                   input_rank, true_indices));
+  } else {
+    return errors::InvalidArgument(
+        "reduction_indices can only be int32 or int64");
   }
 
   std::vector<DimensionHandle> dims;
@@ -1319,11 +1336,10 @@ Status ScatterNdUpdateShape(InferenceContext* c) {
       Status s = c->Merge(prefix_indices, prefix_updates, &unused);
       if (!s.ok()) {
         return errors::InvalidArgument(
-            "The outer ", num_outer_dims,
-            " dimensions of indices.shape=", c->DebugString(indices_shape),
-            " must match the outer ", num_outer_dims,
-            " dimensions of updates.shape=", c->DebugString(updates_shape),
-            ": ", s.error_message());
+            "The outer ", num_outer_dims, " dimensions of indices.shape=",
+            c->DebugString(indices_shape), " must match the outer ",
+            num_outer_dims, " dimensions of updates.shape=",
+            c->DebugString(updates_shape), ": ", s.error_message());
       }
 
       ShapeHandle input_suffix;
diff --git a/tensorflow/core/framework/node_def.proto b/tensorflow/core/framework/node_def.proto
index 53aa03108a..1fd2e50b51 100644
--- a/tensorflow/core/framework/node_def.proto
+++ b/tensorflow/core/framework/node_def.proto
@@ -35,7 +35,7 @@ message NodeDef {
   // CONSTRAINT ::= ("job:" JOB_NAME)
   //              | ("replica:" [1-9][0-9]*)
   //              | ("task:" [1-9][0-9]*)
-  //              | ( ("gpu" | "cpu") ":" ([1-9][0-9]* | "*") )
+  //              | ("device:" ("gpu" | "cpu") ":" ([1-9][0-9]* | "*") )
   //
   // Valid values for this string include:
   // * "/job:worker/replica:0/task:1/device:GPU:3"  (full specification)
diff --git a/tensorflow/core/framework/register_types.h b/tensorflow/core/framework/register_types.h
index 61e722e57b..c31ab18cc1 100644
--- a/tensorflow/core/framework/register_types.h
+++ b/tensorflow/core/framework/register_types.h
@@ -87,7 +87,7 @@ limitations under the License.
 
 #elif defined(__ANDROID_TYPES_FULL__)
 
-// Only half, float, int32, int64, and quantized types are supported.
+// Only half, float, int32, int64, bool, and quantized types are supported.
 #define TF_CALL_float(m) m(float)
 #define TF_CALL_double(m)
 #define TF_CALL_int32(m) m(::tensorflow::int32)
@@ -117,7 +117,7 @@ limitations under the License.
 
 #else  // defined(IS_MOBILE_PLATFORM) && !defined(__ANDROID_TYPES_FULL__)
 
-// Only float and int32 are supported.
+// Only float, int32, and bool are supported.
 #define TF_CALL_float(m) m(float)
 #define TF_CALL_double(m)
 #define TF_CALL_int32(m) m(::tensorflow::int32)
diff --git a/tensorflow/core/framework/rendezvous.cc b/tensorflow/core/framework/rendezvous.cc
index 90426defa0..a9e4c1cfb1 100644
--- a/tensorflow/core/framework/rendezvous.cc
+++ b/tensorflow/core/framework/rendezvous.cc
@@ -210,7 +210,7 @@ class LocalRendezvousImpl : public Rendezvous {
     ItemQueue* queue = &table_[key_hash];
     if (queue->empty() || !queue->front()->IsSendValue()) {
       // There is no message to pick up.
-      // Only recv-related fileds need to be filled.
+      // Only recv-related fields need to be filled.
       Item* item = new Item;
       item->waiter = std::move(done);
       item->recv_args = recv_args;
diff --git a/tensorflow/core/graph/graph.h b/tensorflow/core/graph/graph.h
index 7c7f641265..c5dde722fa 100644
--- a/tensorflow/core/graph/graph.h
+++ b/tensorflow/core/graph/graph.h
@@ -639,7 +639,7 @@ class Graph {
   std::unordered_map<string, int> device_names_map_;
 
   // All the while contexts owned by this graph, keyed by frame name,
-  // corresonding to all the while loops contained in this graph (including
+  // corresponding to all the while loops contained in this graph (including
   // nested loops). The stored contexts are usually accessed via
   // AddWhileContext() or Node::while_ctx(), but this manages the lifetime.
   std::map<string, WhileContext> while_ctxs_;
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index f87a94a76a..f4c9073dee 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -543,7 +543,7 @@ class MklLayoutRewritePass : public GraphOptimizationPass {
     string reason;
 
     // Substring that should be checked for in device name for CPU device.
-    const char* const kCPUDeviceSubStr = "cpu";
+    const char* const kCPUDeviceSubStr = "CPU";
 
     // If Op has been specifically assigned to a non-CPU device, then No.
     if (!n->assigned_device_name().empty() &&
diff --git a/tensorflow/core/graph/mkl_layout_pass_test.cc b/tensorflow/core/graph/mkl_layout_pass_test.cc
index a2b2f6530d..abc63e4f35 100644
--- a/tensorflow/core/graph/mkl_layout_pass_test.cc
+++ b/tensorflow/core/graph/mkl_layout_pass_test.cc
@@ -39,7 +39,7 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-const char kCPUDevice[] = "/job:a/replica:0/task:0/cpu:0";
+const char kCPUDevice[] = "/job:a/replica:0/task:0/device:CPU:0";
 const char kGPUDevice[] = "/job:a/replica:0/task:0/device:GPU:0";
 
 static void InitGraph(const string& s, Graph* graph,
diff --git a/tensorflow/core/graph/testlib.cc b/tensorflow/core/graph/testlib.cc
index be52438747..172471e34b 100644
--- a/tensorflow/core/graph/testlib.cc
+++ b/tensorflow/core/graph/testlib.cc
@@ -480,6 +480,24 @@ Node* Conv2D(Graph* g, Node* in0, Node* in1) {
   return ret;
 }
 
+Node* Diag(Graph* g, Node* in, DataType type) {
+  Node* ret;
+  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "Diag")
+                  .Input(in)
+                  .Attr("T", type)
+                  .Finalize(g, &ret));
+  return ret;
+}
+
+Node* DiagPart(Graph* g, Node* in, DataType type) {
+  Node* ret;
+  TF_CHECK_OK(NodeBuilder(g->NewName("n"), "DiagPart")
+                  .Input(in)
+                  .Attr("T", type)
+                  .Finalize(g, &ret));
+  return ret;
+}
+
 void ToGraphDef(Graph* g, GraphDef* gdef) { g->ToGraphDef(gdef); }
 
 }  // end namespace graph
diff --git a/tensorflow/core/graph/testlib.h b/tensorflow/core/graph/testlib.h
index a38809e6b4..06597778bb 100644
--- a/tensorflow/core/graph/testlib.h
+++ b/tensorflow/core/graph/testlib.h
@@ -199,6 +199,12 @@ Node* BiasAdd(Graph* g, Node* value, Node* bias);
 // Add a Conv2D node in "g".
 Node* Conv2D(Graph* g, Node* in0, Node* in1);
 
+// Add a Diag node in "g".
+Node* Diag(Graph* g, Node* in, DataType type);
+
+// Add a DiagPart node in "g".
+Node* DiagPart(Graph* g, Node* in, DataType type);
+
 }  // end namespace graph
 }  // end namespace test
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/model_pruner.cc b/tensorflow/core/grappler/optimizers/model_pruner.cc
index e087621c3b..b9df196f83 100644
--- a/tensorflow/core/grappler/optimizers/model_pruner.cc
+++ b/tensorflow/core/grappler/optimizers/model_pruner.cc
@@ -104,7 +104,7 @@ Status ModelPruner::Optimize(Cluster* cluster, const GrapplerItem& item,
     // - Don't remove nodes that receive reference values, as those can be
     //   converting references to non-references. It is important to preserve
     //   these non-references since the partitioner will avoid sending
-    //   non-references accross partitions more than once.
+    //   non-references across partitions more than once.
     if (!rewriter.DrivesControlDependency(node) &&
         !rewriter.IsDrivenByControlDependency(node) &&
         !rewriter.IsConnectedToFunction(node) &&
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index f5bfa60199..92a0dbd0ab 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -2499,6 +2499,7 @@ cc_library(
         ":cross_op",
         ":cwise_op",
         ":fft_ops",
+        ":histogram_op",
         ":matmul_op",
         ":population_count_op",
         ":reduction_ops",
@@ -2635,6 +2636,24 @@ tf_kernel_library(
     deps = MATH_DEPS,
 )
 
+tf_cc_test(
+    name = "sequence_ops_test",
+    size = "small",
+    srcs = ["sequence_ops_test.cc"],
+    deps = [
+        ":ops_testutil",
+        ":ops_util",
+        ":sequence_ops",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
 tf_cuda_cc_test(
     name = "cast_op_test",
     size = "small",
@@ -2893,6 +2912,24 @@ tf_cuda_cc_test(
     ],
 )
 
+tf_cuda_cc_test(
+    name = "diag_op_test",
+    size = "small",
+    srcs = ["diag_op_test.cc"],
+    deps = [
+        ":diag_op",
+        ":ops_testutil",
+        ":ops_util",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
+
 # conv_grad_ops currently has to be built with conv_ops*.
 # TODO(josh11b, zhengxq): put these a separate libraries in ":nn" below once
 # conv_ops_gpu.h has be separated into its own library.
@@ -2993,6 +3030,7 @@ cc_library(
         ":in_topk_op",
         ":l2loss_op",
         ":lrn_op",
+        ":nth_element_op",
         ":relu_op",
         ":softmax_op",
         ":softplus_op",
@@ -3080,6 +3118,12 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
+    name = "nth_element_op",
+    prefix = "nth_element_op",
+    deps = NN_DEPS,
+)
+
+tf_kernel_library(
     name = "xent_op",
     prefix = "xent_op",
     deps = NN_DEPS,
@@ -3097,6 +3141,17 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
+    name = "histogram_op",
+    prefix = "histogram_op",
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+        "//third_party/eigen3",
+    ] + if_cuda(["@cub_archive//:cub"]),
+)
+
+tf_kernel_library(
     name = "l2loss_op",
     prefix = "l2loss_op",
     deps = [
diff --git a/tensorflow/core/kernels/batchtospace_op.cc b/tensorflow/core/kernels/batchtospace_op.cc
index 99b5d3daaa..c1c0d6d329 100644
--- a/tensorflow/core/kernels/batchtospace_op.cc
+++ b/tensorflow/core/kernels/batchtospace_op.cc
@@ -249,40 +249,34 @@ class BatchToSpaceOp : public OpKernel {
   Tensor block_shape_;
 };
 
-#define REGISTER(T)                                                  \
-  REGISTER_KERNEL_BUILDER(Name("BatchToSpaceND")                     \
-                              .Device(DEVICE_CPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tblock_shape") \
-                              .TypeConstraint<int32>("Tcrops")       \
-                              .HostMemory("block_shape")             \
-                              .HostMemory("crops"),                  \
-                          BatchToSpaceNDOp<CPUDevice, T>);           \
-  REGISTER_KERNEL_BUILDER(Name("BatchToSpace")                       \
-                              .Device(DEVICE_CPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tidx")         \
-                              .HostMemory("crops"),                  \
+#define REGISTER(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("BatchToSpaceND")           \
+                              .Device(DEVICE_CPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("block_shape")   \
+                              .HostMemory("crops"),        \
+                          BatchToSpaceNDOp<CPUDevice, T>); \
+  REGISTER_KERNEL_BUILDER(Name("BatchToSpace")             \
+                              .Device(DEVICE_CPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("crops"),        \
                           BatchToSpaceOp<CPUDevice, T>);
 
 TF_CALL_REAL_NUMBER_TYPES(REGISTER);
 #undef REGISTER
 
 #if GOOGLE_CUDA
-#define REGISTER(T)                                                  \
-  REGISTER_KERNEL_BUILDER(Name("BatchToSpaceND")                     \
-                              .Device(DEVICE_GPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tblock_shape") \
-                              .TypeConstraint<int32>("Tcrops")       \
-                              .HostMemory("block_shape")             \
-                              .HostMemory("crops"),                  \
-                          BatchToSpaceNDOp<GPUDevice, T>);           \
-  REGISTER_KERNEL_BUILDER(Name("BatchToSpace")                       \
-                              .Device(DEVICE_GPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tidx")         \
-                              .HostMemory("crops"),                  \
+#define REGISTER(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("BatchToSpaceND")           \
+                              .Device(DEVICE_GPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("block_shape")   \
+                              .HostMemory("crops"),        \
+                          BatchToSpaceNDOp<GPUDevice, T>); \
+  REGISTER_KERNEL_BUILDER(Name("BatchToSpace")             \
+                              .Device(DEVICE_GPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("crops"),        \
                           BatchToSpaceOp<GPUDevice, T>);
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER);
diff --git a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
index 6e10b53cf7..9a00a091bd 100644
--- a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
@@ -394,7 +394,7 @@ __global__ void SwapDimension1And2InTensor3SmallDim(const T* input,
     int output_block_idx = SmallDim2 ? block_offset : block_offset * small_dim;
     int output_block_origin_idx = output_block_offset + output_block_idx;
 
-    // Store the tranposed memory region in shared memory to device.
+    // Store the transposed memory region in shared memory to device.
     if (x < tile_height) {
       for (int y = 0; y < small_dim; y++) {
         int output_idx = output_block_origin_idx + x +
diff --git a/tensorflow/core/kernels/crop_and_resize_op_test.cc b/tensorflow/core/kernels/crop_and_resize_op_test.cc
index 22c659b587..a35e1b0788 100644
--- a/tensorflow/core/kernels/crop_and_resize_op_test.cc
+++ b/tensorflow/core/kernels/crop_and_resize_op_test.cc
@@ -61,8 +61,12 @@ class CropAndResizeOpTest : public OpsTestBase {
 
 REGISTER_TEST(float)
 REGISTER_TEST(double)
-REGISTER_TEST(int8)
 REGISTER_TEST(uint8)
+REGISTER_TEST(uint16)
+REGISTER_TEST(int8)
+REGISTER_TEST(int16)
+REGISTER_TEST(int32)
+REGISTER_TEST(int64)
 
 #undef REGISTER_TEST
 
diff --git a/tensorflow/core/kernels/dataset.h b/tensorflow/core/kernels/dataset.h
index a906113466..a431889409 100644
--- a/tensorflow/core/kernels/dataset.h
+++ b/tensorflow/core/kernels/dataset.h
@@ -412,7 +412,7 @@ class DatasetIterator : public IteratorBase {
     // Owns one reference on the shared dataset resource.
     const DatasetType* dataset;
 
-    // Identifies the sequence of iterators leading up to to this iterator.
+    // Identifies the sequence of iterators leading up to this iterator.
     const string prefix;
   };
 
diff --git a/tensorflow/core/kernels/diag_op.cc b/tensorflow/core/kernels/diag_op.cc
index c800859d90..be862b82f1 100644
--- a/tensorflow/core/kernels/diag_op.cc
+++ b/tensorflow/core/kernels/diag_op.cc
@@ -14,65 +14,32 @@ limitations under the License.
 ==============================================================================*/
 
 // See docs in ../ops/array_ops.cc
+
+#define EIGEN_USE_THREADS
+
+#if GOOGLE_CUDA
+#define EIGEN_USE_GPU
+#endif  // GOOGLE_CUDA
+
+#include "tensorflow/core/kernels/diag_op.h"
+
+#include <algorithm>
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/util/work_sharder.h"
 
 namespace tensorflow {
-namespace {
-template <typename T, size_t NumDims, size_t DoubleNumDims>
-class DiagonalGenerator {
- public:
-  explicit DiagonalGenerator(const Tensor& diagonal) : diagonal_(diagonal) {
-    static_assert(DoubleNumDims == 2 * NumDims,
-                  "The second size must be the double of the first size.");
-    CHECK_EQ(diagonal.dims(), NumDims);
-  }
-  T operator()(
-      const Eigen::array<Eigen::DenseIndex, DoubleNumDims>& coordinates) const {
-    Eigen::array<Eigen::DenseIndex, NumDims> index;
-    for (size_t i = 0; i < NumDims; ++i) {
-      if (coordinates[i] != coordinates[NumDims + i]) {
-        return T(0);
-      }
-      index[i] = coordinates[i];
-    }
-    return diagonal_.tensor<T, NumDims>()(index);
-  }
 
- private:
-  Tensor diagonal_;
-};
-
-template <typename T, size_t NumDims>
-class DiagonalExtractor {
- public:
-  explicit DiagonalExtractor(const Tensor& tensor) : tensor_(tensor) {
-    CHECK_EQ(tensor.dims(), 2 * NumDims);
-  }
-  T operator()(const Eigen::array<Eigen::Index, NumDims>& coordinates) const {
-    Eigen::array<Eigen::Index, 2 * NumDims> index;
-    for (size_t j = 0; j < NumDims; ++j){
-      index[j] = coordinates[j];
-    }
-    for (size_t j = NumDims; j < 2 * NumDims; ++j){
-      index[j] = index[j - NumDims];
-    }
-    return tensor_.tensor<T, 2 * NumDims>()(index);
-  }
-
- private:
-  Tensor tensor_;
-};
-  
-}  // namespace
+typedef Eigen::ThreadPoolDevice CPUDevice;
+typedef Eigen::GpuDevice GPUDevice;
 
 // Generate the diagonal tensor with the diagonal set to the input tensor.
-// It only allows up to rank 3 input tensor, so the output tensor is up to
-// rank 6.
-template <typename T>
+template <typename Device, typename T>
 class DiagOp : public OpKernel {
  public:
   explicit DiagOp(OpKernelConstruction* context) : OpKernel(context) {}
@@ -80,9 +47,8 @@ class DiagOp : public OpKernel {
   void Compute(OpKernelContext* context) override {
     const Tensor& diagonal = context->input(0);
     const int num_dims = diagonal.dims();
-    OP_REQUIRES(context, 1 <= num_dims && num_dims <= 3,
-                errors::InvalidArgument("Expected 1 <= dims <= 3, got shape ",
-                                        diagonal.shape().DebugString()));
+    OP_REQUIRES(context, 0 != num_dims, errors::InvalidArgument(
+        "Input must be at least rank 1, got 0"));
     TensorShape out_shape;
     for (int i = 0; i < num_dims; ++i) {
       out_shape.AddDim(diagonal.dim_size(i));
@@ -93,45 +59,17 @@ class DiagOp : public OpKernel {
     Tensor* output_tensor = nullptr;
     OP_REQUIRES_OK(context,
                    context->allocate_output(0, out_shape, &output_tensor));
-    switch (num_dims) {
-      case 1:
-        output_tensor->tensor<T, 2>() = output_tensor->tensor<T, 2>().generate(
-            DiagonalGenerator<T, 1, 2>(diagonal));
-        break;
-      case 2:
-        output_tensor->tensor<T, 4>() = output_tensor->tensor<T, 4>().generate(
-            DiagonalGenerator<T, 2, 4>(diagonal));
-        break;
-      case 3:
-        output_tensor->tensor<T, 6>() = output_tensor->tensor<T, 6>().generate(
-            DiagonalGenerator<T, 3, 6>(diagonal));
-        break;
-      default:
-        context->SetStatus(errors::Unimplemented(
-            "Diagonal of rank ", num_dims, " tensor is not supported yet."));
-        return;
-    }
+    functor::DiagFunctor<Device, T> diagFunc;
+    Status s = diagFunc(context,
+                        diagonal.NumElements(),
+                        diagonal.flat<T>().data(),
+                        output_tensor->flat<T>().data());
+    OP_REQUIRES_OK(context, s);
   }
 };
 
-#define REGISTER_DIAGOP(T) \
-  REGISTER_KERNEL_BUILDER( \
-      Name("Diag").Device(DEVICE_CPU).TypeConstraint<T>("T"), DiagOp<T>)
-
-REGISTER_DIAGOP(double);
-REGISTER_DIAGOP(float);
-REGISTER_DIAGOP(int32);
-REGISTER_DIAGOP(int64);
-REGISTER_DIAGOP(complex64);
-REGISTER_DIAGOP(complex128);
-
-#undef REGISTER_DIAGOP
-
-
-// Generate the diagonal tensor with the diagonal set to the input tensor.
-// It only allows rank 2, 4, or 6 input tensor, so the output tensor is 
-// rank 1, 2, or 3.
-template <typename T>
+// Extract the diagonal tensor with the diagonal set to the input tensor.
+template <typename Device, typename T>
 class DiagPartOp : public OpKernel {
  public:
   explicit DiagPartOp(OpKernelConstruction* context) : OpKernel(context) {}
@@ -140,9 +78,9 @@ class DiagPartOp : public OpKernel {
     const Tensor& tensor = context->input(0);
     const int num_dims = tensor.dims();
     const int out_dims = num_dims / 2;
-    OP_REQUIRES(context, 2 == num_dims || 4 == num_dims || 6 == num_dims, 
-                errors::InvalidArgument("The rank of the tensor should be 2, \
-                                         4, or 6, got shape ",
+    OP_REQUIRES(context, 0 == num_dims % 2,
+                errors::InvalidArgument("The rank of the tensor should be \
+                                         even and positive, got shape ",
                                         tensor.shape().DebugString()));
     for (int i = 0; i < out_dims; i++){
       OP_REQUIRES(context, tensor.dim_size(i) == tensor.dim_size(i + out_dims),
@@ -160,39 +98,158 @@ class DiagPartOp : public OpKernel {
     Tensor* output = nullptr;
     OP_REQUIRES_OK(context,
                    context->allocate_output(0, out_shape, &output));
+    functor::DiagPartFunctor<Device, T> diagPartFunc;
+    Status s = diagPartFunc(context,
+                            out_shape.num_elements(),
+                            tensor.flat<T>().data(),
+                            output->flat<T>().data());
+    OP_REQUIRES_OK(context, s);
+  }
+};
 
-    switch (num_dims) {
-      case 2:
-        output->tensor<T, 1>() = output->tensor<T, 1>().generate(
-          DiagonalExtractor<T, 1>(tensor));
-        break; 
-      case 4:
-        output->tensor<T, 2>() = output->tensor<T, 2>().generate(
-          DiagonalExtractor<T, 2>(tensor));
-        break;
-      case 6:
-        output->tensor<T, 3>() = output->tensor<T, 3>().generate(
-          DiagonalExtractor<T, 3>(tensor));
-        break;      
-      default:
-        context->SetStatus(errors::Unimplemented(
-          "Diagonal of rank ", num_dims, " tensor is not supported yet."));
-        return;
-    }
+// Implementation of the functor specialization for CPU.
+// 
+// According to the diagonal definition,
+// `output[i1,..., ik, i1,..., ik] = input[i1,..., ik]`,
+//
+// Let the rank of input is [s1,..., sk], then any offset of input's
+// pointer can be represent by coordinate [i1,..., ik],
+// where `index = i1*(s2*...*sk) + i2*(s3*...*sk) +... + ik`
+//
+// Let new_index is the offset of output's pointer with coordinate 
+// [i1,..., ik, i1,..., ik], then we have
+// `new_index = i1*(s2*...sk*s1*...*sk) + i2*(s3*...*sk*s1*...*sk) +... + \
+//              ik*(s1*...*sk) + i1*(s2*...*sk) + i2*(s3*...*sk) +... + ik
+//            = (i1*(s2*...*sk) + i2*(s3*...*sk) +... + ik) * (1 + s1*...*sk)
+//            = index * (1 + s1*...*sk)
+//
+// Let `size = s1*...*sk`, we finally have `new_index = index * (1 + size)`,
+// which is the transfer function we use below.
+// This trick make our implementations clear and easy to be parallel.
+namespace functor {
+template <typename T>
+struct DiagFunctor<CPUDevice, T> {
+  EIGEN_ALWAYS_INLINE Status
+  operator() (OpKernelContext* context, const int64 size,
+              const T* in, T* out) {
+    // This subprocess is responsible for writing values in index range
+    // [start*size, limit*size)
+    auto subDiag = [in, out, size](int64 start, int64 limit) {
+      std::fill(out + size * start, out + size * limit, T());
+      for (int64 index = start; index < limit; ++index) {
+        out[(1 + size) * index] = in[index];
+      }
+    };
+
+    // Here, 5 is a empirical factor of cost_per_unit.
+    auto worker_threads = *(context->device()->tensorflow_cpu_worker_threads());
+    Shard(worker_threads.num_threads, worker_threads.workers, size,
+        5 * size, subDiag);
+    return Status::OK();
+  }
+};
+
+template <typename T>
+struct DiagPartFunctor<CPUDevice, T> {
+  EIGEN_ALWAYS_INLINE Status
+  operator() (OpKernelContext* context, const int64 size,
+              const T* in, T* out) {
+    // This subprocess is responsible for extracting values in index range
+    // [start, limit)
+    auto subDiagPart = [in, out, size](int64 start, int64 limit) {
+      for (int64 index = start; index < limit; ++index) {
+        out[index] = in[(1 + size) * index];
+      }
+    };
+
+    // Here, 5 is a empirical factor of cost_per_unit.
+    auto worker_threads = *(context->device()->tensorflow_cpu_worker_threads());
+    Shard(worker_threads.num_threads, worker_threads.workers, size,
+        5, subDiagPart);
+    return Status::OK();
   }
 };
+}  // namespace functor
 
-#define REGISTER_DIAGPARTOP(T) \
-  REGISTER_KERNEL_BUILDER( \
-      Name("DiagPart").Device(DEVICE_CPU).TypeConstraint<T>("T"), DiagPartOp<T>)
 
-REGISTER_DIAGPARTOP(double);
-REGISTER_DIAGPARTOP(float);
-REGISTER_DIAGPARTOP(int32);
-REGISTER_DIAGPARTOP(int64);
-REGISTER_DIAGPARTOP(complex64);
-REGISTER_DIAGPARTOP(complex128);
+// Register the CPU kernels.
+#define REGISTER_DIAGOP(T)                                    \
+  REGISTER_KERNEL_BUILDER(                                    \
+      Name("Diag").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
+      DiagOp<CPUDevice, T>)
 
+TF_CALL_double(REGISTER_DIAGOP);
+TF_CALL_float(REGISTER_DIAGOP);
+TF_CALL_int32(REGISTER_DIAGOP);
+TF_CALL_int64(REGISTER_DIAGOP);
+TF_CALL_complex64(REGISTER_DIAGOP);
+TF_CALL_complex128(REGISTER_DIAGOP);
+#undef REGISTER_DIAGOP
+
+#define REGISTER_DIAGPARTOP(T)                                    \
+  REGISTER_KERNEL_BUILDER(                                        \
+      Name("DiagPart").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
+      DiagPartOp<CPUDevice, T>)
+
+TF_CALL_double(REGISTER_DIAGPARTOP);
+TF_CALL_float(REGISTER_DIAGPARTOP);
+TF_CALL_int32(REGISTER_DIAGPARTOP);
+TF_CALL_int64(REGISTER_DIAGPARTOP);
+TF_CALL_complex64(REGISTER_DIAGPARTOP);
+TF_CALL_complex128(REGISTER_DIAGPARTOP);
 #undef REGISTER_DIAGPARTOP
-  
+
+// Register the GPU kernels.
+#ifdef GOOGLE_CUDA
+
+// Forward declarations of the functor specializations for GPU.
+namespace functor {
+extern template struct DiagFunctor<GPUDevice, double>;
+extern template struct DiagFunctor<GPUDevice, float>;
+extern template struct DiagFunctor<GPUDevice, int32>;
+extern template struct DiagFunctor<GPUDevice, int64>;
+extern template struct DiagFunctor<GPUDevice, complex64>;
+extern template struct DiagFunctor<GPUDevice, complex128>;
+}  // namespace functor
+
+#define REGISTER_DIAGOP_GPU(T)                                \
+  REGISTER_KERNEL_BUILDER(                                    \
+      Name("Diag").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
+      DiagOp<GPUDevice, T>)
+
+TF_CALL_double(REGISTER_DIAGOP_GPU);
+TF_CALL_float(REGISTER_DIAGOP_GPU);
+TF_CALL_int32(REGISTER_DIAGOP_GPU);
+TF_CALL_int64(REGISTER_DIAGOP_GPU);
+TF_CALL_complex64(REGISTER_DIAGOP_GPU);
+TF_CALL_complex128(REGISTER_DIAGOP_GPU);
+#undef REGISTER_DIAGOP_GPU
+
+// Forward declarations of the functor specializations for GPU.
+namespace functor {
+extern template struct DiagPartFunctor<GPUDevice, double>;
+extern template struct DiagPartFunctor<GPUDevice, float>;
+extern template struct DiagPartFunctor<GPUDevice, int32>;
+extern template struct DiagPartFunctor<GPUDevice, int64>;
+extern template struct DiagPartFunctor<GPUDevice, complex64>;
+extern template struct DiagPartFunctor<GPUDevice, complex128>;
+}  // namespace functor
+
+#define REGISTER_DIAGPARTOP_GPU(T)                                \
+  REGISTER_KERNEL_BUILDER(                                        \
+      Name("DiagPart").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
+      DiagPartOp<GPUDevice, T>)
+
+TF_CALL_double(REGISTER_DIAGPARTOP_GPU);
+TF_CALL_float(REGISTER_DIAGPARTOP_GPU);
+TF_CALL_int32(REGISTER_DIAGPARTOP_GPU);
+TF_CALL_int64(REGISTER_DIAGPARTOP_GPU);
+TF_CALL_complex64(REGISTER_DIAGPARTOP_GPU);
+TF_CALL_complex128(REGISTER_DIAGPARTOP_GPU);
+#undef REGISTER_DIAGPARTOP_GPU
+
+#endif  // GOOGLE_CUDA
+
+
 }  // namespace tensorflow
+
diff --git a/tensorflow/core/kernels/diag_op.h b/tensorflow/core/kernels/diag_op.h
new file mode 100644
index 0000000000..c6ca6a2047
--- /dev/null
+++ b/tensorflow/core/kernels/diag_op.h
@@ -0,0 +1,43 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_KERNELS_DIAG_OP_H_
+#define TENSORFLOW_CORE_KERNELS_DIAG_OP_H_
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+namespace functor {
+
+template <typename Device, typename T>
+struct DiagFunctor {
+  Status operator() (OpKernelContext* context, const int64 size,
+                     const T* in, T* out);
+};
+
+template <typename Device, typename T>
+struct DiagPartFunctor {
+  Status operator() (OpKernelContext* context, const int64 size,
+                     const T* in, T* out);
+};
+
+}  // namespace functor
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_DIAG_OP_H_
diff --git a/tensorflow/core/kernels/diag_op_gpu.cu.cc b/tensorflow/core/kernels/diag_op_gpu.cu.cc
new file mode 100644
index 0000000000..684f00ea61
--- /dev/null
+++ b/tensorflow/core/kernels/diag_op_gpu.cu.cc
@@ -0,0 +1,139 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include <complex>
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+#include "tensorflow/core/kernels/diag_op.h"
+
+namespace tensorflow {
+namespace functor {
+
+typedef Eigen::GpuDevice GPUDevice;
+
+template <typename T>
+__global__ void DiagCudaKernel(const int num_threads,
+                               const int64 size,
+                               const T* in,
+                               T* out) {
+  CUDA_1D_KERNEL_LOOP(index, num_threads) {
+    // Fill the diagonal elements or set to zero in other place. 
+    if (index % (1 + size) == 0) {
+      out[index] = in[index / (1 + size)];
+    } else {
+      out[index] = T(0);
+    }
+  }
+}
+
+template <typename T>
+struct DiagFunctor<GPUDevice, T> {
+  EIGEN_ALWAYS_INLINE Status
+  operator() (OpKernelContext* context, const int64 size,
+              const T* in, T* out) {
+    // Empty tensor couldn't launch the kernel.
+    if (size == 0) {
+      return Status::OK();
+    }
+
+    // CudaLaunchConfig uses an int for virtual_thread_count,
+    // so this may overflow for `size*size` in extreme cases,
+    // here is checking the multiplication overflow for integer.
+    if (size && (int(size * size) / size) != size) {
+      return errors::Internal(
+          "DiagOp got input size too large.");
+    }
+    int virtual_thread_count = int(size * size);
+
+    // Launch the GPU kernel.
+    const GPUDevice& device = context->eigen_device<GPUDevice>();
+    CudaLaunchConfig diag_config = GetCudaLaunchConfig(
+        virtual_thread_count, device);
+    DiagCudaKernel<<<diag_config.block_count,
+                     diag_config.thread_per_block,
+                     0, device.stream()>>>(
+        diag_config.virtual_thread_count, size, in, out);
+
+    auto err = cudaGetLastError();
+    if (err != cudaSuccess) {
+      return errors::Internal(
+          "Could not launch DiagOp kernel: ",
+          cudaGetErrorString(err), ".");
+    }
+    return Status::OK();
+  }
+};
+
+template struct DiagFunctor<GPUDevice, double>;
+template struct DiagFunctor<GPUDevice, float>;
+template struct DiagFunctor<GPUDevice, int32>;
+template struct DiagFunctor<GPUDevice, int64>;
+template struct DiagFunctor<GPUDevice, complex64>;
+template struct DiagFunctor<GPUDevice, complex128>;
+
+
+template <typename T>
+__global__ void DiagPartCudaKernel(const int num_threads,
+                                   const int64 size,
+                                   const T* in,
+                                   T* out) {
+  CUDA_1D_KERNEL_LOOP(index, num_threads) {
+    out[index] = in[(1 + size) * index];
+  }
+}
+
+template <typename T>
+struct DiagPartFunctor<GPUDevice, T> {
+  EIGEN_ALWAYS_INLINE Status
+  operator() (OpKernelContext* context, const int64 size,
+              const T* in, T* out) {
+    // Empty tensor couldn't launch the kernel.
+    if (size == 0) {
+      return Status::OK();
+    }
+    const GPUDevice& device = context->eigen_device<GPUDevice>();
+
+    // Extract the diagonal elements.
+    CudaLaunchConfig diag_config = GetCudaLaunchConfig(size, device);
+    DiagPartCudaKernel<<<diag_config.block_count,
+                     diag_config.thread_per_block,
+                     0, device.stream()>>>(
+        diag_config.virtual_thread_count, size, in, out);
+
+    auto err = cudaGetLastError();
+    if (err != cudaSuccess) {
+      return errors::Internal(
+          "Could not launch DiagPartOp kernel: ",
+          cudaGetErrorString(err), ".");
+    }
+    return Status::OK();
+  }
+};
+
+template struct DiagPartFunctor<GPUDevice, double>;
+template struct DiagPartFunctor<GPUDevice, float>;
+template struct DiagPartFunctor<GPUDevice, int32>;
+template struct DiagPartFunctor<GPUDevice, int64>;
+template struct DiagPartFunctor<GPUDevice, complex64>;
+template struct DiagPartFunctor<GPUDevice, complex128>;
+
+}  // end namespace functor
+}  // end namespace tensorflow
+
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/diag_op_test.cc b/tensorflow/core/kernels/diag_op_test.cc
new file mode 100644
index 0000000000..2d1417854c
--- /dev/null
+++ b/tensorflow/core/kernels/diag_op_test.cc
@@ -0,0 +1,54 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+
+namespace tensorflow {
+
+template <typename T>
+static Graph* Diag(int n, DataType type) {
+  Graph* g = new Graph(OpRegistry::Global());
+  Tensor in(type, TensorShape({n}));
+  in.flat<T>().setRandom();
+  Node* out = test::graph::Diag(g, test::graph::Constant(g, in), type);
+  test::graph::DiagPart(g, out, type);
+  return g;
+}
+
+#define BM_DiagDev(N, T, TFTYPE, DEVICE)                           \
+  static void BM_Diag##_##N##_##TFTYPE##_##DEVICE(int iters) {   \
+    testing::UseRealTime();                                     \
+    testing::ItemsProcessed(static_cast<int64>(iters) * N * N); \
+    test::Benchmark(#DEVICE, Diag<T>(N, TFTYPE)).Run(iters);    \
+  }                                                             \
+  BENCHMARK(BM_Diag##_##N##_##TFTYPE##_##DEVICE);
+
+#define BM_Diag(N)                                       \
+  BM_DiagDev(N, int, DT_INT32, cpu);                     \
+  BM_DiagDev(N, float, DT_FLOAT, cpu);                   \
+  BM_DiagDev(N, std::complex<float>, DT_COMPLEX64, cpu); \
+  BM_DiagDev(N, int, DT_INT32, gpu);                     \
+  BM_DiagDev(N, float, DT_FLOAT, gpu);                   \
+  BM_DiagDev(N, std::complex<float>, DT_COMPLEX64, gpu);
+
+BM_Diag(16);
+BM_Diag(128);
+BM_Diag(512);
+
+}  // end namespace tensorflow
+
diff --git a/tensorflow/core/kernels/histogram_op.cc b/tensorflow/core/kernels/histogram_op.cc
new file mode 100644
index 0000000000..4e035286f6
--- /dev/null
+++ b/tensorflow/core/kernels/histogram_op.cc
@@ -0,0 +1,147 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// See docs in ../ops/math_ops.cc.
+
+#define EIGEN_USE_THREADS
+
+#include "tensorflow/core/kernels/histogram_op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+typedef Eigen::ThreadPoolDevice CPUDevice;
+typedef Eigen::GpuDevice GPUDevice;
+
+namespace functor {
+
+template <typename T, typename Tout>
+struct HistogramFixedWidthFunctor<CPUDevice, T, Tout> {
+  static Status Compute(OpKernelContext* context,
+                        const typename TTypes<T, 1>::ConstTensor& values,
+                        const typename TTypes<T, 1>::ConstTensor& value_range,
+                        int32 nbins, typename TTypes<Tout, 1>::Tensor& out) {
+    const CPUDevice& d = context->eigen_device<CPUDevice>();
+
+    Tensor index_to_bin_tensor;
+
+    TF_RETURN_IF_ERROR(context->forward_input_or_allocate_temp(
+        {0}, DataTypeToEnum<int32>::value, TensorShape({values.size()}),
+        &index_to_bin_tensor));
+    auto index_to_bin = index_to_bin_tensor.flat<int32>();
+
+    const double step = static_cast<double>(value_range(1) - value_range(0)) /
+                        static_cast<double>(nbins);
+
+    // The calculation is done by finding the slot of each value in `values`.
+    // With [a, b]:
+    //   step = (b - a) / nbins
+    //   (x - a) / step
+    // , then the entries are mapped to output.
+    index_to_bin.device(d) =
+        ((values.cwiseMax(value_range(0)) - values.constant(value_range(0)))
+             .template cast<double>() /
+         step)
+            .template cast<int32>()
+            .cwiseMin(nbins - 1);
+
+    out.setZero();
+    for (int32 i = 0; i < index_to_bin.size(); i++) {
+      out(index_to_bin(i)) += Tout(1);
+    }
+    return Status::OK();
+  }
+};
+
+}  // namespace functor
+
+template <typename Device, typename T, typename Tout>
+class HistogramFixedWidthOp : public OpKernel {
+ public:
+  explicit HistogramFixedWidthOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor& values_tensor = ctx->input(0);
+    const Tensor& value_range_tensor = ctx->input(1);
+    const Tensor& nbins_tensor = ctx->input(2);
+
+    OP_REQUIRES(ctx, TensorShapeUtils::IsVector(value_range_tensor.shape()),
+                errors::InvalidArgument("value_range should be a vector."));
+    OP_REQUIRES(ctx, (value_range_tensor.shape().num_elements() == 2),
+                errors::InvalidArgument(
+                    "value_range should be a vector of 2 elements."));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(nbins_tensor.shape()),
+                errors::InvalidArgument("nbins should be a scalar."));
+
+    const auto values = values_tensor.flat<T>();
+    const auto value_range = value_range_tensor.flat<T>();
+    const auto nbins = nbins_tensor.scalar<int32>()();
+
+    OP_REQUIRES(
+        ctx, (value_range(0) < value_range(1)),
+        errors::InvalidArgument("value_range should satisfy value_range[0] < "
+                                "value_range[1], but got '[",
+                                value_range(0), ", ", value_range(1), "]'"));
+    OP_REQUIRES(
+        ctx, (nbins > 0),
+        errors::InvalidArgument("nbins should be a positive number, but got '",
+                                nbins, "'"));
+
+    Tensor* out_tensor;
+    OP_REQUIRES_OK(ctx,
+                   ctx->allocate_output(0, TensorShape({nbins}), &out_tensor));
+    auto out = out_tensor->flat<Tout>();
+
+    OP_REQUIRES_OK(
+        ctx, functor::HistogramFixedWidthFunctor<Device, T, Tout>::Compute(
+                 ctx, values, value_range, nbins, out));
+  }
+};
+
+#define REGISTER_KERNELS(type)                                           \
+  REGISTER_KERNEL_BUILDER(Name("HistogramFixedWidth")                    \
+                              .Device(DEVICE_CPU)                        \
+                              .TypeConstraint<type>("T")                 \
+                              .TypeConstraint<int32>("dtype"),           \
+                          HistogramFixedWidthOp<CPUDevice, type, int32>) \
+  REGISTER_KERNEL_BUILDER(Name("HistogramFixedWidth")                    \
+                              .Device(DEVICE_CPU)                        \
+                              .TypeConstraint<type>("T")                 \
+                              .TypeConstraint<int64>("dtype"),           \
+                          HistogramFixedWidthOp<CPUDevice, type, int64>)
+
+TF_CALL_REAL_NUMBER_TYPES(REGISTER_KERNELS);
+#undef REGISTER_KERNELS
+
+#if GOOGLE_CUDA
+#define REGISTER_KERNELS(type)                                 \
+  REGISTER_KERNEL_BUILDER(Name("HistogramFixedWidth")          \
+                              .Device(DEVICE_GPU)              \
+                              .HostMemory("value_range")       \
+                              .HostMemory("nbins")             \
+                              .TypeConstraint<type>("T")       \
+                              .TypeConstraint<int32>("dtype"), \
+                          HistogramFixedWidthOp<GPUDevice, type, int32>)
+
+TF_CALL_GPU_NUMBER_TYPES(REGISTER_KERNELS);
+#undef REGISTER_KERNELS
+
+#endif  // GOOGLE_CUDA
+
+}  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/histogram_op.h b/tensorflow/core/kernels/histogram_op.h
new file mode 100644
index 0000000000..1b253f7fed
--- /dev/null
+++ b/tensorflow/core/kernels/histogram_op.h
@@ -0,0 +1,38 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_HISTOGRAM_OP_H_
+#define TENSORFLOW_HISTOGRAM_OP_H_
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+namespace tensorflow {
+namespace functor {
+
+template <typename Device, typename T, typename Tout>
+struct HistogramFixedWidthFunctor {
+  static Status Compute(OpKernelContext* context,
+                        const typename TTypes<T, 1>::ConstTensor& values,
+                        const typename TTypes<T, 1>::ConstTensor& value_range,
+                        int32 nbins, typename TTypes<Tout, 1>::Tensor& out);
+};
+
+}  // end namespace functor
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_HISTOGRAM_OP_H_
diff --git a/tensorflow/core/kernels/histogram_op_gpu.cu.cc b/tensorflow/core/kernels/histogram_op_gpu.cu.cc
new file mode 100644
index 0000000000..c2bb958be8
--- /dev/null
+++ b/tensorflow/core/kernels/histogram_op_gpu.cu.cc
@@ -0,0 +1,125 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include "tensorflow/core/kernels/histogram_op.h"
+#include "external/cub_archive/cub/device/device_histogram.cuh"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/types.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
+namespace tensorflow {
+
+typedef Eigen::GpuDevice GPUDevice;
+
+namespace functor {
+
+// TODO(yongtang) int64 of atomicAdd is not supported yet.
+template <typename T, typename Tout>
+struct HistogramFixedWidthFunctor<GPUDevice, T, Tout> {
+  static Status Compute(OpKernelContext* context,
+                        const typename TTypes<T, 1>::ConstTensor& values,
+                        const typename TTypes<T, 1>::ConstTensor& value_range,
+                        int32 nbins, typename TTypes<Tout, 1>::Tensor& out) {
+    tensorflow::AllocatorAttributes pinned_allocator;
+    pinned_allocator.set_on_host(true);
+    pinned_allocator.set_gpu_compatible(true);
+
+    Tensor levels_tensor;
+    TF_RETURN_IF_ERROR(context->allocate_temp(
+        DataTypeToEnum<T>::value, TensorShape({nbins + 1}), &levels_tensor,
+        pinned_allocator));
+    auto levels = levels_tensor.flat<T>();
+
+    const double step = static_cast<double>(value_range(1) - value_range(0)) /
+                        static_cast<double>(nbins);
+    levels(0) = std::numeric_limits<T>::lowest();
+    for (int i = 1; i < nbins; i++) {
+      levels(i) =
+          static_cast<T>(static_cast<double>(value_range(0)) + step * i);
+    }
+    levels(nbins) = std::numeric_limits<T>::max();
+
+    size_t temp_storage_bytes = 0;
+    const T* d_samples = values.data();
+    Tout* d_histogram = out.data();
+    int num_levels = levels.size();
+    T* d_levels = levels.data();
+    int num_samples = values.size();
+    const cudaStream_t& stream = GetCudaStream(context);
+
+    // The first HistogramRange is to obtain the temp storage size required
+    // with d_temp_storage = NULL passed to the call.
+    auto err = cub::DeviceHistogram::HistogramRange(
+        /* d_temp_storage */ NULL,
+        /* temp_storage_bytes */ temp_storage_bytes,
+        /* d_samples */ d_samples,
+        /* d_histogram */ d_histogram,
+        /* num_levels */ num_levels,
+        /* d_levels */ d_levels,
+        /* num_samples */ num_samples,
+        /* stream */ stream);
+    if (err != cudaSuccess) {
+      return errors::Internal(
+          "Could not launch HistogramRange to get temp storage: ",
+          cudaGetErrorString(err), ".");
+    }
+
+    Tensor temp_storage;
+    TF_RETURN_IF_ERROR(context->allocate_temp(
+        DataTypeToEnum<int8>::value,
+        TensorShape({static_cast<int64>(temp_storage_bytes)}), &temp_storage));
+
+    void* d_temp_storage = temp_storage.flat<int8>().data();
+
+    // The second HistogramRange is to actual run with d_temp_storage
+    // allocated with temp_storage_bytes.
+    err = cub::DeviceHistogram::HistogramRange(
+        /* d_temp_storage */ d_temp_storage,
+        /* temp_storage_bytes */ temp_storage_bytes,
+        /* d_samples */ d_samples,
+        /* d_histogram */ d_histogram,
+        /* num_levels */ num_levels,
+        /* d_levels */ d_levels,
+        /* num_samples */ num_samples,
+        /* stream */ stream);
+    if (err != cudaSuccess) {
+      return errors::Internal("Could not launch HistogramRange: ",
+                              cudaGetErrorString(err), ".");
+    }
+
+    return Status::OK();
+  }
+};
+
+}  // end namespace functor
+
+#define REGISTER_GPU_SPEC(type) \
+  template struct functor::HistogramFixedWidthFunctor<GPUDevice, type, int32>;
+
+TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_SPEC);
+#undef REGISTER_GPU_SPEC
+
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/listdiff_op.cc b/tensorflow/core/kernels/listdiff_op.cc
index d303bdd560..d28a2729d4 100644
--- a/tensorflow/core/kernels/listdiff_op.cc
+++ b/tensorflow/core/kernels/listdiff_op.cc
@@ -24,12 +24,13 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 
 namespace tensorflow {
-template <typename T>
+template <typename T, typename Tidx>
 class ListDiffOp : public OpKernel {
  public:
   explicit ListDiffOp(OpKernelConstruction* context) : OpKernel(context) {
     const DataType dt = DataTypeToEnum<T>::v();
-    OP_REQUIRES_OK(context, context->MatchSignature({dt, dt}, {dt, DT_INT32}));
+    const DataType dtidx = DataTypeToEnum<Tidx>::v();
+    OP_REQUIRES_OK(context, context->MatchSignature({dt, dt}, {dt, dtidx}));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -72,9 +73,9 @@ class ListDiffOp : public OpKernel {
 
     Tensor* indices = nullptr;
     OP_REQUIRES_OK(context, context->allocate_output(1, {out_size}, &indices));
-    auto Tindices = indices->vec<int32>();
+    auto Tindices = indices->vec<Tidx>();
 
-    for (int i = 0, p = 0; i < static_cast<int32>(x_size); ++i) {
+    for (Tidx i = 0, p = 0; i < static_cast<Tidx>(x_size); ++i) {
       if (y_set.count(Tx(i)) == 0) {
         OP_REQUIRES(context, p < out_size,
                     errors::InvalidArgument(
@@ -95,7 +96,12 @@ class ListDiffOp : public OpKernel {
                               .Device(DEVICE_CPU)                \
                               .TypeConstraint<type>("T")         \
                               .TypeConstraint<int32>("out_idx"), \
-                          ListDiffOp<type>)
+                          ListDiffOp<type, int32>)               \
+  REGISTER_KERNEL_BUILDER(Name("ListDiff")                       \
+                              .Device(DEVICE_CPU)                \
+                              .TypeConstraint<type>("T")         \
+                              .TypeConstraint<int64>("out_idx"), \
+                          ListDiffOp<type, int64>)
 
 TF_CALL_REAL_NUMBER_TYPES(REGISTER_LISTDIFF);
 REGISTER_LISTDIFF(string);
diff --git a/tensorflow/core/kernels/map_stage_op.cc b/tensorflow/core/kernels/map_stage_op.cc
index 0168b57d35..7b5a464b72 100644
--- a/tensorflow/core/kernels/map_stage_op.cc
+++ b/tensorflow/core/kernels/map_stage_op.cc
@@ -111,15 +111,21 @@ class StagingMap : public ResourceBase {
   void notify_inserters_if_bounded(std::unique_lock<std::mutex>* lock) {
     if (has_capacity() || has_memory_limit()) {
       lock->unlock();
-      full_.notify_one();
+      // Notify all inserters. The removal of an element
+      // may make memory available for many inserters
+      // to insert new elements
+      full_.notify_all();
     }
   }
 
-  // Notify any removers waiting to extract values
+  // Notify all removers waiting to extract values
   // that data is now available
   void notify_removers(std::unique_lock<std::mutex>* lock) {
     lock->unlock();
-    not_empty_.notify_one();
+    // Notify all removers. This is because they are
+    // waiting for specific keys to appear in the map
+    // so we don't know which one to wake up.
+    not_empty_.notify_all();
   }
 
   bool has_capacity() const { return capacity_ > 0; }
diff --git a/tensorflow/core/kernels/mirror_pad_op.cc b/tensorflow/core/kernels/mirror_pad_op.cc
index e3643f9447..fbdeaf43eb 100644
--- a/tensorflow/core/kernels/mirror_pad_op.cc
+++ b/tensorflow/core/kernels/mirror_pad_op.cc
@@ -18,10 +18,10 @@ limitations under the License.
 #define EIGEN_USE_THREADS
 
 #include "tensorflow/core/kernels/mirror_pad_op.h"
-
 #include <string>
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -35,7 +35,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tpaddings>
 class MirrorPadOp : public OpKernel {
  public:
   explicit MirrorPadOp(OpKernelConstruction* context) : OpKernel(context) {
@@ -82,10 +82,10 @@ class MirrorPadOp : public OpKernel {
 
     // Compute the shape of the output tensor, and allocate it.
     TensorShape output_shape;
-    TTypes<int32>::ConstMatrix paddings = in1.matrix<int32>();
+    typename TTypes<Tpaddings>::ConstMatrix paddings = in1.matrix<Tpaddings>();
     for (int d = 0; d < dims; ++d) {
-      const int32 before = paddings(d, 0);  // Pad before existing elements.
-      const int32 after = paddings(d, 1);   // Pad after existing elements.
+      const Tpaddings before = paddings(d, 0);  // Pad before existing elements.
+      const Tpaddings after = paddings(d, 1);   // Pad after existing elements.
       OP_REQUIRES(context, before >= 0 && after >= 0,
                   errors::InvalidArgument("paddings must be non-negative: ",
                                           before, " ", after));
@@ -121,7 +121,7 @@ class MirrorPadOp : public OpKernel {
 
 #define MIRROR_PAD_CASE(i)                                                \
   case i: {                                                               \
-    functor::MirrorPad<Device, T, i>()(                                   \
+    functor::MirrorPad<Device, T, Tpaddings, i>()(                        \
         context->eigen_device<Device>(), To32Bit(output->tensor<T, i>()), \
         To32Bit(in0.tensor<T, i>()), paddings, offset_);                  \
     break;                                                                \
@@ -152,20 +152,25 @@ using GpuDevice = Eigen::GpuDevice;
 namespace functor {
 // Forward declarations of the functor specializations defined in the sharded
 // files.
-#define DECLARE_CPU_SPEC(T, i)                                               \
-  template <>                                                                \
-  void MirrorPad<CpuDevice, T, i>::operator()(                               \
-      const CpuDevice&, typename TTypes<T, i, int32>::Tensor,                \
-      typename TTypes<T, i, int32>::ConstTensor, TTypes<int32>::ConstMatrix, \
-      int);                                                                  \
-  extern template struct MirrorPad<CpuDevice, T, i>;
-
-#define DECLARE_CPU_SPECS(T) \
-  DECLARE_CPU_SPEC(T, 1);    \
-  DECLARE_CPU_SPEC(T, 2);    \
-  DECLARE_CPU_SPEC(T, 3);    \
-  DECLARE_CPU_SPEC(T, 4);    \
-  DECLARE_CPU_SPEC(T, 5);
+#define DECLARE_CPU_SPEC(T, Tpaddings, i)                     \
+  template <>                                                 \
+  void MirrorPad<CpuDevice, T, Tpaddings, i>::operator()(     \
+      const CpuDevice&, typename TTypes<T, i, int32>::Tensor, \
+      typename TTypes<T, i, int32>::ConstTensor,              \
+      TTypes<Tpaddings>::ConstMatrix, int);                   \
+  extern template struct MirrorPad<CpuDevice, T, Tpaddings, i>;
+
+#define DECLARE_CPU_SPECS(T)     \
+  DECLARE_CPU_SPEC(T, int32, 1); \
+  DECLARE_CPU_SPEC(T, int32, 2); \
+  DECLARE_CPU_SPEC(T, int32, 3); \
+  DECLARE_CPU_SPEC(T, int32, 4); \
+  DECLARE_CPU_SPEC(T, int32, 5); \
+  DECLARE_CPU_SPEC(T, int64, 1); \
+  DECLARE_CPU_SPEC(T, int64, 2); \
+  DECLARE_CPU_SPEC(T, int64, 3); \
+  DECLARE_CPU_SPEC(T, int64, 4); \
+  DECLARE_CPU_SPEC(T, int64, 5);
 
 TF_CALL_POD_TYPES(DECLARE_CPU_SPECS);
 
@@ -179,7 +184,13 @@ TF_CALL_POD_TYPES(DECLARE_CPU_SPECS);
                               .TypeConstraint<type>("T")          \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings"),            \
-                          MirrorPadOp<CpuDevice, type>);
+                          MirrorPadOp<CpuDevice, type, int32>);   \
+  REGISTER_KERNEL_BUILDER(Name("MirrorPad")                       \
+                              .Device(DEVICE_CPU)                 \
+                              .TypeConstraint<type>("T")          \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          MirrorPadOp<CpuDevice, type, int64>);
 
 // Note that we do register for bool type, but not in the gradient op.
 TF_CALL_POD_TYPES(REGISTER_KERNEL);
@@ -188,20 +199,25 @@ TF_CALL_POD_TYPES(REGISTER_KERNEL);
 #if GOOGLE_CUDA
 namespace functor {
 // Forward declarations of the functor specializations for GPU.
-#define DECLARE_GPU_SPEC(T, i)                                               \
-  template <>                                                                \
-  void MirrorPad<GpuDevice, T, i>::operator()(                               \
-      const GpuDevice&, typename TTypes<T, i, int32>::Tensor,                \
-      typename TTypes<T, i, int32>::ConstTensor, TTypes<int32>::ConstMatrix, \
-      int);                                                                  \
-  extern template struct MirrorPad<GpuDevice, T, i>;
-
-#define DECLARE_GPU_SPECS(T) \
-  DECLARE_GPU_SPEC(T, 1);    \
-  DECLARE_GPU_SPEC(T, 2);    \
-  DECLARE_GPU_SPEC(T, 3);    \
-  DECLARE_GPU_SPEC(T, 4);    \
-  DECLARE_GPU_SPEC(T, 5);
+#define DECLARE_GPU_SPEC(T, Tpaddings, i)                     \
+  template <>                                                 \
+  void MirrorPad<GpuDevice, T, Tpaddings, i>::operator()(     \
+      const GpuDevice&, typename TTypes<T, i, int32>::Tensor, \
+      typename TTypes<T, i, int32>::ConstTensor,              \
+      TTypes<Tpaddings>::ConstMatrix, int);                   \
+  extern template struct MirrorPad<GpuDevice, T, Tpaddings, i>;
+
+#define DECLARE_GPU_SPECS(T)     \
+  DECLARE_GPU_SPEC(T, int32, 1); \
+  DECLARE_GPU_SPEC(T, int32, 2); \
+  DECLARE_GPU_SPEC(T, int32, 3); \
+  DECLARE_GPU_SPEC(T, int32, 4); \
+  DECLARE_GPU_SPEC(T, int32, 5); \
+  DECLARE_GPU_SPEC(T, int64, 1); \
+  DECLARE_GPU_SPEC(T, int64, 2); \
+  DECLARE_GPU_SPEC(T, int64, 3); \
+  DECLARE_GPU_SPEC(T, int64, 4); \
+  DECLARE_GPU_SPEC(T, int64, 5);
 
 TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
 #undef DECLARE_GPU_SPECS
@@ -215,14 +231,20 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings"),            \
-                          MirrorPadOp<GpuDevice, T>)
+                          MirrorPadOp<GpuDevice, T, int32>);      \
+  REGISTER_KERNEL_BUILDER(Name("MirrorPad")                       \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          MirrorPadOp<GpuDevice, T, int64>);
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNEL);
 #undef REGISTER_GPU_KERNEL
 #endif  // GOOGLE_CUDA
 
 // Gradient op.
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tpaddings>
 class MirrorPadGradOp : public OpKernel {
  public:
   explicit MirrorPadGradOp(OpKernelConstruction* context) : OpKernel(context) {
@@ -269,10 +291,10 @@ class MirrorPadGradOp : public OpKernel {
 
     // Compute the shape of the output tensor, and allocate it.
     TensorShape output_shape;
-    TTypes<int32>::ConstMatrix paddings = in1.matrix<int32>();
+    typename TTypes<Tpaddings>::ConstMatrix paddings = in1.matrix<Tpaddings>();
     for (int d = 0; d < dims; ++d) {
-      const int32 before = paddings(d, 0);  // Pad before existing elements.
-      const int32 after = paddings(d, 1);   // Pad after existing elements.
+      const Tpaddings before = paddings(d, 0);  // Pad before existing elements.
+      const Tpaddings after = paddings(d, 1);   // Pad after existing elements.
       OP_REQUIRES(context, before >= 0 && after >= 0,
                   errors::InvalidArgument("Paddings must be non-negative: ",
                                           before, ", ", after));
@@ -308,7 +330,7 @@ class MirrorPadGradOp : public OpKernel {
 
 #define MIRROR_PAD_GRAD_CASE(k)                                           \
   case k: {                                                               \
-    functor::MirrorPadGrad<Device, T, k>()(                               \
+    functor::MirrorPadGrad<Device, T, Tpaddings, k>()(                    \
         context->eigen_device<Device>(), To32Bit(output->tensor<T, k>()), \
         To32Bit(in0.tensor<T, k>()), paddings, offset_,                   \
         To32Bit(scratch.tensor<T, k>()));                                 \
@@ -337,33 +359,45 @@ class MirrorPadGradOp : public OpKernel {
 namespace functor {
 // Forward declarations of the functor specializations defined in the sharded
 // files.
-#define DECLARE_CPU_SPEC(T, k)                                               \
-  template <>                                                                \
-  void MirrorPadGrad<CpuDevice, T, k>::operator()(                           \
-      const CpuDevice&, typename TTypes<T, k, int32>::Tensor,                \
-      typename TTypes<T, k, int32>::ConstTensor, TTypes<int32>::ConstMatrix, \
-      int, typename TTypes<T, k, int32>::Tensor);                            \
-  extern template struct MirrorPadGrad<CpuDevice, T, k>;
-
-#define DECLARE_CPU_SPECS(T) \
-  DECLARE_CPU_SPEC(T, 1);    \
-  DECLARE_CPU_SPEC(T, 2);    \
-  DECLARE_CPU_SPEC(T, 3);    \
-  DECLARE_CPU_SPEC(T, 4);    \
-  DECLARE_CPU_SPEC(T, 5);
+#define DECLARE_CPU_SPEC(T, Tpaddings, k)                     \
+  template <>                                                 \
+  void MirrorPadGrad<CpuDevice, T, Tpaddings, k>::operator()( \
+      const CpuDevice&, typename TTypes<T, k, int32>::Tensor, \
+      typename TTypes<T, k, int32>::ConstTensor,              \
+      TTypes<Tpaddings>::ConstMatrix, int,                    \
+      typename TTypes<T, k, int32>::Tensor);                  \
+  extern template struct MirrorPadGrad<CpuDevice, T, Tpaddings, k>;
+
+#define DECLARE_CPU_SPECS(T)     \
+  DECLARE_CPU_SPEC(T, int32, 1); \
+  DECLARE_CPU_SPEC(T, int32, 2); \
+  DECLARE_CPU_SPEC(T, int32, 3); \
+  DECLARE_CPU_SPEC(T, int32, 4); \
+  DECLARE_CPU_SPEC(T, int32, 5); \
+  DECLARE_CPU_SPEC(T, int64, 1); \
+  DECLARE_CPU_SPEC(T, int64, 2); \
+  DECLARE_CPU_SPEC(T, int64, 3); \
+  DECLARE_CPU_SPEC(T, int64, 4); \
+  DECLARE_CPU_SPEC(T, int64, 5);
 
 TF_CALL_NUMBER_TYPES(DECLARE_CPU_SPECS);
 #undef DECLARE_CPU_SPECS
 #undef DECLARE_CPU_SPEC
 }  // namespace functor
 
-#define REGISTER_KERNEL(type)                                     \
-  REGISTER_KERNEL_BUILDER(Name("MirrorPadGrad")                   \
-                              .Device(DEVICE_CPU)                 \
-                              .TypeConstraint<type>("T")          \
-                              .TypeConstraint<int32>("Tpaddings") \
-                              .HostMemory("paddings"),            \
-                          MirrorPadGradOp<CpuDevice, type>);
+#define REGISTER_KERNEL(type)                                       \
+  REGISTER_KERNEL_BUILDER(Name("MirrorPadGrad")                     \
+                              .Device(DEVICE_CPU)                   \
+                              .TypeConstraint<type>("T")            \
+                              .TypeConstraint<int32>("Tpaddings")   \
+                              .HostMemory("paddings"),              \
+                          MirrorPadGradOp<CpuDevice, type, int32>); \
+  REGISTER_KERNEL_BUILDER(Name("MirrorPadGrad")                     \
+                              .Device(DEVICE_CPU)                   \
+                              .TypeConstraint<type>("T")            \
+                              .TypeConstraint<int64>("Tpaddings")   \
+                              .HostMemory("paddings"),              \
+                          MirrorPadGradOp<CpuDevice, type, int64>);
 
 TF_CALL_NUMBER_TYPES(REGISTER_KERNEL);
 #undef REGISTER_KERNEL
@@ -371,20 +405,26 @@ TF_CALL_NUMBER_TYPES(REGISTER_KERNEL);
 #if GOOGLE_CUDA
 namespace functor {
 // Forward declarations of the functor specializations for GPU.
-#define DECLARE_GPU_SPEC(T, k)                                               \
-  template <>                                                                \
-  void MirrorPadGrad<GpuDevice, T, k>::operator()(                           \
-      const GpuDevice&, typename TTypes<T, k, int32>::Tensor,                \
-      typename TTypes<T, k, int32>::ConstTensor, TTypes<int32>::ConstMatrix, \
-      int, typename TTypes<T, k, int32>::Tensor);                            \
-  extern template struct MirrorPadGrad<GpuDevice, T, k>;
-
-#define DECLARE_GPU_SPECS(T) \
-  DECLARE_GPU_SPEC(T, 1);    \
-  DECLARE_GPU_SPEC(T, 2);    \
-  DECLARE_GPU_SPEC(T, 3);    \
-  DECLARE_GPU_SPEC(T, 4);    \
-  DECLARE_GPU_SPEC(T, 5);
+#define DECLARE_GPU_SPEC(T, Tpaddings, k)                     \
+  template <>                                                 \
+  void MirrorPadGrad<GpuDevice, T, Tpaddings, k>::operator()( \
+      const GpuDevice&, typename TTypes<T, k, int32>::Tensor, \
+      typename TTypes<T, k, int32>::ConstTensor,              \
+      TTypes<Tpaddings>::ConstMatrix, int,                    \
+      typename TTypes<T, k, int32>::Tensor);                  \
+  extern template struct MirrorPadGrad<GpuDevice, T, Tpaddings, k>;
+
+#define DECLARE_GPU_SPECS(T)     \
+  DECLARE_GPU_SPEC(T, int32, 1); \
+  DECLARE_GPU_SPEC(T, int32, 2); \
+  DECLARE_GPU_SPEC(T, int32, 3); \
+  DECLARE_GPU_SPEC(T, int32, 4); \
+  DECLARE_GPU_SPEC(T, int32, 5); \
+  DECLARE_GPU_SPEC(T, int64, 1); \
+  DECLARE_GPU_SPEC(T, int64, 2); \
+  DECLARE_GPU_SPEC(T, int64, 3); \
+  DECLARE_GPU_SPEC(T, int64, 4); \
+  DECLARE_GPU_SPEC(T, int64, 5);
 
 TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
 #undef DECLARE_GPU_SPECS
@@ -398,7 +438,13 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings"),            \
-                          MirrorPadGradOp<GpuDevice, T>)
+                          MirrorPadGradOp<GpuDevice, T, int32>);  \
+  REGISTER_KERNEL_BUILDER(Name("MirrorPadGrad")                   \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          MirrorPadGradOp<GpuDevice, T, int64>);
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNEL);
 #undef REGISTER_GPU_KERNEL
diff --git a/tensorflow/core/kernels/mirror_pad_op.h b/tensorflow/core/kernels/mirror_pad_op.h
index b83d2223d0..81150a9e79 100644
--- a/tensorflow/core/kernels/mirror_pad_op.h
+++ b/tensorflow/core/kernels/mirror_pad_op.h
@@ -64,9 +64,8 @@ class TensorMirrorPadOp
       StorageKind;
   typedef typename Eigen::internal::traits<TensorMirrorPadOp>::Index Index;
 
-  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
-  TensorMirrorPadOp(const XprType& expr, const PaddingDimensions& padding_dims,
-                    Index offset)
+  EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE TensorMirrorPadOp(
+      const XprType& expr, const PaddingDimensions& padding_dims, Index offset)
       : xpr_(expr), padding_dims_(padding_dims), offset_(offset) {}
 
   EIGEN_DEVICE_FUNC
@@ -336,12 +335,12 @@ namespace functor {
 
 // offset argument must be either 0 or 1. This controls whether the boundary
 // values are replicated (offset == 0) or not replicated (offset == 1).
-template <typename Device, typename T, int Dims>
+template <typename Device, typename T, typename Tpaddings, int Dims>
 struct MirrorPad {
   void operator()(const Device& device,
                   typename TTypes<T, Dims, int32>::Tensor output,
                   typename TTypes<T, Dims, int32>::ConstTensor input,
-                  TTypes<int32>::ConstMatrix padding, int offset) {
+                  typename TTypes<Tpaddings>::ConstMatrix padding, int offset) {
     Eigen::array<Eigen::IndexPair<int32>, Dims> padding_dims;
 
     for (int i = 0; i < Dims; ++i) {
@@ -363,12 +362,12 @@ struct MirrorPad {
 
 // offset argument must be either 0 or 1. This controls whether the boundary
 // values are replicated (offset == 0) or not replicated (offset == 1).
-template <typename Device, typename T, int Dims>
+template <typename Device, typename T, typename Tpaddings, int Dims>
 struct MirrorPadGrad {
   void operator()(const Device& device,
                   typename TTypes<T, Dims, int32>::Tensor output,
                   typename TTypes<T, Dims, int32>::ConstTensor input,
-                  TTypes<int32>::ConstMatrix paddings, int offset,
+                  typename TTypes<Tpaddings>::ConstMatrix paddings, int offset,
                   typename TTypes<T, Dims, int32>::Tensor scratch) {
     // Copy the gradient input into the scratch buffer.
     scratch.device(device) = input;
diff --git a/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h b/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
index 9864f5633a..bb22b2aa91 100644
--- a/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
+++ b/tensorflow/core/kernels/mirror_pad_op_cpu_impl.h
@@ -25,13 +25,17 @@ namespace tensorflow {
 
 using CpuDevice = Eigen::ThreadPoolDevice;
 
-#define DEFINE_CPU_SPECS(T) \
-  template struct functor::MirrorPad<CpuDevice, T, CPU_PROVIDED_IXDIM>;
+#define DEFINE_CPU_SPECS(T)                                                    \
+  template struct functor::MirrorPad<CpuDevice, T, int32, CPU_PROVIDED_IXDIM>; \
+  template struct functor::MirrorPad<CpuDevice, T, int64, CPU_PROVIDED_IXDIM>;
 TF_CALL_POD_TYPES(DEFINE_CPU_SPECS);
 #undef DEFINE_CPU_SPECS
 
-#define DEFINE_CPU_SPECS(T) \
-  template struct functor::MirrorPadGrad<CpuDevice, T, CPU_PROVIDED_IXDIM>;
+#define DEFINE_CPU_SPECS(T)                                   \
+  template struct functor::MirrorPadGrad<CpuDevice, T, int32, \
+                                         CPU_PROVIDED_IXDIM>; \
+  template struct functor::MirrorPadGrad<CpuDevice, T, int64, \
+                                         CPU_PROVIDED_IXDIM>;
 TF_CALL_NUMBER_TYPES(DEFINE_CPU_SPECS);
 #undef DEFINE_CPU_SPECS
 
diff --git a/tensorflow/core/kernels/mirror_pad_op_gpu.cu.cc b/tensorflow/core/kernels/mirror_pad_op_gpu.cu.cc
index 8074aa9624..dbd0a9bd8f 100644
--- a/tensorflow/core/kernels/mirror_pad_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/mirror_pad_op_gpu.cu.cc
@@ -25,17 +25,27 @@ namespace tensorflow {
 
 using GpuDevice = Eigen::GpuDevice;
 
-#define DEFINE_GPU_SPECS(T)                                \
-  template struct functor::MirrorPad<GpuDevice, T, 1>;     \
-  template struct functor::MirrorPad<GpuDevice, T, 2>;     \
-  template struct functor::MirrorPad<GpuDevice, T, 3>;     \
-  template struct functor::MirrorPad<GpuDevice, T, 4>;     \
-  template struct functor::MirrorPad<GpuDevice, T, 5>;     \
-  template struct functor::MirrorPadGrad<GpuDevice, T, 1>; \
-  template struct functor::MirrorPadGrad<GpuDevice, T, 2>; \
-  template struct functor::MirrorPadGrad<GpuDevice, T, 3>; \
-  template struct functor::MirrorPadGrad<GpuDevice, T, 4>; \
-  template struct functor::MirrorPadGrad<GpuDevice, T, 5>;
+#define DEFINE_GPU_SPECS(T)                                       \
+  template struct functor::MirrorPad<GpuDevice, T, int32, 1>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int32, 2>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int32, 3>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int32, 4>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int32, 5>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int64, 1>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int64, 2>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int64, 3>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int64, 4>;     \
+  template struct functor::MirrorPad<GpuDevice, T, int64, 5>;     \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int32, 1>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int32, 2>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int32, 3>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int32, 4>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int32, 5>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int64, 1>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int64, 2>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int64, 3>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int64, 4>; \
+  template struct functor::MirrorPadGrad<GpuDevice, T, int64, 5>;
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_SPECS);
 #undef DEFINE_GPU_SPECS
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 57661e8b10..369f632fb4 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -288,8 +288,10 @@ class MklConv2DOp : public OpKernel {
     mkl_filter_output_mkl_shape.SetMklLayout(mkl_context.prim_fwd,
                                              dnnResourceFilter);
 
-    size_t filter_sizes[4] = {filter.dim_size(0), filter.dim_size(1),
-                              filter.dim_size(2), filter.dim_size(3)};
+    size_t filter_sizes[4] = {static_cast<size_t>(filter.dim_size(0)),
+                              static_cast<size_t>(filter.dim_size(1)),
+                              static_cast<size_t>(filter.dim_size(2)),
+                              static_cast<size_t>(filter.dim_size(3))};
     mkl_filter_output_mkl_shape.SetTfLayout(filter.dims(), filter_sizes,
                                             mkl_context.filter_strides);
 
diff --git a/tensorflow/core/kernels/nth_element_op.cc b/tensorflow/core/kernels/nth_element_op.cc
new file mode 100644
index 0000000000..da825e408c
--- /dev/null
+++ b/tensorflow/core/kernels/nth_element_op.cc
@@ -0,0 +1,139 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// See docs in ../ops/nn_ops.cc.
+#include "tensorflow/core/kernels/nth_element_op.h"
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/util/work_sharder.h"
+#include <vector>
+#include <algorithm>
+#include <iostream>
+
+namespace tensorflow {
+
+typedef Eigen::ThreadPoolDevice CPUDevice;
+
+template <typename Device, typename T>
+class NthElementOp : public OpKernel {
+ public:
+  explicit NthElementOp(OpKernelConstruction* context) : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("reverse", &reverse_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    // The second args is N, which must be a positive scalar.
+    const auto& n_in = context->input(1);
+    OP_REQUIRES(context, TensorShapeUtils::IsScalar(n_in.shape()),
+                errors::InvalidArgument("N must be scalar, got shape ",
+                                        n_in.shape().DebugString()));
+    int n = n_in.scalar<int32>()();
+    OP_REQUIRES(context, n >= 0,
+                errors::InvalidArgument("Need n >= 0, got ", n));
+
+    // The first args is input tensor, which must have 1 dimension at least.
+    const Tensor& input_in = context->input(0);
+    const int num_dims = input_in.dims();
+    OP_REQUIRES(context, num_dims >= 1,
+                errors::InvalidArgument("Input must be >= 1-D, got shape ",
+                                        input_in.shape().DebugString()));
+    // The last dimension of input tensor must be greater than N.
+    OP_REQUIRES(context, input_in.dim_size(num_dims-1) > n,
+                errors::InvalidArgument("Input must have at least n+1 columns"));
+
+    // std::nth_element only support the nth-smallest selection.
+    if (reverse_) {
+      n = input_in.dim_size(num_dims - 1) - n - 1;
+    }
+
+    // Assume input_shape is [d1,d2,...dk], and output_shape is [d1,d2...dk-1].
+    TensorShape out_shape;
+    for (int i = 0; i < num_dims-1; ++i) {
+      out_shape.AddDim(input_in.dim_size(i));
+    }
+    Tensor* output_tensor = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(0, out_shape, &output_tensor));
+
+    functor::NthElementFunctor<Device, T> nthElementFunc;
+    nthElementFunc(context, input_in, *output_tensor, n, reverse_);
+  }
+
+ private:
+  bool reverse_;
+};
+
+namespace functor {
+
+template <typename T>
+struct NthElementFunctor<CPUDevice, T> {
+  void operator() (OpKernelContext* context,
+                   const Tensor& input_tensor,
+                   Tensor& output_tensor,
+                   int n,
+                   bool reverse) {
+    const T* input = input_tensor.flat<T>().data();
+    T* output = output_tensor.flat<T>().data();
+
+    // Assume input_shape is [d1,d2,...dk], and output_shape is [d1,d2...dk-1],
+    // then num_rows = d1*d2...dk-1, last_dim = dk.
+    const int num_rows = output_tensor.NumElements();
+    const int last_dim = input_tensor.dim_size(input_tensor.dims()-1);
+
+    // Allocate each row to different shard.
+    auto SubNthElement = [&, input, output, last_dim, n](int start,
+                                                         int limit) {
+      // std::nth_element would rearrange the array, so we need a new buffer.
+      std::vector<T> buf(last_dim);
+
+      for (int b = start; b < limit; ++b) {
+        // Copy from one row of elements to buffer
+        const T* input_start = input + b * last_dim;
+        const T* input_end = input + (b+1) * last_dim;
+        std::copy(input_start, input_end, buf.begin());
+
+        std::nth_element(buf.begin(), buf.begin()+n, buf.end());
+        // The element placed in the nth position is exactly the element that
+        // would occur in this position if the range was fully sorted.
+        output[b] = buf[n];
+      }
+    };
+
+    auto worker_threads = *(context->device()->tensorflow_cpu_worker_threads());
+    // The average time complexity of partition-based nth_element (BFPRT) is O(n),
+    // althought the worst time complexity could be O(n^2).
+    // Here, 20 is a empirical factor of cost_per_unit.
+    Shard(worker_threads.num_threads, worker_threads.workers, num_rows,
+          20 * last_dim, SubNthElement);
+  }
+};
+
+}  // namespace functor
+
+
+#define REGISTER_NTHOP(T)                                           \
+  REGISTER_KERNEL_BUILDER(                                          \
+      Name("NthElement").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
+      NthElementOp<CPUDevice, T>)
+
+TF_CALL_REAL_NUMBER_TYPES(REGISTER_NTHOP);
+#undef REGISTER_NTHOP
+
+}  // end namespace tensorflow
+
diff --git a/tensorflow/core/kernels/nth_element_op.h b/tensorflow/core/kernels/nth_element_op.h
new file mode 100644
index 0000000000..11a6c996b0
--- /dev/null
+++ b/tensorflow/core/kernels/nth_element_op.h
@@ -0,0 +1,39 @@
+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_NTH_ELEMENT_OP_H_
+#define TENSORFLOW_NTH_ELEMENT_OP_H_
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+namespace functor {
+
+template <typename Device, typename T>
+struct NthElementFunctor {
+  void operator() (OpKernelContext* context,
+                   const Tensor& input_tensor,
+                   Tensor& output_tensor,
+                   int n);
+};
+
+}  // namespace functor
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_NTH_ELEMENT_OP_H_
diff --git a/tensorflow/core/kernels/pad_op.cc b/tensorflow/core/kernels/pad_op.cc
index 6196c5ed93..eff3e4d92c 100644
--- a/tensorflow/core/kernels/pad_op.cc
+++ b/tensorflow/core/kernels/pad_op.cc
@@ -40,9 +40,9 @@ typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tpadding>
 class PadOp : public OpKernel {
  public:
   explicit PadOp(OpKernelConstruction* context) : OpKernel(context) {}
@@ -82,10 +82,11 @@ class PadOp : public OpKernel {
 
     // Compute the shape of the output tensor, and allocate it.
     TensorShape output_shape;
-    TTypes<int32>::ConstMatrix paddings = in1.matrix<int32>();
+    typename TTypes<Tpadding>::ConstMatrix paddings = in1.matrix<Tpadding>();
     for (int d = 0; d < fixed_dims; ++d) {
-      const int32 before_d = paddings(d, 0);  // Pad before existing elements.
-      const int32 after_d = paddings(d, 1);   // Pad after existing elements.
+      const Tpadding before_d =
+          paddings(d, 0);                       // Pad before existing elements.
+      const Tpadding after_d = paddings(d, 1);  // Pad after existing elements.
       OP_REQUIRES(context, before_d >= 0 && after_d >= 0,
                   errors::InvalidArgument("Paddings must be non-negative: ",
                                           before_d, " ", after_d));
@@ -142,32 +143,47 @@ class PadOp : public OpKernel {
   template <int Dims>
   void Operate(OpKernelContext* context,
                typename TTypes<T, Dims>::ConstTensor input,
-               TTypes<int32>::ConstMatrix paddings, T pad_value,
+               typename TTypes<Tpadding>::ConstMatrix paddings, T pad_value,
                Tensor* output) {
     CHECK_EQ(Dims, paddings.dimension(0));
     CHECK_EQ(2, paddings.dimension(1));
-    Eigen::array<Eigen::IndexPair<int32>, Dims> paddings_array;
+    Eigen::array<Eigen::IndexPair<Tpadding>, Dims> paddings_array;
     for (int i = 0; i < Dims; ++i) {
       paddings_array[i] = {paddings(i, 0), paddings(i, 1)};
     }
-    functor::Pad<Device, T, Dims> functor;
+    functor::Pad<Device, T, Tpadding, Dims> functor;
     functor(context->eigen_device<Device>(), output->tensor<T, Dims>(), input,
             paddings_array, pad_value);
   }
 };
 
-#define REGISTER_KERNEL(type)                                 \
-  REGISTER_KERNEL_BUILDER(Name("Pad")                         \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<type>("T")      \
-                              .HostMemory("paddings"),        \
-                          PadOp<CPUDevice, type>);            \
-  REGISTER_KERNEL_BUILDER(Name("PadV2")                       \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<type>("T")      \
-                              .HostMemory("paddings")         \
-                              .HostMemory("constant_values"), \
-                          PadOp<CPUDevice, type>);
+#define REGISTER_KERNEL(type)                                     \
+  REGISTER_KERNEL_BUILDER(Name("Pad")                             \
+                              .Device(DEVICE_CPU)                 \
+                              .TypeConstraint<type>("T")          \
+                              .TypeConstraint<int32>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          PadOp<CPUDevice, type, int32>);         \
+  REGISTER_KERNEL_BUILDER(Name("Pad")                             \
+                              .Device(DEVICE_CPU)                 \
+                              .TypeConstraint<type>("T")          \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          PadOp<CPUDevice, type, int64>);         \
+  REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
+                              .Device(DEVICE_CPU)                 \
+                              .TypeConstraint<type>("T")          \
+                              .TypeConstraint<int32>("Tpaddings") \
+                              .HostMemory("paddings")             \
+                              .HostMemory("constant_values"),     \
+                          PadOp<CPUDevice, type, int32>);         \
+  REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
+                              .Device(DEVICE_CPU)                 \
+                              .TypeConstraint<type>("T")          \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings")             \
+                              .HostMemory("constant_values"),     \
+                          PadOp<CPUDevice, type, int64>);
 
 TF_CALL_POD_TYPES(REGISTER_KERNEL);
 #undef REGISTER_KERNEL
@@ -177,11 +193,17 @@ TF_CALL_POD_TYPES(REGISTER_KERNEL);
 namespace functor {
 #define DECLARE_GPU_SPEC(T, Dims)                                         \
   template <>                                                             \
-  void Pad<GPUDevice, T, Dims>::operator()(                               \
+  void Pad<GPUDevice, T, int32, Dims>::operator()(                        \
       const GPUDevice& d, typename TTypes<T, Dims>::Tensor output,        \
       typename TTypes<T, Dims>::ConstTensor input,                        \
       Eigen::array<Eigen::IndexPair<int32>, Dims> paddings, T pad_value); \
-  extern template struct Pad<GPUDevice, T, Dims>;
+  extern template struct Pad<GPUDevice, T, int32, Dims>;                  \
+  template <>                                                             \
+  void Pad<GPUDevice, T, int64, Dims>::operator()(                        \
+      const GPUDevice& d, typename TTypes<T, Dims>::Tensor output,        \
+      typename TTypes<T, Dims>::ConstTensor input,                        \
+      Eigen::array<Eigen::IndexPair<int64>, Dims> paddings, T pad_value); \
+  extern template struct Pad<GPUDevice, T, int64, Dims>;
 
 #define DECLARE_GPU_SPECS(T) \
   DECLARE_GPU_SPEC(T, 0);    \
@@ -202,14 +224,27 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings"),            \
-                          PadOp<GPUDevice, T>);                   \
+                          PadOp<GPUDevice, T, int32>);            \
+  REGISTER_KERNEL_BUILDER(Name("Pad")                             \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          PadOp<GPUDevice, T, int64>);            \
   REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
                               .Device(DEVICE_GPU)                 \
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings")             \
                               .HostMemory("constant_values"),     \
-                          PadOp<GPUDevice, T>)
+                          PadOp<GPUDevice, T, int32>)             \
+  REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings")             \
+                              .HostMemory("constant_values"),     \
+                          PadOp<GPUDevice, T, int64>)
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNEL);
 
@@ -223,7 +258,15 @@ REGISTER_KERNEL_BUILDER(Name("Pad")
                             .HostMemory("input")
                             .HostMemory("paddings")
                             .HostMemory("output"),
-                        PadOp<CPUDevice, int32>);
+                        PadOp<CPUDevice, int32, int32>);
+REGISTER_KERNEL_BUILDER(Name("Pad")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<int32>("T")
+                            .TypeConstraint<int64>("Tpaddings")
+                            .HostMemory("input")
+                            .HostMemory("paddings")
+                            .HostMemory("output"),
+                        PadOp<CPUDevice, int32, int64>);
 REGISTER_KERNEL_BUILDER(Name("PadV2")
                             .Device(DEVICE_GPU)
                             .TypeConstraint<int32>("T")
@@ -232,7 +275,16 @@ REGISTER_KERNEL_BUILDER(Name("PadV2")
                             .HostMemory("paddings")
                             .HostMemory("constant_values")
                             .HostMemory("output"),
-                        PadOp<CPUDevice, int32>);
+                        PadOp<CPUDevice, int32, int32>);
+REGISTER_KERNEL_BUILDER(Name("PadV2")
+                            .Device(DEVICE_GPU)
+                            .TypeConstraint<int32>("T")
+                            .TypeConstraint<int64>("Tpaddings")
+                            .HostMemory("input")
+                            .HostMemory("paddings")
+                            .HostMemory("constant_values")
+                            .HostMemory("output"),
+                        PadOp<CPUDevice, int32, int64>);
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
@@ -243,14 +295,27 @@ REGISTER_KERNEL_BUILDER(Name("PadV2")
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings"),            \
-                          PadOp<SYCLDevice, T>);                  \
+                          PadOp<SYCLDevice, T, int32>);           \
+  REGISTER_KERNEL_BUILDER(Name("Pad")                             \
+                              .Device(DEVICE_SYCL)                \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings"),            \
+                          PadOp<SYCLDevice, T, int64>);           \
   REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
                               .Device(DEVICE_SYCL)                \
                               .TypeConstraint<T>("T")             \
                               .TypeConstraint<int32>("Tpaddings") \
                               .HostMemory("paddings")             \
                               .HostMemory("constant_values"),     \
-                          PadOp<SYCLDevice, T>)
+                          PadOp<SYCLDevice, T, int32>)            \
+  REGISTER_KERNEL_BUILDER(Name("PadV2")                           \
+                              .Device(DEVICE_SYCL)                \
+                              .TypeConstraint<T>("T")             \
+                              .TypeConstraint<int64>("Tpaddings") \
+                              .HostMemory("paddings")             \
+                              .HostMemory("constant_values"),     \
+                          PadOp<SYCLDevice, T, int64>)
 
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNEL);
 REGISTER_KERNEL_BUILDER(Name("Pad")
@@ -260,7 +325,15 @@ REGISTER_KERNEL_BUILDER(Name("Pad")
                             .HostMemory("input")
                             .HostMemory("paddings")
                             .HostMemory("output"),
-                        PadOp<CPUDevice, int32>);
+                        PadOp<CPUDevice, int32, int32>);
+REGISTER_KERNEL_BUILDER(Name("Pad")
+                            .Device(DEVICE_SYCL)
+                            .TypeConstraint<int32>("T")
+                            .TypeConstraint<int64>("Tpaddings")
+                            .HostMemory("input")
+                            .HostMemory("paddings")
+                            .HostMemory("output"),
+                        PadOp<CPUDevice, int32, int64>);
 REGISTER_KERNEL_BUILDER(Name("PadV2")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
@@ -269,8 +342,17 @@ REGISTER_KERNEL_BUILDER(Name("PadV2")
                             .HostMemory("paddings")
                             .HostMemory("constant_values")
                             .HostMemory("output"),
-                        PadOp<CPUDevice, int32>);
+                        PadOp<CPUDevice, int32, int32>);
+REGISTER_KERNEL_BUILDER(Name("PadV2")
+                            .Device(DEVICE_SYCL)
+                            .TypeConstraint<int32>("T")
+                            .TypeConstraint<int64>("Tpaddings")
+                            .HostMemory("input")
+                            .HostMemory("paddings")
+                            .HostMemory("constant_values")
+                            .HostMemory("output"),
+                        PadOp<CPUDevice, int32, int64>);
 #undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/pad_op.h b/tensorflow/core/kernels/pad_op.h
index 95a7c9a3ae..ee9e0f0330 100644
--- a/tensorflow/core/kernels/pad_op.h
+++ b/tensorflow/core/kernels/pad_op.h
@@ -25,13 +25,13 @@ namespace tensorflow {
 namespace functor {
 
 // Functor used by PadOp to do the computations.
-template <typename Device, typename T, int Dims>
+template <typename Device, typename T, typename Tpadding, int Dims>
 struct Pad {
   // Pad "input" into "output", as specified by "paddings" and "pad_value".
   // See pad_op.cc for details.
   void operator()(const Device& d, typename TTypes<T, Dims>::Tensor output,
                   typename TTypes<T, Dims>::ConstTensor input,
-                  Eigen::array<Eigen::IndexPair<int32>, Dims> paddings,
+                  Eigen::array<Eigen::IndexPair<Tpadding>, Dims> paddings,
                   T pad_value) {
     if (Eigen::internal::is_same<Device, Eigen::GpuDevice>::value &&
         (output.size() <= std::numeric_limits<int32>::max())) {
@@ -42,12 +42,12 @@ struct Pad {
   }
 };
 
-template <typename Device, typename T>
-struct Pad<Device, T, 0> {
+template <typename Device, typename T, typename Tpadding>
+struct Pad<Device, T, Tpadding, 0> {
   // In the scalar case we simply copy the input.
   void operator()(const Device& d, typename TTypes<T, 0>::Tensor output,
                   typename TTypes<T, 0>::ConstTensor input,
-                  Eigen::array<Eigen::IndexPair<int32>, 0>, T) {
+                  Eigen::array<Eigen::IndexPair<Tpadding>, 0>, T) {
     output.device(d) = input;
   }
 };
diff --git a/tensorflow/core/kernels/pad_op_gpu.cu.cc b/tensorflow/core/kernels/pad_op_gpu.cu.cc
index f98631df17..613ad62825 100644
--- a/tensorflow/core/kernels/pad_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/pad_op_gpu.cu.cc
@@ -26,14 +26,18 @@ namespace tensorflow {
 typedef Eigen::GpuDevice GPUDevice;
 
 // Definition of the GPU implementations declared in pad_op.cc.
-#define DEFINE_GPU_SPECS(T)                      \
-  template struct functor::Pad<GPUDevice, T, 0>; \
-  template struct functor::Pad<GPUDevice, T, 1>; \
-  template struct functor::Pad<GPUDevice, T, 2>; \
-  template struct functor::Pad<GPUDevice, T, 3>; \
-  template struct functor::Pad<GPUDevice, T, 4>; \
-  template struct functor::Pad<GPUDevice, T, 5>; \
-  template struct functor::Pad<GPUDevice, T, 6>;
+#define DEFINE_GPU_PAD_SPECS(T, Tpadding)                  \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 0>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 1>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 2>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 3>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 4>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 5>; \
+  template struct functor::Pad<GPUDevice, T, Tpadding, 6>;
+
+#define DEFINE_GPU_SPECS(T)      \
+  DEFINE_GPU_PAD_SPECS(T, int32) \
+  DEFINE_GPU_PAD_SPECS(T, int64)
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_SPECS);
 
diff --git a/tensorflow/core/kernels/reduction_ops_all.cc b/tensorflow/core/kernels/reduction_ops_all.cc
index 41abc2b957..4a34c4ef51 100644
--- a/tensorflow/core/kernels/reduction_ops_all.cc
+++ b/tensorflow/core/kernels/reduction_ops_all.cc
@@ -22,7 +22,13 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("Tidx")
         .Device(DEVICE_CPU)
         .HostMemory("reduction_indices"),
-    ReductionOp<CPUDevice, bool, Eigen::internal::AndReducer>);
+    ReductionOp<CPUDevice, bool, int32, Eigen::internal::AndReducer>);
+REGISTER_KERNEL_BUILDER(
+    Name("All")
+        .TypeConstraint<int64>("Tidx")
+        .Device(DEVICE_CPU)
+        .HostMemory("reduction_indices"),
+    ReductionOp<CPUDevice, bool, int64, Eigen::internal::AndReducer>);
 
 #if GOOGLE_CUDA
 REGISTER_KERNEL_BUILDER(
@@ -30,7 +36,13 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("Tidx")
         .Device(DEVICE_GPU)
         .HostMemory("reduction_indices"),
-    ReductionOp<GPUDevice, bool, Eigen::internal::AndReducer>);
+    ReductionOp<GPUDevice, bool, int32, Eigen::internal::AndReducer>);
+REGISTER_KERNEL_BUILDER(
+    Name("All")
+        .TypeConstraint<int64>("Tidx")
+        .Device(DEVICE_GPU)
+        .HostMemory("reduction_indices"),
+    ReductionOp<GPUDevice, bool, int64, Eigen::internal::AndReducer>);
 #endif
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_any.cc b/tensorflow/core/kernels/reduction_ops_any.cc
index a2087cc3b7..6c0519de95 100644
--- a/tensorflow/core/kernels/reduction_ops_any.cc
+++ b/tensorflow/core/kernels/reduction_ops_any.cc
@@ -22,7 +22,13 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("Tidx")
         .Device(DEVICE_CPU)
         .HostMemory("reduction_indices"),
-    ReductionOp<CPUDevice, bool, Eigen::internal::OrReducer>);
+    ReductionOp<CPUDevice, bool, int32, Eigen::internal::OrReducer>);
+REGISTER_KERNEL_BUILDER(
+    Name("Any")
+        .TypeConstraint<int64>("Tidx")
+        .Device(DEVICE_CPU)
+        .HostMemory("reduction_indices"),
+    ReductionOp<CPUDevice, bool, int64, Eigen::internal::OrReducer>);
 
 #if GOOGLE_CUDA
 REGISTER_KERNEL_BUILDER(
@@ -30,7 +36,13 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("Tidx")
         .Device(DEVICE_GPU)
         .HostMemory("reduction_indices"),
-    ReductionOp<GPUDevice, bool, Eigen::internal::OrReducer>);
+    ReductionOp<GPUDevice, bool, int32, Eigen::internal::OrReducer>);
+REGISTER_KERNEL_BUILDER(
+    Name("Any")
+        .TypeConstraint<int64>("Tidx")
+        .Device(DEVICE_GPU)
+        .HostMemory("reduction_indices"),
+    ReductionOp<GPUDevice, bool, int64, Eigen::internal::OrReducer>);
 #endif
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_common.cc b/tensorflow/core/kernels/reduction_ops_common.cc
index 5eba4288ac..8daab0d6be 100644
--- a/tensorflow/core/kernels/reduction_ops_common.cc
+++ b/tensorflow/core/kernels/reduction_ops_common.cc
@@ -57,13 +57,12 @@ gtl::InlinedVector<int32, 8> ReductionHelper::permutation() {
   return perm;
 }
 
-Status ReductionHelper::Simplify(const Tensor& data, const Tensor& axis,
-                                 const bool keep_dims) {
-  // bitmap[i] indicates whether to reduce data along i-th axis.
-  gtl::InlinedVector<bool, 4> bitmap(data.dims(), false);
-  auto axis_vec = axis.flat<int32>();
+template <typename Tperm>
+Status SimplifyHelper(const Tensor& data, const Tensor& axis,
+                      gtl::InlinedVector<bool, 4>& bitmap) {
+  auto axis_vec = axis.flat<Tperm>();
   for (int64 i = 0; i < axis.NumElements(); ++i) {
-    int32 index = axis_vec(i);
+    Tperm index = axis_vec(i);
     if (index < -data.dims() || index >= data.dims()) {
       return errors::InvalidArgument("Invalid reduction dimension (", index,
                                      " for input with ", data.dims(),
@@ -72,7 +71,18 @@ Status ReductionHelper::Simplify(const Tensor& data, const Tensor& axis,
     index = (index + data.dims()) % data.dims();
     bitmap[index] = true;
   }
+  return Status::OK();
+}
 
+Status ReductionHelper::Simplify(const Tensor& data, const Tensor& axis,
+                                 const bool keep_dims) {
+  // bitmap[i] indicates whether to reduce data along i-th axis.
+  gtl::InlinedVector<bool, 4> bitmap(data.dims(), false);
+  if (axis.dtype() == DT_INT32) {
+    TF_RETURN_IF_ERROR(SimplifyHelper<int32>(data, axis, bitmap));
+  } else {
+    TF_RETURN_IF_ERROR(SimplifyHelper<int64>(data, axis, bitmap));
+  }
   // Output tensor's dim sizes.
   out_shape_.clear();
   for (int i = 0; i < data.dims(); ++i) {
diff --git a/tensorflow/core/kernels/reduction_ops_common.h b/tensorflow/core/kernels/reduction_ops_common.h
index 71af9d88dc..9da992ccd1 100644
--- a/tensorflow/core/kernels/reduction_ops_common.h
+++ b/tensorflow/core/kernels/reduction_ops_common.h
@@ -25,6 +25,7 @@ limitations under the License.
 
 #include "third_party/eigen3/Eigen/Core"
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -42,7 +43,7 @@ typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 template <typename Device>
 struct Constants {
@@ -68,11 +69,13 @@ struct ConstantsBase {
   const Eigen::IndexList<Eigen::type2index<1>> kOne;
   const Eigen::IndexList<Eigen::type2index<0>, Eigen::type2index<2>> kZeroTwo;
 };
-template<> struct Constants<CPUDevice> : ConstantsBase{};
+template <>
+struct Constants<CPUDevice> : ConstantsBase {};
 #ifdef TENSORFLOW_USE_SYCL
-template<> struct Constants<SYCLDevice> : ConstantsBase{};
-#endif // TENSORFLOW_USE_SYCL
-#endif // EIGEN_HAS_INDEX_LIST
+template <>
+struct Constants<SYCLDevice> : ConstantsBase {};
+#endif  // TENSORFLOW_USE_SYCL
+#endif  // EIGEN_HAS_INDEX_LIST
 
 class ReductionHelper {
  public:
@@ -131,12 +134,13 @@ class ReductionHelper {
 
 // For operations where the output is a reduction function along some
 // dimensions of the input.
-template <typename Device, class T, typename Reducer>
+template <typename Device, class T, typename Tperm, typename Reducer>
 class ReductionOp : public OpKernel {
  public:
   explicit ReductionOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
     const DataType dt = DataTypeToEnum<T>::v();
-    OP_REQUIRES_OK(ctx, ctx->MatchSignature({dt, DT_INT32}, {dt}));
+    const DataType pt = DataTypeToEnum<Tperm>::v();
+    OP_REQUIRES_OK(ctx, ctx->MatchSignature({dt, pt}, {dt}));
 
     OP_REQUIRES_OK(ctx, ctx->GetAttr("keep_dims", &keep_dims_));
   }
@@ -266,20 +270,19 @@ struct ReduceFunctorBase {
   }
 
   template <typename OUT_T>
-  static void FillIdentity(const Device& d, OUT_T out,
-                           const Reducer& reducer) {
+  static void FillIdentity(const Device& d, OUT_T out, const Reducer& reducer) {
     FillIdentityEigenImpl(d, out, reducer);
   }
 };
 
 template <typename Reducer>
 struct ReduceFunctor<CPUDevice, Reducer>
-        : ReduceFunctorBase<CPUDevice, Reducer>{};
+    : ReduceFunctorBase<CPUDevice, Reducer> {};
 #if TENSORFLOW_USE_SYCL
 template <typename Reducer>
 struct ReduceFunctor<SYCLDevice, Reducer>
-        : ReduceFunctorBase<SYCLDevice, Reducer>{};
-#endif // TENSORFLOW_USE_SYCL
+    : ReduceFunctorBase<SYCLDevice, Reducer> {};
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace functor
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_max.cc b/tensorflow/core/kernels/reduction_ops_max.cc
index 4ca5c11a48..9cf953f4bf 100644
--- a/tensorflow/core/kernels/reduction_ops_max.cc
+++ b/tensorflow/core/kernels/reduction_ops_max.cc
@@ -17,26 +17,39 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Max")                         \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ReductionOp<CPUDevice, type, Eigen::internal::MaxReducer<type>>);
+#define REGISTER_CPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Max")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int32, Eigen::internal::MaxReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Max")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int64, Eigen::internal::MaxReducer<type>>);
 TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU_KERNELS(type)          \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Max")                           \
-          .Device(DEVICE_GPU)               \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<GPUDevice, type, Eigen::internal::MaxReducer<type>>);
+#define REGISTER_GPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Max")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int32, Eigen::internal::MaxReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Max")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int64, Eigen::internal::MaxReducer<type>>);
 REGISTER_GPU_KERNELS(float);
 REGISTER_GPU_KERNELS(double);
 REGISTER_GPU_KERNELS(int64);
@@ -52,21 +65,37 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("output")
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::MaxReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::MaxReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Max")
+        .Device(DEVICE_GPU)
+        .HostMemory("reduction_indices")
+        .HostMemory("input")
+        .HostMemory("output")
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::MaxReducer<int32>>);
 
 #undef REGISTER_GPU_KERNELS
 
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(type)         \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Max")                           \
-          .Device(DEVICE_SYCL)              \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<SYCLDevice, type, Eigen::internal::MaxReducer<type>>);
+#define REGISTER_SYCL_KERNELS(type)                                        \
+  REGISTER_KERNEL_BUILDER(Name("Max")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int32>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int32,             \
+                                      Eigen::internal::MaxReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Max")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int64>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int64,             \
+                                      Eigen::internal::MaxReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 
@@ -78,8 +107,17 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("output")
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::MaxReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::MaxReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Max")
+        .Device(DEVICE_SYCL)
+        .HostMemory("reduction_indices")
+        .HostMemory("input")
+        .HostMemory("output")
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::MaxReducer<int32>>);
 #undef REGISTER_SYCL_KERNELS
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_mean.cc b/tensorflow/core/kernels/reduction_ops_mean.cc
index 5b01de8ddb..f61589f913 100644
--- a/tensorflow/core/kernels/reduction_ops_mean.cc
+++ b/tensorflow/core/kernels/reduction_ops_mean.cc
@@ -17,26 +17,39 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Mean")                        \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ReductionOp<CPUDevice, type, Eigen::internal::MeanReducer<type>>);
+#define REGISTER_CPU_KERNELS(type)                                          \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_CPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx"),               \
+                          ReductionOp<CPUDevice, type, int32,               \
+                                      Eigen::internal::MeanReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_CPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx"),               \
+                          ReductionOp<CPUDevice, type, int64,               \
+                                      Eigen::internal::MeanReducer<type>>);
 TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU_KERNELS(type)          \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Mean")                          \
-          .Device(DEVICE_GPU)               \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<GPUDevice, type, Eigen::internal::MeanReducer<type>>);
+#define REGISTER_GPU_KERNELS(type)                                          \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_GPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<GPUDevice, type, int32,               \
+                                      Eigen::internal::MeanReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_GPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<GPUDevice, type, int64,               \
+                                      Eigen::internal::MeanReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 TF_CALL_complex64(REGISTER_GPU_KERNELS);
 TF_CALL_complex128(REGISTER_GPU_KERNELS);
@@ -45,17 +58,24 @@ TF_CALL_complex128(REGISTER_GPU_KERNELS);
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(type)         \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Mean")                          \
-          .Device(DEVICE_SYCL)              \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<SYCLDevice, type, Eigen::internal::MeanReducer<type>>);
+#define REGISTER_SYCL_KERNELS(type)                                         \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_SYCL)                          \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<SYCLDevice, type, int32,              \
+                                      Eigen::internal::MeanReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Mean")                                      \
+                              .Device(DEVICE_SYCL)                          \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<SYCLDevice, type, int64,              \
+                                      Eigen::internal::MeanReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 #undef REGISTER_SYCL_KERNELS
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_min.cc b/tensorflow/core/kernels/reduction_ops_min.cc
index 1e394bea41..807ac0a456 100644
--- a/tensorflow/core/kernels/reduction_ops_min.cc
+++ b/tensorflow/core/kernels/reduction_ops_min.cc
@@ -17,26 +17,39 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Min")                         \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ReductionOp<CPUDevice, type, Eigen::internal::MinReducer<type>>);
+#define REGISTER_CPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Min")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int32, Eigen::internal::MinReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Min")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int64, Eigen::internal::MinReducer<type>>);
 TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU_KERNELS(type)          \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Min")                           \
-          .Device(DEVICE_GPU)               \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<GPUDevice, type, Eigen::internal::MinReducer<type>>);
+#define REGISTER_GPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Min")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int32, Eigen::internal::MinReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Min")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int64, Eigen::internal::MinReducer<type>>);
 REGISTER_GPU_KERNELS(float);
 REGISTER_GPU_KERNELS(double);
 
@@ -51,21 +64,37 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("output")
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::MinReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::MinReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Min")
+        .Device(DEVICE_GPU)
+        .HostMemory("reduction_indices")
+        .HostMemory("input")
+        .HostMemory("output")
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::MinReducer<int32>>);
 
 #undef REGISTER_GPU_KERNELS
 
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(type)         \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Min")                           \
-          .Device(DEVICE_SYCL)              \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<SYCLDevice, type, Eigen::internal::MinReducer<type>>);
+#define REGISTER_SYCL_KERNELS(type)                                        \
+  REGISTER_KERNEL_BUILDER(Name("Min")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int32>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int32,             \
+                                      Eigen::internal::MinReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Min")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int64>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int64,             \
+                                      Eigen::internal::MinReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 
@@ -77,8 +106,17 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("output")
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::MinReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::MinReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Min")
+        .Device(DEVICE_SYCL)
+        .HostMemory("reduction_indices")
+        .HostMemory("input")
+        .HostMemory("output")
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::MinReducer<int32>>);
 #undef REGISTER_SYCL_KERNELS
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_prod.cc b/tensorflow/core/kernels/reduction_ops_prod.cc
index 33f6ae6bae..e9b23df746 100644
--- a/tensorflow/core/kernels/reduction_ops_prod.cc
+++ b/tensorflow/core/kernels/reduction_ops_prod.cc
@@ -17,26 +17,39 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Prod")                        \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ReductionOp<CPUDevice, type, Eigen::internal::ProdReducer<type>>);
+#define REGISTER_CPU_KERNELS(type)                                          \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_CPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx"),               \
+                          ReductionOp<CPUDevice, type, int32,               \
+                                      Eigen::internal::ProdReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_CPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx"),               \
+                          ReductionOp<CPUDevice, type, int64,               \
+                                      Eigen::internal::ProdReducer<type>>);
 TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU_KERNELS(type)          \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Prod")                          \
-          .Device(DEVICE_GPU)               \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<GPUDevice, type, Eigen::internal::ProdReducer<type>>);
+#define REGISTER_GPU_KERNELS(type)                                          \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_GPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<GPUDevice, type, int32,               \
+                                      Eigen::internal::ProdReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_GPU)                           \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<GPUDevice, type, int64,               \
+                                      Eigen::internal::ProdReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 TF_CALL_int32(REGISTER_GPU_KERNELS);
 TF_CALL_complex64(REGISTER_GPU_KERNELS);
@@ -46,18 +59,25 @@ TF_CALL_complex128(REGISTER_GPU_KERNELS);
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(type)         \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Prod")                          \
-          .Device(DEVICE_SYCL)              \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<SYCLDevice, type, Eigen::internal::ProdReducer<type>>);
+#define REGISTER_SYCL_KERNELS(type)                                         \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_SYCL)                          \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int32>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<SYCLDevice, type, int32,              \
+                                      Eigen::internal::ProdReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Prod")                                      \
+                              .Device(DEVICE_SYCL)                          \
+                              .TypeConstraint<type>("T")                    \
+                              .TypeConstraint<int64>("Tidx")                \
+                              .HostMemory("reduction_indices"),             \
+                          ReductionOp<SYCLDevice, type, int64,              \
+                                      Eigen::internal::ProdReducer<type>>);
 REGISTER_SYCL_KERNELS(int32);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 #undef REGISTER_SYCL_KERNELS
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_sum.cc b/tensorflow/core/kernels/reduction_ops_sum.cc
index c1f4f3475a..5318d8c133 100644
--- a/tensorflow/core/kernels/reduction_ops_sum.cc
+++ b/tensorflow/core/kernels/reduction_ops_sum.cc
@@ -17,26 +17,39 @@ limitations under the License.
 
 namespace tensorflow {
 
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Sum")                         \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ReductionOp<CPUDevice, type, Eigen::internal::SumReducer<type>>);
+#define REGISTER_CPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Sum")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int32, Eigen::internal::SumReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Sum")                                                              \
+          .Device(DEVICE_CPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx"),                                      \
+      ReductionOp<CPUDevice, type, int64, Eigen::internal::SumReducer<type>>);
 TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
 
-#define REGISTER_GPU_KERNELS(type)          \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Sum")                           \
-          .Device(DEVICE_GPU)               \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<GPUDevice, type, Eigen::internal::SumReducer<type>>);
+#define REGISTER_GPU_KERNELS(type)                                             \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Sum")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int32>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int32, Eigen::internal::SumReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(                                                     \
+      Name("Sum")                                                              \
+          .Device(DEVICE_GPU)                                                  \
+          .TypeConstraint<type>("T")                                           \
+          .TypeConstraint<int64>("Tidx")                                       \
+          .HostMemory("reduction_indices"),                                    \
+      ReductionOp<GPUDevice, type, int64, Eigen::internal::SumReducer<type>>);
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 TF_CALL_complex64(REGISTER_GPU_KERNELS);
 TF_CALL_complex128(REGISTER_GPU_KERNELS);
@@ -53,19 +66,35 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("input")
         .HostMemory("output")
         .HostMemory("reduction_indices"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::SumReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::SumReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Sum")
+        .Device(DEVICE_GPU)
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx")
+        .HostMemory("input")
+        .HostMemory("output")
+        .HostMemory("reduction_indices"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::SumReducer<int32>>);
 
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(type)         \
-  REGISTER_KERNEL_BUILDER(                  \
-      Name("Sum")                           \
-          .Device(DEVICE_SYCL)              \
-          .TypeConstraint<type>("T")        \
-          .TypeConstraint<int32>("Tidx")    \
-          .HostMemory("reduction_indices"), \
-      ReductionOp<SYCLDevice, type, Eigen::internal::SumReducer<type>>);
+#define REGISTER_SYCL_KERNELS(type)                                        \
+  REGISTER_KERNEL_BUILDER(Name("Sum")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int32>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int32,             \
+                                      Eigen::internal::SumReducer<type>>); \
+  REGISTER_KERNEL_BUILDER(Name("Sum")                                      \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T")                   \
+                              .TypeConstraint<int64>("Tidx")               \
+                              .HostMemory("reduction_indices"),            \
+                          ReductionOp<SYCLDevice, type, int64,             \
+                                      Eigen::internal::SumReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 
@@ -77,8 +106,17 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("input")
         .HostMemory("output")
         .HostMemory("reduction_indices"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::SumReducer<int32>>);
+    ReductionOp<CPUDevice, int32, int32, Eigen::internal::SumReducer<int32>>);
+REGISTER_KERNEL_BUILDER(
+    Name("Sum")
+        .Device(DEVICE_SYCL)
+        .TypeConstraint<int32>("T")
+        .TypeConstraint<int64>("Tidx")
+        .HostMemory("input")
+        .HostMemory("output")
+        .HostMemory("reduction_indices"),
+    ReductionOp<CPUDevice, int32, int64, Eigen::internal::SumReducer<int32>>);
 #undef REGISTER_SYCL_KERNELS
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/resize_bicubic_op.cc b/tensorflow/core/kernels/resize_bicubic_op.cc
index 1c43e77e7c..1a9cf4c640 100644
--- a/tensorflow/core/kernels/resize_bicubic_op.cc
+++ b/tensorflow/core/kernels/resize_bicubic_op.cc
@@ -20,7 +20,6 @@ limitations under the License.
 #include <algorithm>
 #include <array>
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -29,6 +28,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/image_resizer_state.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
 namespace tensorflow {
 namespace {
@@ -235,6 +235,7 @@ inline void interpolate_with_caching(
 
   const T* input_b_ptr = input_data.data();
   float* output_y_ptr = output_data.data();
+  std::vector<float> cached_value(num_channels == 3 ? 0 : 4 * num_channels, 0);
 
   for (int64 b = 0; b < resizer_state.batch_size;
        ++b, input_b_ptr += in_batch_width) {
@@ -248,6 +249,7 @@ inline void interpolate_with_caching(
       const T* y_ptr_1 = input_b_ptr + y_wai.index_1 * in_row_width;
       const T* y_ptr_2 = input_b_ptr + y_wai.index_2 * in_row_width;
       const T* y_ptr_3 = input_b_ptr + y_wai.index_3 * in_row_width;
+
       if (num_channels == 3) {
         // Manually unroll case of 3 channels.
         float cached_value_0[4] = {0};
@@ -330,48 +332,61 @@ inline void interpolate_with_caching(
                       x_wai.weight_2, x_wai.weight_3);
         }
       } else {
-        for (int64 c = 0; c < num_channels; ++c) {
-          float cached_value[4] = {0};
-          for (int64 x = 0; x < resizer_state.out_width; ++x) {
-            const WeightsAndIndices& x_wai = x_wais[x];
-            // Shift values in cached_value to fill first 'advance' values.
-            switch (x_wai.advance) {
-              case 3:
-                cached_value[0] = cached_value[1];
-                cached_value[1] = cached_value[2];
-                cached_value[2] = cached_value[3];
-                break;
-              case 2:
-                cached_value[0] = cached_value[2];
-                cached_value[1] = cached_value[3];
-                break;
-              case 1: {
-                cached_value[0] = cached_value[3];
-                break;
+        for (int64 x = 0; x < resizer_state.out_width; ++x) {
+          const WeightsAndIndices& x_wai = x_wais[x];
+          // Shift values in cached_value to fill first 'advance' values.
+          switch (x_wai.advance) {
+            case 3:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 0] = cached_value[4 * c + 1];
+                cached_value[4 * c + 1] = cached_value[4 * c + 2];
+                cached_value[4 * c + 2] = cached_value[4 * c + 3];
+              }
+              break;
+            case 2:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 0] = cached_value[4 * c + 2];
+                cached_value[4 * c + 1] = cached_value[4 * c + 3];
+              }
+              break;
+            case 1: {
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 0] = cached_value[4 * c + 3];
               }
+              break;
             }
+          }
 
-            // Set the remaining '4-advance' values by computing.
-            switch (x_wai.advance) {
-              case 0:
-                cached_value[0] = ComputeYInterpolation(
+          // Set the remaining '4-advance' values by computing.
+          switch (x_wai.advance) {
+            case 0:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 0] = ComputeYInterpolation(
                     0, c, y_wai, y_ptr_0, y_ptr_1, y_ptr_2, y_ptr_3, x_wai);
-                TF_FALLTHROUGH_INTENDED;
-              case 1:
-                cached_value[1] = ComputeYInterpolation(
+              }
+              TF_FALLTHROUGH_INTENDED;
+            case 1:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 1] = ComputeYInterpolation(
                     1, c, y_wai, y_ptr_0, y_ptr_1, y_ptr_2, y_ptr_3, x_wai);
-                TF_FALLTHROUGH_INTENDED;
-              case 2:
-                cached_value[2] = ComputeYInterpolation(
+              }
+              TF_FALLTHROUGH_INTENDED;
+            case 2:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 2] = ComputeYInterpolation(
                     2, c, y_wai, y_ptr_0, y_ptr_1, y_ptr_2, y_ptr_3, x_wai);
-                TF_FALLTHROUGH_INTENDED;
-              case 3:
-                cached_value[3] = ComputeYInterpolation(
+              }
+              TF_FALLTHROUGH_INTENDED;
+            case 3:
+              for (int64 c = 0; c < num_channels; ++c) {
+                cached_value[4 * c + 3] = ComputeYInterpolation(
                     3, c, y_wai, y_ptr_0, y_ptr_1, y_ptr_2, y_ptr_3, x_wai);
-                break;
-            }
+              }
+              break;
+          }
+          for (int64 c = 0; c < num_channels; ++c) {
             output_y_ptr[x * num_channels + c] =
-                Compute(cached_value, x_wai.weight_0, x_wai.weight_1,
+                Compute(&cached_value[4 * c], x_wai.weight_0, x_wai.weight_1,
                         x_wai.weight_2, x_wai.weight_3);
           }
         }
diff --git a/tensorflow/core/kernels/resize_bicubic_op_test.cc b/tensorflow/core/kernels/resize_bicubic_op_test.cc
index ae14d2804e..9e10fec423 100644
--- a/tensorflow/core/kernels/resize_bicubic_op_test.cc
+++ b/tensorflow/core/kernels/resize_bicubic_op_test.cc
@@ -251,14 +251,15 @@ TEST_F(ResizeBicubicOpTest, TestAreaRandomDataSeveralInputsSizes4Channels) {
   RunManyRandomTests(4);
 }
 
-static Graph* ResizeBicubic(int batch_size, int size, int channels) {
+static Graph* ResizeBicubic(int batch_size, int size, int channels,
+                            float scale_y = 0.3, float scale_x = 0.7) {
   Graph* g = new Graph(OpRegistry::Global());
   Tensor input(DT_FLOAT, TensorShape({batch_size, size, size, channels}));
   input.flat<float>().setRandom();
   Tensor shape(DT_INT32, TensorShape({2}));
   auto shape_t = shape.flat<int32>();
-  shape_t(0) = 0.3 * size;
-  shape_t(1) = 0.7 * size;
+  shape_t(0) = scale_y * size;
+  shape_t(1) = scale_x * size;
   test::graph::Binary(g, "ResizeBicubic", test::graph::Constant(g, input),
                       test::graph::Constant(g, shape));
   return g;
@@ -285,4 +286,17 @@ BM_ResizeBicubicDev(32, 128, 3);
 BM_ResizeBicubicDev(32, 512, 3);
 BM_ResizeBicubicDev(32, 1024, 3);
 
+#define BM_ResizeBicubicExpand(BATCH, SIZE, CHANNELS)                          \
+  static void BM_ResizeBicubicExpand##_##BATCH##_##SIZE##_##CHANNELS(int iters) { \
+    testing::ItemsProcessed(static_cast<int64>(iters) * BATCH * SIZE * SIZE *  \
+                            CHANNELS * 8 * 8);                                 \
+    test::Benchmark("cpu", ResizeBicubic(BATCH, SIZE, CHANNELS, 8, 8))         \
+        .Run(iters);                                                           \
+  }                                                                            \
+  BENCHMARK(BM_ResizeBicubicExpand##_##BATCH##_##SIZE##_##CHANNELS);
+
+BM_ResizeBicubicExpand(12, 48, 1);
+BM_ResizeBicubicExpand(12, 48, 3);
+BM_ResizeBicubicExpand(12, 48, 40);
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/reverse_sequence_op.cc b/tensorflow/core/kernels/reverse_sequence_op.cc
index 505c512cc4..d1980d4b65 100644
--- a/tensorflow/core/kernels/reverse_sequence_op.cc
+++ b/tensorflow/core/kernels/reverse_sequence_op.cc
@@ -175,6 +175,7 @@ class ReverseSequenceOp : public OpKernel {
   REGISTER_REVERSE_SEQUENCE(type, int64);
 
 TF_CALL_NUMBER_TYPES(REGISTER_REVERSE_SEQUENCE_LEN);
+TF_CALL_bool(REGISTER_REVERSE_SEQUENCE_LEN);
 
 #if GOOGLE_CUDA
 
@@ -200,6 +201,7 @@ namespace functor {
   DECLARE_GPU_SPEC_LEN(T, 5);
 
 TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
+TF_CALL_bool(DECLARE_GPU_SPECS);
 
 }  // namespace functor
 
@@ -215,6 +217,7 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPECS);
   REGISTER_REVERSE_SEQUENCE_GPU(type, int64);
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_REVERSE_SEQUENCE_GPU_LEN);
+TF_CALL_bool(REGISTER_REVERSE_SEQUENCE_GPU_LEN);
 
 #undef REGISTER_REVERSE_SEQUENCE_GPU
 
diff --git a/tensorflow/core/kernels/reverse_sequence_op_gpu.cu.cc b/tensorflow/core/kernels/reverse_sequence_op_gpu.cu.cc
index 373fd60687..cb49f14525 100644
--- a/tensorflow/core/kernels/reverse_sequence_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/reverse_sequence_op_gpu.cu.cc
@@ -39,6 +39,7 @@ typedef Eigen::GpuDevice GPUDevice;
   DEFINE_GPU_SPEC_LEN(T, 5);
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_SPECS);
+TF_CALL_bool(DEFINE_GPU_SPECS);
 
 }  // end namespace tensorflow
 
diff --git a/tensorflow/core/kernels/scan_ops.cc b/tensorflow/core/kernels/scan_ops.cc
index cc434ab0ae..0a6848361a 100644
--- a/tensorflow/core/kernels/scan_ops.cc
+++ b/tensorflow/core/kernels/scan_ops.cc
@@ -35,7 +35,7 @@ namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 
-template <typename Device, class T, typename Reducer>
+template <typename Device, class T, typename Reducer, typename Tidx>
 class ScanOp : public OpKernel {
  public:
   explicit ScanOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
@@ -51,8 +51,9 @@ class ScanOp : public OpKernel {
                 errors::InvalidArgument("ScanOp: axis must be a scalar, not ",
                                         tensor_axis.shape().DebugString()));
 
-    const int axis_arg = internal::SubtleMustCopy(tensor_axis.scalar<int>()());
-    const int axis = (axis_arg < 0) ? input.dims() + axis_arg : axis_arg;
+    const Tidx axis_arg =
+        internal::SubtleMustCopy(tensor_axis.scalar<Tidx>()());
+    const Tidx axis = (axis_arg < 0) ? input.dims() + axis_arg : axis_arg;
     OP_REQUIRES(ctx, FastBoundsCheck(axis, input.dims()),
                 errors::InvalidArgument(
                     "ScanOp: Expected scan axis in the range [", -input.dims(),
@@ -70,11 +71,11 @@ class ScanOp : public OpKernel {
 
     // Dim reduction.
     int64 reduced_shape[3] = {1, 1, 1};
-    for (int i = 0; i < axis; ++i) {
+    for (Tidx i = 0; i < axis; ++i) {
       reduced_shape[0] *= input.dim_size(i);
     }
     reduced_shape[1] = input.dim_size(axis);
-    for (int i = axis + 1; i < input.dims(); ++i) {
+    for (Tidx i = axis + 1; i < input.dims(); ++i) {
       reduced_shape[2] *= input.dim_size(i);
     }
 
@@ -112,51 +113,76 @@ TF_CALL_GPU_NUMBER_TYPES(DECLARE_FOR_ALL_REDUCERS);
 }  // namespace functor
 #endif  // GOOGLE_CUDA
 
-
 // Register Cumsum kernels
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Cumsum")                      \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ScanOp<CPUDevice, type, Eigen::internal::SumReducer<type>>)
+#define REGISTER_CPU_KERNELS(type)                                       \
+  REGISTER_KERNEL_BUILDER(                                               \
+      Name("Cumsum")                                                     \
+          .Device(DEVICE_CPU)                                            \
+          .TypeConstraint<type>("T")                                     \
+          .TypeConstraint<int32>("Tidx"),                                \
+      ScanOp<CPUDevice, type, Eigen::internal::SumReducer<type>, int32>) \
+  REGISTER_KERNEL_BUILDER(                                               \
+      Name("Cumsum")                                                     \
+          .Device(DEVICE_CPU)                                            \
+          .TypeConstraint<type>("T")                                     \
+          .TypeConstraint<int64>("Tidx"),                                \
+      ScanOp<CPUDevice, type, Eigen::internal::SumReducer<type>, int64>)
 TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
-#define REGISTER_GPU_KERNELS(type)       \
-  REGISTER_KERNEL_BUILDER(               \
-      Name("Cumsum")                     \
-          .Device(DEVICE_GPU)            \
-          .TypeConstraint<type>("T")     \
-          .TypeConstraint<int32>("Tidx") \
-          .HostMemory("axis"),           \
-      ScanOp<GPUDevice, type, Eigen::internal::SumReducer<type>>)
+#define REGISTER_GPU_KERNELS(type)                                       \
+  REGISTER_KERNEL_BUILDER(                                               \
+      Name("Cumsum")                                                     \
+          .Device(DEVICE_GPU)                                            \
+          .TypeConstraint<type>("T")                                     \
+          .TypeConstraint<int32>("Tidx")                                 \
+          .HostMemory("axis"),                                           \
+      ScanOp<GPUDevice, type, Eigen::internal::SumReducer<type>, int32>) \
+  REGISTER_KERNEL_BUILDER(                                               \
+      Name("Cumsum")                                                     \
+          .Device(DEVICE_GPU)                                            \
+          .TypeConstraint<type>("T")                                     \
+          .TypeConstraint<int64>("Tidx")                                 \
+          .HostMemory("axis"),                                           \
+      ScanOp<GPUDevice, type, Eigen::internal::SumReducer<type>, int64>)
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS)
 #undef REGISTER_GPU_KERNELS
 #endif  // GOOGLE_CUDA
 
 // Register Cumprod kernels
-#define REGISTER_CPU_KERNELS(type)        \
-  REGISTER_KERNEL_BUILDER(                \
-      Name("Cumprod")                     \
-          .Device(DEVICE_CPU)             \
-          .TypeConstraint<type>("T")      \
-          .TypeConstraint<int32>("Tidx"), \
-      ScanOp<CPUDevice, type, Eigen::internal::ProdReducer<type>>)
+#define REGISTER_CPU_KERNELS(type)                                        \
+  REGISTER_KERNEL_BUILDER(                                                \
+      Name("Cumprod")                                                     \
+          .Device(DEVICE_CPU)                                             \
+          .TypeConstraint<type>("T")                                      \
+          .TypeConstraint<int32>("Tidx"),                                 \
+      ScanOp<CPUDevice, type, Eigen::internal::ProdReducer<type>, int32>) \
+  REGISTER_KERNEL_BUILDER(                                                \
+      Name("Cumprod")                                                     \
+          .Device(DEVICE_CPU)                                             \
+          .TypeConstraint<type>("T")                                      \
+          .TypeConstraint<int64>("Tidx"),                                 \
+      ScanOp<CPUDevice, type, Eigen::internal::ProdReducer<type>, int64>)
 TF_CALL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
 #undef REGISTER_CPU_KERNELS
 
 #if GOOGLE_CUDA
-#define REGISTER_GPU_KERNELS(type)       \
-  REGISTER_KERNEL_BUILDER(               \
-      Name("Cumprod")                    \
-          .Device(DEVICE_GPU)            \
-          .TypeConstraint<type>("T")     \
-          .TypeConstraint<int32>("Tidx") \
-          .HostMemory("axis"),           \
-      ScanOp<GPUDevice, type, Eigen::internal::ProdReducer<type>>)
+#define REGISTER_GPU_KERNELS(type)                                        \
+  REGISTER_KERNEL_BUILDER(                                                \
+      Name("Cumprod")                                                     \
+          .Device(DEVICE_GPU)                                             \
+          .TypeConstraint<type>("T")                                      \
+          .TypeConstraint<int32>("Tidx")                                  \
+          .HostMemory("axis"),                                            \
+      ScanOp<GPUDevice, type, Eigen::internal::ProdReducer<type>, int32>) \
+  REGISTER_KERNEL_BUILDER(                                                \
+      Name("Cumprod")                                                     \
+          .Device(DEVICE_GPU)                                             \
+          .TypeConstraint<type>("T")                                      \
+          .TypeConstraint<int64>("Tidx")                                  \
+          .HostMemory("axis"),                                            \
+      ScanOp<GPUDevice, type, Eigen::internal::ProdReducer<type>, int64>)
 TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS)
 #undef REGISTER_GPU_KERNELS
 #endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/sequence_ops.cc b/tensorflow/core/kernels/sequence_ops.cc
index c8ea923020..e2e3758d87 100644
--- a/tensorflow/core/kernels/sequence_ops.cc
+++ b/tensorflow/core/kernels/sequence_ops.cc
@@ -96,7 +96,7 @@ TF_CALL_double(REGISTER_SYCL_KERNEL);
 TF_CALL_int32(REGISTER_SYCL_KERNEL);
 TF_CALL_int64(REGISTER_SYCL_KERNEL);
 #undef REGISTER_SYCL_KERNEL
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
@@ -116,7 +116,7 @@ TF_CALL_int64(REGISTER_GPU_KERNEL);
 #undef REGISTER_CPU_KERNEL
 #undef REGISTER_GPU_KERNEL
 
-template <typename T>
+template <typename T, typename Tnum>
 class LinSpaceOp : public OpKernel {
  public:
   explicit LinSpaceOp(OpKernelConstruction* context) : OpKernel(context) {}
@@ -136,7 +136,7 @@ class LinSpaceOp : public OpKernel {
                                         num_in.shape().DebugString()));
     const T start = start_in.scalar<T>()();
     const T stop = stop_in.scalar<T>()();
-    const int32 num = num_in.scalar<int32>()();
+    const Tnum num = num_in.scalar<Tnum>()();
     OP_REQUIRES(context, num > 0,
                 errors::InvalidArgument("Requires num > 0: ", num));
     Tensor* out = nullptr;
@@ -147,34 +147,46 @@ class LinSpaceOp : public OpKernel {
       flat(0) = start;
     } else {
       const T step = (stop - start) / (num - 1);
-      for (int32 i = 0; i < num; ++i) flat(i) = start + step * i;
+      for (Tnum i = 0; i < num; ++i) flat(i) = start + step * i;
     }
   }
 };
 
-#define REGISTER_KERNEL(DEV, T)                              \
-  REGISTER_KERNEL_BUILDER(Name("LinSpace")                   \
-                              .Device(DEV)                   \
-                              .TypeConstraint<T>("T")        \
-                              .TypeConstraint<int32>("Tidx") \
-                              .HostMemory("start")           \
-                              .HostMemory("stop")            \
-                              .HostMemory("num")             \
-                              .HostMemory("output"),         \
-                          LinSpaceOp<T>);
-#define REGISTER_CPU_KERNEL(T) REGISTER_KERNEL(DEVICE_CPU, T)
+#define REGISTER_KERNEL(DEV, T, Tidx)                       \
+  REGISTER_KERNEL_BUILDER(Name("LinSpace")                  \
+                              .Device(DEV)                  \
+                              .TypeConstraint<T>("T")       \
+                              .TypeConstraint<Tidx>("Tidx") \
+                              .HostMemory("start")          \
+                              .HostMemory("stop")           \
+                              .HostMemory("num")            \
+                              .HostMemory("output"),        \
+                          LinSpaceOp<T, Tidx>);
+
+#define REGISTER_KERNEL_ALL_NUMS(dev, T) \
+  REGISTER_KERNEL(dev, T, int32);        \
+  REGISTER_KERNEL(dev, T, int64)
+
+#define REGISTER_CPU_KERNEL(T) REGISTER_KERNEL_ALL_NUMS(DEVICE_CPU, T)
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 
 // NOTE(touts): We register the op on GPU but it still runs on CPU
 // because its inputs and outputs are tagged as HostMemory.
-#define REGISTER_GPU_KERNEL(T) REGISTER_KERNEL(DEVICE_GPU, T)
+#define REGISTER_GPU_KERNEL(T) REGISTER_KERNEL_ALL_NUMS(DEVICE_GPU, T)
 TF_CALL_float(REGISTER_GPU_KERNEL);
 TF_CALL_double(REGISTER_GPU_KERNEL);
+#undef REGISTER_GPU_KERNEL
 
 #ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(T) REGISTER_KERNEL(DEVICE_SYCL, T)
+#define REGISTER_SYCL_KERNEL(T) REGISTER_KERNEL_ALL_NUMS(DEVICE_SYCL, T)
 TF_CALL_float(REGISTER_SYCL_KERNEL);
 TF_CALL_double(REGISTER_SYCL_KERNEL);
-#endif // TENSORFLOW_USE_SYCL
+#undef REGISTER_SYCL_KERNEL
+#endif  // TENSORFLOW_USE_SYCL
+
+#undef REGISTER_CPU_KERNEL
+#undef REGISTER_KERNEL_ALL_NUMS
+#undef REGISTER_KERNEL
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/sequence_ops_test.cc b/tensorflow/core/kernels/sequence_ops_test.cc
new file mode 100644
index 0000000000..5f0e0a69a8
--- /dev/null
+++ b/tensorflow/core/kernels/sequence_ops_test.cc
@@ -0,0 +1,148 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/fake_input.h"
+#include "tensorflow/core/framework/node_def_builder.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/kernels/ops_testutil.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace {
+
+class RangeOpTest : public OpsTestBase {
+ protected:
+  void MakeOp(DataType input_type) {
+    TF_ASSERT_OK(NodeDefBuilder("myop", "Range")
+                     .Input(FakeInput(input_type))
+                     .Input(FakeInput(input_type))
+                     .Input(FakeInput(input_type))
+                     .Finalize(node_def()));
+    TF_ASSERT_OK(InitOp());
+  }
+};
+
+class LinSpaceOpTest : public OpsTestBase {
+ protected:
+  void MakeOp(DataType input_type, DataType index_type) {
+    TF_ASSERT_OK(NodeDefBuilder("myop", "LinSpace")
+                     .Input(FakeInput(input_type))
+                     .Input(FakeInput(input_type))
+                     .Input(FakeInput(index_type))
+                     .Finalize(node_def()));
+    TF_ASSERT_OK(InitOp());
+  }
+};
+
+TEST_F(RangeOpTest, Simple_D32) {
+  MakeOp(DT_INT32);
+
+  // Feed and run
+  AddInputFromArray<int32>(TensorShape({}), {0});
+  AddInputFromArray<int32>(TensorShape({}), {10});
+  AddInputFromArray<int32>(TensorShape({}), {2});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_INT32, TensorShape({5}));
+  test::FillValues<int32>(&expected, {0, 2, 4, 6, 8});
+  test::ExpectTensorEqual<int32>(expected, *GetOutput(0));
+}
+
+TEST_F(RangeOpTest, Simple_Float) {
+  MakeOp(DT_FLOAT);
+
+  // Feed and run
+  AddInputFromArray<float>(TensorShape({}), {0.5});
+  AddInputFromArray<float>(TensorShape({}), {2});
+  AddInputFromArray<float>(TensorShape({}), {0.3});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_FLOAT, TensorShape({5}));
+  test::FillValues<float>(&expected, {0.5, 0.8, 1.1, 1.4, 1.7});
+  test::ExpectTensorEqual<float>(expected, *GetOutput(0));
+}
+
+TEST_F(RangeOpTest, Large_Double) {
+  MakeOp(DT_DOUBLE);
+
+  // Feed and run
+  AddInputFromArray<double>(TensorShape({}), {0.0});
+  AddInputFromArray<double>(TensorShape({}), {10000});
+  AddInputFromArray<double>(TensorShape({}), {0.5});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_DOUBLE, TensorShape({20000}));
+  std::vector<double> result;
+  for (int32 i = 0; i < 20000; ++i) result.push_back(i * 0.5);
+  test::FillValues<double>(&expected, gtl::ArraySlice<double>(result));
+  test::ExpectTensorEqual<double>(expected, *GetOutput(0));
+}
+
+TEST_F(LinSpaceOpTest, Simple_D32) {
+  MakeOp(DT_FLOAT, DT_INT32);
+
+  // Feed and run
+  AddInputFromArray<float>(TensorShape({}), {3.0});
+  AddInputFromArray<float>(TensorShape({}), {7.0});
+  AddInputFromArray<int32>(TensorShape({}), {3});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_FLOAT, TensorShape({3}));
+  test::FillValues<float>(&expected, {3.0, 5.0, 7.0});
+  test::ExpectTensorEqual<float>(expected, *GetOutput(0));
+}
+
+TEST_F(LinSpaceOpTest, Single_D64) {
+  MakeOp(DT_FLOAT, DT_INT64);
+
+  // Feed and run
+  AddInputFromArray<float>(TensorShape({}), {9.0});
+  AddInputFromArray<float>(TensorShape({}), {100.0});
+  AddInputFromArray<int64>(TensorShape({}), {1});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_FLOAT, TensorShape({1}));
+  test::FillValues<float>(&expected, {9.0});
+  test::ExpectTensorEqual<float>(expected, *GetOutput(0));
+}
+
+TEST_F(LinSpaceOpTest, Simple_Double) {
+  MakeOp(DT_DOUBLE, DT_INT32);
+
+  // Feed and run
+  AddInputFromArray<double>(TensorShape({}), {5.0});
+  AddInputFromArray<double>(TensorShape({}), {6.0});
+  AddInputFromArray<int32>(TensorShape({}), {6});
+  TF_ASSERT_OK(RunOpKernel());
+
+  // Check the output
+  Tensor expected(allocator(), DT_DOUBLE, TensorShape({6}));
+  test::FillValues<double>(&expected, {5.0, 5.2, 5.4, 5.6, 5.8, 6.0});
+  test::ExpectTensorEqual<double>(expected, *GetOutput(0));
+}
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/spacetobatch_op.cc b/tensorflow/core/kernels/spacetobatch_op.cc
index c513683918..95c1f5e7e8 100644
--- a/tensorflow/core/kernels/spacetobatch_op.cc
+++ b/tensorflow/core/kernels/spacetobatch_op.cc
@@ -248,40 +248,34 @@ class SpaceToBatchOp : public OpKernel {
   Tensor block_shape_;
 };
 
-#define REGISTER(T)                                                  \
-  REGISTER_KERNEL_BUILDER(Name("SpaceToBatchND")                     \
-                              .Device(DEVICE_CPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tblock_shape") \
-                              .TypeConstraint<int32>("Tpaddings")    \
-                              .HostMemory("block_shape")             \
-                              .HostMemory("paddings"),               \
-                          SpaceToBatchNDOp<CPUDevice, T>);           \
-  REGISTER_KERNEL_BUILDER(Name("SpaceToBatch")                       \
-                              .Device(DEVICE_CPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tpaddings")    \
-                              .HostMemory("paddings"),               \
+#define REGISTER(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("SpaceToBatchND")           \
+                              .Device(DEVICE_CPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("block_shape")   \
+                              .HostMemory("paddings"),     \
+                          SpaceToBatchNDOp<CPUDevice, T>); \
+  REGISTER_KERNEL_BUILDER(Name("SpaceToBatch")             \
+                              .Device(DEVICE_CPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("paddings"),     \
                           SpaceToBatchOp<CPUDevice, T>);
 
 TF_CALL_REAL_NUMBER_TYPES(REGISTER);
 #undef REGISTER
 
 #if GOOGLE_CUDA
-#define REGISTER(T)                                                  \
-  REGISTER_KERNEL_BUILDER(Name("SpaceToBatchND")                     \
-                              .Device(DEVICE_GPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tblock_shape") \
-                              .TypeConstraint<int32>("Tpaddings")    \
-                              .HostMemory("block_shape")             \
-                              .HostMemory("paddings"),               \
-                          SpaceToBatchNDOp<GPUDevice, T>);           \
-  REGISTER_KERNEL_BUILDER(Name("SpaceToBatch")                       \
-                              .Device(DEVICE_GPU)                    \
-                              .TypeConstraint<T>("T")                \
-                              .TypeConstraint<int32>("Tpaddings")    \
-                              .HostMemory("paddings"),               \
+#define REGISTER(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("SpaceToBatchND")           \
+                              .Device(DEVICE_GPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("block_shape")   \
+                              .HostMemory("paddings"),     \
+                          SpaceToBatchNDOp<GPUDevice, T>); \
+  REGISTER_KERNEL_BUILDER(Name("SpaceToBatch")             \
+                              .Device(DEVICE_GPU)          \
+                              .TypeConstraint<T>("T")      \
+                              .HostMemory("paddings"),     \
                           SpaceToBatchOp<GPUDevice, T>);
 
 TF_CALL_GPU_NUMBER_TYPES(REGISTER);
diff --git a/tensorflow/core/kernels/sparse_matmul_op.h b/tensorflow/core/kernels/sparse_matmul_op.h
index 308b641b54..cca52558ae 100644
--- a/tensorflow/core/kernels/sparse_matmul_op.h
+++ b/tensorflow/core/kernels/sparse_matmul_op.h
@@ -54,8 +54,9 @@ EIGEN_DEVICE_FUNC inline Packet pexpand_bf16_u(const Packet& from) {
 }
 
 // Specialization non-scalar version on non-sse.
+// Enable vectorization on z13 and higher
 #if defined(EIGEN_VECTORIZE_ALTIVEC) || defined(EIGEN_VECTORIZE_VSX) || \
-    defined(EIGEN_VECTORIZE_NEON)
+    defined(EIGEN_VECTORIZE_NEON) || defined(EIGEN_VECTORIZE_ZVECTOR)
 template <typename Packet>
 EIGEN_DEVICE_FUNC inline Packet4f pexpand_bf16_l(const Packet4f& from) {
   float r[4];
@@ -126,8 +127,9 @@ EIGEN_DEVICE_FUNC inline Packet pload2bf16(
 }
 
 // Specialization for pload4bf16 and pload2bf16 for non-sse.
+// Enable vectorization on z13 and higher.
 #if defined(EIGEN_VECTORIZE_ALTIVEC) || defined(EIGEN_VECTORIZE_VSX) || \
-    defined(EIGEN_VECTORIZE_NEON)
+    defined(EIGEN_VECTORIZE_NEON) || defined(EIGEN_VECTORIZE_ZVECTOR)
 template <>
 EIGEN_STRONG_INLINE Packet4f pload4bf16<Packet4f>(const float* from) {
   tensorflow::uint32 p[4];
diff --git a/tensorflow/core/kernels/stage_op.cc b/tensorflow/core/kernels/stage_op.cc
index 1717428adf..0fae46dea6 100644
--- a/tensorflow/core/kernels/stage_op.cc
+++ b/tensorflow/core/kernels/stage_op.cc
@@ -53,7 +53,10 @@ class Buffer : public ResourceBase {
   void notify_inserters_if_bounded(std::unique_lock<std::mutex>* lock) {
     if (IsBounded()) {
       lock->unlock();
-      full_cond_var_.notify_one();
+      // Notify all inserters. The removal of an element
+      // may make memory available for many inserters
+      // to insert new elements
+      full_cond_var_.notify_all();
     }
   }
 
@@ -115,9 +118,12 @@ class Buffer : public ResourceBase {
     buf_.push_back(std::move(*tuple));
 
     lock.unlock();
-    // maybe possible to optimize by reducing
-    // how often this signal is sent
-    non_empty_cond_var_.notify_one();
+    // Notify all removers. Removers
+    // may be peeking at a specific element or waiting
+    // for the element at the front of the deque.
+    // As we don't know the appropriate one to wake up
+    // we should wake them all.
+    non_empty_cond_var_.notify_all();
 
     return Status::OK();
   }
diff --git a/tensorflow/core/kernels/stateless_random_ops.cc b/tensorflow/core/kernels/stateless_random_ops.cc
index 79d0c07acd..f6fb0a121d 100644
--- a/tensorflow/core/kernels/stateless_random_ops.cc
+++ b/tensorflow/core/kernels/stateless_random_ops.cc
@@ -137,7 +137,6 @@ TF_CALL_double(REGISTER);
           .Device(DEVICE_GPU)                                          \
           .HostMemory("shape")                                         \
           .HostMemory("seed")                                          \
-          .TypeConstraint<int32>("T")                                  \
           .TypeConstraint<TYPE>("dtype"),                              \
       StatelessRandomOp<GPUDevice, random::UniformDistribution<        \
                                        random::PhiloxRandom, TYPE> >); \
@@ -146,7 +145,6 @@ TF_CALL_double(REGISTER);
           .Device(DEVICE_GPU)                                          \
           .HostMemory("shape")                                         \
           .HostMemory("seed")                                          \
-          .TypeConstraint<int32>("T")                                  \
           .TypeConstraint<TYPE>("dtype"),                              \
       StatelessRandomOp<GPUDevice, random::NormalDistribution<         \
                                        random::PhiloxRandom, TYPE> >); \
@@ -155,7 +153,6 @@ TF_CALL_double(REGISTER);
           .Device(DEVICE_GPU)                                          \
           .HostMemory("shape")                                         \
           .HostMemory("seed")                                          \
-          .TypeConstraint<int32>("T")                                  \
           .TypeConstraint<TYPE>("dtype"),                              \
       StatelessRandomOp<                                               \
           GPUDevice,                                                   \
diff --git a/tensorflow/core/kernels/tile_functor.h b/tensorflow/core/kernels/tile_functor.h
index 28af2dace3..189be9239b 100644
--- a/tensorflow/core/kernels/tile_functor.h
+++ b/tensorflow/core/kernels/tile_functor.h
@@ -17,6 +17,7 @@ limitations under the License.
 #define TENSORFLOW_KERNELS_TILE_FUNCTOR_H_
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_types.h"
 #include "tensorflow/core/platform/types.h"
@@ -29,13 +30,13 @@ namespace internal {
 template <typename Device, typename T>
 void TileSimple(const Device& d, Tensor* out, const Tensor& in);
 
-template <typename Device, typename T, int NDIM>
+template <typename Device, typename T, typename Tmultiples, int NDIM>
 void TileUsingEigen(const Device& d, Tensor* out, const Tensor& in,
-                    const gtl::ArraySlice<int32>& broadcast_array) {
+                    const gtl::ArraySlice<Tmultiples>& broadcast_array) {
   auto x = in.tensor<T, NDIM>();
   auto y = out->tensor<T, NDIM>();
 
-  Eigen::array<int32, NDIM> b;
+  Eigen::array<Tmultiples, NDIM> b;
   for (int i = 0; i < NDIM; ++i) b[i] = broadcast_array[i];
   if (Eigen::internal::is_same<Device, Eigen::GpuDevice>::value) {
     // Use 32bit indexing to speed up the computations
@@ -45,9 +46,9 @@ void TileUsingEigen(const Device& d, Tensor* out, const Tensor& in,
   }
 }
 
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tmultiples>
 void TileUsingEigen(const Device& d, Tensor* out, const Tensor& in,
-                    const gtl::ArraySlice<int32>&) {
+                    const gtl::ArraySlice<Tmultiples>&) {
   auto x = in.tensor<T, 0>();
   auto y = out->tensor<T, 0>();
   // In the scalar case we simply copy the input.
@@ -58,34 +59,42 @@ void TileUsingEigen(const Device& d, Tensor* out, const Tensor& in,
 
 namespace functor {
 
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tmultiples>
 struct Tile {
   void operator()(const Device& d, Tensor* out, const Tensor& in,
-                  const gtl::ArraySlice<int32> broadcast_array) const {
+                  const gtl::ArraySlice<Tmultiples> broadcast_array) const {
     switch (in.dims()) {
       case 0:
-        internal::TileUsingEigen<Device, T>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples>(d, out, in,
+                                                        broadcast_array);
         break;
       case 1:
-        internal::TileUsingEigen<Device, T, 1>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 1>(d, out, in,
+                                                           broadcast_array);
         break;
       case 2:
-        internal::TileUsingEigen<Device, T, 2>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 2>(d, out, in,
+                                                           broadcast_array);
         break;
       case 3:
-        internal::TileUsingEigen<Device, T, 3>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 3>(d, out, in,
+                                                           broadcast_array);
         break;
       case 4:
-        internal::TileUsingEigen<Device, T, 4>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 4>(d, out, in,
+                                                           broadcast_array);
         break;
       case 5:
-        internal::TileUsingEigen<Device, T, 5>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 5>(d, out, in,
+                                                           broadcast_array);
         break;
       case 6:
-        internal::TileUsingEigen<Device, T, 6>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 6>(d, out, in,
+                                                           broadcast_array);
         break;
       case 7:
-        internal::TileUsingEigen<Device, T, 7>(d, out, in, broadcast_array);
+        internal::TileUsingEigen<Device, T, Tmultiples, 7>(d, out, in,
+                                                           broadcast_array);
         break;
       default:
         internal::TileSimple<Device, T>(d, out, in);
diff --git a/tensorflow/core/kernels/tile_functor_cpu.cc b/tensorflow/core/kernels/tile_functor_cpu.cc
index 5952d49221..b2fd669541 100644
--- a/tensorflow/core/kernels/tile_functor_cpu.cc
+++ b/tensorflow/core/kernels/tile_functor_cpu.cc
@@ -15,10 +15,10 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/kernels/tile_functor.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/kernels/ops_util.h"
-#include "tensorflow/core/kernels/tile_functor.h"
 
 namespace tensorflow {
 
@@ -51,7 +51,9 @@ namespace functor {
 typedef Eigen::ThreadPoolDevice CPUDevice;
 
 // Register functors used for Tile functor.
-#define DEFINE_TYPE(T) template struct Tile<CPUDevice, T>;
+#define DEFINE_TYPE(T)                       \
+  template struct Tile<CPUDevice, T, int32>; \
+  template struct Tile<CPUDevice, T, int64>;
 
 TF_CALL_bool(DEFINE_TYPE);
 TF_CALL_float(DEFINE_TYPE);
@@ -70,7 +72,9 @@ TF_CALL_string(DEFINE_TYPE);
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
 
-#define DEFINE_TYPE(T) template struct Tile<SYCLDevice, T>;
+#define DEFINE_TYPE(T)                        \
+  template struct Tile<SYCLDevice, T, int32>; \
+  template struct Tile<SYCLDevice, T, int64>;
 
 TF_CALL_bool(DEFINE_TYPE);
 TF_CALL_float(DEFINE_TYPE);
@@ -81,7 +85,7 @@ TF_CALL_int16(DEFINE_TYPE);
 TF_CALL_int64(DEFINE_TYPE);
 
 #undef DEFINE_TYPE
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // end namespace functor
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/tile_functor_gpu.cu.cc b/tensorflow/core/kernels/tile_functor_gpu.cu.cc
index 1c61c3030a..5a36e7567b 100644
--- a/tensorflow/core/kernels/tile_functor_gpu.cu.cc
+++ b/tensorflow/core/kernels/tile_functor_gpu.cu.cc
@@ -18,10 +18,11 @@ limitations under the License.
 #define EIGEN_USE_GPU
 
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-#include "tensorflow/core/kernels/tile_functor.h"
+
+#include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/kernels/tile_functor.h"
 #include "tensorflow/core/util/cuda_kernel_helper.h"
-#include "tensorflow/core/framework/register_types.h"
 
 namespace tensorflow {
 namespace internal {
@@ -60,7 +61,8 @@ void TileSimple(const Device& d, Tensor* out, const Tensor& in) {
     host_buf[ndims + i] = out_strides[i];
     host_buf[ndims * 2 + i] = in.dim_size(i);
   }
-  // Copies the input strides, output strides and input dimension sizes to the device.
+  // Copies the input strides, output strides and input dimension sizes to the
+  // device.
   auto num_bytes = sizeof(int64) * host_buf.size();
   auto dev_buf = d.allocate(num_bytes);
   // NOTE: host_buf is not allocated by CudaHostAllocator, and
@@ -84,7 +86,9 @@ namespace functor {
 typedef Eigen::GpuDevice GPUDevice;
 
 // Register functors used for Tile functor.
-#define DEFINE_TYPE(T) template struct Tile<GPUDevice, T>;
+#define DEFINE_TYPE(T)                       \
+  template struct Tile<GPUDevice, T, int32>; \
+  template struct Tile<GPUDevice, T, int64>;
 
 TF_CALL_int16(DEFINE_TYPE);
 TF_CALL_int32(DEFINE_TYPE);
diff --git a/tensorflow/core/kernels/tile_ops.cc b/tensorflow/core/kernels/tile_ops.cc
index c49ebc0685..4c496a12c2 100644
--- a/tensorflow/core/kernels/tile_ops.cc
+++ b/tensorflow/core/kernels/tile_ops.cc
@@ -42,14 +42,14 @@ typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 // Forward declarations of functors that will be defined in tile_ops_impl.h
 namespace functor {
-template <typename Device, typename T>
+template <typename Device, typename T, typename Tmultiple>
 struct Tile {
   void operator()(const Device& d, Tensor* out, const Tensor& in,
-                  const gtl::ArraySlice<int32> broadcast_array) const;
+                  const gtl::ArraySlice<Tmultiple> broadcast_array) const;
 };
 
 template <typename Device, typename T, int NDIM>
@@ -80,7 +80,7 @@ struct ReduceAndReshape {
 }  // namespace functor
 
 // --------------------------------------------------------------------------
-template <typename Device>
+template <typename Device, typename Tmultiples>
 class TileOp : public OpKernel {
  public:
   explicit TileOp(OpKernelConstruction* context) : OpKernel(context) {}
@@ -105,8 +105,8 @@ class TileOp : public OpKernel {
       return;
     }
 
-    const gtl::ArraySlice<int32> multiples_array(multiples.flat<int32>().data(),
-                                                 input_dims);
+    const gtl::ArraySlice<Tmultiples> multiples_array(
+        multiples.flat<Tmultiples>().data(), input_dims);
     TensorShape output_shape;
     for (int i = 0; i < input_dims; ++i) {
       OP_REQUIRES(
@@ -125,10 +125,10 @@ class TileOp : public OpKernel {
     // If there's no output, there's nothing to do.
     if (output_shape.num_elements() == 0) return;
 
-#define HANDLE_TYPE(DT)                                        \
-  if (context->input(0).dtype() == DT) {                       \
-    HandleCase<DT>(context, multiples_array, result);          \
-    return;                                                    \
+#define HANDLE_TYPE(DT)                               \
+  if (context->input(0).dtype() == DT) {              \
+    HandleCase<DT>(context, multiples_array, result); \
+    return;                                           \
   }
 
 #define HANDLE_TYPE_NAME(T) HANDLE_TYPE(DataTypeToEnum<T>::value)
@@ -158,27 +158,27 @@ class TileOp : public OpKernel {
  private:
   template <DataType DT>
   void HandleCaseImpl(OpKernelContext* context,
-                      const gtl::ArraySlice<int32>& multiples_array,
+                      const gtl::ArraySlice<Tmultiples>& multiples_array,
                       Tensor* result) {
     typedef typename EnumToDataType<DT>::Type T;
-    functor::Tile<Device, T>() (
-        context->eigen_device<Device>(), result,
-        context->input(0), multiples_array);
+    functor::Tile<Device, T, Tmultiples>()(context->eigen_device<Device>(),
+                                           result, context->input(0),
+                                           multiples_array);
   }
 
   template <DataType DT>
   void HandleCase(OpKernelContext* context,
-                  const gtl::ArraySlice<int32>& multiples_array,
+                  const gtl::ArraySlice<Tmultiples>& multiples_array,
                   Tensor* result);
 
   TF_DISALLOW_COPY_AND_ASSIGN(TileOp);
 };
 
-template <typename Device>
+template <typename Device, typename Tmultiples>
 template <DataType DT>
-inline void TileOp<Device>::HandleCase(
-    OpKernelContext* context, const gtl::ArraySlice<int32>& multiples_array,
-    Tensor* result) {
+inline void TileOp<Device, Tmultiples>::HandleCase(
+    OpKernelContext* context,
+    const gtl::ArraySlice<Tmultiples>& multiples_array, Tensor* result) {
   // TODO(vrv): print out the device name if useful. Currently disabled to avoid
   // having to use RTTI.
   LOG(FATAL) << "TileOp: Invalid combination of Device, DT: "
@@ -186,25 +186,28 @@ inline void TileOp<Device>::HandleCase(
              << DataTypeString(DT);
 }
 
-#define HANDLE_CASE(device, dtype)                                     \
-  template <>                                                          \
-  template <>                                                          \
-  void TileOp<device>::HandleCase<dtype>(                              \
-      OpKernelContext * context,                                       \
-      const gtl::ArraySlice<int32>& multiples_array, Tensor* result) { \
-    HandleCaseImpl<dtype>(context, multiples_array, result);           \
+#define HANDLE_CASE(device, dtype, Tmultiples)                              \
+  template <>                                                               \
+  template <>                                                               \
+  void TileOp<device, Tmultiples>::HandleCase<dtype>(                       \
+      OpKernelContext * context,                                            \
+      const gtl::ArraySlice<Tmultiples>& multiples_array, Tensor* result) { \
+    HandleCaseImpl<dtype>(context, multiples_array, result);                \
   }
 
-#define HANDLE_TYPE_NAME_CPU(T) \
-  HANDLE_CASE(CPUDevice, DataTypeToEnum<T>::value);
+#define HANDLE_TYPE_NAME_CPU(T)                            \
+  HANDLE_CASE(CPUDevice, DataTypeToEnum<T>::value, int32); \
+  HANDLE_CASE(CPUDevice, DataTypeToEnum<T>::value, int64);
 
-#define HANDLE_TYPE_NAME_GPU(T) \
-  HANDLE_CASE(GPUDevice, DataTypeToEnum<T>::value);
+#define HANDLE_TYPE_NAME_GPU(T)                            \
+  HANDLE_CASE(GPUDevice, DataTypeToEnum<T>::value, int32); \
+  HANDLE_CASE(GPUDevice, DataTypeToEnum<T>::value, int64);
 
 #ifdef TENSORFLOW_USE_SYCL
-#define HANDLE_TYPE_NAME_SYCL(T) \
-  HANDLE_CASE(SYCLDevice, DataTypeToEnum<T>::value);
-#endif // TENSORFLOW_USE_SYCL
+#define HANDLE_TYPE_NAME_SYCL(T)                            \
+  HANDLE_CASE(SYCLDevice, DataTypeToEnum<T>::value, int32); \
+  HANDLE_CASE(SYCLDevice, DataTypeToEnum<T>::value, int64);
+#endif  // TENSORFLOW_USE_SYCL
 
 TF_CALL_bool(HANDLE_TYPE_NAME_CPU);
 TF_CALL_float(HANDLE_TYPE_NAME_CPU);
@@ -235,13 +238,13 @@ TF_CALL_double(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int16(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int32(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int64(HANDLE_TYPE_NAME_SYCL);
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 #undef HANDLE_TYPE_NAME_CPU
 #undef HANDLE_TYPE_NAME_GPU
 #ifdef TENSORFLOW_USE_SYCL
 #undef HANDLE_TYPE_NAME_SYCL
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 #undef HANDLE_CASE
 
 // --------------------------------------------------------------------------
@@ -494,7 +497,7 @@ TF_CALL_int16(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int32(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int64(HANDLE_TYPE_NAME_SYCL);
 #undef HANDLE_TYPE_NAME_SYCL
-#endif // TENSORFLOW_USE_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 #undef HANDLE_TYPE_NAME_CPU
 #undef HANDLE_TYPE_NAME_GPU
@@ -505,127 +508,73 @@ REGISTER_KERNEL_BUILDER(Name("Tile")
                             .Device(DEVICE_CPU)
                             .HostMemory("multiples")
                             .TypeConstraint<int32>("Tmultiples"),
-                        TileOp<CPUDevice>);
+                        TileOp<CPUDevice, int32>);
+REGISTER_KERNEL_BUILDER(Name("Tile")
+                            .Device(DEVICE_CPU)
+                            .HostMemory("multiples")
+                            .TypeConstraint<int64>("Tmultiples"),
+                        TileOp<CPUDevice, int64>);
 REGISTER_KERNEL_BUILDER(
     Name("TileGrad").Device(DEVICE_CPU).HostMemory("multiples"),
     TileGradientOp<CPUDevice>);
 
 #if GOOGLE_CUDA
-
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<float>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<double>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<Eigen::half>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int16>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<complex64>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<complex128>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<GPUDevice>);
-
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<float>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<double>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<Eigen::half>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int16>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<int32>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<complex64>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_GPU)
-                            .TypeConstraint<complex128>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<GPUDevice>);
-
+#define REGISTER_GPU(type)                                         \
+  REGISTER_KERNEL_BUILDER(Name("Tile")                             \
+                              .Device(DEVICE_GPU)                  \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int32>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileOp<GPUDevice, int32>);               \
+  REGISTER_KERNEL_BUILDER(Name("Tile")                             \
+                              .Device(DEVICE_GPU)                  \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int64>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileOp<GPUDevice, int64>);               \
+  REGISTER_KERNEL_BUILDER(Name("TileGrad")                         \
+                              .Device(DEVICE_GPU)                  \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int32>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileGradientOp<GPUDevice>);
+
+TF_CALL_float(REGISTER_GPU);
+TF_CALL_double(REGISTER_GPU);
+TF_CALL_half(REGISTER_GPU);
+TF_CALL_int16(REGISTER_GPU);
+TF_CALL_int32(REGISTER_GPU);
+TF_CALL_complex64(REGISTER_GPU);
+TF_CALL_complex128(REGISTER_GPU)
+
+#undef REGISTER_GPU
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<float>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<SYCLDevice>);
-REGISTER_KERNEL_BUILDER(Name("Tile")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<double>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileOp<SYCLDevice>);
-
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<float>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<SYCLDevice>);
-REGISTER_KERNEL_BUILDER(Name("TileGrad")
-                            .Device(DEVICE_SYCL)
-                            .TypeConstraint<double>("T")
-                            .TypeConstraint<int32>("Tmultiples")
-                            .HostMemory("multiples"),
-                        TileGradientOp<SYCLDevice>);
-#endif // TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL(type)                                        \
+  REGISTER_KERNEL_BUILDER(Name("Tile")                             \
+                              .Device(DEVICE_SYCL)                 \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int32>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileOp<SYCLDevice, int32>);              \
+  REGISTER_KERNEL_BUILDER(Name("Tile")                             \
+                              .Device(DEVICE_SYCL)                 \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int64>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileOp<SYCLDevice, int64>);              \
+  REGISTER_KERNEL_BUILDER(Name("TileGrad")                         \
+                              .Device(DEVICE_SYCL)                 \
+                              .TypeConstraint<type>("T")           \
+                              .TypeConstraint<int32>("Tmultiples") \
+                              .HostMemory("multiples"),            \
+                          TileGradientOp<SYCLDevice>);
+
+    TF_CALL_float(REGISTER_SYCL);
+TF_CALL_double(REGISTER_SYCL);
+
+#undef REGISTER_SYCL
+#endif  // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/transpose_op.cc b/tensorflow/core/kernels/transpose_op.cc
index e151b38d90..20f0edf309 100644
--- a/tensorflow/core/kernels/transpose_op.cc
+++ b/tensorflow/core/kernels/transpose_op.cc
@@ -91,6 +91,26 @@ REGISTER_KERNEL_BUILDER(Name("InvertPermutation")
                         InvertPermutationOp);
 #endif  // TENSORFLOW_USE_SYCL
 
+namespace {
+template <typename Tperm>
+Status PermutationHelper(const Tensor& perm, const int dims,
+                         std::vector<int32>* permutation) {
+  auto Vperm = perm.vec<Tperm>();
+  if (dims != Vperm.size()) {
+    return errors::InvalidArgument("transpose expects a vector of size ", dims,
+                                   ". But input(1) is a vector of size ",
+                                   Vperm.size());
+  }
+  // using volatile instead of SubtleMustCopy here so that the
+  // asynchrony boundary is permutation.
+  const volatile Tperm* perm_begin =
+      reinterpret_cast<const volatile Tperm*>(Vperm.data());
+  *permutation = std::vector<int32>(perm_begin, perm_begin + dims);
+
+  return Status::OK();
+}
+}  // namespace
+
 // output = TransposeOp(T<any> input, T<int32> perm) takes a tensor
 // of type T and rank N, and a permutation of 0, 1, ..., N-1. It
 // shuffles the dimensions of the input tensor according to permutation.
@@ -113,17 +133,16 @@ void TransposeOp::Compute(OpKernelContext* ctx) {
   OP_REQUIRES(ctx, TensorShapeUtils::IsVector(perm.shape()),
               errors::InvalidArgument("perm must be a vector, not ",
                                       perm.shape().DebugString()));
-  auto Vperm = perm.vec<int32>();
+
+  // Although Tperm may be an int64 type, an int32 is sufficient to hold
+  // dimension range values, so the narrowing here should be safe.
+  std::vector<int32> permutation;
   const int dims = input.dims();
-  OP_REQUIRES(ctx, dims == Vperm.size(),
-              errors::InvalidArgument(
-                  "transpose expects a vector of size ", input.dims(),
-                  ". But input(1) is a vector of size ", Vperm.size()));
-  // using volatile instead of SubtleMustCopy here so that the
-  // asynchrony boundary is permutation.
-  const volatile int32* perm_begin =
-      reinterpret_cast<const volatile int32*>(Vperm.data());
-  const std::vector<int32> permutation(perm_begin, perm_begin + dims);
+  if (perm.dtype() == DT_INT32) {
+    OP_REQUIRES_OK(ctx, PermutationHelper<int32>(perm, dims, &permutation));
+  } else {
+    OP_REQUIRES_OK(ctx, PermutationHelper<int64>(perm, dims, &permutation));
+  }
   TensorShape shape;
 
   // Check whether permutation is a permutation of integers of [0 .. dims).
@@ -142,10 +161,9 @@ void TransposeOp::Compute(OpKernelContext* ctx) {
     }
   }
   for (int i = 0; i < dims; ++i) {
-    OP_REQUIRES(
-        ctx, bits[i],
-        errors::InvalidArgument(i, " is missing from {",
-                                str_util::Join(permutation, ","), "}."));
+    OP_REQUIRES(ctx, bits[i], errors::InvalidArgument(
+                                  i, " is missing from {",
+                                  str_util::Join(permutation, ","), "}."));
   }
 
   // 0-D, 1-D, and identity transposes do nothing.
@@ -185,18 +203,16 @@ Status ConjugateTransposeCpuOp::DoTranspose(OpKernelContext* ctx,
 }
 
 #ifdef INTEL_MKL
-#define REGISTER(T)                                           \
-  REGISTER_KERNEL_BUILDER(Name("Transpose")                   \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
-                          MklTransposeCpuOp);                 \
-  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")          \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
+#define REGISTER(T)                                   \
+  REGISTER_KERNEL_BUILDER(Name("Transpose")           \
+                              .Device(DEVICE_CPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
+                          MklTransposeCpuOp);         \
+  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")  \
+                              .Device(DEVICE_CPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
                           MklConjugateTransposeCpuOp);
 TF_CALL_ALL_TYPES(REGISTER);
 REGISTER(bfloat16);
@@ -204,18 +220,16 @@ REGISTER(bfloat16);
 
 #else  // INTEL_MKL
 
-#define REGISTER(T)                                           \
-  REGISTER_KERNEL_BUILDER(Name("Transpose")                   \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
-                          TransposeCpuOp);                    \
-  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")          \
-                              .Device(DEVICE_CPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
+#define REGISTER(T)                                   \
+  REGISTER_KERNEL_BUILDER(Name("Transpose")           \
+                              .Device(DEVICE_CPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
+                          TransposeCpuOp);            \
+  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")  \
+                              .Device(DEVICE_CPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
                           ConjugateTransposeCpuOp);
 TF_CALL_ALL_TYPES(REGISTER)
 REGISTER(bfloat16);
@@ -238,18 +252,16 @@ Status ConjugateTransposeGpuOp::DoTranspose(OpKernelContext* ctx,
                                             perm, out);
 }
 
-#define REGISTER(T)                                           \
-  REGISTER_KERNEL_BUILDER(Name("Transpose")                   \
-                              .Device(DEVICE_GPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
-                          TransposeGpuOp);                    \
-  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")          \
-                              .Device(DEVICE_GPU)             \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
+#define REGISTER(T)                                   \
+  REGISTER_KERNEL_BUILDER(Name("Transpose")           \
+                              .Device(DEVICE_GPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
+                          TransposeGpuOp);            \
+  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")  \
+                              .Device(DEVICE_GPU)     \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
                           ConjugateTransposeGpuOp);
 TF_CALL_POD_TYPES(REGISTER);
 #undef REGISTER
@@ -270,18 +282,16 @@ Status ConjugateTransposeSyclOp::DoTranspose(OpKernelContext* ctx,
   return ::tensorflow::DoConjugateTranspose(ctx->eigen_device<SYCLDevice>(), in,
                                             perm, out);
 }
-#define REGISTER(T)                                           \
-  REGISTER_KERNEL_BUILDER(Name("Transpose")                   \
-                              .Device(DEVICE_SYCL)            \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
-                          TransposeSyclOp);                   \
-  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")          \
-                              .Device(DEVICE_SYCL)            \
-                              .TypeConstraint<T>("T")         \
-                              .TypeConstraint<int32>("Tperm") \
-                              .HostMemory("perm"),            \
+#define REGISTER(T)                                   \
+  REGISTER_KERNEL_BUILDER(Name("Transpose")           \
+                              .Device(DEVICE_SYCL)    \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
+                          TransposeSyclOp);           \
+  REGISTER_KERNEL_BUILDER(Name("ConjugateTranspose")  \
+                              .Device(DEVICE_SYCL)    \
+                              .TypeConstraint<T>("T") \
+                              .HostMemory("perm"),    \
                           ConjugateTransposeSyclOp);
 TF_CALL_POD_TYPES(REGISTER);
 #undef REGISTER
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index 14b87f0edf..c5935141f8 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -739,7 +739,7 @@ REGISTER_OP("Diag")
     .Attr("T: {float, double, int32, int64, complex64, complex128}")
     .SetShapeFn([](InferenceContext* c) {
       ShapeHandle in = c->input(0);
-      TF_RETURN_IF_ERROR(c->WithRankAtMost(in, 3, &in));
+      TF_RETURN_IF_ERROR(c->WithRankAtLeast(in, 1, &in));
       // Output shape is original concatenated with itself.
       ShapeHandle out;
       TF_RETURN_IF_ERROR(c->Concatenate(in, in, &out));
@@ -767,7 +767,7 @@ tf.diag(diagonal) ==> [[1, 0, 0, 0]
                        [0, 0, 0, 4]]
 ```
 
-diagonal: Rank k tensor where k is at most 3.
+diagonal: Rank k tensor where k is at most 1.
 )doc");
 
 // --------------------------------------------------------------------------
@@ -783,9 +783,9 @@ REGISTER_OP("DiagPart")
       }
       // Rank must be even, and result will have rank <rank/2>.
       const int32 rank = c->Rank(in);
-      if ((rank % 2) != 0 || rank > 6) {
+      if ((rank % 2) != 0 || rank <= 0) {
         return errors::InvalidArgument(
-            "Input must have even rank <= 6, input rank is ", rank);
+            "Input must have even and non-zero rank, input rank is ", rank);
       }
       const int32 mid = rank / 2;
 
@@ -820,7 +820,7 @@ For example:
 tf.diag_part(input) ==> [1, 2, 3, 4]
 ```
 
-input: Rank k tensor where k is 2, 4, or 6.
+input: Rank k tensor where k is even and not zero.
 diagonal: The extracted diagonal.
 
 )doc");
@@ -1175,7 +1175,7 @@ For example:
 #                  [20, 21, 22, 23]]]]
 # tensor 't' shape is [1, 2, 3, 4]
 
-# 'dims' is [3] or 'dims' is -1
+# 'dims' is [3] or 'dims' is [-1]
 reverse(t, dims) ==> [[[[ 3,  2,  1,  0],
                         [ 7,  6,  5,  4],
                         [ 11, 10, 9, 8]],
@@ -2283,6 +2283,8 @@ size(t) ==> 12
 
 namespace {
 
+// This SliceHelper processes the output shape of the `slice`
+// when the tensor of `sizes` is available.
 template <typename T>
 Status SliceHelper(InferenceContext* c, ShapeHandle begin_value,
                    const Tensor* sizes_value,
@@ -2308,7 +2310,6 @@ Status SliceHelper(InferenceContext* c, ShapeHandle begin_value,
 
   return Status::OK();
 }
-
 }  // namespace
 
 // --------------------------------------------------------------------------
@@ -2339,9 +2340,10 @@ REGISTER_OP("Slice")
       ShapeHandle begin_value;
       TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(1, &begin_value));
 
-      // NOTE(mrry): We can't use `MakeShapeFromShapeTensor` for `sizes` because
-      // it might contain -1, which can't be represented (-1 in the ShapeHandle
-      // would mean "unknown".
+      // We check the tensor value here and will only use
+      // `MakeShapeFromShapeTensor` when `sizes_value` is null.
+      // The reason is that `sizes`might contain -1, which can't
+      // be represented (-1 in the ShapeHandle would mean "unknown".
       const Tensor* sizes_value = c->input_tensor(2);
 
       if (sizes_value != nullptr) {
@@ -2361,6 +2363,28 @@ REGISTER_OP("Slice")
         c->set_output(0, c->MakeShape(dims));
         return Status::OK();
       } else {
+        // In case `sizes` is not available (`sizes_value` is null),
+        // we could try to use `MakeShapeFromShapeTensor` here.
+        // If sizes contain -1, we will simply consider it as `Unknown`.
+        // This is less than ideal but still an improvement of shape inference.
+        // The following is an example that returns [None, 1, None] with this
+        // code path:
+        //   z = tf.zeros((1, 2, 3))
+        //   m = tf.slice(z, [0, 0, 0], [tf.constant(1) + 0, 1, -1])
+        //   m.get_shape().as_list()
+        ShapeHandle sizes_value;
+        TF_RETURN_IF_ERROR(c->MakeShapeFromShapeTensor(2, &sizes_value));
+        if (c->RankKnown(sizes_value)) {
+          TF_RETURN_IF_ERROR(
+              c->WithRank(begin_value, c->Rank(sizes_value), &begin_value));
+          std::vector<DimensionHandle> dims;
+          for (int i = 0; i < c->Rank(sizes_value); ++i) {
+            dims.emplace_back(c->Dim(sizes_value, i));
+          }
+          c->set_output(0, c->MakeShape(dims));
+          return Status::OK();
+        }
+
         // We might know the rank of the input.
         if (c->RankKnown(input)) {
           c->set_output(0, c->UnknownShapeOfRank(c->Rank(input)));
diff --git a/tensorflow/core/ops/array_ops_test.cc b/tensorflow/core/ops/array_ops_test.cc
index a5d7a32e05..94eb120175 100644
--- a/tensorflow/core/ops/array_ops_test.cc
+++ b/tensorflow/core/ops/array_ops_test.cc
@@ -186,21 +186,20 @@ TEST(ArrayOpsTest, Identity_ShapeFnHandles) {
 TEST(ArrayOpsTest, Diag_ShapeFn) {
   ShapeInferenceTestOp op("Diag");
   INFER_OK(op, "?", "?");
-  INFER_OK(op, "[]", "[]");
   INFER_OK(op, "[1,?,3]", "[d0_0,d0_1,d0_2,d0_0,d0_1,d0_2]");
-  INFER_ERROR("Shape must be at most rank 3 but is rank 4", op, "[?,1,2,3]");
+  INFER_OK(op, "[?,1,2,3]", "[d0_0,d0_1,d0_2,d0_3,d0_0,d0_1,d0_2,d0_3]");
+  INFER_ERROR("Shape must be at least rank 1 but is rank 0", op, "[]");
 }
 
 TEST(ArrayOpsTest, DiagPart_ShapeFn) {
   ShapeInferenceTestOp op("DiagPart");
   INFER_OK(op, "?", "?");
-  INFER_OK(op, "[]", "[]");
   INFER_OK(op, "[1,?,?,4]", "[d0_0,d0_3]");
   INFER_OK(op, "[1,?,3,?,4,3]", "[d0_0,d0_4,d0_2|d0_5]");
-  INFER_ERROR("Input must have even rank <= 6, input rank is 1", op, "[?]");
-  INFER_ERROR("Input must have even rank <= 6, input rank is 3", op, "[1,2,3]");
-  INFER_ERROR("Input must have even rank <= 6, input rank is 8", op,
-              "[1,2,3,?,?,?,?,?]");
+  INFER_OK(op, "[1,2,3,?,?,?,?,4]", "[d0_0,d0_1,d0_2,d0_7]");
+  INFER_ERROR("Input must have even and non-zero rank", op, "[]");
+  INFER_ERROR("Input must have even and non-zero rank", op, "[?]");
+  INFER_ERROR("Input must have even and non-zero rank", op, "[1,2,3]");
   INFER_ERROR("Dimensions must be equal, but are 2 and 10", op, "[1,2,?,10]");
 }
 
diff --git a/tensorflow/core/ops/image_ops.cc b/tensorflow/core/ops/image_ops.cc
index a44bac60bf..e9bf29d172 100644
--- a/tensorflow/core/ops/image_ops.cc
+++ b/tensorflow/core/ops/image_ops.cc
@@ -151,7 +151,7 @@ REGISTER_OP("ResizeArea")
     .Input("images: T")
     .Input("size: int32")
     .Output("resized_images: float")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {int8, uint8, int16, uint16, int32, int64, half, float, double}")
     .Attr("align_corners: bool = false")
     .SetShapeFn(ResizeShapeFn)
     .Doc(R"doc(
@@ -179,7 +179,7 @@ REGISTER_OP("ResizeBicubic")
     .Input("images: T")
     .Input("size: int32")
     .Output("resized_images: float")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {int8, uint8, int16, uint16, int32, int64, half, float, double}")
     .Attr("align_corners: bool = false")
     .SetShapeFn(ResizeShapeFn)
     .Doc(R"doc(
@@ -227,7 +227,7 @@ REGISTER_OP("ResizeBilinear")
     .Input("images: T")
     .Input("size: int32")
     .Output("resized_images: float")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {int8, uint8, int16, uint16, int32, int64, half, float, double}")
     .Attr("align_corners: bool = false")
     .SetShapeFn(ResizeShapeFn)
     .Doc(R"doc(
@@ -311,7 +311,7 @@ REGISTER_OP("ResizeNearestNeighbor")
     .Input("images: T")
     .Input("size: int32")
     .Output("resized_images: T")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {int8, uint8, int16, uint16, int32, int64, half, float, double}")
     .Attr("align_corners: bool = false")
     .SetShapeFn(ResizeShapeFn)
     .Doc(R"doc(
@@ -453,7 +453,36 @@ REGISTER_OP("DecodeAndCropJpeg")
     .Attr("acceptable_fraction: float = 1.0")
     .Attr("dct_method: string = ''")
     .Output("image: uint8")
-    .SetShapeFn(DecodeImageShapeFn)
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle unused;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &unused));
+      DimensionHandle channels_dim = c->UnknownDim();
+      DimensionHandle h = c->UnknownDim();
+      DimensionHandle w = c->UnknownDim();
+
+      int32 channels;
+      TF_RETURN_IF_ERROR(c->GetAttr("channels", &channels));
+      if (channels != 0) {
+        if (channels < 0) {
+          return errors::InvalidArgument("channels must be non-negative, got ",
+                                         channels);
+        }
+        channels_dim = c->MakeDim(channels);
+      }
+
+      DimensionHandle unused_dim;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 1, &unused));
+      TF_RETURN_IF_ERROR(c->WithValue(c->Dim(unused, 0), 4, &unused_dim));
+
+      const Tensor* crop_window = c->input_tensor(1);
+      if (crop_window != nullptr) {
+        auto crop_window_vec = crop_window->vec<int32>();
+        h = c->MakeDim(crop_window_vec(2));
+        w = c->MakeDim(crop_window_vec(3));
+      }
+      c->set_output(0, c->MakeShape({h, w, channels_dim}));
+      return Status::OK();
+    })
     .Doc(strings::StrCat(R"doc(
 Decode and Crop a JPEG-encoded image to a uint8 tensor.
 )doc",
@@ -1068,7 +1097,7 @@ REGISTER_OP("CropAndResize")
     .Input("box_ind: int32")
     .Input("crop_size: int32")
     .Output("crops: float")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {uint8, uint16, int8, int16, int32, int64, half, float, double}")
     .Attr("method: {'bilinear'} = 'bilinear'")
     .Attr("extrapolation_value: float = 0")
     .SetShapeFn([](InferenceContext* c) {
@@ -1175,7 +1204,7 @@ REGISTER_OP("CropAndResizeGradBoxes")
     .Input("boxes: float")
     .Input("box_ind: int32")
     .Output("output: float")
-    .Attr("T: {uint8, int8, int16, int32, int64, half, float, double}")
+    .Attr("T: {uint8, uint16, int8, int16, int32, int64, half, float, double}")
     .Attr("method: {'bilinear'} = 'bilinear'")
     .SetShapeFn([](InferenceContext* c) {
       c->set_output(0, c->input(2));
diff --git a/tensorflow/core/ops/image_ops_test.cc b/tensorflow/core/ops/image_ops_test.cc
index c34b11a15e..5f0b391b0d 100644
--- a/tensorflow/core/ops/image_ops_test.cc
+++ b/tensorflow/core/ops/image_ops_test.cc
@@ -105,7 +105,7 @@ TEST(ImageOpsTest, DecodeAndCropJpeg_ShapeFn) {
                    .Input({"img", 0, DT_STRING})
                    .Input({"crop_window", 1, DT_INT32})
                    .Finalize(&op.node_def));
-  INFER_OK(op, "[];[]", "[?,?,?]");
+  INFER_OK(op, "[];[?]", "[?,?,?]");
 
   // Set the channel, so that part of output shape is known.
   TF_ASSERT_OK(NodeDefBuilder("test", op_name)
@@ -113,7 +113,7 @@ TEST(ImageOpsTest, DecodeAndCropJpeg_ShapeFn) {
                    .Input({"crop_window", 1, DT_INT32})
                    .Attr("channels", 4)
                    .Finalize(&op.node_def));
-  INFER_OK(op, "[];[]", "[?,?,4]");
+  INFER_OK(op, "[];[?]", "[?,?,4]");
 
   // Negative channel value is rejected.
   TF_ASSERT_OK(NodeDefBuilder("test", op_name)
@@ -139,7 +139,7 @@ TEST(ImageOpsTest, DecodeAndCropJpeg_InvalidCropWindow) {
                    .Input({"img", 0, DT_STRING})
                    .Input({"crop_window", 1, DT_INT32})
                    .Finalize(&op.node_def));
-  INFER_OK(op, "[];[]", "[?,?,?]");
+  INFER_OK(op, "[];[?]", "[?,?,?]");
 }
 
 TEST(ImageOpsTest, EncodeImage_ShapeFn) {
diff --git a/tensorflow/core/ops/math_ops.cc b/tensorflow/core/ops/math_ops.cc
index ab0bc258f7..61db896c51 100644
--- a/tensorflow/core/ops/math_ops.cc
+++ b/tensorflow/core/ops/math_ops.cc
@@ -49,6 +49,38 @@ inputs: Must all be the same size and shape.
 
 // --------------------------------------------------------------------------
 
+// Note that the following operator is just a placeholder and has no
+// associated kernel. The code in accumulate_n_optimizer.cc replaces
+// this placeholder with a graph of operators that do have kernels.
+// The Python code that generates instances of this op is currently in
+// contrib/framework/python/ops/accumulate_n_v2.py
+REGISTER_OP("AccumulateNV2")
+    .Input("inputs: N * T")
+    .Output("sum: T")
+    .Attr("N: int >= 1")
+    .Attr("T: numbertype")
+    .Attr("shape: shape")
+    .SetIsCommutative()
+    .SetIsAggregate()
+    .SetShapeFn(shape_inference::ExplicitShape)
+    .Doc(R"doc(
+Returns the element-wise sum of a list of tensors.
+
+`tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
+wait for all of its inputs to be ready before beginning to sum. This can
+save memory if inputs are ready at different times, since minimum temporary
+storage is proportional to the output size rather than the inputs size.
+
+Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
+
+Returns a `Tensor` of same shape and type as the elements of `inputs`.
+
+inputs: A list of `Tensor` objects, each with same shape and type.
+shape: Shape of elements of `inputs`.
+)doc");
+
+// --------------------------------------------------------------------------
+
 REGISTER_OP("BatchMatMul")
     .Input("x: T")
     .Input("y: T")
@@ -591,7 +623,7 @@ REGISTER_OP("TruncateDiv")
 Returns x / y element-wise for integer types.
 
 Truncation designates that negative numbers will round fractional quantities
-toward zero. I.e. -7 / 5 = 1. This matches C semantics but it is different
+toward zero. I.e. -7 / 5 = -1. This matches C semantics but it is different
 than Python semantics. See `FloorDiv` for a division function that matches
 Python Semantics.
 
@@ -2218,6 +2250,44 @@ product: Pairwise cross product of the vectors in `a` and `b`.
 
 // --------------------------------------------------------------------------
 
+REGISTER_OP("HistogramFixedWidth")
+    .Input("values: T")
+    .Input("value_range: T")
+    .Input("nbins: int32")
+    .Output("out: dtype")
+    .Attr("T: {int32, int64, float32, float64}")
+    .Attr("dtype: {int32, int64} = DT_INT32")
+    .SetShapeFn([](InferenceContext* c) {
+      c->set_output(0, c->UnknownShapeOfRank(1));
+      return Status::OK();
+    })
+    .Doc(R"doc(
+Return histogram of values.
+
+Given the tensor `values`, this operation returns a rank 1 histogram counting
+the number of entries in `values` that fall into every bin.  The bins are
+equal width and determined by the arguments `value_range` and `nbins`.
+
+```python
+# Bins will be:  (-inf, 1), [1, 2), [2, 3), [3, 4), [4, inf)
+nbins = 5
+value_range = [0.0, 5.0]
+new_values = [-1.0, 0.0, 1.5, 2.0, 5.0, 15]
+
+with tf.get_default_session() as sess:
+  hist = tf.histogram_fixed_width(new_values, value_range, nbins=5)
+  variables.global_variables_initializer().run()
+  sess.run(hist) => [2, 1, 1, 0, 2]
+```
+
+values:  Numeric `Tensor`.
+value_range:  Shape [2] `Tensor` of same `dtype` as `values`.
+  values <= value_range[0] will be mapped to hist[0],
+  values >= value_range[1] will be mapped to hist[-1].
+nbins:  Scalar `int32 Tensor`.  Number of histogram bins.
+out: A 1-D `Tensor` holding histogram of values.
+)doc");
+
 REGISTER_OP("Bincount")
     .Input("arr: int32")
     .Input("size: int32")
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index 5efa55b496..1d26660a4b 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -2260,6 +2260,56 @@ indices: The indices of `values` within the last dimension of `input`.
 
 // --------------------------------------------------------------------------
 
+REGISTER_OP("NthElement")
+    .Input("input: T")
+    .Input("n: int32")
+    .Output("values: T")
+    .Attr("reverse: bool = false")
+    .Attr("T: realnumbertype")
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle input;
+      TF_RETURN_IF_ERROR(c->WithRankAtLeast(c->input(0), 1, &input));
+
+      // Get the n value from input tensor, and make sure which is a scalar.
+      DimensionHandle n_dim;
+      TF_RETURN_IF_ERROR(c->MakeDimForScalarInput(1, &n_dim));
+
+      // The last dimension of input tensor must be greater than N.
+      DimensionHandle last_dim = c->Dim(input, -1);
+      if (c->ValueKnown(last_dim) && c->ValueKnown(n_dim) &&
+          c->Value(last_dim) <= c->Value(n_dim)) {
+        return errors::InvalidArgument(
+            "Input must have last dimension > n = ", c->Value(n_dim), " but is ",
+            c->Value(last_dim));
+      }
+
+      // Reduce last_dim for output tensor
+      ShapeHandle s;
+      TF_RETURN_IF_ERROR(c->Subshape(input, 0, -1, &s));
+      c->set_output(0, s);
+      return Status::OK();
+    })
+    .Doc(R"doc(
+Finds values of the `n`-th order statistic for the last dmension.
+
+If the input is a vector (rank-1), finds the entries which is the nth-smallest
+value in the vector and outputs their values as scalar tensor.
+
+For matrices (resp. higher rank input), computes the entries which is the
+nth-smallest value in each row (resp. vector along the last dimension). Thus,
+
+    values.shape = input.shape[:-1]
+
+input: 1-D or higher with last dimension at least `n+1`.
+n: 0-D. Position of sorted vector to select along the last dimension (along
+  each row for matrices). Valid range of n is `[0, input.shape[:-1])`
+reverse: When set to True, find the nth-largest value in the vector and vice
+  versa.
+values: The `n`-th order statistic along each last dimensional slice.
+)doc");
+
+// --------------------------------------------------------------------------
+
 REGISTER_OP("FractionalMaxPool")
     .Input("value: T")
     .Output("output: T")
diff --git a/tensorflow/core/ops/nn_ops_test.cc b/tensorflow/core/ops/nn_ops_test.cc
index 4628b725f8..94ecf4d5db 100644
--- a/tensorflow/core/ops/nn_ops_test.cc
+++ b/tensorflow/core/ops/nn_ops_test.cc
@@ -81,6 +81,30 @@ TEST(NNOpsTest, TopKV2_ShapeFn) {
       op, "[1,2,3,4];[]");
 }
 
+TEST(NNOpsTest, NthElement_ShapeFn) {
+  ShapeInferenceTestOp op("NthElement");
+  op.input_tensors.resize(2);
+
+  Tensor n_t;
+  op.input_tensors[1] = &n_t;
+  n_t = test::AsScalar<int32>(20);
+
+  INFER_OK(op, "?;[]", "?");
+  INFER_OK(op, "[21];[]", "[]");
+  INFER_OK(op, "[2,?,?];[]", "[d0_0,d0_1]");
+  INFER_OK(op, "[?,3,?,21];[]", "[d0_0,d0_1,d0_2]");
+
+  INFER_ERROR("Shape must be at least rank 1 but is rank 0", op, "[];[]");
+  INFER_ERROR("Input must have last dimension > n = 20 but is 1", op,
+              "[1];[]");
+  INFER_ERROR("Input must have last dimension > n = 20 but is 20", op,
+              "[1,2,3,20];[]");
+  n_t = test::AsScalar<int32>(-1);
+  INFER_ERROR(
+     "Dimension size, given by scalar input 1, must be non-negative but is -1",
+     op, "[1,2,3,4];[]");
+}
+
 TEST(NNOpsTest, BatchNormWithGlobalNormalization_ShapeFn) {
   ShapeInferenceTestOp op("BatchNormWithGlobalNormalization");
 
diff --git a/tensorflow/core/platform/s3/s3_crypto.cc b/tensorflow/core/platform/s3/s3_crypto.cc
index 14bbed19a5..d7062a59d2 100644
--- a/tensorflow/core/platform/s3/s3_crypto.cc
+++ b/tensorflow/core/platform/s3/s3_crypto.cc
@@ -71,7 +71,7 @@ class S3Sha256OpenSSLImpl : public Aws::Utils::Crypto::Hash {
     SHA256_Init(&sha256);
 
     auto currentPos = stream.tellg();
-    if (currentPos == -1) {
+    if (currentPos == std::streampos(std::streamoff(-1))) {
       currentPos = 0;
       stream.clear();
     }
diff --git a/tensorflow/core/profiler/README.md b/tensorflow/core/profiler/README.md
index 92bce9c1ce..8ca26fa5dc 100644
--- a/tensorflow/core/profiler/README.md
+++ b/tensorflow/core/profiler/README.md
@@ -48,7 +48,7 @@ bazel-bin/tensorflow/python/profiler/profiler_ui \
 # Create options to profile the time and memory information.
 builder = tf.profiler.ProfileOptionBuilder
 opts = builder(builder.time_and_memory()).order_by('micros').build()
-# Create a profiling context, set contructor argument `trace_steps`, 
+# Create a profiling context, set constructor argument `trace_steps`, 
 # `dump_steps` to empty for explicit control.
 with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
                                       trace_steps=[],
diff --git a/tensorflow/core/profiler/g3doc/options.md b/tensorflow/core/profiler/g3doc/options.md
index ddee63ad42..4c73e372e3 100644
--- a/tensorflow/core/profiler/g3doc/options.md
+++ b/tensorflow/core/profiler/g3doc/options.md
@@ -43,7 +43,7 @@ In graph view, in means the number of hops in the <b>graph</b>.
 
 ### Times
 
-Most machines have mutli-core CPUs. Some installs one or more accelerators.
+Most machines have multi-core CPUs. Some installs one or more accelerators.
 Each accelerator usually performs massive parallel processing. The profiler
 tracks the accumulated processing times. Hence, the accumulated processing
 time is likely larger than the time of each step.
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index ccb861c93a..5d2298f7b7 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -19,12 +19,12 @@ limitations under the License.
 // TensorFlow uses semantic versioning, see http://semver.org/.
 
 #define TF_MAJOR_VERSION 1
-#define TF_MINOR_VERSION 3
+#define TF_MINOR_VERSION 4
 #define TF_PATCH_VERSION 0
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
-#define TF_VERSION_SUFFIX ""
+#define TF_VERSION_SUFFIX "-rc0"
 
 #define TF_STR_HELPER(x) #x
 #define TF_STR(x) TF_STR_HELPER(x)
diff --git a/tensorflow/docs_src/api_guides/python/reading_data.md b/tensorflow/docs_src/api_guides/python/reading_data.md
index 7609ca91d0..b3ebaa0f0a 100644
--- a/tensorflow/docs_src/api_guides/python/reading_data.md
+++ b/tensorflow/docs_src/api_guides/python/reading_data.md
@@ -67,7 +67,7 @@ A typical queue-based pipeline for reading records from files has the following
 8.  Example queue
 
 Warning: This section discusses implementing input pipelines using the
-queue-based APIs which can be cleanly replaced by the @{$datasets$Dataset API}.
+queue-based APIs which can be cleanly replaced by the @{$datasets$Datasets API}.
 
 ### Filenames, shuffling, and epoch limits
 
diff --git a/tensorflow/docs_src/get_started/estimator.md b/tensorflow/docs_src/get_started/estimator.md
index ab270d1408..790de6679b 100644
--- a/tensorflow/docs_src/get_started/estimator.md
+++ b/tensorflow/docs_src/get_started/estimator.md
@@ -28,7 +28,7 @@ from __future__ import division
 from __future__ import print_function
 
 import os
-import urllib
+from six.moves.urllib.request import urlopen
 
 import numpy as np
 import tensorflow as tf
@@ -44,13 +44,13 @@ IRIS_TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"
 def main():
   # If the training and test sets aren't stored locally, download them.
   if not os.path.exists(IRIS_TRAINING):
-    raw = urllib.urlopen(IRIS_TRAINING_URL).read()
-    with open(IRIS_TRAINING, "w") as f:
+    raw = urlopen(IRIS_TRAINING_URL).read()
+    with open(IRIS_TRAINING, "wb") as f:
       f.write(raw)
 
   if not os.path.exists(IRIS_TEST):
-    raw = urllib.urlopen(IRIS_TEST_URL).read()
-    with open(IRIS_TEST, "w") as f:
+    raw = urlopen(IRIS_TEST_URL).read()
+    with open(IRIS_TEST, "wb") as f:
       f.write(raw)
 
   # Load datasets.
@@ -167,7 +167,7 @@ from __future__ import division
 from __future__ import print_function
 
 import os
-import urllib
+from six.moves.urllib.request import urlopen
 
 import tensorflow as tf
 import numpy as np
@@ -184,13 +184,13 @@ them.
 
 ```python
 if not os.path.exists(IRIS_TRAINING):
-  raw = urllib.urlopen(IRIS_TRAINING_URL).read()
-  with open(IRIS_TRAINING,'w') as f:
+  raw = urlopen(IRIS_TRAINING_URL).read()
+  with open(IRIS_TRAINING,'wb') as f:
     f.write(raw)
 
 if not os.path.exists(IRIS_TEST):
-  raw = urllib.urlopen(IRIS_TEST_URL).read()
-  with open(IRIS_TEST,'w') as f:
+  raw = urlopen(IRIS_TEST_URL).read()
+  with open(IRIS_TEST,'wb') as f:
     f.write(raw)
 ```
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index 7ebf5c4a2c..586bb6dead 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -35,7 +35,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for Mac OS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.3.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.4.0-rc0.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index b991fd0f93..1d00661d83 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -35,7 +35,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.3.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.4.0-rc0.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 2adcd4da73..3b3acfdcb3 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -34,7 +34,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.3.0</version>
+  <version>1.4.0-rc0</version>
 </dependency>
 ```
 
@@ -63,7 +63,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.3.0</version>
+                 <version>1.4.0-rc0</version>
                </dependency>
              </dependencies>
          </project>
@@ -122,7 +122,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or Mac OS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.3.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0-rc0.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -141,7 +141,7 @@ Take the following steps to install TensorFlow for Java on Linux or Mac OS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.3.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.4.0-rc0.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -149,10 +149,10 @@ Take the following steps to install TensorFlow for Java on Linux or Mac OS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.3.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.4.0-rc0.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.3.0.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.4.0-rc0.zip).
   3. Extract this .zip file.
 
 
@@ -200,7 +200,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.3.0.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.4.0-rc0.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -214,11 +214,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and Mac OS X:
 
-<pre><b>java -cp libtensorflow-1.3.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.4.0-rc0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.3.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.4.0-rc0.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index 576099f054..9d204cc246 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -42,9 +42,21 @@ must be installed on your system:
     a list of supported GPU cards.
   * The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface.
     This library provides advanced profiling support. To install this library,
-    issue the following command:
+    issue the following command for CUDA Toolkit >= 8.0:
 
     <pre>
+    $ <b>sudo apt-get install cuda-command-line-tools</b>
+    </pre>
+    
+    and add its path to your `LD_LIBRARY_PATH` environment variable:
+
+    <pre> 
+    $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> 
+    </pre>
+
+    For CUDA Toolkit <= 7.5 do:
+    
+    <pre>
     $ <b>sudo apt-get install libcupti-dev</b>
     </pre>
 
@@ -172,7 +184,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      virtualenv environment:
 
      <pre>(tensorflow)$ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common_installation_problems).
@@ -277,7 +289,7 @@ take the following steps:
 
      <pre>
      $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp34-cp34m-linux_x86_64.whl</b>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp34-cp34m-linux_x86_64.whl</b>
      </pre>
 
      If this step fails, see
@@ -445,7 +457,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
   2. Create a conda environment named <tt>tensorflow</tt> to run a version
      of Python by invoking the following command:
 
-     <pre>$ <b>conda create -n tensorflow python=2.7 # or python=3.3, etc.</b></pre>
+     <pre>$ <b>conda create -n tensorflow pip python=2.7 # or python=3.3, etc.</b></pre>
 
   3. Activate the conda environment by issuing the following command:
 
@@ -464,7 +476,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -632,14 +644,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0dev-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc0-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -651,14 +663,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0dev-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc0-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -670,14 +682,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0dev-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc0-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -689,14 +701,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0dev-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.4.0rc0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0dev-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0rc0-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index b6daeb0dd6..6da22784bf 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -109,7 +109,7 @@ Take the following steps to install TensorFlow with Virtualenv:
      TensorFlow in the active Virtualenv is as follows:
 
      <pre> $ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc0-py2-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -230,7 +230,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py2-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc0-py2-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -321,7 +321,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
   2. Create a conda environment named `tensorflow`
      by invoking the following command:
 
-     <pre>$ <b>conda create -n tensorflow python=2.7 # or python=3.3, etc.</b></pre>
+     <pre>$ <b>conda create -n tensorflow pip python=2.7 # or python=3.3, etc.</b></pre>
 
   3. Activate the conda environment by issuing the following command:
 
@@ -339,7 +339,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc0-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -512,7 +512,7 @@ This section documents the relevant values for Mac OS installations.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc0-py2-none-any.whl
 </pre>
 
 
@@ -520,7 +520,7 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py2-none-a
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0dev-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.4.0rc0-py3-none-any.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index e6a4088656..b853d87816 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -137,8 +137,15 @@ The following NVIDIA <i>software</i> must be installed on your system:
     particularly the description of appending the appropriate pathname
     to your `LD_LIBRARY_PATH` environment variable.
 
-Finally, you must also install `libcupti-dev` by invoking the following
-command:
+Finally, you must also install `libcupti` which for Cuda Toolkit >= 8.0 you do via 
+
+<pre> $ <b>sudo apt-get install cuda-command-line-tools</b> </pre>
+
+and add its path to your `LD_LIBRARY_PATH` environment variable:
+
+<pre> $ <b>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64</b> </pre>
+
+For Cuda Toolkit <= 7.5, you install `libcupti-dev` by invoking the following command:
 
 <pre> $ <b>sudo apt-get install libcupti-dev</b> </pre>
 
@@ -342,10 +349,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.4.0dev on Linux:
+for TensorFlow 1.4.0rc0 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.0dev-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.0rc0-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
@@ -434,8 +441,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Linux**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
+<tr><td>tensorflow-1.4.0rc0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.4.0rc0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>6</td><td>8</td></tr>
 <tr><td>tensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.5</td><td>5.1</td><td>8</td></tr>
 <tr><td>tensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>GCC 4.8</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
@@ -447,7 +454,7 @@ Stack Overflow and specify the `tensorflow` tag.
 **Mac**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow-1.4.0rc0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>ttensorflow-1.2.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.5</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>ttensorflow-1.1.0</td><td>CPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>ttensorflow_gpu-1.1.0</td><td>GPU</td><td>2.7, 3.3-3.6</td><td>Clang from xcode</td><td>Bazel 0.4.2</td><td>5.1</td><td>8</td></tr>
@@ -458,8 +465,8 @@ Stack Overflow and specify the `tensorflow` tag.
 **Windows**
 <table>
 <tr><th>Version:</th><th>CPU/GPU:</th><th>Python Version:</th><th>Compiler:</th><th>Build Tools:</th><th>cuDNN:</th><th>CUDA:</th></tr>
-<tr><td>tensorflow-1.3.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
-<tr><td>tensorflow_gpu-1.3.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
+<tr><td>tensorflow-1.4.0rc0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
+<tr><td>tensorflow_gpu-1.4.0rc0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>6</td><td>8</td></tr>
 <tr><td>tensorflow-1.2.0</td><td>CPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
 <tr><td>tensorflow_gpu-1.2.0</td><td>GPU</td><td>3.5-3.6</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>5.1</td><td>8</td></tr>
 <tr><td>tensorflow-1.1.0</td><td>CPU</td><td>3.5</td><td>MSVC 2015 update 3</td><td>Cmake v3.6.3</td><td>N/A</td><td>N/A</td></tr>
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index ae8749c231..f0d580d803 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -105,7 +105,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
   2. Create a conda environment named <tt>tensorflow</tt>
      by invoking the following command:
 
-     <pre>C:\> <b>conda create -n tensorflow python=3.5</b> </pre>
+     <pre>C:\> <b>conda create -n tensorflow pip python=3.5</b> </pre>
 
   3. Activate the conda environment by issuing the following command:
 
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
index 06bb40f64d..da556bd848 100644
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@@ -127,7 +127,7 @@ Reading large numbers of small files significantly impacts I/O performance.
 One approach to get maximum I/O throughput is to preprocess input data into
 larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
 approach is often to load the entire data set into memory. The document
-[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/slim#Data)
+[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/research/slim#Data)
 includes information and scripts for creating `TFRecords` and this
 [script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
 converts the CIFAR-10 data set into `TFRecords`.
diff --git a/tensorflow/docs_src/performance/performance_models.md b/tensorflow/docs_src/performance/performance_models.md
index 183bbc75a9..fcda19e74c 100644
--- a/tensorflow/docs_src/performance/performance_models.md
+++ b/tensorflow/docs_src/performance/performance_models.md
@@ -345,7 +345,7 @@ executing the main script
 *   **`num_gpus`**: Number of GPUs to use.
 *   **`data_dir`**: Path to data to process. If not set, synthetic data is used.
     To use Imagenet data use these
-    [instructions](https://github.com/tensorflow/models/tree/master/inception#getting-started)
+    [instructions](https://github.com/tensorflow/models/tree/master/research/inception#getting-started)
     as a starting point.
 *   **`batch_size`**: Batch size for each GPU.
 *   **`variable_update`**: The method for managing variables: `parameter_server`
diff --git a/tensorflow/docs_src/programmers_guide/datasets.md b/tensorflow/docs_src/programmers_guide/datasets.md
index 38e5612fb4..f458cbcef2 100644
--- a/tensorflow/docs_src/programmers_guide/datasets.md
+++ b/tensorflow/docs_src/programmers_guide/datasets.md
@@ -44,7 +44,7 @@ To start an input pipeline, you must define a *source*. For example, to
 construct a `Dataset` from some tensors in memory, you can use
 `tf.data.Dataset.from_tensors()` or
 `tf.data.Dataset.from_tensor_slices()`. Alternatively, if your input
-data are on disk in the recommend TFRecord format, you can construct a
+data are on disk in the recommended TFRecord format, you can construct a
 `tf.data.TFRecordDataset`.
 
 Once you have a `Dataset` object, you can *transform* it into a new `Dataset` by
diff --git a/tensorflow/docs_src/programmers_guide/graphs.md b/tensorflow/docs_src/programmers_guide/graphs.md
index 6ba8bb7a34..10f53fe8f2 100644
--- a/tensorflow/docs_src/programmers_guide/graphs.md
+++ b/tensorflow/docs_src/programmers_guide/graphs.md
@@ -404,8 +404,8 @@ y = tf.square(x)
 
 with tf.Session() as sess:
   # Feeding a value changes the result that is returned when you evaluate `y`.
-  print(sess.run(y, {x: [1.0, 2.0, 3.0]})  # => "[1.0, 4.0, 9.0]"
-  print(sess.run(y, {x: [0.0, 0.0, 5.0]})  # => "[0.0, 0.0, 25.0]"
+  print(sess.run(y, {x: [1.0, 2.0, 3.0]}))  # => "[1.0, 4.0, 9.0]"
+  print(sess.run(y, {x: [0.0, 0.0, 5.0]}))  # => "[0.0, 0.0, 25.0]"
 
   # Raises `tf.errors.InvalidArgumentError`, because you must feed a value for
   # a `tf.placeholder()` when evaluating a tensor that depends on it.
diff --git a/tensorflow/docs_src/programmers_guide/saved_model.md b/tensorflow/docs_src/programmers_guide/saved_model.md
index 9262143ad8..6bc2cbb9e3 100644
--- a/tensorflow/docs_src/programmers_guide/saved_model.md
+++ b/tensorflow/docs_src/programmers_guide/saved_model.md
@@ -158,6 +158,39 @@ Notes:
    optionally choose names for the variables in the checkpoint files.
 
 
+### Inspect variables in a checkpoint
+
+We can quickly inspect variables in a checkpoint with the 
+[`inspect_checkpoint`](https://www.tensorflow.org/code/tensorflow/python/tools/inspect_checkpoint.py) library.
+
+Continuing from the save/restore examples shown earlier:
+
+```python
+# import the inspect_checkpoint library
+from tensorflow.python.tools import inspect_checkpoint as chkp
+
+# print all tensors in checkpoint file
+chkp.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name='', all_tensors=True)
+
+# tensor_name:  v1
+# [ 1.  1.  1.]
+# tensor_name:  v2
+# [-1. -1. -1. -1. -1.]
+
+# print only tensor v1 in checkpoint file
+chkp.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name='v1', all_tensors=False)
+
+# tensor_name:  v1
+# [ 1.  1.  1.]
+
+# print only tensor v2 in checkpoint file
+chkp.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name='v2', all_tensors=False)
+
+# tensor_name:  v2
+# [-1. -1. -1. -1. -1.]
+```
+
+
 <a name="models"></a>
 ## Overview of saving and restoring models
 
diff --git a/tensorflow/docs_src/tutorials/wide.md b/tensorflow/docs_src/tutorials/wide.md
index 3055c54021..6292c1a01e 100644
--- a/tensorflow/docs_src/tutorials/wide.md
+++ b/tensorflow/docs_src/tutorials/wide.md
@@ -426,8 +426,7 @@ m = tf.estimator.LinearClassifier(
     optimizer=tf.train.FtrlOptimizer(
       learning_rate=0.1,
       l1_regularization_strength=1.0,
-      l2_regularization_strength=1.0),
-    model_dir=model_dir)
+      l2_regularization_strength=1.0))
 ```
 
 One important difference between L1 and L2 regularization is that L1
diff --git a/tensorflow/examples/get_started/regression/imports85.py b/tensorflow/examples/get_started/regression/imports85.py
index 96a464920a..6bee556eb8 100644
--- a/tensorflow/examples/get_started/regression/imports85.py
+++ b/tensorflow/examples/get_started/regression/imports85.py
@@ -127,7 +127,7 @@ def dataset(y_name="price", train_fraction=0.7):
   def in_test_set(line):
     """Returns a boolean tensor, true if the line is in the training set."""
     # Items not in the training set are in the test set.
-    # This line must use `~` instead of `not` beacuse `not` only works on python
+    # This line must use `~` instead of `not` because `not` only works on python
     # booleans but we are dealing with symbolic tensors.
     return ~in_training_set(line)
 
diff --git a/tensorflow/examples/get_started/regression/linear_regression_categorical.py b/tensorflow/examples/get_started/regression/linear_regression_categorical.py
index 860d0e437c..e2ad415fbc 100644
--- a/tensorflow/examples/get_started/regression/linear_regression_categorical.py
+++ b/tensorflow/examples/get_started/regression/linear_regression_categorical.py
@@ -67,7 +67,7 @@ def main(argv):
 
   # The second way, appropriate for an unspecified vocabulary, is to create a
   # hashed column. It will create a fixed length list of weights, and
-  # automatically assign each input categort to a weight. Due to the
+  # automatically assign each input category to a weight. Due to the
   # pseudo-randomness of the process, some weights may be shared between
   # categories, while others will remain unused.
   make_column = tf.feature_column.categorical_column_with_hash_bucket(
diff --git a/tensorflow/examples/learn/resnet.py b/tensorflow/examples/learn/resnet.py
index 33a09bb6e0..1e0966475b 100755
--- a/tensorflow/examples/learn/resnet.py
+++ b/tensorflow/examples/learn/resnet.py
@@ -190,8 +190,8 @@ def main(unused_args):
 
   # Calculate accuracy.
   test_input_fn = tf.estimator.inputs.numpy_input_fn(
-      x={X_FEATURE: mnist.train.images},
-      y=mnist.train.labels.astype(np.int32),
+      x={X_FEATURE: mnist.test.images},
+      y=mnist.test.labels.astype(np.int32),
       num_epochs=1,
       shuffle=False)
   scores = classifier.evaluate(input_fn=test_input_fn)
diff --git a/tensorflow/examples/tutorials/word2vec/word2vec_basic.py b/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
index 1fa2b14869..142e45a2e8 100644
--- a/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
+++ b/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
@@ -115,11 +115,9 @@ def generate_batch(batch_size, num_skips, skip_window):
   data_index += span
   for i in range(batch_size // num_skips):
     context_words = [w for w in range(span) if w != skip_window]
-    random.shuffle(context_words)
-    words_to_use = collections.deque(context_words)
-    for j in range(num_skips):
+    words_to_use = random.sample(context_words, num_skips)
+    for j, context_word in enumerate(words_to_use):
       batch[i * num_skips + j] = buffer[skip_window]
-      context_word = words_to_use.pop()
       labels[i * num_skips + j, 0] = buffer[context_word]
     if data_index == len(data):
       buffer[:] = data[:span]
diff --git a/tensorflow/java/BUILD b/tensorflow/java/BUILD
index a380bc2c71..d74cb32c5a 100644
--- a/tensorflow/java/BUILD
+++ b/tensorflow/java/BUILD
@@ -24,6 +24,7 @@ java_library(
     ],
     data = [":libtensorflow_jni"],
     javacopts = JAVACOPTS,
+    plugins = [":processor"],
     visibility = ["//visibility:public"],
 )
 
@@ -41,6 +42,21 @@ filegroup(
     ],
 )
 
+java_plugin(
+    name = "processor",
+    generates_api = True,
+    processor_class = "org.tensorflow.processor.OperatorProcessor",
+    visibility = ["//visibility:public"],
+    deps = [":processor_library"],
+)
+
+java_library(
+    name = "processor_library",
+    srcs = glob(["src/gen/java/org/tensorflow/processor/**/*.java"]),
+    javacopts = JAVACOPTS,
+    resources = glob(["src/gen/resources/META-INF/services/javax.annotation.processing.Processor"]),
+)
+
 filegroup(
     name = "java_op_sources",
     srcs = glob(["src/main/java/org/tensorflow/op/**/*.java"]) + [
@@ -264,6 +280,29 @@ tf_java_test(
     ],
 )
 
+#java_test(
+#    name = "OperatorProcessorTest",
+#    size = "small",
+#    srcs = ["src/test/java/org/tensorflow/processor/OperatorProcessorTest.java"],
+#    javacopts = JAVACOPTS,
+#    resources = [":processor_test_resources"],
+#    test_class = "org.tensorflow.processor.OperatorProcessorTest",
+#    deps = [
+#        ":processor_library",
+#        "@junit",
+#        "@com_google_testing_compile",
+#        "@com_google_truth",
+#    ],
+#)
+
+filegroup(
+    name = "processor_test_resources",
+    srcs = glob([
+        "src/test/resources/org/tensorflow/**/*.java",
+        "src/main/java/org/tensorflow/op/annotation/Operator.java",
+    ]),
+)
+
 filegroup(
     name = "libtensorflow_jni",
     srcs = select({
diff --git a/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java b/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java
new file mode 100644
index 0000000000..45e42878c7
--- /dev/null
+++ b/tensorflow/java/src/gen/java/org/tensorflow/processor/OperatorProcessor.java
@@ -0,0 +1,164 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow.processor;
+
+import java.io.IOException;
+import java.io.PrintWriter;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.Set;
+import javax.annotation.processing.AbstractProcessor;
+import javax.annotation.processing.Filer;
+import javax.annotation.processing.Messager;
+import javax.annotation.processing.ProcessingEnvironment;
+import javax.annotation.processing.RoundEnvironment;
+import javax.lang.model.SourceVersion;
+import javax.lang.model.element.Element;
+import javax.lang.model.element.TypeElement;
+import javax.tools.Diagnostic.Kind;
+
+/**
+ * A compile-time Processor that aggregates classes annotated with {@link
+ * org.tensorflow.op.annotation.Operator} and generates the {@code Ops} convenience API. Please
+ * refer to the {@link org.tensorflow.op.annotation.Operator} annotation for details about the API
+ * generated for each annotated class.
+ *
+ * <p>Note that this processor can only be invoked once, in a single compilation run that includes
+ * all the {@code Operator} annotated source classes. The reason is that the {@code Ops} API is an
+ * "aggregating" API, and annotation processing does not permit modifying an already generated
+ * class.
+ *
+ * @see org.tensorflow.op.annotation.Operator
+ */
+public final class OperatorProcessor extends AbstractProcessor {
+
+  @Override
+  public SourceVersion getSupportedSourceVersion() {
+    return SourceVersion.latestSupported();
+  }
+
+  @Override
+  public synchronized void init(ProcessingEnvironment processingEnv) {
+    super.init(processingEnv);
+    messager = processingEnv.getMessager();
+    filer = processingEnv.getFiler();
+  }
+
+  @Override
+  public boolean process(Set<? extends TypeElement> annotations, RoundEnvironment roundEnv) {
+    // Nothing needs to be done at the end of all rounds.
+    if (roundEnv.processingOver()) {
+      return false;
+    }
+
+    // Nothing to look at in this round.
+    if (annotations.size() == 0) {
+      return false;
+    }
+
+    // We expect to be registered for exactly one annotation.
+    if (annotations.size() != 1) {
+      throw new IllegalStateException(
+          "Unexpected - multiple annotations registered: " + annotations);
+    }
+    TypeElement annotation = annotations.iterator().next();
+    Set<? extends Element> annotated = roundEnv.getElementsAnnotatedWith(annotation);
+
+    // If there are no annotated elements, claim the annotion but do nothing.
+    if (annotated.size() == 0) {
+      return true;
+    }
+
+    // This processor has to aggregate all op classes in one round, as it generates a single Ops
+    // API class which cannot be modified once generated. If we find an annotation after we've
+    // generated our code, flag the location of each such class.
+    if (hasRun) {
+      for (Element e : annotated) {
+        error(
+            e,
+            "The Operator processor has already processed @Operator annotated sources\n"
+                + "and written out an Ops API. It cannot process additional @Operator sources.\n"
+                + "One reason this can happen is if other annotation processors generate\n"
+                + "new @Operator source files.");
+      }
+      return true;
+    }
+
+    // Collect all classes tagged with our annotation.
+    Set<TypeElement> opClasses = new HashSet<TypeElement>();
+    if (!collectOpClasses(roundEnv, opClasses, annotation)) {
+      return true;
+    }
+
+    // Nothing to do when there are no tagged classes.
+    if (opClasses.isEmpty()) {
+      return true;
+    }
+
+    // TODO:(kbsriram) validate operator classes and generate Op API.
+    writeApi();
+    hasRun = true;
+    return true;
+  }
+
+  @Override
+  public Set<String> getSupportedAnnotationTypes() {
+    return Collections.singleton(String.format("%s.annotation.Operator", OP_PACKAGE));
+  }
+
+  private void writeApi() {
+    // Generate an empty class for now and get the build working correctly. This will be changed to
+    // generate the actual API once we've done with build-related changes.
+    // TODO:(kbsriram)
+    try (PrintWriter writer =
+        new PrintWriter(filer.createSourceFile(String.format("%s.Ops", OP_PACKAGE)).openWriter())) {
+      writer.println(String.format("package %s;", OP_PACKAGE));
+      writer.println("public class Ops{}");
+    } catch (IOException e) {
+      error(null, "Unexpected failure generating API: %s", e.getMessage());
+    }
+  }
+
+  private boolean collectOpClasses(
+      RoundEnvironment roundEnv, Set<TypeElement> opClasses, TypeElement annotation) {
+    boolean result = true;
+    for (Element e : roundEnv.getElementsAnnotatedWith(annotation)) {
+      // @Operator can only apply to types, so e must be a TypeElement.
+      if (!(e instanceof TypeElement)) {
+        error(
+            e,
+            "@Operator can only be applied to classes, but this is a %s",
+            e.getKind().toString());
+        result = false;
+        continue;
+      }
+      opClasses.add((TypeElement) e);
+    }
+    return result;
+  }
+
+  private void error(Element e, String message, Object... args) {
+    if (args != null && args.length > 0) {
+      message = String.format(message, args);
+    }
+    messager.printMessage(Kind.ERROR, message, e);
+  }
+
+  private Filer filer;
+  private Messager messager;
+  private boolean hasRun = false;
+  private static final String OP_PACKAGE = "org.tensorflow.op";
+}
diff --git a/tensorflow/java/src/gen/resources/META-INF/services/javax.annotation.processing.Processor b/tensorflow/java/src/gen/resources/META-INF/services/javax.annotation.processing.Processor
new file mode 100644
index 0000000000..9a4fc98a89
--- /dev/null
+++ b/tensorflow/java/src/gen/resources/META-INF/services/javax.annotation.processing.Processor
@@ -0,0 +1 @@
+org.tensorflow.processor.OperatorProcessor
diff --git a/tensorflow/java/src/main/java/org/tensorflow/op/annotation/Operator.java b/tensorflow/java/src/main/java/org/tensorflow/op/annotation/Operator.java
index 59476fb43d..3782240edb 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/op/annotation/Operator.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/op/annotation/Operator.java
@@ -54,7 +54,7 @@ import java.lang.annotation.Target;
  */
 @Documented
 @Target(ElementType.TYPE)
-@Retention(RetentionPolicy.CLASS)
+@Retention(RetentionPolicy.SOURCE)
 public @interface Operator {
   /**
    * Specify an optional group within the {@code Ops} class.
diff --git a/tensorflow/java/src/test/java/org/tensorflow/processor/OperatorProcessorTest.java b/tensorflow/java/src/test/java/org/tensorflow/processor/OperatorProcessorTest.java
new file mode 100644
index 0000000000..9fa1bad20d
--- /dev/null
+++ b/tensorflow/java/src/test/java/org/tensorflow/processor/OperatorProcessorTest.java
@@ -0,0 +1,51 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow.processor;
+
+import static com.google.testing.compile.CompilationSubject.assertThat;
+
+import com.google.testing.compile.Compilation;
+import com.google.testing.compile.Compiler;
+import com.google.testing.compile.JavaFileObjects;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Basic tests for {@link org.tensorflow.processor.operator.OperatorProcessor}. */
+@RunWith(JUnit4.class)
+public final class OperatorProcessorTest {
+
+  @Test
+  public void basicGood() {
+    Compilation compile = compile("org/tensorflow/processor/operator/good/BasicGood.java");
+    assertThat(compile).succeededWithoutWarnings();
+    assertThat(compile).generatedSourceFile("org.tensorflow.op.Ops");
+  }
+
+  @Test
+  public void basicBad() {
+    assertThat(compile("org/tensorflow/processor/operator/bad/BasicBad.java")).failed();
+  }
+
+  // Create a compilation unit that includes the @Operator annotation and processor.
+  private static Compilation compile(String path) {
+    return Compiler.javac()
+        .withProcessors(new OperatorProcessor())
+        .compile(
+            JavaFileObjects.forResource("src/main/java/org/tensorflow/op/annotation/Operator.java"),
+            JavaFileObjects.forResource(path));
+  }
+}
diff --git a/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/bad/BasicBad.java b/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/bad/BasicBad.java
new file mode 100644
index 0000000000..7d12857dfa
--- /dev/null
+++ b/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/bad/BasicBad.java
@@ -0,0 +1,22 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow.processor.operator.bad;
+
+import org.tensorflow.op.annotation.Operator;
+
+public class BasicBad {
+  @Operator int foo;
+}
diff --git a/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/good/BasicGood.java b/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/good/BasicGood.java
new file mode 100644
index 0000000000..4cf175f00d
--- /dev/null
+++ b/tensorflow/java/src/test/resources/org/tensorflow/processor/operator/good/BasicGood.java
@@ -0,0 +1,21 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package org.tensorflow.processor.operator.good;
+
+import org.tensorflow.op.annotation.Operator;
+
+@Operator
+public class BasicGood {}
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index 4382eeb9a8..953aa566f0 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -4202,6 +4202,19 @@ cuda_py_test(
     main = "client/session_benchmark.py",
 )
 
+cuda_py_test(
+    name = "nn_grad_test",
+    size = "small",
+    srcs = ["ops/nn_grad_test.py"],
+    additional_deps = [
+        ":client_testlib",
+        ":framework_for_generated_wrappers",
+        ":nn_grad",
+        ":nn_ops",
+        "//third_party/py/numpy",
+    ],
+)
+
 py_library(
     name = "tf_item",
     srcs = [
diff --git a/tensorflow/python/debug/cli/tensor_format.py b/tensorflow/python/debug/cli/tensor_format.py
index 7a5597db12..05ccf93f15 100644
--- a/tensorflow/python/debug/cli/tensor_format.py
+++ b/tensorflow/python/debug/cli/tensor_format.py
@@ -480,7 +480,7 @@ def _pad_string_to_length(string, length):
 
 
 def numeric_summary(tensor):
-  """Get a text summmary of a numeric tensor.
+  """Get a text summary of a numeric tensor.
 
   This summary is only available for numeric (int*, float*, complex*) and
   Boolean tensors.
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index 64b014a6b5..1131995b3e 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -199,7 +199,7 @@ class EvalSpec(
         evaluations on different data sets. Metrics for different evaluations
         are saved in separate folders, and appear separately in tensorboard.
       hooks: Iterable of `tf.train.SessionRunHook` objects to run
-        on all workers (including chief) during training.
+        during evaluation.
       exporters: Iterable of `Exporter`s, or a single one, or `None`.
         `exporters` will be invoked after each evaluation.
       start_delay_secs: Int. Start evaluating after waiting for this many
@@ -408,8 +408,8 @@ def train_and_evaluate(estimator, train_spec, eval_spec):
 
   Args:
     estimator: An `Estimator` instance to train and evaluate.
-    train_spec: A `TrainSpec instance to specify the training specification.
-    eval_spec: A `EvalSpec instance to specify the evaluation and export
+    train_spec: A `TrainSpec` instance to specify the training specification.
+    eval_spec: A `EvalSpec` instance to specify the evaluation and export
       specification.
 
   Raises:
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index b02bae95fd..d8ecabcdea 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -903,6 +903,21 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "nth_element_op_test",
+    size = "small",
+    srcs = ["nth_element_op_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:nn_grad",
+        "//tensorflow/python:nn_ops",
+    ],
+)
+
 tf_py_test(
     name = "unique_op_test",
     size = "small",
diff --git a/tensorflow/python/kernel_tests/batchtospace_op_test.py b/tensorflow/python/kernel_tests/batchtospace_op_test.py
index 8ec93119f2..0c802476a0 100644
--- a/tensorflow/python/kernel_tests/batchtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/batchtospace_op_test.py
@@ -24,6 +24,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
@@ -52,14 +53,15 @@ class BatchToSpaceDepthToSpace(test.TestCase, PythonOpImpl):
   def testDepthToSpaceTranspose(self):
     x = np.arange(20 * 5 * 8 * 7, dtype=np.float32).reshape([20, 5, 8, 7])
     block_size = 2
-    crops = np.zeros((2, 2), dtype=np.int32)
-    y1 = self.batch_to_space(x, crops, block_size=block_size)
-    y2 = array_ops.transpose(
-        array_ops.depth_to_space(
-            array_ops.transpose(x, [3, 1, 2, 0]), block_size=block_size),
-        [3, 1, 2, 0])
-    with self.test_session():
-      self.assertAllEqual(y1.eval(), y2.eval())
+    for crops_dtype in [dtypes.int64, dtypes.int32]:
+      crops = array_ops.zeros((2, 2), dtype=crops_dtype)
+      y1 = self.batch_to_space(x, crops, block_size=block_size)
+      y2 = array_ops.transpose(
+          array_ops.depth_to_space(
+              array_ops.transpose(x, [3, 1, 2, 0]), block_size=block_size),
+          [3, 1, 2, 0])
+      with self.test_session():
+        self.assertAllEqual(y1.eval(), y2.eval())
 
 
 class BatchToSpaceDepthToSpaceCpp(BatchToSpaceDepthToSpace, CppOpImpl):
@@ -287,9 +289,10 @@ class BatchToSpaceGradientCppTest(BatchToSpaceGradientTest, CppOpImpl):
 class BatchToSpaceNDGradientTest(test.TestCase):
 
   # Check the gradients.
-  def _checkGrad(self, x, block_shape, crops):
+  def _checkGrad(self, x, block_shape, crops, crops_dtype):
     block_shape = np.array(block_shape)
-    crops = np.array(crops).reshape((len(block_shape), 2))
+    crops = constant_op.constant(
+        np.array(crops).reshape((len(block_shape), 2)), crops_dtype)
     with self.test_session():
       tf_x = ops.convert_to_tensor(x)
       tf_y = array_ops.batch_to_space_nd(tf_x, block_shape, crops)
@@ -304,23 +307,26 @@ class BatchToSpaceNDGradientTest(test.TestCase):
 
     self.assertAllClose(x_jacob_t, x_jacob_n, rtol=1e-2, atol=epsilon)
 
-  def _compare(self, input_shape, block_shape, crops):
+  def _compare(self, input_shape, block_shape, crops, crops_dtype):
     input_shape = list(input_shape)
     input_shape[0] *= np.prod(block_shape)
     x = np.random.normal(
         0, 1, np.prod(input_shape)).astype(np.float32).reshape(input_shape)
-    self._checkGrad(x, block_shape, crops)
+    self._checkGrad(x, block_shape, crops, crops_dtype)
 
   # Don't use very large numbers as dimensions here as the result is tensor
   # with cartesian product of the dimensions.
   def testSmall(self):
-    self._compare([1, 2, 3, 5], [2, 2], [[0, 0], [0, 0]])
+    for dtype in [dtypes.int64, dtypes.int32]:
+      self._compare([1, 2, 3, 5], [2, 2], [[0, 0], [0, 0]], dtype)
 
   def testSmall2(self):
-    self._compare([2, 4, 3, 2], [2, 2], [[0, 0], [0, 0]])
+    for dtype in [dtypes.int64, dtypes.int32]:
+      self._compare([2, 4, 3, 2], [2, 2], [[0, 0], [0, 0]], dtype)
 
   def testSmallCrop1x1(self):
-    self._compare([1, 2, 3, 5], [2, 2], [[1, 1], [1, 1]])
+    for dtype in [dtypes.int64, dtypes.int32]:
+      self._compare([1, 2, 3, 5], [2, 2], [[1, 1], [1, 1]], dtype)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/kernel_tests/diag_op_test.py b/tensorflow/python/kernel_tests/diag_op_test.py
index f0b7885732..6cfa9b37fe 100644
--- a/tensorflow/python/kernel_tests/diag_op_test.py
+++ b/tensorflow/python/kernel_tests/diag_op_test.py
@@ -279,7 +279,7 @@ class MatrixDiagPartTest(test.TestCase):
 
 class DiagTest(test.TestCase):
 
-  def diagOp(self, diag, dtype, expected_ans, use_gpu=False):
+  def _diagOp(self, diag, dtype, expected_ans, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       tf_ans = array_ops.diag(ops.convert_to_tensor(diag.astype(dtype)))
       out = tf_ans.eval()
@@ -290,6 +290,10 @@ class DiagTest(test.TestCase):
     self.assertShapeEqual(expected_ans, tf_ans)
     self.assertShapeEqual(diag, tf_ans_inv)
 
+  def diagOp(self, diag, dtype, expected_ans):
+    self._diagOp(diag, dtype, expected_ans, False)
+    self._diagOp(diag, dtype, expected_ans, True)
+
   def testEmptyTensor(self):
     x = np.array([])
     expected_ans = np.empty([0, 0])
@@ -400,13 +404,53 @@ class DiagTest(test.TestCase):
           dtype=dtype)
       self.diagOp(x, dtype, expected_ans)
 
+  def testRankFourNumberTensor(self):
+    for dtype in [np.float32, np.float64, np.int64, np.int32]:
+      # Input with shape [2, 1, 2, 3]
+      x = np.array([[[[ 1,  2,  3],
+                      [ 4,  5,  6]]],
+                    [[[ 7,  8,  9],
+                      [10, 11, 12]]]], dtype=dtype)
+      # Output with shape [2, 1, 2, 3, 2, 1, 2, 3]
+      expected_ans = np.array(
+          [[[[[[[[1, 0, 0], [0, 0, 0]]],
+               [[[0, 0, 0], [0, 0, 0]]]],
+              [[[[0, 2, 0], [0, 0, 0]]],
+               [[[0, 0, 0], [0, 0, 0]]]],
+              [[[[0, 0, 3], [0, 0, 0]]],
+               [[[0, 0, 0], [0, 0, 0]]]]],
+             [[[[[0, 0, 0], [4, 0, 0]]],
+               [[[0, 0, 0], [0, 0, 0]]]],
+              [[[[0, 0, 0], [0, 5, 0]]],
+               [[[0, 0, 0], [0, 0, 0]]]],
+              [[[[0, 0, 0], [0, 0, 6]]],
+               [[[0, 0, 0], [0, 0, 0]]]]]]],
+
+           [[[[[[[0, 0, 0], [0, 0, 0]]],
+               [[[7, 0, 0], [0, 0, 0]]]],
+              [[[[0, 0, 0], [0, 0, 0]]],
+               [[[0, 8, 0], [0, 0, 0]]]],
+              [[[[0, 0, 0], [0, 0, 0]]],
+               [[[0, 0, 9], [0, 0, 0]]]]],
+             [[[[[0, 0, 0], [0, 0, 0]]],
+               [[[0, 0, 0], [10, 0, 0]]]],
+              [[[[0, 0, 0], [0, 0, 0]]],
+               [[[0, 0, 0], [0, 11, 0]]]],
+              [[[[0, 0, 0], [0, 0, 0]]],
+               [[[0, 0, 0], [0, 0, 12]]]]]]]], dtype=dtype)
+      self.diagOp(x, dtype, expected_ans)
+
+  def testInvalidRank(self):
+    with self.assertRaisesRegexp(ValueError, "must be at least rank 1"):
+      array_ops.diag(0.0)
+
 
 class DiagPartOpTest(test.TestCase):
 
   def setUp(self):
     np.random.seed(0)
 
-  def diagPartOp(self, tensor, dtype, expected_ans, use_gpu=False):
+  def _diagPartOp(self, tensor, dtype, expected_ans, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       tensor = ops.convert_to_tensor(tensor.astype(dtype))
       tf_ans_inv = array_ops.diag_part(tensor)
@@ -414,6 +458,10 @@ class DiagPartOpTest(test.TestCase):
     self.assertAllClose(inv_out, expected_ans)
     self.assertShapeEqual(expected_ans, tf_ans_inv)
 
+  def diagPartOp(self, tensor, dtype, expected_ans):
+    self._diagPartOp(tensor, dtype, expected_ans, False)
+    self._diagPartOp(tensor, dtype, expected_ans, True)
+
   def testRankTwoFloatTensor(self):
     x = np.random.rand(3, 3)
     i = np.arange(3)
@@ -451,11 +499,23 @@ class DiagPartOpTest(test.TestCase):
     self.diagPartOp(x, np.float32, expected_ans)
     self.diagPartOp(x, np.float64, expected_ans)
 
+  def testRankEightComplexTensor(self):
+    x = np.random.rand(2, 2, 2, 3, 2, 2, 2, 3)
+    i = np.arange(2)[:, None, None, None]
+    j = np.arange(2)[:, None, None]
+    k = np.arange(2)[:, None]
+    l = np.arange(3)
+    expected_ans = x[i, j, k, l, i, j, k, l]
+    self.diagPartOp(x, np.complex64, expected_ans)
+    self.diagPartOp(x, np.complex128, expected_ans)
+
   def testOddRank(self):
     w = np.random.rand(2)
     x = np.random.rand(2, 2, 2)
     self.assertRaises(ValueError, self.diagPartOp, w, np.float32, 0)
     self.assertRaises(ValueError, self.diagPartOp, x, np.float32, 0)
+    with self.assertRaises(ValueError):
+      array_ops.diag_part(0.0)
 
   def testUnevenDimensions(self):
     w = np.random.rand(2, 5)
diff --git a/tensorflow/python/kernel_tests/listdiff_op_test.py b/tensorflow/python/kernel_tests/listdiff_op_test.py
index 4f053d2a21..ee86cf0b24 100644
--- a/tensorflow/python/kernel_tests/listdiff_op_test.py
+++ b/tensorflow/python/kernel_tests/listdiff_op_test.py
@@ -41,15 +41,17 @@ class ListDiffTest(test.TestCase):
         y = [compat.as_bytes(str(a)) for a in y]
         out = [compat.as_bytes(str(a)) for a in out]
       for diff_func in [array_ops.setdiff1d]:
-        with self.test_session() as sess:
-          x_tensor = ops.convert_to_tensor(x, dtype=dtype)
-          y_tensor = ops.convert_to_tensor(y, dtype=dtype)
-          out_tensor, idx_tensor = diff_func(x_tensor, y_tensor)
-          tf_out, tf_idx = sess.run([out_tensor, idx_tensor])
-        self.assertAllEqual(tf_out, out)
-        self.assertAllEqual(tf_idx, idx)
-        self.assertEqual(1, out_tensor.get_shape().ndims)
-        self.assertEqual(1, idx_tensor.get_shape().ndims)
+        for index_dtype in [dtypes.int32, dtypes.int64]:
+          with self.test_session() as sess:
+            x_tensor = ops.convert_to_tensor(x, dtype=dtype)
+            y_tensor = ops.convert_to_tensor(y, dtype=dtype)
+            out_tensor, idx_tensor = diff_func(x_tensor, y_tensor,
+                                               index_dtype=index_dtype)
+            tf_out, tf_idx = sess.run([out_tensor, idx_tensor])
+          self.assertAllEqual(tf_out, out)
+          self.assertAllEqual(tf_idx, idx)
+          self.assertEqual(1, out_tensor.get_shape().ndims)
+          self.assertEqual(1, idx_tensor.get_shape().ndims)
 
   def testBasic1(self):
     x = [1, 2, 3, 4]
diff --git a/tensorflow/python/kernel_tests/metrics_test.py b/tensorflow/python/kernel_tests/metrics_test.py
index f21b0dfeab..e5b7cbce7a 100644
--- a/tensorflow/python/kernel_tests/metrics_test.py
+++ b/tensorflow/python/kernel_tests/metrics_test.py
@@ -3426,7 +3426,7 @@ class MeanIOUTest(test.TestCase):
       sess.run(variables.local_variables_initializer())
       for _ in range(5):
         sess.run(update_op)
-      desired_output = np.mean([1.0 / 3.0, 2.0 / 4.0, 0.])
+      desired_output = np.mean([1.0 / 3.0, 2.0 / 4.0])
       self.assertAlmostEqual(desired_output, miou.eval())
 
   def testUpdateOpEvalIsAccumulatedConfusionMatrix(self):
@@ -3505,6 +3505,55 @@ class MeanIOUTest(test.TestCase):
       desired_miou = np.mean([2. / 4., 4. / 6.])
       self.assertAlmostEqual(desired_miou, miou.eval())
 
+  def testMissingClassInLabels(self):
+    labels = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 0, 0, 0, 0, 1]],
+      [[1, 1, 1, 1, 1, 1],
+       [0, 0, 0, 0, 0, 0]]])
+    predictions = constant_op.constant([
+      [[0, 0, 2, 1, 1, 0],
+       [0, 1, 2, 2, 0, 1]],
+      [[0, 0, 2, 1, 1, 1],
+       [1, 1, 2, 0, 0, 0]]])
+    num_classes = 3
+    with self.test_session() as sess:
+      miou, update_op = metrics.mean_iou(labels, predictions, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[7, 4, 3], [3, 5, 2], [0, 0, 0]], update_op.eval())
+      self.assertAlmostEqual(
+        1 / 3 * (7 / (7 + 3 + 7) + 5 / (5 + 4 + 5) + 0 / (0 + 5 + 0)),
+        miou.eval())
+
+  def testMissingClassOverallSmall(self):
+    labels = constant_op.constant([0])
+    predictions = constant_op.constant([0])
+    num_classes = 2
+    with self.test_session() as sess:
+      miou, update_op = metrics.mean_iou(labels, predictions, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[1, 0], [0, 0]], update_op.eval())
+      self.assertAlmostEqual(1, miou.eval())
+
+  def testMissingClassOverallLarge(self):
+    labels = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 0, 0, 0, 0, 1]],
+      [[1, 1, 1, 1, 1, 1],
+       [0, 0, 0, 0, 0, 0]]])
+    predictions = constant_op.constant([
+      [[0, 0, 1, 1, 0, 0],
+       [1, 1, 0, 0, 1, 1]],
+      [[0, 0, 0, 1, 1, 1],
+       [1, 1, 1, 0, 0, 0]]])
+    num_classes = 3
+    with self.test_session() as sess:
+      miou, update_op = metrics.mean_iou(labels, predictions, num_classes)
+      sess.run(variables.local_variables_initializer())
+      self.assertAllEqual([[9, 5, 0], [3, 7, 0], [0, 0, 0]], update_op.eval())
+      self.assertAlmostEqual(
+        1 / 2 * (9 / (9 + 3 + 5) + 7 / (7 + 5 + 3)), miou.eval())
+
 
 class MeanPerClassAccuracyTest(test.TestCase):
 
diff --git a/tensorflow/python/kernel_tests/nth_element_op_test.py b/tensorflow/python/kernel_tests/nth_element_op_test.py
new file mode 100644
index 0000000000..58cd46d2d5
--- /dev/null
+++ b/tensorflow/python/kernel_tests/nth_element_op_test.py
@@ -0,0 +1,174 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import tensorflow.python.ops.nn_grad  # pylint: disable=unused-import
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gradients_impl
+from tensorflow.python.platform import test
+
+
+class NthElementTest(test.TestCase):
+
+  def _validateNthElement(self, inputs, dtype, n, reverse, expected_values):
+    np_expected_values = np.array(expected_values)
+    with self.test_session(use_gpu=False) as sess:
+      inputs_op = ops.convert_to_tensor(inputs, dtype=dtype)
+      values_op = nn_ops.nth_element(inputs_op, n, reverse=reverse)
+      values = sess.run(values_op)
+
+      self.assertShapeEqual(np_expected_values, values_op)
+      self.assertAllClose(np_expected_values, values)
+
+  def testExample1(self):
+    inputs = [2.2, 4.4, 1.1, 5.5, 3.3]
+    self._validateNthElement(inputs, dtypes.float32, 1, False, 2.2)
+    self._validateNthElement(inputs, dtypes.float32, 1, True, 4.4)
+
+  def testExample2(self):
+    inputs = [[2.2, 4.4, 1.1], [5.5, 3.3, 6.6]]
+    self._validateNthElement(inputs, dtypes.float64, 2, False, [4.4, 6.6])
+    self._validateNthElement(inputs, dtypes.float64, 2, True, [1.1, 3.3])
+
+  def testExample3(self):
+    inputs = [[[2, 4, 1], [5, -3, 6]],
+              [[7, 9, -8], [9, 0, 4]]]
+    self._validateNthElement(inputs, dtypes.int32, 0, False,
+                             [[1, -3], [-8, 0]])
+    self._validateNthElement(inputs, dtypes.int64, 0, True,
+                             [[4, 6], [9, 9]])
+
+  def _testFloatLargeInput(self, input_shape):
+    inputs = np.random.random_sample(input_shape)
+    n = np.random.randint(input_shape[-1])
+    sort_inputs = np.sort(inputs)
+    expected_values = sort_inputs[..., n]
+    self._validateNthElement(
+        inputs, dtypes.float32, n, False, expected_values)
+    expected_values = sort_inputs[..., ::-1][..., n]
+    self._validateNthElement(
+        inputs, dtypes.float64, n, True, expected_values)
+
+  def _testIntLargeInput(self, input_shape):
+    inputs = np.random.randint(-1e3, 1e3, input_shape)
+    n = np.random.randint(input_shape[-1])
+    sort_inputs = np.sort(inputs)
+    expected_values = sort_inputs[..., n]
+    self._validateNthElement(
+        inputs, dtypes.int32, n, False, expected_values)
+    expected_values = sort_inputs[..., ::-1][..., n]
+    self._validateNthElement(
+        inputs, dtypes.int64, n, True, expected_values)
+
+  def _testLargeInput(self, input_shape):
+    self._testFloatLargeInput(input_shape)
+    self._testIntLargeInput(input_shape)
+
+  def testLargeInput(self):
+    self._testLargeInput([1])
+    self._testLargeInput([10])
+    self._testLargeInput([5, 10])
+    self._testLargeInput([50, 100])
+    self._testLargeInput([50, 10000])
+    self._testLargeInput([50, 10, 100])
+    self._testLargeInput([50, 10, 10, 100])
+
+  def _testEnumerateN(self, input_shape):
+    inputs = np.random.random_sample(input_shape)
+    sort_inputs = np.sort(inputs)
+    for n in range(input_shape[-1]):
+      expected_values = sort_inputs[..., n]
+      self._validateNthElement(
+          inputs, dtypes.float32, n, False, expected_values)
+      expected_values = sort_inputs[..., ::-1][..., n]
+      self._validateNthElement(
+          inputs, dtypes.float64, n, True, expected_values)
+
+  def testEnumerateN(self):
+    self._testEnumerateN([1])
+    self._testEnumerateN([10])
+    self._testEnumerateN([10, 10])
+    self._testEnumerateN([10, 10, 10])
+    self._testEnumerateN([10, 10, 10, 10])
+
+  def testInvalidInput(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 "at least rank 1 but is rank 0"):
+      nn_ops.nth_element(5, 0)
+
+  def testInvalidInputAtEval(self):
+    with self.test_session(use_gpu=False):
+      v = array_ops.placeholder(dtype=dtypes.float32)
+      with self.assertRaisesOpError("Input must be >= 1-D"):
+        nn_ops.nth_element(v, 0).eval(feed_dict={v: 5.0})
+
+  def testInvalidN(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 "non-negative but is -1"):
+      nn_ops.nth_element([5], -1)
+    with self.assertRaisesRegexp(ValueError,
+                                 "scalar but has rank 1"):
+      nn_ops.nth_element([5, 6, 3], [1])
+
+  def testInvalidNAtEval(self):
+    inputs = [[0.1, 0.2], [0.3, 0.4]]
+    with self.test_session(use_gpu=False):
+      n = array_ops.placeholder(dtypes.int32)
+      values = nn_ops.nth_element(inputs, n)
+      with self.assertRaisesOpError("Need n >= 0, got -7"):
+        values.eval(feed_dict={n: -7})
+
+  def testNTooLarge(self):
+    inputs = [[0.1, 0.2], [0.3, 0.4]]
+    with self.assertRaisesRegexp(ValueError,
+                                 "must have last dimension > n = 2"):
+      nn_ops.nth_element(inputs, 2)
+
+  def testNTooLargeAtEval(self):
+    inputs = [[0.1, 0.2], [0.3, 0.4]]
+    with self.test_session(use_gpu=False):
+      n = array_ops.placeholder(dtypes.int32)
+      values = nn_ops.nth_element(inputs, n)
+      with self.assertRaisesOpError(r"Input must have at least n\+1 columns"):
+        values.eval(feed_dict={n: 2})
+
+  def testGradients(self):
+    with self.test_session(use_gpu=False) as sess:
+      inputs = array_ops.placeholder(dtypes.int32, shape=[3, 5])
+      values = nn_ops.nth_element(inputs, 3)
+      grad = sess.run(
+          gradients_impl.gradients(
+              values, inputs, grad_ys=[[-1., 2., 5.]]),
+          feed_dict={inputs: [[2, -1, 1000, 3, 1000],
+                              [1, 5, 2, 4, 3],
+                              [2, 2, 2, 2, 2],
+                             ]})
+    self.assertAllClose(grad[0], [[0, 0, -0.5, 0, -0.5],
+                                  [0, 0, 0, 2, 0],
+                                  [1, 1, 1, 1, 1],
+                                 ])
+
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/kernel_tests/pad_op_test.py b/tensorflow/python/kernel_tests/pad_op_test.py
index ca1f3f878f..2c766e3640 100644
--- a/tensorflow/python/kernel_tests/pad_op_test.py
+++ b/tensorflow/python/kernel_tests/pad_op_test.py
@@ -193,6 +193,25 @@ class PadOpTest(test.TestCase):
       with self.assertRaisesRegexp(ValueError, "Unknown padding mode"):
         array_ops.pad(x, [[1, 0], [2, 1]], mode="weird").eval()
 
+  def testPaddingTypes(self):
+    paddings = [[1, 0], [2, 3], [0, 2]]
+    inputs = np.random.randint(-100, 100, (4, 4, 3)).astype(np.float32)
+    for mode in ("CONSTANT", "REFLECT", "SYMMETRIC", "reflect", "symmetric",
+                 "constant"):
+      for padding_dtype in [dtypes.int32, dtypes.int64]:
+        np_val = self._npPad(inputs,
+                             paddings,
+                             mode=mode,
+                             constant_values=0)
+        with self.test_session(use_gpu=True):
+          tf_val = array_ops.pad(inputs,
+                                 constant_op.constant(paddings, padding_dtype),
+                                 mode=mode,
+                                 constant_values=0)
+          out = tf_val.eval()
+        self.assertAllEqual(np_val, out)
+        self.assertShapeEqual(np_val, tf_val)
+
   def testIntTypes(self):
     # TODO(touts): Figure out why the padding tests do not work on GPU
     # for int types and rank > 2.
@@ -284,6 +303,15 @@ class PadOpTest(test.TestCase):
     self.assertAllEqual(inp, out)
     self.assertShapeEqual(inp, tf_val)
 
+  def testPadTypes(self):
+    for dtype in [dtypes.int32, dtypes.int64]:
+      paddings = np.zeros((0, 2))
+      inp = np.asarray(7)
+      with self.test_session(use_gpu=True):
+        tf_val = array_ops.pad(inp, constant_op.constant(paddings, dtype=dtype))
+        out = tf_val.eval()
+      self.assertAllEqual(inp, out)
+      self.assertShapeEqual(inp, tf_val)
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/reduction_ops_test.py b/tensorflow/python/kernel_tests/reduction_ops_test.py
index c794351fe9..2dc65b1384 100644
--- a/tensorflow/python/kernel_tests/reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/reduction_ops_test.py
@@ -163,6 +163,13 @@ class SumReductionTest(BaseReductionTest):
       reduction_axes = tuple(reduction_axes)
     return np.sum(x, axis=reduction_axes, keepdims=keep_dims)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_sum([0, 0], constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, 0)
+
   def testInfinity(self):
     for dtype in [np.float32, np.float64]:
       for special_value_x in [-np.inf, np.inf]:
@@ -193,6 +200,7 @@ class SumReductionTest(BaseReductionTest):
       tf_out_mean = sess.run(tf_mean)
     self.assertAllClose(tf_out_mean, 1.)
 
+
   def testFloat32(self):
     for rank in range(1, _MAX_RANK + 1):
       np_arr = self._makeIncremental((2,) * rank, dtypes.float32)
@@ -369,6 +377,13 @@ class MeanReductionTest(BaseReductionTest):
       return np_sum // count
     return np_sum / count
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_mean([0, 0], constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, 0)
+
   def testInfinity(self):
     for dtype in [np.float32, np.float64]:
       for special_value_x in [-np.inf, np.inf]:
@@ -435,6 +450,13 @@ class ProdReductionTest(BaseReductionTest):
       reduction_axes = tuple(reduction_axes)
     return np.prod(x, axis=reduction_axes, keepdims=keep_dims)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_prod([0, 0], constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, 0)
+
   def testInfinity(self):
     for dtype in [np.float32, np.float64]:
       for special_value_x in [-np.inf, np.inf]:
@@ -531,6 +553,13 @@ class MinReductionTest(test.TestCase):
     self._compare(x, reduction_axes, True, use_gpu=True)
     self._compare(x, reduction_axes, True, use_gpu=False)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_min([0, 0], constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, 0)
+
   def testInfinity(self):
     for dtype in [np.float32, np.float64]:
       for special_value_x in [-np.inf, np.inf]:
@@ -637,6 +666,13 @@ class MaxReductionTest(test.TestCase):
     self._compare(x, reduction_axes, True, use_gpu=True)
     self._compare(x, reduction_axes, True, use_gpu=False)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_max([0, 0], constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, 0)
+
   def testInfinity(self):
     for dtype in [np.float32, np.float64]:
       for special_value_x in [-np.inf, np.inf]:
@@ -757,6 +793,14 @@ class AllReductionTest(test.TestCase):
     self._compare(x, reduction_axes, True, use_gpu=True)
     self._compare(x, reduction_axes, True, use_gpu=False)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_all([True, True],
+                                constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, True)
+
   def testAll3D(self):
     # Create a 3D array of bools and reduce across all possible
     # dimensions
@@ -798,6 +842,14 @@ class AnyReductionTest(test.TestCase):
     self._compare(x, reduction_axes, True, use_gpu=True)
     self._compare(x, reduction_axes, True, use_gpu=False)
 
+  def testAxesType(self):
+    for dtype in [dtypes.int64, dtypes.int32]:
+      with self.test_session(use_gpu=True) as sess:
+        v = math_ops.reduce_any([True, True],
+                                constant_op.constant(0, dtype=dtype))
+        tf_v = sess.run(v)
+      self.assertAllEqual(tf_v, True)
+
   def testAll3D(self):
     # Create a 3D array of bools and reduce across all possible
     # dimensions
diff --git a/tensorflow/python/kernel_tests/scan_ops_test.py b/tensorflow/python/kernel_tests/scan_ops_test.py
index 6b2b589a06..08b4a2aaae 100644
--- a/tensorflow/python/kernel_tests/scan_ops_test.py
+++ b/tensorflow/python/kernel_tests/scan_ops_test.py
@@ -20,6 +20,8 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gradient_checker
@@ -92,6 +94,14 @@ class CumsumTest(test.TestCase):
       for axis in (-1, 0):
         self._compareAll(x, axis)
 
+  def testAxisType(self):
+    for dtype in self.valid_dtypes:
+      x = np.arange(1, 6).reshape([5]).astype(dtype)
+      for axis_dtype in [dtypes.int64, dtypes.int32]:
+        with self.test_session(use_gpu=True):
+          axis = constant_op.constant(0, axis_dtype)
+          tf_out = math_ops.cumsum(x, axis).eval()
+
   def test1D(self):
     for dtype in self.valid_dtypes:
       x = np.arange(1, 6).reshape([5]).astype(dtype)
@@ -190,6 +200,14 @@ class CumprodTest(test.TestCase):
       for axis in (-1, 0):
         self._compareAll(x, axis)
 
+  def testAxisType(self):
+    for dtype in self.valid_dtypes:
+      x = np.arange(1, 6).reshape([5]).astype(dtype)
+      for axis_dtype in [dtypes.int64, dtypes.int32]:
+        with self.test_session(use_gpu=True):
+          axis = constant_op.constant(0, axis_dtype)
+          tf_out = math_ops.cumprod(x, axis).eval()
+
   def test1D(self):
     for dtype in self.valid_dtypes:
       x = np.arange(1, 6).reshape([5]).astype(dtype)
diff --git a/tensorflow/python/kernel_tests/shape_ops_test.py b/tensorflow/python/kernel_tests/shape_ops_test.py
index 52cf904528..a9fc699b21 100644
--- a/tensorflow/python/kernel_tests/shape_ops_test.py
+++ b/tensorflow/python/kernel_tests/shape_ops_test.py
@@ -411,14 +411,16 @@ class TileTest(test.TestCase):
       self.assertEqual(7, result)
 
   def testSimple(self):
-    with self.test_session():
-      inp = np.random.rand(4, 1).astype(np.float32)
-      a = constant_op.constant(inp)
-      tiled = array_ops.tile(a, [1, 4])
-      result = tiled.eval()
-    self.assertEqual(result.shape, (4, 4))
-    self.assertEqual([4, 4], tiled.get_shape())
-    self.assertTrue((result == np.tile(inp, (1, 4))).all())
+    # multiples could be int32 or int64
+    for dtype in [dtypes.int32, dtypes.int64]:
+      with self.test_session(use_gpu=True):
+        inp = np.random.rand(4, 1).astype(np.float32)
+        a = constant_op.constant(inp)
+        tiled = array_ops.tile(a, constant_op.constant([1, 4], dtype=dtype))
+        result = tiled.eval()
+      self.assertEqual(result.shape, (4, 4))
+      self.assertEqual([4, 4], tiled.get_shape())
+      self.assertTrue((result == np.tile(inp, (1, 4))).all())
 
   def testIdentityTileAndGrad(self):
     with self.test_session():
diff --git a/tensorflow/python/kernel_tests/slice_op_test.py b/tensorflow/python/kernel_tests/slice_op_test.py
index f6997e9c61..f415d9e70d 100644
--- a/tensorflow/python/kernel_tests/slice_op_test.py
+++ b/tensorflow/python/kernel_tests/slice_op_test.py
@@ -217,6 +217,17 @@ class SliceTest(test.TestCase):
     self.assertEqual(expected_val.shape, slice_t.get_shape())
     self.assertEqual(expected_val.shape, slice2_t.get_shape())
 
+  def testPartialShapeInference(self):
+    z = array_ops.zeros((1, 2, 3))
+    self.assertAllEqual(z.get_shape().as_list(), [1, 2, 3])
+
+    m1 = array_ops.slice(z, [0, 0, 0], [-1, -1, -1])
+    self.assertAllEqual(m1.get_shape().as_list(), [1, 2, 3])
+
+    m2 = array_ops.slice(z, [0, 0, 0], [constant_op.constant(1) + 0, 2, -1])
+    self.assertAllEqual(m2.get_shape().as_list(), [None, 2, None])
+
+
   def _testGradientSlice(self, input_shape, slice_begin, slice_size):
     with self.test_session(use_gpu=True):
       num_inputs = np.prod(input_shape)
diff --git a/tensorflow/python/kernel_tests/transpose_op_test.py b/tensorflow/python/kernel_tests/transpose_op_test.py
index 3b352937c8..c551d9c3d0 100644
--- a/tensorflow/python/kernel_tests/transpose_op_test.py
+++ b/tensorflow/python/kernel_tests/transpose_op_test.py
@@ -317,6 +317,19 @@ class TransposeTest(test.TestCase):
         np.arange(0, 8).reshape([2, 4]).astype(np.float32),
         np.array([1, 0]).astype(np.int32))
 
+  def testPermType(self):
+    for perm_dtype in [np.int64, np.int32]:
+      x = np.arange(0, 8).reshape([2, 4]).astype(np.float32)
+      p = np.array([1, 0]).astype(perm_dtype)
+      np_ans = np.copy(x).transpose(p)
+      with self.test_session(use_gpu=True):
+        inx = ops.convert_to_tensor(x)
+        inp = constant_op.constant(p)
+        y = array_ops.transpose(inx, inp)
+        tf_ans = y.eval()
+        self.assertShapeEqual(np_ans, y)
+        self.assertAllEqual(np_ans, tf_ans)
+
   def testHalf(self):
     self._compare(np.arange(0, 21).reshape([3, 7]).astype(np.float16))
     self._compare(np.arange(0, 210).reshape([2, 3, 5, 7]).astype(np.float16))
diff --git a/tensorflow/python/ops/hidden_ops.txt b/tensorflow/python/ops/hidden_ops.txt
index fcd378e3c0..86bc038e86 100644
--- a/tensorflow/python/ops/hidden_ops.txt
+++ b/tensorflow/python/ops/hidden_ops.txt
@@ -43,6 +43,7 @@ UniformCandidateSampler
 GenerateVocabRemapping
 LoadAndRemapMatrix
 
+
 # control_flow_ops
 Switch
 Merge
@@ -241,6 +242,7 @@ TensorSummaryV2
 
 # math_ops
 Abs
+AccumulateNV2
 AddN
 All
 Any
@@ -257,6 +259,7 @@ ComplexAbs
 Conj
 FloorDiv
 FloorMod
+HistogramFixedWidth
 Max
 Mean
 Min
diff --git a/tensorflow/python/ops/histogram_ops.py b/tensorflow/python/ops/histogram_ops.py
index c2077d51af..51e4be9343 100644
--- a/tensorflow/python/ops/histogram_ops.py
+++ b/tensorflow/python/ops/histogram_ops.py
@@ -28,6 +28,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import clip_ops
+from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import math_ops
 
 
@@ -69,30 +70,6 @@ def histogram_fixed_width(values,
   ```
   """
   with ops.name_scope(name, 'histogram_fixed_width',
-                      [values, value_range, nbins]) as scope:
-    values = ops.convert_to_tensor(values, name='values')
-    values = array_ops.reshape(values, [-1])
-    value_range = ops.convert_to_tensor(value_range, name='value_range')
-    nbins = ops.convert_to_tensor(nbins, dtype=dtypes.int32, name='nbins')
-    nbins_float = math_ops.cast(nbins, values.dtype)
-
-    # Map tensor values that fall within value_range to [0, 1].
-    scaled_values = math_ops.truediv(values - value_range[0],
-                                     value_range[1] - value_range[0],
-                                     name='scaled_values')
-
-    # map tensor values within the open interval value_range to {0,.., nbins-1},
-    # values outside the open interval will be zero or less, or nbins or more.
-    indices = math_ops.floor(nbins_float * scaled_values, name='indices')
-
-    # Clip edge cases (e.g. value = value_range[1]) or "outliers."
-    indices = math_ops.cast(
-        clip_ops.clip_by_value(indices, 0, nbins_float - 1), dtypes.int32)
-
-    # TODO(langmore) This creates an array of ones to add up and place in the
-    # bins.  This is inefficient, so replace when a better Op is available.
-    return math_ops.unsorted_segment_sum(
-        array_ops.ones_like(indices, dtype=dtype),
-        indices,
-        nbins,
-        name=scope)
+                      [values, value_range, nbins]) as name:
+    return gen_math_ops._histogram_fixed_width(values, value_range, nbins,
+                                               dtype=dtype, name=name)
diff --git a/tensorflow/python/ops/histogram_ops_test.py b/tensorflow/python/ops/histogram_ops_test.py
index e819e0234d..bf6e0296f6 100644
--- a/tensorflow/python/ops/histogram_ops_test.py
+++ b/tensorflow/python/ops/histogram_ops_test.py
@@ -36,7 +36,7 @@ class HistogramFixedWidthTest(test.TestCase):
     value_range = [0.0, 5.0]
     values = []
     expected_bin_counts = [0, 0, 0, 0, 0]
-    with self.test_session():
+    with self.test_session(use_gpu=True):
       hist = histogram_ops.histogram_fixed_width(values, value_range, nbins=5)
       self.assertEqual(dtypes.int32, hist.dtype)
       self.assertAllClose(expected_bin_counts, hist.eval())
@@ -47,7 +47,7 @@ class HistogramFixedWidthTest(test.TestCase):
     value_range = [0.0, 5.0]
     values = [-1.0, 0.0, 1.5, 2.0, 5.0, 15]
     expected_bin_counts = [2, 1, 1, 0, 2]
-    with self.test_session():
+    with self.test_session(use_gpu=True):
       hist = histogram_ops.histogram_fixed_width(
           values, value_range, nbins=5, dtype=dtypes.int64)
       self.assertEqual(dtypes.int64, hist.dtype)
@@ -59,7 +59,7 @@ class HistogramFixedWidthTest(test.TestCase):
     value_range = np.float64([0.0, 5.0])
     values = np.float64([-1.0, 0.0, 1.5, 2.0, 5.0, 15])
     expected_bin_counts = [2, 1, 1, 0, 2]
-    with self.test_session():
+    with self.test_session(use_gpu=True):
       hist = histogram_ops.histogram_fixed_width(values, value_range, nbins=5)
       self.assertEqual(dtypes.int32, hist.dtype)
       self.assertAllClose(expected_bin_counts, hist.eval())
@@ -70,7 +70,7 @@ class HistogramFixedWidthTest(test.TestCase):
     value_range = [0.0, 5.0]
     values = [[-1.0, 0.0, 1.5], [2.0, 5.0, 15]]
     expected_bin_counts = [2, 1, 1, 0, 2]
-    with self.test_session():
+    with self.test_session(use_gpu=True):
       hist = histogram_ops.histogram_fixed_width(values, value_range, nbins=5)
       self.assertEqual(dtypes.int32, hist.dtype)
       self.assertAllClose(expected_bin_counts, hist.eval())
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index ebbf581204..d1554b399f 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -1374,6 +1374,25 @@ class PadToBoundingBoxTest(test_util.TensorFlowTestCase):
     y = image_ops.pad_to_bounding_box(image, 0, 0, height, width)
     self.assertEqual(y.get_shape().as_list(), post_shape)
 
+  def testInt64(self):
+    x = [1, 2, 3,
+         4, 5, 6,
+         7, 8, 9]
+    x_shape = [3, 3, 1]
+
+    y = [0, 0, 0,
+         1, 2, 3,
+         4, 5, 6,
+         7, 8, 9]
+    y_shape = [4, 3, 1]
+    x = np.array(x).reshape(x_shape)
+    y = np.array(y).reshape(y_shape)
+
+    i = constant_op.constant([1, 0, 4, 3], dtype=dtypes.int64)
+    y_tf = image_ops.pad_to_bounding_box(x, i[0], i[1], i[2], i[3])
+    with self.test_session(use_gpu=True):
+      self.assertAllClose(y, y_tf.eval())
+
   def testNoOp(self):
     x_shape = [10, 10, 10]
     x = np.random.uniform(size=x_shape)
@@ -1672,8 +1691,8 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
              image_ops.ResizeMethod.BICUBIC,
              image_ops.ResizeMethod.AREA]
 
-  TYPES = [np.uint8, np.int8, np.int16, np.int32, np.int64,
-           np.float32, np.float64]
+  TYPES = [np.uint8, np.int8, np.uint16, np.int16, np.int32, np.int64,
+           np.float16, np.float32, np.float64]
 
   def _assertShapeInference(self, pre_shape, size, post_shape):
     # Try single image resize
@@ -2434,9 +2453,13 @@ class JpegTest(test_util.TensorFlowTestCase):
         y, x, h, w = crop_window
         image1_crop = image_ops.crop_to_bounding_box(image1, y, x, h, w)
 
-        # Combined crop+decode.
+        # Combined decode+crop.
         image2 = image_ops.decode_and_crop_jpeg(jpeg0, crop_window)
 
+        # Combined decode+crop should have the same shape inference
+        self.assertAllEqual(image1_crop.get_shape().as_list(),
+                            image2.get_shape().as_list())
+
         # CropAndDecode should be equal to DecodeJpeg+Crop.
         image1_crop, image2 = sess.run([image1_crop, image2])
         self.assertAllEqual(image1_crop, image2)
diff --git a/tensorflow/python/ops/losses/losses_impl.py b/tensorflow/python/ops/losses/losses_impl.py
index 752d260fba..55a18d28ca 100644
--- a/tensorflow/python/ops/losses/losses_impl.py
+++ b/tensorflow/python/ops/losses/losses_impl.py
@@ -27,6 +27,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import weights_broadcast_ops
 from tensorflow.python.ops.losses import util
+from tensorflow.python.util.deprecation import deprecated_args
 
 
 class Reduction(object):
@@ -230,10 +231,12 @@ def absolute_difference(
         losses, weights, scope, loss_collection, reduction=reduction)
 
 
+@deprecated_args(None, "dim is deprecated, use axis instead", "dim")
 def cosine_distance(
-    labels, predictions, dim=None, weights=1.0, scope=None,
+    labels, predictions, axis=None, weights=1.0, scope=None,
     loss_collection=ops.GraphKeys.LOSSES,
-    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS):
+    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS,
+    dim=None):
   """Adds a cosine-distance loss to the training procedure.
 
   Note that the function assumes that `predictions` and `labels` are already
@@ -242,13 +245,14 @@ def cosine_distance(
   Args:
     labels: `Tensor` whose shape matches 'predictions'
     predictions: An arbitrary matrix.
-    dim: The dimension along which the cosine distance is computed.
+    axis: The dimension along which the cosine distance is computed.
     weights: Optional `Tensor` whose rank is either 0, or the same rank as
       `labels`, and must be broadcastable to `labels` (i.e., all dimensions must
       be either `1`, or the same as the corresponding `losses` dimension).
     scope: The scope for the operations performed in computing the loss.
     loss_collection: collection to which this loss will be added.
     reduction: Type of reduction to apply to loss.
+    dim: The old (deprecated) name for `axis`.
 
   Returns:
     Weighted loss float `Tensor`. If `reduction` is `NONE`, this has the same
@@ -256,10 +260,14 @@ def cosine_distance(
 
   Raises:
     ValueError: If `predictions` shape doesn't match `labels` shape, or
-      `dim`, `labels`, `predictions` or `weights` is `None`.
+      `axis`, `labels`, `predictions` or `weights` is `None`.
   """
-  if dim is None:
-    raise ValueError("`dim` cannot be None.")
+  if dim is not None:
+    if axis is not None:
+      raise ValueError("Cannot specify both 'axis' and 'dim'")
+    axis = dim
+  if axis is None and dim is None:
+    raise ValueError("You must specify 'axis'.")
   if labels is None:
     raise ValueError("labels must not be None.")
   if predictions is None:
@@ -271,7 +279,7 @@ def cosine_distance(
     predictions.get_shape().assert_is_compatible_with(labels.get_shape())
 
     radial_diffs = math_ops.multiply(predictions, labels)
-    losses = 1 - math_ops.reduce_sum(radial_diffs, axis=(dim,), keep_dims=True)
+    losses = 1 - math_ops.reduce_sum(radial_diffs, axis=(axis,), keep_dims=True)
     return compute_weighted_loss(
         losses, weights, scope, loss_collection, reduction=reduction)
 
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index 9273659a77..10ff4be2dd 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -949,6 +949,12 @@ def mean_iou(labels,
       cm_diag = math_ops.to_float(array_ops.diag_part(total_cm))
       denominator = sum_over_row + sum_over_col - cm_diag
 
+      # The mean is only computed over classes that appear in the
+      # label or prediction tensor. If the denominator is 0, we need to
+      # ignore the class.
+      num_valid_entries = math_ops.reduce_sum(math_ops.cast(
+          math_ops.not_equal(denominator, 0), dtype=dtypes.float32))
+
       # If the value of the denominator is 0, set it to 1 to avoid
       # zero division.
       denominator = array_ops.where(
@@ -956,7 +962,13 @@ def mean_iou(labels,
           denominator,
           array_ops.ones_like(denominator))
       iou = math_ops.div(cm_diag, denominator)
-      return math_ops.reduce_mean(iou, name=name)
+
+      # If the number of valid entries is 0 (no classes) we return 0.
+      result = array_ops.where(
+          math_ops.greater(num_valid_entries, 0),
+          math_ops.reduce_sum(iou, name=name) / num_valid_entries,
+          0)
+      return result
 
     mean_iou_v = compute_mean_iou('mean_iou')
 
diff --git a/tensorflow/python/ops/nn_grad.py b/tensorflow/python/ops/nn_grad.py
index af610d8fdb..557f39fb42 100644
--- a/tensorflow/python/ops/nn_grad.py
+++ b/tensorflow/python/ops/nn_grad.py
@@ -352,6 +352,13 @@ def _Relu6Grad(op, grad):
   return gen_nn_ops._relu6_grad(grad, op.outputs[0])  # pylint: disable=protected-access
 
 
+@ops.RegisterGradient("Relu6Grad")
+def _Relu6GradGrad(op, grad):
+  x = op.inputs[1]
+  return (gen_nn_ops._relu6_grad(grad, x), array_ops.zeros(
+      shape=array_ops.shape(x), dtype=x.dtype))
+
+
 @ops.RegisterGradient("Elu")
 def _EluGrad(op, grad):
   return gen_nn_ops._elu_grad(grad, op.outputs[0])
@@ -934,3 +941,32 @@ def _TopKGrad(op, grad, _):
                                  validate_indices=False),
       in_shape), array_ops.zeros(
           [], dtype=dtypes.int32)]
+
+
+@ops.RegisterGradient("NthElement")
+def _NthElementGrad(op, grad):
+  """Return the gradients for NthElement.
+
+  Args:
+    op: The NthElementOp for which we need to generate gradients.
+    grad: Tensor. The gradients passed to the NthElementOp
+
+  Returns:
+    A list of two tensors, the first being the gradient w.r.t. the input,
+    the second being the gradient w.r.t. the N (None).
+  """
+  input = op.inputs[0]
+  output = op.outputs[0]
+
+  # Compute the number of elements which equal to output in each reduction
+  # dimension. If there are multiple elements then the gradient will be
+  # divided between them.
+  indicators = math_ops.cast(
+      math_ops.equal(array_ops.expand_dims(output, -1), input),
+      grad.dtype)
+
+  grad = array_ops.expand_dims(grad, -1)
+  num_selected = array_ops.expand_dims(
+      math_ops.reduce_sum(indicators, -1), -1)
+
+  return [math_ops.div(indicators, num_selected) * grad, None]
diff --git a/tensorflow/python/ops/nn_grad_test.py b/tensorflow/python/ops/nn_grad_test.py
new file mode 100644
index 0000000000..f7541c0e89
--- /dev/null
+++ b/tensorflow/python/ops/nn_grad_test.py
@@ -0,0 +1,48 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Python ops defined in nn_grad.py."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import gradient_checker
+from tensorflow.python.ops import gradients_impl
+from tensorflow.python.ops import nn_grad
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.platform import test
+
+
+class Relu6OpTest(test.TestCase):
+  def testRelu6GradGrad(self):
+    inputs = constant_op.constant([[-2, -1, 1, 3], [5, 7, 8, 9]],
+                                  dtype=dtypes.float32)
+    x_init_value = np.array([[-3.5, -1.5, 2, 4], [4.5, 7.5, 8.5, 11]])
+    r = nn_ops.relu6(inputs)
+    r_g = gradients_impl.gradients(r, inputs)[0]
+    with self.test_session():
+      error = gradient_checker.compute_gradient_error(
+        inputs, inputs.get_shape().as_list(),
+        r_g, r_g.get_shape().as_list(),
+        x_init_value=x_init_value)
+      self.assertLess(error, 1e-4)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 5f82323bfc..a37b68c6fa 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -2140,6 +2140,34 @@ def top_k(input, k=1, sorted=True, name=None):
   return gen_nn_ops._top_kv2(input, k=k, sorted=sorted, name=name)
 
 
+def nth_element(input, n, reverse=False, name=None):
+  r"""Finds values of the `n`-th order statistic for the last dmension.
+
+  If the input is a vector (rank-1), finds the entries which is the nth-smallest
+  value in the vector and outputs their values as scalar tensor.
+
+  For matrices (resp. higher rank input), computes the entries which is the
+  nth-smallest value in each row (resp. vector along the last dimension). Thus,
+
+      values.shape = input.shape[:-1]
+
+  Args:
+    input: 1-D or higher `Tensor` with last dimension at least `n+1`.
+    n: A `Tensor` of type `int32`.
+      0-D. Position of sorted vector to select along the last dimension (along
+      each row for matrices). Valid range of n is `[0, input.shape[:-1])`
+    reverse: An optional `bool`. Defaults to `False`.
+      When set to True, find the nth-largest value in the vector and vice
+      versa.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor`. Has the same type as `input`.
+    The `n`-th order statistic along each last dimensional slice.
+  """
+  return gen_nn_ops.nth_element(input, n, reverse=reverse, name=name)
+
+
 def conv1d(value, filters, stride, padding,
            use_cudnn_on_gpu=None, data_format=None,
            name=None):
diff --git a/tensorflow/python/platform/self_check.py b/tensorflow/python/platform/self_check.py
index 39d38d7bbc..966a094e55 100644
--- a/tensorflow/python/platform/self_check.py
+++ b/tensorflow/python/platform/self_check.py
@@ -21,7 +21,13 @@ from __future__ import print_function
 import os
 
 
-from tensorflow.python.platform import build_info
+try:
+  from tensorflow.python.platform import build_info
+except ImportError:
+  raise ImportError("Could not import tensorflow. Do not import tensorflow "
+                    "from its source directory; change directory to outside "
+                    "the TensorFlow source tree, and relaunch your Python "
+                    "interpreter from there.")
 
 
 def preload_check():
diff --git a/tensorflow/tools/api/golden/tensorflow.losses.pbtxt b/tensorflow/tools/api/golden/tensorflow.losses.pbtxt
index 79443839b9..c1d190ae11 100644
--- a/tensorflow/tools/api/golden/tensorflow.losses.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.losses.pbtxt
@@ -18,7 +18,7 @@ tf_module {
   }
   member_method {
     name: "cosine_distance"
-    argspec: "args=[\'labels\', \'predictions\', \'dim\', \'weights\', \'scope\', \'loss_collection\', \'reduction\'], varargs=None, keywords=None, defaults=[\'None\', \'1.0\', \'None\', \'losses\', \'weighted_sum_by_nonzero_weights\'], "
+    argspec: "args=[\'labels\', \'predictions\', \'axis\', \'weights\', \'scope\', \'loss_collection\', \'reduction\', \'dim\'], varargs=None, keywords=None, defaults=[\'None\', \'1.0\', \'None\', \'losses\', \'weighted_sum_by_nonzero_weights\', \'None\'], "
   }
   member_method {
     name: "get_losses"
diff --git a/tensorflow/tools/ci_build/Dockerfile.pi b/tensorflow/tools/ci_build/Dockerfile.pi
index 9d12ededb8..2fddd6a2c0 100644
--- a/tensorflow/tools/ci_build/Dockerfile.pi
+++ b/tensorflow/tools/ci_build/Dockerfile.pi
@@ -14,6 +14,9 @@ RUN /install/install_proto3.sh
 RUN /install/install_buildifier.sh
 RUN /install/install_auditwheel.sh
 RUN /install/install_golang.sh
+
+# The following line installs the Python cross-compilation toolchain. All the
+# preceding dependencies should be kept in sync with the main CPU docker file.
 RUN /install/install_pi_toolchain.sh
 
 # Set up the master bazelrc configuration file.
diff --git a/tensorflow/tools/ci_build/Dockerfile.pi-python3 b/tensorflow/tools/ci_build/Dockerfile.pi-python3
new file mode 100644
index 0000000000..18b131ea19
--- /dev/null
+++ b/tensorflow/tools/ci_build/Dockerfile.pi-python3
@@ -0,0 +1,23 @@
+FROM ubuntu:14.04
+
+MAINTAINER Jan Prach <jendap@google.com>
+
+# Copy and run the install scripts.
+COPY install/*.sh /install/
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa && \
+    add-apt-repository -y ppa:george-edison55/cmake-3.x
+RUN /install/install_deb_packages.sh
+RUN /install/install_pip_packages.sh
+RUN /install/install_bazel.sh
+RUN /install/install_proto3.sh
+RUN /install/install_buildifier.sh
+RUN /install/install_auditwheel.sh
+RUN /install/install_golang.sh
+
+# The following line installs the Python cross-compilation toolchain. All the
+# preceding dependencies should be kept in sync with the main CPU docker file.
+RUN /install/install_pi_python3_toolchain.sh
+
+# Set up the master bazelrc configuration file.
+COPY install/.bazelrc /etc/bazel.bazelrc
diff --git a/tensorflow/tools/ci_build/README.md b/tensorflow/tools/ci_build/README.md
index ad83669950..acef833909 100644
--- a/tensorflow/tools/ci_build/README.md
+++ b/tensorflow/tools/ci_build/README.md
@@ -1,115 +1,76 @@
 # TensorFlow Builds
 
-This directory contains all the files and setup instructions to run all
-the important builds and tests. **You can trivially run it yourself!** It also
-run continuous integration [ci.tensorflow.org](https://ci.tensorflow.org).
-
-
+This directory contains all the files and setup instructions to run all the
+important builds and tests. You can run it yourself!
 
 ## Run It Yourself
 
-1. Install [Docker](http://www.docker.com/). Follow instructions
-   [on the Docker site](https://docs.docker.com/installation/).
-
-   You can run all the jobs **without docker** if you are on mac or on linux
-   and you just don't want docker. Just install all the dependencies from
-   [Installing TensorFlow](https://www.tensorflow.org/install/).
-   Then run any of the one liners below without the
-   `tensorflow/tools/ci_build/ci_build.sh` in them.
-
-2. Clone tensorflow repository.
-
-   ```bash
-   git clone https://github.com/tensorflow/tensorflow.git
-   ```
-
-3. Go to tensorflow directory
-
-   ```bash
-   cd tensorflow
-   ```
-
-4. Build what you want, for example
-
-   ```bash
-   tensorflow/tools/ci_build/ci_build.sh CPU bazel test //tensorflow/...
-   ```
-   If you are using the Docker image on Windows or OS X, the Docker VM's default
-   memory limit may be too low to build TensorFlow. This can result in
-   strange-looking errors, e.g. the compilation may fail with `gcc: internal
-   compiler error: Killed (program cc1plus)`. Try increasing the memory limit in
-   the Docker preferences.
-
-
-## Jobs
-
-The jobs run by [ci.tensorflow.org](https://ci.tensorflow.org) include following:
-
-```bash
-# Note: You can run the following one-liners yourself if you have Docker. Run
-# without `tensorflow/tools/ci_build/ci_build.sh` on mac or linux without Docker.
-
-# build and run cpu tests
-tensorflow/tools/ci_build/ci_build.sh CPU bazel test //tensorflow/...
+You have two options when running TensorFlow tests locally on your
+machine. First, using docker, you can run our Continuous Integration
+(CI) scripts on tensorflow devel images. The other option is to install
+all TensorFlow dependencies on your machine and run the scripts
+natively on your system.
 
-# build and run gpu tests (note if you get unstable results you may be running
-# out of gpu memory - if so add "--jobs=1" argument)
-tensorflow/tools/ci_build/ci_build.sh GPU bazel test -c opt --config=cuda //tensorflow/...
+### Run TensorFlow CI Scripts using Docker
 
-# build pip with gpu support
-tensorflow/tools/ci_build/ci_build.sh GPU tensorflow/tools/ci_build/builds/pip.sh GPU -c opt --config=cuda
+1.  Install Docker following the [instructions on the docker website](https://docs.docker.com/engine/installation/).
 
-# build and run gpu tests using python 3
-CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3" tensorflow/tools/ci_build/ci_build.sh GPU tensorflow/tools/ci_build/builds/pip.sh GPU -c opt --config=cuda
+2.  Start a container with one of the devel images here:
+    https://hub.docker.com/r/tensorflow/tensorflow/tags/.
 
-# build android example app
-tensorflow/tools/ci_build/ci_build.sh ANDROID tensorflow/tools/ci_build/builds/android.sh
+3.  Based on your choice of the image, pick one of the scripts under
+    https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/ci_build/linux
+    and run them from the TensorFlow repository root.
 
-# cmake cpu build and test
-tensorflow/tools/ci_build/ci_build.sh CPU tensorflow/tools/ci_build/builds/cmake.sh
+### Run TensorFlow CI Scripts Natively on your Machine
 
-# run bash inside the container
-CI_DOCKER_EXTRA_PARAMS='-it --rm' tensorflow/tools/ci_build/ci_build.sh CPU /bin/bash
-```
+1.  Follow the instructions at https://www.tensorflow.org/install/install_sources,
+    but stop when you get to the section "Configure the installation". You do not
+    need to configure the installation to run the CI scripts.
 
-**Note**: The set of jobs and how they are triggered is still evolving.
-There are builds for master branch on cpu, gpu and android. There is a build
-for incoming gerrit changes. Gpu tests and benchmark are coming soon. Check
-[ci.tensorflow.org](https://ci.tensorflow.org) for current jobs.
+2.  Pick the appropriate OS and python version you have installed,
+    and run the script under tensorflow/tools/ci_build/<OS>.
 
+## TensorFlow Continuous Integration
 
+To verify that new changes don’t break TensorFlow, we run builds and
+tests on either [Jenkins](https://jenkins-ci.org/) or a CI system
+internal to Google.
 
-## How Does TensorFlow Continuous Integration Work
+We can trigger builds and tests on updates to master or on each pull
+request. Contact one of the repository maintainers to trigger builds
+on your pull request.
 
-We use [jenkins](https://jenkins-ci.org/) as our continuous integration.
-It is running at [ci.tensorflow.org](https://ci.tensorflow.org).
-All the jobs are run within [docker](http://www.docker.com/) containers.
+### View CI Results
 
-Builds can be triggered by push to master, push a change set or manually.
-The build started in jenkins will first pull the git tree. Then jenkins builds
-a docker container (using one of those Dockerfile.* files in this directory).
-The build itself is run within the container itself.
+The Pull Request will show if the change passed or failed the checks.
 
-Source tree lives in jenkins job workspace. Docker container for jenkins
-are transient - deleted after the build. Containers build very fast thanks
-to docker caching. Individual builds are fast thanks to bazel caching.
+From the pull request, click **Show all checks** to see the list of builds
+and tests. Click on **Details** to see the results from Jenkins or the internal
+CI system.
 
+Results from Jenkins are displayed in the Jenkins UI. For more information,
+see the [Jenkns documentation](https://jenkins.io/doc/).
 
+Results from the internal CI system are displayed in the Build Status UI. In
+this UI, to see the logs for a failed build:
 
-## Implementation Details
+*   Click on the **INVOCATION LOG** tab to see the invocation log.
 
-* The ci_build.sh script create and run docker container with all dependencies.
-  The builds/with_the_same_user together with ci_build.sh creates an environment
-  which is the same inside the container as it is outside. The same user, group,
-  path, so that docker symlinks work inside and outside the container. You can
-  use it for your development. Edit files in your git clone directory. If you
-  run the ci_build.sh it gets this directory mapped inside the container and
-  build your tree.
+*   Click on the **ARTIFACTS** tab to see a list of all artifacts, including logs.
 
-* The unusual `bazel-ci_build-cache` directory is mapped to docker container
-  performing the build using docker's --volume parameter. This way we cache
-  bazel output between builds.
+*   Individual test logs may be available. To see these logs, from the **TARGETS**
+    tab, click on the failed target. Then, click on the **TARGET LOG** tab to see
+    its test log.
 
-* The `builds` directory within this folder contains shell scripts to run within
-  the container. They essentially contains workarounds for current limitations
-  of bazel.
+    If you’re looking at target that is sharded or a test that is flaky, then
+    the build tool divided the target into multiple shards or ran the test
+    multiple times. Each test log is specific to the shard, run, and attempt.
+    To see a specific log:
+    
+    1.  Click on the log icon that is on the right next to the shard, run,
+        and attempt number.
+        
+    2.  In the grid that appears on the right, click on the specific shard,
+        run, and attempt to view its log. You can also type the desired shard,
+        run, or attempt number in the field above its grid.
diff --git a/tensorflow/tools/ci_build/builds/android_full.sh b/tensorflow/tools/ci_build/builds/android_full.sh
index 63250e0a4d..9d449241e8 100755
--- a/tensorflow/tools/ci_build/builds/android_full.sh
+++ b/tensorflow/tools/ci_build/builds/android_full.sh
@@ -40,7 +40,7 @@ rm -rf ${AAR_LIB_TMP}
 for CPU in ${CPUS//,/ }
 do
     echo "========== Building native libs for Android ${CPU} =========="
-    bazel build -c opt --cpu=${CPU} \
+    bazel build -c opt --config=monolithic --cpu=${CPU} \
         --crosstool_top=//external:android/crosstool \
         --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
         //tensorflow/core:android_tensorflow_lib \
@@ -62,7 +62,7 @@ done
 # in assets/ dir (see https://github.com/bazelbuild/bazel/issues/2334)
 # TODO(gunan): remove extra flags once sandboxing is enabled for all builds.
 echo "========== Building TensorFlow Android Jar and Demo =========="
-bazel --bazelrc=/dev/null build -c opt --fat_apk_cpu=${CPUS} \
+bazel --bazelrc=/dev/null build -c opt --config=monolithic --fat_apk_cpu=${CPUS} \
     --spawn_strategy=sandboxed --genrule_strategy=sandboxed \
     //tensorflow/contrib/android:android_tensorflow_inference_java \
     //tensorflow/contrib/android:android_tensorflow_inference_java.aar \
diff --git a/tensorflow/tools/ci_build/builds/libtensorflow.sh b/tensorflow/tools/ci_build/builds/libtensorflow.sh
index 5052d3626c..26713dded8 100755
--- a/tensorflow/tools/ci_build/builds/libtensorflow.sh
+++ b/tensorflow/tools/ci_build/builds/libtensorflow.sh
@@ -78,9 +78,52 @@ function build_libtensorflow_tarball() {
     //tensorflow/tools/lib_package:libtensorflow_proto.zip
 
   mkdir -p ${DIR}
+
   cp bazel-bin/tensorflow/tools/lib_package/libtensorflow.tar.gz ${DIR}/libtensorflow${TARBALL_SUFFIX}.tar.gz
   cp bazel-bin/tensorflow/tools/lib_package/libtensorflow_jni.tar.gz ${DIR}/libtensorflow_jni${TARBALL_SUFFIX}.tar.gz
-  cp bazel-bin/tensorflow/java/libtensorflow.jar bazel-bin/tensorflow/java/libtensorflow-src.jar ${DIR}
+  cp bazel-bin/tensorflow/java/libtensorflow.jar ${DIR}
+  cp_normalized_srcjar bazel-bin/tensorflow/java/libtensorflow-src.jar ${DIR}/libtensorflow-src.jar
   cp bazel-genfiles/tensorflow/tools/lib_package/libtensorflow_proto.zip ${DIR}
   chmod -x ${DIR}/*
 }
+
+# Helper function to copy a srcjar after moving any source files
+# directly under the root to the "maven-style" src/main/java layout
+#
+# Source files generated by annotation processors appear directly
+# under the root of srcjars jars created by bazel, rather than under
+# the maven-style src/main/java subdirectory.
+#
+# Bazel manages annotation generated source as follows: First, it
+# calls javac with options that create generated files under a
+# bazel-out directory. Next, it archives the generated source files
+# into a srcjar directly under the root. There doesn't appear to be a
+# simple way to parameterize this from bazel, hence this helper to
+# "normalize" the srcjar layout.
+#
+# Arguments:
+#   src_jar - path to the original srcjar
+#   dest_jar - path to the destination
+# Returns:
+#   None
+function cp_normalized_srcjar() {
+  local src_jar="$1"
+  local dest_jar="$2"
+  if [[ -z "${src_jar}" || -z "${dest_jar}" ]]; then
+    echo "Unexpected: missing arguments" >&2
+    exit 2
+  fi
+  local tmp_dir
+  tmp_dir=$(mktemp -d)
+  cp "${src_jar}" "${tmp_dir}/orig.jar"
+  pushd "${tmp_dir}"
+  # Extract any src/ files
+  jar -xf "${tmp_dir}/orig.jar" src/
+  # Extract any org/ files under src/main/java
+  (mkdir -p src/main/java && cd src/main/java && jar -xf "${tmp_dir}/orig.jar" org/)
+  # Repackage src/
+  jar -cMf "${tmp_dir}/new.jar" src
+  popd
+  cp "${tmp_dir}/new.jar" "${dest_jar}"
+  rm -rf "${tmp_dir}"
+}
diff --git a/tensorflow/tools/ci_build/install/install_golang.sh b/tensorflow/tools/ci_build/install/install_golang.sh
index 596265b069..55c1674495 100755
--- a/tensorflow/tools/ci_build/install/install_golang.sh
+++ b/tensorflow/tools/ci_build/install/install_golang.sh
@@ -16,7 +16,7 @@
 
 set -ex
 
-GOLANG_URL="https://storage.googleapis.com/golang/go1.9.linux-amd64.tar.gz"
+GOLANG_URL="https://storage.googleapis.com/golang/go1.9.1.linux-amd64.tar.gz"
 
 sudo mkdir -p /usr/local
 wget -q -O - "${GOLANG_URL}" | sudo tar -C /usr/local -xz
diff --git a/tensorflow/tools/ci_build/install/install_pi_python3_toolchain.sh b/tensorflow/tools/ci_build/install/install_pi_python3_toolchain.sh
new file mode 100755
index 0000000000..9d8e3df3b5
--- /dev/null
+++ b/tensorflow/tools/ci_build/install/install_pi_python3_toolchain.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+dpkg --add-architecture armhf
+echo 'deb [arch=armhf] http://ports.ubuntu.com/ trusty main restricted universe multiverse' >> /etc/apt/sources.list.d/armhf.list
+echo 'deb [arch=armhf] http://ports.ubuntu.com/ trusty-updates main restricted universe multiverse' >> /etc/apt/sources.list.d/armhf.list
+echo 'deb [arch=armhf] http://ports.ubuntu.com/ trusty-security main restricted universe multiverse' >> /etc/apt/sources.list.d/armhf.list
+echo 'deb [arch=armhf] http://ports.ubuntu.com/ trusty-backports main restricted universe multiverse' >> /etc/apt/sources.list.d/armhf.list
+sed -i 's#deb http://archive.ubuntu.com/ubuntu/#deb [arch=amd64] http://archive.ubuntu.com/ubuntu/#g' /etc/apt/sources.list
+apt-get update
+apt-get install -y libpython3-all-dev:armhf
+echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
+curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
+apt-get update
+rm -rf /usr/local/bin/bazel
+apt-get install -y bazel python3 python3-numpy python3-dev python3-pip
diff --git a/tensorflow/tools/ci_build/install/install_pi_toolchain.sh b/tensorflow/tools/ci_build/install/install_pi_toolchain.sh
index ef30ba58c2..03c43cc838 100755
--- a/tensorflow/tools/ci_build/install/install_pi_toolchain.sh
+++ b/tensorflow/tools/ci_build/install/install_pi_toolchain.sh
@@ -1,5 +1,5 @@
 #!/usr/bin/env bash
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/tensorflow/tools/ci_build/linux/cpu/run_mkl.sh b/tensorflow/tools/ci_build/linux/cpu/run_mkl.sh
new file mode 100755
index 0000000000..dbf376be6f
--- /dev/null
+++ b/tensorflow/tools/ci_build/linux/cpu/run_mkl.sh
@@ -0,0 +1,36 @@
+#!/usr/bin/env bash
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# ==============================================================================
+
+set -e
+set -x
+
+N_JOBS=$(grep -c ^processor /proc/cpuinfo)
+
+echo ""
+echo "Bazel will use ${N_JOBS} concurrent job(s)."
+echo ""
+
+# Run configure.
+export TF_NEED_CUDA=0
+export PYTHON_BIN_PATH=`which python2`
+yes "" | $PYTHON_BIN_PATH configure.py
+
+# Run bazel test command. Double test timeouts to avoid flakes.
+bazel test --test_tag_filters=-no_oss,-oss_serial,-gpu,-benchmark-test --test_lang_filters=py -k \
+    --jobs=${N_JOBS} --test_timeout 300,450,1200,3600 --build_tests_only \
+    --config=mkl --config=opt --test_output=errors -- \
+    //tensorflow/... -//tensorflow/compiler/... -//tensorflow/contrib/...
diff --git a/tensorflow/tools/docker/Dockerfile b/tensorflow/tools/docker/Dockerfile
index 07a972400d..024cb40eb4 100644
--- a/tensorflow/tools/docker/Dockerfile
+++ b/tensorflow/tools/docker/Dockerfile
@@ -1,6 +1,6 @@
 FROM ubuntu:16.04
 
-MAINTAINER Craig Citro <craigcitro@google.com>
+LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index 60a94504b7..20e1dcd085 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -1,6 +1,6 @@
 FROM ubuntu:16.04
 
-MAINTAINER Craig Citro <craigcitro@google.com>
+LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
@@ -72,7 +72,7 @@ RUN mkdir /bazel && \
 
 RUN git clone https://github.com/tensorflow/tensorflow.git && \
     cd tensorflow && \
-    git checkout r1.3
+    git checkout r1.4
 WORKDIR /tensorflow
 
 # TODO(craigcitro): Don't install the pip package, since it makes it
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 04773376e9..21a44ee404 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -1,6 +1,6 @@
 FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
-MAINTAINER Craig Citro <craigcitro@google.com>
+LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
@@ -73,20 +73,23 @@ RUN mkdir /bazel && \
 
 RUN git clone https://github.com/tensorflow/tensorflow.git && \
     cd tensorflow && \
-    git checkout r1.3
+    git checkout r1.4
 WORKDIR /tensorflow
 
 # Configure the build for our CUDA configuration.
 ENV CI_BUILD_PYTHON python
-ENV LD_LIBRARY_PATH /usr/local/cuda/lib64/stubs:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
+ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
 ENV TF_NEED_CUDA 1
 ENV TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1
 
-RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
 
-RUN tensorflow/tools/ci_build/builds/configured GPU \
-    bazel build -c opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \
+RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
+    LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH} \
+    tensorflow/tools/ci_build/builds/configured GPU \
+    bazel build -c opt --config=cuda \
+	--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \
         tensorflow/tools/pip_package:build_pip_package && \
+    rm /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
     bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/pip && \
     pip --no-cache-dir install --upgrade /tmp/pip/tensorflow-*.whl && \
     rm -rf /tmp/pip && \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7 b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
index ac1a437031..4558bc5293 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu-cuda9-cudnn7
@@ -42,6 +42,7 @@ RUN pip --no-cache-dir install \
         scipy \
         sklearn \
         pandas \
+        wheel \
         && \
     python -m ipykernel.kernelspec
 
@@ -80,22 +81,32 @@ RUN git clone https://github.com/tensorflow/tensorflow.git && \
 WORKDIR /tensorflow
 
 # Configure the build for our CUDA configuration.
-ENV CI_BUILD_PYTHON python
-ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
-ENV TF_NEED_CUDA 1
-ENV TF_CUDA_COMPUTE_CAPABILITIES 3.0,3.5,5.2,6.0,6.1
-ENV TF_CUDA_VERSION 9.0
-ENV TF_CUDNN_VERSION 7.0
+ENV CI_BUILD_PYTHON=python \
+    LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:${LD_LIBRARY_PATH} \
+    CUDNN_INSTALL_PATH=/usr/lib/x86_64-linux-gnu \
+    PYTHON_BIN_PATH=/usr/bin/python \
+    PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
+    TF_NEED_CUDA=1 \
+    TF_CUDA_VERSION=9.0 \
+    TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1,7.0 \
+    TF_CUDNN_VERSION=7
 RUN ./configure
 
-RUN LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH} \
-    bazel build -c opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \
-        --jobs=${TF_AVAILABLE_CPUS} \
-        tensorflow/tools/pip_package:build_pip_package && \
-    mkdir -p /pip_pkg && \
+# Build and Install TensorFlow.
+RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 && \
+    LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH} \
+    bazel build -c opt \
+                --config=cuda \
+                --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" \
+                --jobs=${TF_AVAILABLE_CPUS} \
+                tensorflow/tools/pip_package:build_pip_package && \
+    mkdir /pip_pkg && \
     bazel-bin/tensorflow/tools/pip_package/build_pip_package /pip_pkg
 
+# Clean up pip wheel and Bazel cache when done.
 RUN pip --no-cache-dir install --upgrade /pip_pkg/tensorflow-*.whl && \
+    rm -rf /pip_pkg && \
+    rm -rf /root/.cache
 
 WORKDIR /root
 
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index da83a30058..0571dd7391 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,6 +1,6 @@
 FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
-MAINTAINER Craig Citro <craigcitro@google.com>
+LABEL maintainer="Craig Citro <craigcitro@google.com>"
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
diff --git a/tensorflow/tools/docker/README.md b/tensorflow/tools/docker/README.md
index 3780bde2be..2e5a0038ed 100644
--- a/tensorflow/tools/docker/README.md
+++ b/tensorflow/tools/docker/README.md
@@ -41,6 +41,7 @@ Note: If you would have a problem running nvidia-docker you may try the old meth
 we have used. But it is not recommended. If you find a bug in nvidia-docker, please report
 it there and try using nvidia-docker as described above.
 
+    $ # The old, not recommended way to run docker with gpu support: 
     $ export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda.* | xargs -I{} echo '-v {}:{}')
     $ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
     $ docker run -it -p 8888:8888 $CUDA_SO $DEVICES gcr.io/tensorflow/tensorflow:latest-gpu
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index c05d39e942..4f0de8f768 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -29,14 +29,14 @@ from setuptools.dist import Distribution
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.3.0'
+_VERSION = '1.4.0-rc0'
 
 REQUIRED_PACKAGES = [
     'enum34 >= 1.1.6',
     'numpy >= 1.12.1',
     'six >= 1.10.0',
     'protobuf >= 3.4.0',
-    'tensorflow-tensorboard >= 0.1.0, < 0.2.0',
+    'tensorflow-tensorboard >= 0.4.0rc1, < 0.5.0',
 ]
 
 project_name = 'tensorflow'
@@ -67,6 +67,7 @@ if sys.version_info < (3, 4):
 
 # pylint: disable=line-too-long
 CONSOLE_SCRIPTS = [
+    'freeze_graph = tensorflow.python.tools.freeze_graph:main',
     'saved_model_cli = tensorflow.python.tools.saved_model_cli:main',
     # We need to keep the TensorBoard command, even though the console script
     # is now declared by the tensorboard pip package. If we remove the
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 4d577fc246..a14469a0be 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -590,16 +590,13 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       actual = "@jsoncpp_git//:jsoncpp",
   )
 
-  patched_http_archive(
+  native.http_archive(
       name = "boringssl",
       urls = [
-          "https://mirror.bazel.build/github.com/google/boringssl/archive/e3860009a091cd1bd2bc189cdbc3c6d095abde84.tar.gz",
-          # "https://github.com/google/boringssl/archive/e3860009a091cd1bd2bc189cdbc3c6d095abde84.tar.gz",  # 2017-07-07
+          "https://mirror.bazel.build/github.com/google/boringssl/archive/a0fb951d2a26a8ee746b52f3ba81ab011a0af778.tar.gz",
       ],
-      sha256 = "02f5950f93c4fd3691771c07c9d04cf2999ab01383ff99da345249e93b0fcfb2",
-      strip_prefix = "boringssl-e3860009a091cd1bd2bc189cdbc3c6d095abde84",
-      # Add patch to boringssl code to support s390x
-      patch_file = str(Label("//third_party/boringssl:add_boringssl_s390x.patch")),
+      sha256 = "524ba98a56300149696481b4cb9ddebd0c7b7ac9b9f6edee81da2d2d7e5d2bb3",
+      strip_prefix = "boringssl-a0fb951d2a26a8ee746b52f3ba81ab011a0af778",
   )
 
   native.new_http_archive(
@@ -701,6 +698,31 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       repository = tf_repo_name,
   )
 
+  java_import_external(
+      name = "com_google_testing_compile",
+      jar_sha256 = "edc180fdcd9f740240da1a7a45673f46f59c5578d8cd3fbc912161f74b5aebb8",
+      jar_urls = [
+          "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
+          "http://repo1.maven.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
+          "http://maven.ibiblio.org/maven2/com/google/testing/compile/compile-testing/0.11/compile-testing-0.11.jar",
+      ],
+      licenses = ["notice"],  # New BSD License
+      testonly_ = True,
+      deps = ["@com_google_guava", "@com_google_truth"],
+  )
+
+  java_import_external(
+      name = "com_google_truth",
+      jar_sha256 = "032eddc69652b0a1f8d458f999b4a9534965c646b8b5de0eba48ee69407051df",
+      jar_urls = [
+          "http://mirror.bazel.build/repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
+          "http://repo1.maven.org/maven2/com/google/truth/truth/0.32/truth-0.32.jar",
+      ],
+      licenses = ["notice"],  # Apache 2.0
+      testonly_ = True,
+      deps = ["@com_google_guava"],
+  )
+
   native.new_http_archive(
       name = "com_google_pprof",
       urls = [
@@ -715,11 +737,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   native.new_http_archive(
       name = "cub_archive",
       urls = [
-          "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.3.zip",
-          # "https://github.com/NVlabs/cub/archive/1.7.3.zip",
+          "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.4.zip",
+          # "https://github.com/NVlabs/cub/archive/1.7.4.zip",
       ],
-      sha256 = "b7ead9e291d34ffa8074243541c1380d63be63f88de23de8ee548db573b72ebe",
-      strip_prefix = "cub-1.7.3",
+      sha256 = "20a1a39fd97e5da7f40f5f2e7fd73fd2ea59f9dc4bb8a6c5f228aa543e727e31",
+      strip_prefix = "cub-1.7.4",
       build_file = str(Label("//third_party:cub.BUILD")),
   )
 
diff --git a/third_party/aws.BUILD b/third_party/aws.BUILD
index 38b7e0e543..9d8e7946cd 100644
--- a/third_party/aws.BUILD
+++ b/third_party/aws.BUILD
@@ -18,6 +18,7 @@ cc_library(
         "@%ws%//tensorflow:darwin": glob([
             "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
         ]),
+        "//conditions:default": [],
     }) + glob([
         "aws-cpp-sdk-core/include/**/*.h",
         "aws-cpp-sdk-core/source/*.cpp",
diff --git a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
index 078be83e0d..c210b1712c 100644
--- a/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
+++ b/third_party/eigen3/unsupported/Eigen/CXX11/src/FixedPoint/PacketMathAVX2.h
@@ -1,6 +1,35 @@
 #ifndef THIRD_PARTY_EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
 #define THIRD_PARTY_EIGEN3_UNSUPPORTED_EIGEN_CXX11_SRC_FIXEDPOINT_PACKETMATHAVX2_H_
 
+#ifdef _MSC_VER
+
+#include <immintrin.h>
+#include <emmintrin.h>
+#include <smmintrin.h>
+
+#endif
+
+inline int _mm256_extract_epi16_N0(const __m256i X)
+{
+	return _mm_extract_epi16(_mm256_extractf128_si256(X, 0 >> 3), 0 % 8);
+}
+
+inline int _mm256_extract_epi16_N1(const __m256i X)
+{
+	return _mm_extract_epi16(_mm256_extractf128_si256(X, 1 >> 3), 1 % 8);
+}
+
+inline int _mm256_extract_epi8_N0(const __m256i X)
+{
+	return _mm_extract_epi8(_mm256_extractf128_si256((X), 0 >> 4), 0 % 16);
+}
+
+inline int _mm256_extract_epi8_N1(const __m256i X)
+{
+	return _mm_extract_epi8(_mm256_extractf128_si256((X), 1 >> 4), 1 % 16);
+}
+
+
 namespace Eigen {
 namespace internal {
 
@@ -271,15 +300,15 @@ EIGEN_STRONG_INLINE QInt32 pfirst<Packet8q32i>(const Packet8q32i& a) {
 }
 template <>
 EIGEN_STRONG_INLINE QInt16 pfirst<Packet16q16i>(const Packet16q16i& a) {
-  return _mm256_extract_epi16(a.val, 0);
+  return _mm256_extract_epi16_N0(a.val);
 }
 template <>
 EIGEN_STRONG_INLINE QUInt8 pfirst<Packet32q8u>(const Packet32q8u& a) {
-  return static_cast<uint8_t>(_mm256_extract_epi8(a.val, 0));
+  return static_cast<uint8_t>(_mm256_extract_epi8_N0(a.val));
 }
 template <>
 EIGEN_STRONG_INLINE QInt8 pfirst<Packet32q8i>(const Packet32q8i& a) {
-  return _mm256_extract_epi8(a.val, 0);
+  return _mm256_extract_epi8_N0(a.val);
 }
 
 // Initialize to constant value.
@@ -391,7 +420,7 @@ EIGEN_STRONG_INLINE QInt16 predux_min<Packet16q16i>(const Packet16q16i& a) {
   tmp =
       _mm256_min_epi16(tmp, _mm256_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
   tmp = _mm256_min_epi16(tmp, _mm256_shuffle_epi32(tmp, 1));
-  return std::min(_mm256_extract_epi16(tmp, 0), _mm256_extract_epi16(tmp, 1));
+  return std::min(_mm256_extract_epi16_N0(tmp), _mm256_extract_epi16_N1(tmp));
 }
 template <>
 EIGEN_STRONG_INLINE QInt16 predux_max<Packet16q16i>(const Packet16q16i& a) {
@@ -399,7 +428,7 @@ EIGEN_STRONG_INLINE QInt16 predux_max<Packet16q16i>(const Packet16q16i& a) {
   tmp =
       _mm256_max_epi16(tmp, _mm256_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
   tmp = _mm256_max_epi16(tmp, _mm256_shuffle_epi32(tmp, 1));
-  return std::max(_mm256_extract_epi16(tmp, 0), _mm256_extract_epi16(tmp, 1));
+  return std::max(_mm256_extract_epi16_N0(tmp), _mm256_extract_epi16_N1(tmp));
 }
 
 template <>
@@ -410,8 +439,8 @@ EIGEN_STRONG_INLINE QUInt8 predux_min<Packet32q8u>(const Packet32q8u& a) {
   tmp = _mm256_min_epu8(tmp, _mm256_shuffle_epi32(tmp, 1));
   tmp = _mm256_min_epu8(tmp,
                         _mm256_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
-  return std::min(static_cast<uint8_t>(_mm256_extract_epi8(tmp, 0)),
-                  static_cast<uint8_t>(_mm256_extract_epi8(tmp, 1)));
+  return std::min(static_cast<uint8_t>(_mm256_extract_epi8_N0(tmp)),
+                  static_cast<uint8_t>(_mm256_extract_epi8_N1(tmp)));
 }
 template <>
 EIGEN_STRONG_INLINE QUInt8 predux_max<Packet32q8u>(const Packet32q8u& a) {
@@ -421,8 +450,8 @@ EIGEN_STRONG_INLINE QUInt8 predux_max<Packet32q8u>(const Packet32q8u& a) {
   tmp = _mm256_max_epu8(tmp, _mm256_shuffle_epi32(tmp, 1));
   tmp = _mm256_max_epu8(tmp,
                         _mm256_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
-  return std::max(static_cast<uint8_t>(_mm256_extract_epi8(tmp, 0)),
-                  static_cast<uint8_t>(_mm256_extract_epi8(tmp, 1)));
+  return std::max(static_cast<uint8_t>(_mm256_extract_epi8_N0(tmp)),
+                  static_cast<uint8_t>(_mm256_extract_epi8_N1(tmp)));
 }
 
 template <>
@@ -431,7 +460,7 @@ EIGEN_STRONG_INLINE QInt8 predux_min<Packet32q8i>(const Packet32q8i& a) {
   tmp = _mm256_min_epi8(tmp, _mm256_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
   tmp = _mm256_min_epi8(tmp, _mm256_shuffle_epi32(tmp, 1));
   tmp = _mm256_min_epi8(tmp, _mm256_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
-  return std::min(_mm256_extract_epi8(tmp, 0), _mm256_extract_epi8(tmp, 1));
+  return std::min(_mm256_extract_epi8_N0(tmp), _mm256_extract_epi8_N1(tmp));
 }
 template <>
 EIGEN_STRONG_INLINE QInt8 predux_max<Packet32q8i>(const Packet32q8i& a) {
@@ -439,7 +468,7 @@ EIGEN_STRONG_INLINE QInt8 predux_max<Packet32q8i>(const Packet32q8i& a) {
   tmp = _mm256_max_epi8(tmp, _mm256_shuffle_epi32(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
   tmp = _mm256_max_epi8(tmp, _mm256_shuffle_epi32(tmp, 1));
   tmp = _mm256_max_epi8(tmp, _mm256_shufflelo_epi16(tmp, _MM_SHUFFLE(1, 0, 3, 2)));
-  return std::max(_mm256_extract_epi8(tmp, 0), _mm256_extract_epi8(tmp, 1));
+  return std::max(_mm256_extract_epi8_N0(tmp), _mm256_extract_epi8_N1(tmp));
 }
 
 // Vectorized scaling of Packet32q8i by float.
diff --git a/third_party/toolchains/cpus/arm/CROSSTOOL.tpl b/third_party/toolchains/cpus/arm/CROSSTOOL.tpl
index ad7f5596d0..f0e17d1fe0 100644
--- a/third_party/toolchains/cpus/arm/CROSSTOOL.tpl
+++ b/third_party/toolchains/cpus/arm/CROSSTOOL.tpl
@@ -87,7 +87,7 @@ toolchain {
   cxx_flag: "-isystem"
   cxx_flag: "/usr/include/arm-linux-gnueabihf"
   cxx_flag: "-isystem"
-  cxx_flag: "/usr/include/python2.7"
+  cxx_flag: "%{PYTHON_INCLUDE_PATH}%"
   cxx_flag: "-isystem"
   cxx_flag: "/usr/include/"
   linker_flag: "-lstdc++"
diff --git a/third_party/toolchains/cpus/arm/arm_compiler_configure.bzl b/third_party/toolchains/cpus/arm/arm_compiler_configure.bzl
index 5eb3b7bb1c..ab6eac115c 100644
--- a/third_party/toolchains/cpus/arm/arm_compiler_configure.bzl
+++ b/third_party/toolchains/cpus/arm/arm_compiler_configure.bzl
@@ -11,9 +11,20 @@ def _tpl(repository_ctx, tpl, substitutions={}, out=None):
 
 
 def _arm_compiler_configure_impl(repository_ctx):
+  # We need to find a cross-compilation include directory for Python, so look
+  # for an environment variable. Be warned, this crosstool template is only
+  # regenerated on the first run of Bazel, so if you change the variable after
+  # it may not be reflected in later builds. Doing a shutdown and clean of Bazel
+  # doesn't fix this, you'll need to delete the generated file at something like:
+  # external/local_config_arm_compiler/CROSSTOOL in your Bazel install.
+  if "CROSSTOOL_PYTHON_INCLUDE_PATH" in repository_ctx.os.environ:
+    python_include_path = repository_ctx.os.environ["CROSSTOOL_PYTHON_INCLUDE_PATH"]
+  else:
+    python_include_path = "/usr/include/python2.7"
   _tpl(repository_ctx, "CROSSTOOL", {
       "%{ARM_COMPILER_PATH}%": str(repository_ctx.path(
           repository_ctx.attr.remote_config_repo)),
+      "%{PYTHON_INCLUDE_PATH}%": python_include_path,
   })
   repository_ctx.symlink(repository_ctx.attr.build_file, "BUILD")