143 files changed, 4130 insertions, 800 deletions
diff --git a/ISSUE_TEMPLATE.md b/ISSUE_TEMPLATE.md
index 37a2a6ddf7..3134c5eaf3 100644
--- a/ISSUE_TEMPLATE.md
+++ b/ISSUE_TEMPLATE.md
@@ -1,5 +1,11 @@
-For bugs/issues, please fill in the following.  The more information you
-provide, the more likely we can help you.
+GitHub issues are for bugs / installation problems / feature requests.  
+For general support from the community, see [StackOverflow](https://stackoverflow.com/questions/tagged/tensorflow).
+To make bugs and feature requests more easy to find and organize, we close issues that are deemed
+out of scope for GitHub Issues and point people to StackOverflow.
+
+For bugs or installation issues, please provide the following information.
+The more information you provide, the more easily we will be able to offer
+help and advice.
 
 ### Environment info
 Operating System:
diff --git a/README.md b/README.md
index 74891d0e34..dd407a3184 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 
 |  **`Linux CPU`**   |  **`Linux GPU PIP`** | **`Mac OS CPU`** |  **`Android`** |
 |-------------------|----------------------|------------------|----------------|
-| [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master)](http://ci.tensorflow.org/job/tensorflow-master) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-gpu_pip)](http://ci.tensorflow.org/job/tensorflow-master-gpu_pip) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](http://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](http://ci.tensorflow.org/job/tensorflow-master-android) |
+| [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-cpu)](http://ci.tensorflow.org/job/tensorflow-master-cpu) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-gpu_pip)](http://ci.tensorflow.org/job/tensorflow-master-gpu_pip) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](http://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](http://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](http://ci.tensorflow.org/job/tensorflow-master-android) |
 
 **TensorFlow** is an open source software library for numerical computation using
 data flow graphs.  Nodes in the graph represent mathematical operations, while
@@ -27,7 +27,14 @@ tracking requests and bugs, but please see
 and discussion.**
 
 ## Installation
-*See [Download and Setup](tensorflow/g3doc/get_started/os_setup.md).*
+*See [Download and Setup](tensorflow/g3doc/get_started/os_setup.md) for instructions on how to install our release binaries or how to build from source.*
+
+People who are a little bit adventurous can also try our nightly binaries:
+
+* Linux CPU only: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-cp27-none-linux_x86_64.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/))
+* Linux GPU: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py2-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nigntly-matrix-linux-gpu/TF_BUILD_CONTAINER_TYPE=GPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-slave/))
+* Mac CPU only: [Python 2](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py2-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-0.7.1-py3-none-any.whl) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
+* [Android](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-android/TF_BUILD_CONTAINER_TYPE=ANDROID,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=NO_PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=android-slave/lastSuccessfulBuild/artifact/bazel-out/local_linux/bin/tensorflow/examples/android/tensorflow_demo.apk) ([build history](http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-android/TF_BUILD_CONTAINER_TYPE=ANDROID,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=NO_PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=android-slave/))
 
 #### *Try your first TensorFlow program*
 ```python
@@ -46,6 +53,9 @@ Hello, TensorFlow!
 ```
 
 ##For more information
+
 * [TensorFlow website](http://tensorflow.org)
 * [TensorFlow whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf)
-* [Tensorflow MOOC on Udacity] (https://www.udacity.com/course/deep-learning--ud730)
+* [TensorFlow MOOC on Udacity] (https://www.udacity.com/course/deep-learning--ud730)
+
+The TensorFlow community has created amazing things with TensorFlow, please see the [resources section of tensorflow.org](https://www.tensorflow.org/versions/master/resources#community) for an incomplete list.
diff --git a/RELEASE.md b/RELEASE.md
index 350e36df42..535b5153ba 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,3 +1,20 @@
+# Release 0.7.1
+
+## Bug Fixes and Other Changes
+
+* Added gfile.Open and gfile.Copy, used by input_data.py.
+* Fixed Saver bug when MakeDirs tried to create empty directory.
+* GPU Pip wheels are built with cuda 7.5 and cudnn-v4, making them
+  required for the binary releases. Lower versions of cuda/cudnn can
+  be supported by installing from sources and setting the options
+  during ./configure
+* Fix dataset encoding example for Python3 (@danijar)
+* Fix PIP installation by not packaging protobuf as part of wheel,
+  require protobuf 3.0.0b2.
+* Fix Mac pip installation of numpy by requiring pip >= 1.10.1.
+* Improvements and fixes to Docker image.
+
+
 # Release 0.7.0
 
 ## Major Features and Improvements
diff --git a/configure b/configure
index 3f0a067eae..2d7ec77aec 100755
--- a/configure
+++ b/configure
@@ -99,12 +99,18 @@ while true; do
   else
     TF_CUDNN_EXT=".$TF_CUDNN_VERSION"
   fi
-  if [ -e "$CUDNN_INSTALL_PATH/libcudnn.so${CUDNNEXT}" -o -e "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}" ]; then
+  if [ -e "$CUDNN_INSTALL_PATH/libcudnn.so${TF_CUDNN_EXT}" -o -e "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}" ]; then
+    break
+  fi
+  CUDNN_PATH_FROM_LDCONFIG="$(ldconfig -p | sed -n 's/.*libcudnn.so .* => \(.*\)/\1/p')"
+  if [ -e "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}" ]; then
+    CUDNN_INSTALL_PATH="$(dirname ${CUDNN_PATH_FROM_LDCONFIG})"
     break
   fi
   echo "Invalid path to cuDNN ${TF_CUDNN_VERSION} toolkit. Neither of the following two files can be found:"
   echo "$CUDNN_INSTALL_PATH/lib64/libcudnn.so${TF_CUDNN_EXT}"
   echo "$CUDNN_INSTALL_PATH/libcudnn.so${TF_CUDNN_EXT}"
+  echo "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}"
   if [ -z "$fromuser" ]; then
     exit 1
   fi
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 226c28994f..a3e6aadc22 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -54,6 +54,15 @@ cc_binary(
     ],
 )
 
+cc_binary(
+    name = "libtensorflow_cc.so",
+    linkshared = 1,
+    deps = [
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:tensorflow",
+    ],
+)
+
 py_library(
     name = "tensorflow_py",
     srcs = ["__init__.py"],
diff --git a/tensorflow/contrib/cmake/CMakeLists.txt b/tensorflow/contrib/cmake/CMakeLists.txt
new file mode 100644
index 0000000000..2fcec53b4d
--- /dev/null
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@@ -0,0 +1,62 @@
+# Minimum CMake required
+cmake_minimum_required(VERSION 2.8)
+
+# Project
+project(tensorflow C CXX)
+
+# Actual source is the ../../.. directory
+get_filename_component(tf_contrib_source_dir ${tensorflow_SOURCE_DIR} PATH)
+get_filename_component(tf_tf_source_dir ${tf_contrib_source_dir} PATH)
+get_filename_component(tensorflow_source_dir ${tf_tf_source_dir} PATH)
+
+# [CLEANUP] Not sure if this is needed (copied from Protobuf)
+# CMake policies
+cmake_policy(SET CMP0022 NEW)
+
+# Options
+option(tensorflow_VERBOSE "Enable for verbose output" OFF)
+option(tensorflow_BUILD_TESTS "Build tests" ON)
+
+#Threads: defines CMAKE_THREAD_LIBS_INIT and adds -pthread compile option for
+# targets that link ${CMAKE_THREAD_LIBS_INIT}.
+find_package (Threads)
+
+# [CLEANUP] Remove when done
+# For debugging
+function(SHOW_VARIABLES)
+    get_cmake_property(_variableNames VARIABLES)
+    foreach (_variableName ${_variableNames})
+        message(STATUS "${_variableName}=${${_variableName}}")
+    endforeach()
+endfunction()
+
+# External dependencies
+set(CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/external)
+
+# Location where external projects will be downloaded
+set (DOWNLOAD_LOCATION "${CMAKE_CURRENT_BINARY_DIR}/downloads"
+     CACHE PATH "Location where external projects will be downloaded.")
+mark_as_advanced(DOWNLOAD_LOCATION)
+
+# External dependencies
+include(png)
+include(jpeg)
+include(re2)
+include(eigen)
+
+# Let's get to work!
+include(tf_core_framework.cmake)
+include(tf_stream_executor.cmake)
+include(tf_core_cpu.cmake)
+include(tf_models.cmake)
+include(tf_core_ops.cmake)
+include(tf_core_direct_session.cmake)
+include(tf_core_kernels.cmake)
+include(tf_cc_ops.cmake)
+include(tf_tutorials.cmake)
+
+if (tensorflow_BUILD_TESTS)
+  include(tests.cmake)
+endif (tensorflow_BUILD_TESTS)
+
+include(install.cmake)
diff --git a/tensorflow/contrib/cmake/README.md b/tensorflow/contrib/cmake/README.md
new file mode 100644
index 0000000000..18d535faea
--- /dev/null
+++ b/tensorflow/contrib/cmake/README.md
@@ -0,0 +1,257 @@
+This directory contains *CMake* files that can be used to build TensorFlow
+core library.
+
+You need to have [CMake](http://www.cmake.org) and [Git](http://git-scm.com)
+installed on your computer before proceeding.
+
+Most of the instructions will be given to the *Сommand Prompt*, but the same
+actions can be performed using appropriate GUI tools.
+
+Environment Setup
+=================
+
+Open the appropriate *Command Prompt* from the *Start* menu.
+
+For example *VS2013 x64 Native Tools Command Prompt*:
+
+    C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64>
+
+Change to your working directory:
+
+    C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64>cd C:\Path\to
+    C:\Path\to>
+
+Where *C:\Path\to* is the path to your real working directory.
+
+Create a folder where Tensorflow headers/libraries/binaries will be installed
+after they are built:
+
+    C:\Path\to>mkdir install
+
+If *cmake* command is not available from *Command Prompt*, add it to system
+*PATH* variable:
+
+    C:\Path\to>set PATH=%PATH%;C:\Program Files (x86)\CMake\bin
+
+If *git* command is not available from *Command Prompt*, add it to system
+*PATH* variable:
+
+    C:\Path\to>set PATH=%PATH%;C:\Program Files\Git\cmd
+
+Good. Now you are ready to continue.
+
+Getting Sources
+===============
+
+You can get the latest stable source packages from the
+[releases](https://github.com/tensorflow/tensorflow/releases) page.
+Or you can type:
+
+    C:\Path\to> git clone --recursive -b [release_tag] https://github.com/tensorflow/tensorflow.git
+
+Where *[release_tag]* is a git tag like *v0.6.0* or a branch name like *master*
+if you want to get the latest code.
+
+Go to the project folder:
+
+    C:\Path\to>cd tensorflow
+    C:\Path\to\tensorflow>
+
+Now go to *tensorflow\contrib\cmake* folder in Tensorflow's contrib sources:
+
+    C:\Path\to\tensorflow>cd tensorflow\contrib\cmake
+    C:\Path\to\tensorflow\tensorflow\contrib\cmake>
+
+Good. Now you are ready to configure *CMake*.
+
+CMake Configuration
+===================
+
+*CMake* supports a lot of different
+[generators](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html)
+for various native build systems. We are only interested in
+[Makefile](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html#makefile-generators)
+and
+[Visual Studio](http://www.cmake.org/cmake/help/latest/manual/cmake-generators.7.html#visual-studio-generators)
+generators.
+
+We will use shadow building to separate the temporary files from the Tensorflow
+source code.
+
+Create a temporary *build* folder and change your working directory to it:
+
+     C:\Path\to\tensorflow\tensorflow\contrib\cmake>mkdir build & cd build
+     C:\Path\to\tensorflow\tensorflow\contrib\cmake\build>
+
+The *Makefile* generator can build the project in only one configuration, so
+you need to build a separate folder for each configuration.
+
+To start using a *Release* configuration:
+
+     [...]\contrib\cmake\build>mkdir release & cd release
+     [...]\contrib\cmake\build\release>cmake -G "NMake Makefiles" ^
+     -DCMAKE_BUILD_TYPE=Release ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *nmake* *Makefile* in current directory.
+
+To use *Debug* configuration:
+
+     [...]\contrib\cmake\build>mkdir debug & cd debug
+     [...]\contrib\cmake\build\debug>cmake -G "NMake Makefiles" ^
+     -DCMAKE_BUILD_TYPE=Debug ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *nmake* *Makefile* in current directory.
+
+To create *Visual Studio* solution file:
+
+     [...]\contrib\cmake\build>mkdir solution & cd solution
+     [...]\contrib\cmake\build\solution>cmake -G "Visual Studio 12 2013 Win64" ^
+     -DCMAKE_INSTALL_PREFIX=../../../../../../install ^
+     ../..
+
+It will generate *Visual Studio* solution file *tensorflow.sln* in current
+directory.
+
+If the *gmock* directory does not exist, and/or you do not want to build
+Tensorflow unit tests, you need to add *cmake* command argument
+`-Dtensorflow_BUILD_TESTS=OFF` to disable testing.
+
+Compiling
+=========
+
+To compile tensorflow:
+
+     [...]\contrib\cmake\build\release>nmake
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake
+
+And wait for the compilation to finish.
+
+If you prefer to use the IDE:
+
+  * Open the generated tensorflow.sln file in Microsoft Visual Studio.
+  * Choose "Debug" or "Release" configuration as desired.
+  * From the Build menu, choose "Build Solution".
+
+And wait for the compilation to finish.
+
+Testing
+=======
+
+To run unit-tests:
+
+     [...]\contrib\cmake\build\release>nmake check
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake check
+
+You can also build project *check* from Visual Studio solution.
+Yes, it may sound strange, but it works.
+
+You should see an output similar to:
+
+     Running main() from gmock_main.cc
+     [==========] Running 1546 tests from 165 test cases.
+     
+     ...
+     
+     [==========] 1546 tests from 165 test cases ran. (2529 ms total)
+     [  PASSED  ] 1546 tests.
+
+To run specific tests:
+
+     C:\Path\to\tensorflow>tensorflow\contrib\cmake\build\release\tests.exe ^
+     --gtest_filter=AnyTest*
+     Running main() from gmock_main.cc
+     Note: Google Test filter = AnyTest*
+     [==========] Running 3 tests from 1 test case.
+     [----------] Global test environment set-up.
+     [----------] 3 tests from AnyTest
+     [ RUN      ] AnyTest.TestPackAndUnpack
+     [       OK ] AnyTest.TestPackAndUnpack (0 ms)
+     [ RUN      ] AnyTest.TestPackAndUnpackAny
+     [       OK ] AnyTest.TestPackAndUnpackAny (0 ms)
+     [ RUN      ] AnyTest.TestIs
+     [       OK ] AnyTest.TestIs (0 ms)
+     [----------] 3 tests from AnyTest (1 ms total)
+     
+     [----------] Global test environment tear-down
+     [==========] 3 tests from 1 test case ran. (2 ms total)
+     [  PASSED  ] 3 tests.
+
+Note that the tests must be run from the source folder.
+
+If all tests are passed, safely continue.
+
+Installing
+==========
+
+To install Tensorflow to the specified *install* folder:
+
+     [...]\contrib\cmake\build\release>nmake install
+
+or
+
+     [...]\contrib\cmake\build\debug>nmake install
+
+You can also build project *INSTALL* from Visual Studio solution.
+It sounds not so strange and it works.
+
+This will create the following folders under the *install* location:
+  * bin - that contains tensorflow binaries;
+  * include - that contains C++ headers and Tensorflow *.proto files;
+  * lib - that contains linking libraries and *CMake* configuration files for
+    *tensorflow* package.
+
+Now you can if needed:
+  * Copy the contents of the include directory to wherever you want to put
+    headers.
+  * Copy binaries wherever you put build tools (probably somewhere in your
+    PATH).
+  * Copy linking libraries libtensorflow[d].lib wherever you put libraries.
+
+To avoid conflicts between the MSVC debug and release runtime libraries, when
+compiling a debug build of your application, you may need to link against a
+debug build of libtensorflowd.lib with "d" postfix.  Similarly, release builds
+should link against release libtensorflow.lib library.
+
+DLLs vs. static linking
+=======================
+
+Static linking is now the default for the Tensorflow Buffer libraries.  Due to
+issues with Win32's use of a separate heap for each DLL, as well as binary
+compatibility issues between different versions of MSVC's STL library, it is
+recommended that you use static linkage only.  However, it is possible to
+build libtensorflow as DLLs if you really want.  To do this, do the following:
+
+  * Add an additional flag `-Dtensorflow_BUILD_SHARED_LIBS=ON` when invoking
+    cmake
+  * Follow the same steps as described in the above section.
+  * When compiling your project, make sure to `#define TENSORFLOW_USE_DLLS`.
+
+When distributing your software to end users, we strongly recommend that you
+do NOT install libtensorflow.dll to any shared location.
+Instead, keep these libraries next to your binaries, in your application's
+own install directory.  C++ makes it very difficult to maintain binary
+compatibility between releases, so it is likely that future versions of these
+libraries will *not* be usable as drop-in replacements.
+
+If your project is itself a DLL intended for use by third-party software, we
+recommend that you do NOT expose Tensorflow objects in your library's
+public interface, and that you statically link them into your library.
+
+Notes on Compiler Warnings
+==========================
+
+The following warnings have been disabled while building the tensorflow
+libraries and binaries.  You may have to disable some of them in your own
+project as well, or live with them.
+
+* [TODO]
diff --git a/tensorflow/contrib/cmake/external/eigen.cmake b/tensorflow/contrib/cmake/external/eigen.cmake
new file mode 100644
index 0000000000..3dd29ca169
--- /dev/null
+++ b/tensorflow/contrib/cmake/external/eigen.cmake
@@ -0,0 +1,34 @@
+#new_http_archive(
+#  name = "eigen_archive",
+#  url = "https://bitbucket.org/eigen/eigen/get/...",
+#  sha256 = "...",
+#  build_file = "eigen.BUILD",
+#)
+
+include (ExternalProject)
+
+set(eigen_archive_hash "ed4c9730b545")
+
+set(eigen_INCLUDE_DIRS
+    ${CMAKE_CURRENT_BINARY_DIR}
+    ${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive
+    ${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive/eigen-eigen-${eigen_archive_hash}
+    ${tensorflow_source_dir}/third_party/eigen3
+)
+set(eigen_URL https://bitbucket.org/eigen/eigen/get/${eigen_archive_hash}.tar.gz)
+set(eigen_HASH SHA256=3d9eceb8a2add299e37b1f32759157cc2574f7684936c151552a5ae3f33aebd5)
+set(eigen_BUILD ${CMAKE_CURRENT_BINARY_DIR}/eigen/src/eigen)
+set(eigen_INSTALL ${CMAKE_CURRENT_BINARY_DIR}/eigen/install)
+
+ExternalProject_Add(eigen
+    PREFIX eigen
+    URL ${eigen_URL}
+    URL_HASH ${eigen_HASH}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    INSTALL_DIR "${eigen_INSTALL}"
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_INSTALL_PREFIX:STRING=${eigen_INSTALL}
+        -DINCLUDE_INSTALL_DIR:STRING=${CMAKE_CURRENT_BINARY_DIR}/external/eigen_archive/eigen-eigen-${eigen_archive_hash}
+)
diff --git a/tensorflow/contrib/cmake/external/jpeg.cmake b/tensorflow/contrib/cmake/external/jpeg.cmake
new file mode 100644
index 0000000000..4b6b648950
--- /dev/null
+++ b/tensorflow/contrib/cmake/external/jpeg.cmake
@@ -0,0 +1,75 @@
+include (ExternalProject)
+
+set(jpeg_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/jpeg_archive)
+set(jpeg_URL http://www.ijg.org/files/jpegsrc.v9a.tar.gz)
+set(jpeg_HASH SHA256=3a753ea48d917945dd54a2d97de388aa06ca2eb1066cbfdc6652036349fe05a7)
+set(jpeg_BUILD ${CMAKE_BINARY_DIR}/jpeg/src/jpeg)
+set(jpeg_INSTALL ${CMAKE_BINARY_DIR}/jpeg/install)
+set(jpeg_STATIC_LIBRARIES ${jpeg_INSTALL}/lib/libjpeg.a)
+
+set(jpeg_HEADERS
+    "${jpeg_INSTALL}/include/jconfig.h"
+    "${jpeg_INSTALL}/include/jerror.h"
+    "${jpeg_INSTALL}/include/jmorecfg.h"
+    "${jpeg_INSTALL}/include/jpeglib.h"
+    "${jpeg_BUILD}/cderror.h"
+    "${jpeg_BUILD}/cdjpeg.h"
+    "${jpeg_BUILD}/jdct.h"
+    "${jpeg_BUILD}/jinclude.h"
+    "${jpeg_BUILD}/jmemsys.h"
+    "${jpeg_BUILD}/jpegint.h"
+    "${jpeg_BUILD}/jversion.h"
+    "${jpeg_BUILD}/transupp.h"
+)
+
+if (WIN32)
+    ExternalProject_Add(jpeg
+        PREFIX jpeg
+        URL ${jpeg_URL}
+        URL_HASH ${jpeg_HASH}
+        PATCH_COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_SOURCE_DIR}/patches/jpeg/CMakeLists.txt ${jpeg_BUILD}
+        INSTALL_DIR ${jpeg_INSTALL}
+        DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+        CMAKE_CACHE_ARGS
+            -DCMAKE_BUILD_TYPE:STRING=Release
+            -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+            -DCMAKE_INSTALL_PREFIX:STRING=${jpeg_INSTALL}
+    )
+
+    ExternalProject_Add_Step(jpeg copy_jconfig
+        COMMAND ${CMAKE_COMMAND} -E copy 
+            ${jpeg_BUILD}/jconfig.vc ${jpeg_BUILD}/jconfig.h
+        DEPENDEES patch
+        DEPENDERS build
+    )
+
+else()
+
+    ExternalProject_Add(jpeg
+        PREFIX jpeg
+        URL ${jpeg_URL}
+        URL_HASH ${jpeg_HASH}
+        INSTALL_DIR ${jpeg_INSTALL}
+        DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+        BUILD_COMMAND $(MAKE)
+        INSTALL_COMMAND $(MAKE) install
+        CONFIGURE_COMMAND
+            ${jpeg_BUILD}/configure
+            --prefix=${jpeg_INSTALL}
+            --enable-shared=yes
+    )
+  
+endif()
+
+# put jpeg includes in the directory where they are expected
+add_custom_target(jpeg_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${jpeg_INCLUDE_DIR}/jpeg-9a
+    DEPENDS jpeg)
+
+add_custom_target(jpeg_copy_headers_to_destination
+    DEPENDS jpeg_create_destination_dir)
+
+foreach(header_file ${jpeg_HEADERS})
+    add_custom_command(TARGET jpeg_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${jpeg_INCLUDE_DIR}/jpeg-9a)
+endforeach()
diff --git a/tensorflow/contrib/cmake/external/png.cmake b/tensorflow/contrib/cmake/external/png.cmake
new file mode 100644
index 0000000000..ca3633430d
--- /dev/null
+++ b/tensorflow/contrib/cmake/external/png.cmake
@@ -0,0 +1,38 @@
+include (ExternalProject)
+
+set(png_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/png_archive)
+set(png_URL https://storage.googleapis.com/libpng-public-archive/libpng-1.2.53.tar.gz)
+set(png_HASH SHA256=e05c9056d7f323088fd7824d8c6acc03a4a758c4b4916715924edc5dd3223a72)
+set(png_BUILD ${CMAKE_BINARY_DIR}/png/src/png)
+set(png_INSTALL ${CMAKE_BINARY_DIR}/png/install)
+set(png_STATIC_LIBRARIES ${CMAKE_BINARY_DIR}/png/install/lib/libpng12.a)
+
+set(png_HEADERS
+    "${png_INSTALL}/include/libpng12/png.h"
+    "${png_INSTALL}/include/libpng12/pngconf.h"
+)
+
+ExternalProject_Add(png
+    PREFIX png
+    URL ${png_URL}
+    URL_HASH ${png_HASH}
+    INSTALL_DIR ${png_INSTALL}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_INSTALL_PREFIX:STRING=${png_INSTALL}
+)
+
+## put png includes in the directory where they are expected
+add_custom_target(png_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${png_INCLUDE_DIR}/libpng-1.2.53
+    DEPENDS png)
+
+add_custom_target(png_copy_headers_to_destination
+    DEPENDS png_create_destination_dir)
+
+foreach(header_file ${png_HEADERS})
+    add_custom_command(TARGET png_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${png_INCLUDE_DIR}/libpng-1.2.53)
+endforeach()
diff --git a/tensorflow/contrib/cmake/external/re2.cmake b/tensorflow/contrib/cmake/external/re2.cmake
new file mode 100644
index 0000000000..b96d90533e
--- /dev/null
+++ b/tensorflow/contrib/cmake/external/re2.cmake
@@ -0,0 +1,46 @@
+include (ExternalProject)
+
+set(re2_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/re2/re2)
+set(re2_EXTRA_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/re2/src)
+set(re2_URL https://github.com/google/re2.git)
+set(re2_TAG 791beff)
+set(re2_BUILD ${CMAKE_BINARY_DIR}/re2/src/re2)
+set(re2_LIBRARIES ${re2_BUILD}/obj/so/libre2.so)
+get_filename_component(re2_STATIC_LIBRARIES ${re2_BUILD}/libre2.a ABSOLUTE)
+set(re2_INCLUDES ${re2_BUILD})
+
+# We only need re2.h in external/re2/re2/re2.h
+# For the rest, we'll just add the build dir as an include dir.
+set(re2_HEADERS
+    "${re2_BUILD}/re2/re2.h"
+)
+
+ExternalProject_Add(re2
+    PREFIX re2
+    GIT_REPOSITORY ${re2_URL}
+    GIT_TAG ${re2_TAG}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    BUILD_IN_SOURCE 1
+    INSTALL_COMMAND ""
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+)
+
+## put re2 includes in the directory where they are expected
+add_custom_target(re2_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${re2_INCLUDE_DIR}
+    DEPENDS re2)
+
+add_custom_target(re2_copy_headers_to_destination
+    DEPENDS re2_create_destination_dir)
+
+foreach(header_file ${re2_HEADERS})
+    add_custom_command(TARGET re2_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy ${header_file} ${re2_INCLUDE_DIR})
+endforeach()
+
+ADD_LIBRARY(re2_lib STATIC IMPORTED
+    DEPENDS re2)
+SET_TARGET_PROPERTIES(re2_lib PROPERTIES
+    IMPORTED_LOCATION ${re2_STATIC_LIBRARIES})
diff --git a/tensorflow/contrib/cmake/install.cmake b/tensorflow/contrib/cmake/install.cmake
new file mode 100644
index 0000000000..a3fe2bcf06
--- /dev/null
+++ b/tensorflow/contrib/cmake/install.cmake
@@ -0,0 +1 @@
+# [TODO]
+\ No newline at end of file
diff --git a/tensorflow/contrib/cmake/patches/jpeg/CMakeLists.txt b/tensorflow/contrib/cmake/patches/jpeg/CMakeLists.txt
new file mode 100644
index 0000000000..782076ef74
--- /dev/null
+++ b/tensorflow/contrib/cmake/patches/jpeg/CMakeLists.txt
@@ -0,0 +1,76 @@
+cmake_minimum_required(VERSION 2.8.3)
+
+project(libjpeg)
+
+set(LIBJPEG_SRCS
+    "jaricom.c"
+    "jcapimin.c"
+    "jcapistd.c"
+    "jcarith.c"
+    "jccoefct.c"
+    "jccolor.c"
+    "jcdctmgr.c"
+    "jchuff.c"
+    "jcinit.c"
+    "jcmainct.c"
+    "jcmarker.c"
+    "jcmaster.c"
+    "jcomapi.c"
+    "jcparam.c"
+    "jcprepct.c"
+    "jcsample.c"
+    "jctrans.c"
+    "jdapimin.c"
+    "jdapistd.c"
+    "jdarith.c"
+    "jdatadst.c"
+    "jdatasrc.c"
+    "jdcoefct.c"
+    "jdcolor.c"
+    "jddctmgr.c"
+    "jdhuff.c"
+    "jdinput.c"
+    "jdmainct.c"
+    "jdmarker.c"
+    "jdmaster.c"
+    "jdmerge.c"
+    "jdpostct.c"
+    "jdsample.c"
+    "jdtrans.c"
+    "jerror.c"
+    "jfdctflt.c"
+    "jfdctfst.c"
+    "jfdctint.c"
+    "jidctflt.c"
+    "jidctfst.c"
+    "jidctint.c"
+    "jmemmgr.c"
+    "jmemnobs.c"
+    "jquant1.c"
+    "jquant2.c"
+    "jutils.c"
+)
+set(LIBJPEG_INCLUDES
+    "jconfig.h"
+    "jdct.h"
+    "jerror.h"
+    "jinclude.h"
+    "jmemsys.h"
+    "jmorecfg.h"
+    "jpegint.h"
+    "jpeglib.h"
+    "jversion.h"
+)
+
+include_directories("${CMAKE_CURRENT_SOURCE_DIR}")
+
+add_library(libjpeg ${LIBJPEG_SRCS})
+
+install(TARGETS libjpeg
+  RUNTIME DESTINATION bin COMPONENT RuntimeLibraries
+  LIBRARY DESTINATION lib COMPONENT RuntimeLibraries
+  ARCHIVE DESTINATION lib COMPONENT Development)
+
+foreach(LIBJPEG_INCLUDE ${LIBJPEG_INCLUDES})
+  install(FILES ${LIBJPEG_INCLUDE} DESTINATION include COMPONENT Development)
+endforeach()
diff --git a/tensorflow/contrib/cmake/tests.cmake b/tensorflow/contrib/cmake/tests.cmake
new file mode 100644
index 0000000000..a3fe2bcf06
--- /dev/null
+++ b/tensorflow/contrib/cmake/tests.cmake
@@ -0,0 +1 @@
+# [TODO]
+\ No newline at end of file
diff --git a/tensorflow/contrib/cmake/tf_cc_ops.cmake b/tensorflow/contrib/cmake/tf_cc_ops.cmake
new file mode 100644
index 0000000000..8a9b2083ae
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_cc_ops.cmake
@@ -0,0 +1,204 @@
+########################################################
+# tf_cc_op_gen_main library
+########################################################
+set(tf_cc_op_gen_main_srcs
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen_main.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/cc_op_gen.h"
+)
+
+add_library(tf_cc_op_gen_main OBJECT ${tf_cc_op_gen_main_srcs})
+
+add_dependencies(tf_cc_op_gen_main tf_core_framework)
+
+target_include_directories(tf_cc_op_gen_main PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_cc_op_gen_main
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_framework
+#)
+
+target_compile_options(tf_cc_op_gen_main PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_cc_op_gen_main PRIVATE
+    cxx_rvalue_references
+)
+
+########################################################
+# tf_gen_op_wrapper_cc executables
+########################################################
+
+#
+#  # Run the op generator.
+#  if name == "sendrecv_ops":
+#    include_internal = "1"
+#  else:
+#    include_internal = "0"
+#  native.genrule(
+#      name=name + "_genrule",
+#      outs=[out_ops_file + ".h", out_ops_file + ".cc"],
+#      tools=[":" + tool],
+#      cmd=("$(location :" + tool + ") $(location :" + out_ops_file + ".h) " +
+#           "$(location :" + out_ops_file + ".cc) " + include_internal))
+
+
+
+#def tf_gen_op_wrappers_cc(name,
+#                          op_lib_names=[],
+#                          other_srcs=[],
+#                          other_hdrs=[],
+#                          pkg=""):
+#  subsrcs = other_srcs
+#  subhdrs = other_hdrs
+#  for n in op_lib_names:
+#    tf_gen_op_wrapper_cc(n, "ops/" + n, pkg=pkg)
+#    subsrcs += ["ops/" + n + ".cc"]
+#    subhdrs += ["ops/" + n + ".h"]
+#
+#  native.cc_library(name=name,
+#                    srcs=subsrcs,
+#                    hdrs=subhdrs,
+#                    deps=["//tensorflow/core:core_cpu"],
+#                    copts=tf_copts(),
+#                    alwayslink=1,)
+
+# create directory for ops generated files
+set(cc_ops_target_dir ${CMAKE_CURRENT_BINARY_DIR}/tensorflow/cc/ops)
+
+add_custom_target(create_cc_ops_header_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${cc_ops_target_dir}
+)
+
+set(tf_cc_ops_generated_files)
+
+set(tf_cc_op_lib_names
+    ${tf_op_lib_names}
+    "user_ops"
+)
+foreach(tf_cc_op_lib_name ${tf_cc_op_lib_names})
+    #tf_gen_op_wrapper_cc(name, out_ops_file, pkg=""):
+    #  # Construct an op generator binary for these ops.
+    #  tool = out_ops_file + "_gen_cc"  #example ops/array_ops_gen_cc
+    #  native.cc_binary(
+    #      name = tool,
+    #      copts = tf_copts(),
+    #      linkopts = ["-lm"],
+    #      linkstatic = 1,   # Faster to link this one-time-use binary dynamically
+    #      deps = (["//tensorflow/cc:cc_op_gen_main",
+    #               pkg + ":" + name + "_op_lib"])
+    #  )
+ 
+    # Using <TARGET_OBJECTS:...> to work around an issue where no ops were
+    # registered (static initializers dropped by the linker because the ops
+    # are not used explicitly in the *_gen_cc executables).
+    add_executable(${tf_cc_op_lib_name}_gen_cc
+        $<TARGET_OBJECTS:tf_cc_op_gen_main>
+        $<TARGET_OBJECTS:tf_${tf_cc_op_lib_name}>
+        $<TARGET_OBJECTS:tf_core_lib>
+        $<TARGET_OBJECTS:tf_core_framework>
+    )
+
+    target_include_directories(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        ${tensorflow_source_dir}
+        ${eigen_INCLUDE_DIRS}
+    )
+
+    find_package(ZLIB REQUIRED)
+
+    target_link_libraries(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        ${CMAKE_THREAD_LIBS_INIT}
+        ${PROTOBUF_LIBRARIES}
+        tf_protos_cc
+        re2_lib
+        ${jpeg_STATIC_LIBRARIES}
+        ${png_STATIC_LIBRARIES}
+        ${ZLIB_LIBRARIES}
+    )
+
+    target_compile_options(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        -fno-exceptions
+        -DEIGEN_AVOID_STL_ARRAY
+        -lm
+    )
+
+    # C++11
+    target_compile_features(${tf_cc_op_lib_name}_gen_cc PRIVATE
+        cxx_rvalue_references
+    )
+
+    set(cc_ops_include_internal 0)
+    if(${tf_cc_op_lib_name} STREQUAL "sendrecv_ops")
+        set(cc_ops_include_internal 1)
+    endif()
+
+    add_custom_command(
+        OUTPUT ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h
+               ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc
+        COMMAND ${tf_cc_op_lib_name}_gen_cc ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc ${cc_ops_include_internal}
+        DEPENDS ${tf_cc_op_lib_name}_gen_cc create_cc_ops_header_dir
+    )
+    
+    list(APPEND tf_cc_ops_generated_files ${cc_ops_target_dir}/${tf_cc_op_lib_name}.h)
+    list(APPEND tf_cc_ops_generated_files ${cc_ops_target_dir}/${tf_cc_op_lib_name}.cc)
+endforeach()
+
+
+########################################################
+# tf_cc_ops library
+########################################################
+add_library(tf_cc_ops OBJECT
+    ${tf_cc_ops_generated_files}
+    "${tensorflow_source_dir}/tensorflow/cc/ops/const_op.h"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/const_op.cc"
+    "${tensorflow_source_dir}/tensorflow/cc/ops/standard_ops.h"
+)
+
+target_include_directories(tf_cc_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_cc_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_cpu
+#    tf_models_word2vec_ops
+#)
+
+target_compile_options(tf_cc_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_cc_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
+#tf_gen_op_wrappers_cc(
+#    name = "cc_ops",
+#    op_lib_names = [
+#        ...
+#    ],
+#    other_hdrs = [
+#        "ops/const_op.h",
+#        "ops/standard_ops.h",
+#    ],
+#    other_srcs = [
+#        "ops/const_op.cc",
+#    ] + glob(["ops/*_grad.cc"]),
+#    pkg = "//tensorflow/core",
+#)
+\ No newline at end of file
diff --git a/tensorflow/contrib/cmake/tf_core_cpu.cmake b/tensorflow/contrib/cmake/tf_core_cpu.cmake
new file mode 100644
index 0000000000..374096e942
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_core_cpu.cmake
@@ -0,0 +1,53 @@
+########################################################
+# tf_core_cpu library
+########################################################
+file(GLOB_RECURSE tf_core_cpu_srcs
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/client/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/graph/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/graph/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_cpu_exclude_srcs
+    "${tensorflow_source_dir}/tensorflow/core/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu_device_factory.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.h"
+)
+
+list(REMOVE_ITEM tf_core_cpu_srcs ${tf_core_cpu_exclude_srcs}) 
+
+add_library(tf_core_cpu OBJECT ${tf_core_cpu_srcs})
+
+target_include_directories(tf_core_cpu PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+
+add_dependencies(tf_core_cpu
+    tf_core_framework
+)
+#target_link_libraries(tf_core_cpu
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#)
+
+target_compile_options(tf_core_cpu PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_cpu PRIVATE
+    cxx_rvalue_references
+)
+
diff --git a/tensorflow/contrib/cmake/tf_core_direct_session.cmake b/tensorflow/contrib/cmake/tf_core_direct_session.cmake
new file mode 100644
index 0000000000..bafc7e1e63
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_core_direct_session.cmake
@@ -0,0 +1,35 @@
+########################################################
+# tf_core_direct_session library
+########################################################
+file(GLOB tf_core_direct_session_srcs
+   "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.cc"
+   "${tensorflow_source_dir}/tensorflow/core/common_runtime/direct_session.h"
+)
+
+add_library(tf_core_direct_session OBJECT ${tf_core_direct_session_srcs})
+
+add_dependencies(tf_core_direct_session tf_core_cpu)
+
+target_include_directories(tf_core_direct_session PRIVATE
+   ${tensorflow_source_dir}
+   ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_direct_session
+#   ${CMAKE_THREAD_LIBS_INIT}
+#   ${PROTOBUF_LIBRARIES}
+#   tf_core_cpu
+#   tf_core_framework
+#   tf_core_lib
+#   tf_protos_cc
+#)
+
+target_compile_options(tf_core_direct_session PRIVATE
+   -fno-exceptions
+   -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_direct_session PRIVATE
+   cxx_rvalue_references
+)
diff --git a/tensorflow/contrib/cmake/tf_core_framework.cmake b/tensorflow/contrib/cmake/tf_core_framework.cmake
new file mode 100644
index 0000000000..989a176fd0
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_core_framework.cmake
@@ -0,0 +1,165 @@
+########################################################
+# RELATIVE_PROTOBUF_GENERATE_CPP function
+########################################################
+# A variant of PROTOBUF_GENERATE_CPP that keeps the directory hierarchy.
+# ROOT_DIR must be absolute, and proto paths must be relative to ROOT_DIR.
+function(RELATIVE_PROTOBUF_GENERATE_CPP SRCS HDRS ROOT_DIR)
+  if(NOT ARGN)
+    message(SEND_ERROR "Error: RELATIVE_PROTOBUF_GENERATE_CPP() called without any proto files")
+    return()
+  endif()
+  
+  set(${SRCS})
+  set(${HDRS})
+  foreach(FIL ${ARGN})
+    set(ABS_FIL ${ROOT_DIR}/${FIL})
+    get_filename_component(FIL_WE ${FIL} NAME_WE)
+    get_filename_component(FIL_DIR ${ABS_FIL} PATH)
+    file(RELATIVE_PATH REL_DIR ${ROOT_DIR} ${FIL_DIR})
+
+    list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc")
+    list(APPEND ${HDRS} "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h")
+
+    add_custom_command(
+      OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.cc"
+             "${CMAKE_CURRENT_BINARY_DIR}/${REL_DIR}/${FIL_WE}.pb.h"
+      COMMAND  ${PROTOBUF_PROTOC_EXECUTABLE}
+      ARGS --cpp_out  ${CMAKE_CURRENT_BINARY_DIR} -I ${ROOT_DIR} ${ABS_FIL}
+      DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE}
+      COMMENT "Running C++ protocol buffer compiler on ${FIL}"
+      VERBATIM )
+  endforeach()
+
+  set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
+  set(${SRCS} ${${SRCS}} PARENT_SCOPE)
+  set(${HDRS} ${${HDRS}} PARENT_SCOPE)
+endfunction()
+
+
+########################################################
+# tf_protos_cc library
+########################################################
+
+# Build proto library
+include(FindProtobuf)
+find_package(Protobuf REQUIRED)
+include_directories(${PROTOBUF_INCLUDE_DIRS})
+include_directories(${CMAKE_CURRENT_BINARY_DIR})
+file(GLOB_RECURSE tf_protos_cc_srcs RELATIVE ${tensorflow_source_dir}
+    "${tensorflow_source_dir}/tensorflow/*.proto"
+)
+RELATIVE_PROTOBUF_GENERATE_CPP(PROTO_SRCS PROTO_HDRS
+    ${tensorflow_source_dir} ${tf_protos_cc_srcs}
+)
+
+add_library(tf_protos_cc ${PROTO_SRCS} ${PROTO_HDRS})
+target_include_directories(tf_protos_cc PUBLIC
+     ${CMAKE_CURRENT_BINARY_DIR}
+)
+target_link_libraries(tf_protos_cc PUBLIC
+    ${PROTOBUF_LIBRARIES}
+)
+
+
+########################################################
+# tf_core_lib library
+########################################################
+file(GLOB_RECURSE tf_core_lib_srcs
+    "${tensorflow_source_dir}/tensorflow/core/lib/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/lib/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_lib_test_srcs
+    "${tensorflow_source_dir}/tensorflow/core/lib/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/lib/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/platform/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/public/*test*.h"
+)
+
+list(REMOVE_ITEM tf_core_lib_srcs ${tf_core_lib_test_srcs}) 
+
+add_library(tf_core_lib OBJECT ${tf_core_lib_srcs})
+target_include_directories(tf_core_lib PUBLIC
+    ${tensorflow_source_dir}
+    ${jpeg_INCLUDE_DIR}
+    ${png_INCLUDE_DIR}
+)
+#target_link_libraries(tf_core_lib
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#)
+target_compile_options(tf_core_lib PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_lib PRIVATE
+    cxx_rvalue_references
+)
+
+add_dependencies(tf_core_lib
+    jpeg_copy_headers_to_destination
+    png_copy_headers_to_destination
+    re2_copy_headers_to_destination
+    eigen
+    tf_protos_cc
+)
+
+
+########################################################
+# tf_core_framework library
+########################################################
+file(GLOB_RECURSE tf_core_framework_srcs
+    "${tensorflow_source_dir}/tensorflow/core/framework/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/util/*.cc"
+    "${tensorflow_source_dir}/public/*.h"
+)
+
+file(GLOB_RECURSE tf_core_framework_test_srcs
+    "${tensorflow_source_dir}/tensorflow/core/framework/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*testutil.h"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*testutil.cc"
+    "${tensorflow_source_dir}/tensorflow/core/framework/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/util/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/util/*main.cc"
+)
+
+list(REMOVE_ITEM tf_core_framework_srcs ${tf_core_framework_test_srcs})
+
+add_library(tf_core_framework OBJECT ${tf_core_framework_srcs})
+target_include_directories(tf_core_framework PUBLIC
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+#target_link_libraries(tf_core_framework
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    #${re2_STATIC_LIBRARIES}
+#    re2_lib
+#    ${jpeg_STATIC_LIBRARIES}
+#    ${png_STATIC_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#)
+add_dependencies(tf_core_framework
+    tf_core_lib
+)
+target_compile_options(tf_core_framework PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+# C++11
+target_compile_features(tf_core_framework PRIVATE
+    cxx_rvalue_references
+)
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
new file mode 100644
index 0000000000..c8f77f625c
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -0,0 +1,53 @@
+########################################################
+# tf_core_kernels library
+########################################################
+file(GLOB_RECURSE tf_core_kernels_srcs
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.cc"
+)
+
+file(GLOB_RECURSE tf_core_kernels_exclude_srcs
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*test*.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*test*.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*testutil.h"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*testutil.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*main.cc"
+   "${tensorflow_source_dir}/tensorflow/core/kernels/*.cu.cc"
+)
+
+list(REMOVE_ITEM tf_core_kernels_srcs ${tf_core_kernels_exclude_srcs}) 
+
+add_library(tf_core_kernels OBJECT ${tf_core_kernels_srcs})
+
+add_dependencies(tf_core_kernels tf_core_cpu)
+
+target_include_directories(tf_core_kernels PRIVATE
+   ${tensorflow_source_dir}
+   ${png_INCLUDE_DIR}
+   ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_kernels
+#   ${CMAKE_THREAD_LIBS_INIT}
+#   ${PROTOBUF_LIBRARIES}
+#   tf_core_cpu
+#   tf_core_framework
+#   tf_core_lib
+#   tf_protos_cc
+#   tf_models_word2vec_kernels
+#   tf_stream_executor
+#   tf_core_ops
+#   tf_core_cpu
+#)
+
+#        "@gemmlowp//:eight_bit_int_gemm",
+
+target_compile_options(tf_core_kernels PRIVATE
+   -fno-exceptions
+   -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_kernels PRIVATE
+   cxx_rvalue_references
+)
diff --git a/tensorflow/contrib/cmake/tf_core_ops.cmake b/tensorflow/contrib/cmake/tf_core_ops.cmake
new file mode 100644
index 0000000000..c6124b689c
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_core_ops.cmake
@@ -0,0 +1,181 @@
+#def tf_gen_op_libs(op_lib_names):
+#  # Make library out of each op so it can also be used to generate wrappers
+#  # for various languages.
+#  for n in op_lib_names:
+#    native.cc_library(name=n + "_op_lib"
+#                      copts=tf_copts(),
+#                      srcs=["ops/" + n + ".cc"],
+#                      deps=(["//tensorflow/core:framework"]),
+#                      visibility=["//visibility:public"],
+#                      alwayslink=1,
+#                      linkstatic=1,)
+
+
+set(tf_op_lib_names
+    "array_ops"
+    "attention_ops"
+    "candidate_sampling_ops"
+    "control_flow_ops"
+    "data_flow_ops"
+    "image_ops"
+    "io_ops"
+    "linalg_ops"
+    "logging_ops"
+    "functional_ops"
+    "math_ops"
+    "nn_ops"
+    "no_op"
+    "parsing_ops"
+    "random_ops"
+    "script_ops"
+    "sendrecv_ops"
+    "sparse_ops"
+    "state_ops"
+    "string_ops"
+    "summary_ops"
+    "training_ops"
+)
+
+foreach(tf_op_lib_name ${tf_op_lib_names})
+    ########################################################
+    # tf_${tf_op_lib_name} library
+    ########################################################
+    file(GLOB tf_${tf_op_lib_name}_srcs
+        "${tensorflow_source_dir}/tensorflow/core/ops/${tf_op_lib_name}.cc"
+    )
+
+    add_library(tf_${tf_op_lib_name} OBJECT ${tf_${tf_op_lib_name}_srcs})
+
+    add_dependencies(tf_${tf_op_lib_name} tf_core_framework)
+
+    target_include_directories(tf_${tf_op_lib_name} PRIVATE
+        ${tensorflow_source_dir}
+        ${eigen_INCLUDE_DIRS}
+    )
+
+    target_compile_options(tf_${tf_op_lib_name} PRIVATE
+        -fno-exceptions
+        -DEIGEN_AVOID_STL_ARRAY
+    )
+
+    # C++11
+    target_compile_features(tf_${tf_op_lib_name} PRIVATE
+        cxx_rvalue_references
+    )
+endforeach()
+
+#cc_library(
+#    name = "user_ops_op_lib"
+#    srcs = glob(["user_ops/**/*.cc"]),
+#    copts = tf_copts(),
+#    linkstatic = 1,
+#    visibility = ["//visibility:public"],
+#    deps = [":framework"],
+#    alwayslink = 1,
+#)
+########################################################
+# tf_user_ops library
+########################################################
+file(GLOB_RECURSE tf_user_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cc"
+)
+
+add_library(tf_user_ops OBJECT ${tf_user_ops_srcs})
+
+add_dependencies(tf_user_ops tf_core_framework)
+
+target_include_directories(tf_user_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+target_compile_options(tf_user_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_user_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
+#tf_cuda_library(
+#    name = "ops"
+#    srcs = glob(
+#        [
+#            "ops/**/*.h"
+#            "ops/**/*.cc"
+#            "user_ops/**/*.h"
+#            "user_ops/**/*.cc"
+#        ],
+#        exclude = [
+#            "**/*test*"
+#            "**/*main.cc"
+#            "user_ops/**/*.cu.cc"
+#        ],
+#    ),
+#    copts = tf_copts(),
+#    linkstatic = 1,
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        ":core"
+#        ":lib"
+#        ":protos_cc"
+#        "//tensorflow/models/embedding:word2vec_ops"
+#        "//third_party/eigen3"
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_core_ops library
+########################################################
+file(GLOB_RECURSE tf_core_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/core/ops/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.h"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cc"
+)
+
+file(GLOB_RECURSE tf_core_ops_exclude_srcs
+    "${tensorflow_source_dir}/tensorflow/core/ops/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/ops/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*test*.h"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*test*.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*main.cc"
+    "${tensorflow_source_dir}/tensorflow/core/user_ops/*.cu.cc"
+)
+
+list(REMOVE_ITEM tf_core_ops_srcs ${tf_core_ops_exclude_srcs}) 
+
+add_library(tf_core_ops OBJECT ${tf_core_ops_srcs})
+
+add_dependencies(tf_core_ops tf_core_cpu)
+
+target_include_directories(tf_core_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+#target_link_libraries(tf_core_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#    tf_core_cpu
+#    tf_models_word2vec_ops
+#)
+
+target_compile_options(tf_core_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_core_ops PRIVATE
+    cxx_rvalue_references
+)
+
+
diff --git a/tensorflow/contrib/cmake/tf_models.cmake b/tensorflow/contrib/cmake/tf_models.cmake
new file mode 100644
index 0000000000..ff3f5afbba
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_models.cmake
@@ -0,0 +1,95 @@
+#cc_library(
+#    name = "word2vec_ops",
+#    srcs = [
+#        "word2vec_ops.cc",
+#    ],
+#    visibility = ["//tensorflow:internal"],
+#    deps = [
+#        "//tensorflow/core:framework",
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_models_word2vec_ops library
+########################################################
+file(GLOB tf_models_word2vec_ops_srcs
+    "${tensorflow_source_dir}/tensorflow/models/embedding/word2vec_ops.cc"
+)
+
+add_library(tf_models_word2vec_ops OBJECT ${tf_models_word2vec_ops_srcs})
+
+target_include_directories(tf_models_word2vec_ops PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+add_dependencies(tf_models_word2vec_ops
+    tf_core_framework
+)
+#target_link_libraries(tf_models_word2vec_ops
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#)
+
+target_compile_options(tf_models_word2vec_ops PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_models_word2vec_ops PRIVATE
+    cxx_rvalue_references
+)
+
+#cc_library(
+#    name = "word2vec_kernels",
+#    srcs = [
+#        "word2vec_kernels.cc",
+#    ],
+#    visibility = ["//tensorflow:internal"],
+#    deps = [
+#        "//tensorflow/core",
+#    ],
+#    alwayslink = 1,
+#)
+########################################################
+# tf_models_word2vec_kernels library
+########################################################
+file(GLOB tf_models_word2vec_kernels_srcs
+    "${tensorflow_source_dir}/tensorflow/models/embedding/word2vec_kernels.cc"
+)
+
+add_library(tf_models_word2vec_kernels OBJECT ${tf_models_word2vec_kernels_srcs})
+
+target_include_directories(tf_models_word2vec_kernels PRIVATE
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+    ${re2_INCLUDES}
+)
+
+add_dependencies(tf_models_word2vec_ops
+    tf_core_cpu
+)
+
+#target_link_libraries(tf_models_word2vec_kernels
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_core_framework
+#    tf_core_lib
+#    tf_protos_cc
+#    tf_core_cpu
+#)
+
+target_compile_options(tf_models_word2vec_kernels PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_models_word2vec_kernels PRIVATE
+    cxx_rvalue_references
+)
diff --git a/tensorflow/contrib/cmake/tf_stream_executor.cmake b/tensorflow/contrib/cmake/tf_stream_executor.cmake
new file mode 100644
index 0000000000..0bc8dad0ab
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_stream_executor.cmake
@@ -0,0 +1,81 @@
+#cc_library(
+#    name = "stream_executor",
+#    srcs = glob(
+#        [
+#XX            "*.cc",
+#            "lib/*.cc",
+#        ],
+#        exclude = [
+#            "**/*_test.cc",
+#        ],
+#    ) + if_cuda(
+#        glob([
+#            "cuda/*.cc",
+#        ]),
+#    ),
+#    hdrs = glob([
+#        "*.h",
+#        "cuda/*.h",
+#        "lib/*.h",
+#        "platform/**/*.h",
+#    ]),
+#    data = [
+#        "//tensorflow/core:cuda",
+#        "//third_party/gpus/cuda:cublas",
+#        "//third_party/gpus/cuda:cudnn",
+#    ],
+#    linkopts = [
+#        "-ldl",
+#    ],
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        "//tensorflow/core:lib",
+#        "//third_party/gpus/cuda:cuda_headers",
+#    ],
+#    alwayslink = 1,
+#)
+
+########################################################
+# tf_stream_executor library
+########################################################
+file(GLOB tf_stream_executor_srcs
+    "${tensorflow_source_dir}/tensorflow/stream_executor/*.cc"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/lib/*.cc"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/lib/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/platform/*.h"
+    "${tensorflow_source_dir}/tensorflow/stream_executor/platform/default/*.h"
+)
+
+#file(GLOB_RECURSE tf_stream_executor_test_srcs
+#    "${tensorflow_source_dir}/tensorflow/stream_executor/*_test.cc"
+#    "${tensorflow_source_dir}/tensorflow/stream_executor/*_test.h"
+#)
+#
+#list(REMOVE_ITEM tf_stream_executor_srcs ${tf_stream_executor_test_srcs}) 
+
+add_library(tf_stream_executor OBJECT ${tf_stream_executor_srcs})
+
+target_include_directories(tf_stream_executor PRIVATE
+    ${tensorflow_source_dir}
+)
+add_dependencies(tf_stream_executor
+    tf_core_lib
+)
+#target_link_libraries(tf_stream_executor
+#    ${CMAKE_THREAD_LIBS_INIT}
+#    ${PROTOBUF_LIBRARIES}
+#    tf_protos_cc
+#    tf_core_lib
+#)
+
+target_compile_options(tf_stream_executor PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_stream_executor PRIVATE
+    cxx_rvalue_references
+)
+
diff --git a/tensorflow/contrib/cmake/tf_tutorials.cmake b/tensorflow/contrib/cmake/tf_tutorials.cmake
new file mode 100644
index 0000000000..3b41c0637c
--- /dev/null
+++ b/tensorflow/contrib/cmake/tf_tutorials.cmake
@@ -0,0 +1,54 @@
+#cc_binary(
+#    name = "tutorials_example_trainer",
+#    srcs = ["tutorials/example_trainer.cc"],
+#    copts = tf_copts(),
+#    linkopts = [
+#        "-lpthread",
+#        "-lm",
+#    ],
+#    deps = [
+#        ":cc_ops",
+#        "//tensorflow/core:kernels",
+#        "//tensorflow/core:tensorflow",
+#    ],
+#)
+
+set(tf_tutorials_example_trainer_srcs
+    "${tensorflow_source_dir}/tensorflow/cc/tutorials/example_trainer.cc"
+)
+
+add_executable(tf_tutorials_example_trainer
+    ${tf_tutorials_example_trainer_srcs}
+    $<TARGET_OBJECTS:tf_core_lib>
+    $<TARGET_OBJECTS:tf_core_cpu>
+    $<TARGET_OBJECTS:tf_core_framework>
+    $<TARGET_OBJECTS:tf_core_kernels>
+    $<TARGET_OBJECTS:tf_cc_ops>
+    $<TARGET_OBJECTS:tf_core_ops>
+    $<TARGET_OBJECTS:tf_core_direct_session>
+)
+
+target_include_directories(tf_tutorials_example_trainer PUBLIC
+    ${tensorflow_source_dir}
+    ${eigen_INCLUDE_DIRS}
+)
+
+target_link_libraries(tf_tutorials_example_trainer PUBLIC
+    ${CMAKE_THREAD_LIBS_INIT}
+    ${PROTOBUF_LIBRARIES}
+    tf_protos_cc
+    re2_lib
+    ${jpeg_STATIC_LIBRARIES}
+    ${png_STATIC_LIBRARIES}
+    ${ZLIB_LIBRARIES}
+)
+
+target_compile_options(tf_tutorials_example_trainer PRIVATE
+    -fno-exceptions
+    -DEIGEN_AVOID_STL_ARRAY
+)
+
+# C++11
+target_compile_features(tf_tutorials_example_trainer PRIVATE
+    cxx_rvalue_references
+)
diff --git a/tensorflow/contrib/layers/python/ops/loss_ops.py b/tensorflow/contrib/layers/python/ops/loss_ops.py
index 0c2b2a0df5..ae3d6203fe 100644
--- a/tensorflow/contrib/layers/python/ops/loss_ops.py
+++ b/tensorflow/contrib/layers/python/ops/loss_ops.py
@@ -79,7 +79,7 @@ def _reduce_batch(x, reduce_fn, name=None):
     elif ndims == 1:
       return x  # Don't include a useless reduction.
     elif ndims:
-      reduction_indices = range(1, ndims)
+      reduction_indices = list(range(1, ndims))
       shape = [x.get_shape().dims[0]]
     else:
       reduction_indices = math_ops.range(1, array_ops.size(array_ops.shape(x)))
diff --git a/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc b/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc
index 1e9ca3d256..68146a3dff 100644
--- a/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc
+++ b/tensorflow/contrib/linear_optimizer/kernels/sdca_ops.cc
@@ -73,11 +73,6 @@ struct Regularizations {
   float symmetric_l2 = 0;
 };
 
-struct RegularizationLoss {
-  double l1_loss = 0;
-  double l2_loss = 0;
-};
-
 struct PerExampleData {
   double wx = 0;
   double norm = 0;
@@ -102,7 +97,7 @@ using DenseFeaturesByGroup = std::vector<TTypes<const float>::Vec>;
 // indicates that the contents of sparse_examples_by_group cannot be trusted or
 // used.
 Status FillSparseExamplesByGroup(
-    const int64 num_sparse_features, const int64 num_examples,
+    const int64 num_sparse_features, const int num_examples,
     const OpInputList& sparse_features_indices_inputs,
     const OpInputList& sparse_features_values_inputs,
     const WeightsByGroup& sparse_weights_by_group,
@@ -127,7 +122,10 @@ Status FillSparseExamplesByGroup(
       static const int64 kIndicesDims = 2;
       gtl::InlinedVector<int64, 8> order(kIndicesDims);
       std::iota(order.begin(), order.end(), 0);
-      for (int64 i = begin; i < end; ++i) {
+
+      // The static_cast here is safe since begin and end can be at most
+      // num_examples which is an int.
+      for (int i = static_cast<int>(begin); i < end; ++i) {
         if (sparse_features_indices_inputs[i].shape().dims() != kIndicesDims) {
           mutex_lock l(mu);
           result = errors::InvalidArgument(strings::Printf(
@@ -147,7 +145,7 @@ Status FillSparseExamplesByGroup(
           if (example_index < 0 || example_index >= num_examples) {
             mutex_lock l(mu);
             result = errors::Internal(strings::Printf(
-                "Example indices should be in [0, %lld). Encountered: %lld",
+                "Example indices should be in [0, %d). Encountered: %lld",
                 num_examples, example_index));
             return;
           }
@@ -203,35 +201,6 @@ inline double Shrink(const double weight, const double shrink_by) {
   return 0.0;
 }
 
-// Compute L1 and L2 regularization loss.
-inline RegularizationLoss ComputeRegularizationLoss(
-    const WeightsByGroup& sparse_weights_by_group,
-    const WeightsByGroup& dense_weights_by_group,
-    const Regularizations& regularizations) {
-  RegularizationLoss result;
-
-  const double shrink_by = ShrinkageFactor(regularizations);
-  auto accumulate_regularization_loss = [&](const double w) {
-    const double sw = std::abs(Shrink(w, shrink_by));
-    result.l1_loss += sw;
-    result.l2_loss += sw * sw;
-  };
-
-  for (const TTypes<float>::Vec weights : sparse_weights_by_group) {
-    for (int64 i = 0; i < weights.size(); ++i) {
-      accumulate_regularization_loss(weights(i));
-    }
-  }
-
-  for (const TTypes<float>::Vec weights : dense_weights_by_group) {
-    accumulate_regularization_loss(weights(0));
-  }
-
-  result.l1_loss *= regularizations.symmetric_l1;
-  result.l2_loss *= regularizations.symmetric_l2;
-  return result;
-}
-
 // Compute PerExampleData which contains the logits, and weighted example norm
 // for a given example_id. Norm is weighted by 1/(lambda*N).
 inline PerExampleData ComputeWxAndWeightedExampleNorm(
@@ -380,7 +349,7 @@ WeightsByGroup MakeDeltaWeightsFrom(std::vector<Tensor>* const tensors) {
 }
 
 Status RunTrainStepsForMiniBatch(
-    const int64 num_examples, const TTypes<const string>::Vec example_ids,
+    const int num_examples, const TTypes<const string>::Vec example_ids,
     const TTypes<const float>::Vec example_labels,
     const TTypes<const float>::Vec example_weights,
     const DeviceBase::CpuWorkerThreads& worker_threads,
@@ -459,6 +428,13 @@ Status RunTrainStepsForMiniBatch(
   return train_step_status;
 }
 
+Status FillRegularizations(OpKernelConstruction* const context,
+                           Regularizations* const regularizations) {
+  TF_RETURN_IF_ERROR(context->GetAttr("l1", &regularizations->symmetric_l1));
+  TF_RETURN_IF_ERROR(context->GetAttr("l2", &regularizations->symmetric_l2));
+  return Status::OK();
+}
+
 }  // namespace
 
 class SdcaSolver : public OpKernel {
@@ -484,25 +460,9 @@ class SdcaSolver : public OpKernel {
     OP_REQUIRES(
         context, num_sparse_features_ + num_dense_features_ > 0,
         errors::InvalidArgument("Requires at least one feature to train."));
-
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
-
+    OP_REQUIRES_OK(context, FillRegularizations(context, &regularizations_));
     OP_REQUIRES_OK(context, context->GetAttr("num_inner_iterations",
                                              &num_inner_iterations_));
-
-    // TODO(rohananil): Provide emperical evidence for this. It is better to run
-    // more than one iteration on single mini-batch as we want to spend more
-    // time in compute. SDCA works better with larger mini batches and there
-    // is also recent work that shows its better to reuse old samples than train
-    // on new samples. See: http://arxiv.org/abs/1602.02136.
-    num_inner_iterations_ =
-        std::max(num_inner_iterations_, static_cast<int64>(2));
     OP_REQUIRES_OK(context, context->GetAttr("container", &container_));
     OP_REQUIRES_OK(context, context->GetAttr("solver_uuid", &solver_uuid_));
   }
@@ -533,21 +493,16 @@ class SdcaSolver : public OpKernel {
     OP_REQUIRES(context, TensorShapeUtils::IsVector(example_weights_t->shape()),
                 errors::InvalidArgument("example_weights should be a vector."));
     const auto example_weights = example_weights_t->vec<float>();
-
-    Eigen::Tensor<float, 0, Eigen::RowMajor> example_weights_sum;
-    example_weights_sum.device(context->eigen_cpu_device()) =
-        example_weights.sum();
-    const float weighted_examples = example_weights_sum();
-    const int64 num_examples = example_weights.size();
-
-    OP_REQUIRES(context, weighted_examples > 0,
-                errors::InvalidArgument("No weighted examples in ",
-                                        num_examples, " training examples"));
+    OP_REQUIRES(context,
+                example_weights.size() <= std::numeric_limits<int>::max(),
+                errors::InvalidArgument(strings::Printf(
+                    "Too many examples in a mini-batch: %ld > %d",
+                    example_weights.size(), std::numeric_limits<int>::max())));
+    const int num_examples = static_cast<int>(example_weights.size());
 
     OpInputList dense_features_inputs;
     OP_REQUIRES_OK(
         context, context->input_list("dense_features", &dense_features_inputs));
-
     DenseFeaturesByGroup dense_features_by_group;
     for (const auto& dense_feature : dense_features_inputs) {
       dense_features_by_group.emplace_back(dense_feature.vec<float>());
@@ -562,7 +517,7 @@ class SdcaSolver : public OpKernel {
     OP_REQUIRES(context, example_labels.size() == num_examples,
                 errors::InvalidArgument(strings::Printf(
                     "The number of example labels (%ld) should match the "
-                    "number of example weights (%lld).",
+                    "number of example weights (%d).",
                     example_labels.size(), num_examples)));
 
     const Tensor* example_ids_t;
@@ -573,7 +528,7 @@ class SdcaSolver : public OpKernel {
     OP_REQUIRES(context, example_labels.size() == num_examples,
                 errors::InvalidArgument(strings::Printf(
                     "The number of example ids (%ld) should match the number "
-                    "of example weights (%lld).",
+                    "of example weights (%d).",
                     example_ids.size(), num_examples)));
     const int64 num_duplicate_example_ids = [&] {
       // TODO(katsiapis): Benchmark and/or optimize.
@@ -632,12 +587,7 @@ class SdcaSolver : public OpKernel {
     SetZeroDeltaWeights(&sparse_delta_weights_by_group,
                         &dense_delta_weights_by_group);
 
-    // TODO(rohananil): Provide emperical evidence for this. It is better to run
-    // more than one iteration on single mini-batch as we want to spend more
-    // time in compute. SDCA works better with larger mini batches and there
-    // is also recent work that shows its better to reuse old samples than train
-    // on new samples. See: http://arxiv.org/abs/1602.02136.
-    for (int64 i = 0; i < num_inner_iterations_; ++i) {
+    for (int i = 0; i < num_inner_iterations_; ++i) {
       OP_REQUIRES_OK(
           context,
           RunTrainStepsForMiniBatch(
@@ -669,7 +619,7 @@ class SdcaSolver : public OpKernel {
   int64 num_sparse_features_;
   int64 num_dense_features_;
   Regularizations regularizations_;
-  int64 num_inner_iterations_;
+  int num_inner_iterations_;
   string container_;
   string solver_uuid_;
 };
@@ -678,13 +628,7 @@ REGISTER_KERNEL_BUILDER(Name("SdcaSolver").Device(DEVICE_CPU), SdcaSolver);
 class SdcaShrinkL1 : public OpKernel {
  public:
   explicit SdcaShrinkL1(OpKernelConstruction* context) : OpKernel(context) {
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
+    OP_REQUIRES_OK(context, FillRegularizations(context, &regularizations_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -709,19 +653,10 @@ class SdcaShrinkL1 : public OpKernel {
 };
 REGISTER_KERNEL_BUILDER(Name("SdcaShrinkL1").Device(DEVICE_CPU), SdcaShrinkL1);
 
-class ComputeDualityGap : public OpKernel {
+class SdcaTrainingStats : public OpKernel {
  public:
-  explicit ComputeDualityGap(OpKernelConstruction* context)
+  explicit SdcaTrainingStats(OpKernelConstruction* context)
       : OpKernel(context) {
-    // TODO(rohananil): Refactor grabbing common attributes across ops related
-    // to sdca.
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l1", &regularizations_.symmetric_l1));
-    OP_REQUIRES_OK(context,
-                   context->GetAttr("l2", &regularizations_.symmetric_l2));
-    // We enforce a minimal l2, required by the algorithm.
-    regularizations_.symmetric_l2 =
-        std::max(regularizations_.symmetric_l2, 1.0f);
     OP_REQUIRES_OK(context, context->GetAttr("container", &container_));
     OP_REQUIRES_OK(context, context->GetAttr("solver_uuid", &solver_uuid_));
   }
@@ -734,45 +669,56 @@ class ComputeDualityGap : public OpKernel {
         context, !data_by_example->RefCountIsOne(),
         errors::Internal("Expected shared-ownership of data_by_example."));
 
-    OpMutableInputList sparse_weights_inputs;
-    OP_REQUIRES_OK(context, context->mutable_input_list(
-                                "sparse_weights", &sparse_weights_inputs));
-    WeightsByGroup sparse_weights_by_group =
-        MakeWeightsFrom(&sparse_weights_inputs);
-
-    OpMutableInputList dense_weights_inputs;
-    OP_REQUIRES_OK(context, context->mutable_input_list("dense_weights",
-                                                        &dense_weights_inputs));
-    WeightsByGroup dense_weights_by_group =
-        MakeWeightsFrom(&dense_weights_inputs);
-
-    double example_weight_sum = 0;
-    double total_duality_gap = 0;
+    double total_primal_loss = 0;
+    double total_dual_loss = 0;
+    double total_example_weight = 0;
     OP_REQUIRES_OK(context,
                    data_by_example->Visit([&](const DataByExample::Data& data) {
-                     example_weight_sum += data.example_weight;
-                     total_duality_gap += data.primal_loss + data.dual_loss;
+                     total_primal_loss += data.primal_loss;
+                     total_dual_loss += data.dual_loss;
+                     total_example_weight += data.example_weight;
                    }));
 
-    const RegularizationLoss regularization_loss = ComputeRegularizationLoss(
-        sparse_weights_by_group, dense_weights_by_group, regularizations_);
-    total_duality_gap +=
-        regularization_loss.l2_loss + regularization_loss.l1_loss;
+    // TODO(katsiapis): Think about the most arithmetically stable way of
+    // computing (dual + primal) loss (if it matters).
 
-    Tensor* duality_gap_t = nullptr;
-    OP_REQUIRES_OK(context,
-                   context->allocate_output("duality_gap", {}, &duality_gap_t));
-    duality_gap_t->scalar<float>()() = total_duality_gap / example_weight_sum;
+    {
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("primal_loss", {}, &tensor));
+      tensor->scalar<double>()() = total_primal_loss;
+    }
+
+    {
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("dual_loss", {}, &tensor));
+      tensor->scalar<double>()() = total_dual_loss;
+    }
+
+    {
+      OP_REQUIRES(
+          context, total_example_weight > 0,
+          errors::FailedPrecondition(
+              "No examples found or all examples have zero weight. Either the "
+              "optimizer was trained with no instances or perhaps there is a "
+              "bug in the training data."));
+
+      Tensor* tensor = nullptr;
+      OP_REQUIRES_OK(context,
+                     context->allocate_output("example_weights", {}, &tensor));
+      tensor->scalar<double>()() = total_example_weight;
+    }
 
     // TODO(katsiapis): Use core::ScopedUnref once it's moved out of internal.
     data_by_example->Unref();
   }
 
  private:
-  Regularizations regularizations_;
   string container_;
   string solver_uuid_;
 };
-REGISTER_KERNEL_BUILDER(Name("ComputeDualityGap").Device(DEVICE_CPU),
-                        ComputeDualityGap);
+REGISTER_KERNEL_BUILDER(Name("SdcaTrainingStats").Device(DEVICE_CPU),
+                        SdcaTrainingStats);
+
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc b/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc
index ff2bae8fea..7c0e27d19d 100644
--- a/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc
+++ b/tensorflow/contrib/linear_optimizer/ops/sdca_ops.cc
@@ -24,7 +24,7 @@ REGISTER_OP("SdcaSolver")
     .Attr("num_dense_features: int >= 0")
     .Attr("l1: float >= 0")
     .Attr("l2: float >= 1")
-    .Attr("num_inner_iterations: int >= 2")
+    .Attr("num_inner_iterations: int >= 1")
     .Attr("container: string")
     .Attr("solver_uuid: string")
     .Input("sparse_features_indices: num_sparse_features * int64")
@@ -69,7 +69,7 @@ example_labels: a vector which contains the label/target associated with each
 example_ids: a vector which contains the unique identifier associated with each
   example.
 sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
+  a feature group.
 dense_weights: a list of vectors where the value is the weight associated with
   a dense feature group.
 )doc");
@@ -89,38 +89,28 @@ num_dense_features: Number of dense feature groups to train on.
 l1: Symmetric l1 regularization strength.
 l2: Symmetric l2 regularization strength.
 sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
+  a feature group.
 dense_weights: a list of vectors where the value is the weight associated with
   a dense feature group.
 )doc");
 
-// TODO(katsiapis): We should expand this scope of this op to compute other
-// statistics about the data.
-REGISTER_OP("ComputeDualityGap")
-    .Attr("num_sparse_features: int >= 0")
-    .Attr("num_dense_features: int >= 0")
-    .Attr("l1: float >= 0")
-    .Attr("l2: float >= 1")
+REGISTER_OP("SdcaTrainingStats")
     .Attr("container: string")
     .Attr("solver_uuid: string")
-    .Input("sparse_weights: Ref(num_sparse_features * float)")
-    .Input("dense_weights: Ref(num_dense_features * float)")
-    .Output("duality_gap: float")
+    .Output("primal_loss: float64")
+    .Output("dual_loss: float64")
+    .Output("example_weights: float64")
     .Doc(R"doc(
-Computes duality gap over all examples seen by the optimizer.
+Computes statistics over all examples seen by the optimizer.
 
-num_sparse_features: Number of sparse feature groups to train on.
-num_dense_features: Number of dense feature groups to train on.
-l1: Symmetric l1 regularization strength.
-l2: Symmetric l2 regularization strength.
 container: Name of the Container that stores data across invocations of this
   Kernel. Together with SolverUUID form an isolation unit for this solver.
 solver_uuid: Universally Unique Identifier for this solver.
-sparse_weights: a list of vectors where each value is the weight associated with
-  a feature index.
-dense_weights: a list of vectors where the value is the weight associated with
-  a dense feature group.
-duality_gap: duality gap over all examples seen by the optimizer.
+primal_loss: total primal loss of all examples seen by the optimizer.
+dual_loss: total dual loss of all examples seen by the optimizer.
+example_weights: total example weights of all examples seen by the optimizer
+  (guaranteed to be positive; otherwise returns FAILED_PRECONDITION as it
+   probably indicates a bug in the training data).
 )doc");
 
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
index 13968457f7..27da0b4f55 100644
--- a/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
+++ b/tensorflow/contrib/linear_optimizer/python/kernel_tests/sdca_ops_test.py
@@ -92,6 +92,7 @@ def make_variable_dict(max_age, max_gender):
   return dict(sparse_features_weights=[age_weights, gender_weights],
               dense_features_weights=[])
 
+
 def make_dense_variable_dict(num_dense_features, num_examples):
   feature_weights = ([
       tf.Variable(tf.zeros([1],
@@ -130,6 +131,7 @@ def tearDown():
   pass
 
 
+# TODO(katsiapis): Add tests that exercise L1 and Shrinking.
 class SdcaOptimizerTest(TensorFlowTestCase):
 
   def _single_threaded_test_session(self):
@@ -180,6 +182,44 @@ class SdcaOptimizerTest(TensorFlowTestCase):
                           rtol=1e-2,
                           atol=1e-2)
 
+  def testSimpleLogisticNoL2(self):
+    # Same as test above (so comments from above apply) but without an L2.
+    # The algorithm should behave as if we have an L2 of 1 in optimization but
+    # 0 in regularized_loss.
+    example_protos = [
+        make_example_proto(
+            {'age': [0],
+             'gender': [0]}, 0),
+        make_example_proto(
+            {'age': [1],
+             'gender': [1]}, 1),
+    ]
+    example_weights = [1.0, 1.0]
+    with self._single_threaded_test_session():
+      examples = make_example_dict(example_protos, example_weights)
+      variables = make_variable_dict(1, 1)
+      options = dict(symmetric_l2_regularization=0,
+                     symmetric_l1_regularization=0,
+                     loss_type='logistic_loss')
+
+      lr = SdcaModel(CONTAINER, examples, variables, options)
+      tf.initialize_all_variables().run()
+      unregularized_loss = lr.unregularized_loss(examples)
+      loss = lr.regularized_loss(examples)
+      predictions = lr.predictions(examples)
+      self.assertAllClose(0.693147, unregularized_loss.eval())
+      self.assertAllClose(0.693147, loss.eval())
+      for _ in xrange(5):
+        lr.minimize().run()
+      self.assertAllClose(0.411608, unregularized_loss.eval(), rtol=0.11)
+      self.assertAllClose(0.371705, loss.eval(), atol=0.01)
+      predicted_labels = get_binary_predictions_for_logistic(predictions)
+      self.assertAllEqual([0, 1], predicted_labels.eval())
+      self.assertAllClose(0.01,
+                          lr.approximate_duality_gap().eval(),
+                          rtol=1e-2,
+                          atol=1e-2)
+
   def testSomeUnweightedExamples(self):
     # Setup test data with 4 examples, but should produce the same
     # results as testSimple.
@@ -272,10 +312,11 @@ class SdcaOptimizerTest(TensorFlowTestCase):
       lr = SdcaModel(CONTAINER, examples, variables, options)
       tf.initialize_all_variables().run()
       self.assertAllClose([0.5, 0.5], lr.predictions(examples).eval())
-      with self.assertRaisesOpError(
-          'No weighted examples in 2 training examples'):
-        lr.minimize().run()
+      lr.minimize().run()
       self.assertAllClose([0.5, 0.5], lr.predictions(examples).eval())
+      with self.assertRaisesOpError(
+          'No examples found or all examples have zero weight.'):
+        lr.approximate_duality_gap().eval()
 
   def testDuplicateExampleIds(self):
     # Setup test data with 1 positive, and 1 negative example.
diff --git a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
index 957a734b07..9d41f024ae 100644
--- a/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
+++ b/tensorflow/contrib/linear_optimizer/python/ops/sdca_ops.py
@@ -28,7 +28,6 @@ from tensorflow.python.framework.ops import name_scope
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables as var_ops
 from tensorflow.python.ops.nn import sigmoid_cross_entropy_with_logits
 from tensorflow.python.platform import resource_loader
@@ -139,30 +138,35 @@ class SdcaModel(object):
         ['loss_type', 'symmetric_l2_regularization',
          'symmetric_l1_regularization'], options)
 
+    for name in ['symmetric_l1_regularization', 'symmetric_l2_regularization']:
+      value = options[name]
+      if value < 0.0:
+        raise ValueError('%s should be non-negative. Found (%f)' %
+                         (name, value))
+
     self._container = container
     self._examples = examples
     self._variables = variables
     self._options = options
     self._solver_uuid = uuid.uuid4().hex
-    self._create_slots(variables)
-
-  # TODO(rohananil): Use optimizer interface to make use of slot creation
-  # logic
-  def _create_slots(self, variables):
-    self._slots = {}
-    # TODO(rohananil): Rename the slot keys to "unshrinked" weights.
-    self._slots['sparse_features_weights'] = []
-    self._slots['dense_features_weights'] = []
-    self._assign_ops = []
-    # Make an internal variable which has the updates before applying L1
+    self._create_slots()
+
+  def _symmetric_l2_regularization(self):
+    # Algorithmic requirement (for now) is to have minimal l2 of 1.0
+    return max(self._options['symmetric_l2_regularization'], 1.0)
+
+  # TODO(rohananil): Use optimizer interface to make use of slot creation logic.
+  def _create_slots(self):
+    # Make internal variables which have the updates before applying L1
     # regularization.
-    for var_type in ['sparse_features_weights', 'dense_features_weights']:
-      for var in variables[var_type]:
-        if var is not None:
-          self._slots[var_type].append(var_ops.Variable(array_ops.zeros_like(
-              var.initialized_value(), dtypes.float32)))
-          self._assign_ops.append(state_ops.assign(var, self._slots[var_type][
-              -1]))
+    self._slots = {
+        'unshrinked_sparse_features_weights': [],
+        'unshrinked_dense_features_weights': [],
+    }
+    for name in ['sparse_features_weights', 'dense_features_weights']:
+      for var in self._variables[name]:
+        self._slots['unshrinked_' + name].append(var_ops.Variable(
+            array_ops.zeros_like(var.initialized_value(), dtypes.float32)))
 
   def _assertSpecified(self, items, check_in):
     for x in items:
@@ -177,33 +181,22 @@ class SdcaModel(object):
   def _l1_loss(self):
     """Computes the l1 loss of the model."""
     with name_scope('l1_loss'):
-      sparse_weights = self._convert_n_to_tensor(self._variables[
-          'sparse_features_weights'])
-      dense_weights = self._convert_n_to_tensor(self._variables[
-          'dense_features_weights'])
-      l1 = self._options['symmetric_l1_regularization']
-      loss = 0.0
-      for w in sparse_weights:
-        loss += l1 * math_ops.reduce_sum(abs(w))
-      for w in dense_weights:
-        loss += l1 * math_ops.reduce_sum(abs(w))
-      return loss
-
-  def _l2_loss(self):
+      sum = 0.0
+      for name in ['sparse_features_weights', 'dense_features_weights']:
+        for weights in self._convert_n_to_tensor(self._variables[name]):
+          sum += math_ops.reduce_sum(math_ops.abs(weights))
+      # SDCA L1 regularization cost is: l1 * sum(|weights|)
+      return self._options['symmetric_l1_regularization'] * sum
+
+  def _l2_loss(self, l2):
     """Computes the l2 loss of the model."""
     with name_scope('l2_loss'):
-      sparse_weights = self._convert_n_to_tensor(self._variables[
-          'sparse_features_weights'])
-      dense_weights = self._convert_n_to_tensor(self._variables[
-          'dense_features_weights'])
-      l2 = self._options['symmetric_l2_regularization']
-      loss = 0.0
-      for w in sparse_weights:
-        loss += l2 * math_ops.reduce_sum(math_ops.square(w))
-      for w in dense_weights:
-        loss += l2 * math_ops.reduce_sum(math_ops.square(w))
-      # SDCA L2 regularization cost is 1/2 * l2 * sum(weights^2)
-      return loss / 2.0
+      sum = 0.0
+      for name in ['sparse_features_weights', 'dense_features_weights']:
+        for weights in self._convert_n_to_tensor(self._variables[name]):
+          sum += math_ops.reduce_sum(math_ops.square(weights))
+      # SDCA L2 regularization cost is: l2 * sum(weights^2) / 2
+      return l2 * sum / 2
 
   def _convert_n_to_tensor(self, input_list, as_ref=False):
     """Converts input list to a set of tensors."""
@@ -265,31 +258,44 @@ class SdcaModel(object):
     """
     with name_scope('sdca/minimize'):
       sparse_features_indices = []
-      sparse_features_weights = []
+      sparse_features_values = []
       for sf in self._examples['sparse_features']:
         sparse_features_indices.append(convert_to_tensor(sf.indices))
-        sparse_features_weights.append(convert_to_tensor(sf.values))
+        sparse_features_values.append(convert_to_tensor(sf.values))
 
       step_op = _sdca_ops.sdca_solver(
           sparse_features_indices,
-          sparse_features_weights,
+          sparse_features_values,
           self._convert_n_to_tensor(self._examples['dense_features']),
           convert_to_tensor(self._examples['example_weights']),
           convert_to_tensor(self._examples['example_labels']),
           convert_to_tensor(self._examples['example_ids']),
-          self._convert_n_to_tensor(self._slots['sparse_features_weights'],
-                                    as_ref=True),
-          self._convert_n_to_tensor(self._slots['dense_features_weights'],
-                                    as_ref=True),
+          self._convert_n_to_tensor(
+              self._slots['unshrinked_sparse_features_weights'],
+              as_ref=True),
+          self._convert_n_to_tensor(
+              self._slots['unshrinked_dense_features_weights'],
+              as_ref=True),
           l1=self._options['symmetric_l1_regularization'],
-          l2=self._options['symmetric_l2_regularization'],
+          l2=self._symmetric_l2_regularization(),
+          # TODO(rohananil): Provide empirical evidence for this. It is better
+          # to run more than one iteration on single mini-batch as we want to
+          # spend more time in compute. SDCA works better with larger
+          # mini-batches and there is also recent work that shows its better to
+          # reuse old samples than train on new samples.
+          # See: http://arxiv.org/abs/1602.02136.
           num_inner_iterations=2,
           loss_type=self._options['loss_type'],
           container=self._container,
           solver_uuid=self._solver_uuid)
       with ops.control_dependencies([step_op]):
-        assign_ops = control_flow_ops.group(*self._assign_ops)
-        with ops.control_dependencies([assign_ops]):
+        assign_ops = []
+        for name in ['sparse_features_weights', 'dense_features_weights']:
+          for var, slot_var in zip(self._variables[name],
+                                   self._slots['unshrinked_' + name]):
+            assign_ops.append(var.assign(slot_var))
+        assign_group = control_flow_ops.group(*assign_ops)
+        with ops.control_dependencies([assign_group]):
           return _sdca_ops.sdca_shrink_l1(
               self._convert_n_to_tensor(
                   self._variables['sparse_features_weights'],
@@ -298,7 +304,7 @@ class SdcaModel(object):
                   self._variables['dense_features_weights'],
                   as_ref=True),
               l1=self._options['symmetric_l1_regularization'],
-              l2=self._options['symmetric_l2_regularization'])
+              l2=self._symmetric_l2_regularization())
 
   def approximate_duality_gap(self):
     """Add operations to compute the approximate duality gap.
@@ -307,15 +313,14 @@ class SdcaModel(object):
       An Operation that computes the approximate duality gap over all
       examples.
     """
-    return _sdca_ops.compute_duality_gap(
-        self._convert_n_to_tensor(self._slots['sparse_features_weights'],
-                                  as_ref=True),
-        self._convert_n_to_tensor(self._slots['dense_features_weights'],
-                                  as_ref=True),
-        l1=self._options['symmetric_l1_regularization'],
-        l2=self._options['symmetric_l2_regularization'],
+    (primal_loss, dual_loss, example_weights) = _sdca_ops.sdca_training_stats(
         container=self._container,
         solver_uuid=self._solver_uuid)
+    # Note that example_weights is guaranteed to be positive by
+    # sdca_training_stats so dividing by it is safe.
+    return (primal_loss + dual_loss + math_ops.to_double(self._l1_loss()) +
+            (2.0 * math_ops.to_double(self._l2_loss(
+                self._symmetric_l2_regularization())))) / example_weights
 
   def unregularized_loss(self, examples):
     """Add operations to compute the loss (without the regularization loss).
@@ -384,6 +389,11 @@ class SdcaModel(object):
     self._assertList(['sparse_features', 'dense_features'], examples)
     with name_scope('sdca/regularized_loss'):
       weights = convert_to_tensor(examples['example_weights'])
-      return ((
-          (self._l1_loss() + self._l2_loss()) / math_ops.reduce_sum(weights)) +
+      return (((
+          self._l1_loss() +
+          # Note that here we are using the raw regularization
+          # (as specified by the user) and *not*
+          # self._symmetric_l2_regularization().
+          self._l2_loss(self._options['symmetric_l2_regularization'])) /
+               math_ops.reduce_sum(weights)) +
               self.unregularized_loss(examples))
diff --git a/tensorflow/core/distributed_runtime/README.md b/tensorflow/core/distributed_runtime/README.md
index 4d2a18ed33..918af2d2ba 100644
--- a/tensorflow/core/distributed_runtime/README.md
+++ b/tensorflow/core/distributed_runtime/README.md
@@ -127,7 +127,7 @@ replicated model. Possible approaches include:
   
 * As above, but where the gradients from all workers are averaged. See the
   [CIFAR-10 multi-GPU trainer](https://www.tensorflow.org/code/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py)
-  for an example of this form of replication. The implements *synchronous* training
+  for an example of this form of replication. This implements *synchronous* training
   
 * The "distributed trainer" approach uses multiple graphs&mdash;one per
   worker&mdash;where each graph contains one set of parameters (pinned to
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index f8b90ced75..cd96724980 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -1089,6 +1089,7 @@ filegroup(
         "avgpooling_op.cc",
         "batch_norm_op.cc",
         "bcast_ops.cc",
+        "check_numerics_op.cc",
         "control_flow_ops.cc",
         "conv_2d.h",
         "conv_ops.cc",
diff --git a/tensorflow/core/kernels/bounds_check.h b/tensorflow/core/kernels/bounds_check.h
index 9bfbde9bc7..db9345e965 100644
--- a/tensorflow/core/kernels/bounds_check.h
+++ b/tensorflow/core/kernels/bounds_check.h
@@ -26,26 +26,15 @@ namespace tensorflow {
 // Check that 0 <= index < limit using a single comparison, assuming
 // that 0 <= limit if Index is signed.  Intended for use in performance
 // critical contexts where 0 <= index < limit is almost always true.
-template <class Index>
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(Index index, Index limit) {
-  typedef typename std::make_unsigned<Index>::type UIndex;
+template <typename Ta, typename Tb>
+EIGEN_ALWAYS_INLINE bool FastBoundsCheck(const Ta index, const Tb limit) {
+  static_assert(std::is_integral<Ta>::value && std::is_integral<Tb>::value,
+                "FastBoundsCheck can only be used on integer types.");
+  typedef typename std::make_unsigned<decltype(index + limit)>::type UIndex;
   return TF_PREDICT_TRUE(static_cast<UIndex>(index) <
                          static_cast<UIndex>(limit));
 }
 
-// Upcasting specializations when the index and bounds do not match;
-// always move to the larger type.
-
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(int64 index, int32 limit) {
-  return TF_PREDICT_TRUE(static_cast<uint64>(index) <
-                         static_cast<uint64>(limit));
-}
-
-EIGEN_ALWAYS_INLINE bool FastBoundsCheck(int32 index, int64 limit) {
-  return TF_PREDICT_TRUE(static_cast<uint64>(index) <
-                         static_cast<uint64>(limit));
-}
-
 namespace internal {
 // Ensure that the compiler cannot elide a copy into a local, for
 // bounds checking on source tensors that might be updated asynchronously.
diff --git a/tensorflow/core/kernels/conv_grad_ops.cc b/tensorflow/core/kernels/conv_grad_ops.cc
index e0ddd1c0fd..819d16444f 100644
--- a/tensorflow/core/kernels/conv_grad_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_ops.cc
@@ -1398,7 +1398,7 @@ class Conv2DSlowBackpropFilterOp : public OpKernel {
       //   [filter_rows, filter_cols, in_depth, out_depth];
       // And we need to reverse the filter backprops
       // So we need to allocated (sigh) yet another piece of memory to hold the
-      // ouptut.
+      // output.
       TensorShape filter_shuffle_shape(
           {out_depth, filter_rows, filter_cols, in_depth});
       Tensor filter_shuffle;
diff --git a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
index b5776eb533..dbf096ac45 100644
--- a/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
+++ b/tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
@@ -246,7 +246,7 @@ __global__ void SwapDimension1And2InTensor3UsingTiles(const T* input,
   }
 }
 
-// A Cuda custom kernel that converst input to output, given proper padding on
+// A Cuda custom kernel that convert input to output, given proper padding on
 // the left and the top. The padded value is zero.
 template <typename T>
 __global__ void PadInputCustomKernelNHWC(int nthreads, const T* input,
diff --git a/tensorflow/core/kernels/diag_op.cc b/tensorflow/core/kernels/diag_op.cc
index 319fb99d23..df738f00cb 100644
--- a/tensorflow/core/kernels/diag_op.cc
+++ b/tensorflow/core/kernels/diag_op.cc
@@ -45,6 +45,28 @@ class DiagonalGenerator {
  private:
   Tensor diagonal_;
 };
+
+template <typename T, size_t NumDims>
+class DiagonalExtractor {
+ public:
+  explicit DiagonalExtractor(const Tensor& tensor) : tensor_(tensor) {
+    CHECK_EQ(tensor.dims(), 2 * NumDims);
+  }
+  T operator()(const Eigen::array<Eigen::Index, NumDims>& coordinates) const {
+    Eigen::array<Eigen::Index, 2 * NumDims> index;
+    for (size_t j = 0; j < NumDims; ++j){
+      index[j] = coordinates[j];
+    }
+    for (size_t j = NumDims; j < 2 * NumDims; ++j){
+      index[j] = index[j - NumDims];
+    }
+    return tensor_.tensor<T, 2 * NumDims>()(index);
+  }
+
+ private:
+  Tensor tensor_;
+};
+  
 }  // namespace
 
 // Generate the diagonal tensor with the diagonal set to the input tensor.
@@ -58,12 +80,9 @@ class DiagOp : public OpKernel {
   void Compute(OpKernelContext* context) override {
     const Tensor& diagonal = context->input(0);
     const int num_dims = diagonal.dims();
-    OP_REQUIRES(context, 1 <= num_dims,
-                errors::InvalidArgument(
-                    "The rank of the diagonal should be between 1 and 3."));
-    OP_REQUIRES(context, 3 >= num_dims,
-                errors::InvalidArgument(
-                    "The rank of the diagonal  should be between 1 and 3."));
+    OP_REQUIRES(context, 1 <= num_dims && num_dims <= 3,
+                errors::InvalidArgument("Expected 1 <= dims <= 3, got shape ",
+                                        diagonal.shape().DebugString()));
     TensorShape out_shape;
     for (int i = 0; i < num_dims; ++i) {
       out_shape.AddDim(diagonal.dim_size(i));
@@ -105,4 +124,71 @@ REGISTER_DIAGOP(int32);
 REGISTER_DIAGOP(int64);
 
 #undef REGISTER_DIAGOP
+
+
+// Generate the diagonal tensor with the diagonal set to the input tensor.
+// It only allows rank 2, 4, or 6 input tensor, so the output tensor is 
+// rank 1, 2, or 3.
+template <typename T>
+class DiagPartOp : public OpKernel {
+ public:
+  explicit DiagPartOp(OpKernelConstruction* context) : OpKernel(context) {}
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& tensor = context->input(0);
+    const int num_dims = tensor.dims();
+    const int out_dims = num_dims / 2;
+    OP_REQUIRES(context, 2 == num_dims || 4 == num_dims || 6 == num_dims, 
+                errors::InvalidArgument("The rank of the tensor should be 2, \
+                                         4, or 6, got shape ",
+                                        tensor.shape().DebugString()));
+    for (int i = 0; i < out_dims; i++){
+      OP_REQUIRES(context, tensor.dim_size(i) == tensor.dim_size(i + out_dims),
+                  errors::InvalidArgument(
+                    "Invalid shape ", tensor.shape().DebugString(),
+                    ": dimensions ", i, " and ", i + out_dims, " do not match.")
+                  );
+    }
+
+    TensorShape out_shape;
+    for (int i = 0; i < out_dims; ++i) {
+      out_shape.AddDim(tensor.dim_size(i));
+    }
+
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(0, out_shape, &output));
+
+    switch (num_dims) {
+      case 2:
+        output->tensor<T, 1>() = output->tensor<T, 1>().generate(
+          DiagonalExtractor<T, 1>(tensor));
+        break; 
+      case 4:
+        output->tensor<T, 2>() = output->tensor<T, 2>().generate(
+          DiagonalExtractor<T, 2>(tensor));
+        break;
+      case 6:
+        output->tensor<T, 3>() = output->tensor<T, 3>().generate(
+          DiagonalExtractor<T, 3>(tensor));
+        break;      
+      default:
+        context->SetStatus(errors::Unimplemented(
+          "Diagonal of rank ", num_dims, " tensor is not supported yet."));
+        return;
+    }
+  }
+};
+
+#define REGISTER_DIAGPARTOP(T) \
+  REGISTER_KERNEL_BUILDER( \
+      Name("DiagPart").Device(DEVICE_CPU).TypeConstraint<T>("T"), DiagPartOp<T>)
+
+REGISTER_DIAGPARTOP(double);
+REGISTER_DIAGPARTOP(float);
+REGISTER_DIAGPARTOP(int32);
+REGISTER_DIAGPARTOP(int64);
+
+#undef REGISTER_DIAGPARTOP
+  
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/matrix_solve_ls_op.cc b/tensorflow/core/kernels/matrix_solve_ls_op.cc
index 77060184fe..1b17b2db68 100644
--- a/tensorflow/core/kernels/matrix_solve_ls_op.cc
+++ b/tensorflow/core/kernels/matrix_solve_ls_op.cc
@@ -94,7 +94,7 @@ class MatrixSolveLsOp
     }
     if (fast_) {
       // The fast branch assumes that matrix is not rank deficient and
-      // not too ill-conditioned. Specifically, the reciprobal condition number
+      // not too ill-conditioned. Specifically, the reciprocal condition number
       // should be greater than the square root of the machine precision, i.e.
       //   1 / cond(matrix) > sqrt(std::numeric_limits<Scalar>::epsilon()).
       // This branch solves over- or underdetermined least-squares problems
diff --git a/tensorflow/core/kernels/reduction_ops_gpu.cu.cc b/tensorflow/core/kernels/reduction_ops_gpu.cu.cc
index a740bf1df3..d5d9f44da2 100644
--- a/tensorflow/core/kernels/reduction_ops_gpu.cu.cc
+++ b/tensorflow/core/kernels/reduction_ops_gpu.cu.cc
@@ -84,6 +84,7 @@ struct ReduceFunctor<GPUDevice, Eigen::internal::MeanReducer<T> > {
   DEFINE_FOR_TYPE_AND_R(T, Eigen::internal::ProdReducer<T>)
 
 DEFINE_FOR_ALL_REDUCERS(float);
+DEFINE_FOR_ALL_REDUCERS(double);
 #undef DEFINE_FOR_ALL_REDUCERS
 
 DEFINE_FOR_TYPE_AND_R(complex64, Eigen::internal::SumReducer<complex64>);
diff --git a/tensorflow/core/kernels/reduction_ops_max.cc b/tensorflow/core/kernels/reduction_ops_max.cc
index 7569932125..2c694487ee 100644
--- a/tensorflow/core/kernels/reduction_ops_max.cc
+++ b/tensorflow/core/kernels/reduction_ops_max.cc
@@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
           .HostMemory("reduction_indices"), \
       ReductionOp<GPUDevice, type, Eigen::internal::MaxReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS
 
 #endif
diff --git a/tensorflow/core/kernels/reduction_ops_min.cc b/tensorflow/core/kernels/reduction_ops_min.cc
index 2205a91769..be757282f8 100644
--- a/tensorflow/core/kernels/reduction_ops_min.cc
+++ b/tensorflow/core/kernels/reduction_ops_min.cc
@@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
           .HostMemory("reduction_indices"), \
       ReductionOp<GPUDevice, type, Eigen::internal::MinReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS
 
 #endif
diff --git a/tensorflow/core/kernels/reduction_ops_prod.cc b/tensorflow/core/kernels/reduction_ops_prod.cc
index 2ebfef676f..d1396f4926 100644
--- a/tensorflow/core/kernels/reduction_ops_prod.cc
+++ b/tensorflow/core/kernels/reduction_ops_prod.cc
@@ -34,6 +34,7 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_CPU_KERNELS);
           .HostMemory("reduction_indices"), \
       ReductionOp<GPUDevice, type, Eigen::internal::ProdReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS
 
 #endif
diff --git a/tensorflow/core/kernels/reduction_ops_sum.cc b/tensorflow/core/kernels/reduction_ops_sum.cc
index 0cc287f102..6f8c3e7c6e 100644
--- a/tensorflow/core/kernels/reduction_ops_sum.cc
+++ b/tensorflow/core/kernels/reduction_ops_sum.cc
@@ -41,6 +41,7 @@ REGISTER_KERNEL_BUILDER(
           .HostMemory("reduction_indices"), \
       ReductionOp<GPUDevice, type, Eigen::internal::SumReducer<type>>);
 REGISTER_GPU_KERNELS(float);
+REGISTER_GPU_KERNELS(double);
 #undef REGISTER_GPU_KERNELS
 
 REGISTER_KERNEL_BUILDER(
diff --git a/tensorflow/core/kernels/resize_nearest_neighbor_op.cc b/tensorflow/core/kernels/resize_nearest_neighbor_op.cc
index c3ed9914c9..059ef83bb0 100644
--- a/tensorflow/core/kernels/resize_nearest_neighbor_op.cc
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op.cc
@@ -26,6 +26,10 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 
+#if GOOGLE_CUDA
+#include "tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h"
+#endif  // GOOGLE_CUDA
+
 namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
@@ -58,10 +62,10 @@ class ResizeNearestNeighborOp : public OpKernel {
     // Initialize shape to the batch size of the input, then add
     // the rest of the dimensions
     Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                0, TensorShape({input.dim_size(0), sizes(0),
-                                                sizes(1), input.dim_size(3)}),
-                                &output));
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));
 
     const int64 batch_size = input.dim_size(0);
     const int64 in_height = input.dim_size(1);
@@ -132,10 +136,10 @@ class ResizeNearestNeighborOpGrad : public OpKernel {
     // Initialize shape to the batch size of the input, then add
     // the rest of the dimensions
     Tensor* output = nullptr;
-    OP_REQUIRES_OK(context, context->allocate_output(
-                                0, TensorShape({input.dim_size(0), sizes(0),
-                                                sizes(1), input.dim_size(3)}),
-                                &output));
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));
 
     const int64 batch_size = input.dim_size(0);
     const int64 in_height = input.dim_size(1);
@@ -204,4 +208,83 @@ TF_CALL_REAL_NUMBER_TYPES(REGISTER_KERNEL);
 
 #undef REGISTER_KERNEL
 
+#if GOOGLE_CUDA
+
+template <typename T>
+class ResizeNearestNeighborGPUOp : public OpKernel {
+ public:
+  explicit ResizeNearestNeighborGPUOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context, context->GetAttr("align_corners", &align_corners_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    const Tensor& input = context->input(0);
+    OP_REQUIRES(context, input.dims() == 4,
+                errors::InvalidArgument("input must be 4-dimensional",
+                                        input.shape().DebugString()));
+    const Tensor& shape_t = context->input(1);
+    OP_REQUIRES(context, shape_t.dims() == 1,
+                errors::InvalidArgument("shape_t must be 1-dimensional",
+                                        shape_t.shape().DebugString()));
+    OP_REQUIRES(context, shape_t.NumElements() == 2,
+                errors::InvalidArgument("shape_t must have two elements",
+                                        shape_t.shape().DebugString()));
+
+    auto sizes = shape_t.vec<int32>();
+    OP_REQUIRES(context, sizes(0) > 0 && sizes(1) > 0,
+                errors::InvalidArgument("shape_t's elements must be positive"));
+
+    // Initialize shape to the batch size of the input, then add
+    // the rest of the dimensions
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(
+        context, context->allocate_output(0, TensorShape({input.dim_size(0), sizes(0),
+                                                          sizes(1), input.dim_size(3)}),
+                                          &output));
+
+    const int64 batch_size = input.dim_size(0);
+    const int64 in_height = input.dim_size(1);
+    const int64 in_width = input.dim_size(2);
+    const int64 channels = input.dim_size(3);
+    const int64 out_height = output->dim_size(1);
+    const int64 out_width = output->dim_size(2);
+
+    const float height_scale =
+        (align_corners_ && out_height > 1)
+            ? (in_height - 1) / static_cast<float>(out_height - 1)
+            : in_height / static_cast<float>(out_height);
+    const float width_scale =
+        (align_corners_ && out_width > 1)
+            ? (in_width - 1) / static_cast<float>(out_width - 1)
+            : in_width / static_cast<float>(out_width);
+
+    bool status = ResizeNearestNeighbor<T>(
+        input.flat<T>().data(), batch_size, in_height,
+        in_width, channels, out_height, out_width,
+        height_scale, width_scale, output->flat<T>().data(),
+        context->eigen_gpu_device());
+
+    if (!status) {
+      context->SetStatus(
+          errors::Internal("Failed launching ResizeNearestNeighbor"));
+    }
+  }
+ private:
+  bool align_corners_;
+};
+
+#define REGISTER_KERNEL(T)                                        \
+  REGISTER_KERNEL_BUILDER(Name("ResizeNearestNeighbor")           \
+                              .Device(DEVICE_GPU)                 \
+                              .TypeConstraint<T>("T")             \
+                              .HostMemory("size"),                \
+                          ResizeNearestNeighborGPUOp<T>);
+
+TF_CALL_GPU_NUMBER_TYPES(REGISTER_KERNEL);
+
+#undef REGISTER_KERNEL
+
+#endif  // GOOGLE_CUDA
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/resize_nearest_neighbor_op_benchmark_test.cc b/tensorflow/core/kernels/resize_nearest_neighbor_op_benchmark_test.cc
new file mode 100644
index 0000000000..7784c69674
--- /dev/null
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_benchmark_test.cc
@@ -0,0 +1,52 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/graph/node_builder.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
+
+namespace tensorflow {
+
+static Graph* BM_ResizeNearestNeighbor(int batches, int width, int height) {
+  Graph* g = new Graph(OpRegistry::Global());
+  Tensor in(DT_FLOAT, TensorShape({batches, width, height, 3}));
+  in.flat<float>().setRandom();
+
+  Tensor out_size(DT_INT32, TensorShape({2}));
+  auto out_size_flat = out_size.flat<int32>();
+  out_size_flat(0) = width * 2;
+  out_size_flat(1) = height * 2;
+
+  Node* ret;
+  NodeBuilder(g->NewName("n"), "ResizeNearestNeighbor")
+      .Input(test::graph::Constant(g, in))
+      .Input(test::graph::Constant(g, out_size))
+      .Finalize(g, &ret);
+  return g;
+}
+
+#define BM_ResizeNearestNeighborDev(DEVICE, B, W, H)                           \
+  static void BM_ResizeNearestNeighbor_##DEVICE##_##B##_##W##_##H(int iters) { \
+    testing::ItemsProcessed(iters* B* W* H * 3);                               \
+    test::Benchmark(#DEVICE, BM_ResizeNearestNeighbor(B, W, H)).Run(iters);    \
+  }                                                                            \
+  BENCHMARK(BM_ResizeNearestNeighbor_##DEVICE##_##B##_##W##_##H)
+
+BM_ResizeNearestNeighborDev(cpu, 1, 499, 499);
+BM_ResizeNearestNeighborDev(gpu, 1, 499, 499);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.cu.cc b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.cu.cc
new file mode 100644
index 0000000000..bee24a5b02
--- /dev/null
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.cu.cc
@@ -0,0 +1,86 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+
+#define EIGEN_USE_GPU
+
+#include <stdio.h>
+
+#include "tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h"
+
+#include "tensorflow/core/framework/register_types.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+
+namespace tensorflow {
+namespace {
+
+template <typename T>
+__global__ void ResizeNearestNeighborNHWC(const int nthreads, const T* bottom_data,
+                                          const int in_height, const int in_width,
+                                          const int channels, const int out_height,
+                                          const int out_width, const float height_scale,
+                                          const float width_scale, T* top_data) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    int n = index;
+    int c = n % channels;
+    n /= channels;
+    int out_x = n % out_width;
+    n /= out_width;
+    int out_y = n % out_height;
+    n /= out_height;
+
+    const T* bottom_data_n = bottom_data + n * channels * in_height * in_width;
+    const int in_x = min(static_cast<int>(floorf(out_x * width_scale)), in_width - 1);
+    const int in_y = min(static_cast<int>(floorf(out_y * height_scale)), in_height - 1);
+    const int idx = (in_y * in_width + in_x) * channels + c;
+    top_data[index] = ldg(bottom_data_n + idx);
+  }
+}
+
+}  // namespace
+
+template <typename T>
+bool ResizeNearestNeighbor(const T* bottom_data, const int batch,
+                           const int in_height, const int in_width,
+                           const int channels, const int out_height,
+                           const int out_width,  const float height_scale,
+                           const float width_scale, T* top_data,
+                           const Eigen::GpuDevice& d) {
+  const int output_size = batch * channels * out_height * out_width;
+  CudaLaunchConfig config = GetCudaLaunchConfig(output_size, d);
+
+  ResizeNearestNeighborNHWC<T>
+      <<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
+      output_size, bottom_data, in_height, in_width, channels, out_height,
+      out_width, height_scale, width_scale, top_data);
+  return d.ok();
+}
+
+#define DECLARE_GPU_SPEC(T)                                                        \
+  template bool ResizeNearestNeighbor(const T* bottom_data, const int batch,       \
+                               const int in_height, const int in_width,            \
+                               const int channels, const int out_height,           \
+                               const int out_width,  const float height_scale,     \
+                               const float width_scale, T* top_data,               \
+                               const Eigen::GpuDevice& d);
+
+TF_CALL_GPU_NUMBER_TYPES(DECLARE_GPU_SPEC);
+
+#undef DECLARE_GPU_SPEC
+}  // end namespace tensorflow
+
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h
new file mode 100644
index 0000000000..65b4b331d9
--- /dev/null
+++ b/tensorflow/core/kernels/resize_nearest_neighbor_op_gpu.h
@@ -0,0 +1,37 @@
+/* Copyright 2015 Google Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if !GOOGLE_CUDA
+#error This file must only be included when building with Cuda support
+#endif
+
+#ifndef TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
+#define TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
+
+#include "third_party/eigen3/unsupported/Eigen/CXX11/NeuralNetworks"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+template <typename T>
+bool ResizeNearestNeighbor(const T* bottom_data, const int batch, const int in_height,
+                           const int in_width, const int channels, const int out_height,
+                           const int out_width, const float height_scale, const float width_scale,
+                           T* top_data, const Eigen::GpuDevice& d);
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_KERNELS_RESIZE_NEAREST_NEIGHBOR_OP_GPU_H_
diff --git a/tensorflow/core/kernels/sparse_matmul_op.cc b/tensorflow/core/kernels/sparse_matmul_op.cc
index 42cadf848c..1e90616e36 100644
--- a/tensorflow/core/kernels/sparse_matmul_op.cc
+++ b/tensorflow/core/kernels/sparse_matmul_op.cc
@@ -524,7 +524,7 @@ class SparseMatMulOp : public OpKernel {
 
  private:
   // Perform matrix multiplication of "left" and "right", and store the result
-  // in *"ouptut".
+  // in *"output".
   static inline void SparseMatMul(
       const ConstMatrixMap& left, const ConstMatrixMap& right,
       bool transpose_left, const DeviceBase::CpuWorkerThreads* thread_pool,
@@ -858,7 +858,7 @@ inline void SparseMatMulOp::SparseMatMul(
   const int right_dim0 = right.dimension(0);
   const int right_dim1 = right.dimension(1);
   // Allocate buffer for storing slices of right matrix.
-  // Note buffer needs enough space to hold atmost a KR * NR matrix since that
+  // Note buffer needs enough space to hold at most a KR * NR matrix since that
   // is the block size per iteration.
   const int buffer_num_rows =
       std::min(KR, right_dim0) * (std::min(NR, right_dim1) + N - 1) / N;
diff --git a/tensorflow/core/kernels/tensor_array_ops.cc b/tensorflow/core/kernels/tensor_array_ops.cc
index 044a93f552..70ef00292a 100644
--- a/tensorflow/core/kernels/tensor_array_ops.cc
+++ b/tensorflow/core/kernels/tensor_array_ops.cc
@@ -577,7 +577,7 @@ class TensorArrayConcatOp : public OpKernel {
     ConstMatrixVector input_tensors_flat;
     input_tensors_flat.reserve(values.size());
 
-    for (int i = 0; i < values.size(); ++i) {
+    for (size_t i = 0; i < values.size(); ++i) {
       const Tensor* value_t = value_tensors[i];
       if (value_t->NumElements() > 0) {
         input_tensors_flat.emplace_back(new ConstMatrix(
diff --git a/tensorflow/core/kernels/transpose_functor.h b/tensorflow/core/kernels/transpose_functor.h
index b3aa98d3bf..5e1d64a5c9 100644
--- a/tensorflow/core/kernels/transpose_functor.h
+++ b/tensorflow/core/kernels/transpose_functor.h
@@ -47,7 +47,7 @@ void ComputeStride(const TensorShape& shape, Index* strides) {
   }
 }
 
-// Device-specific naive implementation for tranpose.
+// Device-specific naive implementation for transpose.
 template <typename Device, typename T>
 void TransposeSimple(const Device& d, const Tensor& in,
                      const gtl::ArraySlice<int32> perm, Tensor* out);
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index 8afe80e4f1..589c40ca90 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -173,6 +173,38 @@ diagonal: Rank k tensor where k is at most 3.
 )doc");
 
 // --------------------------------------------------------------------------
+REGISTER_OP("DiagPart")
+    .Input("input: T")
+    .Output("diagonal: T")
+    .Attr("T: {float, double, int32, int64}")
+    .Doc(R"doc(
+Returns the diagonal part of the tensor.
+
+This operation returns a tensor with the `diagonal` part
+of the `input`. The `diagonal` part is computed as follows:
+
+Assume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a
+tensor of rank `k` with dimensions `[D1,..., Dk]` where:
+
+`diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.
+
+For example:
+
+```prettyprint
+# 'input' is [[1, 0, 0, 0]
+              [0, 2, 0, 0]
+              [0, 0, 3, 0]
+              [0, 0, 0, 4]]
+
+tf.diag_part(input) ==> [1, 2, 3, 4]
+```
+
+input: Rank k tensor where k is 2, 4, or 6.
+diagonal: The extracted diagonal.
+
+)doc");
+
+// --------------------------------------------------------------------------
 REGISTER_OP("Reverse")
     .Input("tensor: T")
     .Input("dims: bool")
diff --git a/tensorflow/core/ops/compat/ops_history.v0.pbtxt b/tensorflow/core/ops/compat/ops_history.v0.pbtxt
index 5784207bbd..c6eda8fb4a 100644
--- a/tensorflow/core/ops/compat/ops_history.v0.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v0.pbtxt
@@ -3483,6 +3483,29 @@ op {
   }
 }
 op {
+  name: "DiagPart"
+  input_arg {
+    name: "input"
+    type_attr: "T"
+  }
+  output_arg {
+    name: "diagonal"
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+}
+op {
   name: "Digamma"
   input_arg {
     name: "x"
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index 8bce27c368..67dffcbc16 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -2859,6 +2859,33 @@ op {
   description: "Given a `diagonal`, this operation returns a tensor with the `diagonal` and\neverything else padded with zeros. The diagonal is computed as follows:\n\nAssume `diagonal` has dimensions [D1,..., Dk], then the output is a tensor of\nrank 2k with dimensions [D1,..., Dk, D1,..., Dk] where:\n\n`output[i1,..., ik, i1,..., ik] = diagonal[i1, ..., ik]` and 0 everywhere else.\n\nFor example:\n\n```prettyprint\n# \'diagonal\' is [1, 2, 3, 4]\ntf.diag(diagonal) ==> [[1, 0, 0, 0]\n                       [0, 2, 0, 0]\n                       [0, 0, 3, 0]\n                       [0, 0, 0, 4]]\n```"
 }
 op {
+  name: "DiagPart"
+  input_arg {
+    name: "input"
+    description: "Rank k tensor where k is 2, 4, or 6."
+    type_attr: "T"
+  }
+  output_arg {
+    name: "diagonal"
+    description: "The extracted diagonal."
+    type_attr: "T"
+  }
+  attr {
+    name: "T"
+    type: "type"
+    allowed_values {
+      list {
+        type: DT_FLOAT
+        type: DT_DOUBLE
+        type: DT_INT32
+        type: DT_INT64
+      }
+    }
+  }
+  summary: "Returns the diagonal part of the tensor."
+  description: "This operation returns a tensor with the `diagonal` part\nof the `input`. The `diagonal` part is computed as follows:\n\nAssume `input` has dimensions `[D1,..., Dk, D1,..., Dk]`, then the output is a\ntensor of rank `k` with dimensions `[D1,..., Dk]` where:\n\n`diagonal[i1,..., ik] = input[i1, ..., ik, i1,..., ik]`.\n\nFor example:\n\n```prettyprint\n# \'input\' is [[1, 0, 0, 0]\n              [0, 2, 0, 0]\n              [0, 0, 3, 0]\n              [0, 0, 0, 4]]\n\ntf.diag_part(input) ==> [1, 2, 3, 4]\n```"
+}
+op {
   name: "Digamma"
   input_arg {
     name: "x"
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index cd8db972ae..10db90c206 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -20,7 +20,7 @@ limitations under the License.
 
 #define TF_MAJOR_VERSION 0
 #define TF_MINOR_VERSION 7
-#define TF_PATCH_VERSION 0
+#define TF_PATCH_VERSION 1
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
 // "-beta", "-rc", "-rc.1")
diff --git a/tensorflow/core/util/test_log.proto b/tensorflow/core/util/test_log.proto
index 771a1087c1..82cf0433e6 100644
--- a/tensorflow/core/util/test_log.proto
+++ b/tensorflow/core/util/test_log.proto
@@ -63,34 +63,50 @@ message CommitId {
 };
 
 message CPUInfo {
+  int64 num_cores = 1;
+
+  int64 num_cores_allowed = 2;
+
   // How fast are these cpus?
-  double mhz_per_cpu = 1;
+  double mhz_per_cpu = 3;
 
   // Additional cpu information. For example,
   // Intel Ivybridge with HyperThreading (24 cores) dL1:32KB dL2:256KB dL3:30MB
-  string cpu_info = 2;
+  string cpu_info = 4;
 
   // What kind of cpu scaling is enabled on the host.
   // Examples include "performance", "ondemand", "conservative", "mixed".
-  string cpu_governor = 3;
+  string cpu_governor = 5;
 
   // Cache sizes (in bytes), e.g. "L2": 262144 (for 256KB)
-  map<string, int64> cache_size = 4;
+  map<string, int64> cache_size = 6;
 };
 
+message MemoryInfo {
+  int64 total = 1;      // Total virtual memory in bytes
+  int64 available = 2;  // Immediately available memory in bytes
+}
+
 message GPUInfo {
   string model = 1;  // e.g. "Tesla K40c"
   string uuid = 2;   // Final entry in output of "nvidia-smi -L"
+  string bus_id = 3;  // e.g. "0000:04:00.0"
 };
 
 message PlatformInfo {
   string bits = 1;       // e.g. '64bit'
   string linkage = 2;    // e.g. 'ELF'
   string machine = 3;    // e.g. 'i386'
-  string processor = 4;  // e.g. 'amdk6'  (the real processor name)
-  string release = 5;    // e.g. '3.13.0-76-generic'
-  string system = 6;     // e.g. 'Linux'
-  string version = 7;    // e.g. '#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016'
+  string release = 4;    // e.g. '3.13.0-76-generic'
+  string system = 5;     // e.g. 'Linux'
+  string version = 6;    // e.g. '#120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016'
+};
+
+message AvailableDeviceInfo {       // Matches DeviceAttributes
+  string name = 1;                  // Device name.
+  string type = 2;                  // Device type, e.g. 'CPU' or 'GPU'.
+  int64 memory_limit = 3;           // Memory capacity in bytes.
+  string physical_description = 4;  // The physical description of this device.
 };
 
 message MachineConfiguration {
@@ -105,6 +121,11 @@ message MachineConfiguration {
 
   // Other devices that are attached and relevant (e.g. GPUInfo).
   repeated google.protobuf.Any device_info = 4;
+
+  // Devices accessible to the test (e.g. as given by list_local_devices).
+  repeated AvailableDeviceInfo available_device_info = 5;
+
+  MemoryInfo memory_info = 6;
 };
 
 // Run-specific items such as arguments to the test / benchmark.
diff --git a/tensorflow/examples/how_tos/reading_data/convert_to_records.py b/tensorflow/examples/how_tos/reading_data/convert_to_records.py
index c8819e05ce..da42b5cc38 100644
--- a/tensorflow/examples/how_tos/reading_data/convert_to_records.py
+++ b/tensorflow/examples/how_tos/reading_data/convert_to_records.py
@@ -68,6 +68,7 @@ def convert_to(images, labels, name):
         'label': _int64_feature(int(labels[index])),
         'image_raw': _bytes_feature(image_raw)}))
     writer.write(example.SerializeToString())
+  writer.close()
 
 
 def main(argv):
diff --git a/tensorflow/examples/image_retraining/retrain.py b/tensorflow/examples/image_retraining/retrain.py
index 14aa35e530..93dffaad6b 100644
--- a/tensorflow/examples/image_retraining/retrain.py
+++ b/tensorflow/examples/image_retraining/retrain.py
@@ -219,8 +219,8 @@ def create_image_lists(image_dir, testing_percentage, validation_percentage):
       # To do that, we need a stable way of deciding based on just the file name
       # itself, so we do a hash of that and then use that to generate a
       # probability value that we use to assign it.
-      percentage_hash = (int(
-          hashlib.sha1(hash_name).hexdigest(), 16) % (65536)) * (100 / 65535.0)
+      hash_name_hashed = hashlib.sha1(hash_name.encode('utf-8')).hexdigest()
+      percentage_hash = (int(hash_name_hashed, 16) % (65536)) * (100 / 65535.0)
       if percentage_hash < validation_percentage:
         validation_images.append(base_name)
       elif percentage_hash < (testing_percentage + validation_percentage):
@@ -295,8 +295,9 @@ def create_inception_graph():
     Graph holding the trained Inception network.
   """
   with tf.Session() as sess:
-    with gfile.FastGFile(
-        os.path.join(FLAGS.model_dir, 'classify_image_graph_def.pb'), 'r') as f:
+    model_filename = os.path.join(
+        FLAGS.model_dir, 'classify_image_graph_def.pb')
+    with gfile.FastGFile(model_filename, 'rb') as f:
       graph_def = tf.GraphDef()
       graph_def.ParseFromString(f.read())
       _ = tf.import_graph_def(graph_def, name='')
@@ -395,7 +396,7 @@ def get_or_create_bottleneck(sess, image_lists, label_name, index, image_dir,
                                 category)
     if not gfile.Exists(image_path):
       tf.logging.fatal('File does not exist %s', image_path)
-    image_data = gfile.FastGFile(image_path, 'r').read()
+    image_data = gfile.FastGFile(image_path, 'rb').read()
     bottleneck_values = run_bottleneck_on_image(sess, image_data,
                                                 JPEG_DATA_TENSOR_NAME)
     bottleneck_string = ','.join(str(x) for x in bottleneck_values)
@@ -430,7 +431,7 @@ def cache_bottlenecks(sess, image_lists, image_dir, bottleneck_dir):
   """
   how_many_bottlenecks = 0
   ensure_dir_exists(bottleneck_dir)
-  for label_name, label_lists in image_lists.iteritems():
+  for label_name, label_lists in image_lists.items():
     for category in ['training', 'testing', 'validation']:
       category_list = label_lists[category]
       for index, unused_base_name in enumerate(category_list):
@@ -467,7 +468,7 @@ def get_random_cached_bottlenecks(sess, image_lists, how_many, category,
   ground_truthes = []
   for unused_i in range(how_many):
     label_index = random.randrange(class_count)
-    label_name = image_lists.keys()[label_index]
+    label_name = list(image_lists.keys())[label_index]
     image_index = random.randrange(65536)
     bottleneck = get_or_create_bottleneck(sess, image_lists, label_name,
                                           image_index, image_dir, category,
@@ -818,7 +819,7 @@ def main(_):
   # Write out the trained graph and labels with the weights stored as constants.
   output_graph_def = graph_util.convert_variables_to_constants(
       sess, graph.as_graph_def(), [FLAGS.final_tensor_name])
-  with gfile.FastGFile(FLAGS.output_graph, 'w') as f:
+  with gfile.FastGFile(FLAGS.output_graph, 'wb') as f:
     f.write(output_graph_def.SerializeToString())
   with gfile.FastGFile(FLAGS.output_labels, 'w') as f:
     f.write('\n'.join(image_lists.keys()) + '\n')
diff --git a/tensorflow/examples/tutorials/mnist/input_data.py b/tensorflow/examples/tutorials/mnist/input_data.py
index 20affa9dae..07ed2c4f1c 100644
--- a/tensorflow/examples/tutorials/mnist/input_data.py
+++ b/tensorflow/examples/tutorials/mnist/input_data.py
@@ -54,7 +54,7 @@ def _read32(bytestream):
 def extract_images(filename):
   """Extract the images into a 4D uint8 numpy array [index, y, x, depth]."""
   print('Extracting', filename)
-  with tf.gfile.Open(filename) as f, gzip.GzipFile(fileobj=f) as bytestream:
+  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
     magic = _read32(bytestream)
     if magic != 2051:
       raise ValueError(
@@ -81,7 +81,7 @@ def dense_to_one_hot(labels_dense, num_classes):
 def extract_labels(filename, one_hot=False, num_classes=10):
   """Extract the labels into a 1D uint8 numpy array [index]."""
   print('Extracting', filename)
-  with tf.gfile.Open(filename) as f, gzip.GzipFile(fileobj=f) as bytestream:
+  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
     magic = _read32(bytestream)
     if magic != 2049:
       raise ValueError(
diff --git a/tensorflow/examples/tutorials/mnist/mnist.py b/tensorflow/examples/tutorials/mnist/mnist.py
index 647b226afa..4720ad626c 100644
--- a/tensorflow/examples/tutorials/mnist/mnist.py
+++ b/tensorflow/examples/tutorials/mnist/mnist.py
@@ -143,7 +143,7 @@ def evaluation(logits, labels):
   """
   # For a classifier model, we can use the in_top_k Op.
   # It returns a bool tensor with shape [batch_size] that is true for
-  # the examples where the label's is was in the top k (here k=1)
+  # the examples where the label is in the top k (here k=1)
   # of all logits for that example.
   correct = tf.nn.in_top_k(logits, labels, 1)
   # Return the number of true entries.
diff --git a/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py b/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
index 9a485e63bc..33dc13c813 100644
--- a/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
+++ b/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
@@ -54,23 +54,23 @@ def main(_):
   # Create the model
   x = tf.placeholder(tf.float32, [None, 784], name='x-input')
   W = tf.Variable(tf.zeros([784, 10]), name='weights')
-  b = tf.Variable(tf.zeros([10], name='bias'))
+  b = tf.Variable(tf.zeros([10]), name='bias')
 
   # Use a name scope to organize nodes in the graph visualizer
   with tf.name_scope('Wx_b'):
     y = tf.nn.softmax(tf.matmul(x, W) + b)
 
   # Add summary ops to collect data
-  _ = tf.histogram_summary('weights', W)
-  _ = tf.histogram_summary('biases', b)
-  _ = tf.histogram_summary('y', y)
+  tf.histogram_summary('weights', W)
+  tf.histogram_summary('biases', b)
+  tf.histogram_summary('y', y)
 
   # Define loss and optimizer
   y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
   # More name scopes will clean up the graph representation
   with tf.name_scope('xent'):
     cross_entropy = -tf.reduce_sum(y_ * tf.log(y))
-    _ = tf.scalar_summary('cross entropy', cross_entropy)
+    tf.scalar_summary('cross entropy', cross_entropy)
   with tf.name_scope('train'):
     train_step = tf.train.GradientDescentOptimizer(
         FLAGS.learning_rate).minimize(cross_entropy)
@@ -78,7 +78,7 @@ def main(_):
   with tf.name_scope('test'):
     correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
-    _ = tf.scalar_summary('accuracy', accuracy)
+    tf.scalar_summary('accuracy', accuracy)
 
   # Merge all the summaries and write them out to /tmp/mnist_logs (by default)
   merged = tf.merge_all_summaries()
diff --git a/tensorflow/examples/tutorials/word2vec/word2vec_basic.py b/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
index 09d0678d3e..e81df38343 100644
--- a/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
+++ b/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
@@ -128,7 +128,7 @@ num_skips = 2         # How many times to reuse an input to generate a label.
 # construction are also the most frequent.
 valid_size = 16     # Random set of words to evaluate similarity on.
 valid_window = 100  # Only pick dev samples in the head of the distribution.
-valid_examples = np.array(random.sample(np.arange(valid_window), valid_size))
+valid_examples = np.random.choice(valid_window, valid_size, replace=False)
 num_sampled = 64    # Number of negative examples to sample.
 
 graph = tf.Graph()
diff --git a/tensorflow/examples/udacity/3_regularization.ipynb b/tensorflow/examples/udacity/3_regularization.ipynb
index 7c587a6512..5e1d30f54f 100644
--- a/tensorflow/examples/udacity/3_regularization.ipynb
+++ b/tensorflow/examples/udacity/3_regularization.ipynb
@@ -290,11 +290,11 @@
         "Another one is to use learning rate decay:\n",
         "\n",
         "    global_step = tf.Variable(0)  # count the number of steps taken.\n",
-        "    learning_rate = tf.train.exponential_decay(0.5, step, ...)\n",
+        "    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)\n",
         "    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)\n",
         " \n",
         " ---\n"
       ]
     }
   ]
-}
-\ No newline at end of file
+}
diff --git a/tensorflow/examples/udacity/5_word2vec.ipynb b/tensorflow/examples/udacity/5_word2vec.ipynb
index 5bc8152b1c..c266488bde 100644
--- a/tensorflow/examples/udacity/5_word2vec.ipynb
+++ b/tensorflow/examples/udacity/5_word2vec.ipynb
@@ -421,7 +421,7 @@
         "\n",
         "graph = tf.Graph()\n",
         "\n",
-        "with graph.as_default():\n",
+        "with graph.as_default(), tf.device('/cpu:0'):\n",
         "\n",
         "  # Input data.\n",
         "  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])\n",
diff --git a/tensorflow/examples/udacity/Dockerfile b/tensorflow/examples/udacity/Dockerfile
index 59ae4abca8..9545c376b7 100644
--- a/tensorflow/examples/udacity/Dockerfile
+++ b/tensorflow/examples/udacity/Dockerfile
@@ -1,6 +1,7 @@
 FROM b.gcr.io/tensorflow/tensorflow:latest
 MAINTAINER Vincent Vanhoucke <vanhoucke@google.com>
 RUN pip install scikit-learn
+RUN rm -rf /notebooks/*
 ADD *.ipynb /notebooks/
 WORKDIR /notebooks
 CMD ["/run_jupyter.sh"]
diff --git a/tensorflow/g3doc/api_docs/python/nn.md b/tensorflow/g3doc/api_docs/python/nn.md
index 6540a614c1..8da65f848e 100644
--- a/tensorflow/g3doc/api_docs/python/nn.md
+++ b/tensorflow/g3doc/api_docs/python/nn.md
@@ -820,7 +820,7 @@ classes are mutually exclusive (each entry is in exactly one class).  For
 example, each CIFAR-10 image is labeled with one and only one label: an image
 can be a dog or a truck, but not both.
 
-**NOTE:**:  While the classes are mutually exclusive, their probabilities
+**NOTE:**  While the classes are mutually exclusive, their probabilities
 need not be.  All that is required is that each row of `labels` is
 a valid probability distribution.  If using exclusive `labels`
 (wherein one and only one class is true at a time), see
@@ -857,7 +857,7 @@ classes are mutually exclusive (each entry is in exactly one class).  For
 example, each CIFAR-10 image is labeled with one and only one label: an image
 can be a dog or a truck, but not both.
 
-**NOTE:**:  For this operation, the probability of a given label is considered
+**NOTE:**  For this operation, the probability of a given label is considered
 exclusive.  That is, soft classes are not allowed, and the `labels` vector
 must provide a single specific index for the true class for each row of
 `logits` (each minibatch entry).  For soft softmax classification with
diff --git a/tensorflow/g3doc/api_docs/python/train.md b/tensorflow/g3doc/api_docs/python/train.md
index 595c89ddc1..7a65f963fc 100644
--- a/tensorflow/g3doc/api_docs/python/train.md
+++ b/tensorflow/g3doc/api_docs/python/train.md
@@ -794,9 +794,11 @@ global_step = tf.Variable(0, trainable=False)
 starter_learning_rate = 0.1
 learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                            100000, 0.96, staircase=True)
-optimizer = tf.GradientDescentOptimizer(learning_rate)
 # Passing global_step to minimize() will increment it at each step.
-optimizer.minimize(...my loss..., global_step=global_step)
+learning_step = (
+    tf.GradientDescentOptimizer(learning_rate)
+    .minimize(...my loss..., global_step=global_step)
+)
 ```
 
 ##### Args:
@@ -2280,5 +2282,3 @@ device assignments have not changed.
 ##### Returns:
 
   A saver constructed rom `saver_def` in `MetaGraphDef`.
-
-
diff --git a/tensorflow/g3doc/get_started/os_setup.md b/tensorflow/g3doc/get_started/os_setup.md
index 82d3edba10..9b1d561e29 100644
--- a/tensorflow/g3doc/get_started/os_setup.md
+++ b/tensorflow/g3doc/get_started/os_setup.md
@@ -53,28 +53,28 @@ Install TensorFlow:
 
 ```bash
 # Ubuntu/Linux 64-bit, CPU only:
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl
 
 # Ubuntu/Linux 64-bit, GPU enabled:
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl
 
 # Mac OS X, CPU only:
 $ sudo easy_install --upgrade six
-$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py2-none-any.whl
+$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl
 ```
 
 For python3:
 
 ```bash
 # Ubuntu/Linux 64-bit, CPU only:
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl
 
 # Ubuntu/Linux 64-bit, GPU enabled:
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl
 
 # Mac OS X, CPU only:
 $ sudo easy_install --upgrade six
-$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py3-none-any.whl
+$ sudo pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp35-none-any.whl
 ```
 
 NOTE: If you are upgrading from a previous installation of TensorFlow < 0.7.1,
@@ -126,13 +126,13 @@ $ source ~/tensorflow/bin/activate.csh  # If using csh
 (tensorflow)$  # Your prompt should change
 
 # Ubuntu/Linux 64-bit, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl
 
 # Ubuntu/Linux 64-bit, GPU enabled:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl
 
 # Mac OS X, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py2-none-any.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl
 ```
 
 and again for python3:
@@ -143,13 +143,13 @@ $ source ~/tensorflow/bin/activate.csh  # If using csh
 (tensorflow)$  # Your prompt should change
 
 # Ubuntu/Linux 64-bit, CPU only:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl
 
 # Ubuntu/Linux 64-bit, GPU enabled:
-(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.0-py3-none-linux_x86_64.whl
+(tensorflow)$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.7.1-cp34-none-linux_x86_64.whl
 
 # Mac OS X, CPU only:
-(tensorflow)$ pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py3-none-any.whl
+(tensorflow)$ pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp35-none-any.whl
 ```
 
 With the Virtualenv environment activated, you can now
@@ -191,7 +191,7 @@ code.
 * `b.gcr.io/tensorflow/tensorflow:latest-devel-gpu`: GPU Binary image plus source
 code.
 
-We also have tags with `latest` replaced by a released version (e.g., `0.7.0-gpu`).
+We also have tags with `latest` replaced by a released version (e.g., `0.7.1-gpu`).
 
 With Docker the installation is as follows:
 
@@ -464,7 +464,7 @@ We recommend using [homebrew](http://brew.sh) to install the bazel and SWIG
 dependencies, and installing python dependencies using easy_install or pip.
 
 Of course you can also install Swig from source without using homebrew. In that
-case, be sure to install its dependency [PCRE](from www.pcre.org) and not PCRE2.
+case, be sure to install its dependency [PCRE](http://www.pcre.org) and not PCRE2.
 
 #### Dependencies
 
@@ -517,7 +517,7 @@ $ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_pack
 $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
 
 # The name of the .whl file will depend on your platform.
-$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.0-py2-none-linux_x86_64.whl
+$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-linux_x86_64.whl
 ```
 
 ## Setting up TensorFlow for Development
diff --git a/tensorflow/g3doc/how_tos/image_retraining/index.md b/tensorflow/g3doc/how_tos/image_retraining/index.md
index 9b3bdbf7c1..e84b2978c5 100644
--- a/tensorflow/g3doc/how_tos/image_retraining/index.md
+++ b/tensorflow/g3doc/how_tos/image_retraining/index.md
@@ -74,7 +74,7 @@ and compact summary of the images, since it has to contain enough information
 for the classifier to make a good choice in a very small set of values. The
 reason our final layer retraining can work on new classes is that it turns out
 the kind of information needed to distinguish between all the 1,000 classes in
-ImageNet is often also useful to chose between new kinds of objects.
+ImageNet is often also useful to distinguish between new kinds of objects.
 
 Because every image is reused multiple times during training and calculating
 each bottleneck takes a significant amount of time, it speeds things up to
@@ -88,20 +88,20 @@ part again.
 Once the bottlenecks are complete, the actual training of the top layer of the
 network begins. You'll see a series of step outputs, each one showing training
 accuracy, validation accuracy, and the cross entropy. The training accuracy
-shows how many of the images used in the current training batch were labeled
-with the correct class. The validation accuracy is the precision on a
+shows what percent of the images used in the current training batch were
+labeled with the correct class. The validation accuracy is the precision on a
 randomly-selected group of images from a different set. The key difference is
 that the training accuracy is based on images that the network has been able
 to learn from so the network can overfit to the noise in the training data. A
 true measure of the performance of the network is to measure its performance on
 a data set not contained in the training data -- this is measured by the
-validation accuracy. If the training accuracy is high but the validation remains
-low, that means the network is overfitting and memorizing particular features
-in the training images that aren't helpful more generally. Cross entropy is a
-loss function which gives a glimpse into how well the learning process is
-progressing. The training's objective is to make the loss as small as possible,
-so you can tell if the learning is working by keeping an eye on whether the loss
-keeps trending downwards, ignoring the short-term noise.
+validation accuracy. If the train accuracy is high but the validation accuracy
+remains low, that means the network is overfitting and memorizing particular
+features in the training images that aren't helpful more generally. Cross
+entropy is a loss function which gives a glimpse into how well the learning
+process is progressing. The training's objective is to make the loss as small as
+possible, so you can tell if the learning is working by keeping an eye on
+whether the loss keeps trending downwards, ignoring the short-term noise.
 
 By default this script will run 4,000 training steps. Each step chooses ten
 images at random from the training set, finds their bottlenecks from the cache,
@@ -114,8 +114,8 @@ and validation pictures. This test evaluation is the best estimate of how the
 trained model will perform on the classification task. You should see an
 accuracy value of between 90% and 95%, though the exact value will vary from run
 to run since there's randomness in the training process. This number is based on
-how many of the images in the test set are given the correct label after the
-model is fully trained.
+the percent of the images in the test set that are given the correct label
+after the model is fully trained.
 
 ## Using the Retrained Model
 
@@ -266,7 +266,7 @@ memorized unimportant details of the training images.
 
 This problem is known as overfitting, and to avoid it we keep some of our data
 out of the training process, so that the model can't memorize them. We then use
-those images as a check to make sure that overfitting isn't occuring, since if
+those images as a check to make sure that overfitting isn't occurring, since if
 we see good accuracy on them it's a good sign the network isn't overfitting. The
 usual split is to put 80% of the images into the main training set, keep 10%
 aside to run as validation frequently during training, and then have a final 10%
diff --git a/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md b/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md
index c8ae21e3dd..5059f02a73 100644
--- a/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md
+++ b/tensorflow/g3doc/how_tos/summaries_and_tensorboard/index.md
@@ -86,23 +86,23 @@ with tf.name_scope("Wx_b") as scope:
   y = tf.nn.softmax(tf.matmul(x,W) + b)
 
 # Add summary ops to collect data
-w_hist = tf.histogram_summary("weights", W)
-b_hist = tf.histogram_summary("biases", b)
-y_hist = tf.histogram_summary("y", y)
+tf.histogram_summary("weights", W)
+tf.histogram_summary("biases", b)
+tf.histogram_summary("y", y)
 
 # Define loss and optimizer
 y_ = tf.placeholder(tf.float32, [None,10], name="y-input")
 # More name scopes will clean up the graph representation
 with tf.name_scope("xent") as scope:
   cross_entropy = -tf.reduce_sum(y_*tf.log(y))
-  ce_summ = tf.scalar_summary("cross entropy", cross_entropy)
+  tf.scalar_summary("cross entropy", cross_entropy)
 with tf.name_scope("train") as scope:
   train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
 
 with tf.name_scope("test") as scope:
   correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
   accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
-  accuracy_summary = tf.scalar_summary("accuracy", accuracy)
+  tf.scalar_summary("accuracy", accuracy)
 
 # Merge all the summaries and write them out to /tmp/mnist_logs
 merged = tf.merge_all_summaries()
diff --git a/tensorflow/g3doc/how_tos/tool_developers/index.md b/tensorflow/g3doc/how_tos/tool_developers/index.md
index 89e0867e99..67aad20831 100644
--- a/tensorflow/g3doc/how_tos/tool_developers/index.md
+++ b/tensorflow/g3doc/how_tos/tool_developers/index.md
@@ -28,8 +28,7 @@ by calling `as_graph_def()`, which returns a `GraphDef` object.
 
 The GraphDef class is an object created by the ProtoBuf library from the
 definition in
-[tensorflow/core/framework/graph.proto](https://github.com/tensorflow/tensorflow
-/blob/master/tensorflow/core/framework/graph.proto). The protobuf tools parse
+[tensorflow/core/framework/graph.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/graph.proto). The protobuf tools parse
 this text file, and generate the code to load, store, and manipulate graph
 definitions. If you see a standalone TensorFlow file representing a model, it's
 likely to contain a serialized version of one of these `GraphDef` objects
@@ -37,8 +36,7 @@ saved out by the protobuf code.
 
 This generated code is used to save and load the GraphDef files from disk. A
 good example to look at as we dig into this is
-[graph_metrics.py](https://github.com/tensorflow/tensorflow/blob/master/tensorfl
-ow/python/tools/graph_metrics.py). This Python script takes a saved graph
+[graph_metrics.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/graph_metrics.py). This Python script takes a saved graph
 definition, and analyzes the model to estimate performance and resource
 statistics. The code that actually loads the model looks like this:
 
@@ -69,16 +67,14 @@ There are actually two different formats that a ProtoBuf can be saved in.
 TextFormat is a human-readable form, which makes it nice for debugging and
 editing, but can get large when there's numerical data like weights stored in
 it. You can see a small example of that in
-[poly5-graph.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorf
-low/tensorboard/components/tf-tensorboard/demo/data/poly5-graph.pbtxt).
+[poly5-graph.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/components/tf-tensorboard/demo/data/poly5-graph.pbtxt).
 
 Binary format files are a lot smaller than their text equivalents, even though
 they're not as readable for us. In this script, we ask the user to supply a
 flag indicating whether the input file is binary or text, so we know the right
 function to call. You can find an example of a large binary file inside the
 [inception_dec_2015.zip
-archive](https://storage.googleapis.com/download.tensorflow.org/models/inception
-_dec_2015.zip), as `tensorflow_inception_graph.pb`.
+archive](https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015.zip), as `tensorflow_inception_graph.pb`.
 
 The API itself can be a bit confusing - the binary call is actually
 `ParseFromString()`, whereas you use a utility function from the `text_format`
@@ -104,7 +100,7 @@ single operation along with its input connections. Here are the members of a
 Every node should have a unique identifier that's not used by any other nodes
 in the graph. If you don't specify one as you're building a graph using the
 Python API, one reflecting the name of operation, such as "MatMul",
-concatenated with a monotonically increasing number, such as "5", will be 
+concatenated with a monotonically increasing number, such as "5", will be
 picked for you. an arbitrary one will be picked for you. The name is used when
 defining the connections between nodes, and when setting inputs and outputs for
 the whole graph when it's run.
@@ -115,8 +111,7 @@ This defines what operation to run, for example `"Add"`, `"MatMul"`, or
 `"Conv2D"`. When a graph is run, this op name is looked up in a registry to
 find an implementation. The registry is populated by calls to the
 `REGISTER_OP()` macro, like those in
-[tensorflow/core/ops/nn_ops.cc](https://github.com/tensorflow/tensorflow/blob/ma
-ster/tensorflow/core/ops/nn_ops.cc).
+[tensorflow/core/ops/nn_ops.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/nn_ops.cc).
 
 ### `input`
 
@@ -142,8 +137,7 @@ size of filters for convolutions, or the values of constant ops. Because there
 can be so many different types of attribute values, from strings, to ints, to
 arrays of tensor values, there's a separate protobuf file defining the data
 structure that holds them, in
-[tensorflow/core/framework/attr_value.proto](https://github.com/tensorflow/tenso
-rflow/blob/master/tensorflow/core/framework/attr_value.proto).
+[tensorflow/core/framework/attr_value.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/attr_value.proto).
 
 Each attribute has a unique name string, and the expected attributes are listed
 when the operation is defined. If an attribute isn't present in a node, but it
@@ -161,8 +155,7 @@ the file format during training. Instead, they're held in separate checkpoint
 files, and there are `Variable` ops in the graph that load the latest values
 when they're initialized. It's often not very convenient to have separate files
 when you're deploying to production, so there's the
-[freeze_graph.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflo
-w/python/tools/freeze_graph.py) script that takes a graph definition and a set
+[freeze_graph.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py) script that takes a graph definition and a set
 of checkpoints and freezes them together into a single file.
 
 What this does is load the `GraphDef`, pull in the values for all the variables
@@ -178,10 +171,9 @@ the most common problems is extracting and interpreting the weight values. A
 common way to store them, for example in graphs created by the freeze_graph
 script, is as `Const` ops containing the weights as `Tensors`. These are
 defined in
-[tensorflow/core/framework/tensor.proto](https://github.com/tensorflow/tensorflo
-w/blob/master/tensorflow/core/framework/tensor.proto), and contain information
+[tensorflow/core/framework/tensor.proto](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/tensor.proto), and contain information
 about the size and type of the data, as well as the values themselves. In
-Python, you get a `TensorProto` object from a `NodeDef` representing a `Const` 
+Python, you get a `TensorProto` object from a `NodeDef` representing a `Const`
 op by calling something like `some_node_def.attr['value'].tensor`.
 
 This will give you an object representing the weights data. The data itself
diff --git a/tensorflow/g3doc/resources/dims_types.md b/tensorflow/g3doc/resources/dims_types.md
index 18390724e3..8e55e609a0 100644
--- a/tensorflow/g3doc/resources/dims_types.md
+++ b/tensorflow/g3doc/resources/dims_types.md
@@ -16,7 +16,7 @@ Python list) has a rank of 2:
     t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
 
 A rank two tensor is what we typically think of as a matrix, a rank one tensor
-is a vector. For a rank two tensor you can acccess any element with the syntax
+is a vector. For a rank two tensor you can access any element with the syntax
 `t[i, j]`.  For a rank three tensor you would need to address an element with
 `t[i, j, k]`.
 
diff --git a/tensorflow/g3doc/resources/index.md b/tensorflow/g3doc/resources/index.md
index 5d14dcbdf6..2f0bbb455b 100644
--- a/tensorflow/g3doc/resources/index.md
+++ b/tensorflow/g3doc/resources/index.md
@@ -31,6 +31,11 @@ something amazing with TensorFlow, we'd like to hear about it!
 
 ## Community
 
+The TensorFlow community has created many great projects around TensorFlow, including:
+
+* [TensorFlow tutorials](https://github.com/pkmital/tensorflow_tutorials)
+* [Scikit Flow - Simplified Interface for TensorFlow](https://github.com/tensorflow/skflow)
+
 ### Development
 
 The source code for TensorFlow is hosted on GitHub:
diff --git a/tensorflow/g3doc/tutorials/deep_cnn/index.md b/tensorflow/g3doc/tutorials/deep_cnn/index.md
index 1491c91bae..57722ed18a 100644
--- a/tensorflow/g3doc/tutorials/deep_cnn/index.md
+++ b/tensorflow/g3doc/tutorials/deep_cnn/index.md
@@ -9,8 +9,6 @@ CIFAR-10 classification is a common benchmark problem in machine learning.  The
 problem is to classify RGB 32x32 pixel images across 10 categories:
 ```airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.```
 
-![CIFAR-10 Samples](../../images/cifar_samples.png "CIFAR-10 Samples, from http://www.cs.toronto.edu/~kriz/cifar.html")
-
 For more details refer to the [CIFAR-10 page](http://www.cs.toronto.edu/~kriz/cifar.html)
 and a [Tech Report](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)
 by Alex Krizhevsky.
@@ -117,7 +115,7 @@ learn more about how the `Reader` class works.
 The images are processed as follows:
 
 *  They are cropped to 24 x 24 pixels, centrally for evaluation or
-   [randomly](../../api_docs/python/image.md#random_crop) for training.
+   [randomly](../../api_docs/python/constant_op.md#random_crop) for training.
 *  They are [approximately whitened](../../api_docs/python/image.md#per_image_whitening)
    to make the model insensitive to dynamic range.
 
@@ -168,7 +166,7 @@ Here is a graph generated from TensorBoard describing the inference operation:
 </div>
 
 > **EXERCISE**: The output of `inference` are un-normalized logits. Try editing
-the network architecture to return normalized predictions using [`tf.softmax()`]
+the network architecture to return normalized predictions using [`tf.nn.softmax()`]
 (../../api_docs/python/nn.md#softmax).
 
 The `inputs()` and `inference()` functions provide all the components
diff --git a/tensorflow/g3doc/tutorials/mnist/download/index.md b/tensorflow/g3doc/tutorials/mnist/download/index.md
index e9698d6248..16ff9e8422 100644
--- a/tensorflow/g3doc/tutorials/mnist/download/index.md
+++ b/tensorflow/g3doc/tutorials/mnist/download/index.md
@@ -50,7 +50,7 @@ unpacked (following the instructions available at the website) by the
 
 The image data is extracted into a 2d tensor of: `[image index, pixel index]`
 where each entry is the intensity value of a specific pixel in a specific
-image, rescaled from `[0, 255]` to `[-0.5, 0.5]`.  The "image index" corresponds
+image, rescaled from `[0, 255]` to `[0, 1]`.  The "image index" corresponds
 to an image in the dataset, counting up from zero to the size of the dataset.
 And the "pixel index" corresponds to a specific pixel in that image, ranging
 from zero to the number of pixels in the image.
diff --git a/tensorflow/g3doc/tutorials/recurrent/index.md b/tensorflow/g3doc/tutorials/recurrent/index.md
index b39dedffb3..5de4c653ed 100644
--- a/tensorflow/g3doc/tutorials/recurrent/index.md
+++ b/tensorflow/g3doc/tutorials/recurrent/index.md
@@ -92,7 +92,7 @@ lstm = rnn_cell.BasicLSTMCell(lstm_size)
 # Initial state of the LSTM memory.
 initial_state = state = tf.zeros([batch_size, lstm.state_size])
 
-for i in range(len(num_steps)):
+for i in range(num_steps):
     # The value of state is updated after processing each batch of words.
     output, state = lstm(words[:, i], state)
 
@@ -159,7 +159,7 @@ lstm = rnn_cell.BasicLSTMCell(lstm_size)
 stacked_lstm = rnn_cell.MultiRNNCell([lstm] * number_of_layers)
 
 initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
-for i in range(len(num_steps)):
+for i in range(num_steps):
     # The value of state is updated after processing each batch of words.
     output, state = stacked_lstm(words[:, i], state)
 
diff --git a/tensorflow/g3doc/tutorials/seq2seq/index.md b/tensorflow/g3doc/tutorials/seq2seq/index.md
index 3d64bcc91b..c1673f474d 100644
--- a/tensorflow/g3doc/tutorials/seq2seq/index.md
+++ b/tensorflow/g3doc/tutorials/seq2seq/index.md
@@ -58,7 +58,7 @@ translation [Sutskever et al., 2014](http://arxiv.org/abs/1409.3215)
 In the basic model depicted above, every input has to be encoded into
 a fixed-size state vector, as that is the only thing passed to the decoder.
 To allow the decoder more direct access to the input, an *attention* mechanism
-was introduced in [Bahdanu et al., 2014](http://arxiv.org/abs/1409.0473)
+was introduced in [Bahdanau et al., 2014](http://arxiv.org/abs/1409.0473)
 ([pdf](http://arxiv.org/pdf/1409.0473.pdf)).
 We will not go into the details of the attention mechanism (see the paper),
 suffice it to say that it allows the decoder to peek into the input at every
@@ -176,8 +176,8 @@ projections are constructed by the following code in `seq2seq_model.py`.
 ```
 
 First, note that we only construct a sampled softmax if the number of samples
-(512 by default) is smaller that the target vocabulary size. For vocabularies
-smaller than 512 it might be a better idea to just use a standard softmax loss.
+(512 by default) is smaller than the target vocabulary size. For vocabularies
+smaller than 512, it might be a better idea to just use a standard softmax loss.
 
 Then, as you can see, we construct an output projection. It is a pair,
 consisting of a weight matrix and a bias vector. If used, the rnn cell
diff --git a/tensorflow/models/image/mnist/convolutional.py b/tensorflow/models/image/mnist/convolutional.py
index 4627082a28..aaba57b3b0 100644
--- a/tensorflow/models/image/mnist/convolutional.py
+++ b/tensorflow/models/image/mnist/convolutional.py
@@ -17,7 +17,7 @@
 
 This should achieve a test error of 0.7%. Please keep this model as simple and
 linear as possible, it is meant as a tutorial for simple convolutional models.
-Run with --self_test on the command line to exectute a short self-test.
+Run with --self_test on the command line to execute a short self-test.
 """
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/models/rnn/ptb/ptb_word_lm.py b/tensorflow/models/rnn/ptb/ptb_word_lm.py
index bac4c65491..c56f316b03 100644
--- a/tensorflow/models/rnn/ptb/ptb_word_lm.py
+++ b/tensorflow/models/rnn/ptb/ptb_word_lm.py
@@ -276,7 +276,7 @@ def get_config():
     raise ValueError("Invalid model: %s", FLAGS.model)
 
 
-def main(unused_args):
+def main(_):
   if not FLAGS.data_path:
     raise ValueError("Must set --data_path to PTB data directory")
 
diff --git a/tensorflow/models/rnn/translate/data_utils.py b/tensorflow/models/rnn/translate/data_utils.py
index dab12ab928..48da4f065c 100644
--- a/tensorflow/models/rnn/translate/data_utils.py
+++ b/tensorflow/models/rnn/translate/data_utils.py
@@ -66,7 +66,7 @@ def gunzip_file(gz_path, new_path):
   """Unzips from gz_path into new_path."""
   print("Unpacking %s to %s" % (gz_path, new_path))
   with gzip.open(gz_path, "rb") as gz_file:
-    with open(new_path, "w") as new_file:
+    with open(new_path, "wb") as new_file:
       for line in gz_file:
         new_file.write(line)
 
diff --git a/tensorflow/python/framework/importer.py b/tensorflow/python/framework/importer.py
index 33ead71e35..d40e0b33d2 100644
--- a/tensorflow/python/framework/importer.py
+++ b/tensorflow/python/framework/importer.py
@@ -251,8 +251,8 @@ def import_graph_def(graph_def, input_map=None, return_elements=None,
           class_values = value.list
           new_class_values = []
           for class_value in class_values.s:
-            if class_value.startswith('loc:@'):
-              op_to_bind_to = class_value[5:]
+            if class_value.startswith(b'loc:@'):
+              op_to_bind_to = class_value[5:].decode()
               # Find the op by its original name.
               if op_to_bind_to not in name_to_op:
                 raise ValueError('Specified colocation to an op that '
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index e15299e519..b1c68f33fb 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -1041,7 +1041,7 @@ class Operation(object):
       raise TypeError("node_def needs to be a NodeDef: %s" % node_def)
     if node_def.ByteSize() >= (1 << 31) or node_def.ByteSize() < 0:
       raise ValueError(
-          "Cannot create an Operation with a NodeDef larger than 2GB.")
+          "Cannot create a tensor proto whose content is larger than 2GB.")
     if not _VALID_OP_NAME_REGEX.match(node_def.name):
       raise ValueError("'%s' is not a valid node name" % node_def.name)
     if not isinstance(g, Graph):
diff --git a/tensorflow/python/framework/ops_test.py b/tensorflow/python/framework/ops_test.py
index 82e455c3ec..cfc96a0cc8 100644
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@@ -1228,8 +1228,8 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
     with ops.colocate_with(a.op):
       b = constant_op.constant(3.0)
     c = constant_op.constant(4.0)
-    self.assertEqual(["loc:@a"], a.op.colocation_groups())
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], a.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
     with self.assertRaises(ValueError):
       c.op.get_attr("_class")
 
@@ -1242,7 +1242,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
         # colocated with 'a', which is on '/gpu:0'.  colocate_with
         # overrides devices because it is a stronger constraint.
         b = constant_op.constant(3.0)
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
     self.assertEqual(a.op.device, b.op.device)
 
   def testLocationOverrides(self):
@@ -1258,7 +1258,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
         c = constant_op.constant(4.0)
       d = constant_op.constant(5.0)
 
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
     self.assertEqual("/device:GPU:0", a.op.device)
     self.assertEqual(a.op.device, b.op.device)
 
@@ -1272,8 +1272,8 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
       b = constant_op.constant(3.0)
       with ops.colocate_with(b.op):
         c = constant_op.constant(4.0)
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
-    self.assertEqual(["loc:@a"], c.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], c.op.colocation_groups())
 
   def testMultiColocationGroups(self):
     a = constant_op.constant([2.0], name="a")
@@ -1281,7 +1281,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
     with ops.colocate_with(a.op):
       with ops.colocate_with(b.op):
         c = constant_op.constant(4.0)
-    self.assertEqual(set(["loc:@a", "loc:@b"]), set(c.op.colocation_groups()))
+    self.assertEqual(set([b"loc:@a", b"loc:@b"]), set(c.op.colocation_groups()))
 
   def testColocationIgnoreStack(self):
     a = constant_op.constant([2.0], name="a")
@@ -1295,7 +1295,7 @@ class ColocationGroupTest(test_util.TensorFlowTestCase):
     a = variables.Variable([2.0], name="a")
     with ops.colocate_with(a.op):
       b = variables.Variable([3.0], name="b")
-    self.assertEqual(["loc:@a"], b.op.colocation_groups())
+    self.assertEqual([b"loc:@a"], b.op.colocation_groups())
 
   def testInconsistentDeviceWithinColocate(self):
     with ops.device("/gpu:0"):
diff --git a/tensorflow/python/framework/tensor_util.py b/tensorflow/python/framework/tensor_util.py
index d311cefcf4..ae4e73a363 100644
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@@ -361,6 +361,9 @@ def make_tensor_proto(values, dtype=None, shape=None):
       tensor_shape=tensor_shape.as_shape(shape).as_proto())
 
   if is_same_size and numpy_dtype in _TENSOR_CONTENT_TYPES and shape_size > 1:
+    if nparray.size * nparray.itemsize >= (1 << 31):
+      raise ValueError(
+          "Cannot create a tensor proto whose content is larger than 2GB.")
     tensor_proto.tensor_content = nparray.tostring()
     return tensor_proto
 
diff --git a/tensorflow/python/kernel_tests/constant_op_test.py b/tensorflow/python/kernel_tests/constant_op_test.py
index 9235be228f..d93020e825 100644
--- a/tensorflow/python/kernel_tests/constant_op_test.py
+++ b/tensorflow/python/kernel_tests/constant_op_test.py
@@ -155,7 +155,7 @@ class ConstantTest(tf.test.TestCase):
       large_array = np.zeros((512, 1024, 1024), dtype=np.float32)
       with self.assertRaisesRegexp(
           ValueError,
-          "Cannot create an Operation with a NodeDef larger than 2GB."):
+          "Cannot create a tensor proto whose content is larger than 2GB."):
         c = tf.constant(large_array)
 
   def testTooLargeGraph(self):
diff --git a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
index 19f41c49e5..9c25ea555c 100644
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@@ -1397,7 +1397,7 @@ class ControlFlowTest(tf.test.TestCase):
                                                            vdef)
         # The device is empty, but the colocation constraint is set.
         self.assertDeviceEqual("", with_vdef_dep.device)
-        self.assertEqual(["loc:@vdef"],
+        self.assertEqual([b"loc:@vdef"],
                          with_vdef_dep.op.colocation_groups())
 
   def testGroup(self):
diff --git a/tensorflow/python/kernel_tests/depthtospace_op_test.py b/tensorflow/python/kernel_tests/depthtospace_op_test.py
index 36b25d9656..8dda8832b3 100644
--- a/tensorflow/python/kernel_tests/depthtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/depthtospace_op_test.py
@@ -156,7 +156,7 @@ class DepthToSpaceTest(tf.test.TestCase):
       out_tf.eval()
 
   def testBlockSizeNotDivisibleDepth(self):
-    # The the depth is not divisible by the square of the block size.
+    # The depth is not divisible by the square of the block size.
     x_np = [[[[1, 1, 1, 1],
               [2, 2, 2, 2]],
              [[3, 3, 3, 3],
diff --git a/tensorflow/python/kernel_tests/diag_op_test.py b/tensorflow/python/kernel_tests/diag_op_test.py
index 8cc4cbe8b2..73cad8d34f 100644
--- a/tensorflow/python/kernel_tests/diag_op_test.py
+++ b/tensorflow/python/kernel_tests/diag_op_test.py
@@ -23,18 +23,21 @@ import tensorflow as tf
 
 class GenerateIdentityTensorTest(tf.test.TestCase):
 
-  def _testDiagOp(self, diag, dtype, expected_ans, use_gpu=False,
-                  expected_err_re=None):
+  def diagOp(self, diag, dtype, expected_ans, use_gpu=False):
     with self.test_session(use_gpu=use_gpu):
       tf_ans = tf.diag(tf.convert_to_tensor(diag.astype(dtype)))
       out = tf_ans.eval()
+      tf_ans_inv = tf.diag_part(expected_ans)
+      inv_out = tf_ans_inv.eval()
     self.assertAllClose(out, expected_ans)
+    self.assertAllClose(inv_out, diag)
     self.assertShapeEqual(expected_ans, tf_ans)
+    self.assertShapeEqual(diag, tf_ans_inv)
 
   def testEmptyTensor(self):
     x = numpy.array([])
     expected_ans = numpy.empty([0, 0])
-    self._testDiagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)
 
   def testRankOneIntTensor(self):
     x = numpy.array([1, 2, 3])
@@ -42,8 +45,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
         [[1, 0, 0],
          [0, 2, 0],
          [0, 0, 3]])
-    self._testDiagOp(x, numpy.int32, expected_ans)
-    self._testDiagOp(x, numpy.int64, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int64, expected_ans)
 
   def testRankOneFloatTensor(self):
     x = numpy.array([1.1, 2.2, 3.3])
@@ -51,8 +54,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
         [[1.1, 0, 0],
          [0, 2.2, 0],
          [0, 0, 3.3]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)
 
   def testRankTwoIntTensor(self):
     x = numpy.array([[1, 2, 3], [4, 5, 6]])
@@ -63,8 +66,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
          [[[0, 0, 0], [4, 0, 0]],
           [[0, 0, 0], [0, 5, 0]],
           [[0, 0, 0], [0, 0, 6]]]])
-    self._testDiagOp(x, numpy.int32, expected_ans)
-    self._testDiagOp(x, numpy.int64, expected_ans)
+    self.diagOp(x, numpy.int32, expected_ans)
+    self.diagOp(x, numpy.int64, expected_ans)
 
   def testRankTwoFloatTensor(self):
     x = numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
@@ -75,8 +78,8 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
          [[[0, 0, 0], [4.4, 0, 0]],
           [[0, 0, 0], [0, 5.5, 0]],
           [[0, 0, 0], [0, 0, 6.6]]]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)
 
   def testRankThreeFloatTensor(self):
     x = numpy.array([[[1.1, 2.2], [3.3, 4.4]],
@@ -90,8 +93,64 @@ class GenerateIdentityTensorTest(tf.test.TestCase):
            [[[0, 0], [0, 0]], [[0, 6.6], [0, 0]]]],
           [[[[0, 0], [0, 0]], [[0, 0], [7.7, 0]]],
            [[[0, 0], [0, 0]], [[0, 0], [0, 8.8]]]]]])
-    self._testDiagOp(x, numpy.float32, expected_ans)
-    self._testDiagOp(x, numpy.float64, expected_ans)
+    self.diagOp(x, numpy.float32, expected_ans)
+    self.diagOp(x, numpy.float64, expected_ans)
+
+class DiagPartOpTest(tf.test.TestCase):
+
+  def setUp(self):
+    x = numpy.random.seed(0)
+
+  def diagPartOp(self, tensor, dtpe, expected_ans, use_gpu=False):
+    with self.test_session(use_gpu=use_gpu):
+      tf_ans_inv = tf.diag_part(tensor)
+      inv_out = tf_ans_inv.eval()
+    self.assertAllClose(inv_out, expected_ans)
+    self.assertShapeEqual(expected_ans, tf_ans_inv)
+
+  def testRankTwoFloatTensor(self):
+    x = numpy.random.rand(3, 3)
+    i = numpy.arange(3)
+    expected_ans = x[i, i]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+
+  def testRankFourFloatTensor(self):
+    x = numpy.random.rand(2, 3, 2, 3)
+    i = numpy.arange(2)[:, None]
+    j = numpy.arange(3)
+    expected_ans = x[i, j, i, j]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+    
+  def testRankSixFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2, 2, 2, 2)
+    i = numpy.arange(2)[:, None, None]
+    j = numpy.arange(2)[:, None]
+    k = numpy.arange(2)
+    expected_ans = x[i, j, k, i, j, k]
+    self.diagPartOp(x, numpy.float32, expected_ans)
+    self.diagPartOp(x, numpy.float64, expected_ans)
+
+  def testOddRank(self):
+    w = numpy.random.rand(2)
+    x = numpy.random.rand(2, 2, 2)
+    y = numpy.random.rand(2, 2, 2, 2, 2)
+    z = numpy.random.rand(2, 2, 2, 2, 2, 2, 2)
+    self.assertRaises(ValueError, self.diagPartOp, w, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, x, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, y, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, z, numpy.float32, 0)
+    
+  def testUnevenDimensions(self):
+    w = numpy.random.rand(2, 5)
+    x = numpy.random.rand(2, 1, 2, 3)
+    y = numpy.random.rand(2, 1, 2, 1, 2, 5)
+    z = numpy.random.rand(2, 2, 2, 2, 2, 2, 2, 2)
+    self.assertRaises(ValueError, self.diagPartOp, w, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, x, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, y, numpy.float32, 0)
+    self.assertRaises(ValueError, self.diagPartOp, z, numpy.float32, 0)
 
 if __name__ == "__main__":
   tf.test.main()
diff --git a/tensorflow/python/kernel_tests/init_ops_test.py b/tensorflow/python/kernel_tests/init_ops_test.py
index 816a159a6a..eb78cedfab 100644
--- a/tensorflow/python/kernel_tests/init_ops_test.py
+++ b/tensorflow/python/kernel_tests/init_ops_test.py
@@ -25,7 +25,7 @@ from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import init_ops
 
 
-# Returns true iff the two initalizers produce the same tensor to
+# Returns true iff the two initializers produce the same tensor to
 # within a tiny tolerance.
 def identicaltest(tc, init1, init2, use_gpu):
   """Tests if two initializations are identical to within tiny tolerances.
diff --git a/tensorflow/python/kernel_tests/matmul_op_test.py b/tensorflow/python/kernel_tests/matmul_op_test.py
index d0978d0adb..87ccc83d98 100644
--- a/tensorflow/python/kernel_tests/matmul_op_test.py
+++ b/tensorflow/python/kernel_tests/matmul_op_test.py
@@ -120,7 +120,7 @@ class MatMulTest(tf.test.TestCase):
       self._testCpuMatmul(x, y, True, True)
       self._testGpuMatmul(x, y, True, True)
 
-  def testDoubleRandomTranposeBoth(self):
+  def testDoubleRandomTransposeBoth(self):
     for _ in range(10):
       n, k, m = np.random.randint(1, 100, size=3)
       x = self._randMatrix(k, n, np.float64)
diff --git a/tensorflow/python/kernel_tests/reduction_ops_test.py b/tensorflow/python/kernel_tests/reduction_ops_test.py
index 0a55afb3ad..323bc43920 100644
--- a/tensorflow/python/kernel_tests/reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/reduction_ops_test.py
@@ -116,8 +116,8 @@ class SumReductionTest(tf.test.TestCase):
   # Simple tests for various types.
   def testDoubleReduce1D(self):
     np_arr = np.arange(1, 6).reshape([5]).astype(np.float64)
-    self._compare(np_arr, [], False)
-    self._compare(np_arr, [0], False)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
 
   def testInt32Reduce1D(self):
     np_arr = np.arange(1, 6).reshape([5]).astype(np.int32)
@@ -230,6 +230,19 @@ class MeanReductionTest(tf.test.TestCase):
     self._compareAll(np_arr, [0, 2])
     self._compareAll(np_arr, [0, 1, 2])
 
+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
   def testGradient(self):
     s = [2, 3, 4, 2]
     x = np.arange(1.0, 49.0).reshape(s).astype(np.float32)
@@ -383,6 +396,19 @@ class MinReductionTest(tf.test.TestCase):
     self._compareAll(np_arr, [0, 2])
     self._compareAll(np_arr, [0, 1, 2])
 
+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
   def testGradient(self):
     s = [2, 3, 4, 2]
     x = np.arange(1.0, 49.0).reshape(s).astype(np.float64)
@@ -477,6 +503,20 @@ class MaxReductionTest(tf.test.TestCase):
     self._compareAll(np_arr, [0, 2])
     self._compareAll(np_arr, [0, 1, 2])
 
+  def testDoubleReduce3D(self):
+    # Create a 3D array of doubles and reduce across all possible
+    # dimensions
+    np_arr = np.arange(0, 30).reshape([2, 3, 5]).astype(np.float64)
+    self._compareAll(np_arr, None)
+    self._compareAll(np_arr, [])
+    self._compareAll(np_arr, [0])
+    self._compareAll(np_arr, [1])
+    self._compareAll(np_arr, [2])
+    self._compareAll(np_arr, [0, 1])
+    self._compareAll(np_arr, [1, 2])
+    self._compareAll(np_arr, [0, 2])
+    self._compareAll(np_arr, [0, 1, 2])
+
   def testGradient(self):
     s = [2, 3, 4, 2]
     x = np.arange(1.0, 49.0).reshape(s).astype(np.float64)
diff --git a/tensorflow/python/kernel_tests/rnn_test.py b/tensorflow/python/kernel_tests/rnn_test.py
index 7d040c221e..be59ac08c2 100644
--- a/tensorflow/python/kernel_tests/rnn_test.py
+++ b/tensorflow/python/kernel_tests/rnn_test.py
@@ -782,11 +782,11 @@ class BidirectionalRNNTest(tf.test.TestCase):
             tf.float32,
             shape=(batch_size, input_size) if use_shape else (None, input_size))
     ]
-    outputs = tf.nn.bidirectional_rnn(cell_fw,
-                                      cell_bw,
-                                      inputs,
-                                      dtype=tf.float32,
-                                      sequence_length=sequence_length)
+    outputs, state_fw, state_bw = tf.nn.bidirectional_rnn(cell_fw,
+                                                          cell_bw,
+                                                          inputs,
+                                                          dtype=tf.float32,
+                                                          sequence_length=sequence_length)
     self.assertEqual(len(outputs), len(inputs))
     for out in outputs:
       self.assertEqual(
@@ -794,17 +794,19 @@ class BidirectionalRNNTest(tf.test.TestCase):
           [batch_size if use_shape else None, 2 * num_units])
 
     input_value = np.random.randn(batch_size, input_size)
+    outputs = tf.pack(outputs)
 
-    return input_value, inputs, outputs, sequence_length
+    return input_value, inputs, outputs, state_fw, state_bw, sequence_length
 
   def _testBidirectionalRNN(self, use_gpu, use_shape):
     with self.test_session(use_gpu=use_gpu, graph=tf.Graph()) as sess:
-      input_value, inputs, outputs, sequence_length = (
+      input_value, inputs, outputs, state_fw, state_bw, sequence_length = (
           self._createBidirectionalRNN(use_gpu, use_shape, True))
       tf.initialize_all_variables().run()
       # Run with pre-specified sequence length of 2, 3
-      out = sess.run(outputs, feed_dict={inputs[0]: input_value,
-                                         sequence_length: [2, 3]})
+      out, s_fw, s_bw = sess.run([outputs, state_fw, state_bw], 
+                                 feed_dict={inputs[0]: input_value,
+                                 sequence_length: [2, 3]})
 
       # Since the forward and backward LSTM cells were initialized with the
       # same parameters, the forward and backward output has to be the same,
@@ -836,13 +838,17 @@ class BidirectionalRNNTest(tf.test.TestCase):
       self.assertEqual(out[2][1][0], out[0][1][3])
       self.assertEqual(out[2][1][1], out[0][1][4])
       self.assertEqual(out[2][1][2], out[0][1][5])
+      # Via the reasoning above, the forward and backward final state should be
+      # exactly the same
+      self.assertAllClose(s_fw, s_bw)
 
   def _testBidirectionalRNNWithoutSequenceLength(self, use_gpu, use_shape):
     with self.test_session(use_gpu=use_gpu, graph=tf.Graph()) as sess:
-      input_value, inputs, outputs, _ = self._createBidirectionalRNN(
-          use_gpu, use_shape, False)
+      input_value, inputs, outputs, state_fw, state_bw, _ = self._createBidirectionalRNN(
+                                                                use_gpu, use_shape, False)
       tf.initialize_all_variables().run()
-      out = sess.run(outputs, feed_dict={inputs[0]: input_value})
+      out, s_fw, s_bw = sess.run([outputs, state_fw, state_bw], 
+                                 feed_dict={inputs[0]: input_value})
 
       # Since the forward and backward LSTM cells were initialized with the
       # same parameters, the forward and backward output has to be the same,
@@ -861,6 +867,9 @@ class BidirectionalRNNTest(tf.test.TestCase):
         self.assertEqual(out[i][1][0], out[8 - 1 - i][1][3])
         self.assertEqual(out[i][1][1], out[8 - 1 - i][1][4])
         self.assertEqual(out[i][1][2], out[8 - 1 - i][1][5])
+      # Via the reasoning above, the forward and backward final state should be
+      # exactly the same
+      self.assertAllClose(s_fw, s_bw)
 
   def testBidirectionalRNN(self):
     self._testBidirectionalRNN(use_gpu=False, use_shape=False)
diff --git a/tensorflow/python/kernel_tests/seq2seq_test.py b/tensorflow/python/kernel_tests/seq2seq_test.py
index 77ff0571b7..a6f017f22f 100644
--- a/tensorflow/python/kernel_tests/seq2seq_test.py
+++ b/tensorflow/python/kernel_tests/seq2seq_test.py
@@ -495,6 +495,105 @@ class Seq2SeqTest(tf.test.TestCase):
         if len(perplexities[bucket]) > 1:  # Assert that perplexity went down.
           self.assertLess(perplexities[bucket][-1], perplexities[bucket][0])
 
+  def testModelWithBooleanFeedPrevious(self):
+    """Test the model behavior when feed_previous is True.
+
+    For example, the following two cases have the same effect:
+      - Train `embedding_rnn_seq2seq` with `feed_previous=True`, which contains
+        a `embedding_rnn_decoder` with `feed_previous=True` and
+        `update_embedding_for_previous=True`. The decoder is fed with "<Go>"
+        and outputs "A, B, C".
+      - Train `embedding_rnn_seq2seq` with `feed_previous=False`. The decoder
+        is fed with "<Go>, A, B".
+    """
+    num_encoder_symbols = 3
+    num_decoder_symbols = 5
+    batch_size = 2
+    num_enc_timesteps = 2
+    num_dec_timesteps = 3
+
+    def TestModel(seq2seq):
+      with self.test_session(graph=tf.Graph()) as sess:
+        tf.set_random_seed(111)
+        random.seed(111)
+        np.random.seed(111)
+
+        enc_inp = [tf.constant(i + 1, tf.int32, shape=[batch_size])
+                     for i in range(num_enc_timesteps)]
+        dec_inp_fp_true = [tf.constant(i, tf.int32, shape=[batch_size])
+                           for i in range(num_dec_timesteps)]
+        dec_inp_holder_fp_false = [tf.placeholder(tf.int32, shape=[batch_size])
+                                   for _ in range(num_dec_timesteps)]
+        targets = [tf.constant(i + 1, tf.int32, shape=[batch_size])
+                   for i in range(num_dec_timesteps)]
+        weights = [tf.constant(1.0, shape=[batch_size])
+                   for i in range(num_dec_timesteps)]
+
+        def ForwardBackward(enc_inp, dec_inp, feed_previous):
+          scope_name = "fp_{}".format(feed_previous)
+          with tf.variable_scope(scope_name):
+            dec_op, _ = seq2seq(enc_inp, dec_inp, feed_previous=feed_previous)
+            net_variables = tf.get_collection(tf.GraphKeys.VARIABLES,
+                                              scope_name)
+          optimizer = tf.train.AdamOptimizer(0.03, epsilon=1e-5)
+          update_op = optimizer.minimize(
+              tf.nn.seq2seq.sequence_loss(dec_op, targets, weights),
+              var_list=net_variables)
+          return dec_op, update_op, net_variables
+
+        dec_op_fp_true, update_fp_true, variables_fp_true = ForwardBackward(
+            enc_inp, dec_inp_fp_true, feed_previous=True)
+        dec_op_fp_false, update_fp_false, variables_fp_false = ForwardBackward(
+            enc_inp, dec_inp_holder_fp_false, feed_previous=False)
+
+        sess.run(tf.initialize_all_variables())
+
+        # We only check consistencies between the variables existing in both
+        # the models with True and False feed_previous. Variables created by
+        # the loop_function in the model with True feed_previous are ignored.
+        v_false_name_dict = {v.name.split('/', 1)[-1]: v
+                             for v in variables_fp_false}
+        matched_variables = [(v, v_false_name_dict[v.name.split('/', 1)[-1]])
+                             for v in variables_fp_true]
+        for v_true, v_false in matched_variables:
+          sess.run(tf.assign(v_false, v_true))
+
+        # Take the symbols generated by the decoder with feed_previous=True as
+        # the true input symbols for the decoder with feed_previous=False.
+        dec_fp_true = sess.run(dec_op_fp_true)
+        output_symbols_fp_true = np.argmax(dec_fp_true, axis=2)
+        dec_inp_fp_false = np.vstack((dec_inp_fp_true[0].eval(),
+                                      output_symbols_fp_true[:-1]))
+        sess.run(update_fp_true)
+        sess.run(update_fp_false,
+                 {holder: inp for holder, inp in zip(dec_inp_holder_fp_false,
+                                                     dec_inp_fp_false)})
+
+        for v_true, v_false in matched_variables:
+          self.assertAllClose(v_true.eval(), v_false.eval())
+
+    def EmbeddingRNNSeq2SeqF(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_rnn_seq2seq(
+          enc_inp, dec_inp, cell, num_encoder_symbols,
+          num_decoder_symbols, feed_previous=feed_previous)
+
+    def EmbeddingTiedRNNSeq2Seq(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_tied_rnn_seq2seq(
+          enc_inp, dec_inp, cell, num_decoder_symbols,
+          feed_previous=feed_previous)
+
+    def EmbeddingAttentionSeq2Seq(enc_inp, dec_inp, feed_previous):
+      cell = tf.nn.rnn_cell.BasicLSTMCell(2)
+      return tf.nn.seq2seq.embedding_attention_seq2seq(
+          enc_inp, dec_inp, cell, num_encoder_symbols,
+          num_decoder_symbols, feed_previous=feed_previous)
+
+    for model in (EmbeddingRNNSeq2SeqF, EmbeddingTiedRNNSeq2Seq,
+                  EmbeddingAttentionSeq2Seq):
+      TestModel(model)
+
 
 if __name__ == "__main__":
   tf.test.main()
diff --git a/tensorflow/python/kernel_tests/trace_op_test.py b/tensorflow/python/kernel_tests/trace_op_test.py
new file mode 100644
index 0000000000..1263252bad
--- /dev/null
+++ b/tensorflow/python/kernel_tests/trace_op_test.py
@@ -0,0 +1,71 @@
+# Copyright 2015 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy
+import tensorflow as tf
+
+
+class TraceTest(tf.test.TestCase):
+
+  def setUp(self):
+    x = numpy.random.seed(0)
+
+  def traceOp(self, x, dtype, expected_ans, use_gpu=False):
+    with self.test_session(use_gpu=use_gpu):
+      tf_ans = tf.trace(x.astype(dtype))
+      out = tf_ans.eval()
+    self.assertAllClose(out, expected_ans)
+
+  def testEmptyTensor(self):
+    x = numpy.array([])
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankOneTensor(self):
+    x = numpy.array([1,2,3])
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankTwoIntTensor(self):
+    x = numpy.array(
+        [[1, 0, 0],
+         [0, 2, 0],
+         [0, 0, 3]])
+    expected_ans = 6
+    self.traceOp(x, numpy.int32, expected_ans)
+    self.traceOp(x, numpy.int64, expected_ans)
+
+  def testRankTwoFloatTensor(self):
+    x = numpy.array(
+        [[1.1, 0, 0],
+         [0, 2.2, 0],
+         [0, 0, 3.3]])
+    expected_ans = 6.6
+    self.traceOp(x, numpy.float32, expected_ans)
+    self.traceOp(x, numpy.float64, expected_ans)
+
+  def testRankThreeFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2)
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+  def testRankFourFloatTensor(self):
+    x = numpy.random.rand(2, 2, 2, 2)
+    self.assertRaises(ValueError, self.traceOp, x, numpy.float32, 0)
+
+
+if __name__ == "__main__":
+  tf.test.main()
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index 9c43d4fac3..44efdb538b 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -846,6 +846,35 @@ def _DiagShape(op):
   input_shape = op.inputs[0].get_shape().with_rank_at_most(3)
   return [input_shape.concatenate(input_shape)]
 
+@ops.RegisterShape("DiagPart")
+def _DiagPartShape(op):
+  """Shape function for array_ops.diag_part.
+
+  This op has one input (of rank k = 2, 4, or 6), and one output (of rank k/2),
+  where the shape of the output is the diagonal of the input shape.
+
+  Args:
+    op: A DiagPart Operation.
+
+  Returns:
+    A single-element list containing the shape of the output.
+
+  Raises:
+    ValueError: If input has odd rank or greater than 6
+
+  """
+  shape = op.inputs[0].get_shape()
+  rank = len(shape)
+  mid = rank // 2
+  if rank % 2 or rank > 6:
+    raise ValueError("Input must have even rank <= 6, input rank is " +
+                     str(rank) + "." )
+  if shape[:mid] != shape[mid:]:
+    raise ValueError("Invalid shape, shape[:mid] " + str(shape[:mid]) +
+                     " and shape[mid:] " + str(shape[mid:]) +
+                     " do not match ")
+  input_shape = shape.with_rank_at_most(6)
+  return [input_shape[:len(input_shape) // 2]]
 
 @ops.RegisterShape("ExpandDims")
 def _ExpandDimsShape(op):
@@ -1360,7 +1389,7 @@ def _SpaceToDepthShape(op):
   * input: a tensor of shape like that [B, H, W, D]
   * block_size: an int.
 
-  Its output is the the same-rank tensor but with changed
+  Its output is the same-rank tensor but with changed
   dimensions like that: [B, H/block_size, W/block_size, D*block_size*block_size]
 
   Args:
@@ -1408,7 +1437,7 @@ def _DepthToSpaceShape(op):
   * input: a tensor of shape like that [B, H, W, D]
   * block_size: an int.
 
-  Its output is the the same-rank tensor but with changed
+  Its output is the same-rank tensor but with changed
   dimensions like that:
       [B, H*block_size, W*block_size, D/(block_size*block_size)]
 
diff --git a/tensorflow/python/ops/image_ops.py b/tensorflow/python/ops/image_ops.py
index 25048181fc..bebc011360 100644
--- a/tensorflow/python/ops/image_ops.py
+++ b/tensorflow/python/ops/image_ops.py
@@ -308,6 +308,7 @@ def flip_left_right(image):
   Raises:
     ValueError: if the shape of `image` not supported.
   """
+  image = ops.convert_to_tensor(image, name='image')
   _Check3DImage(image, require_static=False)
   return array_ops.reverse(image, [False, True, False])
 
@@ -329,6 +330,7 @@ def flip_up_down(image):
   Raises:
     ValueError: if the shape of `image` not supported.
   """
+  image = ops.convert_to_tensor(image, name='image')
   _Check3DImage(image, require_static=False)
   return array_ops.reverse(image, [True, False, False])
 
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index 51c49e9da3..b4f4f87d06 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -741,7 +741,14 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
              image_ops.ResizeMethod.AREA]
 
   TYPES = [np.uint8, np.int8, np.int16, np.int32, np.int64,
-           np.float, np.double]
+           np.float32, np.float64]
+
+  def availableGPUModes(self, opt, nptype):
+    if opt == image_ops.ResizeMethod.NEAREST_NEIGHBOR \
+            and nptype in [np.float32, np.float64]:
+      return [True, False]
+    else:
+      return [False]
 
   def testNoOp(self):
     img_shape = [1, 6, 4, 1]
@@ -761,13 +768,14 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
       img_np = np.array(data, dtype=nptype).reshape(img_shape)
 
       for opt in self.OPTIONS:
-        with self.test_session() as sess:
-          image = constant_op.constant(img_np, shape=img_shape)
-          y = image_ops.resize_images(image, target_height, target_width, opt)
-          yshape = array_ops.shape(y)
-          resized, newshape = sess.run([y, yshape])
-          self.assertAllEqual(img_shape, newshape)
-          self.assertAllClose(resized, img_np, atol=1e-5)
+        for use_gpu in self.availableGPUModes(opt, nptype):
+          with self.test_session(use_gpu=use_gpu) as sess:
+            image = constant_op.constant(img_np, shape=img_shape)
+            y = image_ops.resize_images(image, target_height, target_width, opt)
+            yshape = array_ops.shape(y)
+            resized, newshape = sess.run([y, yshape])
+            self.assertAllEqual(img_shape, newshape)
+            self.assertAllClose(resized, img_np, atol=1e-5)
 
       # Resizing with a single image must leave the shape unchanged also.
       with self.test_session():
@@ -857,12 +865,13 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
         img_np = np.array(data, dtype=nptype).reshape(img_shape)
 
         for opt in self.OPTIONS:
-          with self.test_session():
-            image = constant_op.constant(img_np, shape=img_shape)
-            y = image_ops.resize_images(image, target_height, target_width, opt)
-            expected = np.array(expected_data).reshape(target_shape)
-            resized = y.eval()
-            self.assertAllClose(resized, expected, atol=1e-5)
+          for use_gpu in self.availableGPUModes(opt, nptype):
+            with self.test_session(use_gpu=use_gpu):
+              image = constant_op.constant(img_np, shape=img_shape)
+              y = image_ops.resize_images(image, target_height, target_width, opt)
+              expected = np.array(expected_data).reshape(target_shape)
+              resized = y.eval()
+              self.assertAllClose(resized, expected, atol=1e-5)
 
   def testResizeUp(self):
     img_shape = [1, 3, 2, 1]
@@ -899,14 +908,15 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
           image_ops.ResizeMethod.BILINEAR,
           image_ops.ResizeMethod.NEAREST_NEIGHBOR,
           image_ops.ResizeMethod.AREA]:
-        with self.test_session():
-          img_np = np.array(data, dtype=nptype).reshape(img_shape)
-          image = constant_op.constant(img_np, shape=img_shape)
-          y = image_ops.resize_images(image, target_height, target_width, opt)
-          resized = y.eval()
-          expected = np.array(expected_data[opt]).reshape(
-              [1, target_height, target_width, 1])
-          self.assertAllClose(resized, expected, atol=1e-05)
+        for use_gpu in self.availableGPUModes(opt, nptype):
+          with self.test_session(use_gpu=use_gpu):
+            img_np = np.array(data, dtype=nptype).reshape(img_shape)
+            image = constant_op.constant(img_np, shape=img_shape)
+            y = image_ops.resize_images(image, target_height, target_width, opt)
+            resized = y.eval()
+            expected = np.array(expected_data[opt]).reshape(
+                [1, target_height, target_width, 1])
+            self.assertAllClose(resized, expected, atol=1e-05)
 
   def testResizeUpBicubic(self):
     img_shape = [1, 6, 6, 1]
@@ -964,6 +974,28 @@ class ResizeImagesTest(test_util.TensorFlowTestCase):
       self.assertAllClose(resized, expected, atol=1)
 
 
+  def testCompareNearestNeighbor(self):
+    input_shape = [1, 5, 6, 3]
+    target_height = 8
+    target_width = 12
+    for nptype in [np.float32, np.float64]:
+      for align_corners in [True, False]:
+        img_np = np.arange(0, np.prod(input_shape), dtype=nptype).reshape(input_shape)
+        with self.test_session(use_gpu=True):
+          image = constant_op.constant(img_np, shape=input_shape)
+          out_op = image_ops.resize_images(image, target_height, target_width,
+                                           image_ops.ResizeMethod.NEAREST_NEIGHBOR,
+                                           align_corners=align_corners)
+          gpu_val = out_op.eval()
+        with self.test_session(use_gpu=False):
+          image = constant_op.constant(img_np, shape=input_shape)
+          out_op = image_ops.resize_images(image, target_height, target_width,
+                                           image_ops.ResizeMethod.NEAREST_NEIGHBOR,
+                                           align_corners=align_corners)
+          cpu_val = out_op.eval()
+        self.assertAllClose(cpu_val, gpu_val, rtol=1e-5, atol=1e-5)
+
+
 class ResizeImageWithCropOrPadTest(test_util.TensorFlowTestCase):
 
   def _ResizeImageWithCropOrPad(self, original, original_shape,
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index 6c388ae9b2..3ba9d509fd 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -63,6 +63,8 @@ TensorFlow provides several operations that you can use to add basic
 mathematical functions for matrices to your graph.
 
 @@diag
+@@diag_part
+@@trace
 @@transpose
 
 @@matmul
@@ -921,6 +923,39 @@ def reduce_any(input_tensor, reduction_indices=None, keep_dims=False,
                            keep_dims, name=name)
 
 
+def trace(x, name=None):
+  """ Compute the trace of a tensor `x`.
+
+  `trace(x)` returns the sum of along the diagonal.
+  
+  For example:
+
+  ```python
+  # 'x' is [[1, 1],
+  #         [1, 1]]
+  tf.trace(x) ==> 2
+  
+  # 'x' is [[1,2,3],
+  #         [4,5,6],
+  #         [7,8,9]]
+  tf.trace(x) ==> 15
+  ```
+
+  Args:
+    input_tensor: 2-D tensor.
+    name: A name for the operation (optional).
+
+  Returns:
+    The trace of input tensor.
+  """
+  with ops.op_scope([x], name, "Trace") as name: 
+    x = ops.convert_to_tensor(x, name="x")
+    if len(x.get_shape()) != 2:
+      raise ValueError("Expected a tensor with rank 2, rank %d tensor received"
+                       % len(x.get_shape()))
+    return reduce_sum(array_ops.diag_part(x), name=name)
+
+
 def matmul(a, b,
            transpose_a=False, transpose_b=False,
            a_is_sparse=False, b_is_sparse=False,
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 08f2a59b63..f7891bb2d0 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -194,7 +194,7 @@ def softmax_cross_entropy_with_logits(logits, labels, name=None):
   example, each CIFAR-10 image is labeled with one and only one label: an image
   can be a dog or a truck, but not both.
 
-  **NOTE:**:  While the classes are mutually exclusive, their probabilities
+  **NOTE:**  While the classes are mutually exclusive, their probabilities
   need not be.  All that is required is that each row of `labels` is
   a valid probability distribution.  If using exclusive `labels`
   (wherein one and only one class is true at a time), see
@@ -231,7 +231,7 @@ def sparse_softmax_cross_entropy_with_logits(logits, labels, name=None):
   example, each CIFAR-10 image is labeled with one and only one label: an image
   can be a dog or a truck, but not both.
 
-  **NOTE:**:  For this operation, the probability of a given label is considered
+  **NOTE:**  For this operation, the probability of a given label is considered
   exclusive.  That is, soft classes are not allowed, and the `labels` vector
   must provide a single specific index for the true class for each row of
   `logits` (each minibatch entry).  For soft softmax classification with
diff --git a/tensorflow/python/ops/rnn.py b/tensorflow/python/ops/rnn.py
index ad916b6b5f..611f5fa314 100644
--- a/tensorflow/python/ops/rnn.py
+++ b/tensorflow/python/ops/rnn.py
@@ -312,9 +312,11 @@ def bidirectional_rnn(cell_fw, cell_bw, inputs,
     scope: VariableScope for the created subgraph; defaults to "BiRNN"
 
   Returns:
-    A set of output `Tensors` where:
+    A tuple (outputs, output_state_fw, output_state_bw) where:
       outputs is a length T list of outputs (one for each input), which
       are depth-concatenated forward and backward outputs
+      output_state_fw is the final state of the forward rnn
+      output_state_bw is the final state of the backward rnn
 
   Raises:
     TypeError: If "cell_fw" or "cell_bw" is not an instance of RNNCell.
@@ -333,19 +335,19 @@ def bidirectional_rnn(cell_fw, cell_bw, inputs,
   name = scope or "BiRNN"
   # Forward direction
   with vs.variable_scope(name + "_FW") as fw_scope:
-    output_fw, _ = rnn(cell_fw, inputs, initial_state_fw, dtype,
+    output_fw, output_state_fw = rnn(cell_fw, inputs, initial_state_fw, dtype,
                        sequence_length, scope=fw_scope)
 
   # Backward direction
   with vs.variable_scope(name + "_BW") as bw_scope:
-    tmp, _ = rnn(cell_bw, _reverse_seq(inputs, sequence_length),
+    tmp, output_state_bw = rnn(cell_bw, _reverse_seq(inputs, sequence_length),
                  initial_state_bw, dtype, sequence_length, scope=bw_scope)
   output_bw = _reverse_seq(tmp, sequence_length)
   # Concat each of the forward/backward outputs
   outputs = [array_ops.concat(1, [fw, bw])
              for fw, bw in zip(output_fw, output_bw)]
 
-  return outputs
+  return (outputs, output_state_fw, output_state_bw)
 
 
 def dynamic_rnn(cell, inputs, sequence_length=None, initial_state=None,
diff --git a/tensorflow/python/ops/seq2seq.py b/tensorflow/python/ops/seq2seq.py
index 2301679b2e..7df123ef70 100644
--- a/tensorflow/python/ops/seq2seq.py
+++ b/tensorflow/python/ops/seq2seq.py
@@ -73,6 +73,34 @@ from tensorflow.python.ops import rnn_cell
 from tensorflow.python.ops import variable_scope
 
 
+def _extract_argmax_and_embed(embedding, output_projection=None,
+                              update_embedding=True):
+  """Get a loop_function that extracts the previous symbol and embeds it.
+
+  Args:
+    embedding: embedding tensor for symbols.
+    output_projection: None or a pair (W, B). If provided, each fed previous
+      output will first be multiplied by W and added B.
+    update_embedding: Boolean; if False, the gradients will not propagate
+      through the embeddings.
+
+  Returns:
+    A loop function.
+  """
+  def loop_function(prev, _):
+    if output_projection is not None:
+      prev = nn_ops.xw_plus_b(
+          prev, output_projection[0], output_projection[1])
+    prev_symbol = math_ops.argmax(prev, 1)
+    # Note that gradients will not propagate through the second parameter of
+    # embedding_lookup.
+    emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
+    if not update_embedding:
+      emb_prev = array_ops.stop_gradient(emb_prev)
+    return emb_prev
+  return loop_function
+
+
 def rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None,
                 scope=None):
   """RNN decoder for the sequence-to-sequence model.
@@ -107,14 +135,13 @@ def rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None,
     for i, inp in enumerate(decoder_inputs):
       if loop_function is not None and prev is not None:
         with variable_scope.variable_scope("loop_function", reuse=True):
-          # We do not propagate gradients over the loop function.
-          inp = array_ops.stop_gradient(loop_function(prev, i))
+          inp = loop_function(prev, i)
       if i > 0:
         variable_scope.get_variable_scope().reuse_variables()
       output, state = cell(inp, state)
       outputs.append(output)
       if loop_function is not None:
-        prev = array_ops.stop_gradient(output)
+        prev = output
   return outputs, state
 
 
@@ -182,7 +209,7 @@ def tied_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
 
 def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
                           output_projection=None, feed_previous=False,
-                          scope=None):
+                          update_embedding_for_previous=True, scope=None):
   """RNN decoder with embedding and a pure-decoding option.
 
   Args:
@@ -200,6 +227,11 @@ def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
       In effect, this implements a greedy decoder. It can also be used
       during training to emulate http://arxiv.org/abs/1506.03099.
       If False, decoder_inputs are used as given (the standard decoder case).
+    update_embedding_for_previous: Boolean; if False and feed_previous=True,
+      only the embedding for the first symbol of decoder_inputs (the "GO"
+      symbol) will be updated by back propagation. Embeddings for the symbols
+      generated from the decoder itself remain unchanged. This parameter has
+      no effect if feed_previous=False.
     scope: VariableScope for the created subgraph; defaults to
       "embedding_rnn_decoder".
 
@@ -227,16 +259,9 @@ def embedding_rnn_decoder(decoder_inputs, initial_state, cell, num_symbols,
     with ops.device("/cpu:0"):
       embedding = variable_scope.get_variable("embedding",
                                               [num_symbols, cell.input_size])
-
-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      return embedding_ops.embedding_lookup(embedding, prev_symbol)
-
-    loop_function = extract_argmax_and_embed if feed_previous else None
+    loop_function = _extract_argmax_and_embed(
+        embedding, output_projection,
+        update_embedding_for_previous) if feed_previous else None
     emb_inp = (
         embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs)
     return rnn_decoder(emb_inp, initial_state, cell,
@@ -306,7 +331,8 @@ def embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
         outputs, state = embedding_rnn_decoder(
             decoder_inputs, encoder_state, cell, num_decoder_symbols,
             output_projection=output_projection,
-            feed_previous=feed_previous_bool)
+            feed_previous=feed_previous_bool,
+            update_embedding_for_previous=False)
         return outputs + [state]
 
     outputs_and_state = control_flow_ops.cond(feed_previous,
@@ -372,25 +398,19 @@ def embedding_tied_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
     emb_decoder_inputs = [embedding_ops.embedding_lookup(embedding, x)
                           for x in decoder_inputs]
 
-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      return embedding_ops.embedding_lookup(embedding, prev_symbol)
-
     if output_projection is None:
       cell = rnn_cell.OutputProjectionWrapper(cell, num_symbols)
 
     if isinstance(feed_previous, bool):
-      loop_function = extract_argmax_and_embed if feed_previous else None
+      loop_function = _extract_argmax_and_embed(
+          embedding, output_projection, True) if feed_previous else None
       return tied_rnn_seq2seq(emb_encoder_inputs, emb_decoder_inputs, cell,
                               loop_function=loop_function, dtype=dtype)
 
     # If feed_previous is a Tensor, we construct 2 graphs and use cond.
     def decoder(feed_previous_bool):
-      loop_function = extract_argmax_and_embed if feed_previous_bool else None
+      loop_function = _extract_argmax_and_embed(
+        embedding, output_projection, False) if feed_previous_bool else None
       reuse = None if feed_previous_bool else True
       with variable_scope.variable_scope(variable_scope.get_variable_scope(),
                                          reuse=reuse):
@@ -523,7 +543,7 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
       # If loop_function is set, we use it instead of decoder_inputs.
       if loop_function is not None and prev is not None:
         with variable_scope.variable_scope("loop_function", reuse=True):
-          inp = array_ops.stop_gradient(loop_function(prev, i))
+          inp = loop_function(prev, i)
       # Merge input and previous attentions into one vector of the right size.
       x = rnn_cell.linear([inp] + attns, cell.input_size, True)
       # Run the RNN.
@@ -539,8 +559,7 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
       with variable_scope.variable_scope("AttnOutputProjection"):
         output = rnn_cell.linear([cell_output] + attns, output_size, True)
       if loop_function is not None:
-        # We do not propagate gradients over the loop function.
-        prev = array_ops.stop_gradient(output)
+        prev = output
       outputs.append(output)
 
   return outputs, state
@@ -549,8 +568,10 @@ def attention_decoder(decoder_inputs, initial_state, attention_states, cell,
 def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
                                 cell, num_symbols, num_heads=1,
                                 output_size=None, output_projection=None,
-                                feed_previous=False, dtype=dtypes.float32,
-                                scope=None, initial_state_attention=False):
+                                feed_previous=False,
+                                update_embedding_for_previous=True,
+                                dtype=dtypes.float32, scope=None,
+                                initial_state_attention=False):
   """RNN decoder with embedding and attention and a pure-decoding option.
 
   Args:
@@ -571,6 +592,11 @@ def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
       In effect, this implements a greedy decoder. It can also be used
       during training to emulate http://arxiv.org/abs/1506.03099.
       If False, decoder_inputs are used as given (the standard decoder case).
+    update_embedding_for_previous: Boolean; if False and feed_previous=True,
+      only the embedding for the first symbol of decoder_inputs (the "GO"
+      symbol) will be updated by back propagation. Embeddings for the symbols
+      generated from the decoder itself remain unchanged. This parameter has
+      no effect if feed_previous=False.
     dtype: The dtype to use for the RNN initial states (default: tf.float32).
     scope: VariableScope for the created subgraph; defaults to
       "embedding_attention_decoder".
@@ -602,17 +628,9 @@ def embedding_attention_decoder(decoder_inputs, initial_state, attention_states,
     with ops.device("/cpu:0"):
       embedding = variable_scope.get_variable("embedding",
                                               [num_symbols, cell.input_size])
-
-    def extract_argmax_and_embed(prev, _):
-      """Loop_function that extracts the symbol from prev and embeds it."""
-      if output_projection is not None:
-        prev = nn_ops.xw_plus_b(
-            prev, output_projection[0], output_projection[1])
-      prev_symbol = array_ops.stop_gradient(math_ops.argmax(prev, 1))
-      emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
-      return emb_prev
-
-    loop_function = extract_argmax_and_embed if feed_previous else None
+    loop_function = _extract_argmax_and_embed(
+        embedding, output_projection,
+        update_embedding_for_previous) if feed_previous else None
     emb_inp = [
         embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]
     return attention_decoder(
@@ -700,6 +718,7 @@ def embedding_attention_seq2seq(encoder_inputs, decoder_inputs, cell,
             num_decoder_symbols, num_heads=num_heads, output_size=output_size,
             output_projection=output_projection,
             feed_previous=feed_previous_bool,
+            update_embedding_for_previous=False,
             initial_state_attention=initial_state_attention)
         return outputs + [state]
 
diff --git a/tensorflow/python/platform/default/_gfile.py b/tensorflow/python/platform/default/_gfile.py
index 5272f78617..f700d34978 100644
--- a/tensorflow/python/platform/default/_gfile.py
+++ b/tensorflow/python/platform/default/_gfile.py
@@ -248,7 +248,7 @@ class _Nulllocker(object):
 
 
 def Exists(path):   # pylint: disable=invalid-name
-  """Retruns True iff "path" exists (as a dir, file, non-broken symlink)."""
+  """Returns True iff "path" exists (as a dir, file, non-broken symlink)."""
   return os.path.exists(path)
 
 
diff --git a/tensorflow/python/training/learning_rate_decay.py b/tensorflow/python/training/learning_rate_decay.py
index 7d4999921f..ab48d34782 100644
--- a/tensorflow/python/training/learning_rate_decay.py
+++ b/tensorflow/python/training/learning_rate_decay.py
@@ -50,9 +50,11 @@ def exponential_decay(learning_rate, global_step, decay_steps, decay_rate,
   starter_learning_rate = 0.1
   learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                              100000, 0.96, staircase=True)
-  optimizer = tf.GradientDescentOptimizer(learning_rate)
   # Passing global_step to minimize() will increment it at each step.
-  optimizer.minimize(...my loss..., global_step=global_step)
+  learning_step = (
+      tf.GradientDescentOptimizer(learning_rate)
+      .minimize(...my loss..., global_step=global_step)
+  )
   ```
 
   Args:
diff --git a/tensorflow/python/training/moving_averages_test.py b/tensorflow/python/training/moving_averages_test.py
index 7ec6b2597e..fd9891853a 100644
--- a/tensorflow/python/training/moving_averages_test.py
+++ b/tensorflow/python/training/moving_averages_test.py
@@ -218,7 +218,7 @@ class ExponentialMovingAverageTest(tf.test.TestCase):
     self.assertDeviceEqual("/job:dev_v0", ema.average(v0).device)
     self.assertDeviceEqual("/job:dev_v1", ema.average(v1).device)
     # However, the colocation property is maintained.
-    self.assertEqual(["loc:@v1"],
+    self.assertEqual([b"loc:@v1"],
                      ema.average(v1).op.colocation_groups())
     self.assertDeviceEqual("/job:default", ema.average(tensor2).device)
 
diff --git a/tensorflow/python/training/optimizer.py b/tensorflow/python/training/optimizer.py
index 1e8d6b0f12..1c3ac2d09d 100644
--- a/tensorflow/python/training/optimizer.py
+++ b/tensorflow/python/training/optimizer.py
@@ -75,7 +75,7 @@ class Optimizer(object):
 
   # grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
   # need to the 'gradient' part, for example cap them, etc.
-  capped_grads_and_vars = [(MyCapper(gv[0]), gv[1])) for gv in grads_and_vars]
+  capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
 
   # Ask the optimizer to apply the capped gradients.
   opt.apply_gradients(capped_grads_and_vars)
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 8694dd694f..3648e5a040 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -404,8 +404,8 @@ class BaseSaverBuilder(object):
           if slice_name is None:
             slice_name = variable._save_slice_info.full_name
           elif slice_name != variable._save_slice_info.full_name:
-            raise variable("Slices must all be from the same tensor: %s != %s"
-                           % (slice_name, variable._save_slice_info.full_name))
+            raise ValueError("Slices must all be from the same tensor: %s != %s"
+                             % (slice_name, variable._save_slice_info.full_name))
           self._AddVarToSave(vars_to_save, seen_variables,
                              variable, variable._save_slice_info.spec, name)
         # pylint: enable=protected-access
diff --git a/tensorflow/tensorboard/backend/server_test.py b/tensorflow/tensorboard/backend/server_test.py
index aceaced26c..42c2aafe21 100644
--- a/tensorflow/tensorboard/backend/server_test.py
+++ b/tensorflow/tensorboard/backend/server_test.py
@@ -26,10 +26,10 @@ import gzip
 import json
 import os
 import shutil
-import StringIO
 import threading
 import zlib
 
+from six import BytesIO
 from six.moves import http_client
 from six.moves import xrange  # pylint: disable=redefined-builtin
 import tensorflow as tf
@@ -80,9 +80,9 @@ class TensorboardServerTest(tf.test.TestCase):
     content = response.read()
     if encoding in ('gzip', 'x-gzip', 'deflate'):
       if encoding == 'deflate':
-        data = StringIO.StringIO(zlib.decompress(content))
+        data = BytesIO(zlib.decompress(content))
       else:
-        data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
+        data = gzip.GzipFile('', 'rb', 9, BytesIO(content))
       content = data.read()
     return content
 
@@ -201,7 +201,7 @@ class TensorboardServerTest(tf.test.TestCase):
     node1.name = 'a'
     node2 = graph_def.node.add()
     node2.name = 'b'
-    node2.attr['very_large_attr'].s = 'a' * 2048  # 2 KB attribute
+    node2.attr['very_large_attr'].s = b'a' * 2048  # 2 KB attribute
     writer.add_event(tf.Event(graph_def=graph_def.SerializeToString()))
 
     # 1x1 transparent GIF.
diff --git a/tensorflow/tensorboard/components/tf-dashboard-common/tf-url-generator.html b/tensorflow/tensorboard/components/tf-dashboard-common/tf-url-generator.html
index 1207f40249..5f75958f25 100644
--- a/tensorflow/tensorboard/components/tf-dashboard-common/tf-url-generator.html
+++ b/tensorflow/tensorboard/components/tf-dashboard-common/tf-url-generator.html
@@ -31,7 +31,7 @@
       },
     };
     TF.Urls.routes.forEach(function(route) {
-      /* for each route (other than runs, handled seperately):
+      /* for each route (other than runs, handled separately):
        * out`RouteName`: {
        *  type: Function,
        *  readOnly: true,
diff --git a/tensorflow/tensorboard/components/tf-graph-common/lib/render.ts b/tensorflow/tensorboard/components/tf-graph-common/lib/render.ts
index 96d485b0ed..fa0ee99d19 100644
--- a/tensorflow/tensorboard/components/tf-graph-common/lib/render.ts
+++ b/tensorflow/tensorboard/components/tf-graph-common/lib/render.ts
@@ -548,7 +548,7 @@ export class RenderGraphInfo {
       // (which ignores control edges) and seeing that Z comes AFTER A.
       //
       // The property of being backwards is independent of whether the edge
-      // is inbound or outbound. In the preceeding example, if we were building
+      // is inbound or outbound. In the preceding example, if we were building
       // the subhierarchy for Z, we'd find bridge edge Z/Y=>A, walk to its
       // topmost adjoining metaedge Z=>A and discover that it's backwards.
       let backwards = false;
@@ -656,7 +656,7 @@ export class RenderGraphInfo {
     // one edge in the bridgegraph from Z->A/C.
     //
     // At this point, we've added a container bridge node IN to house all
-    // incoming bridge nodes. We'v alse added a bridge node Z' (with parent IN)
+    // incoming bridge nodes. We've also added a bridge node Z' (with parent IN)
     // to A, and a bridge edge from Z'->C.
     //
     //     +----------------------+
@@ -1059,7 +1059,7 @@ export class RenderMetaedgeInfo {
   metaedge: Metaedge;
 
   /**
-   * Reference to the adjoining RenderMeteaedgeInfo from the parent's
+   * Reference to the adjoining RenderMetaedgeInfo from the parent's
    * coreGraph. This is used during layout to determine the point at which this
    * edge should touch the node's bounding box. This property will be null for
    * edges which terminate at a node on both ends (all non-bridge edges).
@@ -1069,7 +1069,7 @@ export class RenderMetaedgeInfo {
   /**
    * Most of the time, a RenderMetaedgeInfo object represents a real
    * edge between nodes in the underlying graph structure. But sometimes, an
-   * edge only exsts for layout purposes. These structural edges are added
+   * edge only exists for layout purposes. These structural edges are added
    * during buildSubhierarchy() to force dagre.layout() to put bridge nodes
    * at the ends of the flow.
    * @see buildSubhierarchy()
@@ -1291,7 +1291,7 @@ function hasTypeIn(node: Node, types: string[]): boolean {
   return false;
 }
 
-/** Move nodes that are speficied to be excluded out of the core graph. */
+/** Move nodes that are specified to be excluded out of the core graph. */
 function extractSpecifiedNodes(renderNode: RenderGroupNodeInfo,
     params: RenderGraphParams) {
   let graph = renderNode.coreGraph;
diff --git a/tensorflow/tensorboard/components/tf-graph-common/lib/scene/annotation.ts b/tensorflow/tensorboard/components/tf-graph-common/lib/scene/annotation.ts
index b601ee84a9..b48d62c346 100644
--- a/tensorflow/tensorboard/components/tf-graph-common/lib/scene/annotation.ts
+++ b/tensorflow/tensorboard/components/tf-graph-common/lib/scene/annotation.ts
@@ -207,7 +207,7 @@ function update(aGroup, d: render.RenderNodeInfo, a: render.Annotation,
   });
 
   // Some annotations (such as summary) are represented using a 12x12 image tag.
-  // Purposely ommited units (e.g. pixels) since the images are vector graphics.
+  // Purposely omitted units (e.g. pixels) since the images are vector graphics.
   // If there is an image, we adjust the location of the image to be vertically
   // centered with the node and horizontally centered between the arrow and the
   // text label.
diff --git a/tensorflow/tensorboard/components/tf-graph-common/lib/scene/node.ts b/tensorflow/tensorboard/components/tf-graph-common/lib/scene/node.ts
index a08613d615..f2e73976ff 100644
--- a/tensorflow/tensorboard/components/tf-graph-common/lib/scene/node.ts
+++ b/tensorflow/tensorboard/components/tf-graph-common/lib/scene/node.ts
@@ -493,6 +493,7 @@ function position(nodeGroup, d: render.RenderNodeInfo) {
         scene.positionRect(shape, cx, d.y, d.coreBox.width, d.coreBox.height);
         labelPosition(nodeGroup, cx, d.y, d.labelOffset);
       }
+      break;
     }
     case NodeType.BRIDGE: {
       // position shape
diff --git a/tensorflow/tensorboard/components/tf-graph/tf-graph-icon.html b/tensorflow/tensorboard/components/tf-graph/tf-graph-icon.html
index 765803a6a9..180613b4e4 100644
--- a/tensorflow/tensorboard/components/tf-graph/tf-graph-icon.html
+++ b/tensorflow/tensorboard/components/tf-graph/tf-graph-icon.html
@@ -92,7 +92,7 @@
 
           /**
            * String indicating the type of coloring to use for this node, used
-           * only for deterimining the fill.
+           * only for determining the fill.
            */
           colorBy: {
             type: Object,
@@ -100,7 +100,7 @@
           },
 
           /**
-           * Function used by structural coloring algorithim to determine which
+           * Function used by structural coloring algorithm to determine which
            * color to use based on the template ID of the node. Optional.
            */
           templateIndex: {
diff --git a/tensorflow/tensorboard/dist/tf-tensorboard.html b/tensorflow/tensorboard/dist/tf-tensorboard.html
index a5d28c7c41..3fcd43c7d5 100644
--- a/tensorflow/tensorboard/dist/tf-tensorboard.html
+++ b/tensorflow/tensorboard/dist/tf-tensorboard.html
@@ -6872,6 +6872,7 @@ var tf;
                                 scene.positionRect(shape, cx, d.y, d.coreBox.width, d.coreBox.height);
                                 labelPosition(nodeGroup, cx, d.y, d.labelOffset);
                             }
+                            break;
                         }
                         case graph.NodeType.BRIDGE: {
                             // position shape
diff --git a/tensorflow/tensorboard/lib/js/requestManager/requestManager.ts b/tensorflow/tensorboard/lib/js/requestManager/requestManager.ts
index f63d1900d1..fd0f7abfba 100644
--- a/tensorflow/tensorboard/lib/js/requestManager/requestManager.ts
+++ b/tensorflow/tensorboard/lib/js/requestManager/requestManager.ts
@@ -21,7 +21,7 @@ module TF.Backend {
   * more urls are requested than can be handled at once. The queue can be cleared.
   *
   * When a request is made, a Promise is returned which resolves with the parsed
-  * JSON rseult from the reqest.
+  * JSON result from the request.
   */
 
   export class RequestCancellationError extends Error {
diff --git a/tensorflow/tensorboard/lib/js/requestManager/test/requestManagerTest.ts b/tensorflow/tensorboard/lib/js/requestManager/test/requestManagerTest.ts
index 04c2830f7a..aeef4f65a2 100644
--- a/tensorflow/tensorboard/lib/js/requestManager/test/requestManagerTest.ts
+++ b/tensorflow/tensorboard/lib/js/requestManager/test/requestManagerTest.ts
@@ -137,8 +137,8 @@ module TF.Backend {
       /* This test is a bit tricky.
       * We want to verify that the RequestManager queue has LIFO semantics.
       * So we construct three requests off the bat: A, B, C.
-      * So LIFO semantis ensure these will resolve in order A, C, B.
-      * (Beacuse the A request launches immediately when we create it, it's not in queue)
+      * So LIFO semantics ensure these will resolve in order A, C, B.
+      * (Because the A request launches immediately when we create it, it's not in queue)
       * Then after resolving A, C moves out of queue, and we create X.
       * So expected final order is A, C, X, B.
       * We verify this with an external var that counts how many requests were resolved.
diff --git a/tensorflow/tools/ci_build/Dockerfile.android b/tensorflow/tools/ci_build/Dockerfile.android
index 0bffe80fcb..444ce17d98 100644
--- a/tensorflow/tools/ci_build/Dockerfile.android
+++ b/tensorflow/tools/ci_build/Dockerfile.android
@@ -3,11 +3,10 @@ FROM ubuntu:14.04
 MAINTAINER Jan Prach <jendap@google.com>
 
 # Copy and run the install scripts.
-COPY install/install_deb_packages.sh /install/install_deb_packages.sh
+COPY install/*.sh /install/
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa
 RUN /install/install_deb_packages.sh
-COPY install/install_openjdk8_from_ppa.sh /install/install_openjdk8_from_ppa.sh
-RUN /install/install_openjdk8_from_ppa.sh
-COPY install/install_bazel.sh /install/install_bazel.sh
 RUN /install/install_bazel.sh
 
 # Set up bazelrc.
diff --git a/tensorflow/tools/ci_build/Dockerfile.cpu b/tensorflow/tools/ci_build/Dockerfile.cpu
index 7bef5e07fe..acc84f136a 100644
--- a/tensorflow/tools/ci_build/Dockerfile.cpu
+++ b/tensorflow/tools/ci_build/Dockerfile.cpu
@@ -3,11 +3,10 @@ FROM ubuntu:14.04
 MAINTAINER Jan Prach <jendap@google.com>
 
 # Copy and run the install scripts.
-COPY install/install_deb_packages.sh /install/install_deb_packages.sh
+COPY install/*.sh /install/
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa
 RUN /install/install_deb_packages.sh
-COPY install/install_openjdk8_from_ppa.sh /install/install_openjdk8_from_ppa.sh
-RUN /install/install_openjdk8_from_ppa.sh
-COPY install/install_bazel.sh /install/install_bazel.sh
 RUN /install/install_bazel.sh
 
 # Set up bazelrc.
diff --git a/tensorflow/tools/ci_build/Dockerfile.debian.jessie.cpu b/tensorflow/tools/ci_build/Dockerfile.debian.jessie.cpu
new file mode 100644
index 0000000000..fc37a5bb28
--- /dev/null
+++ b/tensorflow/tools/ci_build/Dockerfile.debian.jessie.cpu
@@ -0,0 +1,14 @@
+FROM debian:jessie
+
+MAINTAINER Jan Prach <jendap@google.com>
+
+# Copy and run the install scripts.
+COPY install/*.sh /install/
+RUN /install/install_bootstrap_deb_packages.sh
+RUN echo "deb http://http.debian.net/debian jessie-backports main" | tee -a /etc/apt/sources.list
+RUN /install/install_deb_packages.sh
+RUN /install/install_bazel.sh
+
+# Set up bazelrc.
+COPY install/.bazelrc /root/.bazelrc
+ENV BAZELRC /root/.bazelrc
diff --git a/tensorflow/tools/ci_build/Dockerfile.gpu b/tensorflow/tools/ci_build/Dockerfile.gpu
index b57d1d18c1..b4b0ccccf7 100644
--- a/tensorflow/tools/ci_build/Dockerfile.gpu
+++ b/tensorflow/tools/ci_build/Dockerfile.gpu
@@ -1,13 +1,12 @@
-FROM nvidia/cuda:7.0-cudnn2-devel
+FROM nvidia/cuda:7.5-cudnn4-devel
 
 MAINTAINER Jan Prach <jendap@google.com>
 
 # Copy and run the install scripts.
-COPY install/install_deb_packages.sh /install/install_deb_packages.sh
+COPY install/*.sh /install/
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa
 RUN /install/install_deb_packages.sh
-COPY install/install_openjdk8_from_ppa.sh /install/install_openjdk8_from_ppa.sh
-RUN /install/install_openjdk8_from_ppa.sh
-COPY install/install_bazel.sh /install/install_bazel.sh
 RUN /install/install_bazel.sh
 
 # Set up bazelrc.
diff --git a/tensorflow/tools/ci_build/README.md b/tensorflow/tools/ci_build/README.md
index 90ede0b60c..aca5829b3c 100644
--- a/tensorflow/tools/ci_build/README.md
+++ b/tensorflow/tools/ci_build/README.md
@@ -73,7 +73,7 @@ tensorflow/tools/ci_build/ci_build.sh CPU bazel test //tensorflow/...
 tensorflow/tools/ci_build/ci_build.sh GPU bazel build -c opt --config=cuda //tensorflow/...
 
 # build pip with gpu support
-tensorflow/tools/ci_build/ci_build.sh GPU tensorflow/tools/ci_build/builds/gpu_pip.sh
+tensorflow/tools/ci_build/ci_build.sh GPU tensorflow/tools/ci_build/builds/pip.sh GPU
 
 # build android example app
 tensorflow/tools/ci_build/ci_build.sh ANDROID tensorflow/tools/ci_build/builds/android.sh
diff --git a/tensorflow/tools/ci_build/builds/configured b/tensorflow/tools/ci_build/builds/configured
index d452eac65e..297937e24e 100755
--- a/tensorflow/tools/ci_build/builds/configured
+++ b/tensorflow/tools/ci_build/builds/configured
@@ -32,7 +32,9 @@ else
   export TF_NEED_CUDA=0
 fi
 
+pushd "${CI_TENSORFLOW_SUBMODULE_PATH:-.}"
 ./configure
+popd
 
 # Gather and print build information
 SCRIPT_DIR=$( cd ${0%/*} && pwd -P )
diff --git a/tensorflow/tools/ci_build/builds/docker_test.sh b/tensorflow/tools/ci_build/builds/docker_test.sh
new file mode 100755
index 0000000000..7a1af79c89
--- /dev/null
+++ b/tensorflow/tools/ci_build/builds/docker_test.sh
@@ -0,0 +1,127 @@
+#!/usr/bin/env bash
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# Build and test TensorFlow docker images.
+# The tests include Python unit tests-on-install and tutorial tests.
+#
+# Usage: docker_test.sh <IMAGE_TYPE> <TAG> <WHL_PATH>
+# Arguments:
+#   IMAGE_TYPE : Type of the image: (CPU|GPU)
+#   TAG        : Docker image tag
+#   WHL_PATH   : Path to the whl file to be installed inside the docker image
+#
+#   e.g.: docker_test.sh CPU someone/tensorflow:0.8.0 pip_test/whl/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
+#
+
+# Helper functions
+# Exit after a failure
+die() {
+  echo $@
+  exit 1
+}
+
+# Convert to lower case
+to_lower () {
+  echo "$1" | tr '[:upper:]' '[:lower:]'
+}
+
+
+# Helper function to traverse directories up until given file is found.
+function upsearch () {
+  test / == "$PWD" && return || \
+      test -e "$1" && echo "$PWD" && return || \
+      cd .. && upsearch "$1"
+}
+
+
+# Verify command line argument
+if [[ $# != "3" ]]; then
+  die "Usage: $(basename $0) <IMAGE_TYPE> <TAG> <WHL_PATH>"
+fi
+IMAGE_TYPE=$(to_lower "$1")
+DOCKER_IMG_TAG=$2
+WHL_PATH=$3
+
+# Verify image type
+if [[ "${IMAGE_TYPE}" == "cpu" ]]; then
+  DOCKERFILE="tensorflow/tools/docker/Dockerfile"
+elif [[ "${IMAGE_TYPE}" == "gpu" ]]; then
+  DOCKERFILE="tensorflow/tools/docker/Dockerfile.gpu"
+else
+  die "Unrecognized image type: $1"
+fi
+
+# Verify docker binary existence
+if [[ -z $(which docker) ]]; then
+  die "FAILED: docker binary unavailable"
+fi
+
+# Locate the base directory
+BASE_DIR=$(upsearch "${DOCKERFILE}")
+if [[ -z "${BASE_DIR}" ]]; then
+  die "FAILED: Unable to find the base directory where the dockerfile "\
+"${DOCKERFFILE} resides"
+fi
+echo "Base directory: ${BASE_DIR}"
+
+pushd ${BASE_DIR} > /dev/null
+
+# Build docker image
+DOCKERFILE_PATH="${BASE_DIR}/${DOCKERFILE}"
+DOCKERFILE_DIR="$(dirname ${DOCKERFILE_PATH})"
+
+# Check to make sure that the whl file exists
+test -f ${WHL_PATH} || \
+    die "whl file does not exist: ${WHL_PATH}"
+
+TMP_WHL_DIR="${DOCKERFILE_DIR}/whl"
+mkdir -p "${TMP_WHL_DIR}"
+cp "${WHL_PATH}" "${TMP_WHL_DIR}/" || \
+    die "FAILED to copy whl file from ${WHL_PATH} to ${TMP_WHL_DIR}/"
+
+docker build -t "${DOCKER_IMG_TAG}" -f "${DOCKERFILE_PATH}" \
+"${DOCKERFILE_DIR}" || \
+    die "FAILED to build docker image from Dockerfile ${DOCKERFILE_PATH}"
+
+# Clean up
+rm -rf "${TMP_WHL_DIR}" || \
+    die "Failed to remove temporary directory ${TMP_WHL_DIR}"
+
+
+# Add extra params for cuda devices and libraries for GPU container.
+if [ "${IMAGE_TYPE}" == "gpu" ]; then
+  devices=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
+  libs=$(\ls /usr/lib/x86_64-linux-gnu/libcuda.* | xargs -I{} echo '-v {}:{}')
+  GPU_EXTRA_PARAMS="${devices} ${libs}"
+else
+  GPU_EXTRA_PARAMS=""
+fi
+
+# Run docker image with source directory mapped
+docker run -v ${BASE_DIR}:/tensorflow-src -w /tensorflow-src \
+${GPU_EXTRA_PARAMS} \
+"${DOCKER_IMG_TAG}" \
+/bin/bash -c "tensorflow/tools/ci_build/builds/test_installation.sh && "\
+"tensorflow/tools/ci_build/builds/test_tutorials.sh"
+
+RESULT=$?
+
+popd > /dev/null
+if [[ ${RESULT} == 0 ]]; then
+  echo "SUCCESS: Built and tested docker image: ${DOCKER_IMG_TAG}"
+else
+  die "FAILED to build and test docker image: ${DOCKER_IMG_TAG}"
+fi
diff --git a/tensorflow/tools/ci_build/builds/pip.sh b/tensorflow/tools/ci_build/builds/pip.sh
index 66ebf13baa..16364fbf9e 100755
--- a/tensorflow/tools/ci_build/builds/pip.sh
+++ b/tensorflow/tools/ci_build/builds/pip.sh
@@ -13,55 +13,27 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-
-# Build the Python PIP installation package for TensorFlow
-# and run the Python unit tests from the source code on the installation
+#
+# Build the Python PIP installation package for TensorFlow and install
+# the package.
+# The PIP installation is done using the --user flag.
 #
 # Usage:
-#   pip.sh CONTAINER_TYPE
+#   pip.sh CONTAINER_TYPE [--test_tutorials]
 #
 # When executing the Python unit tests, the script obeys the shell
-# variables: PY_TEST_WHITELIST, PY_TEST_BLACKLIST, PY_TEST_GPU_BLACKLIST,
-# and NO_TEST_ON_INSTALL
-#
-# To select only a subset of the Python tests to run, set the environment
-# variable PY_TEST_WHITELIST, e.g.,
-#   PY_TEST_WHITELIST="tensorflow/python/kernel_tests/shape_ops_test.py"
-# Separate the tests with a colon (:). Leave this environment variable empty
-# to disable the whitelist.
+# variables: TF_BUILD_BAZEL_CLEAN, NO_TEST_ON_INSTALL
 #
-# You can also ignore a set of the tests by using the environment variable
-# PY_TEST_BLACKLIST. For example, you can include in PY_TEST_BLACKLIST the
-# tests that depend on Python modules in TensorFlow source that are not
-# exported publicly.
+# TF_BUILD_BAZEL_CLEAN, if set to any non-empty and non-0 value, directs the
+# script to perform bazel clean prior to main build and test steps.
 #
-# In addition, you can put blacklist for only GPU build inthe environment
-# variable PY_TEST_GPU_BLACKLIST.
-#
-# If the environmental variable NO_TEST_ON_INSTALL is set to any non-empty
-# value, the script will exit after the pip install step.
-
-# =============================================================================
-# Test blacklist: General
+# If NO_TEST_ON_INSTALL has any non-empty and non-0 value, the test-on-install
+# part will be skipped.
 #
-# tensorflow/python/framework/ops_test.py
-#   depends on depends on "test_ops", which is defined in a C++ file wrapped as
-#   a .py file through the Bazel rule “tf_gen_ops_wrapper_py”.
-# tensorflow/util/protobuf/compare_test.py:
-#   depends on compare_test_pb2 defined outside Python
-# tensorflow/python/framework/device_test.py:
-#   depends on CheckValid() and ToString(), both defined externally
+# I the --test_tutorials flag is set, it will cause the script to run the
+# tutorial tests (see test_tutorials.sh) after the PIP
+# installation and the Python unit tests-on-install step.
 #
-PY_TEST_BLACKLIST="${PY_TEST_BLACKLIST}:"\
-"tensorflow/python/framework/ops_test.py:"\
-"tensorflow/python/util/protobuf/compare_test.py:"\
-"tensorflow/python/framework/device_test.py"
-
-# Test blacklist: GPU-only
-PY_TEST_GPU_BLACKLIST="${PY_TEST_GPU_BLACKLIST}:"\
-"tensorflow/python/framework/function_test.py"
-
-# =============================================================================
 
 # Helper functions
 # Get the absolute path from a path
@@ -69,15 +41,30 @@ abs_path() {
     [[ $1 = /* ]] && echo "$1" || echo "$PWD/${1#./}"
 }
 
+
 # Exit after a failure
 die() {
     echo $@
     exit 1
 }
 
+
 # Get the command line arguments
 CONTAINER_TYPE=$( echo "$1" | tr '[:upper:]' '[:lower:]' )
 
+if [[ ! -z "${TF_BUILD_BAZEL_CLEAN}" ]] && \
+   [[ "${TF_BUILD_BAZEL_CLEAN}" != "0" ]]; then
+  echo "TF_BUILD_BAZEL_CLEAN=${TF_BUILD_BAZEL_CLEAN}: Performing 'bazel clean'"
+  bazel clean
+fi
+
+DO_TEST_TUTORIALS=0
+for ARG in $@; do
+  if [[ "${ARG}" == "--test_tutorials" ]]; then
+    DO_TEST_TUTORIALS=1
+  fi
+done
+
 PIP_BUILD_TARGET="//tensorflow/tools/pip_package:build_pip_package"
 if [[ ${CONTAINER_TYPE} == "cpu" ]]; then
   bazel build -c opt ${PIP_BUILD_TARGET} || die "Build failed."
@@ -96,6 +83,12 @@ if [[ ${CONTAINER_TYPE} == "gpu" ]]; then
   PY_TEST_BLACKLIST="${PY_TEST_BLACKLIST}:${PY_TEST_GPU_BLACKLIST}"
 fi
 
+# If still in a virtualenv, deactivate it first
+if [[ ! -z "$(which deactivate)" ]]; then
+  echo "It appears that we are already in a virtualenv. Deactivating..."
+  deactivate || die "FAILED: Unable to deactivate from existing virtualenv"
+fi
+
 # Obtain the path to Python binary
 source tools/python_bin_path.sh
 
@@ -109,18 +102,20 @@ fi
 # installation of Python
 PY_MAJOR_MINOR_VER=$(${PYTHON_BIN_PATH} -V 2>&1 | awk '{print $NF}' | cut -d. -f-2)
 
-echo "Python binary path to be used in PIP install-test: ${PYTHON_BIN_PATH} "\
+echo "Python binary path to be used in PIP install: ${PYTHON_BIN_PATH} "\
 "(Major.Minor version: ${PY_MAJOR_MINOR_VER})"
 
 # Build PIP Wheel file
-PIP_WHL_DIR="pip_test/whl"
-PIP_WHL_DIR=`abs_path ${PIP_WHL_DIR}`  # Get absolute path
+PIP_TEST_ROOT="pip_test"
+PIP_WHL_DIR="${PIP_TEST_ROOT}/whl"
+PIP_WHL_DIR=$(abs_path ${PIP_WHL_DIR})  # Get absolute path
 rm -rf ${PIP_WHL_DIR} && mkdir -p ${PIP_WHL_DIR}
-bazel-bin/tensorflow/tools/pip_package/build_pip_package ${PIP_WHL_DIR} &&
+bazel-bin/tensorflow/tools/pip_package/build_pip_package ${PIP_WHL_DIR} || \
+die "build_pip_package FAILED"
 
 # Perform installation
-WHL_PATH=`ls ${PIP_WHL_DIR}/tensorflow*.whl`
-if [[ `echo ${WHL_PATH} | wc -w` -ne 1 ]]; then
+WHL_PATH=$(ls ${PIP_WHL_DIR}/tensorflow*.whl)
+if [[ $(echo ${WHL_PATH} | wc -w) -ne 1 ]]; then
   die "ERROR: Failed to find exactly one built TensorFlow .whl file in "\
 "directory: ${PIP_WHL_DIR}"
 fi
@@ -130,180 +125,47 @@ echo "whl file path = ${WHL_PATH}"
 # Install, in user's local home folder
 echo "Installing pip whl file: ${WHL_PATH}"
 
-# Call pip install twice, first time with --upgrade and second time without it
-# This addresses the sporadic test failures related to protobuf version
-${PYTHON_BIN_PATH} -m pip install -v --user --upgrade ${WHL_PATH} numpy==1.8.2 &&
-${PYTHON_BIN_PATH} -m pip install -v --user ${WHL_PATH} &&
+# Create temporary directory for install test
+VENV_DIR="${PIP_TEST_ROOT}/venv"
+rm -rf "${VENV_DIR}" && mkdir -p "${VENV_DIR}"
+echo "Create directory for virtualenv: ${VENV_DIR}"
+
+# Verify that virtualenv exists
+if [[ -z $(which virtualenv) ]]; then
+  die "FAILED: virtualenv not available on path"
+fi
+
+virtualenv -p "${PYTHON_BIN_PATH}" "${VENV_DIR}" ||
+die "FAILED: Unable to create virtualenv"
+
+source "${VENV_DIR}/bin/activate" ||
+die "FAILED: Unable to activate virtualenv"
+
+# Install the pip file in virtual env
+pip install -v ${WHL_PATH} \
+&& echo "Successfully installed pip package ${WHL_PATH}" \
+|| die "pip install (without --upgrade) FAILED"
 
 # If NO_TEST_ON_INSTALL is set to any non-empty value, skip all Python
 # tests-on-install and exit right away
-if [[ ! -z ${NO_TEST_ON_INSTALL} ]]; then
+if [[ ! -z "${NO_TEST_ON_INSTALL}" ]] &&
+   [[ "${NO_TEST_ON_INSTALL}" != "0" ]]; then
   echo "NO_TEST_ON_INSTALL=${NO_TEST_ON_INSTALL}:"
   echo "  Skipping ALL Python unit tests on install"
   exit 0
 fi
 
-# Directory from which the unit-test files will be run
-PY_TEST_DIR_REL="pip_test/tests"
-PY_TEST_DIR=`abs_path ${PY_TEST_DIR_REL}`  # Get absolute path
-rm -rf ${PY_TEST_DIR} && mkdir -p ${PY_TEST_DIR}
-
-# Create test log directory
-PY_TEST_LOG_DIR_REL=${PY_TEST_DIR_REL}/logs
-PY_TEST_LOG_DIR=`abs_path ${PY_TEST_LOG_DIR_REL}`  # Absolute path
-
-mkdir ${PY_TEST_LOG_DIR}
-
-# Copy source files that are required by the tests but are not included in the
-# PIP package
-
-# Look for local Python library directory
-LIB_PYTHON_DIR=""
+# Call test_installation.sh to perform test-on-install
+DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
-# Candidate locations of the local Python library directory
-LIB_PYTHON_DIR_CANDS="${HOME}/.local/lib/python${PY_MAJOR_MINOR_VER}* "\
-"${HOME}/Library/Python/${PY_MAJOR_MINOR_VER}*/lib/python"
+"${DIR}/test_installation.sh" --virtualenv ||
+die "PIP tests-on-install FAILED"
 
-for CAND in ${LIB_PYTHON_DIR_CANDS}; do
-  if [[ -d "${CAND}" ]]; then
-    LIB_PYTHON_DIR="${CAND}"
-    break
-  fi
-done
-
-if [[ -z ${LIB_PYTHON_DIR} ]]; then
-  die "Failed to find local Python library directory"
-else
-  echo "Found local Python library directory at: ${LIB_PYTHON_DIR}"
+# Optional: Run the tutorial tests
+if [[ "${DO_TEST_TUTORIALS}" == "1" ]]; then
+  "${DIR}/test_tutorials.sh" --virtualenv ||
+die "PIP tutorial tests-on-install FAILED"
 fi
 
-PACKAGES_DIR=`ls -d ${LIB_PYTHON_DIR}/*-packages | head -1`
-
-echo "Copying some source directories that are required by tests but are "\
-"not included in install to Python packages directory: ${PACKAGES_DIR}"
-
-# tensorflow.python.tools
-rm -rf ${PACKAGES_DIR}/tensorflow/python/tools
-cp -r tensorflow/python/tools \
-      ${PACKAGES_DIR}/tensorflow/python/tools
-touch ${PACKAGES_DIR}/tensorflow/python/tools/__init__.py  # Make module visible
-
-echo "Copying additional files required by tests to working directory "\
-"for test: ${PY_TEST_DIR}"
-
-# Image files required by some tests, e.g., images_ops_test.py
-mkdir -p ${PY_TEST_DIR}/tensorflow/core/lib
-rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/jpeg
-cp -r tensorflow/core/lib/jpeg ${PY_TEST_DIR}/tensorflow/core/lib
-rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/png
-cp -r tensorflow/core/lib/png ${PY_TEST_DIR}/tensorflow/core/lib
-
-# Run tests
-DIR0=`pwd`
-ALL_PY_TESTS=`find tensorflow/python -name "*_test.py"`
-# TODO(cais): Add tests in tensorflow/contrib
-
-PY_TEST_COUNT=`echo ${ALL_PY_TESTS} | wc -w`
-
-if [[ ${PY_TEST_COUNT} -eq 0 ]]; then
-  die "ERROR: Cannot find any tensorflow Python unit tests to run on install"
-fi
-
-# Iterate through all the Python unit test files using the installation
-COUNTER=0
-PASS_COUNTER=0
-FAIL_COUNTER=0
-SKIP_COUNTER=0
-FAILED_TESTS=""
-FAILED_TEST_LOGS=""
-
-for TEST_FILE_PATH in ${ALL_PY_TESTS}; do
-  ((COUNTER++))
-
-  PROG_STR="(${COUNTER} / ${PY_TEST_COUNT})"
-
-  # If PY_TEST_WHITELIST is not empty, only the white-listed tests will be run
-  if [[ ! -z ${PY_TEST_WHITELIST} ]] && \
-     [[ ! ${PY_TEST_WHITELIST} == *"${TEST_FILE_PATH}"* ]]; then
-    ((SKIP_COUNTER++))
-    echo "${PROG_STR} Non-whitelisted test SKIPPED: ${TEST_FILE_PATH}"
-    continue
-  fi
-
-  # If the test is in the black list, skip it
-  if [[ ${PY_TEST_BLACKLIST} == *"${TEST_FILE_PATH}"* ]]; then
-    ((SKIP_COUNTER++))
-    echo "${PROG_STR} Blacklisted test SKIPPED: ${TEST_FILE_PATH}"
-    continue
-  fi
-
-  # Copy to a separate directory to guard against the possibility of picking up
-  # modules in the source directory
-  cp ${TEST_FILE_PATH} ${PY_TEST_DIR}/
-
-  TEST_BASENAME=`basename "${TEST_FILE_PATH}"`
-
-  # Relative path of the test log. Use long path in case there are duplicate
-  # file names in the Python tests
-  TEST_LOG_REL="${PY_TEST_LOG_DIR_REL}/${TEST_FILE_PATH}.log"
-  mkdir -p `dirname ${TEST_LOG_REL}`  # Create directory for log
-
-  TEST_LOG=`abs_path ${TEST_LOG_REL}`  # Absolute path
-
-  # Before running the test, cd away from the Tensorflow source to
-  # avoid the possibility of picking up dependencies from the
-  # source directory
-  cd ${PY_TEST_DIR}
-  ${PYTHON_BIN_PATH} ${PY_TEST_DIR}/${TEST_BASENAME} >${TEST_LOG} 2>&1
-
-  # Check for pass or failure status of the test outtput and exit
-  if [[ $? -eq 0 ]]; then
-    ((PASS_COUNTER++))
-
-    echo "${PROG_STR} Python test-on-install PASSED: ${TEST_FILE_PATH}"
-  else
-    ((FAIL_COUNTER++))
-
-    FAILED_TESTS="${FAILED_TESTS} ${TEST_FILE_PATH}"
-
-    FAILED_TEST_LOGS="${FAILED_TEST_LOGS} ${TEST_LOG_REL}"
-
-    echo "${PROG_STR} Python test-on-install FAILED: ${TEST_FILE_PATH}"
-    echo "  Log @: ${TEST_LOG_REL}"
-    echo "============== BEGINS failure log content =============="
-    cat ${TEST_LOG}
-    echo "============== ENDS failure log content =============="
-    echo ""
-  fi
-  cd ${DIR0}
-
-  # Clean up files for this test
-  rm -f ${PY_TEST_DIR}/${TEST_BASENAME}
-
-done
-
-echo ""
-echo "${PY_TEST_COUNT} Python test(s):" \
-     "${PASS_COUNTER} passed;" \
-     "${FAIL_COUNTER} failed; " \
-     "${SKIP_COUNTER} skipped"
-echo "Test logs directory: ${PY_TEST_LOG_DIR_REL}"
-
-if [[ ${FAIL_COUNTER} -eq 0  ]]; then
-  echo ""
-  echo "Python test-on-install SUCCEEDED"
-
-  exit 0
-else
-  echo "FAILED test(s):"
-  FAILED_TEST_LOGS=($FAILED_TEST_LOGS)
-  FAIL_COUNTER=0
-  for TEST_NAME in ${FAILED_TESTS}; do
-    echo "  ${TEST_NAME} (Log @: ${FAILED_TEST_LOGS[${FAIL_COUNTER}]})"
-    ((FAIL_COUNTER++))
-  done
-
-  echo ""
-  echo "Python test-on-install FAILED"
-  exit 1
-fi
+deactivate ||
+die "FAILED: Unable to deactivate virtualenv"
diff --git a/tensorflow/tools/ci_build/builds/print_build_info.sh b/tensorflow/tools/ci_build/builds/print_build_info.sh
index f243c185c0..95b5eb8b83 100755
--- a/tensorflow/tools/ci_build/builds/print_build_info.sh
+++ b/tensorflow/tools/ci_build/builds/print_build_info.sh
@@ -63,7 +63,7 @@ if [[ ! -z $(which swig) ]]; then
 fi
 
 # Information about TensorFlow source
-TF_FETCH_URL=$(git remote show origin | grep "Fetch URL:" | awk '{print $3}')
+TF_FETCH_URL=$(git config --get remote.origin.url)
 TF_HEAD=$(git rev-parse HEAD)
 
 # NVIDIA & CUDA info
diff --git a/tensorflow/tools/ci_build/builds/test_installation.sh b/tensorflow/tools/ci_build/builds/test_installation.sh
new file mode 100755
index 0000000000..d2c8d21c5b
--- /dev/null
+++ b/tensorflow/tools/ci_build/builds/test_installation.sh
@@ -0,0 +1,292 @@
+#!/usr/bin/env bash
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# Build the Python PIP installation package for TensorFlow
+# and run the Python unit tests from the source code on the installation
+#
+# Usage:
+#   test_installation.sh [--virtualenv]
+#
+# If the flag --virtualenv is set, the script will use "python" as the Python
+# binary path. Otherwise, it will use tools/python_bin_path.sh to determine
+# the Python binary path.
+#
+# When executing the Python unit tests, the script obeys the shell
+# variables: PY_TEST_WHITELIST, PY_TEST_BLACKLIST, PY_TEST_GPU_BLACKLIST,
+#
+# To select only a subset of the Python tests to run, set the environment
+# variable PY_TEST_WHITELIST, e.g.,
+#   PY_TEST_WHITELIST="tensorflow/python/kernel_tests/shape_ops_test.py"
+# Separate the tests with a colon (:). Leave this environment variable empty
+# to disable the whitelist.
+#
+# You can also ignore a set of the tests by using the environment variable
+# PY_TEST_BLACKLIST. For example, you can include in PY_TEST_BLACKLIST the
+# tests that depend on Python modules in TensorFlow source that are not
+# exported publicly.
+#
+# In addition, you can put blacklist for only GPU build inthe environment
+# variable PY_TEST_GPU_BLACKLIST.
+#
+# TF_BUILD_BAZEL_CLEAN, if set to any non-empty and non-0 value, directs the
+# script to perform bazel clean prior to main build and test steps.
+#
+# If the environmental variable NO_TEST_ON_INSTALL is set to any non-empty
+# value, the script will exit after the pip install step.
+
+# =============================================================================
+# Test blacklist: General
+#
+# tensorflow/python/framework/ops_test.py
+#   depends on depends on "test_ops", which is defined in a C++ file wrapped as
+#   a .py file through the Bazel rule “tf_gen_ops_wrapper_py”.
+# tensorflow/util/protobuf/compare_test.py:
+#   depends on compare_test_pb2 defined outside Python
+# tensorflow/python/framework/device_test.py:
+#   depends on CheckValid() and ToString(), both defined externally
+#
+PY_TEST_BLACKLIST="${PY_TEST_BLACKLIST}:"\
+"tensorflow/python/framework/ops_test.py:"\
+"tensorflow/python/util/protobuf/compare_test.py:"\
+"tensorflow/python/framework/device_test.py"
+
+# Test blacklist: GPU-only
+PY_TEST_GPU_BLACKLIST="${PY_TEST_GPU_BLACKLIST}:"\
+"tensorflow/python/framework/function_test.py"
+
+# =============================================================================
+
+
+# Helper functions
+# Get the absolute path from a path
+abs_path() {
+  [[ $1 = /* ]] && echo "$1" || echo "$PWD/${1#./}"
+}
+
+
+die() {
+  echo $@
+  exit 1
+}
+
+
+# Obtain the path to Python binary
+# source tools/python_bin_path.sh
+if [[ "$1" == "--virtualenv" ]]; then
+  PYTHON_BIN_PATH="$(which python)"
+else
+  source tools/python_bin_path.sh
+  # Assume: PYTHON_BIN_PATH is exported by the script above
+fi
+
+if [[ -z "${PYTHON_BIN_PATH}" ]]; then
+  die "PYTHON_BIN_PATH was not provided. If this is not virtualenv, "\
+"did you run configure?"
+fi
+
+# Determine the major and minor versions of Python being used (e.g., 2.7)
+# This info will be useful for determining the directory of the local pip
+# installation of Python
+PY_MAJOR_MINOR_VER=$(${PYTHON_BIN_PATH} -V 2>&1 | awk '{print $NF}' | cut -d. -f-2)
+
+echo "Python binary path to be used in PIP install-test: ${PYTHON_BIN_PATH} "\
+"(Major.Minor version: ${PY_MAJOR_MINOR_VER})"
+
+# Avoid permission issues outside container
+umask 000
+
+# Directory from which the unit-test files will be run
+PY_TEST_DIR_REL="pip_test/tests"
+PY_TEST_DIR=$(abs_path ${PY_TEST_DIR_REL})  # Get absolute path
+rm -rf ${PY_TEST_DIR} && mkdir -p ${PY_TEST_DIR}
+
+# Create test log directory
+PY_TEST_LOG_DIR_REL=${PY_TEST_DIR_REL}/logs
+PY_TEST_LOG_DIR=$(abs_path ${PY_TEST_LOG_DIR_REL})  # Absolute path
+
+mkdir ${PY_TEST_LOG_DIR}
+
+
+# Copy source files that are required by the tests but are not included in the
+# PIP package
+
+# Look for local Python library directory
+# pushd/popd avoids importing TensorFlow from the source directory.
+pushd /tmp > /dev/null
+TF_INSTALL_PATH=$(dirname \
+    $("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; print(tf.__file__)"))
+popd > /dev/null
+
+if [[ -z ${TF_INSTALL_PATH} ]]; then
+  die "Failed to find path where TensorFlow is installed."
+else
+  echo "Found TensorFlow install path: ${TF_INSTALL_PATH}"
+fi
+
+echo "Copying some source directories required by Python unit tests but "\
+"not included in install to TensorFlow install path: ${TF_INSTALL_PATH}"
+
+# Files for tensorflow.python.tools
+rm -rf ${TF_INSTALL_PATH}/python/tools
+cp -r tensorflow/python/tools \
+      ${TF_INSTALL_PATH}/python/tools
+touch ${TF_INSTALL_PATH}/python/tools/__init__.py  # Make module visible
+
+# Files for tensorflow.examples
+rm -rf ${TF_INSTALL_PATH}/examples/image_retraining
+mkdir -p ${TF_INSTALL_PATH}/examples/image_retraining
+cp -r tensorflow/examples/image_retraining/retrain.py \
+      ${TF_INSTALL_PATH}/examples/image_retraining/retrain.py
+touch ${TF_INSTALL_PATH}/examples/__init__.py
+touch ${TF_INSTALL_PATH}/examples/image_retraining/__init__.py
+
+echo "Copying additional files required by tests to working directory "\
+"for test: ${PY_TEST_DIR}"
+
+# Image files required by some tests, e.g., images_ops_test.py
+
+mkdir -p ${PY_TEST_DIR}/tensorflow/core/lib
+rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/jpeg
+cp -r tensorflow/core/lib/jpeg ${PY_TEST_DIR}/tensorflow/core/lib
+rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/png
+cp -r tensorflow/core/lib/png ${PY_TEST_DIR}/tensorflow/core/lib
+
+# Run tests
+DIR0=$(pwd)
+ALL_PY_TESTS=$(find tensorflow/{contrib,examples,models,python,tensorboard} -name "*_test.py" | sort)
+# TODO(cais): Add tests in tensorflow/contrib
+
+PY_TEST_COUNT=$(echo ${ALL_PY_TESTS} | wc -w)
+
+if [[ ${PY_TEST_COUNT} -eq 0 ]]; then
+  die "ERROR: Cannot find any tensorflow Python unit tests to run on install"
+fi
+
+# Iterate through all the Python unit test files using the installation
+COUNTER=0
+PASS_COUNTER=0
+FAIL_COUNTER=0
+SKIP_COUNTER=0
+FAILED_TESTS=""
+FAILED_TEST_LOGS=""
+
+for TEST_FILE_PATH in ${ALL_PY_TESTS}; do
+  ((COUNTER++))
+
+  PROG_STR="(${COUNTER} / ${PY_TEST_COUNT})"
+
+  # If PY_TEST_WHITELIST is not empty, only the white-listed tests will be run
+  if [[ ! -z ${PY_TEST_WHITELIST} ]] && \
+     [[ ! ${PY_TEST_WHITELIST} == *"${TEST_FILE_PATH}"* ]]; then
+    ((SKIP_COUNTER++))
+    echo "${PROG_STR} Non-whitelisted test SKIPPED: ${TEST_FILE_PATH}"
+    continue
+  fi
+
+  # If the test is in the black list, skip it
+  if [[ ${PY_TEST_BLACKLIST} == *"${TEST_FILE_PATH}"* ]]; then
+    ((SKIP_COUNTER++))
+    echo "${PROG_STR} Blacklisted test SKIPPED: ${TEST_FILE_PATH}"
+    continue
+  fi
+
+  # Copy to a separate directory to guard against the possibility of picking up
+  # modules in the source directory
+  cp ${TEST_FILE_PATH} ${PY_TEST_DIR}/
+
+  TEST_BASENAME=$(basename "${TEST_FILE_PATH}")
+
+  # Relative path of the test log. Use long path in case there are duplicate
+  # file names in the Python tests
+  TEST_LOG_REL="${PY_TEST_LOG_DIR_REL}/${TEST_FILE_PATH}.log"
+  mkdir -p $(dirname ${TEST_LOG_REL})  # Create directory for log
+
+  TEST_LOG=$(abs_path ${TEST_LOG_REL})  # Absolute path
+
+  # Start the stopwatch for this test
+  START_TIME=$(date +'%s')
+
+  # Before running the test, cd away from the Tensorflow source to
+  # avoid the possibility of picking up dependencies from the
+  # source directory
+  cd ${PY_TEST_DIR}
+  ${PYTHON_BIN_PATH} ${PY_TEST_DIR}/${TEST_BASENAME} >${TEST_LOG} 2>&1
+
+  TEST_RESULT=$?
+
+  END_TIME=$(date +'%s')
+  ELAPSED_TIME="$((${END_TIME} - ${START_TIME})) s"
+
+  # Check for pass or failure status of the test outtput and exit
+  if [[ ${TEST_RESULT} -eq 0 ]]; then
+    ((PASS_COUNTER++))
+
+    echo "${PROG_STR} Python test-on-install PASSED (${ELAPSED_TIME}): "\
+"${TEST_FILE_PATH}"
+  else
+    ((FAIL_COUNTER++))
+
+    FAILED_TESTS="${FAILED_TESTS} ${TEST_FILE_PATH}"
+
+    FAILED_TEST_LOGS="${FAILED_TEST_LOGS} ${TEST_LOG_REL}"
+
+    echo "${PROG_STR} Python test-on-install FAILED (${ELPASED_TIME}): "\
+"${TEST_FILE_PATH}"
+
+    echo "  Log @: ${TEST_LOG_REL}"
+    echo "============== BEGINS failure log content =============="
+    cat ${TEST_LOG}
+    echo "============== ENDS failure log content =============="
+    echo ""
+  fi
+  cd ${DIR0}
+
+  # Clean up files for this test
+  rm -f ${PY_TEST_DIR}/${TEST_BASENAME}
+
+done
+
+# Clean up files copied for Python unit tests:
+rm -rf ${TF_INSTALL_PATH}/python/tools
+rm -rf ${TF_INSTALL_PATH}/examples/image_retraining
+rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/jpeg
+rm -rf ${PY_TEST_DIR}/tensorflow/core/lib/png
+
+echo ""
+echo "${PY_TEST_COUNT} Python test(s):" \
+     "${PASS_COUNTER} passed;" \
+     "${FAIL_COUNTER} failed; " \
+     "${SKIP_COUNTER} skipped"
+echo "Test logs directory: ${PY_TEST_LOG_DIR_REL}"
+
+if [[ ${FAIL_COUNTER} -eq 0  ]]; then
+  echo ""
+  echo "Python test-on-install SUCCEEDED"
+
+  exit 0
+else
+  echo "FAILED test(s):"
+  FAILED_TEST_LOGS=($FAILED_TEST_LOGS)
+  FAIL_COUNTER=0
+  for TEST_NAME in ${FAILED_TESTS}; do
+    echo "  ${TEST_NAME} (Log @: ${FAILED_TEST_LOGS[${FAIL_COUNTER}]})"
+    ((FAIL_COUNTER++))
+  done
+
+  echo ""
+  echo "Python test-on-install FAILED"
+  exit 1
+fi
diff --git a/tensorflow/tools/ci_build/builds/test_tutorials.sh b/tensorflow/tools/ci_build/builds/test_tutorials.sh
index 13c26fd61f..bb65460186 100644
--- a/tensorflow/tools/ci_build/builds/test_tutorials.sh
+++ b/tensorflow/tools/ci_build/builds/test_tutorials.sh
@@ -21,7 +21,11 @@
 # decrement of loss with training, and verifying the existence of saved
 # checkpoints and summaries files.
 #
-# Usage: test_tutorials.sh
+# Usage: test_tutorials.sh [--virtualenv]
+#
+# If the flag --virtualenv is set, the script will use "python" as the Python
+# binary path. Otherwise, it will use tools/python_bin_path.sh to determine
+# the Python binary path.
 #
 # This script obeys the following environment variables (if exists):
 #   TUT_TESTS_BLACKLIST: Force skipping of specified tutorial tests listed
@@ -104,42 +108,48 @@ if [[ -z "$(which ${TIMEOUT_BIN})" ]]; then
 fi
 echo "Binary path for timeout: \"$(which ${TIMEOUT_BIN})\""
 
+# Avoid permission issues outside Docker containers
+umask 000
+
 mkdir -p "${LOGS_DIR}" || die "Failed to create logs directory"
 mkdir -p "${TUT_TEST_ROOT}" || die "Failed to create test directory"
 
-source tools/python_bin_path.sh
-
-if [[ -z "$PYTHON_BIN_PATH" ]]; then
-  die "PYTHON_BIN_PATH was not provided. Did you run configure?"
+if [[ "$1" == "--virtualenv" ]]; then
+  PYTHON_BIN_PATH="$(which python)"
+else
+  source tools/python_bin_path.sh
 fi
 
-echo "Binary path for python: \"$PYTHON_BIN_PATH\""
+if [[ -z "${PYTHON_BIN_PATH}" ]]; then
+  die "PYTHON_BIN_PATH was not provided. If this is not virtualenv, "\
+"did you run configure?"
+else
+  echo "Binary path for python: \"$PYTHON_BIN_PATH\""
+fi
 
 # Determine the TensorFlow installation path
+# pushd/popd avoids importing TensorFlow from the source directory.
 pushd /tmp > /dev/null
-TF_INSTALL_PATH=$(dirname $(${PYTHON_BIN_PATH} -c "import tensorflow; print(tensorflow.__file__)"))
+TF_INSTALL_PATH=$(dirname \
+    $("${PYTHON_BIN_PATH}" -c "import tensorflow as tf; print(tf.__file__)"))
 popd > /dev/null
 
 echo "Detected TensorFlow installation path: ${TF_INSTALL_PATH}"
 
 TEST_DIR="pip_test/tutorials"
-mkdir -p "${TEST_DIR}" ||
-die "Failed to create test directory: ${TEST_DIR}"
+mkdir -p "${TEST_DIR}" || \
+    die "Failed to create test directory: ${TEST_DIR}"
 
 # Copy folders required by mnist tutorials
-if [[ ! -d "${TF_INSTALL_PATH}/examples/tutorials/mnist" ]]; then
-  echo "Copying files required by MNIST tutorials..."
-
-  mkdir -p "${TF_INSTALL_PATH}/examples/tutorials"
-  cp tensorflow/examples/tutorials/__init__.py \
+mkdir -p "${TF_INSTALL_PATH}/examples/tutorials"
+cp tensorflow/examples/tutorials/__init__.py \
     "${TF_INSTALL_PATH}/examples/tutorials/"
-  cp -r tensorflow/examples/tutorials/mnist \
+cp -r tensorflow/examples/tutorials/mnist \
     "${TF_INSTALL_PATH}/examples/tutorials/"
 
-  if [[ ! -d "${TF_INSTALL_PATH}/examples/tutorials/mnist" ]]; then
-    die "FAILED: Unable to copy directory required by MNIST tutorials: "\
+if [[ ! -d "${TF_INSTALL_PATH}/examples/tutorials/mnist" ]]; then
+  die "FAILED: Unable to copy directory required by MNIST tutorials: "\
 "${TF_INSTALL_PATH}/examples/tutorials/mnist"
-  fi
 fi
 
 # -----------------------------------------------------------
diff --git a/tensorflow/tools/ci_build/builds/with_the_same_user b/tensorflow/tools/ci_build/builds/with_the_same_user
index bab6f14c10..43773f23ba 100755
--- a/tensorflow/tools/ci_build/builds/with_the_same_user
+++ b/tensorflow/tools/ci_build/builds/with_the_same_user
@@ -17,7 +17,7 @@
 # This script is a wrapper creating the same user inside container as the one
 # running the ci_build.sh outside the container. It also set the home directory
 # for the user inside container to match the same absolute path as the workspace
-# outside of continer.
+# outside of container.
 # We do this so that the bazel running inside container generate symbolic links
 # and user permissions which makes sense outside of container.
 # Do not run this manually. It does not make sense. It is intended to be called
diff --git a/tensorflow/tools/ci_build/ci_build.sh b/tensorflow/tools/ci_build/ci_build.sh
index 9525017793..24c14f2197 100755
--- a/tensorflow/tools/ci_build/ci_build.sh
+++ b/tensorflow/tools/ci_build/ci_build.sh
@@ -20,10 +20,15 @@ CONTAINER_TYPE=$( echo "$1" | tr '[:upper:]' '[:lower:]' )
 shift 1
 COMMAND=("$@")
 
+# Figure out the directory where this script is.
+SCRIPT_DIR=$( cd ${0%/*} && pwd -P )
+
 # Validate command line arguments.
-if [ "$#" -lt 1 ] || [[ ! "${CONTAINER_TYPE}" =~ ^(cpu|gpu|android)$ ]]; then
+if [ "$#" -lt 1 ] || [ ! -e "${SCRIPT_DIR}/Dockerfile.${CONTAINER_TYPE}" ]; then
+  supported_container_types=$( ls -1 ${SCRIPT_DIR}/Dockerfile.* | \
+      sed -n 's/.*Dockerfile\.\([^\/]*\)/\1/p' | tr '\n' ' ' )
   >&2 echo "Usage: $(basename $0) CONTAINER_TYPE COMMAND"
-  >&2 echo "       CONTAINER_TYPE can be 'CPU' or 'GPU'"
+  >&2 echo "       CONTAINER_TYPE can be one of [ ${supported_container_types}]"
   >&2 echo "       COMMAND is a command (with arguments) to run inside"
   >&2 echo "               the container."
   >&2 echo ""
@@ -38,12 +43,10 @@ fi
 if [[ "${CI_DOCKER_EXTRA_PARAMS}" != *"--rm"* ]]; then
   CI_DOCKER_EXTRA_PARAMS="--rm ${CI_DOCKER_EXTRA_PARAMS}"
 fi
-CI_COMMAND_PREFIX=("${CI_COMMAND_PREFIX[@]:-tensorflow/tools/ci_build/builds/with_the_same_user tensorflow/tools/ci_build/builds/configured ${CONTAINER_TYPE}}")
+CI_TENSORFLOW_SUBMODULE_PATH="${CI_TENSORFLOW_SUBMODULE_PATH:-.}"
+CI_COMMAND_PREFIX=("${CI_COMMAND_PREFIX[@]:-${CI_TENSORFLOW_SUBMODULE_PATH}/tensorflow/tools/ci_build/builds/with_the_same_user ${CI_TENSORFLOW_SUBMODULE_PATH}/tensorflow/tools/ci_build/builds/configured ${CONTAINER_TYPE}}")
 
 
-# Figure out the directory where this script is.
-SCRIPT_DIR=$( cd ${0%/*} && pwd -P )
-
 # Helper function to traverse directories up until given file is found.
 function upsearch () {
   test / == "$PWD" && return || \
@@ -60,7 +63,7 @@ BUILD_TAG="${BUILD_TAG:-tf_ci}"
 # Add extra params for cuda devices and libraries for GPU container.
 if [ "${CONTAINER_TYPE}" == "gpu" ]; then
   devices=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
-  libs=$(\ls /usr/lib/x86_64-linux-gnu/libcuda* | xargs -I{} echo '-v {}:{}')
+  libs=$(\ls /usr/lib/x86_64-linux-gnu/libcuda.* | xargs -I{} echo '-v {}:{}')
   GPU_EXTRA_PARAMS="${devices} ${libs}"
 else
   GPU_EXTRA_PARAMS=""
@@ -98,12 +101,13 @@ mkdir -p ${WORKSPACE}/bazel-ci_build-cache
 docker run \
     -v ${WORKSPACE}/bazel-ci_build-cache:${WORKSPACE}/bazel-ci_build-cache \
     -e "CI_BUILD_HOME=${WORKSPACE}/bazel-ci_build-cache" \
-    -e "CI_BUILD_USER=${USER}" \
-    -e "CI_BUILD_UID=$(id -u $USER)" \
-    -e "CI_BUILD_GROUP=$(id -g --name $USER)" \
-    -e "CI_BUILD_GID=$(id -g $USER)" \
-    -v ${WORKSPACE}:/tensorflow \
-    -w /tensorflow \
+    -e "CI_BUILD_USER=$(id -u --name)" \
+    -e "CI_BUILD_UID=$(id -u)" \
+    -e "CI_BUILD_GROUP=$(id -g --name)" \
+    -e "CI_BUILD_GID=$(id -g)" \
+    -e "CI_TENSORFLOW_SUBMODULE_PATH=${CI_TENSORFLOW_SUBMODULE_PATH}" \
+    -v ${WORKSPACE}:/workspace \
+    -w /workspace \
     ${GPU_EXTRA_PARAMS} \
     ${CI_DOCKER_EXTRA_PARAMS[@]} \
     "${DOCKER_IMG_NAME}" \
diff --git a/tensorflow/tools/ci_build/ci_parameterized_build.sh b/tensorflow/tools/ci_build/ci_parameterized_build.sh
index 97b25f32a0..46c1740af6 100755
--- a/tensorflow/tools/ci_build/ci_parameterized_build.sh
+++ b/tensorflow/tools/ci_build/ci_parameterized_build.sh
@@ -21,7 +21,7 @@
 #   TF_BUILD_CONTAINER_TYPE:   (CPU | GPU | ANDROID)
 #   TF_BUILD_PYTHON_VERSION:   (PYTHON2 | PYTHON3)
 #   TF_BUILD_IS_OPT:           (NO_OPT | OPT)
-#   TF_BUILD_IS_PIP:           (NO_PIP | PIP)
+#   TF_BUILD_IS_PIP:           (NO_PIP | PIP | BOTH)
 #
 # Note: certain combinations of parameter values are regarded
 # as invalid and will cause the script to exit with code 0. For example:
@@ -49,6 +49,11 @@
 #                      (i.e., bazel test --job=1), potentially useful for
 #                      builds where the tests cannot be run in parallel due to
 #                      resource contention (e.g., for GPU builds)
+#   TF_BUILD_TEST_TUTORIALS:
+#                      If set to any non-empty and non-0 value, will perform
+#                      tutorials tests (Applicable only if TF_BUILD_IS_PIP is
+#                      PIP or BOTH).
+#                      See builds/test_tutorials.sh
 #
 # This script can be used by Jenkins parameterized / matrix builds.
 
@@ -62,6 +67,12 @@ str_strip () {
   echo -e "$1" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'
 }
 
+# Helper function: Exit on failure
+die () {
+  echo $@
+  exit 1
+}
+
 
 ##########################################################
 # Default configuration
@@ -84,11 +95,12 @@ BAZEL_CLEAN_CMD="bazel clean"
 BAZEL_SERIAL_FLAG="--jobs=1"
 
 PIP_CMD="${CI_BUILD_DIR}/builds/pip.sh"
+PIP_TEST_TUTORIALS_FLAG="--test_tutorials"
 ANDROID_CMD="${CI_BUILD_DIR}/builds/android.sh"
 
 BAZEL_TARGET="//tensorflow/..."
 
-
+TUT_TEST_DATA_DIR="/tmp/tf_tutorial_test_data"
 
 ##########################################################
 
@@ -116,6 +128,7 @@ echo "  TF_BUILD_APPEND_ARGUMENTS=${TF_BUILD_APPEND_ARGUMENTS}"
 echo "  TF_BUILD_BAZEL_TARGET=${TF_BUILD_BAZEL_TARGET}"
 echo "  TF_BUILD_BAZEL_CLEAN=${TF_BUILD_BAZEL_CLEAN}"
 echo "  TF_BUILD_SERIAL_TESTS=${TF_BUILD_SERIAL_TESTS}"
+echo "  TF_BUILD_TEST_TUTORIALS=${TF_BUILD_TEST_TUTORIALS}"
 
 # Process container type
 CTYPE=${TF_BUILD_CONTAINER_TYPE}
@@ -127,9 +140,8 @@ elif [[ ${CTYPE} == "gpu" ]]; then
 elif [[ ${CTYPE} == "android" ]]; then
   :
 else
-  echo "Unrecognized value in TF_BUILD_CONTAINER_TYPE: "\
+  die "Unrecognized value in TF_BUILD_CONTAINER_TYPE: "\
 "\"${TF_BUILD_CONTAINER_TYPE}\""
-  exit 1
 fi
 
 EXTRA_PARAMS=""
@@ -159,15 +171,15 @@ if [[ ${TF_BUILD_IS_OPT} == "no_opt" ]]; then
 elif [[ ${TF_BUILD_IS_OPT} == "opt" ]]; then
   OPT_FLAG="${OPT_FLAG} -c opt"
 else
-  echo "Unrecognized value in TF_BUILD_IS_OPT: \"${TF_BUILD_IS_OPT}\""
-  exit 1
+  die "Unrecognized value in TF_BUILD_IS_OPT: \"${TF_BUILD_IS_OPT}\""
 fi
 
 # Strip whitespaces from OPT_FLAG
 OPT_FLAG=$(str_strip "${OPT_FLAG}")
 
 # Process PIP install-test option
-if [[ ${TF_BUILD_IS_PIP} == "no_pip" ]]; then
+if [[ ${TF_BUILD_IS_PIP} == "no_pip" ]] ||
+   [[ ${TF_BUILD_IS_PIP} == "both" ]]; then
   # Process optional bazel target override
   if [[ ! -z "${TF_BUILD_BAZEL_TARGET}" ]]; then
     BAZEL_TARGET=${TF_BUILD_BAZEL_TARGET}
@@ -175,9 +187,9 @@ if [[ ${TF_BUILD_IS_PIP} == "no_pip" ]]; then
 
   if [[ ${CTYPE} == "cpu" ]] || [[ ${CTYPE} == "gpu" ]]; then
     # Run Bazel
-    MAIN_CMD="${MAIN_CMD} ${BAZEL_CMD} ${OPT_FLAG} "\
+    NO_PIP_MAIN_CMD="${MAIN_CMD} ${BAZEL_CMD} ${OPT_FLAG} "\
 "${TF_BUILD_APPEND_ARGUMENTS} ${BAZEL_TARGET}"
-    MAIN_CMD=$(str_strip "${MAIN_CMD}")
+    NO_PIP_MAIN_CMD=$(str_strip "${NO_PIP_MAIN_CMD}")
 
     if [[ ! -z "${TF_BUILD_SERIAL_TESTS}" ]] &&
        [[ "${TF_BUILD_SERIAL_TESTS}" != "0" ]]; then
@@ -189,15 +201,19 @@ if [[ ${TF_BUILD_IS_PIP} == "no_pip" ]]; then
 "${TF_BUILD_APPEND_ARGUMENTS} ${BAZEL_TARGET}"
       echo "Build-only command: ${BUILD_ONLY_CMD}"
 
-      MAIN_CMD="${BUILD_ONLY_CMD} && "\
+      NO_PIP_MAIN_CMD="${BUILD_ONLY_CMD} && "\
 "${BAZEL_CMD} ${OPT_FLAG} ${BAZEL_SERIAL_FLAG} "\
 "${TF_BUILD_APPEND_ARGUMENTS} ${BAZEL_TARGET}"
-      echo "Parallel-build + serial-test command: ${MAIN_CMD}"
+      echo "Parallel-build + serial-test command: ${NO_PIP_MAIN_CMD}"
     fi
   elif [[ ${CTYPE} == "android" ]]; then
-    MAIN_CMD="${ANDROID_CMD} ${OPT_FLAG} "
+    NO_PIP_MAIN_CMD="${ANDROID_CMD} ${OPT_FLAG} "
   fi
-elif [[ ${TF_BUILD_IS_PIP} == "pip" ]]; then
+
+fi
+
+if [[ ${TF_BUILD_IS_PIP} == "pip" ]] ||
+   [[ ${TF_BUILD_IS_PIP} == "both"  ]]; then
   # Android builds conflict with PIP builds
   if [[ ${CTYPE} == "android" ]]; then
     echo "Skipping parameter combination: ${TF_BUILD_IS_PIP} & "\
@@ -205,13 +221,36 @@ elif [[ ${TF_BUILD_IS_PIP} == "pip" ]]; then
     exit 0
   fi
 
-  MAIN_CMD="${MAIN_CMD} ${PIP_CMD} ${CTYPE} "\
+  PIP_MAIN_CMD="${MAIN_CMD} ${PIP_CMD} ${CTYPE} "\
 "${TF_BUILD_APPEND_ARGUMENTS}"
+
+  # Add command for tutorial test
+  if [[ ! -z "${TF_BUILD_TEST_TUTORIALS}" ]] &&
+     [[ "${TF_BUILD_TEST_TUTORIALS}" != "0" ]]; then
+    PIP_MAIN_CMD="${PIP_MAIN_CMD} ${PIP_TEST_TUTORIALS_FLAG}"
+
+    # Prepare data directory for tutorial tests
+    mkdir -p "${TUT_TEST_DATA_DIR}" ||
+    die "FAILED to create data directory for tutorial tests: "\
+        "${TUT_TEST_DATA_DIR}"
+
+    if [[ "${DO_DOCKER}" == "1" ]]; then
+      EXTRA_PARAMS="${EXTRA_PARAMS} -v ${TUT_TEST_DATA_DIR}:${TUT_TEST_DATA_DIR}"
+    fi
+  fi
+fi
+
+if [[ ${TF_BUILD_IS_PIP} == "no_pip" ]]; then
+  MAIN_CMD="${NO_PIP_MAIN_CMD}"
+elif [[ ${TF_BUILD_IS_PIP} == "pip" ]]; then
+  MAIN_CMD="${PIP_MAIN_CMD}"
+elif [[ ${TF_BUILD_IS_PIP} == "both" ]]; then
+  MAIN_CMD="${NO_PIP_MAIN_CMD} && ${PIP_MAIN_CMD}"
 else
-  echo "Unrecognized value in TF_BUILD_IS_PIP: \"${TF_BUILD_IS_PIP}\""
-  exit 1
+  die "Unrecognized value in TF_BUILD_IS_PIP: \"${TF_BUILD_IS_PIP}\""
 fi
 
+
 # Process Python version
 if [[ ${TF_BUILD_PYTHON_VERSION} == "python2" ]]; then
   :
@@ -223,8 +262,7 @@ elif [[ ${TF_BUILD_PYTHON_VERSION} == "python3" ]]; then
     # Determine the path to python3
     PYTHON3_PATH=$(which python3 | head -1)
     if [[ -z "${PYTHON3_PATH}" ]]; then
-      echo "ERROR: Failed to locate python3 binary on the system"
-      exit 1
+      die "ERROR: Failed to locate python3 binary on the system"
     else
       echo "Found python3 binary at: ${PYTHON3_PATH}"
     fi
@@ -233,9 +271,8 @@ elif [[ ${TF_BUILD_PYTHON_VERSION} == "python3" ]]; then
   fi
 
 else
-  echo "Unrecognized value in TF_BUILD_PYTHON_VERSION: "\
+  die "Unrecognized value in TF_BUILD_PYTHON_VERSION: "\
 "\"${TF_BUILD_PYTHON_VERSION}\""
-  exit 1
 fi
 
 # Append additional Docker extra parameters
@@ -253,6 +290,15 @@ TMP_SCRIPT=/tmp/ci_parameterized_build_${RAND_STR}.sh
 if [[ "${DO_DOCKER}" == "1" ]]; then
   # Map the tmp script into the Docker container
   EXTRA_PARAMS="${EXTRA_PARAMS} -v ${TMP_SCRIPT}:/tmp/tf_build.sh"
+
+  if [[ ! -z "${TF_BUILD_BAZEL_CLEAN}" ]] &&
+     [[ "${TF_BUILD_BAZEL_CLEAN}" != "0" ]] &&
+     [[ "${TF_BUILD_IS_PIP}" != "both" ]]; then
+    # For TF_BUILD_IS_PIP == both, "bazel clean" will have already
+    # been performed before the "bazel test" step
+    EXTRA_PARAMS="${EXTRA_PARAMS} -e TF_BUILD_BAZEL_CLEAN=1"
+  fi
+
   EXTRA_PARAMS=$(str_strip "${EXTRA_PARAMS}")
 
   echo "Exporting CI_DOCKER_EXTRA_PARAMS: ${EXTRA_PARAMS}"
@@ -275,6 +321,7 @@ echo ""
 
 chmod +x ${TMP_SCRIPT}
 
+FAILURE=0
 if [[ ! -z "${TF_BUILD_DRY_RUN}" ]] && [[ ${TF_BUILD_DRY_RUN} != "0" ]]; then
   # Do a dry run: just print the final command
   echo "*** This is a DRY RUN ***"
@@ -285,7 +332,12 @@ else
   else
     ${TMP_SCRIPT}
   fi
-fi && FAILURE=0 || FAILURE=1
+
+  if [[ $? != "0" ]]; then
+    FAILURE=1
+  fi
+fi
+
 [[ ${FAILURE} == "0" ]] && RESULT="SUCCESS" || RESULT="FAILURE"
 
 rm -f ${TMP_SCRIPT}
diff --git a/tensorflow/tools/ci_build/install/install_bazel.sh b/tensorflow/tools/ci_build/install/install_bazel.sh
index 8c3aa2b639..e6ac91e722 100755
--- a/tensorflow/tools/ci_build/install/install_bazel.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel.sh
@@ -17,7 +17,7 @@
 set -e
 
 # Select bazel version.
-BAZEL_VERSION="0.1.4"
+BAZEL_VERSION="0.2.0"
 
 # Install bazel.
 mkdir /bazel
diff --git a/tensorflow/tools/ci_build/install/install_openjdk8_from_ppa.sh b/tensorflow/tools/ci_build/install/install_bootstrap_deb_packages.sh
index 7f2e8be8c8..3b574692a0 100755
--- a/tensorflow/tools/ci_build/install/install_openjdk8_from_ppa.sh
+++ b/tensorflow/tools/ci_build/install/install_bootstrap_deb_packages.sh
@@ -16,9 +16,9 @@
 
 set -e
 
-# Install openjdk 8 for bazel from PPA (it is not available in 14.04).
-add-apt-repository -y ppa:openjdk-r/ppa
+# Install bootstrap dependencies from ubuntu deb repository.
 apt-get update
-apt-get install -y openjdk-8-jdk openjdk-8-jre-headless
+apt-get install -y \
+    software-properties-common
 apt-get clean
 rm -rf /var/lib/apt/lists/*
diff --git a/tensorflow/tools/ci_build/install/install_deb_packages.sh b/tensorflow/tools/ci_build/install/install_deb_packages.sh
index 9fe28b7894..b752e86d69 100755
--- a/tensorflow/tools/ci_build/install/install_deb_packages.sh
+++ b/tensorflow/tools/ci_build/install/install_deb_packages.sh
@@ -23,14 +23,17 @@ apt-get install -y \
     build-essential \
     curl \
     git \
+    openjdk-8-jdk \
+    openjdk-8-jre-headless \
     pkg-config \
     python-dev \
     python-numpy \
     python-pip \
+    python-virtualenv \
     python3-dev \
     python3-numpy \
     python3-pip \
-    software-properties-common \
+    sudo \
     swig \
     unzip \
     wget \
diff --git a/tensorflow/tools/ci_build/update_version.sh b/tensorflow/tools/ci_build/update_version.sh
new file mode 100755
index 0000000000..36a1e39f3a
--- /dev/null
+++ b/tensorflow/tools/ci_build/update_version.sh
@@ -0,0 +1,134 @@
+#!/usr/bin/env bash
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# Automatically update TensorFlow version in source files
+#
+# Usage:  update_version.sh <new_major_ver>.<new_minor_ver>.<new_patch_ver>
+#         e.g.,
+#           update_version.sh 0.7.2
+#
+
+# Helper functions
+die() {
+  echo $1
+  exit 1
+}
+
+check_existence() {
+  # Usage: check_exists (dir|file) <path>
+
+  if [[ "$1" == "dir" ]]; then
+    test -d "$2" ||
+      die "ERROR: Cannot find directory ${2}. "\
+"Are you under the TensorFlow source root directory?"
+  else
+    test -f "$2" ||
+      die "ERROR: Cannot find file ${2}. "\
+"Are you under the TensorFlow source root directory?"
+  fi
+}
+
+
+TF_SRC_DIR="tensorflow"
+check_existence dir "${TF_SRC_DIR}"
+
+# Process command-line arguments
+if [[ $# != 1 ]]; then
+  die "Usage: $(basename $0) <new_major_ver>.<new_minor_ver>.<new_patch_ver>"
+fi
+NEW_VER=$1
+
+# Check validity of new version string
+echo "${NEW_VER}" | grep -q -E "[0-9]+\.[0-9]+\.[0-9]+"
+if [[ $? != "0" ]]; then
+  die "ERROR: Invalid new version: \"${NEW_VER}\""
+fi
+
+# Extract major, minor and patch versions
+MAJOR=$(echo "${NEW_VER}" | cut -d \. -f 1)
+MINOR=$(echo "${NEW_VER}" | cut -d \. -f 2)
+PATCH=$(echo "${NEW_VER}" | cut -d \. -f 3)
+
+# Update tensorflow/core/public/version.h
+VERSION_H="${TF_SRC_DIR}/core/public/version.h"
+check_existence file "${VERSION_H}"
+
+OLD_MAJOR=$(cat ${VERSION_H} | grep -E "^#define TF_MAJOR_VERSION [0-9]+" | \
+cut -d ' ' -f 3)
+OLD_MINOR=$(cat ${VERSION_H} | grep -E "^#define TF_MINOR_VERSION [0-9]+" | \
+cut -d ' ' -f 3)
+OLD_PATCH=$(cat ${VERSION_H} | grep -E "^#define TF_PATCH_VERSION [0-9]+" | \
+cut -d ' ' -f 3)
+
+sed -i -e "s/^#define TF_MAJOR_VERSION ${OLD_MAJOR}/#define TF_MAJOR_VERSION ${MAJOR}/g" ${VERSION_H}
+sed -i -e "s/^#define TF_MINOR_VERSION ${OLD_MINOR}/#define TF_MINOR_VERSION ${MINOR}/g" ${VERSION_H}
+sed -i -e "s/^#define TF_PATCH_VERSION ${OLD_PATCH}/#define TF_PATCH_VERSION ${PATCH}/g" "${VERSION_H}"
+
+
+# Update setup.py
+SETUP_PY="${TF_SRC_DIR}/tools/pip_package/setup.py"
+check_existence file "${SETUP_PY}"
+
+sed -i -e "s/^\_VERSION = [\'\"].*[\'\"]/\_VERSION = \'${MAJOR}.${MINOR}.${PATCH}\'/g" "${SETUP_PY}"
+
+
+# Update Dockerfiles in tensorflow/tools/docker/
+TOOLS_DOCKER_DIR="${TF_SRC_DIR}/tools/docker"
+check_existence dir "${TOOLS_DOCKER_DIR}"
+
+# Determine the files that need to be modified
+DOCKERFILES=$(grep -lrE "^ENV TENSORFLOW_VERSION .+" ${TOOLS_DOCKER_DIR})
+for DOCKERF in ${DOCKERFILES}; do
+  sed -i -r -e "s/^ENV TENSORFLOW_VERSION .+/ENV TENSORFLOW_VERSION ${MAJOR}.${MINOR}.${PATCH}/g" "${DOCKERF}"
+done
+
+
+# Update os_setup.md
+OS_SETUP="${TF_SRC_DIR}/g3doc/get_started/os_setup.md"
+check_existence file "${OS_SETUP}"
+
+sed -i -r -e "s/(.*pip[0-9]* install .*tensorflow-)([0-9]+\.[0-9]+\.[0-9]+)(.*\.whl)/\1${MAJOR}.${MINOR}.${PATCH}\3/g" "${OS_SETUP}"
+
+sed -i -r -e "s/(.*\(e\.g\..*[^0-9])([0-9]+\.[0-9]+\.[0-9]+)([^0-9].*\).*)/\1${MAJOR}.${MINOR}.${PATCH}\3/g" "${OS_SETUP}"
+
+
+# Update README.md
+README_MD="./README.md"
+check_existence file "${README_MD}"
+
+sed -i -r -e "s/${OLD_MAJOR}\.${OLD_MINOR}\.${OLD_PATCH}/${MAJOR}.${MINOR}.${PATCH}/g" "${README_MD}"
+
+
+echo "Major: ${OLD_MAJOR} -> ${MAJOR}"
+echo "Minor: ${OLD_MINOR} -> ${MINOR}"
+echo "Patch: ${OLD_PATCH} -> ${PATCH}"
+echo ""
+
+# Look for potentially lingering old version strings in TensorFlow source files
+OLD_VER="${OLD_MAJOR}\.${OLD_MINOR}\.${OLD_PATCH}"
+LINGER_STRS=$(grep -rnoH "${OLD_VER}" "${TF_SRC_DIR}")
+
+if [[ ! -z "${LINGER_STRS}" ]]; then
+  echo "WARNING: Below are potentially instances of lingering old version "\
+"string (${OLD_VER}) in source directory \"${TF_SRC_DIR}/\" that are not "\
+"updated by this script. Please check them manually!"
+  for LINGER_STR in ${LINGER_STRS}; do
+    echo "${LINGER_STR}"
+  done
+else
+  echo "No lingering old version strings found in source directory "\
+"\"${TF_SRC_DIR}/\". Good."
+fi
diff --git a/tensorflow/tools/docker/Dockerfile b/tensorflow/tools/docker/Dockerfile
index 69e502d098..a8156559ed 100644
--- a/tensorflow/tools/docker/Dockerfile
+++ b/tensorflow/tools/docker/Dockerfile
@@ -4,6 +4,7 @@ MAINTAINER Craig Citro <craigcitro@google.com>
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y \
+        bc \
         curl \
         libfreetype6-dev \
         libpng12-dev \
@@ -28,13 +29,16 @@ RUN pip --no-cache-dir install \
     python -m ipykernel.kernelspec
 
 # Install TensorFlow CPU version.
-ENV TENSORFLOW_VERSION 0.7.0
+ENV TENSORFLOW_VERSION 0.7.1
 RUN pip --no-cache-dir install \
-    http://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-${TENSORFLOW_VERSION}-py2-none-linux_x86_64.whl
+    http://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
 
 # Set up our notebook config.
 COPY jupyter_notebook_config.py /root/.jupyter/
 
+# Copy sample notebooks.
+COPY notebooks /notebooks
+
 # Jupyter has issues with being run directly:
 #   https://github.com/ipython/ipython/issues/7062
 # We just add a little wrapper script.
@@ -45,6 +49,6 @@ EXPOSE 6006
 # IPython
 EXPOSE 8888
 
-WORKDIR "/root"
+WORKDIR "/notebooks"
 
-CMD ["/bin/bash"]
+CMD ["/run_jupyter.sh"]
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index 1a30c7d700..ac8e885fd7 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -64,7 +64,7 @@ RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/root/.bazelrc
 ENV BAZELRC /root/.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.1.4
+ENV BAZEL_VERSION 0.2.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 56de5940ab..3c85d16a9d 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:7.0-cudnn2-devel
+FROM nvidia/cuda:7.5-cudnn4-devel
 
 MAINTAINER Craig Citro <craigcitro@google.com>
 
@@ -64,7 +64,7 @@ RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/root/.bazelrc
 ENV BAZELRC /root/.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.1.4
+ENV BAZEL_VERSION 0.2.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -96,7 +96,6 @@ WORKDIR /root
 
 # Set up CUDA variables
 ENV CUDA_PATH /usr/local/cuda
-ENV LD_LIBRARY_PATH /usr/local/cuda/lib64
 
 # TensorBoard
 EXPOSE 6006
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index 77699ebb42..8d2ede62b5 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,9 +1,10 @@
-FROM nvidia/cuda:7.0-cudnn2-runtime
+FROM nvidia/cuda:7.5-cudnn4-runtime
 
 MAINTAINER Craig Citro <craigcitro@google.com>
 
 # Pick up some TF dependencies
 RUN apt-get update && apt-get install -y \
+        bc \
         curl \
         libfreetype6-dev \
         libpng12-dev \
@@ -28,13 +29,16 @@ RUN pip --no-cache-dir install \
     python -m ipykernel.kernelspec
 
 # Install TensorFlow GPU version.
-ENV TENSORFLOW_VERSION 0.7.0
+ENV TENSORFLOW_VERSION 0.7.1
 RUN pip --no-cache-dir install \
-    http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-py2-none-linux_x86_64.whl
+    http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
 
 # Set up our notebook config.
 COPY jupyter_notebook_config.py /root/.jupyter/
 
+# Copy sample notebooks.
+COPY notebooks /notebooks
+
 # Jupyter has issues with being run directly:
 #   https://github.com/ipython/ipython/issues/7062
 # We just add a little wrapper script.
@@ -45,6 +49,6 @@ EXPOSE 6006
 # IPython
 EXPOSE 8888
 
-WORKDIR "/root"
+WORKDIR "/notebooks"
 
-CMD ["/bin/bash"]
+CMD ["/run_jupyter.sh"]
diff --git a/tensorflow/tools/docker/README.md b/tensorflow/tools/docker/README.md
index e94b11e4f2..f7ec66d933 100644
--- a/tensorflow/tools/docker/README.md
+++ b/tensorflow/tools/docker/README.md
@@ -38,7 +38,7 @@ NVidia libraries available on their system, as well as providing mappings so
 that the container can see the host's GPU. For most purposes, this can be
 accomplished via
 
-    $ export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda* | xargs -I{} echo '-v {}:{}')
+    $ export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda.* | xargs -I{} echo '-v {}:{}')
     $ export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
     $ docker run -it -p 8888:8888 $CUDA_SO $DEVICES b.gcr.io/tensorflow/tensorflow-devel-gpu
 
diff --git a/tensorflow/tools/docker/docker_run_gpu.sh b/tensorflow/tools/docker/docker_run_gpu.sh
index 699b39dae1..9ebfa701e4 100755
--- a/tensorflow/tools/docker/docker_run_gpu.sh
+++ b/tensorflow/tools/docker/docker_run_gpu.sh
@@ -24,7 +24,7 @@ if [ ! -d ${CUDA_HOME}/lib64 ]; then
   exit 1
 fi
 
-export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda* | \
+export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda.* | \
                     xargs -I{} echo '-v {}:{}')
 export DEVICES=$(\ls /dev/nvidia* | \
                     xargs -I{} echo '--device {}:{}')
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index 7474757a6c..11b57134f6 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 import fnmatch
 import os
+import platform
 import re
 import sys
 
@@ -26,10 +27,17 @@ from setuptools import find_packages, setup, Command, Extension
 from setuptools.command.install import install as InstallCommandBase
 from setuptools.dist import Distribution
 
-_VERSION = '0.7.0'
+_VERSION = '0.7.1'
+
+numpy_version = "1.8.2"
+if platform.system() == "Darwin":
+  # There are bugs with numpy pip installation on OS X prior to
+  # 1.10.1, so on mac we require a higher version than on other
+  # platforms.
+  numpy_version = "1.10.1"
 
 REQUIRED_PACKAGES = [
-    'numpy >= 1.8.2',
+    'numpy >= %s' % numpy_version,
     'six >= 1.10.0',
     'protobuf == 3.0.0b2',
 ]
@@ -43,7 +51,7 @@ else:
 
 # pylint: disable=line-too-long
 CONSOLE_SCRIPTS = [
-    'tensorboard = tensorflow.tensorboard.backend.tensorboard:main',
+    'tensorboard = tensorflow.tensorboard.tensorboard:main',
 ]
 # pylint: enable=line-too-long
 
diff --git a/tensorflow/tools/test/BUILD b/tensorflow/tools/test/BUILD
new file mode 100644
index 0000000000..4a8bb87a77
--- /dev/null
+++ b/tensorflow/tools/test/BUILD
@@ -0,0 +1,40 @@
+# Description:
+# Tools for testing
+
+package(default_visibility = ["//tensorflow:__subpackages__"])
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+py_library(
+    name = "system_info_lib",
+    srcs = [
+        "gpu_info_lib.py",
+        "system_info_lib.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = ["//tensorflow:tensorflow_py"],
+)
+
+py_binary(
+    name = "system_info",
+    srcs = ["system_info.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":system_info_lib",
+        "//tensorflow:tensorflow_py",
+    ],
+)
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
diff --git a/tensorflow/tools/test/__init__.py b/tensorflow/tools/test/__init__.py
new file mode 100644
index 0000000000..0468856532
--- /dev/null
+++ b/tensorflow/tools/test/__init__.py
@@ -0,0 +1,20 @@
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tools for testing."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
diff --git a/tensorflow/tools/test/gpu_info_lib.py b/tensorflow/tools/test/gpu_info_lib.py
new file mode 100644
index 0000000000..cfb7d89920
--- /dev/null
+++ b/tensorflow/tools/test/gpu_info_lib.py
@@ -0,0 +1,184 @@
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Library for getting system information during TensorFlow tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+import ctypes as ct
+import platform
+
+import tensorflow as tf
+
+from tensorflow.core.util import test_log_pb2
+
+
+def _gather_gpu_devices_proc():
+  """Try to gather NVidia GPU device information via /proc/driver."""
+  dev_info = []
+  for f in tf.gfile.Glob("/proc/driver/nvidia/gpus/*/information"):
+    bus_id = f.split("/")[5]
+    key_values = dict(
+        line.rstrip().replace("\t", "").split(":", 1)
+        for line in tf.gfile.GFile(f, "r"))
+    key_values = dict(
+        (k.lower(), v.strip(" ").rstrip(" "))
+        for (k, v) in key_values.items())
+    info = test_log_pb2.GPUInfo()
+    info.model = key_values.get("model", "Unknown")
+    info.uuid = key_values.get("gpu uuid", "Unknown")
+    info.bus_id = bus_id
+    dev_info.append(info)
+  return dev_info
+
+
+class CUDADeviceProperties(ct.Structure):
+  # See $CUDA_HOME/include/cuda_runtime_api.h for the definition of
+  # the cudaDeviceProp struct.
+  _fields_ = [
+      ("name", ct.c_char * 256),
+      ("totalGlobalMem", ct.c_size_t),
+      ("sharedMemPerBlock", ct.c_size_t),
+      ("regsPerBlock", ct.c_int),
+      ("warpSize", ct.c_int),
+      ("memPitch", ct.c_size_t),
+      ("maxThreadsPerBlock", ct.c_int),
+      ("maxThreadsDim", ct.c_int * 3),
+      ("maxGridSize", ct.c_int * 3),
+      ("clockRate", ct.c_int),
+      ("totalConstMem", ct.c_size_t),
+      ("major", ct.c_int),
+      ("minor", ct.c_int),
+      ("textureAlignment", ct.c_size_t),
+      ("texturePitchAlignment", ct.c_size_t),
+      ("deviceOverlap", ct.c_int),
+      ("multiProcessorCount", ct.c_int),
+      ("kernelExecTimeoutEnabled", ct.c_int),
+      ("integrated", ct.c_int),
+      ("canMapHostMemory", ct.c_int),
+      ("computeMode", ct.c_int),
+      ("maxTexture1D", ct.c_int),
+      ("maxTexture1DMipmap", ct.c_int),
+      ("maxTexture1DLinear", ct.c_int),
+      ("maxTexture2D", ct.c_int * 2),
+      ("maxTexture2DMipmap", ct.c_int * 2),
+      ("maxTexture2DLinear", ct.c_int * 3),
+      ("maxTexture2DGather", ct.c_int * 2),
+      ("maxTexture3D", ct.c_int * 3),
+      ("maxTexture3DAlt", ct.c_int * 3),
+      ("maxTextureCubemap", ct.c_int),
+      ("maxTexture1DLayered", ct.c_int * 2),
+      ("maxTexture2DLayered", ct.c_int * 3),
+      ("maxTextureCubemapLayered", ct.c_int * 2),
+      ("maxSurface1D", ct.c_int),
+      ("maxSurface2D", ct.c_int * 2),
+      ("maxSurface3D", ct.c_int * 3),
+      ("maxSurface1DLayered", ct.c_int * 2),
+      ("maxSurface2DLayered", ct.c_int * 3),
+      ("maxSurfaceCubemap", ct.c_int),
+      ("maxSurfaceCubemapLayered", ct.c_int * 2),
+      ("surfaceAlignment", ct.c_size_t),
+      ("concurrentKernels", ct.c_int),
+      ("ECCEnabled", ct.c_int),
+      ("pciBusID", ct.c_int),
+      ("pciDeviceID", ct.c_int),
+      ("pciDomainID", ct.c_int),
+      ("tccDriver", ct.c_int),
+      ("asyncEngineCount", ct.c_int),
+      ("unifiedAddressing", ct.c_int),
+      ("memoryClockRate", ct.c_int),
+      ("memoryBusWidth", ct.c_int),
+      ("l2CacheSize", ct.c_int),
+      ("maxThreadsPerMultiProcessor", ct.c_int),
+      ("streamPrioritiesSupported", ct.c_int),
+      ("globalL1CacheSupported", ct.c_int),
+      ("localL1CacheSupported", ct.c_int),
+      ("sharedMemPerMultiprocessor", ct.c_size_t),
+      ("regsPerMultiprocessor", ct.c_int),
+      ("managedMemSupported", ct.c_int),
+      ("isMultiGpuBoard", ct.c_int),
+      ("multiGpuBoardGroupID", ct.c_int),
+      # Pad with extra space to avoid dereference crashes if future
+      # versions of CUDA extend the size of this struct.
+      ("__future_buffer", ct.c_char * 4096)]
+
+
+def _gather_gpu_devices_cudart():
+  """Try to gather NVidia GPU device information via libcudart."""
+  dev_info = []
+
+  system = platform.system()
+  if system == "Linux":
+    libcudart = ct.cdll.LoadLibrary("libcudart.so")
+  elif system == "Darwin":
+    libcudart = ct.cdll.LoadLibrary("libcudart.dylib")
+  elif system == "Windows":
+    libcudart = ct.windll.LoadLibrary("libcudart.dll")
+  else:
+    raise NotImplementedError("Cannot identify system.")
+
+  version = ct.c_int()
+  rc = libcudart.cudaRuntimeGetVersion(ct.byref(version))
+  if rc != 0:
+    raise ValueError("Could not get version")
+  if version.value < 6050:
+    raise NotImplementedError("CUDA version must be between >= 6.5")
+
+  device_count = ct.c_int()
+  libcudart.cudaGetDeviceCount(ct.byref(device_count))
+
+  for i in range(device_count.value):
+    properties = CUDADeviceProperties()
+    rc = libcudart.cudaGetDeviceProperties(ct.byref(properties), i)
+    if rc != 0:
+      raise ValueError("Could not get device properties")
+    pci_bus_id = " " * 13
+    rc = libcudart.cudaDeviceGetPCIBusId(ct.c_char_p(pci_bus_id), 13, i)
+    if rc != 0:
+      raise ValueError("Could not get device PCI bus id")
+
+    info = test_log_pb2.GPUInfo()  # No UUID available
+    info.model = properties.name
+    info.bus_id = pci_bus_id
+    dev_info.append(info)
+
+    del properties
+
+  return dev_info
+
+
+def gather_gpu_devices():
+  """Gather gpu device info.
+
+  Returns:
+    A list of test_log_pb2.GPUInfo messages.
+  """
+  try:
+    # Prefer using /proc if possible, it provides the UUID.
+    dev_info = _gather_gpu_devices_proc()
+    if not dev_info:
+      raise ValueError("No devices found")
+    return dev_info
+  except (IOError, ValueError):
+    pass
+
+  try:
+    # Fall back on using libcudart
+    return _gather_gpu_devices_cudart()
+  except (OSError, ValueError, NotImplementedError):
+    return []
diff --git a/tensorflow/tools/test/system_info.py b/tensorflow/tools/test/system_info.py
new file mode 100644
index 0000000000..dcbbe1ce1a
--- /dev/null
+++ b/tensorflow/tools/test/system_info.py
@@ -0,0 +1,33 @@
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Library for getting system information during TensorFlow tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from tensorflow.tools.test import system_info_lib
+
+
+def main(unused_args):
+  config = system_info_lib.gather_machine_configuration()
+  print(config)
+
+
+if __name__ == "__main__":
+  tf.app.run()
diff --git a/tensorflow/tools/test/system_info_lib.py b/tensorflow/tools/test/system_info_lib.py
new file mode 100644
index 0000000000..c36a6c6b13
--- /dev/null
+++ b/tensorflow/tools/test/system_info_lib.py
@@ -0,0 +1,149 @@
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Library for getting system information during TensorFlow tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import multiprocessing
+import platform
+import re
+import socket
+
+import tensorflow as tf
+
+# pylint: disable=g-bad-import-order
+# Note: cpuinfo and psutil are not installed for you in the TensorFlow
+# OSS tree.  They are installable via pip.
+import cpuinfo
+import psutil
+# pylint: enable=g-bad-import-order
+
+from tensorflow.core.util import test_log_pb2
+from tensorflow.python.client import device_lib
+from tensorflow.tools.test import gpu_info_lib
+
+
+def gather_machine_configuration():
+  """Gather Machine Configuration.  This is the top level fn of this library."""
+  config = test_log_pb2.MachineConfiguration()
+
+  config.cpu_info.CopyFrom(gather_cpu_info())
+  config.platform_info.CopyFrom(gather_platform_info())
+
+  # gather_available_device_info must come before gather_gpu_devices
+  # because the latter may access libcudart directly, which confuses
+  # TensorFlow StreamExecutor.
+  for d in gather_available_device_info():
+    config.available_device_info.add().CopyFrom(d)
+  for gpu in gpu_info_lib.gather_gpu_devices():
+    config.device_info.add().Pack(gpu)
+
+  config.memory_info.CopyFrom(gather_memory_info())
+
+  config.hostname = gather_hostname()
+
+  return config
+
+
+def gather_hostname():
+  return socket.gethostname()
+
+
+def gather_memory_info():
+  """Gather memory info."""
+  mem_info = test_log_pb2.MemoryInfo()
+  vmem = psutil.virtual_memory()
+  mem_info.total = vmem.total
+  mem_info.available = vmem.available
+  return mem_info
+
+
+def gather_cpu_info():
+  """Gather CPU Information.  Assumes all CPUs are the same."""
+  cpu_info = test_log_pb2.CPUInfo()
+  cpu_info.num_cores = multiprocessing.cpu_count()
+
+  # Gather num_cores_allowed
+  try:
+    with tf.gfile.GFile('/proc/self/status') as fh:
+      nc = re.search(r'(?m)^Cpus_allowed:\s*(.*)$', fh.read())
+    if nc:  # e.g. 'ff' => 8, 'fff' => 12
+      cpu_info.num_cores_allowed = (
+          bin(int(nc.group(1).replace(',', ''), 16)).count('1'))
+  except IOError:
+    pass
+  finally:
+    if cpu_info.num_cores_allowed == 0:
+      cpu_info.num_cores_allowed = cpu_info.num_cores
+
+  # Gather the rest
+  info = cpuinfo.get_cpu_info()
+  cpu_info.cpu_info = info['brand']
+  cpu_info.num_cores = info['count']
+  cpu_info.mhz_per_cpu = info['hz_advertised_raw'][0] / 1.0e6
+  l2_cache_size = re.match(r'(\d+)', str(info['l2_cache_size']))
+  if l2_cache_size:
+    # If a value is returned, it's in KB
+    cpu_info.cache_size['L2'] = int(l2_cache_size.group(0)) * 1024
+
+  # Try to get the CPU governor
+  try:
+    cpu_governors = set([
+        tf.gfile.GFile(f, 'r').readline().rstrip()
+        for f in tf.gfile.Glob(
+            '/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor')])
+    if cpu_governors:
+      if len(cpu_governors) > 1:
+        cpu_info.cpu_governor = 'mixed'
+      else:
+        cpu_info.cpu_governor = list(cpu_governors)[0]
+  except IOError:
+    pass
+
+  return cpu_info
+
+
+def gather_available_device_info():
+  """Gather list of devices available to TensorFlow.
+
+  Returns:
+    A list of test_log_pb2.AvailableDeviceInfo messages.
+  """
+  device_info_list = []
+  devices = device_lib.list_local_devices()
+
+  for d in devices:
+    device_info = test_log_pb2.AvailableDeviceInfo()
+    device_info.name = d.name
+    device_info.type = d.device_type
+    device_info.memory_limit = d.memory_limit
+    device_info.physical_description = d.physical_device_desc
+    device_info_list.append(device_info)
+
+  return device_info_list
+
+
+def gather_platform_info():
+  """Gather platform info."""
+  platform_info = test_log_pb2.PlatformInfo()
+  (platform_info.bits, platform_info.linkage) = platform.architecture()
+  platform_info.machine = platform.machine()
+  platform_info.release = platform.release()
+  platform_info.system = platform.system()
+  platform_info.version = platform.version()
+  return platform_info
diff --git a/third_party/gpus/cuda/cuda_config.sh b/third_party/gpus/cuda/cuda_config.sh
index 21f36d4416..42cd254644 100755
--- a/third_party/gpus/cuda/cuda_config.sh
+++ b/third_party/gpus/cuda/cuda_config.sh
@@ -133,8 +133,10 @@ if test -e ${CUDNN_INSTALL_PATH}/cudnn.h; then
   CUDNN_HEADER_PATH=${CUDNN_INSTALL_PATH}
 elif test -e ${CUDNN_INSTALL_PATH}/include/cudnn.h; then
   CUDNN_HEADER_PATH=${CUDNN_INSTALL_PATH}/include
+elif test -e /usr/include/cudnn.h; then
+  CUDNN_HEADER_PATH=/usr/include
 else
-  CudnnError "cannot find cudnn.h under: ${CUDNN_INSTALL_PATH}"
+  CudnnError "cannot find cudnn.h under: ${CUDNN_INSTALL_PATH} or /usr/include"
 fi
 
 # Locate libcudnn.so.${$TF_CUDNN_VERSION}
diff --git a/util/python/python_config.sh b/util/python/python_config.sh
index 9515ffa24e..a5666c2f7e 100755
--- a/util/python/python_config.sh
+++ b/util/python/python_config.sh
@@ -80,9 +80,9 @@ function setup_python {
     fi
   done
 
-  ln -s "${python_include}" util/python/python_include
-  ln -s "${python_lib}" util/python/python_lib
-  ln -s "${numpy_include}" third_party/py/numpy/numpy_include
+  ln -sf "${python_include}" util/python/python_include
+  ln -sf "${python_lib}" util/python/python_lib
+  ln -sf "${numpy_include}" third_party/py/numpy/numpy_include
 
   # Write tools/bazel.rc
   echo "# Autogenerated by configure: DO NOT EDIT" > tools/bazel.rc