Merge changes from github.

END_PUBLIC --- Commit f0e185d1f authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Better handle nodes with a variable number of outputs PiperOrigin-RevId: 158435028 --- Commit bc3e20807 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused BUILD dependencies PiperOrigin-RevId: 158431059 --- Commit a0c80e4d5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Delete unnecessary (mistakenly duplicated) logging message. PiperOrigin-RevId: 158428506 --- Commit b6ad1d747 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds DNN-only tests for DNNLinearCombinedClassifier. PiperOrigin-RevId: 158423119 --- Commit ddbb58034 authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unnecessary pylint disable PiperOrigin-RevId: 158416140 --- Commit fcaa724e2 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans pack and unpack ops (#10336) * [OpenCL] Cleans pack op * [OpenCL] Cleans unpack op --- Commit 2f53cacb2 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix a test failure of quantization_utils_test on ASAN PiperOrigin-RevId: 158414538 --- Commit 50b2f951c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 158413455 --- Commit 1e90b78e9 authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add CacheDataset ops. Some input pipelines may pull down data from remote webservers or perform expensive processing. In order to avoid extraneous work, we now support caching the dataset (e.g. on disk). PiperOrigin-RevId: 158411901 --- Commit e16cd2ede authored by Taehoon Lee<taehoonlee@snu.ac.kr> Committed by gunan<gunan@google.com>: Fix typos (#10533) --- Commit 50d80ddf9 authored by Jonathan Hseu<jhseu@google.com> Committed by Jonathan Hseu<jhseu@google.com>: Fix fft_ops_test.py for CPU --- Commit d35cbbb44 authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add weight-column support to the heads. PiperOrigin-RevId: 158409180 --- Commit 7fb52cd54 authored by Justin Lebar<jlebar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Don't crash when displaying XLA metrics if they happen to be negative. PiperOrigin-RevId: 158407664 --- Commit 12a7a752a authored by Jianfei Wang<me@thinxer.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Add a tip for tf.train.LoggingTensorHook (#10237) `INFO` logs are not printed by default unless in IPython. Add a friendly tip for newcomers. --- Commit 216dcbf1e authored by Luke Iwanski<luke@codeplay.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [OpenCL] Cleans reduction ops (#10340) * [OpenCL] Cleans reduction_ops_max.cc * [OpenCL] Cleans reduction_ops_mean.cc * [OpenCL] Cleans reduction_ops_min.cc * [OpenCL] Cleans reduction_ops_prod.cc * [OpenCL] Cleans reduction_ops_sum.cc --- Commit 2b351062a authored by Androbin<robin.richtsfeld@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Improve docs for selective registration headers (#10351) * Improve docs for selective registration headers progressing #10299 * Update print_selective_registration_header.py * Mention both flags -DSELECTIVE_REGISTRATION and -DSUPPORT_SELECTIVE_REGISTRATION --- Commit ee919510f authored by Yun Peng<pcloudy@google.com> Committed by gunan<gunan@google.com>: Re-enable some python tests in Windows Bazel build (#10526) --- Commit b0e881457 authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: [Bash] Declare and assign separately (#10509) As proposed by static analysis tool: https://github.com/koalaman/shellcheck/wiki/SC2155 --- Commit 284901b08 authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: [Bash] Remove unquoting quotes (#10506) As proposed by static analysis tool: https://github.com/koalaman/shellcheck/wiki/SC2027 --- Commit 2a1f11556 authored by ksellesk<zhengdachuan200305@gmail.com> Committed by ksellesk<zhengdachuan200305@gmail.com>: Fix AttributeError in resnet.py There is no function tf.softmax() in Tensorflow 1.x. When running the old code, Python interpreter complains: File "resnet.py", line 152, in res_net_model prediction, loss = res_net(x, y) File "resnet.py", line 148, in res_net return tf.softmax(logits), loss AttributeError: 'module' object has no attribute 'softmax' --- Commit 1d68f729b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unneeded BUILD dependency PiperOrigin-RevId: 158391996 --- Commit 08ed32dbb authored by Yun Peng<pcloudy@google.com> Committed by gunan<gunan@google.com>: Windows: Make TensorFlow build without --cpu=x64_windows_msvc (#10466) * Windows: Make TensorFlow build without --cpu=x64_windows_msvc Since from Bazel 0.5.0, MSVC toolchain became the default toolchain on Windows. So --cpu=x64_windows_msvc is not required as long as we adjust the BUILD files in TensorFlow. --cpu=x64_windows_msvc is also supported for now, but is depracated. The configuration for cpu value x64_windows_msvc is a duplicate of x64_windows, which should be removed in the future. * Fix breakage on macOS --- Commit 02dbe153a authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: [Bash] Simplify Conditional (#10503) --- Commit c07bc581f authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: [Bash] Prefer read -a to split path (#10508) As proposed by static analysis tool: https://github.com/koalaman/shellcheck/wiki/SC2207 --- Commit 0a389674d authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: [Bash] Prefer [ p ] && [ q ] over [ p -a q ] (#10507) As proposed by static analysis tool: https://github.com/koalaman/shellcheck/wiki/SC2166 --- Commit 87a008ec3 authored by Jonathan Hseu<vomjom@vomjom.net> Committed by gunan<gunan@google.com>: Delete non-deterministic testEmpty() test (#10512) --- Commit 3a2971bd8 authored by Frank Chen<frankchn@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds the base for ClusterResolvers, a new way of communicating with and retrieving cluster information for running distributed TensorFlow. Implementations of this class would eventually allow users to simply point TensorFlow at a cluster management endpoint, and TensorFlow will automatically retrieve the host names/IPs and port numbers of TensorFlow workers from the cluster management service. PiperOrigin-RevId: 158358761 --- Commit 28b4e7f04 authored by Jonathan Hseu<vomjom@vomjom.net> Committed by gunan<gunan@google.com>: Disable stage_op_test and map_stage_op_test (#10516) --- Commit 390e57a75 authored by Yan (Asta) Li<yanastali@users.noreply.github.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Check EIGEN_MAX_ALIGN_BYTES to prevent mod-by-0 (#10380) * Check EIGEN_MAX_ALIGN_BYTES to prevent mod-by-0 If EIGEN_MAX_ALIGN_BYTES is set to 0, alignment checks that mod by EIGEN_MAX_ALIGN_BYTES fail at runtime. * Returns true, as in tensorflow/core/framework/tensor.h * Update unit tests * Enable tests only if EIGEN_MAX_ALIGN_BYTES > 0 --- Commit cd5ac40b3 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Update LLVM to upstream revision r304927. Add LLVM build rules for the LLVM AMDGPU backend, commented out by default. Fixes issue #10437. PiperOrigin-RevId: 158351480 --- Commit 91cb809bd authored by David Norman<DavidNorman@users.noreply.github.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [XLA] Add ability to run the XLA unit tests against a different device (#9759) * Add ability to run the XLA unit tests against a different device * Allow for multiple extra backend devices * Correct merge error * Include options for additional tags --- Commit aff4d124b authored by Yuxin Wu<ppwwyyxxc@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Compare base_dtype instead of dtype in piecewise_constant (#10280) * Compare base_dtype instead of dtype in piecewise_constant Compare base_dtype instead of dtype in piecewise_constant. Fix #10086 * add unit test * Small lint fix and comment --- Commit 845539f98 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add evaluation test for linear classifier (n==2 or n >2). PiperOrigin-RevId: 158340296 --- Commit 7c46214ab authored by Jonathan Hseu<vomjom@vomjom.net> Committed by GitHub<noreply@github.com>: Fix numpy 1.13 incompatibilities (#10501) * Fix numpy 1.13 incompatibilities * Skip tests with numpy 1.13.0 --- Commit 4572c41df authored by gunan<gunan@google.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: A few changes to kernel_tests. (#10502) * Disable reader_ops_test on windows. * Run buildifier on kernel_tests/BUILD * Mark map_stage_op_test as large. * Set the size of stage_op_test to large --- Commit 892293d98 authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Set a default for datasets end_of_sequence. While all datasets carefully set the end_of_sequence to true at the appropriate time, some datasets might forget to set it to false in the normal case. In order to avoid potential undefined behavior, we set the end_of_sequence variable to be false by default. PiperOrigin-RevId: 158337799 --- Commit 187404eac authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Setup the env to since ops such as MatchFileOp rely on it. PiperOrigin-RevId: 158336344 --- Commit 2741561c8 authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix up vz_projector script structure We now make sure scripts and HTML imports are declared in the correct places. In the future, pedantically listing script tags should not be necessary. PiperOrigin-RevId: 158334306 --- Commit beeaade46 authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Resubmit a reverted change. Original description: [XLA] Enable HloEvaluator for constant folding, also merged a few operations from hlo_constant_folding to hlo_evaluator. Additionally: - In ShapeUtil::ForEachIndex: * fix a bug where visitor is called when the shape has zero elements (e.g., F32{1,0}) * added test case for ForEachIndex. - In HloEvaluator: * Instead of copying and caching a Constant instruction, return the literal directly if the instruction is constant. * Fix an issue where TUPLE and OPAQUE primitives are not keyed in the templated typed_visitor. * Use (fixed) LiteralUtil::Populate to populate resulting literal, fixes the preexisting bug in the evaluator where R0 and shape with zero size dimensions are not handled. * Refactor ElementWiseUnaryOp and HandleCompare to be templatized on the operand's type. * Refactor IsFinite to be top level since it is only applicable to floats and the return type is always boolean. * Change from std::remainder to std::fmod for kRemainder to be compliant with existing XLA behavior. * Change from std::max and std::min to std::fmax and std::fmin to handle NaNs. * Minor comments fix. PiperOrigin-RevId: 158330052 --- Commit b94540e6f authored by Toby Boyd<tobyboyd@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tf.layers.conv2d use_bias=True to use nn.bias_add PiperOrigin-RevId: 158326493 --- Commit 379aa9911 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 158325855 --- Commit 4e529f0f1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 158325293 --- Commit 0a9d2dac0 authored by Yuefeng Zhou<yuefengz@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a util function in virtual placer to return canonicalized device string, which can be used to fix the node's device field before passing them to the maxcut algorithm. PiperOrigin-RevId: 158322753 --- Commit 2d8da1d9b authored by Daniel Ylitalo<daniel@blodan.se> Committed by gunan<gunan@google.com>: Recognize CPU core count in FreeBSD (#10490) --- Commit c19e6cac0 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Initial implementation of TensorArray ops. The XLA implementation of TensorArrays is more restrictive than regular TensorArrays: * XLA TensorArrays must have dynamic_size=False. * all elements in an XLA TensorArray must have the same shape. * writes always add their values to any existing values; neither reads nor writes ever issue errors. Out-of-bounds writes currently wrap. Refactor Variable handling in the TF/XLA bridge. Use a XlaVariable* to refer to variables inside compilation rather than a numerical ID. Allow for variables that don't correspond to variables known to the user. Also use XlaVariable to handle TensorArrays. PiperOrigin-RevId: 158322041 --- Commit b5e8d3086 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Refactor randomized tests to allow testing of larger inputs without running out of memory. PiperOrigin-RevId: 158321431 --- Commit 5d90bbaac authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Disable constant_folding in test base, so that intended test code paths would not be elided by constant_folding pass. PiperOrigin-RevId: 158317641 --- Commit 036ce8ba6 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans dense_update_ops (#10335) * [OpenCL] Cleans dense_update_ops * Acts on feedback from: #10335#discussion_r120536460 --- Commit 85f968125 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans cast operation (#10330) * [OpenCL] Removes not needed typedef for SYCLDevice * [OpenCL] Fixes formatting * [OpenCL] use SYCLDevice for int32 cast case --- Commit bff5e72da authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix typo. PiperOrigin-RevId: 158310742 --- Commit 38249d6be authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Swap the order of NanTensorHook and custom hooks to ensure that when the training encounteres NaN's in the loss function, user-supplied hooks such as tf_debug.LocalCLIDebugHook can still be used to debug the root cause of the numeric issues. PiperOrigin-RevId: 158310249 --- Commit 599727c65 authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Propagate debug option flags to hlo_test_base. Specific HLO tests have to replace the generic test_main target with a manual main() that invokes RUN_ALL_TESTS. To get access to a module with debug options set up, a new convenience method is created on HloTestBase. Initially algebraic_simplifier_test is modified as a canary; in a followup we'll convert all HLO tests to this approach. PiperOrigin-RevId: 158309488 --- Commit 0770393e9 authored by Eric Liu<ioeric@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [Tensorboard] Add a trace viewer component to TensorBoard. We make the trace viewer a separate app; otherwise, there would be dependency conflicts (e.g. Polymer) between the trace viewer app and the tensorboard app. The trace viewer app would be served by a plugin, and Tensorboard dashboard will integrate trace viewer app using iframe in the future. This CL also added "mominify" support for link import HTML tags in the tensorboard home-grown java vulnizer; otherwise, the vulcanized trace viewer code would crash the java vulcanizer. For open-source build, we add a denpendency on the Catapult github repository (https://github.com/catapult-project/catapult/tree/master/tracing). We use a bazel genrule to vulcanize a trace viewer binary which is then used in the tf-trace-viewer component. PiperOrigin-RevId: 158309408 --- Commit 85e832201 authored by RJ Ryan<rjryan@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Support unknown emit shapes in tf.nn.raw_rnn. PiperOrigin-RevId: 158308002 --- Commit edb5fed7f authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add label-vocab support to binary logistic head. Add assertion that binary classifier label is in range [0., 1.] Fixed Classifier Integration tests. PiperOrigin-RevId: 158307521 --- Commit f8e1cf8fa authored by Justine Tunney<jart@google.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Open up visibility of tf_imports (#10500) This also fixes the definition of Clutz. --- Commit 9fd7cf054 authored by Luke Iwanski<luke@codeplay.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [OpenCL] Cleans relu ops (#10343) * [OpenCL] register relu ops to gpu types (no half) * [OpenCL] Removes #undef EIGEN_USE_SYCL --- Commit 09c1455e3 authored by Luke Iwanski<luke@codeplay.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [OpenCL] Cleans reverse_op.cc (#10346) --- Commit b7892a30f authored by orome<royl@aldaron.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Clarify tf.matmul documentation (#10381) * Update math_ops.py * Fix non-ascii character --- Commit 9786b7062 authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Cleans StridedSlice Op (#10314) * [OpenCL] Cleans StridedSlice Op * [OpenCL] Removes half from registred types --- Commit f105df047 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In the CUDA path of depthwise_conv2d, optimize backward filter convolution for images 2 or 4 times smaller than 16x16. Also initialize in_cols from blockDim, to fix the regression caused in CL 157906773. PiperOrigin-RevId: 158296136 --- Commit 492afc2e3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 158295169 --- Commit abe0877ef authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add bazel version check to .configure PiperOrigin-RevId: 158294569 --- Commit b702e7e79 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 158294289 --- Commit 94085bee7 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Replace std::function object with regular function. The function is called recursively, and the std::function object had only existed to allow recursion from within a lambda expression. A regular function should be cheaper than a polymorphic function wrapper. PiperOrigin-RevId: 158292415 --- Commit ba656b261 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use template specialization instead of overloaded methods. This is a more appropriate tool here. NFC PiperOrigin-RevId: 158292035 --- Commit 55f987692 authored by Yutaka Leon<yleon@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make tf.contrib.lookup python functions use the kernels v2 that uses the resource tensor as handler. PiperOrigin-RevId: 158291836 --- Commit ebae3deba authored by Wei Ho<weiho@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Switch back to max_num_rows_to_load instead of reading slice by slice due to performance regression from network overhead. Add check when using initializing values to avoid seg fault PiperOrigin-RevId: 158291218 --- Commit 7b4c01794 authored by RJ Ryan<rjryan@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Support numpy-style padding and slicing of tf.spectral.rfft/irfft to match the desired FFT length. Fixes incorrect RFFT/IRFFT results when fft_length does not match the input dimension. PiperOrigin-RevId: 158289991 --- Commit fdb8e2935 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update iOS examples to use CocoaPods, and moved to tensorflow/examples/ios PiperOrigin-RevId: 158289285 --- Commit d86167b5f authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Merging rc2 back into master. --- Commit dffea202a authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Clean up some code after previous CL PiperOrigin-RevId: 158282834 --- Commit 7b5302af0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds ability to set a "family" attribute in Tensorflow summaries, which controls the "tab name" of the summary that is displayed. This solution keeps using name_scope to keep names unique, but then prefixes the tag with the family name if provided. PiperOrigin-RevId: 158278922 --- Commit 611c82b5b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds integration test for DNNLinearCombined((Classifier)|(Regressor)). PiperOrigin-RevId: 158278512 --- Commit cc6c91a9a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove a further unused proto header inclusion PiperOrigin-RevId: 158278026 --- Commit 9f17c26ca authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add HloLocation to dataflow analysis. Add an HloLocation abstraction to dataflow analysis which indicates where (in the output of what instruction and at which index) an HloValue may appear. Previously only uses were stored with an HLO value where a use is an edge in the HLO graph (instruction, operand number and ShapeIndex). Also, change the handling of tuple-shaped kSelect instructions when ssa_form is true. Previously a phi value would be created. With this change the the value set instead contains the union of it's inputs identical to the ssa_form=false case. PiperOrigin-RevId: 158276598 --- Commit b9d5e1441 authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Start collecting flags for debug options in a single place. ClientLibraryTestBase will now parse command-line flags for debug options automatically, permitting subclasses to override certain options by using mutable_debug_options. main() still has to call AppendDebugOptionsFlags() explicitly before running the TF flag parser. In the mean-time, this CL leaves flag handling to the current "legacy" approach. However, this is part of a larger plan to move *all* debugging flags for XLA into the DebugOptions message and expose them as flags from a single place. The other flags (which are not controlling debugging options) will have to be propagated more explicitly. PiperOrigin-RevId: 158276294 --- Commit 3b6fe94bb authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Properly handle shape nodes that have a preexisting control dependency PiperOrigin-RevId: 158274845 --- Commit 1d67379d5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Minor cleanup PiperOrigin-RevId: 158268933 --- Commit 41997756c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Sort header inclusions; define EIGEN_USE_THREADS where headers depend on it. PiperOrigin-RevId: 158267803 --- Commit 85355f015 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add missing header inclusion PiperOrigin-RevId: 158265934 --- Commit 3cf88d390 authored by Gunhan Gulsoy<gunan@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: When GPU is configured, do not require --config=cuda. Also fix indentation in configure. PiperOrigin-RevId: 158232959 --- Commit f48673b50 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Removes ReductionFunctor for SYCLDevice (#10326) We are using Eigen implementation --- Commit 1b6453bec authored by Joan Puigcerver<joapuipe@gmail.com> Committed by gunan<gunan@google.com>: Fixes issue #10258 (#10366) On CUDA versions previous to 8.0, only __shared__ variables could be declared as static in the device code. --- Commit cd56a638d authored by Beomsu Kim<123bskim@naver.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fixed wrong range in docstring (#10272) --- Commit d13ae380c authored by Micha? Jastrz?bski<michal.jastrzebski@intel.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fix CMD in Dockerfile (#10444) Currently Notebook fails execution because default user for this container is root, and unless explicitly allowed, jupyter notebook will not start. --- Commit 8118ab4ec authored by Simon Perkins<simon.perkins@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Support partial gets in MapStagingArea (#10276) * Modify map staging area tests - size from `small` to `medium` - introduce 2 shards * Add partial get support in MapStagingArea A partial list of tensors in a (key, value) map entry can now be requested. Once all tensors associated with the entry are removed, it is removed from the map. * Correct output/indices mismatch errors * Rename IncompleteTuple to OptionalTuple * Add partial get test with indices * Add some more index checks * Improve stage test case graph creation Test sessions (and default graphs) are reused by default. Create explicit, finalized graphs in each test to prevent possible interactions between stateful Staging Areas and others ops created in separate tests. * Make staging area tests small and remove shards They were originally made 'medium' to ameliorate timeouts in the test case, but they usually run in ~1s so they should be small. * Improve imports Avoid importing base tensorflow package * Support both python 2 and python 3 range. * Set map_stage_op_test to size=large * Convert the tests to size=medium --- Commit 0df102b0a authored by Androbin<robin.richtsfeld@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Update `configure` script sample (#10455) The `configure` script was changed regularly since the generation of the sample. This PR updates the sample to reflect those changes. --- Commit f6dc1ac61 authored by Earthson Lu<Earthson.Lu@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: MKL_INSTALL_PATH should not be ignore when given (#10180) * MKL_INSTALL_PATH should not be clear when given * fix overwrite by default --- Commit 8ad6a036e authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Java: Update Maven release to 1.2.0-rc2 PiperOrigin-RevId: 158212897 --- Commit 15eddf035 authored by Fritz Obermeyer<fritz.obermeyer@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Export C API symbols in _pywrap_tensorflow_internal.so (#10469) * Export C API symbols * Export C API symbols under config:default --- Commit 754e12668 authored by Luke Iwanski<luke@codeplay.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [OpenCL] Removes half concat op registration (#10331) --- Commit cfdc22dee authored by Peng Yu<yupbank@users.noreply.github.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: fix the error (#10293) --- Commit 58747e357 authored by Joel Hestness<jthestness@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: PhiloxRandom: Fix race in GPU fill function (#10298) * PhiloxRandom: Fix race in GPU fill function The PhiloxRandom fill kernel for the GPU had race conditions that caused the outputs to be non-deterministic. In particular, the code previously executed with N GPU threads (# thread contexts per GPU), but it would only advance the fill addresses by N-1 stride in each step. This incorrect stride caused the 0th and N-1st threads to write to the same memory locations, racing for which was last to write their common locations. Make the stride equal to the number of threads to eliminate the race. BONUS: By fixing this race, PhiloxRandom constant-sized GPU initializers now match CPU initializers. * Update random_ops_test.py to find race conditions Increasing the size of arrays in the random_ops_test.py test to manifest the race conditions to be resolved. --- Commit 2cbcda08f authored by Androbin<robin.richtsfeld@gmail.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fixed formatting in Linux install guide (#10353) Formatting issues were introduced in PR #8825, commit f30918b3694afe844990cbddc82e27e023d88856 --- Commit ab5f38560 authored by Lakshay Garg<lakshayg@outlook.in> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fixed typos in documentation & READMEs (#10365) --- Commit 94dc1dbfa authored by Christos Nikolaou<cNikolaou@users.noreply.github.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Enable figures in the tfprof README.md (#10372) --- Commit 3018d4678 authored by Taehoon Lee<taehoonlee@snu.ac.kr> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fix typos (#10386) --- Commit c5f3c6171 authored by Daniel Rasmussen<drasmuss@users.noreply.github.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fix unbatch for Datasets with multiple elements (#10401) * Fix unbatch for datasets with multiple elements * fixup! pylint (indent two spaces instead of four) --- Commit 8b065bc10 authored by Yong Tang<yong.tang.github@outlook.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: Fix unaligned args in api_docs/python/tf/contrib/learn/Evaluable (#10423) This commit fixes unaligned args in api_docs/python/tf/contrib/learn/Evaluable Signed-off-by: Yong Tang <yong.tang.github@outlook.com> --- Commit 8f89b654f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Profile memory usage in VirtualScheduler and report peak memory usage. To do so, NodeState now handles different output ports of a node (in case a node has multiple outputs). Also, VirtualScheduler code is cleaned up with more comments. PiperOrigin-RevId: 158209068 --- Commit 0ea0bf5aa authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a frontend for viewing the first ops that exhibit bad values (NaN, +/- Inf). This helps the user identify problematic ops. Also moved the debugger data logic within tf-graph-info into a new tf-graph-debugger-data-card component. PiperOrigin-RevId: 158208679 --- Commit ed47ecf2d authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Cleans variable op (#10333) * [OpenCL] Cleans variable op * Fixes formatting and float / double -> GPU_NUMBER_TYPES_NO_HALF --- Commit 9b2c1af63 authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Improves device reporting (#10462) Prints: id, type, name, vendor and profile of the device --- Commit 7f5384dcc authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Making load() work for resource variables. PiperOrigin-RevId: 158205361 --- Commit 05412bd36 authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Simplify Shape traversal visitors. Simplify shape traversal visitors in ShapeUtil and ShapeTree. Add a non-Status form because most uses of the traversal methods do not use it, and remove is_leaf parameter from ShapeTree.ForEach* as it is not frequently used. PiperOrigin-RevId: 158201574 --- Commit 69c9365b4 authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extracted linear estimator testing utils to be reused by dnn-linear-combined. Added tests for linear part of dnn-linear-combined estimator. PiperOrigin-RevId: 158200827 --- Commit 65ce8c723 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add arrowheads to dataflow edges. Make reference edges orange. Remove animations from tooltips in the graph documentation. Previously, arrowheads were only added to reference edges (because we assumed users knew about the convention that arrowless edges flow upwards). That decision nicely reduces clutter. However, recently, some internal and external folks have expressed confusion, and so I want to try adding arrowheads to all data flow edges. And make the reference edges starkly different. See #10428 PiperOrigin-RevId: 158195388 --- Commit bf4c3dd6b authored by gunan<gunan@google.com> Committed by GitHub<noreply@github.com>: Revert "Fix patching issue on Windows" (#10472) This reverts commit 47e6785646a1266f01a1a570bd799f8518ee2997. --- Commit b49515539 authored by David Soergel<soergel@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add only string constants to ASSET_FILEPATHS collection. PiperOrigin-RevId: 158192152 --- Commit 51acad09c authored by Sergio Guadarrama<sguada@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add tests with different delta to huber_loss. PiperOrigin-RevId: 158191361 --- Commit a4e7b7add authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixes a bug in setting default optimizers for DNNLinearCombinedClassifier. PiperOrigin-RevId: 158190192 --- Commit ddd67e333 authored by Luke Iwanski<luke@codeplay.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: [OpenCL] Cleans reshape.cc (#10347) * [OpenCL] Cleans reshape.cc * Removes half and complex numbers. Half is extension and complex numbers needs implementation in Eigen first --- Commit 3ca653304 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 158186454 --- Commit 8cda8660e authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans sendrecv_ops.cc (#10345) --- Commit 6915bb919 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans Slice op (#10341) --- Commit 54998b45d authored by Michele Colombo<m-colombo@users.noreply.github.com> Committed by Jonathan Hseu<vomjom@vomjom.net>: BasicRNNCell comment fix (#10467) --- Commit df5906fb7 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Mark saver/restore ops that depend on filesystem as stateful to disable them from being folded into a constant by graph optimizer. PiperOrigin-RevId: 158182282 --- Commit 96cb4d182 authored by Sergio Guadarrama<sguada@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add support of scale_l1 == 0. or scale_l2 == 0 to l1_l2_regularizer. Added tests. PiperOrigin-RevId: 158179790 --- Commit b65eb3f9b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Speed up atrous_convolution_test by combining evaluations. To make this test run faster (and prevent it from timing out under certain circumstances), this change combines all evaluations for each test method into a single call to Session.run, to eliminate overhead. This reduces the test time from about 40 seconds to 10 seconds. RELNOTES: n/a PiperOrigin-RevId: 158175227 --- Commit b440abce7 authored by Gao, Xiang<qasdfgtyuiop@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: add Cuda{2D,3D}LaunchConfig that maximizes occupancy (#10032) * add Cuda{2D,3D}LaunchConfig that max occupancy * remove default val, check input<=0 * add max size check * fix typo * tests, docs, and related changes * build the test * buildify * cudaOccupancy... call check success, and style fix --- Commit 81cf61fdb authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Initialize tensor in graph_properties_test, to avoid msan complaint. PiperOrigin-RevId: 158169374 --- Commit cabc5c35c authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add xla_disable_hlo_passes to DebugOptions Also add a SetDebugOptions method to ClientLibraryTestBas; this lets us set debug options in tests by calling it. As an example, this CL removes the current way of passing xla_disable_hlo_passes programmatically in tests - it used to employ a special constructor parameter which is no longer required. PiperOrigin-RevId: 158169006 --- Commit 187d23337 authored by Luke Iwanski<luke@codeplay.com> Committed by gunan<gunan@google.com>: [OpenCL] Cleans Pad op (#10339) --- Commit e8bc38ef6 authored by gunan<gunan@google.com> Committed by GitHub<noreply@github.com>: Fix test failures on windows. (#10470) --- Commit 2b3535c64 authored by David Soergel<soergel@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Minor docstring fix for build_parsing_serving_input_receiver_fn PiperOrigin-RevId: 158163615 --- Commit e55f2e036 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Propagates constants through switch nodes. PiperOrigin-RevId: 158163537 --- Commit b01d4b905 authored by Jacques Pienaar<jpienaar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Remove outdated todo. PiperOrigin-RevId: 158161411 --- Commit 7125733d7 authored by William Chargin<wchargin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Create a set of sample data for the audio plugin This implements a simple tone generator, with sine waves, square waves, and triangle waves, plus two simple combinations of sine waves. The step value is used to control the frequency. PiperOrigin-RevId: 158160889 --- Commit dc81a2420 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Updates to the WALSMatrixFactorization estimator: - Add a completed_sweeps variable to keep track of sweeps that have been completed during training. - Add a StopAtSweepHook, which can request a stop after completing a specified number of sweeps. PiperOrigin-RevId: 158156347 --- Commit 74220616c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Set device cores and frequency in op_level_cost_estimator_test, to avoid asan error about assigning inf to int64 (this comes in from a divide-by-0). PiperOrigin-RevId: 158155488 --- Commit 47e678564 authored by Yun Peng<pcloudy@google.com> Committed by gunan<gunan@google.com>: Fix patching issue on Windows (#10452) --- Commit 6d54f09d9 authored by Yun Peng<pcloudy@google.com> Committed by gunan<gunan@google.com>: Fix linking errors of lmdb on Windows (#10457) --- Commit 61c8a745b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Minor cleanup: Add braces around if statement arms; remove redundant "return" and "static". PiperOrigin-RevId: 158143418 --- Commit e9a889c5e authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Pass int parameter by value, not by const reference PiperOrigin-RevId: 158142102 --- Commit 9184726ed authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Avoid unnecessary copying of map data during visitation PiperOrigin-RevId: 158141962 --- Commit 2e7e1d57b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Small fix for how std::move is used in constructors PiperOrigin-RevId: 158141564 --- Commit 2a61c1652 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In cpu compiler's CompileAheadOfTime, pass ordering when compiling entry computation. PiperOrigin-RevId: 158140349 --- Commit f3f53e8b3 authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf.contrib.data] Add support for dicts and remove lists from nested structures. This changes the behavior of constructors like `tf.contrib.data.Dataset.from_tensors()` when passed a list. Previously, the `nest` utility would recurse into each element of such a list and create a separate Dataset component. Now the list will be converted to a tensor, allowing code like: ```python dataset = tf.contrib.data.Dataset.from_tensor_slices(([1, 2, 3], [4, 5, 6])) ``` ...to define a dataset with two components (each of shape `()`). This change also adds support for dictionaries as nested structures, which simplifies integration with dictionary-returning ops like `tf.parse_example()`. Fixes #10151. RELNOTES: Breaking change to `tf.contrib.data.Dataset` APIs that expect a nested structure. Lists are now converted to tf.Tensor implicitly. You may need to change uses of lists to tuples in existing code. In addition, dicts are now supported as a nested structure. PiperOrigin-RevId: 158139467 --- Commit b6a8848c1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Enabling python configuration to use a remotely generated configuration that is located inside of the org_tensorflow repo (previously it *had* to be a remote repo declared in workspace file). PiperOrigin-RevId: 158138601 --- Commit 0fe0bfcc3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused protobuf header inclusions PiperOrigin-RevId: 158120864 --- Commit f0c4c6c3f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In the CUDA path of depthwise_conv2d, add a fast NCHW backward filter convolution for images smaller than 16x16. PiperOrigin-RevId: 158111294 --- Commit 8dcf37b47 authored by Jon Malmaud<malmaud@gmail.com> Committed by gunan<gunan@google.com>: Fix typo (#10379) --- Commit 3039d7da2 authored by Androbin<robin.richtsfeld@gmail.com> Committed by gunan<gunan@google.com>: Remove "bazel clean" (#10318) Reverting #8880 (see #10236) unnecessary since bazelbuild/bazel#2759 was merged --- Commit ae1c16ae8 authored by Yifei Feng<fengyifei2026@gmail.com> Committed by gunan<gunan@google.com>: Update docker to cudnn6. (#10307) * Update docker to cudnn6. * Update Dockerfile.gpu * Add --expunge to bazel clean to make cuda_configure run again and update TF_CUDNN_VERSION. * Remove expunge and set CUDA and CUDNN version default in configure. * Update configure * Only set --action_env once * Update prints for default version. --- Commit 232e9d86d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tf_workspace() claims that the tf_repo_name argument is unused. temp_workaround_http_archive still requires it. This change silences the spurious message. PiperOrigin-RevId: 158089834 --- Commit cc1a02d37 authored by Francois Chollet<fchollet@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add fp16 support to convolutional layers that support it. PiperOrigin-RevId: 158086284 --- Commit 7d3fbba48 authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extracted dnn estimator testing utils to be reused by dnn-linear-combined. Added tests for dnn part of dnn-linear-combined estimator. PiperOrigin-RevId: 158084898 --- Commit 9d12c629c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactor the document and some polishment PiperOrigin-RevId: 158083952 --- Commit 134138299 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Corrected comment: import_scoped_metagraph does not return a Saver. PiperOrigin-RevId: 158082288 --- Commit a58553e4d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add function in shape inference to try to infer output tensor content based on the input shapes of the op. In some cases (E.g: shape), knowing the shapes of the input is all that is necessary to infer the content of the output tensor. This improves shape inference. PiperOrigin-RevId: 158079306 --- Commit 0cc851c08 authored by Yuefeng Zhou<yuefengz@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Call maxcut algorithm in the model_based_cost_estimator. PiperOrigin-RevId: 158078511 --- Commit 7d76a90be authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add question marks next to items in the graph legend. PiperOrigin-RevId: 158076005 --- Commit 68fdb7628 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add DNNLinearCombinedClassifier. PiperOrigin-RevId: 158075939 --- Commit 3d52e4cb9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix create_meta_graph to respect an empty collection_list. PiperOrigin-RevId: 158073112 --- Commit 54ccc3e5a authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add module-scoped HLO dataflow analysis. This is the first step to replacing TuplePointsToAnalysis with a global, module-scoped analysis. This dataflow analysis identifies all values and their defs and uses in the XLA graph. The analysis is currently unused. Follow up CLs will add buffer alias analysis using this dataflow analysis, and incrementally switch the transformation passes (for example, CopyInsertion) to use these new module-scoped analyses. PiperOrigin-RevId: 158067910 --- Commit 93c57c6e4 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Handle control flow logic properly: * Don't fold enter/exit nodes since that can interact badly with frames * Create proper control dependencies on switch nodes PiperOrigin-RevId: 158066691 --- Commit 9e6899720 authored by Jingyue Wu<jingyue@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [SE] Add cudnnTransformTensor to StreamExecutor. PiperOrigin-RevId: 158062553 --- Commit 827874c30 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In the CUDA path of depthwise_conv2d, add a fast NCHW backward input convolution for images smaller than 16x16. PiperOrigin-RevId: 158061669 --- Commit bee26215c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Speed up multinomial_op on CPU by using a vectorized Eigen expression and avoiding unnecessary casts. Benchmark with AVX+FMA enabled: Run on <redacted> (12 X 3492 MHz CPUs); 2017-06-05T12:54:07.881672447-07:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_Multinomial_cpu_1_10000_4 250817 172953 +31.0% BM_Multinomial_cpu_1_10000_128 273834 187552 +31.5% BM_Multinomial_cpu_1_10000_10000 1174175 1130778 +3.7% BM_Multinomial_cpu_1_100000_4 2040741 1276761 +37.4% BM_Multinomial_cpu_32_10000_4 10221765 4498666 +56.0% BM_Multinomial_cpu_32_10000_128 10638159 4994754 +53.0% BM_Multinomial_cpu_32_100000_4 100790019 44193314 +56.2% BM_Multinomial_cpu_128_100000_1 431269640 182506078 +57.7% PiperOrigin-RevId: 158061480 --- Commit 515b3ac67 authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add Clutz to TensorBoard build This is so we can get JavaScript protobufs. This CL also improves the web_aspect and makes some peculiar Closure Compiler errors go away relating to externs. PiperOrigin-RevId: 158061198 --- Commit 0df6760fe authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Added a test to make sure that graph properties for variables are properly reported PiperOrigin-RevId: 158053084 --- Commit 2ccfe8e76 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Added a new method to extract the graph properties from a cost graph without having to run the model. This will simplify the process of creating regression tests PiperOrigin-RevId: 158050327 --- Commit 27f1b80c2 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixes memory leak in py_func when functions return unwrapped strings. PiperOrigin-RevId: 158046530 --- Commit cf238e1f2 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix memory leak in python caused by @tf_should_use. The issue is that python's GC has trouble collecting objects with __del__ methods. The solution is two pronged: * Keep track of usage state outside of the class, via a dict mapping id(object) => state * Remove __del__ (this was the source: python's GC couldn't collect wrapped objects), and instead use weakref.finalize to emit warnings just as the object is being garbage collected. * Added tests for garbage collection [they were failing before i fixed the issue] PiperOrigin-RevId: 158042388 --- Commit e6f581863 authored by Bo Wang<david.b.wang@gmail.com> Committed by Rasmus Munk Larsen<rmlarsen@google.com>: New reader for LMDB databases (#9950) * Add LMDBReader op and test case * Add testcase to load LMDB from a folder * Add tensorflow/core/lib/lmdb/testdata/data.mdb * Add EOF test * Add license export * Blacklist the test data in pip_smoke_test.py * Address issues with respect to review * Add LICENSE to BUILD rules * Remove the prefx of LICENSE * Wrap key with compat.as_bytes() * Fixed a compilation flag * Improve BUILD rules * Support LMDB build in cmake * Fix BUILD file format with buildifier * Add fake unistd.h for lmdb to build on Windows * Avoid building lmdb tools which depends on unistd.h * Fix the string encoding issue in Python3 * Update lmdb library name in CMakeList.txt --- Commit cc411f938 authored by Yao Zhang<yaozhang@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: When converting the layout of Conv2DBackpropInput, we need to permute one of its inputs, which is a constant node. We permute a copy of this node, instead of the original node, because the original node may be used as input to other nodes. This kind of sharing of const node could arise if the graph is pre-optimized by common subexpression elimination, which is part of the L1 optimizations in TensorFlow. PiperOrigin-RevId: 158037552 --- Commit 88bdb6fca authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove all remaining references to non-public TF modules from TensorBoard. I deleted the PluginAssetUtil tests because that code is deprecated. I'll later add manual testing for backcompat in the text plugin. PiperOrigin-RevId: 158037466 --- Commit 6c531eb2f authored by Francois Chollet<fchollet@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add file hash to Keras Boston Housing dataset to force cache update. PiperOrigin-RevId: 158036587 --- Commit afdc38cd3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove deprecated resource handle functions in InferenceContext. PiperOrigin-RevId: 158034419 --- Commit 9f932e6ce authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Avoid parsing a rendezvous key for Send/Recv ops outside a loop. For such ops, the rendezvous key will be constant, because `ctx->frame_iter()` will always evaluate to `{0, 0}`. Benchmarking reveals that this can save between 1 and 2 microseconds per Send or Recv op execution. The optimization applies to all cross-process, inter-device, and intra-device (host-to/from-device memory) Send/Recv ops. PiperOrigin-RevId: 158032522 --- Commit cc2dd4ac8 authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tfdbg: dump debug data from different devices in separate directories Fixes: #7051 wherein TFDBG failed to load the data dump from a Session.run() involving multiple GPUs. The root cause of the bug was that TFDBG previously assumed that node names are unique across all partition graphs. This is however not the case when multiple GPUs exist. The Send/Recv nodes in the partition graphs of the GPUs can have duplicate names. There will potentially be other cases like this in the future due to other reasons (e.g., distributed sessions and/or graph optimization). This CL relaxes this assumption, by dumping the GraphDef and tensor data from different devices into different sub-directories under the dump root directory. PiperOrigin-RevId: 158029814 --- Commit a5909d643 authored by Toby Boyd<tobyboyd@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed triggering create device multiple times PiperOrigin-RevId: 158025196 --- Commit 504a307b7 authored by Martin Wicke<wicke@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make sure that Adam colocates ops with a consistent variable across workers. PiperOrigin-RevId: 158022292 --- Commit 69ba4d3d4 authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix #10371 cpuinfo.get_cpu_info() doesn't seem to include the l2_cache_size key on some architectures. PiperOrigin-RevId: 158021008 --- Commit a51a9846c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Performance-related tweaks: Don't copy loop variables; remove ineffective std::move casts. PiperOrigin-RevId: 158017670 --- Commit 009789f74 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Allow 0-sized slices in DynamicSlice and DynamicUpdateSlice; add tests. PiperOrigin-RevId: 158015870 --- Commit 48a4853eb authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Miscellaneous cleanups PiperOrigin-RevId: 158012131 --- Commit 379ddde24 authored by Chris Song<sjhshy@gmail.com> Committed by Chris Song<sjhshy@gmail.com>: Fix misspells. --- Commit a0a76da97 authored by Lakshay Garg<lakshay.garg.1996@gmail.com> Committed by Lakshay Garg<lakshay.garg.1996@gmail.com>: Fixed typo in code --- Commit 7ffc35732 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add support for bools in matrix_diag, matrix_diag_part, matrix_set_diag, matrix_band_part. PiperOrigin-RevId: 157939272 --- Commit edf3d5dbe authored by Darren Garvey<darren.garvey@gmail.com> Committed by Darren Garvey<darren.garvey@gmail.com>: configure: Fix default path when enabling MPI. Correct showing what the default path is when mpi is installed. --- Commit aad2e3daf authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In the CUDA path of depthwise_conv2d, add a fast NCHW forward convolution for images smaller than 16x16. PiperOrigin-RevId: 157915637 --- Commit 5cf08d9cb authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Drop blockDim.y for the equivalent in_cols, and slightly improve naming (use 'pixels' instead of 'size' for height*width numbers). PiperOrigin-RevId: 157906773 --- Commit 563f05ff6 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf contrib seq2seq] Expand tile_batch to handle nested structures. This allows it to properly tile the initial wrapper state when using BeamSearchDecoder with AttentionWrapper. Unit tests updated to show this use. PiperOrigin-RevId: 157903115 --- Commit 1234e2dda authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix Plottable definition On Mac OS the build directory in the Node package conflicts with BUILD. PiperOrigin-RevId: 157899970 --- Commit bb7a8d8e7 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Don't use the _output_shape attribute in the op_level_cost_estimator since there is no guaranty that it will be present or accurate. PiperOrigin-RevId: 157898989 --- Commit 6f4204c3d authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix TensorBoard SHA256 in cmake PiperOrigin-RevId: 157897958 --- Commit c9d2f432b authored by Justine Tunney<jart@google.com> Committed by Justine Tunney<jart@google.com>: Fix TensorBoard SHA256 in cmake --- Commit 1c70fb686 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add training test for multi classes (n>2) linear classifier. PiperOrigin-RevId: 157896002 --- Commit 675d36be0 authored by Yao Zhang<yaozhang@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add fused batch norm to tf.layers. PiperOrigin-RevId: 157893874 --- Commit f37d0ea47 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change -- first draft docs PiperOrigin-RevId: 157891937 --- Commit 9b8f6113b authored by Zongheng Yang<zongheng@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tensor_bundle: fix that the read path forgets to cache file handles. In a case where a reader is geographically far from the file, this change achieves a speedup of end-to-end checkpoint restore by 5.8x. PiperOrigin-RevId: 157889659 --- Commit 0c92dada6 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use inplace Cholesky factorization and solves to speed up and reduce memory usage in matrix_solve_ls. Check succes before copying outputs in cholesky_op. PiperOrigin-RevId: 157887564 --- Commit a4caeb2ea authored by William Chargin<wchargin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extract the graphs dashboard to a plugin This completes the great plugin migration! The graphs plugin is somewhat different from the plugins considered so far. First, it exposes two kinds of data: graph data and run metadata. We elect to put both sources of data under the domain of the graphs plugin for now, because it's not clear that the run metadata would be useful for anything else. Second, the graph data really has no use for "tags": a run either has an associated graph or it does not. Thus, we expose an endpoint /data/plugin/graphs/runs that is different in format from the /tags routes exposed by other plugins (it returns just a list instead of a run-to-tag mapping). This change removes a bunch of tests from application_test.py. The tests cover the compresion behavior of the graph endpoint, but the graph endpoint doesn't have any special logic in the way of compression. Thus, the tests are, apparently, testing that werkzeug (or whatever is relevant here) provides good compression defaults. This isn't necessarily a bad idea, but it shouldn't be coupled to the graph tests. To get test data that includes run metadata, you can run this script: https://raw.githubusercontent.com/tensorflow/tensorflow/326942394e69074d50d5889218a24c9371eff259/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py PiperOrigin-RevId: 157884714 --- Commit 05a6a13f7 authored by Gunhan Gulsoy<gunan@google.com> Committed by gunan<gunan@google.com>: Make sure all writer caches are closed before deleting directories in dnn_test. --- Commit d0e761f8d authored by Gunhan Gulsoy<gunan@google.com> Committed by gunan<gunan@google.com>: Disable another test that uses matrix_set_diag on windows. --- Commit 8939b8562 authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf.contrib.data] Re-implement IteratorGetNext as an AsyncOpKernel. This prevents the op from consuming an inter-op thread pool thread when blocked, and fixes a potential deadlock when many IteratorGetNext ops are blocked. Fixes #10369. PiperOrigin-RevId: 157878885 --- Commit 9e25c68ad authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add loss_only_head to hold additional loss terms for multi_head setup PiperOrigin-RevId: 157875934 --- Commit 7cdcd0cca authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Filter more op types that don't benefit from constant folding. PiperOrigin-RevId: 157875168 --- Commit 366990d92 authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Fix a subtle issue in copy_insertion due the interaction between copy overriding logic and RecordIndicesToColocatingBuffers: - When building instructions ShapeTree to be copy overriden, it is possible that we create a single kCopy for two identical instructions. An example can be: %tuple.19 = tuple(%constant.4, %constant.1793, %constant.1793) where it is used in a while.init operand, and constant.1793 is read-only within the loop and also used by another while loop. The copy overriding pass will then create the following (logical, not finalized) tuple: %tuple.19 = tuple(%constant.4, %copy.5, %copy.5) - In the subsequent pass RecordAmbiguousOrNonDistinctIndices, to add copies to ensure point_to set is distinct, the duplicate %copy.5 are ignored because they are not yet finalized, and these indices (1 and 2 in the example) are still marked as to-be copied. Therefore distinctiveness is lost. This fix applies to the override building stage, to explicitly avoid creating shared copies for non-distinct buffers. PiperOrigin-RevId: 157872231 --- Commit f4b8d21b8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Change function parameters to references to avoid copying, or otherwise move from function parameters when moving reduces the amount of copying. PiperOrigin-RevId: 157867333 --- Commit 3eee61caa authored by Drew Hintz<pushespretn@gmail.com> Committed by GitHub<noreply@github.com>: fix quotes in example code from ? to " --- Commit 4905c0eae authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove TODO - the new tolerance is okay to keep. PiperOrigin-RevId: 157861020 --- Commit 55f6b6ff1 authored by David Soergel<soergel@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add explicit SparseTensor support to SignatureDef. PiperOrigin-RevId: 157860466 --- Commit 79099d677 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Removes default thresholds from BinaryLogisticHead and adds predict and evaluate tests for DNNClassifier. PiperOrigin-RevId: 157856471 --- Commit 54595f0f3 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds the training test for LinearClassifier with n_classes=2. PiperOrigin-RevId: 157855473 --- Commit cd6c02985 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add 'streaming_curve_points' metric which returns curve [ROC, PR] approximation at specified number of points. PiperOrigin-RevId: 157851535 --- Commit 0f2db7391 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Split union-find implementation in mark_for_compilation_pass.cc into a separate library, make it more generic. PiperOrigin-RevId: 157850985 --- Commit d5421cf58 authored by Justin Lebar<jlebar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add additional concat test. PiperOrigin-RevId: 157844113 --- Commit f661128db authored by Geoffrey Irving<geoffreyi@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused overloads of SummarizeGraphDef and EqualGraphDef PiperOrigin-RevId: 157843404 --- Commit a56d59a84 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Set flow to a value during TensorArray creation, Re-enable tensor_array_ops_test in msan. PiperOrigin-RevId: 157841785 --- Commit edcc5cc13 authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add manual test runner for vz_sorting PiperOrigin-RevId: 157841098 --- Commit 3f6404f20 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Assign a max height of 800px to images in the image dashboard. The user could always expand to actual dimensions if need be. PiperOrigin-RevId: 157838046 --- Commit c6ea6972a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove debugging LOG(INFO) from previous change. PiperOrigin-RevId: 157837305 --- Commit 07d39f28e authored by freedom" Koan-Sin Tan<koansin.tan@gmail.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: make gcc-5 on Ubuntu 16.04 happy (#10385) gcc-5 complains of ambiguity and refuses to go when doing something like 'bazel build -c opt tensorflow/...' --- Commit ac66be783 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Minor cleanup: Remove unused BUILD dependencies and unnecessary code. PiperOrigin-RevId: 157837211 --- Commit 4161ccc8e authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adjust tolerance on dirichlet_multinomial test. PiperOrigin-RevId: 157834660 --- Commit 43c0f52f1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix off-by-one error in BoolVector(begin, end) constructor. PiperOrigin-RevId: 157833086 --- Commit 419d437ba authored by Lakshay Garg<lakshay.garg.1996@gmail.com> Committed by Lakshay Garg<lakshay.garg.1996@gmail.com>: Fixed typo in code comment --- Commit 07710014d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix device colocation for KMeans in case of multiple parameter servers. PiperOrigin-RevId: 157795360 --- Commit b659bc39f authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Simplify TensorBoard build - Remove tensorboard_typescript_genrule - Remove tensorboard_typescript_bundle - Introduce ts_web_library Skylark rule which supports seamless TypeScript compilation. - Use Closure Compiler in semi-advanced mode to compile JavaScript. This is done in a way that preserves <script> tag placement, which causes pages to load faster and avoid FOUC, thereby making it a better solution than the existing vulcanize. PiperOrigin-RevId: 157794795 --- Commit 0503ce09c authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Wipe out previous shape inference result when importing a grappler item Run graph optimizations last: since they can be expensive it's best to filter invalid items first. PiperOrigin-RevId: 157792834 --- Commit 9ae941c4a authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Turn reductions along an empty set of dimensions into identity nodes. PiperOrigin-RevId: 157792209 --- Commit 69075f354 authored by Yangzihao Wang<yangzihao@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add functional support for cudnnConvolutionBiasActivationForward(). PiperOrigin-RevId: 157788425 --- Commit 7d7a40309 authored by William Chargin<wchargin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Extract the distributions dashboard to a plugin This continues the great plugin migration. The distributions plugin was similar to the histograms plugin, but it also purported to allow CSV download like the scalars plugin. However, the existing implementation of this was flawed, and would always yield a 500 on current prod [1] (unless there were actually no data). This indicates that no one is actually using it---probably because there isn't a relevant button on the frontend, anyway!---so I just removed it. This also changes most frontend occurrences of "compressedHistograms" to "distributions" while we're at it. [1]: Due to the reference `value.rank_in_bps` in the handler `_serve_compressed_histograms`; this field does not exist and throws an `AttributeError`. PiperOrigin-RevId: 157787156 --- Commit 23cdf96b8 authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Re-enable session_test.py A number of CL's have split up session_test.py to be a bit smaller. As a result, this CL will re-enable the session_test to see if it remains flaky. PiperOrigin-RevId: 157786407 --- Commit d741d81c5 authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Expose tf.test.StubOutForTesting in the tf testing api Also redirect TensorBoard usage to use that endpoint. This is part of my ongoing effort to have TensorBoard only depend on TensorFlow via its public api, so that it can be split into a project with a fast external build. PiperOrigin-RevId: 157784552 --- Commit 40411cd5c authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactor projector plugin to only use tf public methods. Remove all reference to the PluginAsset system, which is deprecated. Part of an ongoing effort to have TensorBoard only consume the public TensorFlow api. PiperOrigin-RevId: 157784016 --- Commit a65a70ea5 authored by Gunhan Gulsoy<gunan@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix pip tests under contrib/text PiperOrigin-RevId: 157783952 --- Commit fb4bc806a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix flakiness in GpuMultiSessionMemoryTest. PiperOrigin-RevId: 157781368 --- Commit f7de292df authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update placeholder nodes' shapes in the GraphDef to reflect manually specified values for incomplete placeholder shapes. Previously, these overrides were only specified in the feed nodes, which improves estimates when using dynamic shapes but not when using static shapes. With this change, static shapes also benefit. PiperOrigin-RevId: 157780800 --- Commit eebd44123 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a frontend method for retrieving numeric alerts from the debugger plugin. This route responds with a list of alerts (occurrences of bad values) in ascending timestamp order. PiperOrigin-RevId: 157780270 --- Commit 5bc685d7f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] If an op has a single "large" operand, we want to fuse this op into some of its consumers, even if we can't fuse into all of them. PiperOrigin-RevId: 157779106 --- Commit 2ee09b873 authored by Mark Heffernan<meheff@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Various improvements to ShapeTree. Add support for holding non-copyable types, operator==, and a CopySubtreeFrom method for copying a subtree from one ShapeTree to another. PiperOrigin-RevId: 157777636 --- Commit 4f3ae7699 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add beam_search kernels used by BeamSearchDecoder to tensorflow.contrib. PiperOrigin-RevId: 157775011 --- Commit 6b16c33b3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make audio-related logic use the audio plugin. Previously, fetching audio and related data from TensorBoard used handlers within application.py. We now remove those handlers in favor of routes offered by the audio plugin. ML Dash is updated as well. PiperOrigin-RevId: 157774953 --- Commit 8032e1f75 authored by Geoffrey Irving<geoffreyi@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make function instantiation use std::vector<NodeDef> instead of GraphDef It's about to turn into std::vector<NodeInfoPtr>; this change gets us partway there. RELNOTES: n/a PiperOrigin-RevId: 157771141 --- Commit 2e44be35d authored by Vinu Rajashekhar<vinuraja@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds a protected DeleteResourceMgr(...) method in Device. PiperOrigin-RevId: 157770378 --- Commit cc346e690 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Strip the :x suffix when generating control inputs from input names PiperOrigin-RevId: 157770257 --- Commit d6fe47af5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use tensorflow::StringPiece in literal_util. Use template for RepeatedField assignment. PiperOrigin-RevId: 157765477 --- Commit 7866fa01b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: This change significantly reduces time and resources used to load large TensorFlow graphs. For a real-world large graph (13k nodes, 20k edges), this change: * reduces all heap allocations by 19% * reduces retained (final) heap allocations by 2.2% * reduces CPU time by 11.2% In most TF graphs, the set of unique values set to Node::assigned_device_name() is quite small. This change adds an interning table to the Graph object, which contains all of the unique values used for Node::set_assigned_device_name(), as well as a look-up table. This is the main source of the reduction in retained heap memory; nearly all nodes are assigned to just one or two unique devices. This change removes the "string assigned_device_name_" field from the Node class, and replaces it with "int assigned_device_name_index_". However, because you need both the index and the name table to get the actual value, the Node::assigned_device_name() accessor needs access to the parent Graph. This requires adding a "Graph* graph_" field to the Node class. In the future, if all users of this property are converted to use Graph::assigned_device_name(Node*), then the Node::graph_ field can be deleted, and the space reclaimed. However, doing so is out of the scope of this CL, and even with this new pointer field, the Node class is smaller than it was before, so this is still a net win. The placement algorithm in simple_placer.cc is one of the main accessors of the Node::assigned_device_name property. This CL contains significant changes to simple_placer.cc, which directly take advantage of the fact that the property is an index into a name table, rather than treating it simply as a string. Many temporary allocations are also removed, which is the main source of the reduction in total heap allocations. This CL also contains a few changes that remove short-lived allocations in unrelated code, such as the changes in op.cc/h, costmodel.cc, etc. It is extremely easy in C++ to accidentally allocate memory, especially when implicit conversions and copy constructors allocate memory. All of the changes in this CL were motivated by empirical measurement, using CPU profiling and heap profiling. PiperOrigin-RevId: 157762909 --- Commit fdffafbc1 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add QueueDequeueUpTo to the list of dequeue ops PiperOrigin-RevId: 157760201 --- Commit 7ad0d0698 authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add type error to start_queue_runners if given session is not a `tf.Session`. Due to semver, we suppress the error if a MonitoredSession is provided. PiperOrigin-RevId: 157748375 --- Commit 7106f9fac authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Implemented an initial version of virtual scheduler unit test. PiperOrigin-RevId: 157746305 --- Commit b020db0c6 authored by Andrew Harp<andrewharp@google.com> Committed by Andrew Harp<andrewharp@google.com>: revert public visibility --- Commit 5b05728c2 authored by Andrew Harp<andrewharp@google.com> Committed by Andrew Harp<andrewharp@google.com>: visibility workaround 3 --- Commit 15a740ebb authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update and Move DNNLinearCombinedRegressor to estimator/canned. PiperOrigin-RevId: 157744087 --- Commit d29bbeca3 authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix outdated code ref in TensorBoard README, add link to SO question. PiperOrigin-RevId: 157743374 --- Commit 9fc164225 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix index_table_from_file to allow vocabulary_file be a Tensor PiperOrigin-RevId: 157740677 --- Commit 0aa3e0194 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change PiperOrigin-RevId: 157740660 --- Commit 02ac85399 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Introduce new class Literal to replace protobuf Literal. This renames the existing Literal message to LiteralProto and introduces a new C++ class named Literal to replace it. The LiteralProto is only used at RPC boundaries, or when protobuf-specific functionality is required. The Literal class offers a 'ToProto' function to generate a new LiteralProto message when necessary. Currently, all the static functions in class LiteralUtil, just forward to their counterparts in class Literal. This will change in a future CL. Class Literal implements all the buffers as std::vectors. The only exception is preds(), which given the std::vector<bool> representation, makes it unusable for the semantics we require (it's not possible to get the address of the underlying vector, for instance). The CL adds a BoolVector class to work around that issue. In future CLs, the std::vector representation may be changed to something more efficient, if needed. PiperOrigin-RevId: 157739125 --- Commit 207203253 authored by gunan<gunan@google.com> Committed by GitHub<noreply@github.com>: Python 3.6 support on windows. (#10356) * Python 3.6 support on windows. * Fix typo in README.md * Make environment configurable for windows gpu build. --- Commit 2b75a9a6e authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 157734029 --- Commit f60b6bdcb authored by Mustafa Ispir<ispir@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a warning to documentation of MonitoredSession. PiperOrigin-RevId: 157728225 --- Commit eb10a4c49 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Preallocate vector storage when the ultimate vector size is known in advance PiperOrigin-RevId: 157724431 --- Commit ce32228c4 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add release notes for Intel MKL integration. PiperOrigin-RevId: 157722003 --- Commit a23255bc0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds missing group OP to benchmark PiperOrigin-RevId: 157716500 --- Commit d3e840a6c authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Disable writing of compressed checkpoints. Snappy compression (and decompression) was enabled after the 1.1 release (in commit 63b2f999d3f22cfe915b89103faa1b0a1b1b7617). This means that checkpoints produced by the 1.2.0 release candidates will cause TensorFlow 1.1 (and prior) binaries to crash as they CHECK fail when trying to load snappy-compressed tables. To ease transition, disable writing of compressed checkpoints in 1.2.0 for now. Reconsider this in the next release. PiperOrigin-RevId: 157675189 --- Commit 6db400bbc authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactoring Python op code generation. PiperOrigin-RevId: 157675126 --- Commit d9620cab8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add flag to determine whether to do L1 optimizations and inline functions. Default is to do them. In tf_optimizer don't inline or do l1 optimizations. PiperOrigin-RevId: 157673614 --- Commit 25bb504cc authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make a plugin that serves data for the audio dashboard. Subsequent changes will make TensorBoard use this audio plugin instead of the previous handlers for audio-related data. PiperOrigin-RevId: 157673132 --- Commit 24623653b authored by James Qin<jamesqin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix graph text format serialization PiperOrigin-RevId: 157669530 --- Commit 3aed1735c authored by Andrew Harp<andrewharp@google.com> Committed by Andrew Harp<andrewharp@google.com>: visibility workaround 2 --- Commit fea90f89d authored by Andrew Harp<andrewharp@google.com> Committed by Andrew Harp<andrewharp@google.com>: visibility workaround --- Commit 732a6b1ae authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Upgrade TypeScript to v2.3.4 PiperOrigin-RevId: 157667511 --- Commit 95d90ab2e authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Fixes Split op (#10322) * [OpenCL] Fixes Split op Split should alway go through SYCL device * [OpenCL] Removes half from registred types --- Commit 963441400 authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Extends softmax op to cover double (#10323) --- Commit a702863e8 authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Extends tile ops to int16 and int32 (#10328) * [OpenCL] Extends tile ops to int16 and int32 * [OpenCL] Extends tile_ops to cover bool, uint8, int16, int64 --- Commit 75385814f authored by cxx<cxxgtxy@gmail.com> Committed by cxx<cxxgtxy@gmail.com>: Fix comments error in mnist_replica.py where only one ps is used with two works by default. --- Commit 23364e2c6 authored by Andrew Harp<andrewharp@google.com> Committed by Andrew Harp<andrewharp@google.com>: buildifier fix --- Commit e5088cb82 authored by Yao Zhang<yaozhang@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix discrepancy between measured and analytical cost graph. Use tf_cuda_library for utils. PiperOrigin-RevId: 157660745 --- Commit 787381ca5 authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Split up session_test.py -> session_clusterspec_prop_test.py session_test.py has gotten very large. Additionally, recently it has become flaky. In order to both (1) improve overall code health, and (2) to facilitate root-causing the test flakiness, this CL begins to split apart session_test into focused subsets. I've suffixed the scoping of the session_test in order to preserve filesystem sort-order grouping. PiperOrigin-RevId: 157658981 --- Commit b09932d74 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Added PlaceholderWithDefault to the list of known placeholder types Use PartialTensorShape instead of TensorShapes to better handle partially known shapes PiperOrigin-RevId: 157657664 --- Commit 0462416f6 authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add make_ndarray, tensor_proto, and MetaGraphDef to tf api. Since TensorProtos are part of the TensorFlow API, it makes sense to also include the methods that generate and parse them. Similarly, we write out MetaGraphDef protos in the summary writer, so we should provide the proto as well. This is part of an ongoing effort to have TensorBoard only consume TensorFlow methods through the public api. PiperOrigin-RevId: 157657564 --- Commit 458f94c12 authored by Wei Ho<weiho@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Open-source skip-gram ops PiperOrigin-RevId: 157655970 --- Commit faac0331c authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Introduce tensorboard_zip_file build rule This rule can depend on web_library or tensorboard_html_binary. In both cases it will create a .zip file containing all the transitive web server paths. This can be used to deploy static assets to web servers. A small change was also made to Vulcanize to support path overriding. PiperOrigin-RevId: 157655047 --- Commit 7ed44f4c9 authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Split up session_test.py -> session_partial_run_test.py session_test.py has gotten very large. Additionally, recently it has become flaky. In order to both (1) improve overall code health, and (2) to facilitate root-causing the test flakiness, this CL begins to split apart session_test into focused subsets. I've suffixed the scoping of the session_test in order to preserve filesystem sort-order grouping. PiperOrigin-RevId: 157651813 --- Commit 3c7ac46ae authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Teach Executable to do its own profiling (patch 4/4). This CL removes the xla::Service stub for ExecuteOnStreamWrapper so the users call the xla::Executable version directly, and simplifies the function API to simply accept "arguments" as a parameter (with a templated type) rather than requiring the user to capture it into a lambda around the relevant Executable::ExecuteOnStream method. PiperOrigin-RevId: 157651740 --- Commit 626f95ab9 authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Don't enforce that all nodes in an encapsulated subgraph are on the same device. Use the assigned device rather than the user-requested device when converting a Graph to a FunctionDef. PiperOrigin-RevId: 157648977 --- Commit 414470329 authored by Jacques Pienaar<jpienaar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Guard stream pool with mutex. A data race can occur while populating the map. PiperOrigin-RevId: 157647183 --- Commit ccdb30763 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Additional colocation options and bugfixes for TensorArray * colocate_with is now set properly when a TensorArray is passed through a while_loop * added a new argument, "colocate_with_first_write" (default: True; this is the current behavior). If False, the TensorArray is simply placed on the device from the context it's constructed in, and no colocation constraints are added. PiperOrigin-RevId: 157643133 --- Commit 03fc7022b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157642677 --- Commit 41b87d6ce authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a new attribute narrow_range to FakeQuant* operations. It quantizes into range [1; 255] instead of [0; 255]. PiperOrigin-RevId: 157641054 --- Commit c048e2938 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds support to non-placeholder inputs in _graph_to_function_def. Specifically, supports input ops with more than one output tensor. PiperOrigin-RevId: 157640908 --- Commit d310de4fa authored by Brennan Saeta<saeta@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Split up session_test.py -> session_list_devices_test.py session_test.py has gotten very large. Additionally, recently it has become flaky. In order to both (1) improve overall code health, and (2) to facilitate root-causing the test flakiness, this CL begins to split apart session_test into focused subsets. I've suffixed the scoping of the session_test in order to preserve filesystem sort-order grouping. PiperOrigin-RevId: 157640788 --- Commit 8e868cf6a authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused arguments to call_cpp_shape_fn. PiperOrigin-RevId: 157640125 --- Commit 9ddbf31fe authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use unnamed namespace to effect internal linkage, replace string constructors with array-deducing helper function PiperOrigin-RevId: 157636308 --- Commit 88ffe6276 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Increase cholesky_op_test to medium, bump shard_count 1 more. PiperOrigin-RevId: 157635774 --- Commit bef563dc8 authored by Benjamin Kramer<kramerb@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Don't add constraints for computations we're not currently looking at. TuplePointsToAnalysis is computed globally per module, so we add all unconstrained buffers in that module, even if it's outside of the computation we're currently running on. Then we proceed to propagate default layouts to all those buffers and then throw the constraints away because they don't affect any instruction in the current computation. PiperOrigin-RevId: 157635564 --- Commit a980aead8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use test_adjusted_name when making the mangled_test_name in run_and_gather_logs_lib.py, to avoid duplicate file names when the same test is run on multiple GPUs. PiperOrigin-RevId: 157630193 --- Commit 0a84cfd58 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157629497 --- Commit 6882effb8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make single-parameter constructors explicit PiperOrigin-RevId: 157628970 --- Commit 0b8070253 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Support negative axis for Split op PiperOrigin-RevId: 157628162 --- Commit 289e7bf5b authored by gunan<gunan@google.com> Committed by GitHub<noreply@github.com>: Fixes and improvements to cmake windows build. (#10354) * Disable linalg ops tests on windows. * Do not print the full source code path for logs on windows. --- Commit bc236cfc3 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Passes classification head to LinearClassifier. PiperOrigin-RevId: 157624020 --- Commit cebd7e246 authored by Luke Iwanski<luke@codeplay.com> Committed by Shanqing Cai<cais@google.com>: [OpenCL] Cleans debug ops (#10334) * [OpenCL] Cleans debug ops * Acts on feedback from #10334#discussion_r119452513 * Acts on #10334#discussion_r119459463 --- Commit fd6c3c4f1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixes flaky test in dnn_linear_combined_test. PiperOrigin-RevId: 157622951 --- Commit c9cc388dc authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Avoid CHECKs in BundleReader, propagate errors instead. Motivation: We'd like to evolve the checkpoint format over time (e.g., enable different types of compression). Without this change, a TensorFlow version that encounters a format that it doesn't understand would CHECK fail with an unhelpful error message. With this, it propagates a clearer error message up, giving the user some hints about what could be wrong. I don't have a unittest for this - I thought about writing a bundle and then strategically corrupting the bytes on disk before reading it back, but that seems a bit much. The intention of this change is to enable graceful reporting of forward compatibility breakages. Ideas for an appropriate unittest are appreciated. PiperOrigin-RevId: 157620358 --- Commit ee05b8b69 authored by Wei Ho<weiho@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix to remove TF op usage outside of the initializer fn (due to deferred execution of initializer fn, this prevent issues with graph mismatch). PiperOrigin-RevId: 157620177 --- Commit e8d17ea8c authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Materialize shapes that are known at graph construction time into constants that can be folded PiperOrigin-RevId: 157619380 --- Commit dc0427d48 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Directly depend on the used libraries Do not rely on transitive dependencies. PiperOrigin-RevId: 157618184 --- Commit 964d1a509 authored by Yuan Yu<yuanbyu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix a bug that an erroneous control edge can be introduced when loops are nested in control dependency context. PiperOrigin-RevId: 157616919 --- Commit 2de94bbb8 authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add an option to set the "generate HLO graph" regex without a flag. Pipes the option through xla.proto ExecutionOptions, to HloModuleConfig, which can then be accessed throughout the compiler. PiperOrigin-RevId: 157615458 --- Commit d3c0482e6 authored by My name is<raviqqe@gmail.com> Committed by gunan<gunan@google.com>: Fix a typo in export_output.py (#9975) --- Commit 0c75d9f52 authored by ddurham2<ddurham@davyandbeth.com> Committed by gunan<gunan@google.com>: Adding lost documentation to tf.abs from the old tf.complex_abs when it learned how to work on complex data. (#9954) --- Commit 84661fa73 authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Propagate control dependencies during constant folding PiperOrigin-RevId: 157610040 --- Commit a3520340e authored by gunan<gunan@google.com> Committed by GitHub<noreply@github.com>: Improve windows bazel python test suite. (#10305) * Improve windows bazel python test suite. - Create new tags, no_windows and no_windows_gpu - Instead of a separate maintained list, use bazel tags to exclude tests. - Tag all the python tests that are known to have issues in windows. * Also blacklist neon_depthwise_conv_ops_test in windows. * Only build tests in CPU windows tests. * Only build tests in GPU windows tests. * Also disable session_test on windows. * Only run py tests on windows, and only build tests that are not disabled. --- Commit a6f284ca4 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds integration tests for LinearRegressor. PiperOrigin-RevId: 157604107 --- Commit d21bf7d75 authored by Francois Chollet<fchollet@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Backport changes from Github master. PiperOrigin-RevId: 157603238 --- Commit 43bfc138c authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix OSS compilation error in tfprof_main.cc PiperOrigin-RevId: 157602449 --- Commit 904a3d075 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixing issue with cuda compilation related to missing include (exception is only thrown when running with sandboxing on) PiperOrigin-RevId: 157602401 --- Commit f59203c98 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Shard cholesky_op_test. PiperOrigin-RevId: 157601172 --- Commit 3fdbb5579 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Merging rc1 back into master. --- Commit be5d98a8b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds integration tests for DNNClassifier. PiperOrigin-RevId: 157592010 --- Commit a05de6cd2 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Change reporting feature importances in RandomForestEstimator to run at the end of training, instead of part of the inference graph. PiperOrigin-RevId: 157591575 --- Commit e96f1142f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unnecessary casts PiperOrigin-RevId: 157591439 --- Commit 5f8571a6b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix missing namespace comments PiperOrigin-RevId: 157591364 --- Commit eeb0b4067 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 157573997 --- Commit 7f9674217 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157573723 --- Commit 473a590c9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Allow complex valued input for Cholesky decomposition. PiperOrigin-RevId: 157572536 --- Commit 2d1860859 authored by Blake Hechtman<blakehechtman@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix test name in array_elementwise_ops_test. PiperOrigin-RevId: 157552402 --- Commit a7fff05e0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tfprof multi-step profiling. This allows users to fill in RunMetadata across different steps. 1. It is useful for RL model which runs a subset of graph each step. 2. It also gets averages of multi-step stats. PiperOrigin-RevId: 157552388 --- Commit fe589d9e7 authored by Luke Iwanski<luke@codeplay.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [OpenCL] Implementation improvements (#9117) * OpenCL Improvements * Registers Scatter and ScatterNd Ops for SYCL * Registers Stack op for SYCL * Fixes No sycl buffer found error for debug ops * Registers MatMul and Transpose Ops to SYCL device for double * Extends analyzer_cli_test.py test to cover SYCL * Fixes Transpose Op for double when on SYCL * Bumps Eigen version to fix double precision issue on SYCL * Extends SessionDebugTestBase to cover SYCL * Register SYCL implementations for random ops * Avoid functions that might not be defined on SYCL device (#51) * Avoid functions that might not be defined on SYCL device * Simplify by using Eigen math functions * OpenCL improvements - Bumps Eigen Version - Refactors Ops registration - Introduces workaround for Const Op related to the difference between CUDA which uses pointers and OpenCL that uses buffers/accessors - Extends memory types to cover DEVICE_SYCL as well - Introduces GetSYCLDevice() method that returns list of supported devices with GPU device having the highest priority ( doesn't include blacklisted devices ) - ::internal::Transpose -> tensorflow::internal::Transpose in order to avoid compilation reported error - re-introduces fix for bugged string replacement causing a lot of compilation warnings -c -> --include - Adds sycl_runtime to bazels ARRAY_DEPS - Replicates TF_CALL_GPU_PROXY_TYPES for SYCL * [OpenCL] Fixes an issue caused by switch to aligned allocator for sycl buffer (#53) * [Build] Use gcc/g++ as a host compiler to avoid #8394 (#54) * [OpenCL] Fixes Scatter Op * Fix testSimple and testConst in stack_op_test (#3) * Fix testSimple and testConst in stack_op_test * Create a specialisation of DoParallelConcatUpdate for SyclDevice and register it * Guard all code in TENSORFLOW_USE_SYCL * Do not use sycl device for int32 * Registration of the Sycl version is now looking like the one for the GPU * Remove added empty line * Register batch normalization kernels for OpenCL (#61) * [OpenCL] RandomGamma has no GPU friendly implementation (#57) * [OpenCL] Compatibility fixes for TensorFlow 1.1.0-rc1 * [OpenCL] Implements BatchMatmul Op for SYCL * Lowercase the device name when GPU or SYCL returned * [OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device * [Eigen] Version bump * GPU device name string manipulation is not needed anymore * [OpenCL] Adds SYCL to device backwards compatibility * [OpenCL] Extends core_rnn_test.py to run for SYCL device * [OpenCL] Minor optimizations for build script * [OpenCL] Enables skip folder list in build script * [OpenCL] Fixes ApplyAdamOp for Sycl device * [OpenCL] SYCL device improvements * [OpenCL] Fixes debug_ops's SEGFAULT for SYCL device * [Build] Adds hexagon to skipped folders list * [OpenCL] Removes EnterLameDuckMode from SYCL device and allocator * [OpenCL] Registers Unique Op for SYCL device * [OpenCL][Temporary] Disables tests for SYCL target due to features not being implemented yet Tests affected: - tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py - tensorflow/python/kernel_tests/conv_ops_test.py - tensorflow/python/kernel_tests/depthwise_conv_op_test.py - tensorflow/python/kernel_tests/pooling_ops_3d_test.py - tensorflow/python/kernel_tests/pooling_ops_test.py - tensorflow/python/kernel_tests/scatter_nd_ops_test.py - tensorflow/python/training/adam_test.py - tensorflow/python/training/localhost_cluster_performance_test.py - tensorflow/python/training/training_ops_test.py * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Tests affected: - tensorflow/python/debug/cli/analyzer_cli_test.py - tensorflow/python/debug/lib/session_debug_testlib.py - tensorflow/python/debug/lib/stepper_test.py - tensorflow/python/kernel_tests/unstack_op_test.py - tensorflow/python/ops/image_ops_test.py * [OpenCL] Take options.config.device_count() into consideration * [OpenCL] Fixes compilation warning * [OpenCL] device:SYCL:0 -> sycl:0 * [OpenCL] Removes unwanted flags in building script Removes flags given to computecpp that enable SIMD instructions Removes duplicate flags * bool -> const bool * [OpenCL] sycl in test_util.gpu_device_name() -> is_sycl_enabled() * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/stateless/python/kernel_tests/stateless_random_ops_test.py * Imports test_util from tensorflow.python.framework * [OpenCL] Fixes formatting in Python code * [OpenCL] Extends session_test.py to cover SYCL device * [OpenCL] Cleans singleton class * [OpenCL] Keeping CUDA happy * [OpenCL][Temporary] Disables failing tests for SYCL in order to establish regression baseline Test affected: - tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py - tensorflow/contrib/seq2seq/python/kernel_tests/beam_search_ops_test.py * Added support for building with SYCL on ARM. * Acts on the review feedback from: - #9117#discussion_r113608975 - #9117#discussion_r113609173 * [OpenCL] Fixes scatter_nd_op_test * Fixes auto-merge mistake * [OpenCL] struct SyclDevice -> class SyclDevice * Revert "[OpenCL] struct SyclDevice -> class SyclDevice" This reverts commit addd43348c374a5379f67bb1e5ad084715722fc2. * [OpenCL] Reverting refactoring commit. As requested in the review #9117#issuecomment-298454466 This change set will be re-introduced in smaller chunks. * Revert "[OpenCL] device:SYCL:0 -> sycl:0" This reverts commit cf16e60340b62d16c3764d71b716fe03d35f87a9. * Revert "[OpenCL] Adds SYCL to device backwards compatibility" This reverts commit b8401b5164199b7a169be1c1d8dea5001195c390. * Acts on the feedback from #9117#discussion_r115036905 * control_flow_ops_py_test.py expects device name to be lower cased * Acts on the feedback from #9117#discussion_r115037222 * Removes debug print * Removes not needed partial specialisation * [OpenCL] Registers ScatterNdFunctor for SYCL device * [OpenCL] Make it compile * [OpenCL] Follow gpu_device changes * [OpenCL] Adds cxx_builtin_include_directory for python lib Fixes bazels missing undeclared inclusions that appeared after merge with TensorFlow upstream * [OpenCL] Fixes Constant Op * [OpenCL] gXX-4.8 -> gXX * [OpenCL] Removes -D_GLIBCXX_USE_CXX11_ABI=0 as it breaks default compiler setup for Ubuntu 16.04 * Revert "[OpenCL] kernel_estimator_test.py assertEqual-> assertAlmostEqual due to floating point representation on the device" This reverts commit 06c50c0a485f40c30a436f02c3fa7794e370c49d. * [OpenCL] CPU allocator is a singleton we should not delete it --- Commit 7aac2395c authored by Blake Hechtman<blakehechtman@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Merge a copies of copies. PiperOrigin-RevId: 157549434 --- Commit 37d9d5f0e authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add some routines for managing summaries to slim. PiperOrigin-RevId: 157541902 --- Commit d58cd2962 authored by Justine Tunney<jart@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix weblas license mirror URL PiperOrigin-RevId: 157537115 --- Commit 5c13ee13b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make images-related logic use the images plugin. Previously, fetching images and related data from TensorBoard used handlers within application.py. We now remove those handlers in favor of routes offered by the images plugin. ML Dash is updated as well. PiperOrigin-RevId: 157536471 --- Commit 60394a3d1 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Reduce size of the no-winograd tests, but still large enough that ShouldIncludeWinogradNonfusedAlgo returns true. PiperOrigin-RevId: 157535386 --- Commit 9501c4104 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Replace protobuf CopyFrom with assignment PiperOrigin-RevId: 157534272 --- Commit 96698f7fd authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf contrib seq2seq] Improve BeamSearchDecoder's ability to handle unknown shapes. Updated unit tests to contain inputs of unknown shape (at graph build time). Found an issue in the gather helper that stops it from properly propagating the batch size of the output shape. This caused problems with tf.while_loop. Fixed. PiperOrigin-RevId: 157533937 --- Commit 5c73d0102 authored by Neal Wu<wun@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Batch norm docs fix applied to _fused_batch_norm as well PiperOrigin-RevId: 157530527 --- Commit abd4aa49a authored by Jonathan Hseu<jhseu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix docs for tf.abs() and tf.pow(). PiperOrigin-RevId: 157528475 --- Commit dd5ad6917 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Declarations of operators to support batch norm in xla PiperOrigin-RevId: 157527596 --- Commit bbeaa1307 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix the expand_dim for label and weight for classifier heads. PiperOrigin-RevId: 157524909 --- Commit 346021ab4 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Cleanup: Use C++ casts, remove redundant casts, use CHECK_OK PiperOrigin-RevId: 157522142 --- Commit e405b0f6b authored by Francois Chollet<fchollet@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactoring of layer name autogeneration, to remove a graph serialization warning. PiperOrigin-RevId: 157520123 --- Commit 5784e1e35 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add HasOutputProperties to check for pruned ops; Return device name instead of casting it to a short name (GPU:0/CPU:0); VLOG(2) when printing op device placement since it is a lot of output. PiperOrigin-RevId: 157519077 --- Commit 2994444bf authored by Peter Hawkins<phawkins@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Issue a more user-friendly error message if a variable's initializer is from inside a control-flow scope, such as tf.cond() or tf.while_loop(). Fixes #8604. PiperOrigin-RevId: 157516279 --- Commit da2daf068 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused using declarations PiperOrigin-RevId: 157513772 --- Commit 8b2e8b566 authored by Derek Murray<derek.murray@gmail.com> Committed by gunan<gunan@google.com>: Exclude Python test files from CMake PIP package. (#10302) * Exclude *_test.py files from the CMake-built PIP package. * Add stray _test.py file to the PIP package. * Nit. Convert tabs to spaces in tf_python.cmake --- Commit 2249a4ea8 authored by Dan Ringwalt<ringwalt@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix control reaching the end of ProjectiveGenerator. PiperOrigin-RevId: 157510013 --- Commit 040e2e20f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unneeded check for has properties in grappler. PiperOrigin-RevId: 157507665 --- Commit 684006955 authored by Yun Peng<pcloudy@google.com> Committed by gunan<gunan@google.com>: Windows: Remove session_test from bazel_test_lib.sh (#10274) It was disabled in 49b17146d2e4f04192d16ed67574142de167f3a1 --- Commit 890a0a407 authored by Gunhan Gulsoy<gunan@google.com> Committed by Gunhan Gulsoy<gunan@google.com>: Upgrade TF ci build and docker files to use bazel 0.5.0 --- Commit 46db634e5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Only run the no-winograd tests once each. Only run the no-winograd tests on GPU; this also fixes timeouts in asan and msan. PiperOrigin-RevId: 157505317 --- Commit a6cd4e735 authored by Dandelion Man?<dandelion@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove all TB build references that circumvent TF's public API. This doesn't actually remove all the code references, lots of code references continue to work despite the BUILD references being removed. I think this is because depending on the public api transitively makes all of TensorFlow's guts available too. PiperOrigin-RevId: 157502987 --- Commit dcc3cdce8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove redundant get() calls and string conversions PiperOrigin-RevId: 157497932 --- Commit af2b9d875 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix the trace inputs functionality of the graph explorer. After migrating to d3 v4, the graph can no longer directly index into d3.Selections to obtain elements. Instead, we must use the nodes method of d3.Selection to generate an array of selected elements. PiperOrigin-RevId: 157493509 --- Commit 5cf484584 authored by Jacques Pienaar<jpienaar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Small test that performs A*B+A and A*B+B. PiperOrigin-RevId: 157492992 --- Commit b2355913b authored by Androbin<robin.richtsfeld@gmail.com> Committed by drpngx<drpngx@users.noreply.github.com>: remove some invalid entries (#10294) I noticed that some entries don't exist (anymore). This seems to be some kind of a consistency issue. More specifically: `tensorflow/contrib/ios_examples/camera/data` `tensorflow/contrib/session_bundle/testdata/saved_model_half_plus_two` `tensorflow/contrib/session_bundle/testdata/saved_model_half_plus_two/variables` This is the continuation of PR #10264 --- Commit 367ec84f8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add SampleEmbeddingHelper to do sampling at inference time PiperOrigin-RevId: 157487623 --- Commit a3ba225d5 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add BatchMatMul execution cost prediction PiperOrigin-RevId: 157487507 --- Commit 34a29fc3b authored by Eric Liu<ioeric@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] preserve metadata when replacing HLO instructions. The motivation is to add metadata for HLO instructions that are created to replace existing HLO instructions during optimizations. The assumption is that the old instruction and the new instruction would perform the same function, and that they would be correlated to the same TF op. This might not always be correct since HLO optimizations can cross TF op boundaries. But still this seems to be better than nothing. Note that this still doesn't fully resolve missing OpMetadata after HLO optimizations; new instructions might be added without using ReplaceInstruction. PiperOrigin-RevId: 157484394 --- Commit 092a7b6e6 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Disable keras lstm test in tsan. PiperOrigin-RevId: 157484268 --- Commit 7280dafca authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use "empty" member function to test for emptiness PiperOrigin-RevId: 157483181 --- Commit 6c3b15915 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Expands integration tests in dnn_test. PiperOrigin-RevId: 157476608 --- Commit 727193b1f authored by Androbin<robin.richtsfeld@gmail.com> Committed by drpngx<drpngx@users.noreply.github.com>: add missing import for `signal` package (#10264) * add missing import for `signal` package * add missing dependency for `signal` package * Update tf_python.cmake --- Commit 21461213d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused BUILD dependencies PiperOrigin-RevId: 157473460 --- Commit 4788ca2be authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix handling of Infinity/NaN in line chart domain Test Plan: - Use the script listed below to generate data that has enough infinities for these values to not be treated as outliers. - Load the data into TensorBoard (`--logdir /tmp/infbug`) and look at the scalars plot; also look at the console. - Before this change, the chart is completely blank, and there is a console warning: "QuantitativeScales cannot take NaN or Infinity as a domain value. Ignoring." - After this change, there is no console output, and the chart appears as intended: a reasonable domain is shown, and the infinities just shoot off the chart. Generating script: ```py import tensorflow as tf LOGDIR = '/tmp/infbug' STEPS = 134 def main(): x = tf.Variable(3.1415) y = x.assign_add(x) tf.summary.scalar('y', y) summ = tf.summary.merge_all() sess = tf.Session() writer = tf.summary.FileWriter(LOGDIR) writer.add_graph(sess.graph) sess.run(tf.global_variables_initializer()) for step in xrange(STEPS): writer.add_summary(sess.run(summ), step) writer.close() if __name__ == '__main__': main() ``` PiperOrigin-RevId: 157472340 --- Commit 49476a62c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unused namespace aliases PiperOrigin-RevId: 157468609 --- Commit d83074847 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use "nullptr" for null pointer values PiperOrigin-RevId: 157468186 --- Commit b73fea6e2 authored by Tim Harley<tharley@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Refactor `tf.Operation.traceback` implementation in to methods of tf.Graph. Adds an `_extract_frame_info` method to allow derived classes to extend the information available in each op traceback, if desired. The default result of `tf.Operation.traceback` is unchanged. Also fixes a poorly scoped `pylint disable=line-too-long`, so adds the necessary enable/disable blocks to silence pylint for the offending docstrings. PiperOrigin-RevId: 157466174 --- Commit f7ca8db7d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Improve shape inference error messages for DynamicSlice/DynamicUpdateSlice. PiperOrigin-RevId: 157461335 --- Commit 8c2a079ec authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adding a slot / accumulator warmstart initializer that overrides the provided partitioner at call time with one passed at construction time. This is intended to be used for slot Variables (such as accumulators) associated with Optimizers, since these Variables are created in a fashion that relies on replicating the exact shape of the associated primary variables (see slot_creator). PiperOrigin-RevId: 157453498 --- Commit 73d10599f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Default CUDNN_HOME to CUDA_TOOLKIT_TARGET_DIR. The cuDNN distro is most naturally installed in the same directory as the CUDA SDK, so try to find it there if the user doesn't specify any other directory. PiperOrigin-RevId: 157436253 --- Commit eb7cf9331 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 157429266 --- Commit 346dcc0a4 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157429078 --- Commit 3d5ede131 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update documentation for sparse_matmul op to reflect gradient calculation. PiperOrigin-RevId: 157428135 --- Commit 822d64f0c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix embedding_lookup() bug where normalization did not work with ids of rank != 1. PiperOrigin-RevId: 157422220 --- Commit 8cad6b824 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Improve the error message for live set memory check. PiperOrigin-RevId: 157415647 --- Commit 34dcd5b49 authored by Eugene Brevdo<ebrevdo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf contrib seq2seq] Bugfixes to BeamSearchDecoder Implementation by Cinjon Resnick. He can't push this since he's traveling. I just copied the fix and added some small syntax tweaks to make the unit tests pass. More comprehensive unit tests will come in the near future. Fixes at least part of #9904. BeamSearchDecoder: 1. Fix the bug where we don't pass the next cell state through. 2. Gather the cell state (and attention if that's a part of the model as an AttentionWrapper on the cell) according to the next_beam_ids. PiperOrigin-RevId: 157415564 --- Commit f7ae1461c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix oversampling in the GPU version of multinomial due to an error in generating gumbel noise. -log(-log(U)) gives infinity if U draws a hard 0. Adds a tiny offset to U (2e-30) to avoid log(U) = -inf. The CPU sampling algorithm depends on the order of the logits which is undesirable and can also oversample the first logit if it is smaller than the smallest random float larger than 0 (~1e-7). Switching to double precision internally mitigates these problems, although it doesn't fix them. Slowdown is ~35% in the worst case. Also adds various tests that we would like the sampling to pass. CPU Benchmark before: 32 10000 1 0.060 0.069 0.87 32 10000 4 0.229 0.074 3.10 32 10000 32 2.180 0.059 37.09 32 100000 1 0.430 0.480 0.90 32 100000 4 2.322 0.449 5.17 32 100000 32 31.508 0.471 66.96 128 10000 1 0.168 0.235 0.71 128 10000 4 0.965 0.246 3.93 128 10000 32 7.989 0.225 35.51 128 100000 1 1.681 1.539 1.09 128 100000 4 9.012 1.57 35.73 128 100000 32 126.222 1.626 77.60 CPU Benchmark after: 32 10000 1 0.054 0.112 0.48 32 10000 4 0.206 0.093 2.21 32 10000 32 1.826 0.091 20.12 32 100000 1 0.292 0.636 0.46 32 100000 4 2.086 0.606 3.44 32 100000 32 28.496 0.633 45.03 128 10000 1 0.125 0.266 0.47 128 10000 4 0.759 0.258 2.94 128 10000 32 7.362 0.254 29.03 128 100000 1 1.550 2.18 10.71 128 100000 4 8.712 2.22 23.92 128 100000 32 122.585 2.213 55.39 PiperOrigin-RevId: 157414849 --- Commit 62cf561f1 authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add numpy_input_fn integration for LinearRegressor and fix the expand_dim for label and weight. PiperOrigin-RevId: 157405237 --- Commit 40c7e0dd7 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 157402364 --- Commit 2726c00ce authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157402063 --- Commit e9d2fba8f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix comment describing ignore_longer_outputs_than_inputs. PiperOrigin-RevId: 157400110 --- Commit 5f097217f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: An initial step of eliminating all implicit broadcast at the HLO level. Guard the shape inference for binary ops behind a flag. PiperOrigin-RevId: 157373647 --- Commit e78e5ec8a authored by Yangzihao Wang<yangzihao@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Set winograd nofused flag to be true by default. Disable winograd nonfused conv for certain input params to avoid a known bug in cuDNNv5 and cuDNNv6. PiperOrigin-RevId: 157352847 --- Commit 3f9b69a50 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: In the CUDA path of depthwise_conv2d, add a fast variant for forward convolution when the input images are smaller than 16x16. PiperOrigin-RevId: 157347823 --- Commit 848123e61 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix incorrect condition to instantiate depthwise_ops introduced in commit 15d9f00fa. The change should have excluded depthwise_conv2d for doubles on windows debug builds, but it excluded it for all windows and all debug builds. PiperOrigin-RevId: 157345929 --- Commit 060d67b34 authored by Taehoon Lee<taehoonlee@snu.ac.kr> Committed by Taehoon Lee<taehoonlee@snu.ac.kr>: Fix typos --- Commit 409419bcc authored by Mark Daoust<markdaoust@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: add closing code quotes PiperOrigin-RevId: 157339360 --- Commit d20d0a623 authored by Jonathan Hseu<jhseu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix the contrib estimator_test by updating the global step in all the appropriate spots. PiperOrigin-RevId: 157328239 --- Commit d1144d3a9 authored by Juang, Yi-Lin<b02901026@ntu.edu.tw> Committed by Juang, Yi-Lin<b02901026@ntu.edu.tw>: Fix typos --- Commit fa8bb43b1 authored by lanhin<lanhin1@gmail.com> Committed by lanhin<lanhin1@gmail.com>: Fixed a comment typo in GraphView:InitializeNode(), executor.cc. --- Commit 9f13ae93f authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Java: Update Maven release to 1.2.0-rc1 PiperOrigin-RevId: 157294719 --- Commit c8256769c authored by Gunhan Gulsoy<gunan@google.com> Committed by Gunhan Gulsoy<gunan@google.com>: Address comments and sanity check failures. --- Commit 344225a60 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 157292254 --- Commit eb2f6d041 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: VLOG(2) instead of VLOG(1) for detailed op printouts. PiperOrigin-RevId: 157291238 --- Commit b4466279a authored by Shanqing Cai<cais@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: tfdbg: add runtime shape and dtype info to DebugNumericSummary PiperOrigin-RevId: 157291215 --- Commit 4fb2425f8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add GraphOptimizer to Grappler item builder to do L1 optimizations and inlining. Op Counts Comparison (BNMT) Counts: Profile vs Grappler Op: Add, 968 vs 965 Op: AddN, 2228 vs 2228 Op: ApplyGradientDescent, 84 vs 84 Op: BatchMatMul, 998 vs 998 Op: Identity, 142 vs 105 Op: MatMul, 63 vs 63 Op: Mul, 10318 vs 10306 Op: OneHot, 1 vs 1 Op: Reshape, 8421 vs 8422 Op: Select, 488 vs 488 Op: Shape, 8132 vs 8131 Op: Sigmoid, 942 vs 942 Op: Softmax, 19 vs 19 Op: StridedSlice, 58 vs 74 Op: Sub, 1398 vs 1394 Op: Tanh, 333 vs 333 Op: Tile, 21 vs 21 Op: Transpose, 39 vs 39 PiperOrigin-RevId: 157288420 --- Commit 8918fa9ef authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: BEGIN_PUBLIC Automated g4 rollback of changelist 157272843 PiperOrigin-RevId: 158534336
author: Jonathan Hseu <jhseu@google.com> 2017-06-09 10:37:18 -0700
committer: TensorFlower Gardener <gardener@tensorflow.org> 2017-06-09 10:41:00 -0700
commit: 1b5235fd897f7ea5cffc715300f67b4dc852fa27 (patch)
tree: e2e26931aff0ff4b10174a430f816b5d31a4ab4b
parent: 98eb5270e2f9b61408f04035c7edde66c21e3fa7 (diff)
508 files changed, 9662 insertions, 2188 deletions
diff --git a/README.md b/README.md
index 2878dab260..e7dbf57b25 100644
--- a/README.md
+++ b/README.md
@@ -34,12 +34,12 @@ and discussion.**
 
 People who are a little more adventurous can also try our nightly binaries:
 
-* Linux CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.1.0-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.1.0-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.1.0-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/))
-* Linux GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.1.0-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.1.0-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/))
-* Mac CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.1.0-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.1.0-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
-* Mac GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.1.0-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.1.0-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/))
-* Windows CPU-only: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/DEVICE=cpu,OS=windows/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.1.0-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/DEVICE=cpu,OS=windows/))
-* Windows GPU: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/DEVICE=gpu,OS=windows/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.1.0-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/DEVICE=gpu,OS=windows/))
+* Linux CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0rc2-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0rc2-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=cpu-slave/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0rc2-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-python35-linux-cpu/))
+* Linux GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0rc2-cp27-none-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-linux/)) / [Python 3.4](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0rc2-cp34-cp34m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-linux/)) / [Python 3.5](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0rc2-cp35-cp35m-linux_x86_64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/))
+* Mac CPU-only: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0rc2-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=mac-slave/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-1.2.0rc2-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-cpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=mac-slave/))
+* Mac GPU: [Python 2](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0rc2-py2-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=gpu-mac/)) / [Python 3](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow_gpu-1.2.0rc2-py3-none-any.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/))
+* Windows CPU-only: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.0rc2-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow-1.2.0rc2-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows,PY=36/))
+* Windows GPU: [Python 3.5 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.0rc2-cp35-cp35m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=35/)) / [Python 3.6 64-bit](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/lastSuccessfulBuild/artifact/cmake_build/tf_python/dist/tensorflow_gpu-1.2.0rc2-cp36-cp36m-win_amd64.whl) ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-win/M=windows-gpu,PY=36/))
 * Android: [demo APK](https://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/tensorflow_demo.apk), [native libs](http://ci.tensorflow.org/view/Nightly/job/nightly-android/lastSuccessfulBuild/artifact/out/native/)
 ([build history](https://ci.tensorflow.org/view/Nightly/job/nightly-android/))
 
diff --git a/RELEASE.md b/RELEASE.md
index 1590aabfef..d22c5c62fe 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,9 +1,11 @@
-# Changes since the last release
+# Release 1.2.0
 
 ## Major Features and Improvements
+* Python 3.6 support on Windows.
 * Added `tf.layers.conv3d_transpose` layer for spatio temporal deconvolution.
 * Added `tf.Session.make_callable()`, which provides a lower overhead means of running a similar step multiple times.
-* Added ibverbs-based RDMA support to contrib (courtesy @junshi15 from Yahoo).
+* Added libverbs-based RDMA support to contrib (courtesy @junshi15 from Yahoo).
+* Bring `tf.feature_column.*` into the API. Non-deprecated functionality from `tf.contrib.layers.*` is moved to `tf.feature_column.*` with cosmetic changes.
 * `RNNCell` objects now subclass `tf.layers.Layer`.  The strictness described
   in the TensorFlow 1.1 release is gone:  The first time an RNNCell is used,
   it caches its scope.  All future uses of the RNNCell will reuse variables from
@@ -48,11 +50,141 @@
   Activation: rectified linear unit (ReLU)
   Data manipulation: multi-dimensional transposition (conversion), split,
   concat, sum and scale.
+* TensorForest Estimator now supports SavedModel export for serving.
+* Support client-provided ClusterSpec's and propagate them to all workers to enable the creation of dynamic TensorFlow clusters.
+* TensorFlow C library now available for Windows.
+* We released a new open-source version of TensorBoard.
+* [`SavedModel CLI`](https://www.tensorflow.org/versions/master/programmers_guide/saved_model_cli) tool available to inspect and execute MetaGraph in SavedModel
+* Android releases of TensorFlow are now pushed to jcenter for easier
+  integration into apps. See
+  https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/android/README.md
+  for more details.
+* RNNCells' variable names have been renamed for consistency with Keras layers.
+  Specifically, the previous variable names "weights" and "biases" have
+  been changed to "kernel" and "bias", respectively.
+  This may cause backward incompatibility with regard to your old
+  checkpoints containing such RNN cells, in which case you can use the tool
+  [checkpoint_convert script](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py)
+  to convert the variable names in your old checkpoints.
+* Many of the RNN functions and classes that were in the `tf.nn` namespace
+  before the 1.0 release and which were moved to `tf.contrib.rnn` have now
+  been moved back to the core namespace.  This includes
+  `RNNCell`, `LSTMCell`, `GRUCell`, and a number of other cells.  These
+  now reside in `tf.nn.rnn_cell` (with aliases in `tf.contrib.rnn` for backwards
+  compatibility).  The original `tf.nn.rnn` function is now `tf.nn.static_rnn`,
+  and the bidirectional static and state saving static rnn functions are also
+  now back in the `tf.nn` namespace.
+
+  Notable exceptions are the `EmbeddingWrapper`, `InputProjectionWrapper` and
+  `OutputProjectionWrapper`,  which will slowly be moved to deprecation
+  in `tf.contrib.rnn`.  These are inefficient wrappers that should often
+  be replaced by calling `embedding_lookup` or `layers.dense` as pre- or post-
+  processing of the rnn.  For RNN decoding, this functionality has been replaced
+  with an alternative API in `tf.contrib.seq2seq`.
+* Intel MKL Integration (https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture). Intel developed a number of
+  optimized deep learning primitives: In addition to matrix multiplication and
+  convolution, these building blocks include:
+  Direct batched convolution
+  Pooling: maximum, minimum, average
+  Normalization: LRN, batch normalization
+  Activation: rectified linear unit (ReLU)
+  Data manipulation: multi-dimensional transposition (conversion), split,
+  concat, sum and scale.
+
+## Deprecations
+
+* TensorFlow 1.2 may be the last time we build with cuDNN 5.1. Starting with
+  TensorFlow 1.3, we will try to build all our prebuilt binaries with cuDNN 6.0.
+  While we will try to keep our source code compatible with cuDNN 5.1, it will
+  be best effort.
+
+## Breaking Changes to the API
+* `org.tensorflow.contrib.android.TensorFlowInferenceInterface` now throws exceptions where possible and has simplified method signatures.
+
+## Changes to contrib APIs
+* Added `tf.contrib.util.create_example`.
+* Added bilinear interpolation to `tf.contrib.image`.
+* Add `tf.contrib.stateless` for random ops with custom seed control.
+* MultivariateNormalFullCovariance added to contrib/distributions/
+* tensorflow/contrib/rnn undergoes RNN cell variable renaming for
+  consistency with Keras layers. Specifically, the previous variable names
+  "weights" and "biases" are changed to "kernel" and "bias", respectively.
+  This may cause backward incompatibility with regard to your old
+  checkpoints containing such RNN cells, in which case you can use the
+  [checkpoint_convert script](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py)
+  to convert the variable names in your old checkpoints.
 
 ## Bug Fixes and Other Changes
 * In python, `Operation.get_attr` on type attributes returns the Python DType
   version of the type to match expected get_attr documentation rather than the
   protobuf enum.
+* tensorflow/contrib/rnn undergoes RNN cell variable renaming for
+  consistency with Keras layers. Specifically, the previous variable names
+  "weights" and "biases" are changed to "kernel" and "bias", respectively.
+* Changed MIN_SDK version to 8.0 when building iOS libraries.
+* Fixed LIBXSMM integration.
+* Make decode_jpeg/decode_png/decode_gif handle all formats, since users frequently try to decode an image as the wrong type.
+* Improve implicit broadcasting lowering.
+* Improving stability of GCS/Bigquery clients by a faster retrying of stale transmissions.
+* Remove OpKernelConstruction::op_def() as part of minimizing proto dependencies.
+* VectorLaplaceDiag distribution added.
+* Android demo no longer requires libtensorflow_demo.so to run (libtensorflow_inference.so still required)
+* Added `categorical_column_with_vocabulary_file`.
+* Introduce ops for batching/unbatching tensors across Session::Run() calls.
+* Add tf.log_sigmoid(x) = tf.log(tf.sigmoid(x)) = -tf.nn.softplus(-x).
+* Changed hooks lists to immutable tuples, and now allow any iterable for the associated arguments.
+* Introduce TFDecorator.
+* Added an Mfcc op for speech feature generation.
+* Improved DirectSession::Run() overhead and error checking. Feeding a value of the wrong type will now synchronously raise an INVALID_ARGUMENT error instead of asynchronously raising an INTERNAL error. Code that depends on the (undefined) behavior when feeding a tensor of the wrong type may need to be updated.
+* Added unreduced NONE, and reduced MEAN options for losses. Removed "WEIGHTED_" prefix from other Reduction constants.
+* assertAllClose now handles dicts.
+* Added Gmock matcher for HloInstructions.
+* Add var name to errors on variable restore.
+* Added an AudioSpectrogram op for audio feature generation.
+* Added `reduction` arg to losses.
+* `tf.placeholder` can represent scalar shapes and partially known.
+* Remove estimator_spec(mode) argument.
+* Added an AudioSpectrogram op for audio feature generation.
+* TensorBoard disables all runs by default if there are more than 40 runs.
+* Removed old doc generator code.
+* GCS file system integration now supports domain buckets, e.g gs://bucket.domain.com/path.
+* Add `tf.summary.text` for outputting text to TensorBoard.
+* The "run" command of tfdbg's command-line interface now supports filtering of tensors by node name, op type and tensor dtype.
+* `tf.string_to_number` now supports int64 and float64 outputs.
+
+## Thanks to our Contributors
+
+This release contains contributions from many people at Google, as well as:
+
+4F2E4A2E, Aaron Schumacher, Abhi Agg, admcrae, Adriano Carmezim, Adrià Arrufat,
+agramesh1, Akimitsu Seo, Alan Mosca, Alex Egg, Alex Rothberg, Alexander Heinecke,
+Alexander Matyasko, Alexandr Baranezky, Alexandre Caulier, Ali Siddiqui, Anand Venkat,
+Andrew Hundt, Androbin, Anmol Sharma, Arie, Arno Leist, Arron Cao, AuréLien Geron, Bairen Yi,
+Beomsu Kim, Carl Thomé, cfperez, Changming Sun, Corey Wharton, critiqjo, Dalei Li, Daniel
+Rasmussen, Daniel Trebbien, DaríO Hereñú, David Eng, David Norman, David Y. Zhang, Davy Song, ddurham2,
+Deepak Subburam, Dmytro Kyrychuk, Dominic Rossi, Dominik SchlöSser, Dustin Tran,
+Eduardo Pinho, Egil Martinsson, Elliot Saba, Eric Bigelow, Erik Smistad, Evan Klitzke,
+Fabrizio Milo, Falcon Dai, Fei Gao, FloopCZ, Fung Lam, Gautam, GBLin5566, Greg Peatfield,
+Gu Wang, Guenther Schmuelling, Hans Pabst, Harun Gunaydin, Huaizheng, Ido Shamay, Ikaro
+Silva, Ilya Edrenkin, Immexxx, James Mishra, Jamie Cooke, Jay Young, Jayaram Bobba,
+Jianfei Wang, jinghua2, Joey Meyer, John Maidens, Jonghoon Jin, Julian Villella,
+Jun Kim, Jun Shi, Junwei Pan, jyegerlehner, Karan Desai, Karel Van De Plassche,
+Kb Sriram, KhabarlakKonstantin, Koan-Sin Tan, krivard, Kwotsin, Leandro Gracia Gil,
+Li Chen, Liangliang He, Louie Helm, lspvic, Luiz Henrique Soares, LáSzló Csomor,
+Mark Wong, Mathew Wicks, Matthew Rahtz, Maxwell Paul Brickner, Michael Hofmann, Miguel
+Flores Ruiz De Eguino, MikeTam1021, Mortada Mehyar, Mycosynth, Namnamseo,
+Nate Harada, Neven Miculinic, Nghia Tran, Nick Lyu, Niranjan Hasabnis, Nishidha, Oleksii
+Kuchaiev, Oyesh Mann Singh, Panmari, Patrick, Paul Van Eck, Piyush Chaudhary, Quim Llimona,
+Raingo, Richard Davies, Ruben Vereecken, Sahit Chintalapudi, Sam Abrahams, Santiago Castro,
+Scott Sievert, Sean O'Keefe, Sebastian Schlecht, Shane, Shubhankar Deshpande, Spencer Schaber,
+Sunyeop Lee, t13m, td2014, Thomas H. P. Andersen, Toby Petty, Umang Mehta,
+Vadim Markovtsev, Valentin Iovene, Vincent Zhao, Vit Stepanovs, Vivek Rane, Vu Pham, wannabesrevenge,
+weipingpku, wuhaixutab, wydwww, Xiang Gao, Xiaolin Lin, xiaoyaozhuzi, Yaroslav Bulatov, Yi Liu,
+Yoshihiro Sugi, Yuan (Terry) Tang, Yuming Wang, Yuxin Wu, Zader Zheng, Zhaojun Zhang, zhengjiajin,
+ZhipengShen, Ziming Dong, zjj2wry
+
+We are also grateful to all who filed issues or helped resolve them, asked and
+answered questions, and were part of inspiring discussions.
 
 # Release 1.1.0
 
diff --git a/configure b/configure
index 71c14345f5..e1aaddabda 100755
--- a/configure
+++ b/configure
@@ -13,28 +13,16 @@ popd > /dev/null
 PLATFORM="$(uname -s | tr 'A-Z' 'a-z')"
 
 function is_linux() {
-  if [[ "${PLATFORM}" == "linux" ]]; then
-    true
-  else
-    false
-  fi
+  [[ "${PLATFORM}" == "linux" ]]
 }
 
 function is_macos() {
-  if [[ "${PLATFORM}" == "darwin" ]]; then
-    true
-  else
-    false
-  fi
+  [[ "${PLATFORM}" == "darwin" ]]
 }
 
 function is_windows() {
   # On windows, the shell script is actually running in msys
-  if [[ "${PLATFORM}" =~ msys_nt*|mingw*|cygwin*|uwin* ]]; then
-    true
-  else
-    false
-  fi
+  [[ "${PLATFORM}" =~ msys_nt*|mingw*|cygwin*|uwin* ]]
 }
 
 function sed_in_place() {
@@ -105,9 +93,7 @@ function setup_python {
 
   if [ -z "$PYTHON_LIB_PATH" ]; then
     # Split python_path into an array of paths, this allows path containing spaces
-    IFS=','
-    python_lib_path=($(python_path))
-    unset IFS
+    IFS=',' read -r -a python_lib_path <<< "$(python_path)"
 
     if [ 1 = "$USE_DEFAULT_PYTHON_LIB_PATH" ]; then
       PYTHON_LIB_PATH=${python_lib_path[0]}
@@ -119,7 +105,7 @@ function setup_python {
         echo "  $x"
       done
       set -- "${python_lib_path[@]}"
-      echo "Please input the desired Python library path to use.  Default is ["$1"]"
+      echo "Please input the desired Python library path to use.  Default is [$1]"
       read b || true
       if [ "$b" == "" ]; then
         PYTHON_LIB_PATH=${python_lib_path[0]}
@@ -135,8 +121,9 @@ function setup_python {
     exit 1
   fi
 
-  local python_major_version=$("${PYTHON_BIN_PATH}" -c 'from __future__ import print_function; import sys; print(sys.version_info[0]);')
-  if [ "$python_major_version" == "" ]; then
+  local python_major_version
+  python_major_version=$("${PYTHON_BIN_PATH}" -c 'from __future__ import print_function; import sys; print(sys.version_info[0]);' | head -c1)
+  if [ -z "$python_major_version" ]; then
     echo -e "\n\nERROR: Problem getting python version.  Is $PYTHON_BIN_PATH the correct python binary?"
     exit 1
   fi
@@ -144,6 +131,7 @@ function setup_python {
   # Convert python path to Windows style before writing into bazel.rc
   if is_windows; then
     PYTHON_BIN_PATH="$(cygpath -m "$PYTHON_BIN_PATH")"
+    PYTHON_LIB_PATH="$(cygpath -m "$PYTHON_LIB_PATH")"
   fi
 
   # Set-up env variables used by python_configure.bzl
@@ -184,7 +172,13 @@ fi
 # This file contains customized config settings.
 rm -f .tf_configure.bazelrc
 touch .tf_configure.bazelrc
-touch .bazelrc
+if [[ ! -e .bazelrc ]]; then
+  if [[ -e "${HOME}/.bazelrc" ]]; then
+    echo "import ${HOME}/.bazelrc" >.bazelrc
+  else
+    touch .bazelrc
+  fi
+fi
 sed_in_place "/tf_configure/d" .bazelrc
 echo "import %workspace%/.tf_configure.bazelrc" >> .bazelrc
 
@@ -241,12 +235,14 @@ if [ "$TF_NEED_MKL" == "1" ]; then # TF_NEED_MKL
   else
     default_mkl_path=/opt/intel/mklml
     fromuser=""
-    read -p "Please specify the location where MKL is installed. [Default is $default_mkl_path]: " MKL_INSTALL_PATH
-    fromuser="1"
+    if [ -z "$MKL_INSTALL_PATH" ]; then
+      read -p "Please specify the location where MKL is installed. [Default is $default_mkl_path]: " MKL_INSTALL_PATH
+      fromuser="1"
+    fi
     if [ -z "$MKL_INSTALL_PATH" ]; then
       MKL_INSTALL_PATH=$default_mkl_path
     fi
-    # Result returned from "read" will be used unexpanded. That make "~" unuseable.
+    # Result returned from "read" will be used unexpanded. That make "~" unusable.
     # Going through one more level of expansion to handle that.
     MKL_INSTALL_PATH=`${PYTHON_BIN_PATH} -c "import os; print(os.path.realpath(os.path.expanduser('${MKL_INSTALL_PATH}')))"`
   fi
@@ -481,7 +477,7 @@ done
 while true; do
   # Configure the Cuda SDK version to use.
   if [ -z "$TF_CUDA_VERSION" ]; then
-    read -p "Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: " TF_CUDA_VERSION
+    read -p "Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: " TF_CUDA_VERSION
   fi
 
   fromuser=""
@@ -524,7 +520,6 @@ while true; do
     export CUDA_TOOLKIT_PATH
     write_action_env_to_bazelrc "CUDA_TOOLKIT_PATH" "$CUDA_TOOLKIT_PATH"
     export TF_CUDA_VERSION
-    write_action_env_to_bazelrc "TF_CUDA_VERSION" "$TF_CUDA_VERSION"
     break
   fi
   echo "Invalid path to CUDA $TF_CUDA_VERSION toolkit. ${CUDA_TOOLKIT_PATH}/${CUDA_RT_LIB_PATH} cannot be found"
@@ -537,6 +532,13 @@ while true; do
   CUDA_TOOLKIT_PATH=""
 done
 
+# Set default CUDA version if not set
+if [ -z "$TF_CUDA_VERSION" ]; then
+  TF_CUDA_VERSION="8.0"
+  export TF_CUDA_VERSION 
+fi
+write_action_env_to_bazelrc "TF_CUDA_VERSION" "$TF_CUDA_VERSION" 
+
 # Set up which gcc nvcc should use as the host compiler
 # No need to set this on Windows
 while [[ "$TF_CUDA_CLANG" != "1" ]] && ! is_windows && true; do
@@ -570,7 +572,7 @@ done
 while true; do
   # Configure the cuDNN version to use.
   if [ -z "$TF_CUDNN_VERSION" ]; then
-    read -p "Please specify the cuDNN version you want to use. [Leave empty to use system default]: " TF_CUDNN_VERSION
+    read -p "Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: " TF_CUDNN_VERSION
   fi
 
   fromuser=""
@@ -581,7 +583,7 @@ while true; do
     if [ -z "$CUDNN_INSTALL_PATH" ]; then
       CUDNN_INSTALL_PATH=$default_cudnn_path
     fi
-    # Result returned from "read" will be used unexpanded. That make "~" unuseable.
+    # Result returned from "read" will be used unexpanded. That make "~" unusable.
     # Going through one more level of expansion to handle that.
     CUDNN_INSTALL_PATH=`"${PYTHON_BIN_PATH}" -c "import os; print(os.path.realpath(os.path.expanduser('${CUDNN_INSTALL_PATH}')))"`
   fi
@@ -603,7 +605,7 @@ while true; do
     CUDA_DNN_LIB_ALT_PATH="libcudnn${TF_CUDNN_EXT}.dylib"
   fi
 
-  if [ -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_ALT_PATH}" -o -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_PATH}" ]; then
+  if [ -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_ALT_PATH}" ] || [ -e "$CUDNN_INSTALL_PATH/${CUDA_DNN_LIB_PATH}" ]; then
     export TF_CUDNN_VERSION
     write_action_env_to_bazelrc "TF_CUDNN_VERSION" "$TF_CUDNN_VERSION"
     export CUDNN_INSTALL_PATH
@@ -620,8 +622,8 @@ while true; do
     CUDNN_PATH_FROM_LDCONFIG="$($LDCONFIG_BIN -p | sed -n 's/.*libcudnn.so .* => \(.*\)/\1/p')"
     if [ -e "${CUDNN_PATH_FROM_LDCONFIG}${TF_CUDNN_EXT}" ]; then
       export TF_CUDNN_VERSION
-      write_action_env_to_bazelrc "TF_CUDNN_VERSION" "$TF_CUDNN_VERSION"
-      export CUDNN_INSTALL_PATH="$(dirname ${CUDNN_PATH_FROM_LDCONFIG})"
+      export CUDNN_INSTALL_PATH
+      CUDNN_INSTALL_PATH="$(dirname ${CUDNN_PATH_FROM_LDCONFIG})"
       write_action_env_to_bazelrc "CUDNN_INSTALL_PATH" "$CUDNN_INSTALL_PATH"
       break
     fi
@@ -641,6 +643,13 @@ while true; do
   CUDNN_INSTALL_PATH=""
 done
 
+# Set default CUDNN version if not set
+if [ -z "$TF_CUDNN_VERSION" ]; then
+  TF_CUDNN_VERSION="6"
+  export TF_CUDNN_VERSION
+fi
+write_action_env_to_bazelrc "TF_CUDNN_VERSION" "$TF_CUDNN_VERSION"
+
 # Configure the compute capabilities that TensorFlow builds for.
 # Since Cuda toolkit is not backward-compatible, this is not guaranteed to work.
 while true; do
@@ -707,7 +716,7 @@ if [ "$TF_NEED_OPENCL" == "1" ]; then
 while true; do
   fromuser=""
   if [ -z "$HOST_CXX_COMPILER" ]; then
-    default_cxx_host_compiler=$(which clang++-3.6 || true)
+    default_cxx_host_compiler=$(which g++ || true)
     read -p "Please specify which C++ compiler should be used as the host C++ compiler. [Default is $default_cxx_host_compiler]: " HOST_CXX_COMPILER
     fromuser="1"
     if [ -z "$HOST_CXX_COMPILER" ]; then
@@ -731,7 +740,7 @@ done
 while true; do
   fromuser=""
   if [ -z "$HOST_C_COMPILER" ]; then
-    default_c_host_compiler=$(which clang-3.6 || true)
+    default_c_host_compiler=$(which gcc || true)
     read -p "Please specify which C compiler should be used as the host C compiler. [Default is $default_c_host_compiler]: " HOST_C_COMPILER
     fromuser="1"
     if [ -z "$HOST_C_COMPILER" ]; then
@@ -787,6 +796,82 @@ done
 # end of if "$TF_NEED_OPENCL" == "1"
 fi
 
-# TODO(gunan): Remove once bazel correctly handles changes in remote repositories.
-bazel clean
+
+while [ "$TF_NEED_MPI" == "" ]; do
+  read -p "Do you wish to build TensorFlow with "\
+"MPI support? [y/N] " INPUT
+  case $INPUT in
+    [Yy]* ) echo "MPI support will be enabled for "\
+"TensorFlow"; TF_NEED_MPI=1;;
+    [Nn]* ) echo "MPI support will not be enabled for "\
+"TensorFlow"; TF_NEED_MPI=0;;
+    "" ) echo "MPI support will not be enabled for "\
+"TensorFlow"; TF_NEED_MPI=0;;
+    * ) echo "Invalid selection: " $INPUT;;
+  esac
+done
+
+# Find out where the MPI toolkit is installed
+while true; do
+    if [ "$TF_NEED_MPI" == "0" ]; then
+        break;
+    fi
+
+    fromuser=""
+    if [ -z "$MPI_HOME" ]; then
+        #Get the base folder by removing the bin path
+        default_mpi_path=$(dirname $(dirname $(which mpirun)) || dirname $(dirname $(which mpiexec))  || true)
+        read -p "Please specify the MPI toolkit folder. [Default is $default_mpi_path]: " MPI_HOME
+        fromuser="1"
+        if [ -z "$MPI_HOME" ]; then
+            MPI_HOME=$default_mpi_path
+        fi
+    fi
+
+    #Check that the include and library folders are where we expect them to be
+    if [ -e "$MPI_HOME/include" ] && [ -e "$MPI_HOME/lib" ]; then
+        break
+    fi
+ 
+    echo "Invalid path to the MPI Toolkit. ${MPI_HOME}/include or ${MPI_HOME}/lib cannot be found."
+    if [ -z "$fromuser" ]; then
+        exit 1
+    fi
+
+    # Retry
+    MPI_HOME="" 
+done
+    
+    
+if [ "$TF_NEED_MPI" == "1" ]; then
+  write_to_bazelrc 'build --define with_mpi_support=true'
+
+  #Link the MPI header files
+  ln -sf "${MPI_HOME}/include/mpi.h" third_party/mpi/mpi.h
+
+
+  #Determine if we use OpenMPI or MVAPICH, these require different header files 
+  #to be included here to make bazel dependency checker happy
+
+  if [ -e "${MPI_HOME}/include/mpi_portable_platform.h" ]; then
+        #OpenMPI 
+        ln -sf "${MPI_HOME}/include/mpi_portable_platform.h" third_party/mpi/
+        sed -i -e "s/MPI_LIB_IS_OPENMPI=False/MPI_LIB_IS_OPENMPI=True/" third_party/mpi/mpi.bzl
+ else
+        #MVAPICH / MPICH
+        ln -sf "${MPI_HOME}/include/mpio.h" third_party/mpi/
+        ln -sf "${MPI_HOME}/include/mpicxx.h" third_party/mpi/
+        sed -i -e "s/MPI_LIB_IS_OPENMPI=True/MPI_LIB_IS_OPENMPI=False/" third_party/mpi/mpi.bzl
+ fi
+
+  
+  if [ -e "${MPI_HOME}/lib/libmpi.so" ]; then
+    ln -sf "${MPI_HOME}/lib/libmpi.so" third_party/mpi/
+  else
+    echo "Cannot find the MPI library file in ${MPI_HOME}/lib "
+    exit 1
+  fi
+fi
+
+
 echo "Configuration finished"
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index 54f2775a0b..3f9a911757 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -71,6 +71,12 @@ config_setting(
 
 config_setting(
     name = "windows",
+    values = {"cpu": "x64_windows"},
+    visibility = ["//visibility:public"],
+)
+
+config_setting(
+    name = "windows_msvc",
     values = {"cpu": "x64_windows_msvc"},
     visibility = ["//visibility:public"],
 )
@@ -164,6 +170,12 @@ config_setting(
     visibility = ["//visibility:public"],
 )
 
+config_setting(
+    name = "with_mpi_support",
+    values = {"define": "with_mpi_support=true"},
+    visibility = ["//visibility:public"],
+)
+
 package_group(
     name = "internal",
     packages = ["//tensorflow/..."],
@@ -447,6 +459,7 @@ cc_binary(
             "//tensorflow/c:exported_symbols.lds",
         ],
         "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "-z defs",
             "-s",
diff --git a/tensorflow/c/c_api_test.cc b/tensorflow/c/c_api_test.cc
index f9e852d337..04540bd793 100644
--- a/tensorflow/c/c_api_test.cc
+++ b/tensorflow/c/c_api_test.cc
@@ -1345,7 +1345,7 @@ class CApiWhileLoopTest : public ::testing::Test {
     EXPECT_EQ(expected_value, *data);
   }
 
-  // Create a valid conditonal graph. Useful for testing unrelated errors.
+  // Create a valid conditional graph. Useful for testing unrelated errors.
   void CreateCondGraph() {
     TF_Operation* one = ScalarConst(1, params_->cond_graph, s_);
     TF_Operation* less_than =
diff --git a/tensorflow/c/generate-pc.sh b/tensorflow/c/generate-pc.sh
index 40b3a60be9..73d427d9b2 100755
--- a/tensorflow/c/generate-pc.sh
+++ b/tensorflow/c/generate-pc.sh
@@ -23,6 +23,8 @@ usage() {
     echo -e "-h, --help\tdisplay this message"
 }
 
+[ $# == 0 ] && usage && exit 0
+
 # read the options
 ARGS=`getopt -o p:v:h --long prefix:,version:,help -n $0 -- "$@"`
 eval set -- "$ARGS"
@@ -41,11 +43,13 @@ while true ; do
                 "") shift 2 ;;
                 *) TF_VERSION=$2 ; shift 2 ;;
             esac ;;
-        --) shift ; echo "Try '$0 --help' for more information."; exit 1 ;;
+        --) shift ; break ;;
         *) echo "Internal error! Try '$0 --help' for more information." ; exit 1 ;;
     esac
 done
 
+[ -z $TF_VERSION ] && echo "Specify a version using -v or --version" && exit 1
+
 echo "Generating pkgconfig file for TensorFlow $TF_VERSION in $TF_PREFIX"
 
 cat << EOF > tensorflow.pc
diff --git a/tensorflow/compiler/aot/tfcompile.bzl b/tensorflow/compiler/aot/tfcompile.bzl
index 0bd3694dda..4be4e0fbb3 100644
--- a/tensorflow/compiler/aot/tfcompile.bzl
+++ b/tensorflow/compiler/aot/tfcompile.bzl
@@ -284,5 +284,6 @@ def target_llvm_triple():
       "//tensorflow:android_arm64": "aarch64-none-android",
       "//tensorflow:android_x86": "i686-none-android",
       "//tensorflow:linux_ppc64le": "ppc64le-ibm-linux-gnu",
+      "//tensorflow:darwin": "x86_64-none-darwin",
       "//conditions:default": "x86_64-pc-linux",
   })
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index 53ee3c8e3a..da1c50b0a7 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -18,9 +18,26 @@ package(
     default_visibility = [":internal"],
 )
 
+load("//tensorflow:tensorflow.bzl", "cc_header_only_library")
 load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
 load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
 
+# TODO(jhseu): Fix this target.
+#
+# This target can be used by XLA device plugins to prevent circular
+# dependencies, and provides access to all of the required headers
+# for building a device library.
+#cc_header_only_library(
+#    name = "xla_jit_headers_lib",
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        ":xla_cpu_device",
+#        ":xla_cpu_jit",
+#        ":xla_gpu_device",
+#        ":xla_gpu_jit",
+#    ],
+#)
+
 # Target that bundles up the XLA CPU and GPU JIT devices.
 cc_library(
     name = "jit",
@@ -30,6 +47,7 @@ cc_library(
         ":xla_cpu_jit",
         ":xla_gpu_device",
         ":xla_gpu_jit",
+        "//tensorflow/compiler/plugin",
     ],
     alwayslink = 1,
 )
diff --git a/tensorflow/compiler/plugin/BUILD b/tensorflow/compiler/plugin/BUILD
new file mode 100644
index 0000000000..4badd3a589
--- /dev/null
+++ b/tensorflow/compiler/plugin/BUILD
@@ -0,0 +1,36 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Configuration file for an XLA plugin.
+- please don't check in changes to this file
+- to prevent changes appearing in git status, use:
+  git update-index --assume-unchanged tensorflow/compiler/plugin/BUILD
+
+To add additional devices to the XLA subsystem, add targets to the
+dependency list in the 'plugin' target. For instance:
+
+    deps = ["//tensorflow/compiler/plugin/example:plugin_lib"],
+"""
+
+licenses(["notice"])
+
+package(
+    default_visibility = ["//visibility:public"],
+)
+
+cc_library(
+    name = "plugin",
+    deps = [],
+)
diff --git a/tensorflow/compiler/tests/build_defs.bzl b/tensorflow/compiler/tests/build_defs.bzl
index 820db13d0b..0bde616521 100644
--- a/tensorflow/compiler/tests/build_defs.bzl
+++ b/tensorflow/compiler/tests/build_defs.bzl
@@ -1,12 +1,14 @@
 """Build rules for Tensorflow/XLA testing."""
 
 load("@local_config_cuda//cuda:build_defs.bzl", "cuda_is_configured")
+load("//tensorflow/compiler/tests:plugin.bzl", "plugins")
 
 def all_backends():
+  b = ["cpu"] + plugins.keys()
   if cuda_is_configured():
-    return ["cpu", "gpu"]
+    return b + ["gpu"]
   else:
-    return ["cpu"]
+    return b
 
 def tf_xla_py_test(name, srcs=[], deps=[], tags=[], data=[], main=None,
                    disabled_backends=None, **kwargs):
@@ -53,6 +55,10 @@ def tf_xla_py_test(name, srcs=[], deps=[], tags=[], data=[], main=None,
       backend_args += ["--test_device=XLA_GPU",
                        "--types=DT_FLOAT,DT_DOUBLE,DT_INT32,DT_INT64,DT_BOOL"]
       backend_tags += ["requires-gpu-sm35"]
+    elif backend in plugins:
+      backend_args += ["--test_device=" + plugins[backend]["device"],
+                       "--types=" + plugins[backend]["types"]]
+      backend_tags += plugins[backend]["tags"]
     else:
       fail("Unknown backend {}".format(backend))
 
diff --git a/tensorflow/compiler/tests/plugin.bzl b/tensorflow/compiler/tests/plugin.bzl
new file mode 100644
index 0000000000..b6eb7a9e39
--- /dev/null
+++ b/tensorflow/compiler/tests/plugin.bzl
@@ -0,0 +1,23 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Additional XLA devices to be included in the unit test suite."""
+
+# If you wish to edit this file without checking it into the repo, consider:
+#   git update-index --assume-unchanged tensorflow/compiler/tests/plugin.bzl
+
+plugins = {
+  #"poplar": {"device":"XLA_IPU", "types":"DT_FLOAT,DT_INT32", "tags":[]},
+}
+
diff --git a/tensorflow/compiler/tests/slice_ops_test.py b/tensorflow/compiler/tests/slice_ops_test.py
index de91b7b425..4ddf2ee0dc 100644
--- a/tensorflow/compiler/tests/slice_ops_test.py
+++ b/tensorflow/compiler/tests/slice_ops_test.py
@@ -26,6 +26,7 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import googletest
 
 
+
 class SliceTest(XLATestCase):
 
   def test1D(self):
@@ -48,11 +49,14 @@ class SliceTest(XLATestCase):
         with self.test_scope():
           o = array_ops.slice(i, [1, 2, 2], [1, 1, 4])
         params = {
-            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
+            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
+                 [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
                  [5, 3, 1, 7, 9, 2, 4, 6, 8, 0]],
-                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
+                 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  [8, 7, 6, 5, 4, 3, 2, 1, 8, 7]],
-                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
+                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5],
+                 [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
                  [9, 8, 7, 9, 8, 7, 9, 8, 7, 9]]]
         }
         result = o.eval(feed_dict=params)
@@ -60,6 +64,7 @@ class SliceTest(XLATestCase):
         self.assertAllEqual([[[6, 5, 4, 3]]], result)
 
 
+
 class StridedSliceTest(XLATestCase):
 
   def test1D(self):
@@ -95,11 +100,14 @@ class StridedSliceTest(XLATestCase):
         with self.test_scope():
           o = array_ops.strided_slice(i, [0, 2, 2], [2, 3, 6], [1, 1, 2])
         params = {
-            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
+            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
+                 [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
                  [5, 3, 1, 7, 9, 2, 4, 6, 8, 0]],
-                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
+                 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  [8, 7, 6, 5, 4, 3, 2, 1, 8, 7]],
-                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
+                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5],
+                 [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
                  [9, 8, 7, 9, 8, 7, 9, 8, 7, 9]]]
         }
         result = o.eval(feed_dict=params)
@@ -113,20 +121,25 @@ class StridedSliceTest(XLATestCase):
         with self.test_scope():
           o = array_ops.strided_slice(i, [2, 2, 6], [0, 0, 2], [-1, -1, -2])
         params = {
-            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
-                 [5, 3, 1, 7, 9, 2, 4, 6, 8, 0], [4, 5, 2, 4, 3, 7, 6, 8, 9,
-                                                  4]],
-                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5], [4, 3, 4, 5, 7, 6, 5, 3, 4, 5],
-                 [8, 7, 6, 5, 4, 3, 2, 1, 8, 7], [7, 1, 7, 1, 8, 1, 8, 1, 3,
-                                                  1]],
-                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
-                 [9, 8, 7, 9, 8, 7, 9, 8, 7, 9], [9, 9, 5, 5, 6, 6, 3, 3, 6,
-                                                  6]]]
+            i: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
+                 [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
+                 [5, 3, 1, 7, 9, 2, 4, 6, 8, 0],
+                 [4, 5, 2, 4, 3, 7, 6, 8, 9, 4]],
+                [[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
+                 [4, 3, 4, 5, 7, 6, 5, 3, 4, 5],
+                 [8, 7, 6, 5, 4, 3, 2, 1, 8, 7],
+                 [7, 1, 7, 1, 8, 1, 8, 1, 3, 1]],
+                [[7, 5, 7, 5, 7, 5, 7, 5, 7, 5],
+                 [1, 2, 1, 2, 1, 2, 1, 2, 1, 2],
+                 [9, 8, 7, 9, 8, 7, 9, 8, 7, 9],
+                 [9, 9, 5, 5, 6, 6, 3, 3, 6, 6]]]
         }
         result = o.eval(feed_dict=params)
 
-        self.assertAllEqual([[[9, 8], [1, 1]], [[2, 4], [5, 7]]], result)
-
+        self.assertAllEqual([[[9, 8],
+                              [1, 1]],
+                             [[2, 4],
+                              [5, 7]]], result)
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/compiler/xla/BUILD b/tensorflow/compiler/xla/BUILD
index bf114fdeff..7ac08cf3b5 100644
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@@ -18,6 +18,7 @@ package_group(
     ],
 )
 
+load("//tensorflow:tensorflow.bzl", "cc_header_only_library")
 load("//tensorflow/compiler/xla:xla.bzl", "xla_proto_library")
 
 # Filegroup used to collect source files for dependency checking.
@@ -45,6 +46,24 @@ xla_proto_library(
     ],
 )
 
+# TODO(jhseu): Restore
+# This is a headers target that extra XLA devices can use to prevent
+# circular dependencies.  Devices that are compiled as separate shared
+# objects can also use it to prevent linking of library code.
+#cc_header_only_library(
+#    name = "xla_headers_lib",
+#    visibility = ["//visibility:public"],
+#    deps = [
+#        "//tensorflow/compiler/xla:xla_data_proto",
+#        "//tensorflow/compiler/xla:xla_proto",
+#        "//tensorflow/compiler/xla/client:client_library",
+#        "//tensorflow/compiler/xla/legacy_flags:layout_util_flags",
+#        "//tensorflow/compiler/xla/service:hlo",
+#        "//tensorflow/core:framework_headers_lib",
+#        "//tensorflow/core:stream_executor_headers_lib",
+#    ],
+#)
+
 cc_library(
     name = "test",
     testonly = 1,
diff --git a/tensorflow/compiler/xla/array4d.h b/tensorflow/compiler/xla/array4d.h
index c27d70b8a6..d93f968f4d 100644
--- a/tensorflow/compiler/xla/array4d.h
+++ b/tensorflow/compiler/xla/array4d.h
@@ -65,7 +65,7 @@ class Array4D {
     Fill(T());
   }
 
-  // Creates a 4D array, initalized to value.
+  // Creates a 4D array, initialized to value.
   Array4D(int64 planes, int64 depth, int64 height, int64 width, T value)
       : Array4D(planes, depth, height, width) {
     Fill(value);
diff --git a/tensorflow/compiler/xla/client/local_client.h b/tensorflow/compiler/xla/client/local_client.h
index 49ffed4dde..c903cd2711 100644
--- a/tensorflow/compiler/xla/client/local_client.h
+++ b/tensorflow/compiler/xla/client/local_client.h
@@ -56,7 +56,7 @@ class ExecutableBuildOptions {
 
   // If set, this specifies the layout of the result of the computation. If not
   // set, the service will chose the layout of the result. A Shape is used to
-  // store the layout to accomodate tuple result shapes. A value of nullptr
+  // store the layout to accommodate tuple result shapes. A value of nullptr
   // indicates the option has not been set.
   ExecutableBuildOptions& set_result_layout(const Shape& shape_with_layout);
   const Shape* result_layout() const;
diff --git a/tensorflow/compiler/xla/literal_util.h b/tensorflow/compiler/xla/literal_util.h
index 9a426ad195..64e58e32fb 100644
--- a/tensorflow/compiler/xla/literal_util.h
+++ b/tensorflow/compiler/xla/literal_util.h
@@ -763,7 +763,7 @@ class LiteralUtil {
 
   // Creates a new value that has the equivalent value as literal, but conforms
   // to new_layout; e.g. a literal matrix that was in {0, 1} minor-to-major
-  // dimension layout can be re-layed-out as {1, 0} minor-to-major dimension
+  // dimension layout can be re-laid-out as {1, 0} minor-to-major dimension
   // layout and the value in the cell at any given logical index (i0, i1) will
   // be the same.
   //
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.cc b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
index 9605cf06a1..7f6737de4d 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@@ -439,7 +439,7 @@ Status AlgebraicSimplifierVisitor::HandleDot(HloInstruction* dot,
         dot, HloInstruction::CreateBroadcast(dot->shape(), zero, {}));
   }
 
-  // Simplify dot(transpose(a), transpose(b)) to tranpose(dot(b,a)).
+  // Simplify dot(transpose(a), transpose(b)) to transpose(dot(b,a)).
   if (lhs->IsRank2Transpose() && rhs->IsRank2Transpose()) {
     auto new_dot = computation_->AddInstruction(HloInstruction::CreateBinary(
         ShapeUtil::PermuteDimensions({1, 0}, dot->shape()), HloOpcode::kDot,
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.h b/tensorflow/compiler/xla/service/algebraic_simplifier.h
index 5d59a27c71..f8919f0caa 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.h
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.h
@@ -35,7 +35,7 @@ class AlgebraicSimplifier : public HloPassInterface {
 
   // If is_layout_sensitive is true, then the simplifier preserves layout during
   // transformation. Otherwise, layout is ignored. If valid_bitcast_callback
-  // returns true, then the pass will replace reshapes and tranposes with
+  // returns true, then the pass will replace reshapes and transposes with
   // bitcasts.
   AlgebraicSimplifier(bool is_layout_sensitive,
                       ValidBitcastCallback valid_bitcast_callback,
diff --git a/tensorflow/compiler/xla/service/buffer_liveness_test.cc b/tensorflow/compiler/xla/service/buffer_liveness_test.cc
index 427e4e492c..c5c24e2d48 100644
--- a/tensorflow/compiler/xla/service/buffer_liveness_test.cc
+++ b/tensorflow/compiler/xla/service/buffer_liveness_test.cc
@@ -628,7 +628,7 @@ class FusedDynamicUpdateSliceLivenessTest : public BufferLivenessTest {
         BufferLiveness::Run(module.get(),
                             MakeUnique<DependencyHloOrdering>(module.get()))
             .ConsumeValueOrDie();
-    // Return whether or not buffers interfernce is detected between
+    // Return whether or not buffers interference is detected between
     // 'tuple_param0' and 'tuple_root' at shape index '{1}'.
     return TupleElementsMayInterfere(*liveness, tuple_param0, tuple_root, {1});
   }
@@ -740,7 +740,7 @@ class DynamicUpdateSliceLivenessTest : public BufferLivenessTest {
         BufferLiveness::Run(module.get(),
                             MakeUnique<DependencyHloOrdering>(module.get()))
             .ConsumeValueOrDie();
-    // Return whether or not buffers interfernce is detected between
+    // Return whether or not buffers interference is detected between
     // 'tuple_param0' and 'tuple_root' at shape index '{1}'.
     return TupleElementsMayInterfere(*liveness, tuple_param0, tuple_root, {1});
   }
diff --git a/tensorflow/compiler/xla/service/copy_insertion.cc b/tensorflow/compiler/xla/service/copy_insertion.cc
index 3c454a3dc4..dc45dd946b 100644
--- a/tensorflow/compiler/xla/service/copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion.cc
@@ -141,7 +141,7 @@ class InstructionCopier {
   Status RecordAmbiguousOrNonDistinctIndices(
       const TuplePointsToAnalysis& points_to_analysis);
 
-  // Records instruction buffer indices which have interferring live ranges
+  // Records instruction buffer indices which have interfering live ranges
   // with 'other_instruction' buffers at same index.
   Status RecordIndicesWhichInterfereWithOtherInstruction(
       const BufferLiveness& liveness, const HloInstruction* other_instruction,
@@ -429,7 +429,7 @@ HloInstruction* InstructionCopier::Copy() {
   return copy;
 }
 
-// The 'read_only_indices' are initalized based on points-to analysis on the
+// The 'read_only_indices' are initialized based on points-to analysis on the
 // while body corresponding to 'while_hlo'. If the init buffer corresponding to
 // a read-only index aliases with an entry parameter (or constant), it cannot be
 // considered read-only, and must be copied. This is necessary because some
diff --git a/tensorflow/compiler/xla/service/gpu/convolution_thunk.cc b/tensorflow/compiler/xla/service/gpu/convolution_thunk.cc
index b2197c6a1f..9a0b14eb73 100644
--- a/tensorflow/compiler/xla/service/gpu/convolution_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/convolution_thunk.cc
@@ -125,7 +125,7 @@ tensorflow::Status ConvolutionThunk::ExecuteOnStream(
   CHECK_LE(num_dimensions, 3);
   // cuDNN does not support 1D convolutions. We therefore express 1D
   // convolutions as 2D convolutions where the first spatial dimension is 1.
-  // This matches the behaviour of TF (see definition of conv1d in
+  // This matches the behavior of TF (see definition of conv1d in
   // tensorflow/python/ops/nn_ops.py).
   const int effective_num_dimensions = std::max(2, num_dimensions);
 
diff --git a/tensorflow/compiler/xla/service/gpu/fusion_merger.h b/tensorflow/compiler/xla/service/gpu/fusion_merger.h
index 9a989d26f9..bd720f8584 100644
--- a/tensorflow/compiler/xla/service/gpu/fusion_merger.h
+++ b/tensorflow/compiler/xla/service/gpu/fusion_merger.h
@@ -25,7 +25,7 @@ namespace gpu {
 // An HLO pass that attempts to merge fusion instructions to reduce kernel
 // launch overhead and improve data locality.
 //
-// Fusion instructions are merged into their users if two conditons are met:
+// Fusion instructions are merged into their users if two conditions are met:
 //
 // 1) The flops_to_bytes ratio of the fusion instruction is below the threshold
 //    value of 1.0.
diff --git a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
index a80f969b9d..e784046450 100644
--- a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
@@ -245,7 +245,7 @@ tensorflow::Status GemmThunk::ExecuteOnStream(
   // Therefore, we need to convert dot between row-major matrices to that
   // between column-major matrices. The key insight for the conversion is that,
   // in linear storage, matrix M in column-major order is identical to the
-  // tranpose of M in row-major order. In other words,
+  // transpose of M in row-major order. In other words,
   //
   //   column-major(M) = row-major(M^T).
   //
diff --git a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc
index 01448ccab2..2c5900d697 100644
--- a/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc
+++ b/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc
@@ -407,9 +407,9 @@ StatusOr<string> CompileModuleToPtx(llvm::Module* module,
 
   AddOptimizationPasses(flags->opt_level, /*size_level=*/0,
                         target_machine.get(), &module_passes, &function_passes);
-  // Loop unrolling exposes more opportunites for SROA. Therefore, we run SROA
+  // Loop unrolling exposes more opportunities for SROA. Therefore, we run SROA
   // again after the standard optimization passes [http://b/13329423].
-  // TODO(jingyue): SROA may further expose more optimization opportunites, such
+  // TODO(jingyue): SROA may further expose more optimization opportunities, such
   // as more precise alias analysis and more function inlining (SROA may change
   // the inlining cost of a function). For now, running SROA already emits good
   // enough code for the evaluated benchmarks. We may want to run more
diff --git a/tensorflow/compiler/xla/service/gpu/partition_assignment.h b/tensorflow/compiler/xla/service/gpu/partition_assignment.h
index 8ac4c59966..8f7fce884a 100644
--- a/tensorflow/compiler/xla/service/gpu/partition_assignment.h
+++ b/tensorflow/compiler/xla/service/gpu/partition_assignment.h
@@ -33,7 +33,7 @@ namespace gpu {
 enum class PartitionStrategy {
   // Optimized for latency by allowing maximum number of registers per thread.
   kLatency,
-  // Optimized for throughtput. This may limit registers per thread and cause
+  // Optimized for throughput. This may limit registers per thread and cause
   // longer latency.
   kThroughput
 };
diff --git a/tensorflow/compiler/xla/service/gpu/while_transformer.cc b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
index 0beaa2586b..06b01d311d 100644
--- a/tensorflow/compiler/xla/service/gpu/while_transformer.cc
+++ b/tensorflow/compiler/xla/service/gpu/while_transformer.cc
@@ -37,7 +37,7 @@ namespace {
 // patterns to match.
 //
 // Each ExprTree node is comprised of an HloOpcode, and a set of operands (each
-// of type ExprTree). Operands can be added by specifing the index and HloOpcode
+// of type ExprTree). Operands can be added by specifying the index and HloOpcode
 // of the operand.
 //
 // For example, the following computation:
diff --git a/tensorflow/compiler/xla/service/hlo_constant_folding.h b/tensorflow/compiler/xla/service/hlo_constant_folding.h
index f45eccf825..331480bd02 100644
--- a/tensorflow/compiler/xla/service/hlo_constant_folding.h
+++ b/tensorflow/compiler/xla/service/hlo_constant_folding.h
@@ -21,7 +21,7 @@ limitations under the License.
 
 namespace xla {
 
-// A pass which performs constant folding in order to avoid unecessary
+// A pass which performs constant folding in order to avoid unnecessary
 // computation on constants.
 class HloConstantFolding : public HloPassInterface {
  public:
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.h b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
index 7d22548234..b2c40f75ca 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
@@ -133,7 +133,7 @@ class HloCostAnalysis : public DfsHloVisitor {
   int64 bytes_accessed() const { return bytes_accessed_; }
 
  private:
-  // An FMA counts as two floating point operations in these analyses.
+  // An FMA counts as two floating point operations in these analyzes.
   static constexpr int64 kFmaFlops = 2;
 
   // Utility function to handle all element-wise operations.
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
index 3eb4250e3a..b74c7eb4e0 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis_test.cc
@@ -54,7 +54,7 @@ class HloCostAnalysisTest : public ::testing::Test {
   HloCostAnalysisTest()
       : client_(ClientLibrary::LocalClientOrDie()),
         // Accessing service instance is required for the unit tests to enable
-        // whitebox acccesses to the user computation built from the client,
+        // whitebox accesses to the user computation built from the client,
         // as shown in the BuildHloGraph functions below.
         service_(static_cast<Service*>(ClientLibrary::GetXlaService(
             static_cast<LocalClient*>(client_)->platform()))),
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.h b/tensorflow/compiler/xla/service/hlo_evaluator.h
index e6798a35a0..91fd56f54c 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.h
@@ -138,7 +138,7 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
                            std::hash<int>>
       typed_visitors_;
 
-  // Tracks the HLO instruciton and its evaluated literal result.
+  // Tracks the HLO instruction and its evaluated literal result.
   // TODO(b/35950897): have better memory management here to free instructions
   // that are no longer a parent for any other subsequent instruction in
   // post-orderring.
diff --git a/tensorflow/compiler/xla/service/hlo_ordering.cc b/tensorflow/compiler/xla/service/hlo_ordering.cc
index eab02866fa..72911ae9f9 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering.cc
@@ -360,7 +360,7 @@ class ListScheduler {
     return freed_bytes;
   }
 
-  // Construct the scheduling priority of the given instruciton.
+  // Construct the scheduling priority of the given instruction.
   Priority GetPriority(const HloInstruction* instruction) {
     return {BytesFreedIfScheduled(instruction), instruction->user_count()};
   }
diff --git a/tensorflow/compiler/xla/service/llvm_ir/README.md b/tensorflow/compiler/xla/service/llvm_ir/README.md
index 9fe7152477..9e4cdd45dc 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/README.md
+++ b/tensorflow/compiler/xla/service/llvm_ir/README.md
@@ -1,2 +1,2 @@
-Common utilites and abstractions for handling and emitting LLVM IR for XLA
+Common utilities and abstractions for handling and emitting LLVM IR for XLA
 backends.
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
index d9a98ae5eb..7b09c1f831 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.h
@@ -129,7 +129,7 @@ llvm::AllocaInst* EmitAllocaAtFunctionEntryWithCount(
     llvm::Type* type, llvm::Value* element_count, tensorflow::StringPiece name,
     llvm::IRBuilder<>* ir_builder, int alignment = 0);
 
-// Creates a basic block with the same context and funtion as for the
+// Creates a basic block with the same context and function as for the
 // builder. Inserts at the end of the function if insert_before is
 // null.
 llvm::BasicBlock* CreateBasicBlock(llvm::BasicBlock* insert_before,
diff --git a/tensorflow/compiler/xla/tests/custom_call_test.cc b/tensorflow/compiler/xla/tests/custom_call_test.cc
index 4b5c4ecdf7..32232acf6e 100644
--- a/tensorflow/compiler/xla/tests/custom_call_test.cc
+++ b/tensorflow/compiler/xla/tests/custom_call_test.cc
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/test.h"
 
+
 extern "C" void TF_EXPORT R0F32Add2(float* out, float** in) {
   TF_ANNOTATE_MEMORY_IS_INITIALIZED(in, sizeof(float*));
   *out = **in + 2.0f;
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.h b/tensorflow/compiler/xla/tests/hlo_test_base.h
index 906551b530..98bc35ae52 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.h
@@ -61,7 +61,7 @@ class HloTestBase : public ::testing::Test {
   perftools::gputools::DeviceMemoryBase TransferToDevice(
       const Literal& literal);
 
-  // Transfers the array refered to by the given handle from the device and
+  // Transfers the array referred to by the given handle from the device and
   // returns as a Literal.
   std::unique_ptr<Literal> TransferFromDevice(
       const Shape& shape, perftools::gputools::DeviceMemoryBase device_base);
diff --git a/tensorflow/compiler/xla/tests/prng_test.cc b/tensorflow/compiler/xla/tests/prng_test.cc
index 5a6aa467e5..a0f98fcfef 100644
--- a/tensorflow/compiler/xla/tests/prng_test.cc
+++ b/tensorflow/compiler/xla/tests/prng_test.cc
@@ -194,7 +194,7 @@ XLA_TEST_F(PrngTest, MapUsingRng) {
   }
 }
 
-// This tests demonstrates the global seeding behaviour.
+// This tests demonstrates the global seeding behavior.
 // * If a seed is passed in via Execute (ExecuteAndTransfer) then the output is
 //   fixed (i.e., there is a single output for a given seed);
 // * If no seed is passed in then the output of every call can be different;
diff --git a/tensorflow/compiler/xla/types.h b/tensorflow/compiler/xla/types.h
index 4935648f98..ea8b4b7b98 100644
--- a/tensorflow/compiler/xla/types.h
+++ b/tensorflow/compiler/xla/types.h
@@ -19,6 +19,8 @@ limitations under the License.
 #include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/platform/types.h"
 
+#include <Eigen/Core>
+
 namespace xla {
 
 using ::tensorflow::string;
diff --git a/tensorflow/contrib/BUILD b/tensorflow/contrib/BUILD
index a0b65bcdc5..d9bec0e19f 100755
--- a/tensorflow/contrib/BUILD
+++ b/tensorflow/contrib/BUILD
@@ -56,6 +56,7 @@ py_library(
         "//tensorflow/contrib/rnn:rnn_py",
         "//tensorflow/contrib/saved_model:saved_model_py",
         "//tensorflow/contrib/seq2seq:seq2seq_py",
+        "//tensorflow/contrib/signal:signal_py",
         "//tensorflow/contrib/slim",
         "//tensorflow/contrib/slim:nets",
         "//tensorflow/contrib/solvers:solvers_py",
diff --git a/tensorflow/contrib/__init__.py b/tensorflow/contrib/__init__.py
index 3908c37fcd..a94e809c13 100644
--- a/tensorflow/contrib/__init__.py
+++ b/tensorflow/contrib/__init__.py
@@ -54,6 +54,7 @@ from tensorflow.contrib import quantization
 from tensorflow.contrib import rnn
 from tensorflow.contrib import saved_model
 from tensorflow.contrib import seq2seq
+from tensorflow.contrib import signal
 from tensorflow.contrib import slim
 from tensorflow.contrib import solvers
 from tensorflow.contrib import sparsemax
diff --git a/tensorflow/contrib/android/cmake/README.md b/tensorflow/contrib/android/cmake/README.md
index 915319da55..6f19b657fe 100644
--- a/tensorflow/contrib/android/cmake/README.md
+++ b/tensorflow/contrib/android/cmake/README.md
@@ -43,6 +43,6 @@ Output
 - TensorFlow-Inference-release.aar
 
 File libtensorflow_inference.so should be packed under jni/${ANDROID_ABI}/
-in the above aar, and it is transparent to the app as it will acccess them via
+in the above aar, and it is transparent to the app as it will access them via
 equivalent java APIs.
 
diff --git a/tensorflow/contrib/batching/BUILD b/tensorflow/contrib/batching/BUILD
index 800d55b127..b16fb9b5bb 100644
--- a/tensorflow/contrib/batching/BUILD
+++ b/tensorflow/contrib/batching/BUILD
@@ -179,7 +179,11 @@ py_test(
     size = "small",
     srcs = ["python/ops/batch_ops_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["nomac"],
+    tags = [
+        "manual",
+        "no_pip",
+        "nomac",
+    ],
     deps = [
         ":batch_py",
         "//tensorflow/python:framework_test_lib",
diff --git a/tensorflow/contrib/bayesflow/python/ops/monte_carlo_impl.py b/tensorflow/contrib/bayesflow/python/ops/monte_carlo_impl.py
index 55e0e6d57b..3590f940ac 100644
--- a/tensorflow/contrib/bayesflow/python/ops/monte_carlo_impl.py
+++ b/tensorflow/contrib/bayesflow/python/ops/monte_carlo_impl.py
@@ -177,7 +177,7 @@ def _logspace_mean(log_values):
       `Log[Mean[values]]`.
   """
   # center = Max[Log[values]],  with stop-gradient
-  # The center hopefully keep the exponentiated term small.  It is cancelled
+  # The center hopefully keep the exponentiated term small.  It is canceled
   # from the final result, so putting stop gradient on it will not change the
   # final result.  We put stop gradient on to eliminate unnecessary computation.
   center = array_ops.stop_gradient(_sample_max(log_values))
diff --git a/tensorflow/contrib/boosted_trees/lib/testutil/random_tree_gen.h b/tensorflow/contrib/boosted_trees/lib/testutil/random_tree_gen.h
index dc584bbd3c..5e12429ba7 100644
--- a/tensorflow/contrib/boosted_trees/lib/testutil/random_tree_gen.h
+++ b/tensorflow/contrib/boosted_trees/lib/testutil/random_tree_gen.h
@@ -42,7 +42,7 @@ class RandomTreeGen {
   boosted_trees::trees::DecisionTreeConfig Generate(
       const boosted_trees::trees::DecisionTreeConfig& tree);
 
-  // Requried: depth >= 1; tree_count >= 1.
+  // Required: depth >= 1; tree_count >= 1.
   boosted_trees::trees::DecisionTreeEnsembleConfig GenerateEnsemble(
       int dept, int tree_count);
 
diff --git a/tensorflow/contrib/cloud/kernels/bigquery_reader_ops.cc b/tensorflow/contrib/cloud/kernels/bigquery_reader_ops.cc
index 02a759eefd..093000559b 100644
--- a/tensorflow/contrib/cloud/kernels/bigquery_reader_ops.cc
+++ b/tensorflow/contrib/cloud/kernels/bigquery_reader_ops.cc
@@ -46,7 +46,7 @@ Status GetTableAttrs(OpKernelConstruction* context, string* project_id,
 
 }  // namespace
 
-// Note that overriden methods with names ending in "Locked" are called by
+// Note that overridden methods with names ending in "Locked" are called by
 // ReaderBase while a mutex is held.
 // See comments for ReaderBase.
 class BigQueryReader : public ReaderBase {
diff --git a/tensorflow/contrib/cloud/python/ops/bigquery_reader_ops_test.py b/tensorflow/contrib/cloud/python/ops/bigquery_reader_ops_test.py
index 9acdb4b102..493b3c6f1b 100644
--- a/tensorflow/contrib/cloud/python/ops/bigquery_reader_ops_test.py
+++ b/tensorflow/contrib/cloud/python/ops/bigquery_reader_ops_test.py
@@ -46,7 +46,7 @@ _TABLE = "test-table"
 # The values for rows are generated such that some columns have null values. The
 # general formula here is:
 #   - The int64 column is present in every row.
-#   - The string column is only avaiable in even rows.
+#   - The string column is only available in even rows.
 #   - The float column is only available in every third row.
 _ROWS = [[0, "s_0", 0.1], [1, None, None], [2, "s_2", None], [3, None, 3.1],
          [4, "s_4", None], [5, None, None], [6, "s_6", 6.1], [7, None, None],
diff --git a/tensorflow/contrib/cmake/CMakeLists.txt b/tensorflow/contrib/cmake/CMakeLists.txt
index e010cdd823..9ffe08eded 100644
--- a/tensorflow/contrib/cmake/CMakeLists.txt
+++ b/tensorflow/contrib/cmake/CMakeLists.txt
@@ -63,11 +63,16 @@ if(WIN32)
   add_definitions(-DWIN32 -DOS_WIN -D_MBCS -DWIN64 -DWIN32_LEAN_AND_MEAN -DNOGDI -DPLATFORM_WINDOWS)
   add_definitions(-DTENSORFLOW_USE_EIGEN_THREADPOOL -DEIGEN_HAS_C99_MATH)
   add_definitions(-DTF_COMPILE_LIBRARY)
-  add_definitions(/bigobj /nologo /EHsc /GF /FC /MP /Gm-)
+  add_definitions(/bigobj /nologo /EHsc /GF /MP /Gm-)
   # Suppress warnings to reduce build log size.
   add_definitions(/wd4267 /wd4244 /wd4800 /wd4503 /wd4554 /wd4996 /wd4348 /wd4018)
   add_definitions(/wd4099 /wd4146 /wd4267 /wd4305 /wd4307)
   add_definitions(/wd4715 /wd4722 /wd4723 /wd4838 /wd4309 /wd4334)
+  add_definitions(/wd4003 /wd4244 /wd4267 /wd4503 /wd4506 /wd4800 /wd4996)
+  # Suppress linker warnings.
+  set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
+  set(CMAKE_MODULE_LINKER_FLAGS "${CMAKE_MODULE_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
+  set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /ignore:4049 /ignore:4197 /ignore:4217 /ignore:4221")
   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /MP")
   set(CMAKE_CXX_FLAGS_DEBUG "/D_DEBUG /MDd /Ob0")
   set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /D_ITERATOR_DEBUG_LEVEL=0")
@@ -108,6 +113,7 @@ include(zlib)
 include(gif)
 include(png)
 include(jpeg)
+include(lmdb)
 include(eigen)
 include(gemmlowp)
 include(jsoncpp)
@@ -124,6 +130,7 @@ set(tensorflow_EXTERNAL_LIBRARIES
     ${gif_STATIC_LIBRARIES}
     ${png_STATIC_LIBRARIES}
     ${jpeg_STATIC_LIBRARIES}
+    ${lmdb_STATIC_LIBRARIES}
     ${jsoncpp_STATIC_LIBRARIES}
     ${farmhash_STATIC_LIBRARIES}
     ${fft2d_STATIC_LIBRARIES}
@@ -135,6 +142,7 @@ set(tensorflow_EXTERNAL_DEPENDENCIES
     gif_copy_headers_to_destination
     png_copy_headers_to_destination
     jpeg_copy_headers_to_destination
+    lmdb_copy_headers_to_destination
     jsoncpp
     farmhash_copy_headers_to_destination
     highwayhash_copy_headers_to_destination
@@ -153,6 +161,7 @@ include_directories(
     ${gif_INCLUDE_DIR}
     ${png_INCLUDE_DIR}
     ${jpeg_INCLUDE_DIR}
+    ${lmdb_INCLUDE_DIR}
     ${eigen_INCLUDE_DIRS}
     ${gemmlowp_INCLUDE_DIR}
     ${jsoncpp_INCLUDE_DIR}
diff --git a/tensorflow/contrib/cmake/external/lmdb.cmake b/tensorflow/contrib/cmake/external/lmdb.cmake
new file mode 100644
index 0000000000..28ec833bab
--- /dev/null
+++ b/tensorflow/contrib/cmake/external/lmdb.cmake
@@ -0,0 +1,60 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+include (ExternalProject)
+
+set(lmdb_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/lmdb)
+set(lmdb_URL http://mirror.bazel.build/github.com/LMDB/lmdb/archive/LMDB_0.9.19.tar.gz)
+set(lmdb_HASH SHA256=108532fb94c6f227558d45be3f3347b52539f0f58290a7bb31ec06c462d05326)
+set(lmdb_BUILD ${CMAKE_BINARY_DIR}/lmdb/src/lmdb)
+set(lmdb_INSTALL ${CMAKE_BINARY_DIR}/lmdb/install)
+
+ExternalProject_Add(lmdb
+    PREFIX lmdb
+    URL ${lmdb_URL}
+    URL_HASH ${lmdb_HASH}
+    PATCH_COMMAND ${CMAKE_COMMAND} -E copy_if_different
+        ${CMAKE_CURRENT_SOURCE_DIR}/patches/lmdb/CMakeLists.txt ${lmdb_BUILD}
+    INSTALL_DIR ${lmdb_INSTALL}
+    DOWNLOAD_DIR "${DOWNLOAD_LOCATION}"
+    CMAKE_CACHE_ARGS
+        -DCMAKE_BUILD_TYPE:STRING=Release
+        -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+        -DCMAKE_INSTALL_PREFIX:STRING=${lmdb_INSTALL}
+    -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+)
+
+if(WIN32)
+    set(lmdb_STATIC_LIBRARIES ${lmdb_INSTALL}/lib/lmdb.lib)
+else()
+    set(lmdb_STATIC_LIBRARIES ${lmdb_INSTALL}/lib/liblmdb.a)
+endif()
+
+set(lmdb_HEADERS
+    "${lmdb_INSTALL}/include/lmdb.h"
+    "${lmdb_INSTALL}/include/midl.h"
+)
+
+## put lmdb includes in the directory where they are expected
+add_custom_target(lmdb_create_destination_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${lmdb_INCLUDE_DIR}
+    DEPENDS lmdb)
+
+add_custom_target(lmdb_copy_headers_to_destination
+    DEPENDS lmdb_create_destination_dir)
+
+foreach(header_file ${lmdb_HEADERS})
+  add_custom_command(TARGET lmdb_copy_headers_to_destination PRE_BUILD
+      COMMAND ${CMAKE_COMMAND} -E copy_if_different ${header_file} ${lmdb_INCLUDE_DIR}/)
+endforeach()
diff --git a/tensorflow/contrib/cmake/patches/lmdb/CMakeLists.txt b/tensorflow/contrib/cmake/patches/lmdb/CMakeLists.txt
new file mode 100644
index 0000000000..19fa607a10
--- /dev/null
+++ b/tensorflow/contrib/cmake/patches/lmdb/CMakeLists.txt
@@ -0,0 +1,26 @@
+cmake_minimum_required(VERSION 2.8.3)
+
+project(liblmdb)
+
+set(LIBLMDB_SRCS
+    "libraries/liblmdb/mdb.c"
+    "libraries/liblmdb/midl.c"
+)
+
+set(LIBLMDB_INCLUDES
+    "libraries/liblmdb/lmdb.h"
+    "libraries/liblmdb/midl.h"
+)
+
+include_directories("${CMAKE_CURRENT_SOURCE_DIR}")
+
+add_library(lmdb ${LIBLMDB_SRCS})
+
+install(TARGETS lmdb
+  RUNTIME DESTINATION bin COMPONENT RuntimeLibraries
+  LIBRARY DESTINATION lib COMPONENT RuntimeLibraries
+  ARCHIVE DESTINATION lib COMPONENT Development)
+
+foreach(LIBLMDB_INCLUDE ${LIBLMDB_INCLUDES})
+  install(FILES ${LIBLMDB_INCLUDE} DESTINATION include COMPONENT Development)
+endforeach()
diff --git a/tensorflow/contrib/cmake/tf_python.cmake b/tensorflow/contrib/cmake/tf_python.cmake
index 124eab17cc..4e9f39648a 100755
--- a/tensorflow/contrib/cmake/tf_python.cmake
+++ b/tensorflow/contrib/cmake/tf_python.cmake
@@ -174,8 +174,15 @@ function(add_python_module MODULE_NAME)
     if(NOT ${ADD_PYTHON_MODULE_DONTCOPY})
         foreach(script ${module_python_srcs})
             get_filename_component(REL_DIR ${script} DIRECTORY)
-            add_custom_command(TARGET tf_python_copy_scripts_to_destination PRE_BUILD
-              COMMAND ${CMAKE_COMMAND} -E copy ${tensorflow_source_dir}/${script} ${CMAKE_CURRENT_BINARY_DIR}/tf_python/${script})
+            # NOTE(mrry): This rule may exclude modules that should be part of
+            # the distributed PIP package
+            # (e.g. tensorflow/contrib/testing/python/framework/util_test.py),
+            # so we currently add explicit commands to include those files
+            # later on in this script.
+            if (NOT "${script}" MATCHES "_test\.py$")
+	        add_custom_command(TARGET tf_python_copy_scripts_to_destination PRE_BUILD
+                  COMMAND ${CMAKE_COMMAND} -E copy ${tensorflow_source_dir}/${script} ${CMAKE_CURRENT_BINARY_DIR}/tf_python/${script})
+            endif()
         endforeach()
     endif()
 endfunction()
@@ -324,7 +331,6 @@ add_python_module("tensorflow/contrib/ios_examples/benchmark/benchmark.xcodeproj
 add_python_module("tensorflow/contrib/ios_examples/benchmark/data")
 add_python_module("tensorflow/contrib/ios_examples/camera")
 add_python_module("tensorflow/contrib/ios_examples/camera/camera_example.xcodeproj")
-add_python_module("tensorflow/contrib/ios_examples/camera/data")
 add_python_module("tensorflow/contrib/ios_examples/camera/en.lproj")
 add_python_module("tensorflow/contrib/ios_examples/simple")
 add_python_module("tensorflow/contrib/ios_examples/simple/data")
@@ -470,8 +476,9 @@ add_python_module("tensorflow/contrib/seq2seq/python/ops")
 add_python_module("tensorflow/contrib/session_bundle")
 add_python_module("tensorflow/contrib/session_bundle/example")
 add_python_module("tensorflow/contrib/session_bundle/testdata")
-add_python_module("tensorflow/contrib/session_bundle/testdata/saved_model_half_plus_two")
-add_python_module("tensorflow/contrib/session_bundle/testdata/saved_model_half_plus_two/variables")
+add_python_module("tensorflow/contrib/signal")
+add_python_module("tensorflow/contrib/signal/python")
+add_python_module("tensorflow/contrib/signal/python/ops")
 add_python_module("tensorflow/contrib/slim")
 add_python_module("tensorflow/contrib/slim/python")
 add_python_module("tensorflow/contrib/slim/python/slim")
@@ -875,9 +882,17 @@ add_dependencies(tf_python_build_pip_package
     tf_python_touchup_modules
     tf_python_ops
     tf_extension_ops)
+
+# Fix-up Python files that were not included by the add_python_module() macros.
 add_custom_command(TARGET tf_python_build_pip_package POST_BUILD
   COMMAND ${CMAKE_COMMAND} -E copy ${tensorflow_source_dir}/tensorflow/tools/pip_package/setup.py
                                    ${CMAKE_CURRENT_BINARY_DIR}/tf_python/)
+# This file is unfortunately excluded by the regex that excludes *_test.py
+# files, but it is imported into tf.contrib, so we add it explicitly.
+add_custom_command(TARGET tf_python_copy_scripts_to_destination PRE_BUILD
+  COMMAND ${CMAKE_COMMAND} -E copy ${tensorflow_source_dir}/tensorflow/contrib/testing/python/framework/util_test.py
+                                   ${CMAKE_CURRENT_BINARY_DIR}/tf_python/tensorflow/contrib/testing/python/framework/)
+
 if(WIN32)
   add_custom_command(TARGET tf_python_build_pip_package POST_BUILD
     COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_CURRENT_BINARY_DIR}/$(Configuration)/pywrap_tensorflow_internal.dll
diff --git a/tensorflow/contrib/cmake/tf_tests.cmake b/tensorflow/contrib/cmake/tf_tests.cmake
index 0eee80ccce..55e9e311f9 100644
--- a/tensorflow/contrib/cmake/tf_tests.cmake
+++ b/tensorflow/contrib/cmake/tf_tests.cmake
@@ -183,12 +183,15 @@ if (tensorflow_BUILD_PYTHON_TESTS)
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/string_to_number_op_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/clip_ops_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/tensor_array_ops_test.py"  # Needs portpicker.
+      # Matrix_set_diag failing on GPU on windows.
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/cholesky_op_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/diag_op_test.py"
+      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/linalg_ops_test.py"
       # misc
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/variable_scope_test.py"
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/reshape_op_test.py"
       "${tensorflow_source_dir}/tensorflow/python/training/evaluation_test.py"
       "${tensorflow_source_dir}/tensorflow/tensorboard/backend/server_test.py"
-      "${tensorflow_source_dir}/tensorflow/python/kernel_tests/diag_op_test.py"  # Silently failing with GPU kernel disabled.
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/neon_depthwise_conv_op_test.py"  # Depends on gemmlowp -> pthread.
       # int32/int64 mixup
       "${tensorflow_source_dir}/tensorflow/python/kernel_tests/functional_ops_test.py"
diff --git a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
index 4f70b275e8..cc0c7b0829 100644
--- a/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
+++ b/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py
@@ -141,7 +141,7 @@ _cudnn_rnn_common_doc_string = """
     * Once a while, the user saves the parameter buffer into model checkpoints
         with Saver.save().
     * When restoring, the user creates a RNNParamsSaveable object and uses
-      Saver.restore() to restore the paramter buffer from the canonical format
+      Saver.restore() to restore the parameter buffer from the canonical format
       to a user-defined format, as well as to restore other savable objects
       in the checkpoint file.
 """
diff --git a/tensorflow/contrib/data/README.md b/tensorflow/contrib/data/README.md
index 42da544a30..9505f5c465 100644
--- a/tensorflow/contrib/data/README.md
+++ b/tensorflow/contrib/data/README.md
@@ -457,12 +457,12 @@ batched into a fixed size.
 # to a fixed shape.
 def _parse_function(filename, label):
   image_string = tf.read_file(filename)
-  image_decoded = tf.image.decode_image(filename)
+  image_decoded = tf.image.decode_image(image_string)
   image_resized = tf.image.resize_images(image_decoded, [28, 28])
   return image_resized, label
 
-filenames = ["/var/data/image1.jpg", "/var/data/image2.jpg", ...]
-labels = [0, 37, 29, 1, ...]
+filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
+labels = tf.constant([0, 37, 29, 1, ...])
 
 dataset = tf.contrib.data.Dataset.from_tensor_slices((filenames, labels))
 dataset = dataset.map(_parse_function)
diff --git a/tensorflow/contrib/data/python/framework/function.py b/tensorflow/contrib/data/python/framework/function.py
index 2839130ab7..8c6bcb858f 100644
--- a/tensorflow/contrib/data/python/framework/function.py
+++ b/tensorflow/contrib/data/python/framework/function.py
@@ -39,7 +39,7 @@ class _ExperimentalFuncGraph(function._FuncGraph):
   _ExperimentalFuncGraph overrides ops.Graph's create_op() so that we can keep
   track of every inputs into every op created inside the function.  If
   any input is from other graphs, we keep track of it in self.capture
-  and substitue the input with a place holder.
+  and substitute the input with a place holder.
 
   Each captured input's corresponding place holder is converted into a
   function argument and the caller passes in the captured tensor.
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index db3e2c807c..ab4d80c327 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -52,7 +52,10 @@ py_test(
     size = "small",
     srcs = ["dataset_constructor_op_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["nomac"],  # b/62040583
+    tags = [
+        "manual",
+        "nomac",  # b/62040583
+    ],
     deps = [
         "//tensorflow/contrib/data",
         "//tensorflow/python:array_ops",
diff --git a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
index c9412d949c..1f87f14187 100644
--- a/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/batch_dataset_op_test.py
@@ -271,6 +271,22 @@ class BatchDatasetTest(test.TestCase):
                                    "larger than the row shape"):
         sess.run(get_next)
 
+  def testUnbatchDataset(self):
+    data = [math_ops.range(10) for _ in range(3)]
+    data = dataset_ops.Dataset.from_tensor_slices(data)
+    data = data.batch(2)
+    data = data.unbatch()
+
+    iter = data.make_one_shot_iterator()
+    op = iter.get_next()
+
+    with self.test_session() as sess:
+      for i in range(3):
+        self.assertAllClose([range(10)], sess.run(op))
+
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(op)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/resample_test.py b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
index f6ce77054b..fb66acdcac 100644
--- a/tensorflow/contrib/data/python/kernel_tests/resample_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/resample_test.py
@@ -65,7 +65,7 @@ class ResampleTest(test.TestCase):
     self.assertAllEqual([compat.as_bytes(str(c))
                          for c in returned_classes], returned_data)
     total_returned = len(returned_classes)
-    # Subsampling rejects a large precentage of the initial data in
+    # Subsampling rejects a large percentage of the initial data in
     # this case.
     self.assertGreater(total_returned, 20000 * 0.2)
     class_counts = np.array([
diff --git a/tensorflow/contrib/data/python/ops/dataset_ops.py b/tensorflow/contrib/data/python/ops/dataset_ops.py
index db9a32adac..cf8926c4e7 100644
--- a/tensorflow/contrib/data/python/ops/dataset_ops.py
+++ b/tensorflow/contrib/data/python/ops/dataset_ops.py
@@ -849,7 +849,8 @@ class Dataset(object):
     Returns:
       A `Dataset`.
     """
-    return self.flat_map(map_func=Dataset.from_tensor_slices)
+    return self.flat_map(
+      map_func=lambda *args: Dataset.from_tensor_slices(args))
 
   def filter(self, predicate):
     """Filters this dataset according to `predicate`.
@@ -1480,7 +1481,8 @@ class MapDataset(Dataset):
         self._output_buffer_size = ops.convert_to_tensor(
             output_buffer_size, dtype=dtypes.int64, name="output_buffer_size")
       else:
-        self._output_buffer_size = self._num_threads
+        self._output_buffer_size = ops.convert_to_tensor(
+            self._num_threads, dtype=dtypes.int64, name="output_buffer_size")
     else:
       self._num_threads = None
       self._output_buffer_size = None
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py b/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
index 58368d92c4..1c67a1b8f6 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/distribution_util_test.py
@@ -166,7 +166,7 @@ class ShapesFromLocAndScaleTest(test.TestCase):
       batch_shape, event_shape = distribution_util.shapes_from_loc_and_scale(
           loc, scale)
       # batch_shape depends on both args, and so is dynamic.  Since loc did not
-      # have static shape, we infered event shape entirely from scale, and this
+      # have static shape, we inferred event shape entirely from scale, and this
       # is available statically.
       self.assertAllEqual(
           [5, 2], batch_shape.eval(feed_dict={loc: np.zeros((2, 3))}))
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/vector_student_t_test.py b/tensorflow/contrib/distributions/python/kernel_tests/vector_student_t_test.py
index 9d0ffd6376..b8a3a262ce 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/vector_student_t_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/vector_student_t_test.py
@@ -38,7 +38,7 @@ class _FakeVectorStudentT(object):
 
   Other `Vector*` implementations need only test new code. That we don't need
   to test every Vector* distribution is good because there aren't SciPy
-  analogues and reimplementing everything in NumPy sort of defeats the point of
+  analogs and reimplementing everything in NumPy sort of defeats the point of
   having the `TransformedDistribution + Affine` API.
   """
 
diff --git a/tensorflow/contrib/distributions/python/ops/binomial.py b/tensorflow/contrib/distributions/python/ops/binomial.py
index ecf6a61156..9304a56491 100644
--- a/tensorflow/contrib/distributions/python/ops/binomial.py
+++ b/tensorflow/contrib/distributions/python/ops/binomial.py
@@ -269,7 +269,7 @@ class Binomial(distribution.Distribution):
             message="total_count must be non-negative."),
         distribution_util.assert_integer_form(
             total_count,
-            message="total_count cannot contain fractional componentes."),
+            message="total_count cannot contain fractional components."),
     ], total_count)
 
   def _maybe_assert_valid_sample(self, counts, check_integer=True):
diff --git a/tensorflow/contrib/factorization/BUILD b/tensorflow/contrib/factorization/BUILD
index 60e7c8f160..0b4dc5667f 100644
--- a/tensorflow/contrib/factorization/BUILD
+++ b/tensorflow/contrib/factorization/BUILD
@@ -214,6 +214,7 @@ tf_py_test(
         "//tensorflow/python:state_ops",
         "//tensorflow/python:variables",
     ],
+    tags = ["manual"],
 )
 
 # Kernel tests
diff --git a/tensorflow/contrib/factorization/python/ops/clustering_ops.py b/tensorflow/contrib/factorization/python/ops/clustering_ops.py
index 42815664ad..2e9b5e22c7 100644
--- a/tensorflow/contrib/factorization/python/ops/clustering_ops.py
+++ b/tensorflow/contrib/factorization/python/ops/clustering_ops.py
@@ -222,7 +222,7 @@ class KMeans(object):
     if (self._distance_metric == COSINE_DISTANCE and
         not self._clusters_l2_normalized()):
       # The cosine distance between normalized vectors x and y is the same as
-      # 2 * squared_euclidian_distance. We are using this fact and reusing the
+      # 2 * squared_euclidean_distance. We are using this fact and reusing the
       # nearest_neighbors op.
       # TODO(ands): Support COSINE distance in nearest_neighbors and remove
       # this.
diff --git a/tensorflow/contrib/framework/python/framework/checkpoint_utils.py b/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
index 36de1e3f82..9e356dd965 100644
--- a/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
+++ b/tensorflow/contrib/framework/python/framework/checkpoint_utils.py
@@ -184,7 +184,7 @@ def init_from_checkpoint(checkpoint_dir, assignment_map):
     var3 = tf.get_variable(name="my1", shape=[100, 100],
                            partitioner=lambda shape, dtype: [5, 1])
     ...
-    # Specify which variables to intialize from checkpoint.
+    # Specify which variables to initialize from checkpoint.
     init_from_checkpoint(checkpoint_dir, {
       'some_var': 'test/my_var',
       'some_scope/': 'test2/'})
diff --git a/tensorflow/contrib/graph_editor/reroute.py b/tensorflow/contrib/graph_editor/reroute.py
index 386ce9eb06..42968ae63b 100644
--- a/tensorflow/contrib/graph_editor/reroute.py
+++ b/tensorflow/contrib/graph_editor/reroute.py
@@ -370,7 +370,7 @@ def _reroute_sgv_outputs(sgv0, sgv1, mode):
 def _reroute_sgv(sgv0, sgv1, mode):
   """Re-route both the inputs and the outputs of the two subgraph views.
 
-  This involves swapping all the inputs/ouputs of the two subgraph views.
+  This involves swapping all the inputs/outputs of the two subgraph views.
 
   Args:
     sgv0: the first subgraph to be swapped. This argument is converted to a
diff --git a/tensorflow/contrib/graph_editor/util.py b/tensorflow/contrib/graph_editor/util.py
index ec32beda5a..959905e982 100644
--- a/tensorflow/contrib/graph_editor/util.py
+++ b/tensorflow/contrib/graph_editor/util.py
@@ -130,7 +130,7 @@ def transform_tree(tree, fn, iterable_type=tuple):
     tree: iterable or not. If iterable, its elements (child) can also be
       iterable or not.
     fn: function to apply to each leaves.
-    iterable_type: type use to construct the resulting tree for unknwon
+    iterable_type: type use to construct the resulting tree for unknown
       iterable, typically `list` or `tuple`.
   Returns:
     A tree whose leaves has been transformed by `fn`.
diff --git a/tensorflow/contrib/hooks/README.md b/tensorflow/contrib/hooks/README.md
index c7f88bb111..84dd6ac879 100644
--- a/tensorflow/contrib/hooks/README.md
+++ b/tensorflow/contrib/hooks/README.md
@@ -5,7 +5,7 @@ of `SessionRunHook` and are to be used with helpers like `MonitoredSession`
 and `learn.Estimator` that wrap `tensorflow.Session`.
 
 The hooks are called between invocations of `Session.run()` to perform custom
-behaviour.
+behavior.
 
 For example the `ProfilerHook` periodically collects `RunMetadata` after
 `Session.run()` and saves profiling information that can be viewed in a
diff --git a/tensorflow/contrib/image/kernels/image_ops.h b/tensorflow/contrib/image/kernels/image_ops.h
index b64fc9e0ec..ad50133061 100644
--- a/tensorflow/contrib/image/kernels/image_ops.h
+++ b/tensorflow/contrib/image/kernels/image_ops.h
@@ -66,7 +66,13 @@ class ProjectiveGenerator {
         projection;
 
     // TODO(ringwalt): Add a fill value input.
+#if (defined __CUDA_ARCH__) && (CUDART_VERSION < 8000)
+    // On CUDA versions previous to 8.0, only __shared__ variables
+    // could be declared as static in the device code.
+    const T fill_value = T(0);
+#else
     static const T fill_value = T(0);
+#endif
     switch (interpolation_) {
       case INTERPOLATION_NEAREST:
         // Switch the order of x and y again for indexing into the image.
diff --git a/tensorflow/contrib/image/ops/image_ops.cc b/tensorflow/contrib/image/ops/image_ops.cc
index 2fe65e011a..4527fdd87a 100644
--- a/tensorflow/contrib/image/ops/image_ops.cc
+++ b/tensorflow/contrib/image/ops/image_ops.cc
@@ -75,7 +75,7 @@ REGISTER_OP("BipartiteMatch")
     .Doc(R"doc(
 Find bipartite matching based on a given distance matrix.
 
-A greedy bi-partite matching alogrithm is used to obtain the matching with the
+A greedy bi-partite matching algorithm is used to obtain the matching with the
 (greedy) minimum distance.
 
 distance_mat: A 2-D float tensor of shape `[num_rows, num_columns]`. It is a
diff --git a/tensorflow/contrib/image/python/ops/image_ops.py b/tensorflow/contrib/image/python/ops/image_ops.py
index da374f8cef..b396dcea21 100644
--- a/tensorflow/contrib/image/python/ops/image_ops.py
+++ b/tensorflow/contrib/image/python/ops/image_ops.py
@@ -266,7 +266,7 @@ def bipartite_match(
     top_k=-1):
   """Find bipartite matching based on a given distance matrix.
 
-  A greedy bi-partite matching alogrithm is used to obtain the matching with
+  A greedy bi-partite matching algorithm is used to obtain the matching with
   the (greedy) minimum distance.
 
   Args:
diff --git a/tensorflow/contrib/keras/BUILD b/tensorflow/contrib/keras/BUILD
index f7f56f6fcf..71ce6540d6 100644
--- a/tensorflow/contrib/keras/BUILD
+++ b/tensorflow/contrib/keras/BUILD
@@ -311,7 +311,10 @@ py_test(
     size = "medium",
     srcs = ["python/keras/layers/convolutional_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["notsan"],
+    tags = [
+        "manual",
+        "notsan",
+    ],
     deps = [
         ":keras",
         ":testing_utils",
diff --git a/tensorflow/contrib/keras/api/keras/layers/__init__.py b/tensorflow/contrib/keras/api/keras/layers/__init__.py
index 8f266df0ad..3c6dce5ee8 100644
--- a/tensorflow/contrib/keras/api/keras/layers/__init__.py
+++ b/tensorflow/contrib/keras/api/keras/layers/__init__.py
@@ -135,6 +135,11 @@ from tensorflow.contrib.keras.python.keras.layers.recurrent import SimpleRNN
 from tensorflow.contrib.keras.python.keras.layers.recurrent import GRU
 from tensorflow.contrib.keras.python.keras.layers.recurrent import LSTM
 
+# Wrapper functions
+from tensorflow.contrib.keras.python.keras.layers.wrappers import Wrapper 
+from tensorflow.contrib.keras.python.keras.layers.wrappers import Bidirectional 
+from tensorflow.contrib.keras.python.keras.layers.wrappers import TimeDistributed
+
 del absolute_import
 del division
 del print_function
diff --git a/tensorflow/contrib/keras/python/keras/applications/resnet50.py b/tensorflow/contrib/keras/python/keras/applications/resnet50.py
index ce7d0bb046..0de13c9592 100644
--- a/tensorflow/contrib/keras/python/keras/applications/resnet50.py
+++ b/tensorflow/contrib/keras/python/keras/applications/resnet50.py
@@ -56,7 +56,7 @@ def identity_block(input_tensor, kernel_size, filters, stage, block):
 
   Arguments:
       input_tensor: input tensor
-      kernel_size: defualt 3, the kernel size of middle conv layer at main path
+      kernel_size: default 3, the kernel size of middle conv layer at main path
       filters: list of integers, the filterss of 3 conv layer at main path
       stage: integer, current stage label, used for generating layer names
       block: 'a','b'..., current block label, used for generating layer names
@@ -95,7 +95,7 @@ def conv_block(input_tensor, kernel_size, filters, stage, block, strides=(2,
 
   Arguments:
       input_tensor: input tensor
-      kernel_size: defualt 3, the kernel size of middle conv layer at main path
+      kernel_size: default 3, the kernel size of middle conv layer at main path
       filters: list of integers, the filterss of 3 conv layer at main path
       stage: integer, current stage label, used for generating layer names
       block: 'a','b'..., current block label, used for generating layer names
diff --git a/tensorflow/contrib/keras/python/keras/backend.py b/tensorflow/contrib/keras/python/keras/backend.py
index 84d0dacce9..b7adf9461a 100644
--- a/tensorflow/contrib/keras/python/keras/backend.py
+++ b/tensorflow/contrib/keras/python/keras/backend.py
@@ -92,7 +92,7 @@ _IMAGE_DATA_FORMAT = 'channels_last'
 def backend():
   """Publicly accessible method for determining the current backend.
 
-  Only exists for API compatibily with multi-backend Keras.
+  Only exists for API compatibility with multi-backend Keras.
 
   Returns:
       The string "tensorflow".
@@ -2736,7 +2736,7 @@ def in_train_phase(x, alt, training=None):
           (tensor or callable that returns a tensor).
       training: Optional scalar tensor
           (or Python boolean, or Python integer)
-          specifing the learning phase.
+          specifying the learning phase.
 
   Returns:
       Either `x` or `alt` based on the `training` flag.
@@ -2779,7 +2779,7 @@ def in_test_phase(x, alt, training=None):
           (tensor or callable that returns a tensor).
       training: Optional scalar tensor
           (or Python boolean, or Python integer)
-          specifing the learning phase.
+          specifying the learning phase.
 
   Returns:
       Either `x` or `alt` based on `K.learning_phase`.
diff --git a/tensorflow/contrib/keras/python/keras/engine/topology.py b/tensorflow/contrib/keras/python/keras/engine/topology.py
index 7561ef78f3..07d708ada3 100644
--- a/tensorflow/contrib/keras/python/keras/engine/topology.py
+++ b/tensorflow/contrib/keras/python/keras/engine/topology.py
@@ -1544,7 +1544,7 @@ class Container(Layer):
     """Retrieve the model's updates.
 
     Will only include updates that are either
-    inconditional, or conditional on inputs to this model
+    unconditional, or conditional on inputs to this model
     (e.g. will not include updates that depend on tensors
     that aren't inputs to this model).
 
@@ -1571,7 +1571,7 @@ class Container(Layer):
     """Retrieve the model's losses.
 
     Will only include losses that are either
-    inconditional, or conditional on inputs to this model
+    unconditional, or conditional on inputs to this model
     (e.g. will not include losses that depend on tensors
     that aren't inputs to this model).
 
diff --git a/tensorflow/contrib/keras/python/keras/wrappers/scikit_learn.py b/tensorflow/contrib/keras/python/keras/wrappers/scikit_learn.py
index 9f8cea375b..0d04fc120f 100644
--- a/tensorflow/contrib/keras/python/keras/wrappers/scikit_learn.py
+++ b/tensorflow/contrib/keras/python/keras/wrappers/scikit_learn.py
@@ -109,7 +109,7 @@ class BaseWrapper(object):
     """Gets parameters for this estimator.
 
     Arguments:
-        **params: ignored (exists for API compatiblity).
+        **params: ignored (exists for API compatibility).
 
     Returns:
         Dictionary of parameter names mapped to their values.
diff --git a/tensorflow/contrib/kernel_methods/g3doc/tutorial.md b/tensorflow/contrib/kernel_methods/g3doc/tutorial.md
index 64c2adf9f3..9877375c2c 100644
--- a/tensorflow/contrib/kernel_methods/g3doc/tutorial.md
+++ b/tensorflow/contrib/kernel_methods/g3doc/tutorial.md
@@ -273,7 +273,7 @@ features.
 * The parameters of the kernel mapping are often data-dependent. Model quality
 can be very sensitive to these parameters. Use hyperparameter tuning to find the
 optimal values.
-* If you have multiple numerical features, concatinate them into a single
+* If you have multiple numerical features, concatenate them into a single
 multi-dimensional feature and apply the kernel mapping to the concatenated
 vector.
 
diff --git a/tensorflow/contrib/kernel_methods/python/mappers/random_fourier_features_test.py b/tensorflow/contrib/kernel_methods/python/mappers/random_fourier_features_test.py
index 200d00b663..6f4a264485 100644
--- a/tensorflow/contrib/kernel_methods/python/mappers/random_fourier_features_test.py
+++ b/tensorflow/contrib/kernel_methods/python/mappers/random_fourier_features_test.py
@@ -85,7 +85,7 @@ class RandomFourierFeatureMapperTest(TensorFlowTestCase):
       mapped_x = rffm.map(x)
       mapped_x_copy = rffm.map(x)
       # Two different evaluations of tensors output by map on the same input
-      # are identical because the same paramaters are used for the mappings.
+      # are identical because the same parameters are used for the mappings.
       self.assertAllClose(mapped_x.eval(), mapped_x_copy.eval(), atol=0.001)
 
   def testTwoMapperObjects(self):
diff --git a/tensorflow/contrib/labeled_tensor/python/ops/core.py b/tensorflow/contrib/labeled_tensor/python/ops/core.py
index e6aded92ca..04bf26a5dd 100644
--- a/tensorflow/contrib/labeled_tensor/python/ops/core.py
+++ b/tensorflow/contrib/labeled_tensor/python/ops/core.py
@@ -618,7 +618,7 @@ def identity(labeled_tensor, name=None):
 def slice_function(labeled_tensor, selection, name=None):
   """Slice out a subset of the tensor.
 
-  This is an analogue of tf.slice.
+  This is an analog of tf.slice.
   For example:
   >>> tensor = tf.reshape(tf.range(0, 6), [3, 2])
   >>> labeled_tensor = lt.LabeledTensor(tensor, ['a', ('b', ['foo', 'bar'])])
@@ -704,7 +704,7 @@ def transpose(labeled_tensor, axis_order=None, name=None):
     axis_names = list(labeled_tensor.axes.keys())
     permutation = [axis_names.index(n) for n in axis_order]
 
-    # Note: TensorFlow doesn't copy data for the identity tranpose.
+    # Note: TensorFlow doesn't copy data for the identity transpose.
     transpose_tensor = array_ops.transpose(
         labeled_tensor.tensor, permutation, name=scope)
 
diff --git a/tensorflow/contrib/layers/python/layers/target_column_test.py b/tensorflow/contrib/layers/python/layers/target_column_test.py
index 1baa663151..d5d03fb1eb 100644
--- a/tensorflow/contrib/layers/python/layers/target_column_test.py
+++ b/tensorflow/contrib/layers/python/layers/target_column_test.py
@@ -28,7 +28,7 @@ from tensorflow.python.platform import test
 
 class RegressionTargetColumnTest(test.TestCase):
 
-  # TODO(zakaria): test multilabel regresssion.
+  # TODO(zakaria): test multilabel regression.
   def testRegression(self):
     target_column = target_column_lib.regression_target()
     with ops.Graph().as_default(), session.Session() as sess:
diff --git a/tensorflow/contrib/learn/python/learn/dataframe/tensorflow_dataframe.py b/tensorflow/contrib/learn/python/learn/dataframe/tensorflow_dataframe.py
index b17a4b8d05..f316c5c980 100644
--- a/tensorflow/contrib/learn/python/learn/dataframe/tensorflow_dataframe.py
+++ b/tensorflow/contrib/learn/python/learn/dataframe/tensorflow_dataframe.py
@@ -97,7 +97,7 @@ class TensorFlowDataFrame(df.DataFrame):
       graph: the `Graph` in which the `DataFrame` should be built.
       session: the `Session` in which to run the columns of the `DataFrame`.
       start_queues: if true, queues will be started before running and halted
-        after producting `n` batches.
+        after producing `n` batches.
       initialize_variables: if true, variables will be initialized.
       **kwargs: Additional keyword arguments e.g. `num_epochs`.
 
diff --git a/tensorflow/contrib/learn/python/learn/datasets/mnist.py b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
index 13f213c197..af4acccaec 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/mnist.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
@@ -261,17 +261,13 @@ def read_data_sets(train_dir,
   train_images = train_images[validation_size:]
   train_labels = train_labels[validation_size:]
 
-  train = DataSet(
-      train_images, train_labels, dtype=dtype, reshape=reshape, seed=seed)
-  validation = DataSet(
-      validation_images,
-      validation_labels,
-      dtype=dtype,
-      reshape=reshape,
-      seed=seed)
-  test = DataSet(
-      test_images, test_labels, dtype=dtype, reshape=reshape, seed=seed)
-
+  
+  options = dict(dtype=dtype, reshape=reshape, seed=seed)
+  
+  train = DataSet(train_images, train_labels, **options)
+  validation = DataSet(validation_images, validation_labels, **options)
+  test = DataSet(test_images, test_labels, **options)
+  
   return base.Datasets(train=train, validation=validation, test=test)
 
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator.py b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
index ac5ef565c8..b87b75d5c4 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
@@ -89,7 +89,7 @@ SCIKIT_DECOUPLE_INSTRUCTIONS = (
 
 
 def _verify_input_args(x, y, input_fn, feed_fn, batch_size):
-  """Verifies validity of co-existance of input arguments."""
+  """Verifies validity of co-existence of input arguments."""
   if input_fn is None:
     if x is None:
       raise ValueError('Either x or input_fn must be provided.')
@@ -360,7 +360,7 @@ class BaseEstimator(
   """
   __metaclass__ = abc.ABCMeta
 
-  # Note that for Google users, this is overriden with
+  # Note that for Google users, this is overridden with
   # learn_runner.EstimatorConfig.
   # TODO(wicke): Remove this once launcher takes over config functionality
   _Config = run_config.RunConfig  # pylint: disable=invalid-name
@@ -705,7 +705,7 @@ class BaseEstimator(
   def _get_eval_ops(self, features, labels, metrics):
     """Method that builds model graph and returns evaluation ops.
 
-    Expected to be overriden by sub-classes that require custom support.
+    Expected to be overridden by sub-classes that require custom support.
 
     Args:
       features: `Tensor` or `dict` of `Tensor` objects.
@@ -1151,7 +1151,7 @@ class Estimator(BaseEstimator):
   def _get_train_ops(self, features, labels):
     """Method that builds model graph and returns trainer ops.
 
-    Expected to be overriden by sub-classes that require custom support.
+    Expected to be overridden by sub-classes that require custom support.
     This implementation uses `model_fn` passed as parameter to constructor to
     build model.
 
@@ -1167,7 +1167,7 @@ class Estimator(BaseEstimator):
   def _get_eval_ops(self, features, labels, metrics):
     """Method that builds model graph and returns evaluation ops.
 
-    Expected to be overriden by sub-classes that require custom support.
+    Expected to be overridden by sub-classes that require custom support.
     This implementation uses `model_fn` passed as parameter to constructor to
     build model.
 
@@ -1206,7 +1206,7 @@ class Estimator(BaseEstimator):
   def _get_predict_ops(self, features):
     """Method that builds model graph and returns prediction ops.
 
-    Expected to be overriden by sub-classes that require custom support.
+    Expected to be overridden by sub-classes that require custom support.
     This implementation uses `model_fn` passed as parameter to constructor to
     build model.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator_test.py b/tensorflow/contrib/learn/python/learn/estimators/estimator_test.py
index 587eb48ed0..54e6595aa8 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator_test.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator_test.py
@@ -405,7 +405,7 @@ class EstimatorModelFnTest(test.TestCase):
       return None, loss, None
 
     est = estimator.Estimator(model_fn=_invalid_model_fn)
-    with self.assertRaisesRegexp(ValueError, 'Missing training_op'):
+    with self.assertRaisesRegexp(ValueError, 'Missing train_op'):
       est.fit(input_fn=boston_input_fn, steps=1)
 
   def testInvalidModelFn_no_loss(self):
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head.py b/tensorflow/contrib/learn/python/learn/estimators/head.py
index 22e89de4c2..6e15e7891e 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head.py
@@ -637,7 +637,7 @@ def _create_model_fn_ops(features,
     weight_tensor = _weight_tensor(features, weight_column_name)
     loss, weighted_average_loss = loss_fn(labels, logits, weight_tensor)
     # Uses the deprecated API to set the tag explicitly.
-    # Without it, trianing and eval losses will show up in different graphs.
+    # Without it, training and eval losses will show up in different graphs.
     logging_ops.scalar_summary(
         _summary_key(head_name, mkey.LOSS), weighted_average_loss)
 
@@ -1158,7 +1158,7 @@ def _to_labels_tensor(labels, label_name):
   """Returns label as a tensor.
 
   Args:
-    labels: Label `Tensor` or `SparseTensor` or a dict containig labels.
+    labels: Label `Tensor` or `SparseTensor` or a dict containing labels.
     label_name: Label name if labels is a dict.
 
   Returns:
@@ -1669,7 +1669,7 @@ class _MultiHead(Head):
     Args:
       all_model_fn_ops: list of ModelFnOps for the individual heads.
       train_op_fn: Function to create train op. See `create_model_fn_ops`
-          documentaion for more details.
+          documentation for more details.
 
     Returns:
       ModelFnOps that merges all heads for TRAIN.
diff --git a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
index c56741a4d1..8a327ab01f 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
@@ -132,7 +132,7 @@ class ModelFnOps(
     # Validate train_op.
     if train_op is None:
       if mode == ModeKeys.TRAIN:
-        raise ValueError('Missing training_op.')
+        raise ValueError('Missing train_op.')
     elif not isinstance(train_op, ops.Operation):
       # TODO(ptucker): Should this be allowed? Consider raising error.
       train_op = ops.convert_to_tensor(train_op).op
diff --git a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
index 6bb2b8b2aa..0f09b111bd 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
@@ -119,7 +119,7 @@ def apply_dropout(cells, dropout_keep_probabilities, random_seed=None):
   """
   if len(dropout_keep_probabilities) != len(cells) + 1:
     raise ValueError(
-        'The number of dropout probabilites must be one greater than the '
+        'The number of dropout probabilities must be one greater than the '
         'number of cells. Got {} cells and {} dropout probabilities.'.format(
             len(cells), len(dropout_keep_probabilities)))
   wrapped_cells = [
diff --git a/tensorflow/contrib/learn/python/learn/estimators/run_config.py b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
index 7af1c541c6..3aaee5862d 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/run_config.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
@@ -309,7 +309,7 @@ class RunConfig(ClusterConfig, core_run_config.RunConfig):
     Args:
       whitelist: A list of the string names of the properties uid should not
         include. If `None`, defaults to `_DEFAULT_UID_WHITE_LIST`, which
-        includes most properites user allowes to change.
+        includes most properties user allowes to change.
 
     Returns:
       A uid string.
diff --git a/tensorflow/contrib/learn/python/learn/evaluable.py b/tensorflow/contrib/learn/python/learn/evaluable.py
index 7960ef629d..66e1526517 100644
--- a/tensorflow/contrib/learn/python/learn/evaluable.py
+++ b/tensorflow/contrib/learn/python/learn/evaluable.py
@@ -60,19 +60,19 @@ class Evaluable(object):
 
     Args:
       x: Matrix of shape [n_samples, n_features...] or dictionary of many matrices
-         containing the input samples for fitting the model. Can be iterator that returns
-         arrays of features or dictionary of array of features. If set, `input_fn` must
-         be `None`.
+        containing the input samples for fitting the model. Can be iterator that returns
+        arrays of features or dictionary of array of features. If set, `input_fn` must
+        be `None`.
       y: Vector or matrix [n_samples] or [n_samples, n_outputs] containing the
-         label values (class labels in classification, real numbers in
-         regression) or dictionary of multiple vectors/matrices. Can be iterator
-         that returns array of targets or dictionary of array of targets. If set,
-         `input_fn` must be `None`. Note: For classification, label values must
-         be integers representing the class index (i.e. values from 0 to
-         n_classes-1).
+        label values (class labels in classification, real numbers in
+        regression) or dictionary of multiple vectors/matrices. Can be iterator
+        that returns array of targets or dictionary of array of targets. If set,
+        `input_fn` must be `None`. Note: For classification, label values must
+        be integers representing the class index (i.e. values from 0 to
+        n_classes-1).
       input_fn: Input function returning a tuple of:
-          features - Dictionary of string feature name to `Tensor` or `Tensor`.
-          labels - `Tensor` or dictionary of `Tensor` with labels.
+        features - Dictionary of string feature name to `Tensor` or `Tensor`.
+        labels - `Tensor` or dictionary of `Tensor` with labels.
         If input_fn is set, `x`, `y`, and `batch_size` must be `None`. If
         `steps` is not provided, this should raise `OutOfRangeError` or
         `StopIteration` after the desired amount of data (e.g., one epoch) has
@@ -90,7 +90,6 @@ class Evaluable(object):
         friendly names for the metric to a `MetricSpec` object defining which
         model outputs to evaluate against which labels with which metric
         function.
-
         Metric ops should support streaming, e.g., returning `update_op` and
         `value` tensors. For example, see the options defined in
         `../../../metrics/python/ops/metrics_ops.py`.
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index d82bc321e7..c60ecac5df 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -53,7 +53,7 @@ class Experiment(object):
   """
 
   # TODO(ispir): remove delay_workers_by_global_step and make global step based
-  # waiting as only behaviour.
+  # waiting as only behavior.
   @deprecated_args(
       "2016-10-23",
       "local_eval_frequency is deprecated as local_run will be renamed to "
@@ -550,7 +550,7 @@ class Experiment(object):
     eval_result = None
 
     # Set the default value for train_steps_per_iteration, which will be
-    # overriden by other settings.
+    # overridden by other settings.
     train_steps_per_iteration = 1000
     if self._train_steps_per_iteration is not None:
       train_steps_per_iteration = self._train_steps_per_iteration
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner.py b/tensorflow/contrib/learn/python/learn/learn_runner.py
index a3398a87e1..943c555314 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner.py
@@ -155,7 +155,7 @@ def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
       to create the `Estimator` (passed as `model_dir` to its constructor). It
       must return an `Experiment`. For this case, `run_config` and `hparams`
       must be None.
-      2) It accpets two arguments `run_config` and `hparams`, which should be
+      2) It accepts two arguments `run_config` and `hparams`, which should be
       used to create the `Estimator` (`run_config` passed as `config` to its
       constructor; `hparams` used as the hyper-paremeters of the model).
       It must return an `Experiment`. For this case, `output_dir` must be None.
diff --git a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
index 0faba7cee5..45727faab4 100644
--- a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
@@ -140,7 +140,7 @@ def rnn_seq2seq(encoder_inputs,
     scope: Scope to use, if None new will be produced.
 
   Returns:
-    List of tensors for outputs and states for trianing and sampling sub-graphs.
+    List of tensors for outputs and states for training and sampling sub-graphs.
   """
   with vs.variable_scope(scope or "rnn_seq2seq"):
     _, last_enc_state = rnn.static_rnn(
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
index 9d4fed9998..5709955c49 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
@@ -128,9 +128,9 @@ class CategoricalVocabulary(object):
       Class name.
 
     Raises:
-      ValueError: if this vocabulary wasn't initalized with support_reverse.
+      ValueError: if this vocabulary wasn't initialized with support_reverse.
     """
     if not self._support_reverse:
-      raise ValueError("This vocabulary wasn't initalized with "
+      raise ValueError("This vocabulary wasn't initialized with "
                        "support_reverse to support reverse() function.")
     return self._reverse_mapping[class_id]
diff --git a/tensorflow/contrib/learn/python/learn/trainable.py b/tensorflow/contrib/learn/python/learn/trainable.py
index 2d1d460425..972fec026f 100644
--- a/tensorflow/contrib/learn/python/learn/trainable.py
+++ b/tensorflow/contrib/learn/python/learn/trainable.py
@@ -49,7 +49,7 @@ class Trainable(object):
       steps: Number of steps for which to train model. If `None`, train forever.
         'steps' works incrementally. If you call two times fit(steps=10) then
         training occurs in total 20 steps. If you don't want to have incremental
-        behaviour please set `max_steps` instead. If set, `max_steps` must be
+        behavior please set `max_steps` instead. If set, `max_steps` must be
         `None`.
       batch_size: minibatch size to use on the input, defaults to first
         dimension of `x`. Must be `None` if `input_fn` is provided.
diff --git a/tensorflow/contrib/learn/python/learn/utils/export.py b/tensorflow/contrib/learn/python/learn/utils/export.py
index 36a1f5f60c..6af2287761 100644
--- a/tensorflow/contrib/learn/python/learn/utils/export.py
+++ b/tensorflow/contrib/learn/python/learn/utils/export.py
@@ -89,7 +89,7 @@ def _export_graph(graph, saver, checkpoint_path, export_dir,
 def generic_signature_fn(examples, unused_features, predictions):
   """Creates generic signature from given examples and predictions.
 
-  This is needed for backward compatibility with default behaviour of
+  This is needed for backward compatibility with default behavior of
   export_estimator.
 
   Args:
diff --git a/tensorflow/contrib/learn/python/learn/utils/gc.py b/tensorflow/contrib/learn/python/learn/utils/gc.py
index dd4376f051..5af9e8b9e2 100644
--- a/tensorflow/contrib/learn/python/learn/utils/gc.py
+++ b/tensorflow/contrib/learn/python/learn/utils/gc.py
@@ -71,6 +71,7 @@ import math
 import os
 
 from tensorflow.python.platform import gfile
+from tensorflow.python.util import compat
 
 Path = collections.namedtuple('Path', 'path export_version')
 
@@ -199,7 +200,9 @@ def get_paths(base_dir, parser):
   raw_paths = gfile.ListDirectory(base_dir)
   paths = []
   for r in raw_paths:
-    p = parser(Path(os.path.join(base_dir, r), None))
+    p = parser(Path(os.path.join(compat.as_str_any(base_dir),
+                                 compat.as_str_any(r)),
+                    None))
     if p:
       paths.append(p)
   return sorted(paths)
diff --git a/tensorflow/contrib/learn/python/learn/utils/gc_test.py b/tensorflow/contrib/learn/python/learn/utils/gc_test.py
index 9c63096d0e..0c1a1f4327 100644
--- a/tensorflow/contrib/learn/python/learn/utils/gc_test.py
+++ b/tensorflow/contrib/learn/python/learn/utils/gc_test.py
@@ -27,6 +27,19 @@ from tensorflow.contrib.learn.python.learn.utils import gc
 from tensorflow.python.framework import test_util
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
+from tensorflow.python.util import compat
+
+
+def _create_parser(base_dir):
+  # create a simple parser that pulls the export_version from the directory.
+  def parser(path):
+    match = re.match("^" + compat.as_str_any(base_dir) + "/(\\d+)$",
+                     compat.as_str_any(path.path))
+    if not match:
+      return None
+    return path._replace(export_version=int(match.group(1)))
+
+  return parser
 
 
 class GcTest(test_util.TensorFlowTestCase):
@@ -102,20 +115,24 @@ class GcTest(test_util.TensorFlowTestCase):
     # add a base_directory to ignore
     gfile.MakeDirs(os.path.join(base_dir, "ignore"))
 
-    # create a simple parser that pulls the export_version from the directory.
-    def parser(path):
-      match = re.match("^" + base_dir + "/(\\d+)$", path.path)
-      if not match:
-        return None
-      return path._replace(export_version=int(match.group(1)))
-
     self.assertEquals(
-        gc.get_paths(
-            base_dir, parser=parser), [
-                gc.Path(os.path.join(base_dir, "0"), 0),
-                gc.Path(os.path.join(base_dir, "1"), 1),
-                gc.Path(os.path.join(base_dir, "2"), 2)
-            ])
+        gc.get_paths(base_dir, _create_parser(base_dir)),
+        [
+            gc.Path(os.path.join(base_dir, "0"), 0),
+            gc.Path(os.path.join(base_dir, "1"), 1),
+            gc.Path(os.path.join(base_dir, "2"), 2)
+        ])
+
+  def testMixedStrTypes(self):
+    temp_dir = compat.as_bytes(test.get_temp_dir())
+
+    for sub_dir in ['str', b'bytes', u'unicode']:
+      base_dir = os.path.join(
+          (temp_dir if isinstance(sub_dir, bytes) else temp_dir.decode()),
+          sub_dir)
+      self.assertFalse(gfile.Exists(base_dir))
+      gfile.MakeDirs(os.path.join(compat.as_str_any(base_dir), "42"))
+      gc.get_paths(base_dir, _create_parser(base_dir))
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
index fa314e69c7..3f0f309253 100644
--- a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
@@ -309,7 +309,7 @@ def get_most_recent_export(export_dir_base):
                      directories.
 
   Returns:
-    A gc.Path, whith is just a namedtuple of (path, export_version).
+    A gc.Path, with is just a namedtuple of (path, export_version).
   """
   select_filter = gc.largest_export_versions(1)
   results = select_filter(gc.get_paths(export_dir_base,
diff --git a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils_test.py b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils_test.py
index 48222e9dd6..9e778ab72a 100644
--- a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils_test.py
+++ b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils_test.py
@@ -109,7 +109,7 @@ class SavedModelExportUtilsTest(test.TestCase):
     self.assertEqual(actual_signature_def, expected_signature_def)
 
   def test_build_standardized_signature_def_classification2(self):
-    """Tests multiple output tensors that include classes and probabilites."""
+    """Tests multiple output tensors that include classes and probabilities."""
     input_tensors = {
         "input-1":
             array_ops.placeholder(
diff --git a/tensorflow/contrib/legacy_seq2seq/python/kernel_tests/seq2seq_test.py b/tensorflow/contrib/legacy_seq2seq/python/kernel_tests/seq2seq_test.py
index 4395138e20..7ce5fb2da6 100644
--- a/tensorflow/contrib/legacy_seq2seq/python/kernel_tests/seq2seq_test.py
+++ b/tensorflow/contrib/legacy_seq2seq/python/kernel_tests/seq2seq_test.py
@@ -825,7 +825,7 @@ class Seq2SeqTest(test.TestCase):
   #     with variable_scope.variable_scope("new"):
   #       _, losses2 = SampleGRUSeq2Seq
   #           inp, out, weights, per_example_loss=True)
-  #       # First loss is scalar, the second one is a 1-dimensinal tensor.
+  #       # First loss is scalar, the second one is a 1-dimensional tensor.
   #       self.assertEqual([], losses1[0].get_shape().as_list())
   #       self.assertEqual([None], losses2[0].get_shape().as_list())
 
diff --git a/tensorflow/contrib/linalg/python/ops/linear_operator_composition.py b/tensorflow/contrib/linalg/python/ops/linear_operator_composition.py
index 9dec621ab2..0853ea03af 100644
--- a/tensorflow/contrib/linalg/python/ops/linear_operator_composition.py
+++ b/tensorflow/contrib/linalg/python/ops/linear_operator_composition.py
@@ -79,7 +79,7 @@ class LinearOperatorComposition(linear_operator.LinearOperator):
   operator_56 = LinearOperatorFullMatrix(matrix_56)
 
   # Compose to create a [2, 3] batch of 4 x 6 operators.
-  opeartor_46 = LinearOperatorComposition([operator_45, operator_56])
+  operator_46 = LinearOperatorComposition([operator_45, operator_56])
 
   # Create a shape [2, 3, 6, 2] vector.
   x = tf.random_normal(shape=[2, 3, 6, 2])
diff --git a/tensorflow/contrib/linear_optimizer/kernels/g3doc/readme.md b/tensorflow/contrib/linear_optimizer/kernels/g3doc/readme.md
index f5fc77b9c1..a4f5086dde 100644
--- a/tensorflow/contrib/linear_optimizer/kernels/g3doc/readme.md
+++ b/tensorflow/contrib/linear_optimizer/kernels/g3doc/readme.md
@@ -159,7 +159,7 @@ expected.
 
 On criteo dataset, the usual Newton method goes out of range for a small (but
 non negligible) fraction of the examples. The returned dual in these cases will
-be $$0$$ or $$\pm 1$$. The modified Newton algorihm always find the true zero
+be $$0$$ or $$\pm 1$$. The modified Newton algorithm always find the true zero
 and achieves a better log loss.
 
 The blue lines represent the modified Newton (evaluation and training) and the
diff --git a/tensorflow/contrib/losses/README.md b/tensorflow/contrib/losses/README.md
index f373c94c1b..7b73c4483a 100644
--- a/tensorflow/contrib/losses/README.md
+++ b/tensorflow/contrib/losses/README.md
@@ -12,10 +12,10 @@ All loss functions take a pair of tensors, `predictions` and ground truth
 `[batch_size, d1, ... dN]` where `batch_size` is the number
 of samples in the batch and `d1` ... `dN` are the remaining dimensions.
 
-THe `weight` parameter can be used to adjust the relative weight samples within
+The `weight` parameter can be used to adjust the relative weight samples within
 the batch. The result of each loss is a scalar average of all sample losses with
 non-zero weights.
 
 Any parameter named `logit` should be the raw model outputs, not a normalized
-probablility distribution (i.e., `[0.0, 1.0]`). `target` for losses taking
+probability distribution (i.e., `[0.0, 1.0]`). `target` for losses taking
 `logit` _should_ be a normalized probability distribution.
diff --git a/tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py b/tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py
index 0f3a5f1313..ec25c032f0 100644
--- a/tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py
+++ b/tensorflow/contrib/memory_stats/python/kernel_tests/memory_stats_ops_test.py
@@ -49,7 +49,7 @@ class MemoryStatsOpsTest(test_util.TensorFlowTestCase):
   # The memory for matrix "a" can be reused for matrix "d". Therefore, this
   # computation needs space for only three matrix plus some small overhead.
   def testChainOfMatmul(self):
-    # MaxBytesInUse is registerd on GPU only. See kernels/memory_stats_ops.cc.
+    # MaxBytesInUse is registered on GPU only. See kernels/memory_stats_ops.cc.
     if not test.is_gpu_available():
       return
 
diff --git a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
index 54994ec617..f93b1945a6 100644
--- a/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
+++ b/tensorflow/contrib/metrics/python/ops/metric_ops_test.py
@@ -1600,7 +1600,7 @@ class StreamingAUCTest(test.TestCase):
       self.assertAlmostEqual(1, auc.eval(), 6)
 
   def np_auc(self, predictions, labels, weights):
-    """Computes the AUC explicitely using Numpy.
+    """Computes the AUC explicitly using Numpy.
 
     Args:
       predictions: an ndarray with shape [N].
diff --git a/tensorflow/contrib/mpi/BUILD b/tensorflow/contrib/mpi/BUILD
new file mode 100644
index 0000000000..20ceef5004
--- /dev/null
+++ b/tensorflow/contrib/mpi/BUILD
@@ -0,0 +1,90 @@
+# Description:
+#   MPI based communication interfaces and implementations for TensorFlow.
+
+package(default_visibility = [
+    "//tensorflow:__subpackages__",
+])
+
+licenses(["notice"])  # Apache 2.0
+
+exports_files(["LICENSE"])
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+filegroup(
+    name = "c_srcs",
+    data = glob([
+        "**/*.cc",
+        "**/*.h",
+    ]),
+)
+
+# For platform specific build config
+load(
+    "//tensorflow/core:platform/default/build_config.bzl",
+    "tf_proto_library_cc",
+)
+
+tf_proto_library_cc(
+    name = "mpi_msg_proto",
+    srcs = ["mpi_msg.proto"],
+    cc_api_version = 2,
+    protodeps = ["//tensorflow/core:worker_proto"],
+    visibility = [
+        "//tensorflow:__subpackages__",
+    ],
+)
+
+cc_library(
+    name = "mpi_utils",
+    srcs = ["mpi_utils.cc"],
+    hdrs = ["mpi_utils.h"],
+    deps = [
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//third_party/mpi",
+    ],
+)
+
+cc_library(
+    name = "mpi_rendezvous_mgr",
+    srcs = ["mpi_rendezvous_mgr.cc"],
+    hdrs = ["mpi_rendezvous_mgr.h"],
+    deps = [
+        ":mpi_msg_proto_cc",
+        ":mpi_utils",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:gpu_runtime",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_cc",
+        "//tensorflow/core:worker_proto_cc",
+        "//tensorflow/core/distributed_runtime:base_rendezvous_mgr",
+        "//tensorflow/core/distributed_runtime:session_mgr",
+        "//tensorflow/core/distributed_runtime:worker_env",
+        "//third_party/mpi",
+    ],
+)
+
+cc_library(
+    name = "mpi_server_lib",
+    srcs = ["mpi_server_lib.cc"],
+    hdrs = ["mpi_server_lib.h"],
+    linkstatic = 1,  # Seems to be needed since alwayslink is broken in bazel
+    deps = [
+        ":mpi_rendezvous_mgr",
+        "//tensorflow/core/distributed_runtime/rpc:grpc_server_lib",
+    ],
+    alwayslink = 1,
+)
diff --git a/tensorflow/contrib/mpi/README.md b/tensorflow/contrib/mpi/README.md
new file mode 100644
index 0000000000..b0d03d05a2
--- /dev/null
+++ b/tensorflow/contrib/mpi/README.md
@@ -0,0 +1,94 @@
+## How to compile and use MPI-enabled TensorFlow
+
+1. Follow the regular TF compilation instructions. During configure step, if you want MPI support, answer yes to this question:
+
+    ```Do you wish to build TensorFlow with MPI support [y/N]```
+
+2. To turn on the MPI connection, add the protocol "grpc+mpi" in the server definition:
+
+    ```server = tf.train.Server(cluster, job_name="local", task_index=0, protocol='grpc+mpi') # default protocol is 'grpc'```
+
+## Overview
+
+By using this protocol TensorFlow can take advantage of the high performance networking primitives that are offered via the MPI API. This enables TensorFlow to take advantage of high performance low latency networks such as Infiniband. These changes are largely transparent to the user who only has to change the offered protocol and launch the script using the 'mpirun'  launcher. For example:
+    ```mpirun -np 2 python my_neuralnet.py ```
+
+
+
+
+
+## Runtime options
+
+The following environment variables can be set to modify the behavior at runtime:
+
+**MPI_DISABLED=[0,1]**
+
+This environment variable allows you to disable the MPI path before launch (e.g. for performance or correctness testing). 
+
+**MPI_OPTIMAL_PATH=[0,1]**
+
+When set to 0 it will use the default path where tensors are encoded to ProtoText before being copied to a remote process. When set to 1 a more optimal path will be taken where only the tensor description is encoded while the actual tensor data is transferred directly from the source buffer to the destination buffer.
+This path is disabled by default as it requires that the MPI library can directly access the pointer to the data. For CPU backed buffers this is no problem, however for GPU backed buffers this requires MPI libraries that are built with CUDA support (CUDA Aware). When using non-CUDA aware MPI libraries and GPU buffers you will get segmentation faults.
+
+
+
+## Known problems
+
+For certain complex neural nets the implementation sometimes crashes inside the MPI libraries. This seems to be related to memory allocations/routines that register the memory for the Infiniband transfers. (The crashes do not happen when all MPI processes are within the same physical machine). 
+
+**MVAPICH**
+- The problem manifests itself with a segmentation fault inside a memory copy routine and during startup you will get the following warning: "WARNING: Error in initializing MVAPICH2 ptmalloc library. Continuing without InfiniBand registration cache support." 
+
+**OpenMPI**
+- With OpenMPI corrupt data will be received resulting in an assertion or the MPI library will print an error and exit. The error is "Attempt to free memory that is still in use by an ongoing MPI communication.  MPI job will now abort."
+
+## Implementation details
+
+
+The implementation takes over the responsibility for sending and receiving tensors between separate processes. This is facilitated by TensorFlow's ability to support different protocols. In this particular implementation, the standard gRPC library is used for all administrative operations while the MPI functions take over the tensor exchanges. On the sending side the tensors are placed in the standard waiting tables and nothing is changed there. On the receiving side the RecvFromRemoteAsync function is newly implemented and instead of requesting the data via gRPC the data is now requested via MPI calls.
+
+To this end once the code is loaded a dedicated thread will be launched that handles all MPI operations. This thread will loop through a set of operations:
+
+* Send requests placed on the request queue to the sending process
+Once a request for a tensor is received two callbacks are created. The first one is to request the tensor and the second one is executed once the requested data has arrived. To this end the request is placed in a queue and will be sent once the MPI thread services the queue. This sending is done using non-blocking MPI_Isend operations.
+
+* Send tensor data in response to a request call
+Once a request has arrived from a remote process the request is forwarded to the original TensorFlow code which looks up the tensor in the waiting table. Once the tensor has been found a callback is executed which places the found tensor on the sendQueue for the MPI thread. Once the sendQueue is served the tensor data will be send using non-blocking send operations (MP_Isend) to the remote process.
+
+* Receive tensor request
+The MPI thread will check if there are any incoming tensor request messages on the communication lines using MPI_Iprobe. Once a request has been received it will be passed on to the standard TensorFlow code and eventually will be placed on the sendQueue.
+
+* Receive tensor 
+At some point after a request has been sent the remote process will transmit the tensor. This tensor will be received and we look-up the callback that is associated with this tensor in our request table and execute the callback on the received data.
+
+
+In the implementation all send operations are non-blocking, all probe operations are non-blocking and all receive-operations are blocking. The receive-operations are only executed after the probe has determined that there is something to receive. 
+The MPI processes identify each other using an MPI process ID. The TensorFlow gRPC processes identify each other using a name. During launch we create a mapping between the TensorFlow process name and the MPI process ID to allow the processes to communicate with the correct destinations when using MPI operations.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/tensorflow/contrib/mpi/mpi_msg.proto b/tensorflow/contrib/mpi/mpi_msg.proto
new file mode 100644
index 0000000000..36f1504901
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_msg.proto
@@ -0,0 +1,19 @@
+
+syntax = "proto3";
+
+package tensorflow;
+option cc_enable_arenas = true;
+
+import "tensorflow/core/protobuf/worker.proto";
+
+
+message MPIRecvTensorResponse {
+    RecvTensorResponse response = 1;
+    bool              singleSend = 2;
+    string key = 3;
+    int64 step_id = 4;
+    uint64 checksum = 5;
+}
+
+
+
diff --git a/tensorflow/contrib/mpi/mpi_rendezvous_mgr.cc b/tensorflow/contrib/mpi/mpi_rendezvous_mgr.cc
new file mode 100644
index 0000000000..e97e8d0163
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_rendezvous_mgr.cc
@@ -0,0 +1,315 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include "tensorflow/contrib/mpi/mpi_rendezvous_mgr.h"
+
+#include <chrono>
+#include <functional>
+#include <memory>
+#include <string>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/core/distributed_runtime/tensor_coding.h"
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_util.h"
+#include "tensorflow/core/distributed_runtime/session_mgr.h"
+
+namespace tensorflow {
+
+MPIRendezvousMgr::MPIRendezvousMgr(const WorkerEnv* env)
+    : BaseRendezvousMgr(env), worker_env_2(env), use_optimal_transfer_(false) {
+
+  const char* mpienv = getenv("MPI_OPTIMAL_PATH");
+  if (mpienv && mpienv[0] == '1') {
+    LOG(INFO) << "MPI Optimal copy path enabled (Requires CUDA-Aware MPI when "
+                 "using GPUs)\n";
+    use_optimal_transfer_ = true;
+  }
+
+  // extract worker-name
+  auto parsed = env->local_devices[0]->parsed_name();
+  const std::string task_id = strings::StrCat(parsed.job, ":", parsed.replica);
+
+  mpiutils_ = new MPIUtils(task_id);
+  background_thread_ =
+      std::thread(&MPIRendezvousMgr::MPIBackgroundThread, this);
+}
+
+BaseRemoteRendezvous* MPIRendezvousMgr::Create(int64 step_id,
+                                               const WorkerEnv* worker_env) {
+  return new MPIRemoteRendezvous(worker_env, step_id, mpiutils_, this);
+}
+
+void MPIRemoteRendezvous::RecvFromRemoteAsync(
+    const Rendezvous::ParsedKey& parsed, const Rendezvous::Args& recv_args,
+    DoneCallback done) {
+
+  Status s = Status::OK();
+  MPIRequestTensorCall* rendezvous_call = new MPIRequestTensorCall();
+
+  VLOG(2) << "MPI User requested " << parsed.FullKey()
+          << " @ step: " << step_id_;
+
+  std::string src_task =
+      strings::StrCat(parsed.src.job, ":", parsed.src.replica);
+  const int dst = mpiutils_->GetSourceID(src_task);
+
+  Device* dst_device;
+  if (s.ok()) {
+    s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_device);
+    CHECK(s.ok()) << "Device lookup failed";
+  } else {
+    done(s, Args(), recv_args, Tensor{}, false);
+    return;
+  }
+
+  // Set properties of the request object and create the request function
+  rendezvous_call->Init(parsed, step_id_);
+
+  std::function<void()> request_call = [parsed, dst, rendezvous_call]() {
+    // Use MPI_Alloc_mem here to force allocation inside MPI thread
+    // this is not optimal, but prevents memory corruption and segmentation
+    // faults during inter-server transfers...
+    MPI_CHECK(MPI_Alloc_mem(rendezvous_call->request_buffer_size_,
+                            MPI_INFO_NULL, &rendezvous_call->request_buffer_));
+    rendezvous_call->req_.SerializeToArray(
+        rendezvous_call->request_buffer_,
+        rendezvous_call->request_buffer_size_);
+    MPI_CHECK(MPI_Isend(rendezvous_call->request_buffer_,
+                        rendezvous_call->request_buffer_size_, MPI_CHAR, dst,
+                        TAG_REQTENSOR, MPI_COMM_WORLD,
+                        &rendezvous_call->mpi_request_));
+  };
+
+  // Create the function which is called when the Tensor is send by remote
+  const int64 temp1 = step_id_;
+  rendezvous_call->recv_call_ =
+      [this, parsed, recv_args, done, dst, temp1, rendezvous_call](
+          MPIRecvTensorResponse mpi_response) {
+    Status s;
+    Device* dst_device;
+    if (s.ok()) {
+      s = env_->device_mgr->LookupDevice(parsed.dst_device, &dst_device);
+      CHECK(s.ok()) << "Device lookup failed";
+    }
+
+    VLOG(3) << "MPI Received tensor " << parsed.FullKey()
+            << " @ step: " << temp1
+            << " single-send: " << mpi_response.singlesend();
+
+    Tensor val;
+    if (mpi_response.singlesend()) {
+      dst_device->MakeTensorFromProto(mpi_response.response().tensor(),
+                                      recv_args.alloc_attrs, &val);
+    } else {
+      TensorResponse tr;
+      tr.InitAlloc(dst_device, recv_args.alloc_attrs);
+      tr.InitPartial(mpi_response.response());
+      const size_t nBytes = tr.tensor().TotalBytes();
+      void* data = const_cast<void*>(DMAHelper::base(&tr.tensor()));
+      MPI_Status status;
+      MPI_CHECK(MPI_Recv(data, static_cast<int>(nBytes), MPI_BYTE, dst,
+                         TAG_SENDTENSOR2, MPI_COMM_WORLD, &status));
+      val = std::move(tr.tensor());
+    }
+
+    done(s, Args(), recv_args, val, mpi_response.response().is_dead());
+  };
+
+  MPIRendezvousMgr* mgr =
+      reinterpret_cast<MPIRendezvousMgr*>(this->rendezvous_mgr_);
+  mgr->QueueRequest(parsed.FullKey().ToString(), step_id_,
+                    std::move(request_call), rendezvous_call);
+}
+
+MPIRemoteRendezvous::~MPIRemoteRendezvous() {
+  MPIRendezvousMgr* mgr =
+      reinterpret_cast<MPIRendezvousMgr*>(this->rendezvous_mgr_);
+  mgr->RemoveStepID(step_id_);
+}
+
+/*
+ * Add the request for one of our Tensors by a remote process
+ * to the local send/table. The here created callback will
+ * be called once the Tensor data has arrived and is
+ * ready to be send to the remote requester.
+ */
+void MPIRendezvousMgr::AddRequest(RecvTensorRequest request,
+                                  const int mpi_dst) {
+  const int64 step_id = request.step_id();
+  const std::string& key = request.rendezvous_key();
+  Rendezvous::ParsedKey parsed;
+  TF_CHECK_OK(Rendezvous::ParseKey(key, &parsed));
+
+  MPIRecvTensorCallBack send_cb = [this, mpi_dst, parsed](
+      const Status& status, const Rendezvous::Args& send_args,
+      const Rendezvous::Args& recv_args, const Tensor& val, bool is_dead,
+      MPISendTensorCall* mpi_send_call) {
+    // TODO(jbedorf) this should be a loop over max size
+    CHECK(mpi_send_call->mRes_.ByteSize() < INT_MAX)
+        << "Buffer too large for single transfer";
+    MPI_CHECK(MPI_Alloc_mem(mpi_send_call->mRes_.ByteSize(), MPI_INFO_NULL,
+                            &mpi_send_call->send_buffer_));
+    mpi_send_call->mRes_.SerializeToArray(mpi_send_call->send_buffer_,
+                                          mpi_send_call->mRes_.ByteSize());
+
+    MPI_CHECK(MPI_Isend(mpi_send_call->send_buffer_,
+                        static_cast<int>(mpi_send_call->mRes_.ByteSize()),
+                        MPI_CHAR, mpi_dst, TAG_SENDTENSOR, MPI_COMM_WORLD,
+                        &(mpi_send_call->msg1_)));
+    MPI_CHECK(MPI_Test(&mpi_send_call->msg1_, &mpi_send_call->done1_,
+                       MPI_STATUS_IGNORE));
+
+    if (!mpi_send_call->mRes_.singlesend()) {
+      const int tensor_size = static_cast<int>(val.TotalBytes());
+      void* temp = const_cast<void*>(DMAHelper::base(&val));
+
+      // If the MPI library is not GPU aware there should be a data transfer
+      // here to get the data on the host.
+      // if(src_dev->tensorflow_gpu_device_info()) //memcpy to send_buffer2_
+
+      // TODO(jbedorf)  this should be a loop over max size
+      MPI_CHECK(MPI_Isend(temp, tensor_size, MPI_CHAR, mpi_dst, TAG_SENDTENSOR2,
+                          MPI_COMM_WORLD, &mpi_send_call->msg2_));
+      mpi_send_call->done2_ = 0;
+    }
+    return mpi_send_call;
+  };
+
+  // Wrapper around the read callback to place the callback on our queue
+  Rendezvous::DoneCallback done_cb = [this, parsed, step_id, send_cb](
+      const Status& status, const Rendezvous::Args& send_args,
+      const Rendezvous::Args& recv_args, const Tensor& val, bool is_dead) {
+    if (!status.ok()) {
+      CHECK(status.ok()) << "RecvLocalAsync was not ok, key: "
+                         << parsed.FullKey() << " step: " << step_id
+                         << " error message: " << status.error_message();
+      return;
+    }
+
+    VLOG(3) << "MPI Sending tensor " << parsed.FullKey()
+            << " @ step: " << step_id << std::endl;
+
+    auto mpi_send_call = new MPISendTensorCall();
+    mpi_send_call->Init(parsed, step_id, is_dead);
+
+    Device* src_dev = nullptr;
+    Status s = this->worker_env_2->device_mgr->LookupDevice(parsed.src_device,
+                                                            &src_dev);
+    CHECK(s.ok()) << "src device not found";
+
+    // Control if shape and data should be send together or if we can optimize
+    // it in two different transfers, thereby reducing memory copies
+    bool doOptimalTransfer = true;
+    if (!DataTypeCanUseMemcpy(val.dtype())) doOptimalTransfer = false;
+    if (val.TotalBytes() < 1024) doOptimalTransfer = false;
+
+    doOptimalTransfer = doOptimalTransfer && use_optimal_transfer_;
+
+    if (doOptimalTransfer) {
+      // First send the Tensor description and in a follow up transfer the data
+      mpi_send_call->mRes_.mutable_response()->mutable_tensor()->set_dtype(
+          val.dtype());
+      val.shape().AsProto(mpi_send_call->mRes_.mutable_response()
+                              ->mutable_tensor()
+                              ->mutable_tensor_shape());
+      mpi_send_call->mRes_.set_singlesend(false);
+    } else {
+      // Send the Tensor description and data in a single transfer
+      if (src_dev->tensorflow_gpu_device_info() &&
+          (!send_args.alloc_attrs.on_host())) {
+        Notification n;
+        GPUUtil::SetProtoFromGPU(
+            val, src_dev, send_args.device_context,
+            mpi_send_call->mRes_.mutable_response()->mutable_tensor(), is_dead,
+            [&n, &s](const Status& s_) {
+              s = s_;
+              n.Notify();
+            });
+        n.WaitForNotification();
+      } else {
+        val.AsProtoTensorContent(
+            mpi_send_call->mRes_.mutable_response()->mutable_tensor());
+      }
+    }
+
+    std::function<MPISendTensorCall*()> res = std::bind(
+        send_cb, status, send_args, recv_args, val, is_dead, mpi_send_call);
+
+    SendQueueEntry req(parsed.FullKey().ToString().c_str(), std::move(res));
+
+    this->QueueSendRequest(req);
+
+    // Wait for the notification that indicates the tensor has been
+    // successfully transmitted to the remote process. Only needed if we
+    // have not parsed the tensor to proto
+    if (doOptimalTransfer) mpi_send_call->n_.WaitForNotification();
+  };  // done_cb
+
+  worker_env_2->compute_pool->Schedule([this, step_id, parsed, done_cb]() {
+    this->RecvLocalAsync(step_id, parsed, done_cb);
+  });
+}
+
+void MPIRendezvousMgr::MPIBackgroundThread() {
+  std::list<std::unique_ptr<MPISendTensorCall>> active_sends;
+
+  while (1) {
+    MPI_Status status;
+
+    // Check for incoming Tensor requests
+    RecvTensorRequest request;
+    if (ProbeForData(TAG_REQTENSOR, &status, &request)) {
+      this->AddRequest(request, status.MPI_SOURCE);
+    }
+
+    // Check for incoming Tensor reply
+    MPIRecvTensorResponse mRes;
+    if (ProbeForData(TAG_SENDTENSOR, &status, &mRes)) {
+      const int64 step_id = mRes.step_id();
+      std::string key = mRes.key();
+
+      std::shared_ptr<MPIRequestTensorCall> call;
+      GetRecvCall(step_id, key, &call);
+      call->recv_call_(mRes);
+      RemoveRecvCall(step_id, key);
+    }
+
+    // Remove sends that have been completed
+    active_sends.remove_if([](std::unique_ptr<MPISendTensorCall>& i) {
+      return i->IsFinished();
+    });
+
+    // send a Tensor request
+    RequestQueueEntry req;
+    if (GetRequest(&req)) req.second();
+
+    // Send a Tensor response
+    SendQueueEntry send;
+    if (GetResponse(&send)) {
+      std::unique_ptr<MPISendTensorCall> p(send.second());
+      active_sends.push_back(std::move(p));
+    }
+
+    //    std::this_thread::sleep_for(std::chrono::microseconds(1));
+  }
+}
+
+}  // namespace tensorflow
+#endif  // TENSORFLOW_USE_MPI
diff --git a/tensorflow/contrib/mpi/mpi_rendezvous_mgr.h b/tensorflow/contrib/mpi/mpi_rendezvous_mgr.h
new file mode 100644
index 0000000000..50fc380496
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_rendezvous_mgr.h
@@ -0,0 +1,260 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_MPI_MPI_RENDEZVOUS_MGR_H_
+#define TENSORFLOW_CONTRIB_MPI_MPI_RENDEZVOUS_MGR_H_
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include <queue>
+#include <thread>
+#include <list>
+#include <string>
+#include <memory>
+#include <map>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+
+#include <iostream>
+
+#include "tensorflow/contrib/mpi/mpi_utils.h"
+#include "tensorflow/core/distributed_runtime/base_rendezvous_mgr.h"
+#include "tensorflow/core/distributed_runtime/worker_env.h"
+#include "tensorflow/contrib/mpi/mpi_msg.pb.h"
+#include "tensorflow/core/protobuf/worker.pb.h"
+
+#define TAG_REQTENSOR 1010
+#define TAG_SENDTENSOR 2020
+#define TAG_SENDTENSOR2 3030
+
+namespace tensorflow {
+
+class MPISendTensorCall {
+ public:
+  char* send_buffer_;
+  char* send_buffer2_;
+
+  MPI_Request msg1_;
+  MPI_Request msg2_;
+  int done1_;  // Int instead of bool for simpler IsFinished logic
+  int done2_;
+  MPIRecvTensorResponse mRes_;
+  Notification n_;
+
+  MPISendTensorCall()
+      : send_buffer_(nullptr), send_buffer2_(nullptr), done1_(1), done2_(1) {}
+
+  ~MPISendTensorCall() {
+    MPI_CHECK(MPI_Wait(&msg1_, MPI_STATUS_IGNORE));
+    n_.Notify();
+    MPI_CHECK(MPI_Free_mem(send_buffer_));
+    //    delete[] send_buffer_;
+    delete[] send_buffer2_;
+  }
+
+  MPISendTensorCall(MPISendTensorCall&&) = delete;
+
+  void Init(const Rendezvous::ParsedKey& parsed, const int64 step_id,
+            const bool is_dead) {
+    mRes_.set_key(parsed.FullKey().ToString());
+    mRes_.set_step_id(step_id);
+    mRes_.mutable_response()->set_is_dead(is_dead);
+    mRes_.mutable_response()->set_send_start_micros(
+        Env::Default()->NowMicros());
+    mRes_.set_singlesend(true);
+  }
+
+  bool IsFinished() {
+    MPI_Status status;
+    if (!done1_) MPI_CHECK(MPI_Test(&msg1_, &done1_, &status));
+    if (!done2_) MPI_CHECK(MPI_Test(&msg2_, &done2_, &status));
+    return done1_ && done2_;
+  }
+};
+
+class MPIRequestTensorCall {
+ public:
+  Rendezvous::DoneCallback done_;
+  RecvTensorRequest req_;
+  MPI_Request mpi_request_;
+  char* request_buffer_;
+  size_t request_buffer_size_;
+  std::function<void(MPIRecvTensorResponse)> recv_call_;
+
+  MPIRequestTensorCall() : request_buffer_(nullptr) {}
+  ~MPIRequestTensorCall() {
+    MPI_CHECK(MPI_Wait(&mpi_request_, MPI_STATUS_IGNORE));
+    // delete[] request_buffer_;
+    MPI_CHECK(MPI_Free_mem(request_buffer_));
+  }
+
+  void Init(const Rendezvous::ParsedKey& parsed, const int64 step_id) {
+    req_.set_step_id(step_id);
+    req_.set_rendezvous_key(parsed.FullKey().data(), parsed.FullKey().size());
+    request_buffer_size_ = req_.ByteSize();
+    //   request_buffer_ = new char[request_buffer_size_];
+    //  req_.SerializeToArray(request_buffer_, request_buffer_size_);
+  }
+};
+
+class MPIRemoteRendezvous : public BaseRemoteRendezvous {
+ public:
+  MPIRemoteRendezvous(const WorkerEnv* env, int64 step_id, const MPIUtils* util,
+                      BaseRendezvousMgr* mgr_)
+      : BaseRemoteRendezvous(env, step_id, false),
+        mpiutils_(util),
+        rendezvous_mgr_(mgr_) {}
+
+ protected:
+  void RecvFromRemoteAsync(const Rendezvous::ParsedKey& parsed,
+                           const Rendezvous::Args& args,
+                           DoneCallback done) override;
+
+ private:
+  ~MPIRemoteRendezvous() override;
+
+  const MPIUtils* mpiutils_;
+  BaseRendezvousMgr* rendezvous_mgr_;
+
+  TF_DISALLOW_COPY_AND_ASSIGN(MPIRemoteRendezvous);
+};
+
+class MPIRendezvousMgr : public BaseRendezvousMgr {
+ public:
+  explicit MPIRendezvousMgr(const WorkerEnv* env);
+  ~MPIRendezvousMgr() {
+    delete mpiutils_;
+    fprintf(stderr, "Delete MPIRendezvousMgr \n");
+    // TODO(jbedorf) stop background_thread_
+    MPI_CHECK(MPI_Finalize());
+  }
+
+  void QueueRequest(std::string key, int64 step_id,
+                    std::function<void()> request_call,
+                    MPIRequestTensorCall* rCall) {
+    mutex_lock l(mrq_);
+    request_queue_.push(RequestQueueEntry(key, std::move(request_call)));
+    recv_tensor_map_[step_id][key] =
+        std::shared_ptr<MPIRequestTensorCall>(rCall);
+  }
+
+  void RemoveStepID(const int64 step_id) {
+    mutex_lock l(mrq_);
+    CHECK(recv_tensor_map_[step_id].size() == 0) << "Removing unfinished step";
+    recv_tensor_map_.erase(step_id);
+    // TODO(jbedorf) Should we verify that the step_id is clear before remove?
+  }
+
+ protected:
+  BaseRemoteRendezvous* Create(int64 step_id,
+                               const WorkerEnv* worker_env) override;
+
+ private:
+  typedef std::function<MPISendTensorCall*(
+      const Status&, const Rendezvous::Args&, const Rendezvous::Args&,
+      const Tensor&, const bool, MPISendTensorCall*)> MPIRecvTensorCallBack;
+
+  typedef std::pair<std::string, std::function<void()>> RequestQueueEntry;
+  typedef std::pair<std::string, std::function<MPISendTensorCall*()>>
+      SendQueueEntry;
+
+  const WorkerEnv* worker_env_2;
+  std::thread background_thread_;
+  MPIUtils* mpiutils_;
+  bool use_optimal_transfer_;
+
+  mutex msq_;
+  mutex mrq_;
+
+  std::queue<SendQueueEntry> send_queue_ GUARDED_BY(msq_);
+  std::queue<RequestQueueEntry> request_queue_ GUARDED_BY(mrq_);
+  std::map<int64, std::unordered_map<std::string,
+                                     std::shared_ptr<MPIRequestTensorCall>>>
+      recv_tensor_map_ GUARDED_BY(mrq_);
+
+  void AddRequest(RecvTensorRequest, const int);
+  void MPIBackgroundThread();
+
+  void QueueSendRequest(SendQueueEntry req) {
+    mutex_lock l(msq_);
+    send_queue_.push(req);
+  }
+
+  void GetRecvCall(const int64 step_id, const std::string& key,
+                   std::shared_ptr<MPIRequestTensorCall>* call) {
+    mutex_lock l(mrq_);
+    if (recv_tensor_map_.find(step_id) == recv_tensor_map_.end()) {
+      LOG(FATAL) << "Step not found in recv_tensor_map_, step: " << step_id
+                 << " key:  " << key << std::endl;
+    }
+    if (recv_tensor_map_[step_id].find(key) !=
+        recv_tensor_map_[step_id].end()) {
+      *call = recv_tensor_map_[step_id][key];
+    } else {
+      LOG(FATAL) << "Key not found in recv_tensor_map_, step: " << step_id
+                 << " key:  " << key << std::endl;
+    }
+  }
+
+  void RemoveRecvCall(const int64 step_id, const std::string& key) {
+    mutex_lock l(mrq_);
+    recv_tensor_map_[step_id].erase(key);
+  }
+
+  bool GetRequest(RequestQueueEntry* req) {
+    mutex_lock l(mrq_);
+    if (!request_queue_.empty()) {
+      *req = request_queue_.front();
+      request_queue_.pop();
+      return true;
+    }
+    return false;
+  }
+
+  bool GetResponse(SendQueueEntry* send) {
+    mutex_lock l(msq_);
+    if (!send_queue_.empty()) {
+      *send = send_queue_.front();
+      send_queue_.pop();
+      return true;
+    }
+    return false;
+  }
+
+  template <typename T>
+  int ProbeForData(const int tag, MPI_Status* status, T* obj) {
+    int flag = 0, msg_size = 0;
+    MPI_Message msg;
+    // Receive the message, probe as size is variable
+    MPI_CHECK(
+        MPI_Improbe(MPI_ANY_SOURCE, tag, MPI_COMM_WORLD, &flag, &msg, status));
+    if (flag) {
+      MPI_CHECK(MPI_Get_count(status, MPI_CHAR, &msg_size));
+      MPI_Status stat2;
+      std::vector<char> request_buffer_(msg_size);
+      MPI_Mrecv(&request_buffer_[0], msg_size, MPI_CHAR, &msg, &stat2);
+      bool res = obj->ParseFromArray(&request_buffer_[0], msg_size);
+      CHECK(res) << "Failed to parse incomming message";
+    }
+    return flag;
+  }
+
+  TF_DISALLOW_COPY_AND_ASSIGN(MPIRendezvousMgr);
+};  // MPIRendezvousMgr
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
+#endif  // TENSORFLOW_CONTRIB_MPI_MPI_RENDEZVOUS_MGR_H_
diff --git a/tensorflow/contrib/mpi/mpi_server_lib.cc b/tensorflow/contrib/mpi/mpi_server_lib.cc
new file mode 100644
index 0000000000..3b2fba97a9
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_server_lib.cc
@@ -0,0 +1,110 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include "tensorflow/contrib/mpi/mpi_server_lib.h"
+
+#include <string>
+#include <utility>
+
+#include "tensorflow/core/distributed_runtime/server_lib.h"
+#include "tensorflow/core/distributed_runtime/rpc/rpc_rendezvous_mgr.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/env.h"
+
+namespace tensorflow {
+
+namespace {
+// static utility function
+RendezvousMgrInterface* NewMPIRendezvousMgr(const WorkerEnv* env) {
+  // Runtime check to disable the MPI path
+  const char* mpienv = getenv("MPI_DISABLED");
+  if (mpienv && mpienv[0] == '1') {
+    LOG(INFO) << "MPI path disabled by environment variable\n";
+    return new RpcRendezvousMgr(env);
+  } else {
+    return new MPIRendezvousMgr(env);
+  }
+}
+
+}  // namespace
+
+MPIServer::MPIServer(const ServerDef& server_def, Env* env)
+    : GrpcServer(server_def, env) {}
+
+MPIServer::~MPIServer() {
+  TF_CHECK_OK(Stop());
+  TF_CHECK_OK(Join());
+}
+
+Status MPIServer::Init(ServiceInitFunction service_func,
+                       RendezvousMgrCreationFunction rendezvous_mgr_func) {
+  Status s = GrpcServer::Init(service_func, rendezvous_mgr_func);
+  return s;
+}
+
+Status MPIServer::Start() {
+  Status s = GrpcServer::Start();
+  return s;
+}
+
+Status MPIServer::Join() {
+  Status s = GrpcServer::Join();
+  return s;
+}
+
+/* static */
+Status MPIServer::Create(const ServerDef& server_def, Env* env,
+                         std::unique_ptr<ServerInterface>* out_server) {
+  std::unique_ptr<MPIServer> ret(new MPIServer(server_def, Env::Default()));
+  ServiceInitFunction service_func = nullptr;
+  TF_RETURN_IF_ERROR(ret->Init(service_func, NewMPIRendezvousMgr));
+  *out_server = std::move(ret);
+  return Status::OK();
+}
+
+namespace {
+
+class MPIServerFactory : public ServerFactory {
+ public:
+  bool AcceptsOptions(const ServerDef& server_def) override {
+    return server_def.protocol() == "grpc+mpi";
+  }
+
+  Status NewServer(const ServerDef& server_def,
+                   std::unique_ptr<ServerInterface>* out_server) override {
+    return MPIServer::Create(server_def, Env::Default(), out_server);
+  }
+};
+
+// Registers a `ServerFactory` for `MPIServer` instances.
+class MPIServerRegistrar {
+ public:
+  MPIServerRegistrar() {
+    gpr_allocation_functions alloc_fns;
+    alloc_fns.malloc_fn = port::Malloc;
+    alloc_fns.realloc_fn = port::Realloc;
+    alloc_fns.free_fn = port::Free;
+    gpr_set_allocation_functions(alloc_fns);
+    ServerFactory::Register("MPI_SERVER", new MPIServerFactory());
+  }
+};
+static MPIServerRegistrar registrar;
+
+}  // namespace
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
diff --git a/tensorflow/contrib/mpi/mpi_server_lib.h b/tensorflow/contrib/mpi/mpi_server_lib.h
new file mode 100644
index 0000000000..736f6922a1
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_server_lib.h
@@ -0,0 +1,54 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_MPI_MPI_SERVER_LIB_H_
+#define TENSORFLOW_CONTRIB_MPI_MPI_SERVER_LIB_H_
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include <memory>
+
+#include "tensorflow/contrib/mpi/mpi_rendezvous_mgr.h"
+#include "tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h"
+
+namespace tensorflow {
+
+class MPIServer : public GrpcServer {
+ protected:
+  MPIServer(const ServerDef& server_def, Env* env);
+
+ public:
+  static Status Create(const ServerDef& server_def, Env* env,
+                       std::unique_ptr<ServerInterface>* out_server);
+
+  // Destruction is only supported in the factory method. Clean
+  // shutdown is not currently implemented for this server type.
+  ~MPIServer() override;
+
+  // Implementations of ServerInterface methods.
+  Status Start() override;
+  Status Join() override;
+
+ protected:
+  Status Init(ServiceInitFunction service_func,
+              RendezvousMgrCreationFunction rendezvous_mgr_func);
+  Status ChannelCacheFactory(const ServerDef& server_def,
+                             GrpcChannelCache** channel_cache);
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
+#endif  // TENSORFLOW_CONTRIB_MPI_MPI_SERVER_LIB_H_
diff --git a/tensorflow/contrib/mpi/mpi_utils.cc b/tensorflow/contrib/mpi/mpi_utils.cc
new file mode 100644
index 0000000000..b8e7d1a274
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_utils.cc
@@ -0,0 +1,72 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include "tensorflow/contrib/mpi/mpi_utils.h"
+namespace tensorflow {
+
+#define max_worker_name_length 128
+
+MPIUtils::MPIUtils(const std::string& worker_name) {
+  InitMPI();
+  // Connect the MPI process IDs to the worker names that are used by TF.
+  // Gather the names of all the active processes (name can't be longer than
+  // 128 bytes)
+  int proc_id = 0, number_of_procs = 1;
+  char my_name[max_worker_name_length];
+  MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &proc_id));
+  MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &number_of_procs));
+
+  CHECK(worker_name.size() < max_worker_name_length)
+      << "Specified worker name is too long.";
+  snprintf(my_name, max_worker_name_length, worker_name.c_str());
+  std::vector<char> worker_names(number_of_procs * max_worker_name_length);
+  MPI_CHECK(MPI_Allgather(my_name, max_worker_name_length, MPI_CHAR,
+                          &worker_names[0], max_worker_name_length, MPI_CHAR,
+                          MPI_COMM_WORLD));
+
+  if (proc_id == 0) LOG(INFO) << "MPI process-ID to gRPC server name map: \n";
+  for (int i = 0; i < number_of_procs; i++) {
+    name_to_id_[std::string(&worker_names[i * 128])] = i;
+    if (proc_id == 0)
+      LOG(INFO) << "Process: " << i
+                << "\tgRPC-name: " << std::string(&worker_names[i * 128])
+                << std::endl;
+  }
+}
+
+void MPIUtils::InitMPI() {
+  // Initialize the MPI environment if that hasn't been done
+  int flag = 0;
+  MPI_CHECK(MPI_Initialized(&flag));
+  if (!flag) {
+    int proc_id = 0, number_of_procs = 1, len = -1;
+    char my_host_name[max_worker_name_length];
+    // MPI_CHECK(MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &flag));
+    MPI_CHECK(MPI_Init(0, 0));
+    MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &proc_id));
+    MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &number_of_procs));
+    MPI_CHECK(MPI_Get_processor_name(my_host_name, &len));
+    fprintf(stderr,
+            "MPI Environment initialised. Process id: %d Total processes: %d "
+            "|| Hostname: %s \n",
+            proc_id, number_of_procs, my_host_name);
+  }
+}
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
diff --git a/tensorflow/contrib/mpi/mpi_utils.h b/tensorflow/contrib/mpi/mpi_utils.h
new file mode 100644
index 0000000000..45e21f2b25
--- /dev/null
+++ b/tensorflow/contrib/mpi/mpi_utils.h
@@ -0,0 +1,60 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_MPI_MPI_UTILS_H_
+#define TENSORFLOW_CONTRIB_MPI_MPI_UTILS_H_
+
+#ifdef TENSORFLOW_USE_MPI
+
+#include <string>
+#include <map>
+#include <vector>
+
+#include "tensorflow/core/lib/strings/str_util.h"
+
+#include "third_party/mpi/mpi.h"
+#define MPI_CHECK(cmd)                                                \
+  do {                                                                \
+    int mpi_errno = cmd;                                              \
+    if (MPI_SUCCESS != mpi_errno) {                                   \
+      fprintf(stderr, "[%s:%d] MPI call failed with %d \n", __FILE__, \
+              __LINE__, mpi_errno);                                   \
+      exit(EXIT_FAILURE);                                             \
+    }                                                                 \
+    assert(MPI_SUCCESS == mpi_errno);                                 \
+  } while (false)
+
+namespace tensorflow {
+class MPIUtils {
+ public:
+  explicit MPIUtils(const std::string& worker_name);
+
+  const int GetSourceID(const std::string& task_id) const {
+    auto it = name_to_id_.find(task_id);
+    if (it == name_to_id_.end()) {
+      LOG(FATAL) << "Failed to convert worker name to MPI index: " << task_id;
+    }
+    return it->second;
+  }
+
+ private:
+  void InitMPI();
+
+  std::map<std::string, int> name_to_id_;
+};
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_USE_MPI
+#endif  // TENSORFLOW_CONTRIB_MPI_MPI_UTILS_H_
diff --git a/tensorflow/contrib/opt/BUILD b/tensorflow/contrib/opt/BUILD
index a7e910975f..befd1b63c9 100644
--- a/tensorflow/contrib/opt/BUILD
+++ b/tensorflow/contrib/opt/BUILD
@@ -14,6 +14,7 @@ py_library(
     name = "opt_py",
     srcs = [
         "__init__.py",
+        "python/training/delay_compensated_gradient_descent.py",
         "python/training/drop_stale_gradient_optimizer.py",
         "python/training/external_optimizer.py",
         "python/training/lazy_adam_optimizer.py",
@@ -38,6 +39,25 @@ py_library(
 )
 
 py_test(
+    name = "delay_compensated_gradient_descent_test",
+    srcs = ["python/training/delay_compensated_gradient_descent_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["manual"],
+    deps = [
+        ":opt_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:extra_py_tests_deps",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:random_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python:variables",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_test(
     name = "external_optimizer_test",
     srcs = ["python/training/external_optimizer_test.py"],
     srcs_version = "PY2AND3",
diff --git a/tensorflow/contrib/opt/__init__.py b/tensorflow/contrib/opt/__init__.py
index 656a548cfd..f4cb7456cc 100644
--- a/tensorflow/contrib/opt/__init__.py
+++ b/tensorflow/contrib/opt/__init__.py
@@ -19,9 +19,11 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=wildcard-import
+from tensorflow.contrib.opt.python.training.delay_compensated_gradient_descent import *
 from tensorflow.contrib.opt.python.training.drop_stale_gradient_optimizer import *
 from tensorflow.contrib.opt.python.training.external_optimizer import *
 from tensorflow.contrib.opt.python.training.lazy_adam_optimizer import *
+from tensorflow.contrib.opt.python.training.nadam_optimizer import *
 from tensorflow.contrib.opt.python.training.moving_average_optimizer import *
 from tensorflow.contrib.opt.python.training.nadam_optimizer import *
 from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
@@ -29,7 +31,9 @@ from tensorflow.contrib.opt.python.training.variable_clipping_optimizer import *
 
 from tensorflow.python.util.all_util import remove_undocumented
 
+
 _allowed_symbols = [
+    'DelayCompensatedGradientDescentOptimizer',
     'DropStaleGradientOptimizer', 'ExternalOptimizerInterface',
     'LazyAdamOptimizer', 'NadamOptimizer', 'MovingAverageOptimizer',
     'ScipyOptimizerInterface', 'VariableClippingOptimizer'
diff --git a/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent.py b/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent.py
new file mode 100644
index 0000000000..5a5e67ef68
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent.py
@@ -0,0 +1,256 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""DelayCompensatedGradientDescentOptimizer for TensorFlow."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.training import optimizer
+from tensorflow.python.training import training_ops
+
+
+class _RefVariableAsynchronousProcessor(optimizer._RefVariableProcessor):
+  """Processor for Variable."""
+  def update_op_asynchronous(self, optimizer, g, index):
+    if isinstance(g, ops.Tensor):
+      return optimizer._apply_dense(g, self._v, index)
+    else:
+      assert isinstance(g, ops.IndexedSlices), ("Gradient ", g, " is neither a "
+                                                "tensor nor IndexedSlices.")
+      # pylint: disable=protected-access
+      return optimizer._apply_sparse_duplicate_indices(g, self._v, index)
+
+
+class _DenseResourceVariableAsynchronousProcessor(optimizer._DenseResourceVariableProcessor):
+  """Processor for dense ResourceVariables."""
+  def update_op_asynchronous(self, optimizer, g, index):
+    # pylint: disable=protected-access
+    if isinstance(g, ops.IndexedSlices):
+      return optimizer._resource_apply_sparse_duplicate_indices(
+        g.values, self._v, g.indices, index)
+    return optimizer._resource_apply_dense(g, self._v, index)
+
+
+def _get_processor(v):
+  """The processor of v."""
+  if v.op.type == "VarHandleOp":
+    return _DenseResourceVariableAsynchronousProcessor(v)
+  if isinstance(v, variables.Variable):
+    return _RefVariableAsynchronousProcessor(v)
+  raise NotImplementedError("Trying to optimize unsupported type ", v)
+
+
+class DelayCompensatedGradientDescentOptimizer(optimizer.Optimizer):
+  """Optimizer that implements gradient descent with delay compensation.
+
+  See [Zheng, Shuxin, et al., 2016](https://arxiv.org/abs/1609.08326)
+  ([pdf](https://arxiv.org/pdf/1609.08326.pdf)).
+  """
+
+  def __init__(self, learning_rate, variance_parameter, num_workers=1,
+               use_locking=False, name="DelayCompensatedGradientDescent"):
+    """Construct a new gradient descent optimizer with delay compensation.
+
+    Args:
+      learning_rate: A Tensor or a floating point value.  The learning
+        rate to use.
+      variance_parameter: A Tensor or a floating point value. The lambda
+        value to use.
+      num_workers: A value to indicate number of workers computing gradients
+        asynchronously.
+      use_locking: If True use locks for update operations.
+      name: Optional name prefix for the operations created when applying
+        gradients. Defaults to "DelayCompensatedGradientDescent".
+      """
+    if num_workers <= 0:
+      raise ValueError("num_workers must be positive: %s" % num_workers)
+    super(DelayCompensatedGradientDescentOptimizer, self).__init__(
+          use_locking, name)
+    self._learning_rate = learning_rate
+    self._lambda = variance_parameter
+    self._num_workers = num_workers
+
+  def minimize(self, loss, global_step=None, var_list=None,
+               gate_gradients=optimizer.Optimizer.GATE_OP, aggregation_method=None,
+               colocate_gradients_with_ops=False, name=None,
+               grad_loss=None, worker_index=None):
+    """Add operations to minimize `loss` by updating `var_list`.
+
+    This method simply combines calls `compute_gradients()` and
+    `apply_gradients()`. If you want to process the gradient before applying
+    them call `compute_gradients()` and `apply_gradients()` explicitly instead
+    of using this function.
+
+    Args:
+      loss: A `Tensor` containing the value to minimize.
+      global_step: Optional `Variable` to increment by one after the
+        variables have been updated.
+      var_list: Optional list or tuple of `Variable` objects to update to
+        minimize `loss`.  Defaults to the list of variables collected in
+        the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
+      gate_gradients: How to gate the computation of gradients.  Can be
+        `GATE_NONE`, `GATE_OP`, or  `GATE_GRAPH`.
+      aggregation_method: Specifies the method used to combine gradient terms.
+        Valid values are defined in the class `AggregationMethod`.
+      colocate_gradients_with_ops: If True, try colocating gradients with
+        the corresponding op.
+      name: Optional name for the returned operation.
+      grad_loss: Optional. A `Tensor` holding the gradient computed for `loss`.
+      worker_index: Optional. A value to indicate the instance of worker
+        minimizing if computing asynchronously.
+
+    Returns:
+      An Operation that updates the variables in `var_list`.  If `global_step`
+      was not `None`, that operation also increments `global_step`.
+
+    Raises:
+      ValueError: If some of the variables are not `Variable` objects.
+    """
+    if (worker_index < 0 and worker_index is not None) or worker_index >= self._num_workers:
+      raise ValueError("worker index must be in the range [0, num_workers): %s" %
+                        worker_index)
+    grads_and_vars = self.compute_gradients(
+        loss, var_list=var_list, gate_gradients=gate_gradients,
+        aggregation_method=aggregation_method,
+        colocate_gradients_with_ops=colocate_gradients_with_ops,
+        grad_loss=grad_loss)
+
+    vars_with_grad = [v for g, v in grads_and_vars if g is not None]
+    if not vars_with_grad:
+      raise ValueError(
+          "No gradients provided for any variable, check your graph for ops"
+          " that do not support gradients, between variables %s and loss %s." %
+          ([str(v) for _, v in grads_and_vars], loss))
+
+    return self.apply_gradients(grads_and_vars, global_step=global_step,
+                                name=name, worker_index=worker_index)
+
+  def apply_gradients(self,
+                      grads_and_vars,
+                      global_step=None,
+                      name=None,
+                      worker_index=None):
+    """Apply gradients to variables.
+
+    This is the second part of `minimize()`. It returns an `Operation` that
+    applies gradients.
+
+    Args:
+      grads_and_vars: List of (gradient, variable) pairs as returned by
+        `compute_gradients()`.
+      global_step: Optional `Variable` to increment by one after the
+        variables have been updated.
+      name: Optional name for the returned operation.  Default to the
+        name passed to the `Optimizer` constructor.
+      worker_index: Optional value to indicate the instance of worker
+        minimizing if computing asynchronously.
+
+    Returns:
+      An `Operation` that applies the specified gradients. If `global_step`
+      was not None, that operation also increments `global_step`.
+
+    Raises:
+      TypeError: If `grads_and_vars` is malformed.
+      ValueError: If none of the variables have gradients.
+    """
+    # This is a default implementation of apply_gradients() that can be shared
+    # by most optimizers.  It relies on the subclass implementing the following
+    # methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().
+
+    grads_and_vars = tuple(grads_and_vars)  # Make sure repeat iteration works.
+    if not grads_and_vars:
+      raise ValueError("No variables provided.")
+    converted_grads_and_vars = []
+    for g, v in grads_and_vars:
+      if g is not None:
+        try:
+          # Convert the grad to Tensor or IndexedSlices if necessary.
+          g = ops.convert_to_tensor_or_indexed_slices(g)
+        except TypeError:
+          raise TypeError(
+              "Gradient must be convertible to a Tensor"
+              " or IndexedSlices, or None: %s" % g)
+        if not isinstance(g, (ops.Tensor, ops.IndexedSlices)):
+          raise TypeError(
+              "Gradient must be a Tensor, IndexedSlices, or None: %s" % g)
+      p = _get_processor(v)
+      converted_grads_and_vars.append((g, v, p))
+
+    converted_grads_and_vars = tuple(converted_grads_and_vars)
+    var_list = [v for g, v, _ in converted_grads_and_vars if g is not None]
+    if not var_list:
+      raise ValueError("No gradients provided for any variable: %s." %
+                       ([str(v) for _, _, v in converted_grads_and_vars],))
+    with ops.control_dependencies(None):
+      self._create_slots([optimizer._get_variable_for(v) for v in var_list])
+    update_ops = []
+    with ops.name_scope(name, self._name) as name:
+      self._prepare()
+      for grad, var, processor in converted_grads_and_vars:
+        if grad is None:
+          continue
+        # We colocate all ops created in _apply_dense or _apply_sparse
+        # on the same device as the variable.
+        with ops.name_scope("update_" + var.op.name), ops.colocate_with(var):
+          if worker_index is None:
+            update_ops.append(processor.update_op(self, grad))
+          else:
+            update_ops.append(processor.update_op_asynchronous(self, grad,
+                                                               worker_index))
+      if global_step is None:
+        apply_updates = self._finish(update_ops, name)
+      else:
+        with ops.control_dependencies([self._finish(update_ops, "update")]):
+          with ops.colocate_with(global_step):
+            apply_updates = state_ops.assign_add(global_step, 1, name=name).op
+
+      train_op = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
+      if apply_updates not in train_op:
+        train_op.append(apply_updates)
+
+      return apply_updates
+
+  def _create_slots(self, var_list):
+    """Initialize slots for all the vars of each worker to store
+        the previous values of it
+    """
+    for index in range(self._num_workers):
+      for v in var_list:
+        var2 = array_ops.identity(v.initialized_value())
+        self._get_or_make_slot(v, var2, "shadow_{0}".format(index),
+                               self._name)
+
+  def _resource_apply_dense(self, grad, var, worker_index=0):
+    # Get previous value of the variable from the slot
+    shadow = self.get_slot(var, "shadow_{0}".format(worker_index))
+    return training_ops.apply_delay_compensated_gradient_descent(
+        var.handle,
+        math_ops.cast(self._learning_rate_tensor, grad.dtype.base_dtype),
+        grad,
+        math_ops.cast(self._lambda_tensor, grad.dtype.base_dtype),
+        shadow.handle,
+        use_locking=self._use_locking)
+
+  def _prepare(self):
+    self._learning_rate_tensor = ops.convert_to_tensor(self._learning_rate,
+                                                       name="learning_rate")
+    self._lambda_tensor = ops.convert_to_tensor(self._lambda,
+                                                name="lambda")
diff --git a/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent_test.py b/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent_test.py
new file mode 100644
index 0000000000..1dbd8416a0
--- /dev/null
+++ b/tensorflow/contrib/opt/python/training/delay_compensated_gradient_descent_test.py
@@ -0,0 +1,132 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Functional test for DelayCompensatedGradientDescentOptimizer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import resource_variable_ops
+from tensorflow.python.ops import variables
+from tensorflow.python.platform import test
+from tensorflow.contrib.opt.python.training import delay_compensated_gradient_descent
+
+
+class DelayCompensatedGradientDescentOptimizerTest(test.TestCase):
+
+  def testBasic(self):
+    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
+      with self.test_session():
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+        optimizer = (delay_compensated_gradient_descent.
+                     DelayCompensatedGradientDescentOptimizer)(
+                         learning_rate=3.0,
+                         variance_parameter=2.0,
+                         num_workers=1)
+        sgd_op = optimizer.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), worker_index=0)
+        variables.global_variables_initializer().run()
+        # Fetch params to validate initial values
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
+        # Run 1 step of sgd
+        sgd_op.run()
+        # Validate updated params
+        self.assertAllCloseAccordingToType(
+            [1.0 - 3.0 * 0.1, 2.0 - 3.0 * 0.1], var0.eval())
+        self.assertAllCloseAccordingToType(
+            [3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01], var1.eval())
+
+  def testTensorLearningRate(self):
+    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
+      with self.test_session():
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+        lrate = constant_op.constant(3.0)
+        optimizer = (delay_compensated_gradient_descent.
+                     DelayCompensatedGradientDescentOptimizer)(
+                         learning_rate=3.0,
+                         variance_parameter=2.0,
+                         num_workers=1)
+        sgd_op = optimizer.apply_gradients(
+            zip([grads0, grads1], [var0, var1]), worker_index=0)
+        variables.global_variables_initializer().run()
+        # Fetch params to validate initial values
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
+        # Run 1 step of sgd
+        sgd_op.run()
+        # Validate updated params
+        self.assertAllCloseAccordingToType(
+            [1.0 - 3.0 * 0.1, 2.0 - 3.0 * 0.1], var0.eval())
+        self.assertAllCloseAccordingToType(
+            [3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01], var1.eval())
+
+    def testGradWrtRef(self):
+      for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
+        with self.test_session():
+          optimizer = (delay_compensated_gradient_descent.
+                       DelayCompensatedGradientDescentOptimizer)(
+                           learning_rate=3.0,
+                           variance_parameter=2.0,
+                           num_workers=1)
+          values = [1.0, 3.0]
+          vars_ = [variables.Variable([v], dtype=dtype) for v in values]
+          grads_and_vars = optimizer.compute_gradients(
+              vars_[0] + vars_[1], vars_)
+          variables.global_variables_initializer().run()
+          for grad, _ in grads_and_vars:
+            self.assertAllCloseAccordingToType([1.0], grad.eval())
+
+  def testWithGlobalStep(self):
+    for dtype in [dtypes.half, dtypes.float32, dtypes.float64]:
+      with self.test_session():
+        global_step = variables.Variable(0, trainable=False)
+        var0 = resource_variable_ops.ResourceVariable([1.0, 2.0], dtype=dtype)
+        var1 = resource_variable_ops.ResourceVariable([3.0, 4.0], dtype=dtype)
+        grads0 = constant_op.constant([0.1, 0.1], dtype=dtype)
+        grads1 = constant_op.constant([0.01, 0.01], dtype=dtype)
+        optimizer = (delay_compensated_gradient_descent.
+                     DelayCompensatedGradientDescentOptimizer)(
+                         learning_rate=3.0,
+                         variance_parameter=2.0,
+                         num_workers=1)
+        sgd_op = optimizer.apply_gradients(
+            zip([grads0, grads1], [var0, var1]),
+            global_step=global_step,
+            worker_index=0)
+        variables.global_variables_initializer().run()
+        # Fetch params to validate initial values
+        self.assertAllCloseAccordingToType([1.0, 2.0], var0.eval())
+        self.assertAllCloseAccordingToType([3.0, 4.0], var1.eval())
+        # Run 1 step of sgd
+        sgd_op.run()
+        # Validate updated params and global_step
+        self.assertAllCloseAccordingToType(
+            [1.0 - 3.0 * 0.1, 2.0 - 3.0 * 0.1], var0.eval())
+        self.assertAllCloseAccordingToType(
+            [3.0 - 3.0 * 0.01, 4.0 - 3.0 * 0.01], var1.eval())
+        self.assertAllCloseAccordingToType(1, global_step.eval())
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/pi_examples/README.md b/tensorflow/contrib/pi_examples/README.md
index 8dde63e4c6..f550228083 100644
--- a/tensorflow/contrib/pi_examples/README.md
+++ b/tensorflow/contrib/pi_examples/README.md
@@ -69,5 +69,5 @@ Flite package and then pipe the output of the binary you've built, like this:
 
 ```
 sudo apt-get install flite
-tensorflow/contrib/pi_examples/camera/gen/bin/camera | xargs -n1 flite -t
+tensorflow/contrib/pi_examples/camera/gen/bin/camera | xargs -n 1 flite -t
 ```
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
index d250af9037..09aa30a20b 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_test.py
@@ -42,7 +42,7 @@ from tensorflow.python.ops import variables as variables_lib
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
 from tensorflow.python.util import nest
-
+from tensorflow.python.framework import test_util
 
 class Plus1RNNCell(rnn_lib.RNNCell):
   """RNN Cell generating (output, new_state) = (input + 1, state + 1)."""
@@ -2209,9 +2209,10 @@ class TensorArrayOnCorrectDeviceTest(test.TestCase):
       return  # Test requires access to a GPU
 
     run_metadata = self._execute_rnn_on(
-        rnn_device="/cpu:0", cell_device="/gpu:0")
+        rnn_device="/cpu:0", cell_device=test_util.gpu_device_name())
     step_stats = run_metadata.step_stats
-    ix = 0 if "gpu" in step_stats.dev_stats[0].device else 1
+    ix = 0 if (("gpu" in step_stats.dev_stats[0].device) or
+    ("sycl" in step_stats.dev_stats[0].device)) else 1
     gpu_stats = step_stats.dev_stats[ix].node_stats
     cpu_stats = step_stats.dev_stats[1 - ix].node_stats
 
@@ -2233,9 +2234,11 @@ class TensorArrayOnCorrectDeviceTest(test.TestCase):
       return  # Test requires access to a GPU
 
     run_metadata = self._execute_rnn_on(
-        rnn_device="/cpu:0", cell_device="/cpu:0", input_device="/gpu:0")
+        rnn_device="/cpu:0", cell_device="/cpu:0",
+        input_device=test_util.gpu_device_name())
     step_stats = run_metadata.step_stats
-    ix = 0 if "gpu" in step_stats.dev_stats[0].device else 1
+    ix = 0 if (("gpu" in step_stats.dev_stats[0].device) or
+    ("sycl" in step_stats.dev_stats[0].device)) else 1
     gpu_stats = step_stats.dev_stats[ix].node_stats
     cpu_stats = step_stats.dev_stats[1 - ix].node_stats
 
@@ -2250,9 +2253,11 @@ class TensorArrayOnCorrectDeviceTest(test.TestCase):
     if not test.is_gpu_available():
       return  # Test requires access to a GPU
 
-    run_metadata = self._execute_rnn_on(input_device="/gpu:0")
+    run_metadata = self._execute_rnn_on(
+        input_device=test_util.gpu_device_name())
     step_stats = run_metadata.step_stats
-    ix = 0 if "gpu" in step_stats.dev_stats[0].device else 1
+    ix = 0 if (("gpu" in step_stats.dev_stats[0].device) or
+    ("sycl" in step_stats.dev_stats[0].device)) else 1
     gpu_stats = step_stats.dev_stats[ix].node_stats
     cpu_stats = step_stats.dev_stats[1 - ix].node_stats
 
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
index 04b0c5876b..fb91fe14f4 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/rnn_cell_test.py
@@ -1026,6 +1026,73 @@ class LayerNormBasicLSTMCellTest(test.TestCase):
         self.assertAllClose(res[1].c, expected_c, 1e-5)
         self.assertAllClose(res[1].h, expected_h, 1e-5)
 
+
+  def testBasicLSTMCellWithoutNorm(self):
+    """Tests that BasicLSTMCell with layer_norm=False."""
+    with self.test_session() as sess:
+      with variable_scope.variable_scope(
+          "root", initializer=init_ops.constant_initializer(0.5)):
+        x = array_ops.zeros([1, 2])
+        c0 = array_ops.zeros([1, 2])
+        h0 = array_ops.zeros([1, 2])
+        state0 = rnn_cell.LSTMStateTuple(c0, h0)
+        c1 = array_ops.zeros([1, 2])
+        h1 = array_ops.zeros([1, 2])
+        state1 = rnn_cell.LSTMStateTuple(c1, h1)
+        state = (state0, state1)
+        single_cell = lambda: contrib_rnn_cell.LayerNormBasicLSTMCell(2, layer_norm=False)
+        cell = rnn_cell.MultiRNNCell([single_cell() for _ in range(2)])
+        g, out_m = cell(x, state)
+        sess.run([variables.global_variables_initializer()])
+        res = sess.run([g, out_m], {
+          x.name: np.array([[1., 1.]]),
+          c0.name: 0.1 * np.asarray([[0, 1]]),
+          h0.name: 0.1 * np.asarray([[2, 3]]),
+          c1.name: 0.1 * np.asarray([[4, 5]]),
+          h1.name: 0.1 * np.asarray([[6, 7]]),
+        })
+
+        expected_h = np.array([[ 0.70230919, 0.72581059]])
+        expected_state0_c = np.array([[ 0.8020075,  0.89599884]])
+        expected_state0_h = np.array([[ 0.56668288,  0.60858738]])
+        expected_state1_c = np.array([[ 1.17500675,  1.26892781]])
+        expected_state1_h = np.array([[ 0.70230919,  0.72581059]])
+
+        actual_h = res[0]
+        actual_state0_c = res[1][0].c
+        actual_state0_h = res[1][0].h
+        actual_state1_c = res[1][1].c
+        actual_state1_h = res[1][1].h
+
+        self.assertAllClose(actual_h, expected_h, 1e-5)
+        self.assertAllClose(expected_state0_c, actual_state0_c, 1e-5)
+        self.assertAllClose(expected_state0_h, actual_state0_h, 1e-5)
+        self.assertAllClose(expected_state1_c, actual_state1_c, 1e-5)
+        self.assertAllClose(expected_state1_h, actual_state1_h, 1e-5)
+
+      with variable_scope.variable_scope(
+          "other", initializer=init_ops.constant_initializer(0.5)) as vs:
+        x = array_ops.zeros(
+          [1, 3])  # Test BasicLSTMCell with input_size != num_units.
+        c = array_ops.zeros([1, 2])
+        h = array_ops.zeros([1, 2])
+        state = rnn_cell.LSTMStateTuple(c, h)
+        cell = contrib_rnn_cell.LayerNormBasicLSTMCell(2, layer_norm=False)
+        g, out_m = cell(x, state)
+        sess.run([variables.global_variables_initializer()])
+        res = sess.run([g, out_m], {
+          x.name: np.array([[1., 1., 1.]]),
+          c.name: 0.1 * np.asarray([[0, 1]]),
+          h.name: 0.1 * np.asarray([[2, 3]]),
+        })
+
+        expected_h = np.array([[ 0.64121795, 0.68166804]])
+        expected_c = np.array([[ 0.88477188, 0.98103917]])
+        self.assertEqual(len(res), 2)
+        self.assertAllClose(res[0], expected_h, 1e-5)
+        self.assertAllClose(res[1].c, expected_c, 1e-5)
+        self.assertAllClose(res[1].h, expected_h, 1e-5)
+
   def testBasicLSTMCellWithStateTuple(self):
     with self.test_session() as sess:
       with variable_scope.variable_scope(
diff --git a/tensorflow/contrib/rnn/python/ops/rnn_cell.py b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
index 3dc8abb8b8..9c5e9fec9d 100644
--- a/tensorflow/contrib/rnn/python/ops/rnn_cell.py
+++ b/tensorflow/contrib/rnn/python/ops/rnn_cell.py
@@ -462,7 +462,7 @@ class GridLSTMCell(rnn_cell_impl.RNNCell):
         state is clipped by this value prior to the cell output activation.
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
-      num_unit_shards: (optional) int, defualt 1, How to split the weight
+      num_unit_shards: (optional) int, default 1, How to split the weight
         matrix. If > 1,the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
@@ -918,7 +918,7 @@ class BidirectionalGridLSTMCell(GridLSTMCell):
         state is clipped by this value prior to the cell output activation.
       initializer: (optional) The initializer to use for the weight and
         projection matrices, default None.
-      num_unit_shards: (optional) int, defualt 1, How to split the weight
+      num_unit_shards: (optional) int, default 1, How to split the weight
         matrix. If > 1,the weight matrix is stored across num_unit_shards.
       forget_bias: (optional) float, default 1.0, The initial bias of the
         forget gates, used to reduce the scale of forgetting at the beginning
@@ -1805,12 +1805,12 @@ class PhasedLSTMCell(rnn_cell_impl.RNNCell):
           period during which the gates are open.
       trainable_ratio_on: bool, weather ratio_on is trainable.
       period_init_min: float or scalar float Tensor. With value > 0.
-          Minimum value of the initalized period.
+          Minimum value of the initialized period.
           The period values are initialized by drawing from the distribution:
           e^U(log(period_init_min), log(period_init_max))
           Where U(.,.) is the uniform distribution.
       period_init_max: float or scalar float Tensor.
-          With value > period_init_min. Maximum value of the initalized period.
+          With value > period_init_min. Maximum value of the initialized period.
       reuse: (optional) Python boolean describing whether to reuse variables
         in an existing scope. If not `True`, and the existing scope already has
         the given variables, an error is raised.
diff --git a/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py b/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py
index d9bb3bcccd..1cbd27a2e5 100644
--- a/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py
+++ b/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py
@@ -162,7 +162,7 @@ def _split_sharded_vars(name_shape_map):
 
   Returns:
     not_sharded: Names of the non-sharded variables.
-    sharded: Names of the sharded varibales.
+    sharded: Names of the sharded variables.
   """
   sharded = []
   not_sharded = []
diff --git a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
index e9a808709b..642c7f1b54 100644
--- a/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
+++ b/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py
@@ -359,9 +359,9 @@ class LuongAttention(_BaseAttentionMechanism):
 
 
 class BahdanauAttention(_BaseAttentionMechanism):
-  """Implements Bhadanau-style (additive) attention.
+  """Implements Bahdanau-style (additive) attention.
 
-  This attention has two forms.  The first is Bhandanau attention,
+  This attention has two forms.  The first is Bahdanau attention,
   as described in:
 
   Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
@@ -502,7 +502,7 @@ class AttentionWrapperState(
 
     Returns:
       A new `AttentionWrapperState` whose properties are the same as
-      this one, except any overriden properties as provided in `kwargs`.
+      this one, except any overridden properties as provided in `kwargs`.
     """
     return super(AttentionWrapperState, self)._replace(**kwargs)
 
diff --git a/tensorflow/contrib/slim/README.md b/tensorflow/contrib/slim/README.md
index 61148c0b26..d37c632be7 100644
--- a/tensorflow/contrib/slim/README.md
+++ b/tensorflow/contrib/slim/README.md
@@ -352,7 +352,7 @@ we can both ensure that each layer uses the same values and simplify the code:
 ```
 
 As the example illustrates, the use of arg_scope makes the code cleaner,
-simpler and easier to maintain. Notice that while argument values are specifed
+simpler and easier to maintain. Notice that while argument values are specified
 in the arg_scope, they can be overwritten locally. In particular, while
 the padding argument has been set to 'SAME', the second convolution overrides
 it with the value of 'VALID'.
diff --git a/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py b/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py
index 3a78c0471d..82c6b5a619 100644
--- a/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py
+++ b/tensorflow/contrib/slim/python/slim/data/dataset_data_provider.py
@@ -33,7 +33,7 @@ To read data using multiple readers simultaneous with shuffling:
       shuffle=True)
   images, labels = pascal_voc_data_provider.get(['images', 'labels'])
 
-Equivalently, one may request different fields of the same sample seperately:
+Equivalently, one may request different fields of the same sample separately:
 
   [images] = pascal_voc_data_provider.get(['images'])
   [labels] = pascal_voc_data_provider.get(['labels'])
diff --git a/tensorflow/contrib/tensorboard/plugins/projector/__init__.py b/tensorflow/contrib/tensorboard/plugins/projector/__init__.py
index 771685229d..be2398cdc0 100644
--- a/tensorflow/contrib/tensorboard/plugins/projector/__init__.py
+++ b/tensorflow/contrib/tensorboard/plugins/projector/__init__.py
@@ -39,7 +39,7 @@ def visualize_embeddings(summary_writer, config):
   """Stores a config file used by the embedding projector.
 
   Args:
-    summary_writer: The summary writer used for writting events.
+    summary_writer: The summary writer used for writing events.
     config: `tf.contrib.tensorboard.plugins.projector.ProjectorConfig`
       proto that holds the configuration for the projector such as paths to
       checkpoint files and metadata files for the embeddings. If
diff --git a/tensorflow/contrib/tensorboard/plugins/projector/projector_api_test.py b/tensorflow/contrib/tensorboard/plugins/projector/projector_api_test.py
index 91ea6bc753..5f86f57a1c 100644
--- a/tensorflow/contrib/tensorboard/plugins/projector/projector_api_test.py
+++ b/tensorflow/contrib/tensorboard/plugins/projector/projector_api_test.py
@@ -46,7 +46,7 @@ class ProjectorApiTest(test.TestCase):
     writer = writer_lib.FileWriter(temp_dir)
     projector.visualize_embeddings(writer, config)
 
-    # Read the configuratin from disk and make sure it matches the original.
+    # Read the configurations from disk and make sure it matches the original.
     with gfile.GFile(os.path.join(temp_dir, 'projector_config.pbtxt')) as f:
       config2 = projector_config_pb2.ProjectorConfig()
       text_format.Parse(f.read(), config2)
diff --git a/tensorflow/contrib/tfprof/README.md b/tensorflow/contrib/tfprof/README.md
index 5bfa0247a5..c01e5eb637 100644
--- a/tensorflow/contrib/tfprof/README.md
+++ b/tensorflow/contrib/tfprof/README.md
@@ -1,6 +1,6 @@
 # tfprof: A Profiling Tool for TensorFlow Models
 
-# Full Docment in tensorflow/tools/tfprof/README.md
+# Full Document in tensorflow/tools/tfprof/README.md
 
 Author: Xin Pan (xpan@google.com, github: panyx0718), Jon Shlens, Yao Zhang
 
diff --git a/tensorflow/contrib/training/python/training/evaluation.py b/tensorflow/contrib/training/python/training/evaluation.py
index bc0c60c85c..24b733dd29 100644
--- a/tensorflow/contrib/training/python/training/evaluation.py
+++ b/tensorflow/contrib/training/python/training/evaluation.py
@@ -370,7 +370,7 @@ def evaluate_repeatedly(checkpoint_dir,
 
   One may also consider using a `tf.contrib.training.SummaryAtEndHook` to record
   summaries after the `eval_ops` have run. If `eval_ops` is `None`, the
-  summaries run immedietly after the model checkpoint has been restored.
+  summaries run immediately after the model checkpoint has been restored.
 
   Note that `evaluate_once` creates a local variable used to track the number of
   evaluations run via `tf.contrib.training.get_or_create_eval_step`.
diff --git a/tensorflow/contrib/training/python/training/hparam.py b/tensorflow/contrib/training/python/training/hparam.py
index 2e08593699..c19a36eabc 100644
--- a/tensorflow/contrib/training/python/training/hparam.py
+++ b/tensorflow/contrib/training/python/training/hparam.py
@@ -422,7 +422,7 @@ class HParams(object):
     elif issubclass(param_type, float):
       typename = 'float'
     else:
-      raise ValueError('Unsupported paramter type: %s' % str(param_type))
+      raise ValueError('Unsupported parameter type: %s' % str(param_type))
 
     suffix = 'list' if is_list else 'value'
     return '_'.join([typename, suffix])
diff --git a/tensorflow/contrib/training/python/training/sequence_queueing_state_saver.py b/tensorflow/contrib/training/python/training/sequence_queueing_state_saver.py
index 2c7c30911c..9312070e52 100644
--- a/tensorflow/contrib/training/python/training/sequence_queueing_state_saver.py
+++ b/tensorflow/contrib/training/python/training/sequence_queueing_state_saver.py
@@ -344,7 +344,7 @@ def _prepare_sequence_inputs(inputs, states):
   key = _check_rank(inputs.key, 0)
 
   if length.dtype != dtypes.int32:
-    raise TypeError("length dtype must be int32, but recieved: %s" %
+    raise TypeError("length dtype must be int32, but received: %s" %
                     length.dtype)
   if key.dtype != dtypes.string:
     raise TypeError("key dtype must be string, but received: %s" % key.dtype)
@@ -1673,7 +1673,7 @@ def _move_sparse_tensor_out_context(input_context, input_sequences, num_unroll):
     shape = array_ops.concat(
         [array_ops.expand_dims(value_length, 0), sp_tensor.dense_shape], 0)
 
-    # Construct new indices by mutliplying old ones and prepending [0, n).
+    # Construct new indices by multiplying old ones and prepending [0, n).
     # First multiply indices n times along a newly created 0-dimension.
     multiplied_indices = array_ops.tile(
         array_ops.expand_dims(sp_tensor.indices, 0),
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 2685acea4b..1f0b100bbb 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -82,6 +82,7 @@ load("//tensorflow:tensorflow.bzl", "tf_cc_test_mkl")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test_gpu")
 load("//tensorflow:tensorflow.bzl", "tf_cc_tests_gpu")
 load("//tensorflow:tensorflow.bzl", "tf_version_info_genrule")
+load("//tensorflow:tensorflow.bzl", "tf_cuda_only_cc_test")
 
 # For platform specific build config
 load(
@@ -110,6 +111,7 @@ load(
     "tf_additional_cloud_kernel_deps",
     "tf_lib_proto_parsing_deps",
     "tf_additional_verbs_lib_defines",
+    "tf_additional_mpi_lib_defines",
 )
 load(
     "//tensorflow/core:platform/default/build_config_root.bzl",
@@ -142,6 +144,7 @@ CORE_PROTO_SRCS = [
     "framework/log_memory.proto",
     "framework/node_def.proto",
     "framework/op_def.proto",
+    "framework/reader_base.proto",
     "framework/remote_fused_graph_execute_info.proto",
     "framework/resource_handle.proto",
     "framework/step_stats.proto",
@@ -183,15 +186,6 @@ ADDITIONAL_CORE_PROTO_SRCS = [
 ]
 
 tf_proto_library(
-    name = "reader_base_proto",
-    srcs = ["framework/reader_base.proto"],
-    cc_api_version = 2,
-    go_api_version = 2,
-    java_api_version = 2,
-    visibility = ["//visibility:public"],
-)
-
-tf_proto_library(
     name = "protos_all",
     srcs = CORE_PROTO_SRCS + ADDITIONAL_CORE_PROTO_SRCS,
     cc_api_version = 2,
@@ -411,6 +405,7 @@ tf_cuda_library(
         "util/work_sharder.h",
     ] + select({
         "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "util/memmapped_file_system.h",
             "util/memmapped_file_system_writer.h",
@@ -438,7 +433,6 @@ cc_library(
     deps = [
         ":framework",
         ":lib",
-        ":reader_base_proto_cc",
     ],
 )
 
@@ -884,7 +878,6 @@ filegroup(
             "**/*main.cc",
             "debug/**/*",
             "framework/op_gen_*",
-            "framework/reader_base.*",
             "graph/dot.*",
             "lib/jpeg/**/*",
             "lib/png/**/*",
@@ -1206,32 +1199,35 @@ tf_proto_library_cc(
     ],
 )
 
+LIB_INTERNAL_WINDOWS_DEPS = glob(
+    [
+        "lib/**/*.h",
+        "lib/**/*.cc",
+        "platform/*.h",
+        "platform/*.cc",
+        "platform/profile_utils/**/*.h",
+        "platform/profile_utils/**/*.cc",
+    ],
+    exclude = [
+        "**/*test*",
+        "lib/hash/crc32c_accelerate.cc",
+        "lib/gif/**/*",
+        "lib/jpeg/**/*",
+        "platform/gif.h",
+        "platform/jpeg.h",
+        "platform/**/env_time.cc",
+        "platform/**/cuda.h",
+        "platform/**/cuda_libdevice_path.cc",
+        "platform/**/stream_executor.h",
+        "platform/load_library.cc",
+    ],
+)
+
 cc_library(
     name = "lib_internal",
     srcs = select({
-        "//tensorflow:windows": glob(
-            [
-                "lib/**/*.h",
-                "lib/**/*.cc",
-                "platform/*.h",
-                "platform/*.cc",
-                "platform/profile_utils/**/*.h",
-                "platform/profile_utils/**/*.cc",
-            ],
-            exclude = [
-                "**/*test*",
-                "lib/hash/crc32c_accelerate.cc",
-                "lib/gif/**/*",
-                "lib/jpeg/**/*",
-                "platform/gif.h",
-                "platform/jpeg.h",
-                "platform/**/env_time.cc",
-                "platform/**/cuda.h",
-                "platform/**/cuda_libdevice_path.cc",
-                "platform/**/stream_executor.h",
-                "platform/load_library.cc",
-            ],
-        ),
+        "//tensorflow:windows": LIB_INTERNAL_WINDOWS_DEPS,
+        "//tensorflow:windows_msvc": LIB_INTERNAL_WINDOWS_DEPS,
         "//conditions:default": glob(
             [
                 "lib/**/*.h",
@@ -1309,8 +1305,9 @@ cc_library(
     ],
     copts = tf_copts(),
     defines = tf_additional_lib_defines() + [
-        "SNAPPY",
-    ] + tf_additional_verbs_lib_defines(),
+                  "SNAPPY",
+              ] + tf_additional_verbs_lib_defines() +
+              tf_additional_mpi_lib_defines(),
     linkopts = select({
         "//tensorflow:freebsd": [],
         "//conditions:default": [
@@ -1434,6 +1431,7 @@ tf_cuda_library(
         ],
     ) + select({
         "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "util/memmapped_file_system.h",
             "util/memmapped_file_system.cc",
@@ -1561,8 +1559,6 @@ tf_cuda_library(
         "graph/graph_constructor.cc",
         "graph/graph_def_builder.cc",
         "graph/graph_partition.cc",
-        "graph/mkl_layout_pass.cc",
-        "graph/mkl_tfconversion_pass.cc",
         "graph/node_builder.cc",
         "graph/optimizer_cse.cc",
         "graph/subgraph.cc",
@@ -1625,6 +1621,8 @@ tf_cuda_library(
         "common_runtime/threadpool_device.cc",
         "common_runtime/threadpool_device_factory.cc",
         "graph/gradients.cc",
+        "graph/mkl_layout_pass.cc",
+        "graph/mkl_tfconversion_pass.cc",
         "graph/quantize_training.cc",
         "public/session.h",
         "public/session_options.h",
@@ -1835,6 +1833,7 @@ cc_library(
     hdrs = if_not_windows([
         "common_runtime/sycl/sycl_allocator.h",
         "common_runtime/sycl/sycl_device.h",
+        "common_runtime/sycl/sycl_util.h",
         "common_runtime/sycl/sycl_device_context.h",
     ]),
     copts = tf_copts(),
@@ -2322,6 +2321,18 @@ tf_cc_test_gpu(
     ],
 )
 
+tf_cuda_only_cc_test(
+    name = "util_cuda_kernel_helper_test",
+    srcs = [
+        "util/cuda_kernel_helper_test.cu.cc",
+    ],
+    deps = [
+        ":test",
+        ":test_main",
+        "//third_party/eigen3",
+    ],
+)
+
 tf_cc_test_gpu(
     name = "memory_types_test",
     size = "small",
@@ -2885,6 +2896,20 @@ filegroup(
 )
 
 filegroup(
+    name = "lmdb_testdata",
+    testonly = 1,
+    srcs = [
+        # A simple key-value store:
+        #   0 : 'a'
+        #   1 : 'b'
+        #    ...
+        #   9 : 'j'
+        "lib/lmdb/testdata/data.mdb",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+filegroup(
     name = "example_parser_configuration_testdata",
     srcs = [
         "example/testdata/parse_example_graph_def.pbtxt",
diff --git a/tensorflow/core/common_runtime/constant_folding.cc b/tensorflow/core/common_runtime/constant_folding.cc
index 8fa61d098e..914683d9fa 100644
--- a/tensorflow/core/common_runtime/constant_folding.cc
+++ b/tensorflow/core/common_runtime/constant_folding.cc
@@ -83,7 +83,7 @@ bool IsConstantFoldable(const Node* n,
 }
 
 // Returns the constant foldable nodes in `nodes` in topological order.
-// Populates `constant_control_deps` with the non-constant control depedencies
+// Populates `constant_control_deps` with the non-constant control dependencies
 // of each constant node.
 void FindConstantFoldableNodes(
     const Graph* graph, ConstantFoldingOptions opts, std::vector<Node*>* nodes,
diff --git a/tensorflow/core/common_runtime/direct_session_test.cc b/tensorflow/core/common_runtime/direct_session_test.cc
index f8deaaf222..3d06ca0ae4 100644
--- a/tensorflow/core/common_runtime/direct_session_test.cc
+++ b/tensorflow/core/common_runtime/direct_session_test.cc
@@ -877,8 +877,6 @@ class BlockingOp : public OpKernel {
 REGISTER_KERNEL_BUILDER(Name("BlockingOp").Device(DEVICE_CPU), BlockingOp);
 REGISTER_OP("BlockingOp").Input("x: float").Output("y: float").Doc("");
 
-REGISTER_KERNEL_BUILDER(Name("BlockingOp").Device(DEVICE_SYCL), BlockingOp);
-
 static void TestSessionInterOpThreadsImpl(bool use_function_lib) {
   FunctionDefLibrary library_graph_def;
   if (use_function_lib) {
@@ -916,6 +914,7 @@ static void TestSessionInterOpThreadsImpl(bool use_function_lib) {
       ->set_opt_level(OptimizerOptions_Level_L0);
   (*options.config.mutable_device_count())["CPU"] = 2;
   (*options.config.mutable_device_count())["GPU"] = 0;
+  (*options.config.mutable_device_count())["SYCL"] = 0;
 
   options.config.add_session_inter_op_thread_pool();
   auto* p = options.config.add_session_inter_op_thread_pool();
diff --git a/tensorflow/core/common_runtime/direct_session_with_tracking_alloc_test.cc b/tensorflow/core/common_runtime/direct_session_with_tracking_alloc_test.cc
index 6f92cd09d3..0cfc289494 100644
--- a/tensorflow/core/common_runtime/direct_session_with_tracking_alloc_test.cc
+++ b/tensorflow/core/common_runtime/direct_session_with_tracking_alloc_test.cc
@@ -155,10 +155,16 @@ static void TestHWAccelerator(bool enableHWTrace) {
   test::FillValues<float>(&x_tensor, {1, 1});
   Node* x = test::graph::Constant(&graph, x_tensor);
   x->set_assigned_device_name("/job:localhost/replica:0/task:0/gpu:0");
+#ifdef TENSORFLOW_USE_SYCL
+  x->set_assigned_device_name("/job:localhost/replica:0/task:0/device:SYCL:0");
+#endif // TENSORFLOW_USE_SYCL
 
   // y = A * x
   Node* y = test::graph::Matmul(&graph, a, x, false, false);
   y->set_assigned_device_name("/job:localhost/replica:0/task:0/gpu:0");
+#ifdef TENSORFLOW_USE_SYCL
+y->set_assigned_device_name("/job:localhost/replica:0/task:0/device:SYCL:0");
+#endif // TENSORFLOW_USE_SYCL
 
   Node* y_neg = test::graph::Unary(&graph, "Neg", y);
   y_neg->set_assigned_device_name("/job:localhost/replica:0/task:0/cpu:0");
@@ -169,6 +175,9 @@ static void TestHWAccelerator(bool enableHWTrace) {
   SessionOptions options;
   (*options.config.mutable_device_count())["CPU"] = 1;
   (*options.config.mutable_device_count())["GPU"] = 1;
+#ifdef TENSORFLOW_USE_SYCL
+  (*options.config.mutable_device_count())["SYCL"] = 1;
+#endif // TENSORFLOW_USE_SYCL
   options.config.set_allow_soft_placement(true);
   options.config.mutable_graph_options()->set_build_cost_model(1);
   std::unique_ptr<Session> session(NewSession(options));
diff --git a/tensorflow/core/common_runtime/executor.cc b/tensorflow/core/common_runtime/executor.cc
index 2ca3c319ab..24b519fb07 100644
--- a/tensorflow/core/common_runtime/executor.cc
+++ b/tensorflow/core/common_runtime/executor.cc
@@ -514,7 +514,7 @@ char* GraphView::InitializeNode(char* ptr, const Node* n) {
   item->num_output_edges = num_output_edges;
 
   // Fill output edges.
-  // Keep track of the last EdgeInfo in the EdngeInfo array that references
+  // Keep track of the last EdgeInfo in the EdgeInfo array that references
   // a given output slot.  For all but the last, we need to do a copy of the
   // Tensor when propagating results downstream in the graph, but for the
   // last one, we can just do a move of the Tensor object to propagate it.
diff --git a/tensorflow/core/common_runtime/executor.h b/tensorflow/core/common_runtime/executor.h
index 93b58906dd..e09dc4e346 100644
--- a/tensorflow/core/common_runtime/executor.h
+++ b/tensorflow/core/common_runtime/executor.h
@@ -74,8 +74,8 @@ class Executor {
   //
   // RunAsync() uses "cancellation_manager", if not nullptr, to
   // register callbacks that should be called if the graph computation
-  // is cancelled. Note that the callbacks merely unblock any
-  // long-running computation, and a cancelled step will terminate by
+  // is canceled. Note that the callbacks merely unblock any
+  // long-running computation, and a canceled step will terminate by
   // returning/calling the DoneCallback as usual.
   //
   // RunAsync() dispatches closures to "runner". Typically, "runner"
diff --git a/tensorflow/core/common_runtime/memory_types.cc b/tensorflow/core/common_runtime/memory_types.cc
index db053dd2fa..21ed73df77 100644
--- a/tensorflow/core/common_runtime/memory_types.cc
+++ b/tensorflow/core/common_runtime/memory_types.cc
@@ -47,12 +47,12 @@ struct EndpointEq {
 static Status ProcessMemoryTypes(
     const DeviceType& device_type, const Graph* g,
     const std::function<Status(const Edge*, MemoryType, MemoryType)>& fn) {
-  if (device_type != DEVICE_GPU) {
-    // On non-GPU devices, HOST_MEMORY and DEVICE_MEMORY are always
+  if (device_type != DEVICE_GPU && device_type != DEVICE_SYCL ) {
+    // On non-GPU and non-SYCL devices, HOST_MEMORY and DEVICE_MEMORY are always
     // compatible.
     return Status::OK();
   }
-  // For GPU device, HOST_MEMORY and DEVICE_MEMORY is not
+  // For GPU and SYCL device, HOST_MEMORY and DEVICE_MEMORY is not
   // compatible. I.e., a conversion/transfer must be done.
   //
   // {node id, slot id} -> memory type.
diff --git a/tensorflow/core/common_runtime/memory_types_test.cc b/tensorflow/core/common_runtime/memory_types_test.cc
index 088ba0cb45..b3a43d3504 100644
--- a/tensorflow/core/common_runtime/memory_types_test.cc
+++ b/tensorflow/core/common_runtime/memory_types_test.cc
@@ -34,6 +34,9 @@ TEST(MemoryTypeChecker, Int32OK) {
   // There is a kernel for adding two int32s on host memory.
   TF_EXPECT_OK(ValidateMemoryTypes(DEVICE_GPU, g));
 #endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+  TF_EXPECT_OK(ValidateMemoryTypes(DEVICE_SYCL, g));
+#endif // TENSORFLOW_USE_SYCL
   delete g;
 }
 
@@ -53,6 +56,15 @@ TEST(MemoryTypeChecker, Int32NotOk) {
   TF_EXPECT_OK(EnsureMemoryTypes(DEVICE_GPU, "/gpu:0", g));
   TF_EXPECT_OK(ValidateMemoryTypes(DEVICE_GPU, g));
 #endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+  // There is no kernel for casting int32/host memory to float/device
+  // memory.
+  EXPECT_TRUE(errors::IsInternal(ValidateMemoryTypes(DEVICE_SYCL, g)));
+
+  // But we can insert _HostSend/_HostRecv to ensure the invariant.
+  TF_EXPECT_OK(EnsureMemoryTypes(DEVICE_SYCL, "/device:SYCL:0", g));
+  TF_EXPECT_OK(ValidateMemoryTypes(DEVICE_SYCL, g));
+#endif // TENSORFLOW_USE_SYCL
   delete g;
 }
 
@@ -74,6 +86,12 @@ TEST(MemoryTypeChecker, MemoryTypeForOutput) {
   // int Switch's output on GPU has HOST_MEMORY constraint.
   EXPECT_EQ(memory_type, HOST_MEMORY);
 #endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+  auto si = test::graph::Switch(g, test::graph::Constant(g, vi), pred);
+  TF_EXPECT_OK(MemoryTypeForOutput(DEVICE_SYCL, g, si, 0, &memory_type));
+  // int Switch's output on GPU has HOST_MEMORY constraint.
+  EXPECT_EQ(memory_type, HOST_MEMORY);
+#endif // TENSORFLOW_USE_SYCL
   delete g;
 }
 
diff --git a/tensorflow/core/common_runtime/session_factory.h b/tensorflow/core/common_runtime/session_factory.h
index 2a1632e035..df3198a70d 100644
--- a/tensorflow/core/common_runtime/session_factory.h
+++ b/tensorflow/core/common_runtime/session_factory.h
@@ -47,7 +47,7 @@ class SessionFactory {
   // Old sessions may continue to have side-effects on resources not in
   // containers listed in "containers", and thus may affect future
   // sessions' results in ways that are hard to predict.  Thus, if well-defined
-  // behaviour is desired, is it recommended that all containers be listed in
+  // behavior is desired, is it recommended that all containers be listed in
   // "containers".
   //
   // If the "containers" vector is empty, the default container is assumed.
diff --git a/tensorflow/core/common_runtime/simple_graph_execution_state.cc b/tensorflow/core/common_runtime/simple_graph_execution_state.cc
index 1a977c1460..8206a678b4 100644
--- a/tensorflow/core/common_runtime/simple_graph_execution_state.cc
+++ b/tensorflow/core/common_runtime/simple_graph_execution_state.cc
@@ -243,7 +243,7 @@ Status SimpleGraphExecutionState::InitBaseGraph(
       session_options_->config.graph_options().rewrite_options();
 
   if (grappler::MetaOptimizerEnabled(rewrite_options)) {
-    // Adding this functionalty in steps. The first step is to make sure
+    // Adding this functionality in steps. The first step is to make sure
     // we don't break dependencies. The second step will be to turn the
     // functionality on by default.
     grappler::GrapplerItem item;
diff --git a/tensorflow/core/common_runtime/sycl/sycl_allocator.cc b/tensorflow/core/common_runtime/sycl/sycl_allocator.cc
index b7ef9361e9..485e5397e8 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_allocator.cc
+++ b/tensorflow/core/common_runtime/sycl/sycl_allocator.cc
@@ -19,29 +19,26 @@ limitations under the License.
 
 namespace tensorflow {
 
-SYCLAllocator::~SYCLAllocator() {}
+SYCLAllocator::~SYCLAllocator() {
+  if(sycl_device_) {
+    delete sycl_device_;
+  }
+}
 
 string SYCLAllocator::Name() { return "device:SYCL"; }
 
 void *SYCLAllocator::AllocateRaw(size_t alignment, size_t num_bytes) {
-  assert(device_);
+  assert(sycl_device_);
   if (num_bytes == 0) {
-    return device_->allocate(1);
+    return sycl_device_->allocate(1);
   }
-  auto p = device_->allocate(num_bytes);
+  auto p = sycl_device_->allocate(num_bytes);
   return p;
 }
 
 void SYCLAllocator::DeallocateRaw(void *ptr) {
-  if (device_) {
-    device_->deallocate(ptr);
-  }
-}
-
-void SYCLAllocator::EnterLameDuckMode() {
-  if (device_) {
-    device_->deallocate_all();
-    device_ = nullptr;
+  if (sycl_device_) {
+    sycl_device_->deallocate(ptr);
   }
 }
 
diff --git a/tensorflow/core/common_runtime/sycl/sycl_allocator.h b/tensorflow/core/common_runtime/sycl/sycl_allocator.h
index 15d9ab41a4..8668cba06a 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_allocator.h
+++ b/tensorflow/core/common_runtime/sycl/sycl_allocator.h
@@ -28,17 +28,19 @@ namespace tensorflow {
 
 class SYCLAllocator : public Allocator {
  public:
-  SYCLAllocator(Eigen::QueueInterface *device) : device_(device) {}
+  SYCLAllocator(Eigen::QueueInterface *queue) : sycl_device_(new Eigen::SyclDevice(queue)) {}
   virtual ~SYCLAllocator() override;
   string Name() override;
   void *AllocateRaw(size_t alignment, size_t num_bytes) override;
   void DeallocateRaw(void *ptr) override;
 
-  void EnterLameDuckMode();
   virtual bool ShouldAllocateEmptyTensors() override final { return true; }
-
+  void Synchronize() { sycl_device_->synchronize(); }
+  bool Ok() { return sycl_device_->ok(); }
+  Eigen::SyclDevice* getSyclDevice() { return sycl_device_; }
  private:
-  Eigen::QueueInterface *device_;  // not owned
+  Eigen::SyclDevice *sycl_device_;  // owned
+
   TF_DISALLOW_COPY_AND_ASSIGN(SYCLAllocator);
 };
 
diff --git a/tensorflow/core/common_runtime/sycl/sycl_device.cc b/tensorflow/core/common_runtime/sycl/sycl_device.cc
index 2c2185b2c0..17f5edd572 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_device.cc
+++ b/tensorflow/core/common_runtime/sycl/sycl_device.cc
@@ -22,50 +22,18 @@ limitations under the License.
 #include "tensorflow/core/platform/tracing.h"
 
 namespace tensorflow {
-
-static std::unordered_set<SYCLDevice *> live_devices;
-static bool first_time = true;
+std::mutex GSYCLInterface::mutex_;
+GSYCLInterface *GSYCLInterface::s_instance = 0;
 
 void ShutdownSycl() {
-  for (auto device : live_devices) {
-    device->EnterLameDuckMode();
-  }
-  live_devices.clear();
+  GSYCLInterface::Reset();
 }
 
 void SYCLDevice::RegisterDevice() {
-  if (first_time) {
-    first_time = false;
     atexit(ShutdownSycl);
-  }
-  live_devices.insert(this);
 }
 
-SYCLDevice::~SYCLDevice() {
-  device_context_->Unref();
-  sycl_allocator_->EnterLameDuckMode();
-  if (sycl_device_) {
-    sycl_device_->synchronize();
-    delete sycl_device_;
-  }
-  if (sycl_queue_) {
-    delete sycl_queue_;
-  }
-  live_devices.erase(this);
-}
-
-void SYCLDevice::EnterLameDuckMode() {
-  sycl_allocator_->EnterLameDuckMode();
-  if (sycl_device_) {
-    sycl_device_->synchronize();
-    delete sycl_device_;
-    sycl_device_ = nullptr;
-  }
-  if (sycl_queue_) {
-    delete sycl_queue_;
-    sycl_queue_ = nullptr;
-  }
-}
+SYCLDevice::~SYCLDevice() {}
 
 void SYCLDevice::Compute(OpKernel *op_kernel, OpKernelContext *context) {
   assert(context);
@@ -88,8 +56,12 @@ Allocator *SYCLDevice::GetAllocator(AllocatorAttributes attr) {
 Status SYCLDevice::MakeTensorFromProto(const TensorProto &tensor_proto,
                                        const AllocatorAttributes alloc_attrs,
                                        Tensor *tensor) {
+  AllocatorAttributes attr;
+  attr.set_on_host(true);
+  Allocator* host_alloc = GetAllocator(attr);
+
   Tensor parsed(tensor_proto.dtype());
-  if (!parsed.FromProto(cpu_allocator_, tensor_proto)) {
+  if (!parsed.FromProto(host_alloc, tensor_proto)) {
     return errors::InvalidArgument("Cannot parse tensor from proto: ",
                                    tensor_proto.DebugString());
   }
@@ -98,6 +70,14 @@ Status SYCLDevice::MakeTensorFromProto(const TensorProto &tensor_proto,
     *tensor = parsed;
   } else {
     Tensor copy(GetAllocator(alloc_attrs), parsed.dtype(), parsed.shape());
+
+    // If the tensor is not initialized, we likely ran out of memory.
+    if (!copy.IsInitialized()) {
+      return errors::ResourceExhausted(
+          "OOM when allocating tensor of shape ", parsed.shape().DebugString(),
+          " and type ", DataTypeString(parsed.dtype()));
+    }
+
     device_context_->CopyCPUTensorToDevice(
         &parsed, this, &copy, [&status](const Status &s) { status = s; });
     *tensor = copy;
@@ -119,8 +99,8 @@ Status SYCLDevice::FillContextMap(const Graph *graph,
 }
 
 Status SYCLDevice::Sync() {
-  sycl_device_->synchronize();
-  if (sycl_device_->ok()) {
+  sycl_allocator_->Synchronize();
+  if (sycl_allocator_->Ok()) {
     return Status::OK();
   } else {
     return errors::Internal("Unknown error detected on device ", name());
diff --git a/tensorflow/core/common_runtime/sycl/sycl_device.h b/tensorflow/core/common_runtime/sycl/sycl_device.h
index a5c7c5f0ec..b4123ca071 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_device.h
+++ b/tensorflow/core/common_runtime/sycl/sycl_device.h
@@ -27,31 +27,184 @@ limitations under the License.
 
 namespace tensorflow {
 
+
+class GSYCLInterface
+{
+    std::vector<Eigen::QueueInterface*>     m_queue_interface_;    // owned
+    std::vector<Allocator*>                 m_cpu_allocator_;      // not owned
+    std::vector<SYCLAllocator*>             m_sycl_allocator_;     // owned
+    std::vector<SYCLDeviceContext*>         m_sycl_context_;       // owned
+
+    static std::mutex mutex_;
+    static GSYCLInterface* s_instance;
+    GSYCLInterface() {
+      bool found_device =false;
+      auto device_list = Eigen::get_sycl_supported_devices();
+      // Obtain list of supported devices from Eigen
+      for (const auto& device : device_list) {
+        if(device.is_gpu()) {
+          // returns first found GPU
+          AddDevice(device);
+          found_device = true;
+        }
+      }
+
+      if(!found_device) {
+        // Currently Intel GPU is not supported
+        LOG(WARNING) << "No OpenCL GPU found that is supported by ComputeCpp, trying OpenCL CPU";
+      }
+
+      for (const auto& device : device_list) {
+        if(device.is_cpu()) {
+          // returns first found CPU
+          AddDevice(device);
+          found_device = true;
+        }
+      }
+
+      if(!found_device) {
+        // Currently Intel GPU is not supported
+        LOG(FATAL) << "No OpenCL GPU nor CPU found that is supported by ComputeCpp";
+      } else {
+        LOG(INFO) << "Found following OpenCL devices:";
+        for (int i = 0; i < device_list.size(); i++) {
+          LOG(INFO) << GetShortDeviceDescription(i);
+        }
+      }
+    }
+
+    ~GSYCLInterface() {
+        m_cpu_allocator_.clear();
+
+        for (auto p : m_sycl_allocator_) {
+          p->Synchronize();
+          delete p;
+        }
+        m_sycl_allocator_.clear();
+
+        for(auto p : m_sycl_context_) {
+          p->Unref();
+        }
+        m_sycl_context_.clear();
+
+        for (auto p : m_queue_interface_) {
+          p->deallocate_all();
+          delete p;
+          p = nullptr;
+        }
+        m_queue_interface_.clear();
+    }
+
+    void AddDevice(const cl::sycl::device & d) {
+      m_queue_interface_.push_back(new Eigen::QueueInterface(d));
+      m_cpu_allocator_.push_back(cpu_allocator());
+      m_sycl_allocator_.push_back(new SYCLAllocator(m_queue_interface_.back()));
+      m_sycl_context_.push_back(new SYCLDeviceContext());
+    }
+
+  public:
+    static GSYCLInterface *instance()
+    {
+      std::lock_guard<std::mutex> lock(mutex_);
+      if (!s_instance) {
+        s_instance = new GSYCLInterface();
+      }
+      return s_instance;
+    }
+
+    static void Reset()
+    {
+      std::lock_guard<std::mutex> lock(mutex_);
+      if(s_instance) {
+        delete s_instance;
+        s_instance = NULL;
+      }
+    }
+
+    Eigen::QueueInterface * GetQueueInterface(size_t i = 0) {
+      if(!m_queue_interface_.empty()) {
+        return m_queue_interface_[i];
+      } else {
+        std::cerr << "No cl::sycl::device has been added" << std::endl;
+        return nullptr;
+      }
+    }
+
+    SYCLAllocator * GetSYCLAllocator(size_t i = 0) {
+      if(!m_sycl_allocator_.empty()) {
+        return m_sycl_allocator_[i];
+      } else {
+        std::cerr << "No cl::sycl::device has been added" << std::endl;
+        return nullptr;
+      }
+    }
+
+    Allocator * GetCPUAllocator(size_t i = 0) {
+      if(!m_cpu_allocator_.empty()) {
+        return m_cpu_allocator_[i];
+      } else {
+        std::cerr << "No cl::sycl::device has been added" << std::endl;
+        return nullptr;
+      }
+    }
+
+    SYCLDeviceContext * GetSYCLContext(size_t i = 0) {
+      if(!m_sycl_context_.empty()) {
+        return m_sycl_context_[i];
+      } else {
+        std::cerr << "No cl::sycl::device has been added" << std::endl;
+        return nullptr;
+      }
+    }
+
+    string GetShortDeviceDescription(int device_id = 0) {
+      auto _device = GetSYCLAllocator(device_id)
+                         ->getSyclDevice()
+                         ->sycl_queue()
+                         .get_device();
+      auto _name = _device.get_info<cl::sycl::info::device::name>();
+      auto _vendor = _device.get_info<cl::sycl::info::device::vendor>();
+      auto _profile = _device.get_info<cl::sycl::info::device::profile>();
+
+      std::string _type;
+      if (_device.is_host()) {
+        _type = "Host";
+      } else if (_device.is_cpu()) {
+        _type = "CPU";
+      } else if (_device.is_gpu()) {
+        _type = "GPU";
+      } else if (_device.is_accelerator()) {
+        _type = "Accelerator";
+      } else {
+        _type = "Unknown";
+      }
+
+      return strings::StrCat("id: ", device_id, " ,type: ", _type, " ,name: ",
+                             _name.c_str(), " ,vendor: ", _vendor.c_str(),
+                             " ,profile: ", _profile.c_str());
+    }
+};
+
+
 class SYCLDevice : public LocalDevice {
  public:
-  template <typename SYCLSelector>
   SYCLDevice(const SessionOptions &options, const string &name,
              Bytes memory_limit, const DeviceLocality &locality,
-             const string &physical_device_desc, SYCLSelector sycl_selector,
-             Allocator *cpu_allocator)
+             const string &physical_device_desc, SYCLAllocator * sycl_allocator,
+             Allocator *cpu_allocator, SYCLDeviceContext* ctx)
       : LocalDevice(
             options,
             Device::BuildDeviceAttributes(name, DEVICE_SYCL, memory_limit,
-                                          locality, physical_device_desc),
-            nullptr),
+                                          locality, physical_device_desc)),
         cpu_allocator_(cpu_allocator),
-        sycl_queue_(new Eigen::QueueInterface(sycl_selector)),
-        sycl_device_(new Eigen::SyclDevice(sycl_queue_)),
-        sycl_allocator_(new SYCLAllocator(sycl_queue_)),
-        device_context_(new SYCLDeviceContext()) {
-    set_eigen_sycl_device(sycl_device_);
+        sycl_allocator_(sycl_allocator),
+        device_context_(ctx) {
     RegisterDevice();
+    set_eigen_sycl_device(sycl_allocator->getSyclDevice());
   }
 
   ~SYCLDevice() override;
 
-  void EnterLameDuckMode();
-
   void Compute(OpKernel *op_kernel, OpKernelContext *context) override;
   Allocator *GetAllocator(AllocatorAttributes attr) override;
   Status MakeTensorFromProto(const TensorProto &tensor_proto,
@@ -62,18 +215,12 @@ class SYCLDevice : public LocalDevice {
                         DeviceContextMap *device_context_map) override;
 
   Status Sync() override;
-  static string GetShortDeviceDescription(/*int device_id,
-                                          const DeviceDescription& desc*/) {
-    return strings::StrCat("device: 0, name SYCL, pci bus id: 0");
-  }
 
  private:
   void RegisterDevice();
 
-  Allocator *cpu_allocator_;           // owned
-  Eigen::QueueInterface *sycl_queue_;  // owned
-  Eigen::SyclDevice *sycl_device_;     // owned
-  SYCLAllocator *sycl_allocator_;      // owned
+  Allocator         *cpu_allocator_;           // not owned
+  SYCLAllocator     *sycl_allocator_;          // not owned
   SYCLDeviceContext *device_context_;
 };
 
diff --git a/tensorflow/core/common_runtime/sycl/sycl_device_factory.cc b/tensorflow/core/common_runtime/sycl/sycl_device_factory.cc
index a643fc7258..19c14770dc 100644
--- a/tensorflow/core/common_runtime/sycl/sycl_device_factory.cc
+++ b/tensorflow/core/common_runtime/sycl/sycl_device_factory.cc
@@ -18,24 +18,34 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/sycl/sycl_device.h"
 
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+
 namespace tensorflow {
 
 class SYCLDeviceFactory : public DeviceFactory {
  public:
   Status CreateDevices(const SessionOptions &options, const string &name_prefix,
                        std::vector<Device *> *devices) override {
-    int n = 1;
+
+    auto syclInterface = GSYCLInterface::instance();
+
+    size_t n = 1;
     auto iter = options.config.device_count().find("SYCL");
     if (iter != options.config.device_count().end()) {
       n = iter->second;
     }
+
     for (int i = 0; i < n; i++) {
       string name = strings::StrCat(name_prefix, "/device:SYCL:", i);
       devices->push_back(
-          new SYCLDevice(options, name, Bytes(256 << 20), DeviceLocality(),
-                         SYCLDevice::GetShortDeviceDescription(),
-                         cl::sycl::gpu_selector(), cpu_allocator()));
+          new SYCLDevice(options, name, Bytes(256 << 20), DeviceLocality()
+                         , syclInterface->GetShortDeviceDescription(i)
+                         , syclInterface->GetSYCLAllocator(i)
+                         , syclInterface->GetCPUAllocator(i)
+                         , syclInterface->GetSYCLContext(i))
+                       );
     }
+
     return Status::OK();
   }
 };
diff --git a/tensorflow/core/common_runtime/sycl/sycl_util.h b/tensorflow/core/common_runtime/sycl/sycl_util.h
new file mode 100644
index 0000000000..f58614c4ff
--- /dev/null
+++ b/tensorflow/core/common_runtime/sycl/sycl_util.h
@@ -0,0 +1,37 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if !TENSORFLOW_USE_SYCL
+#error This file must only be included when building TensorFlow with SYCL support
+#endif
+
+#ifndef TENSORFLOW_CORE_COMMON_RUNTIME_SYCL_SYCL_UTIL_H_
+#define TENSORFLOW_CORE_COMMON_RUNTIME_SYCL_SYCL_UTIL_H_
+
+#include "tensorflow/core/common_runtime/device.h"
+// For DMA helper
+#include "tensorflow/core/common_runtime/dma_helper.h"
+#include "tensorflow/core/framework/tensor.h"
+
+namespace tensorflow {
+  inline void* GetBase(const Tensor* src) {
+    return const_cast<void*>(DMAHelper::base(src));
+  }
+
+  inline void* GetBase(Tensor* dst) { return DMAHelper::base(dst); }
+
+}
+
+#endif // TENSORFLOW_CORE_COMMON_RUNTIME_SYCL_SYCL_UTIL_H_
diff --git a/tensorflow/core/debug/debug_gateway.cc b/tensorflow/core/debug/debug_gateway.cc
index 1031ea843e..2aaed9563a 100644
--- a/tensorflow/core/debug/debug_gateway.cc
+++ b/tensorflow/core/debug/debug_gateway.cc
@@ -86,7 +86,7 @@ void DebugGateway::CopyTensor(const string& node_name, const int output_slot,
     // Determine if the tensor is on device (GPU) or host (CPU).
     // The second part of the check is necessary because even an OpKernel on
     // may have output tensors allocated on CPU.
-    if (device->name().find("gpu:") != string::npos &&
+    if ((device->name().find("gpu:") != string::npos || device->name().find("SYCL:") != string::npos) &&
         !ctx->output_alloc_attr(output_slot).on_host()) {
       // GPU tensors: Copy it to host (CPU).
       DeviceContext* device_ctxt = ctx->op_device_context();
diff --git a/tensorflow/core/debug/debug_gateway_test.cc b/tensorflow/core/debug/debug_gateway_test.cc
index 2911205db2..adbb1b2116 100644
--- a/tensorflow/core/debug/debug_gateway_test.cc
+++ b/tensorflow/core/debug/debug_gateway_test.cc
@@ -46,6 +46,8 @@ class SessionDebugMinusAXTest : public ::testing::Test {
 
 #if GOOGLE_CUDA
     const string kDeviceName = "/job:localhost/replica:0/task:0/gpu:0";
+#elif defined(TENSORFLOW_USE_SYCL)
+    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
 #else
     const string kDeviceName = "/job:localhost/replica:0/task:0/cpu:0";
 #endif
@@ -303,6 +305,8 @@ TEST_F(SessionDebugMinusAXTest, RunSimpleNetworkWithTwoDebugNodesInserted) {
 // through RunMetadata, given whether GPU is involved.
 #if GOOGLE_CUDA
   ASSERT_EQ(2, run_metadata.partition_graphs().size());
+#elif defined(TENSORFLOW_USE_SYCL)
+  ASSERT_EQ(2, run_metadata.partition_graphs().size());
 #else
   ASSERT_EQ(1, run_metadata.partition_graphs().size());
 #endif
@@ -337,7 +341,7 @@ TEST_F(SessionDebugMinusAXTest, RunSimpleNetworkWithTwoDebugNodesInserted) {
   ASSERT_EQ(1, debug_nan_count_tensor_vals[0].scalar<int64>()());
 }
 
-#ifndef GOOGLE_CUDA
+#if !defined(GOOGLE_CUDA) && !defined(TENSORFLOW_USE_SYCL)
 // TODO(cais): Reinstate the following test for concurrent debugged runs on
 //   a GPU once the root cause of the ~0.5% flakiness has been addressed.
 //   (b/34081273)
@@ -500,6 +504,8 @@ class SessionDebugOutputSlotWithoutOngoingEdgeTest : public ::testing::Test {
 
 #if GOOGLE_CUDA
     const string kDeviceName = "/job:localhost/replica:0/task:0/gpu:0";
+#elif defined(TENSORFLOW_USE_SYCL)
+    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
 #else
     const string kDeviceName = "/job:localhost/replica:0/task:0/cpu:0";
 #endif
@@ -600,6 +606,8 @@ class SessionDebugVariableTest : public ::testing::Test {
 
 #if GOOGLE_CUDA
     const string kDeviceName = "/job:localhost/replica:0/task:0/gpu:0";
+#elif defined(TENSORFLOW_USE_SYCL)
+    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
 #else
     const string kDeviceName = "/job:localhost/replica:0/task:0/cpu:0";
 #endif
@@ -823,6 +831,8 @@ TEST_F(SessionDebugVariableTest, VariableAssignWithDebugOps) {
 
 #if GOOGLE_CUDA
   ASSERT_EQ(2, run_metadata.partition_graphs().size());
+#elif defined(TENSORFLOW_USE_SYCL)
+  ASSERT_EQ(2, run_metadata.partition_graphs().size());
 #else
   ASSERT_EQ(1, run_metadata.partition_graphs().size());
 #endif
@@ -860,13 +870,17 @@ TEST_F(SessionDebugVariableTest, VariableAssignWithDebugOps) {
   ASSERT_EQ(2, debug_nan_count_tensor_vals[0].scalar<int64>()());
 }
 
-#if GOOGLE_CUDA
+#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_SYCL)
 class SessionDebugGPUSwitchTest : public ::testing::Test {
  public:
   void Initialize() {
     Graph graph(OpRegistry::Global());
 
+#ifdef GOOGLE_CUDA
     const string kDeviceName = "/job:localhost/replica:0/task:0/gpu:0";
+#elif TENSORFLOW_USE_SYCL
+    const string kDeviceName = "/job:localhost/replica:0/task:0/device:SYCL:0";
+#endif
 
     Tensor vb(DT_BOOL, TensorShape({}));
     vb.scalar<bool>()() = true;
diff --git a/tensorflow/core/debug/debug_service.proto b/tensorflow/core/debug/debug_service.proto
index 1adba5d653..63d6668292 100644
--- a/tensorflow/core/debug/debug_service.proto
+++ b/tensorflow/core/debug/debug_service.proto
@@ -20,7 +20,7 @@ package tensorflow;
 import "tensorflow/core/util/event.proto";
 
 // Reply message from EventListener to the client, i.e., to the source of the
-// Event protocal buffers, e.g., debug ops inserted by a debugged runtime to a
+// Event protocol buffers, e.g., debug ops inserted by a debugged runtime to a
 // TensorFlow graph being executed.
 message EventReply {
   message DebugOpStateChange {
diff --git a/tensorflow/core/distributed_runtime/BUILD b/tensorflow/core/distributed_runtime/BUILD
index b6c4d60a13..efc08e4c9d 100644
--- a/tensorflow/core/distributed_runtime/BUILD
+++ b/tensorflow/core/distributed_runtime/BUILD
@@ -29,6 +29,7 @@ filegroup(
 
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_tests")
+load("//tensorflow:tensorflow.bzl", "tf_copts")
 
 # For platform specific build config
 load(
@@ -326,6 +327,7 @@ cc_library(
     name = "base_rendezvous_mgr",
     srcs = ["base_rendezvous_mgr.cc"],
     hdrs = ["base_rendezvous_mgr.h"],
+    copts = tf_copts(),
     deps = [
         ":rendezvous_mgr_interface",
         ":worker_cache",
diff --git a/tensorflow/core/distributed_runtime/graph_mgr.h b/tensorflow/core/distributed_runtime/graph_mgr.h
index 50391f47e4..4ee3711d02 100644
--- a/tensorflow/core/distributed_runtime/graph_mgr.h
+++ b/tensorflow/core/distributed_runtime/graph_mgr.h
@@ -108,9 +108,9 @@ class GraphMgr {
   };
 
   struct Item : public core::RefCounted {
-    // TOOD(zhifengc): Keeps a copy of the original graph if the need arises.
-    // TOOD(zhifengc): Stats, updated by multiple runs potentially.
-    // TOOD(zhifengc): Dup-detection. Ensure step_id only run once.
+    // TODO(zhifengc): Keeps a copy of the original graph if the need arises.
+    // TODO(zhifengc): Stats, updated by multiple runs potentially.
+    // TODO(zhifengc): Dup-detection. Ensure step_id only run once.
     ~Item() override;
 
     // Session handle.
@@ -126,7 +126,7 @@ class GraphMgr {
     // has a root executor which may call into the runtime library.
     std::vector<ExecutionUnit> units;
 
-    // Used to deresgister a cost model when cost model is requried in graph
+    // Used to deresgister a cost model when cost model is required in graph
     // manager.
     GraphMgr* graph_mgr;
   };
@@ -157,7 +157,7 @@ class GraphMgr {
                               CancellationManager* cancellation_manager,
                               StatusCallback done);
 
-  // Don't attempt to process cost models unless explicitely requested for at
+  // Don't attempt to process cost models unless explicitly requested for at
   // least one of the items.
   bool skip_cost_models_ = true;
 
diff --git a/tensorflow/core/distributed_runtime/master.cc b/tensorflow/core/distributed_runtime/master.cc
index 1cbf30fe4b..e3f23ef0dd 100644
--- a/tensorflow/core/distributed_runtime/master.cc
+++ b/tensorflow/core/distributed_runtime/master.cc
@@ -25,7 +25,7 @@ limitations under the License.
 // A Master discovers remote devices on-demand and keeps track of
 // statistics of those remote devices.
 //
-// Each session analyses the graph, places nodes across available
+// Each session analyzes the graph, places nodes across available
 // devices, and ultimately drives the graph computation by initiating
 // RunGraph on the workers.
 
diff --git a/tensorflow/core/distributed_runtime/master_session.cc b/tensorflow/core/distributed_runtime/master_session.cc
index a2160816fe..94fec4f6d0 100644
--- a/tensorflow/core/distributed_runtime/master_session.cc
+++ b/tensorflow/core/distributed_runtime/master_session.cc
@@ -1405,7 +1405,7 @@ Status MasterSession::DoPartialRun(CallOptions* opts,
         run_state->rcg->CheckFetches(req, run_state, execution_state_.get()));
   }
 
-  // Determine if this partial run satisfies all the pending inputs and ouputs.
+  // Determine if this partial run satisfies all the pending inputs and outputs.
   for (size_t i = 0; i < req.num_feeds(); ++i) {
     auto it = run_state->pending_inputs.find(req.feed_name(i));
     it->second = true;
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_call.h b/tensorflow/core/distributed_runtime/rpc/grpc_call.h
index 3b45e7e8a7..e85b8ccbd3 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_call.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_call.h
@@ -89,7 +89,7 @@ class UntypedCall : public core::RefCounted {
   virtual void RequestReceived(Service* service, bool ok) = 0;
 
   // This method will be called either (i) when the server is notified
-  // that the request has been cancelled, or (ii) when the request completes
+  // that the request has been canceled, or (ii) when the request completes
   // normally. The implementation should distinguish these cases by querying
   // the `grpc::ServerContext` associated with the request.
   virtual void RequestCancelled(Service* service, bool ok) = 0;
@@ -175,7 +175,7 @@ class Call : public UntypedCall<Service> {
   }
 
   // Registers `callback` as the function that should be called if and when this
-  // call is cancelled by the client.
+  // call is canceled by the client.
   void SetCancelCallback(std::function<void()> callback) {
     mutex_lock l(mu_);
     cancel_callback_ = std::move(callback);
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
index b9dd0d82c0..07205bb2c2 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_master_service.cc
@@ -25,7 +25,7 @@ limitations under the License.
 // A GrpcMasterService discovers remote devices in the background and
 // keeps track of statistics of those remote devices.
 //
-// Each session analyses the graph, places nodes across available
+// Each session analyzes the graph, places nodes across available
 // devices, and ultimately drives the graph computation by initiating
 // RunGraph on workers.
 #include "tensorflow/core/distributed_runtime/rpc/grpc_master_service.h"
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_session_test.cc b/tensorflow/core/distributed_runtime/rpc/grpc_session_test.cc
index ff9d12657e..405b2939eb 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_session_test.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_session_test.cc
@@ -516,7 +516,7 @@ TEST(GrpcSessionTest, Error) {
     //
     // Subgraph for "b" sleeps at the node "b_delay". When the sleep
     // finishes, the subgraph "b" will continue execution till it
-    // notices that it is cancelled. Meanwhile, subgraph's executor
+    // notices that it is canceled. Meanwhile, subgraph's executor
     // and its related state (registered ops) should still be alive.
     auto b = test::graph::Constant(&g, Tensor());
     b->set_assigned_device_name(dev_b);
diff --git a/tensorflow/core/distributed_runtime/worker_cache_logger.cc b/tensorflow/core/distributed_runtime/worker_cache_logger.cc
index cf1d6f88e9..5ca1d92a81 100644
--- a/tensorflow/core/distributed_runtime/worker_cache_logger.cc
+++ b/tensorflow/core/distributed_runtime/worker_cache_logger.cc
@@ -37,7 +37,7 @@ void WorkerCacheLogger::SetLogging(bool v) {
     ++want_logging_count_;
   } else {
     --want_logging_count_;
-    // If RPCs get cancelled, it may be possible for the count
+    // If RPCs get canceled, it may be possible for the count
     // to go negative.  This should not be a fatal error, since
     // logging is non-critical.
     if (want_logging_count_ < 0) want_logging_count_ = 0;
diff --git a/tensorflow/core/framework/cancellation.h b/tensorflow/core/framework/cancellation.h
index 4cc3f92353..651c054fe8 100644
--- a/tensorflow/core/framework/cancellation.h
+++ b/tensorflow/core/framework/cancellation.h
@@ -36,7 +36,7 @@ namespace tensorflow {
 // CancellationManager::get_cancellation_token.
 typedef int64 CancellationToken;
 
-// A callback that is invoked when a step is cancelled.
+// A callback that is invoked when a step is canceled.
 //
 // NOTE(mrry): See caveats about CancelCallback implementations in the
 // comment for CancellationManager::RegisterCallback.
diff --git a/tensorflow/core/framework/function_test.cc b/tensorflow/core/framework/function_test.cc
index 251f11a826..2ecdc36c11 100644
--- a/tensorflow/core/framework/function_test.cc
+++ b/tensorflow/core/framework/function_test.cc
@@ -162,7 +162,7 @@ REGISTER_OP("HasDefaultType")
 
 // This verifies that a function using an op before a type attr (with
 // a default) is added, still works.  This is important for backwards
-// compatibilty.
+// compatibility.
 TEST(TFunc, MissingTypeAttr) {
   auto fdef = FDH::Create(
       // Name
@@ -1020,7 +1020,7 @@ TEST(FunctionLibraryDefinitionTest, AddLibrary) {
   EXPECT_EQ(s.error_message(),
             "Gradient for function 'XTimesTwo' already exists.");
 
-  // No conflicing functions or gradients OK
+  // No conflicting functions or gradients OK
   proto.Clear();
   *proto.add_function() = test::function::XTimesFour();
   grad.set_function_name(test::function::XTimes16().signature().name());
diff --git a/tensorflow/core/framework/op_kernel.cc b/tensorflow/core/framework/op_kernel.cc
index 6c3917c686..dec987e1ed 100644
--- a/tensorflow/core/framework/op_kernel.cc
+++ b/tensorflow/core/framework/op_kernel.cc
@@ -96,9 +96,9 @@ OpKernel::OpKernel(OpKernelConstruction* context)
   OP_REQUIRES_OK(context, CheckOpDeprecation(*context->op_def_,
                                              context->graph_def_version()));
 
-  // Kernels executing on GPU tie very few resources on the CPU where the
+  // Kernels executing on GPU/SYCL tie very few resources on the CPU where the
   // scheduler runs: we consider them as inexpensive.
-  expensive_ = context->device_type() != DeviceType(DEVICE_GPU);
+  expensive_ = context->device_type() != DeviceType(DEVICE_GPU) && context->device_type() != DeviceType(DEVICE_SYCL);
 }
 
 OpKernel::~OpKernel() {}
diff --git a/tensorflow/core/framework/resource_mgr.cc b/tensorflow/core/framework/resource_mgr.cc
index c3666f7ab9..4365a861e5 100644
--- a/tensorflow/core/framework/resource_mgr.cc
+++ b/tensorflow/core/framework/resource_mgr.cc
@@ -24,6 +24,34 @@ limitations under the License.
 #include "tensorflow/core/platform/demangle.h"
 
 namespace tensorflow {
+ResourceHandle MakeResourceHandle(OpKernelContext* ctx, const string& container,
+                                  const string& name,
+                                  const TypeIndex& type_index) {
+  ResourceHandle result;
+  result.set_device(ctx->device()->attributes().name());
+  string actual_container;
+  if (!container.empty()) {
+    actual_container = container;
+  } else {
+    actual_container = ctx->resource_manager()->default_container();
+  }
+  result.set_container(actual_container);
+  result.set_name(name);
+  result.set_hash_code(type_index.hash_code());
+  result.set_maybe_type_name(type_index.name());
+  return result;
+}
+
+Status MakeResourceHandleToOutput(OpKernelContext* context, int output_index,
+                                  const string& container, const string& name,
+                                  const TypeIndex& type_index) {
+  Tensor* handle;
+  TF_RETURN_IF_ERROR(
+      context->allocate_output(output_index, TensorShape({}), &handle));
+  handle->scalar<ResourceHandle>()() =
+      MakeResourceHandle(context, container, name, type_index);
+  return Status::OK();
+}
 
 namespace internal {
 
diff --git a/tensorflow/core/framework/resource_mgr.h b/tensorflow/core/framework/resource_mgr.h
index 26a5766569..0e1a5a82d3 100644
--- a/tensorflow/core/framework/resource_mgr.h
+++ b/tensorflow/core/framework/resource_mgr.h
@@ -202,9 +202,20 @@ class ResourceMgr {
 
 // Makes a resource handle with the specified type for a given container /
 // name.
+ResourceHandle MakeResourceHandle(OpKernelContext* ctx, const string& container,
+                                  const string& name,
+                                  const TypeIndex& type_index);
+
 template <typename T>
 ResourceHandle MakeResourceHandle(OpKernelContext* ctx, const string& container,
-                                  const string& name);
+                                  const string& name) {
+  return MakeResourceHandle(ctx, container, name, MakeTypeIndex<T>());
+}
+
+Status MakeResourceHandleToOutput(OpKernelContext* context, int output_index,
+                                  const string& container, const string& name,
+                                  const TypeIndex& type_index);
+
 template <typename T>
 ResourceHandle MakePerStepResourceHandle(OpKernelContext* ctx,
                                          const string& name);
@@ -424,25 +435,6 @@ Status GetResourceFromContext(OpKernelContext* ctx, const string& input_name,
 }
 
 template <typename T>
-ResourceHandle MakeResourceHandle(OpKernelContext* ctx, const string& container,
-                                  const string& name) {
-  ResourceHandle result;
-  result.set_device(ctx->device()->attributes().name());
-  string actual_container;
-  if (!container.empty()) {
-    actual_container = container;
-  } else {
-    actual_container = ctx->resource_manager()->default_container();
-  }
-  result.set_container(actual_container);
-  result.set_name(name);
-  auto type_index = MakeTypeIndex<T>();
-  result.set_hash_code(type_index.hash_code());
-  result.set_maybe_type_name(type_index.name());
-  return result;
-}
-
-template <typename T>
 ResourceHandle MakePerStepResourceHandle(OpKernelContext* ctx,
                                          const string& name) {
   return MakeResourceHandle<T>(ctx, ctx->step_container()->name(), name);
diff --git a/tensorflow/core/framework/resource_op_kernel.h b/tensorflow/core/framework/resource_op_kernel.h
index de65657a9e..813ec6eed5 100644
--- a/tensorflow/core/framework/resource_op_kernel.h
+++ b/tensorflow/core/framework/resource_op_kernel.h
@@ -95,11 +95,9 @@ class ResourceOpKernel : public OpKernel {
       resource_ = resource;
     }
     if (context->expected_output_dtype(0) == DT_RESOURCE) {
-      Tensor* handle;
-      OP_REQUIRES_OK(context,
-                     context->allocate_output(0, TensorShape({}), &handle));
-      handle->scalar<ResourceHandle>()() =
-          MakeResourceHandle<T>(context, cinfo_.container(), cinfo_.name());
+      OP_REQUIRES_OK(context, MakeResourceHandleToOutput(
+                                  context, 0, cinfo_.container(), cinfo_.name(),
+                                  MakeTypeIndex<T>()));
     } else {
       context->set_output_ref(0, &mu_, handle_.AccessTensor(context));
     }
diff --git a/tensorflow/core/framework/shape_inference.cc b/tensorflow/core/framework/shape_inference.cc
index e9ead47fce..1f9e98551f 100644
--- a/tensorflow/core/framework/shape_inference.cc
+++ b/tensorflow/core/framework/shape_inference.cc
@@ -519,6 +519,10 @@ ShapeHandle InferenceContext::UnknownShape() {
 
 ShapeHandle InferenceContext::UnknownShapeOfRank(int64 rank) {
   CHECK_LE(rank, kint32max) << "rank must be less than kint32max";
+  if(rank == kUnknownRank) {
+    return UnknownShape();
+  }
+  CHECK_GE(rank, 0) << "rank must not be negative";
   std::vector<DimensionHandle> dims(rank);
   for (int32 i = 0; i < rank; ++i) {
     dims[i] = UnknownDim();
diff --git a/tensorflow/core/framework/tensor.h b/tensorflow/core/framework/tensor.h
index 5810970a38..49eecc0b08 100644
--- a/tensorflow/core/framework/tensor.h
+++ b/tensorflow/core/framework/tensor.h
@@ -396,7 +396,7 @@ class Tensor {
   typename TTypes<T, NDIMS>::ConstTensor flat_outer_dims() const;
 
   template <typename T, size_t NDIMS = 3>
-  typename TTypes<T, NDIMS>::Tensor flat_inner_outer_dims(int64 begin) const;
+  typename TTypes<T, NDIMS>::ConstTensor flat_inner_outer_dims(int64 begin) const;
 
   /// Render the first `max_entries` values in `*this` into a string.
   string SummarizeValue(int64 max_entries) const;
@@ -673,7 +673,7 @@ typename TTypes<T, NDIMS>::ConstTensor Tensor::flat_outer_dims() const {
 }
 
 template <typename T, size_t NDIMS>
-typename TTypes<T, NDIMS>::Tensor Tensor::flat_inner_outer_dims(int64 begin) const {
+typename TTypes<T, NDIMS>::ConstTensor Tensor::flat_inner_outer_dims(int64 begin) const {
   gtl::InlinedVector<int64,4> flat_outer = ComputeFlatOuterDims(
       shape_.dim_sizes(), begin + NDIMS);
   return shaped<T, NDIMS>(ComputeFlatInnerDims(flat_outer, NDIMS));
diff --git a/tensorflow/core/graph/graph_constructor.cc b/tensorflow/core/graph/graph_constructor.cc
index 318ad4c9ed..6ea0c9560f 100644
--- a/tensorflow/core/graph/graph_constructor.cc
+++ b/tensorflow/core/graph/graph_constructor.cc
@@ -435,7 +435,7 @@ Status GraphConstructor::MakeNode(const NodeDef& node_def, Node** node) {
 Status GraphConstructor::ValidateShape(Node* node) {
   if (!opts_.importing) return Status::OK();
   TF_RETURN_IF_ERROR(refiner_->AddNode(node));
-  // For nodes with the _output_shapes atttribute, override the shape.
+  // For nodes with the _output_shapes attribute, override the shape.
   std::vector<TensorShapeProto> shape_attrs;
   const char* kAttrName = "_output_shapes";
   if (!GetNodeAttr(node->attrs(), kAttrName, &shape_attrs).ok()) {
@@ -481,7 +481,7 @@ Status GraphConstructor::ValidateShape(Node* node) {
           "MutableHashTableOfTensors", "Mutex", "CuckooTable", "IndexTable",
           "WholeFileReader", "TextLineReader", "FixedLengthRecordReader",
           "TFRecordReader", "IdentityReader", "RefSwitch", "RefEnter",
-          "RefNextIteration", "RefMerge", "RefIdentity",
+          "RefNextIteration", "RefMerge", "RefIdentity", "LMDBReader",
           // To be removed after 2017/04/24.
           "ConditionalAccumulator", "SparseConditionalAccumulator", "Table",
       };
diff --git a/tensorflow/core/graph/graph_constructor.h b/tensorflow/core/graph/graph_constructor.h
index bc4f23ed2d..7c34dd536c 100644
--- a/tensorflow/core/graph/graph_constructor.h
+++ b/tensorflow/core/graph/graph_constructor.h
@@ -57,7 +57,7 @@ extern Status ConvertNodeDefsToGraph(const GraphConstructorOptions& opts,
 // On error, returns non-OK and leaves *g unmodified.
 //
 // "shape_refiner" can be null. It should be non-null if the caller
-// intends to add additonal nodes to the graph after the import. This
+// intends to add additional nodes to the graph after the import. This
 // allows the caller to validate shapes of those nodes (since
 // ShapeRefiner::AddNode must be called in topological order).
 //
diff --git a/tensorflow/core/graph/testlib.cc b/tensorflow/core/graph/testlib.cc
index c495b21812..c59c44c80e 100644
--- a/tensorflow/core/graph/testlib.cc
+++ b/tensorflow/core/graph/testlib.cc
@@ -36,6 +36,10 @@ namespace tensorflow {
 REGISTER_KERNEL_BUILDER(Name("HostConst").Device(DEVICE_CPU), HostConstantOp);
 REGISTER_KERNEL_BUILDER(
     Name("HostConst").Device(DEVICE_GPU).HostMemory("output"), HostConstantOp);
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(
+    Name("HostConst").Device(DEVICE_SYCL).HostMemory("output"), HostConstantOp);
+#endif // TENSORFLOW_USE_SYCL
 
 // Register the HostConst Op
 // Returns a constant tensor on the host.  Useful for writing C++ tests
diff --git a/tensorflow/core/grappler/costs/analytical_cost_estimator.h b/tensorflow/core/grappler/costs/analytical_cost_estimator.h
index ef186fc021..cf9163302c 100644
--- a/tensorflow/core/grappler/costs/analytical_cost_estimator.h
+++ b/tensorflow/core/grappler/costs/analytical_cost_estimator.h
@@ -45,7 +45,7 @@ class AnalyticalCostEstimator : public CostEstimator {
                           bool use_static_shapes);
   ~AnalyticalCostEstimator() override {}
 
-  // Initalizes the estimator for the specified grappler item.
+  // Initializes the estimator for the specified grappler item.
   // This implementation always returns OK.
   Status Initialize(const GrapplerItem& item) override;
 
diff --git a/tensorflow/core/grappler/costs/cost_estimator.h b/tensorflow/core/grappler/costs/cost_estimator.h
index 758a29696d..868c4a9733 100644
--- a/tensorflow/core/grappler/costs/cost_estimator.h
+++ b/tensorflow/core/grappler/costs/cost_estimator.h
@@ -134,7 +134,7 @@ class CostEstimator {
  public:
   virtual ~CostEstimator() {}
 
-  // Initalizes the estimator for the specified grappler item.
+  // Initializes the estimator for the specified grappler item.
   // The estimator shouldn't be used if this function returns any status other
   // that OK.
   virtual Status Initialize(const GrapplerItem& item) = 0;
diff --git a/tensorflow/core/grappler/costs/measuring_cost_estimator.h b/tensorflow/core/grappler/costs/measuring_cost_estimator.h
index a84853f6c7..1b3edb4c27 100644
--- a/tensorflow/core/grappler/costs/measuring_cost_estimator.h
+++ b/tensorflow/core/grappler/costs/measuring_cost_estimator.h
@@ -50,7 +50,7 @@ class MeasuringCostEstimator : public CostEstimator {
                                   int measurement_threads);
   ~MeasuringCostEstimator() override {}
 
-  // Initalizes the estimator for the specified grappler item.
+  // Initializes the estimator for the specified grappler item.
   // This implementation always returns OK.
   Status Initialize(const GrapplerItem& item) override;
 
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.h b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
index d234880919..ec7f21622f 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.h
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
@@ -36,7 +36,7 @@ class OpLevelCostEstimator {
 
  protected:
   // Returns an estimate of device performance (in billions of operations
-  // executed per second) and memory bandwith (in GigaBytes/second) for the
+  // executed per second) and memory bandwidth (in GigaBytes/second) for the
   // specified device.
   virtual std::pair<double, double> GetDeviceInfo(
       const DeviceProperties& device) const;
diff --git a/tensorflow/core/grappler/optimizers/model_pruner.cc b/tensorflow/core/grappler/optimizers/model_pruner.cc
index 4707266572..efa2163836 100644
--- a/tensorflow/core/grappler/optimizers/model_pruner.cc
+++ b/tensorflow/core/grappler/optimizers/model_pruner.cc
@@ -46,7 +46,7 @@ Status ModelPruner::Optimize(Cluster* cluster, const GrapplerItem& item,
     if (nodes_to_preserve.find(node.name()) != nodes_to_preserve.end()) {
       continue;
     }
-    // Don't remove nodes that are explicitely placed.
+    // Don't remove nodes that are explicitly placed.
     if (!node.device().empty()) {
       continue;
     }
diff --git a/tensorflow/core/grappler/optimizers/model_pruner.h b/tensorflow/core/grappler/optimizers/model_pruner.h
index 3956d33961..3d76aebef4 100644
--- a/tensorflow/core/grappler/optimizers/model_pruner.h
+++ b/tensorflow/core/grappler/optimizers/model_pruner.h
@@ -22,7 +22,7 @@ namespace tensorflow {
 namespace grappler {
 
 // Prune a model to make it more efficient:
-// * Remove unecessary operations.
+// * Remove unnecessary operations.
 // * Optimize gradient computations.
 class ModelPruner : public GraphOptimizer {
  public:
diff --git a/tensorflow/core/grappler/utils.h b/tensorflow/core/grappler/utils.h
index bc8b0e562e..a49791bad8 100644
--- a/tensorflow/core/grappler/utils.h
+++ b/tensorflow/core/grappler/utils.h
@@ -34,7 +34,7 @@ class NodeMap {
   NodeDef* GetNode(const string& name) const;
   const std::set<NodeDef*>& GetOutputs(const string& node_name) const;
   // This method doesn't record the outputs of the added node; the outputs need
-  // to be explictly added by the AddOutput method.
+  // to be explicitly added by the AddOutput method.
   void AddNode(const string& name, NodeDef* node);
   void AddOutput(const string& node, const string& output);
   void UpdateOutput(const string& node, const string& old_output,
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 34e03bf2a6..214897b7fa 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -33,6 +33,7 @@ load(
     "tf_mkl_kernel_library",
     "cc_header_only_library",
 )
+load("@local_config_sycl//sycl:build_defs.bzl", "if_sycl")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_cc_tests")
 load(
@@ -285,6 +286,15 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "map_stage_op",
+    srcs = ["map_stage_op.cc"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
+
 cc_library(
     name = "queue_base",
     srcs = ["queue_base.cc"],
@@ -479,7 +489,7 @@ ARRAY_DEPS = [
     "//tensorflow/core:proto_text",
     "//tensorflow/core:protos_all_cc",
     "//third_party/eigen3",
-]
+] + if_sycl(["//tensorflow/core:sycl_runtime"])
 
 cc_library(
     name = "array_not_windows",
@@ -1309,6 +1319,7 @@ cc_library(
         ":fifo_queue_op",
         ":lookup_table_init_op",
         ":lookup_table_op",
+        ":map_stage_op",
         ":padding_fifo_queue_op",
         ":priority_queue_op",
         ":queue_ops",
@@ -1893,6 +1904,7 @@ cc_library(
     deps = [
         ":fixed_length_record_reader_op",
         ":identity_reader_op",
+        #":lmdb_reader_op",
         ":matching_files_op",
         ":reader_ops",
         ":restore_op",
@@ -1927,6 +1939,15 @@ tf_kernel_library(
     deps = IO_DEPS,
 )
 
+# TODO(jhseu): Restore after merge.
+#tf_kernel_library(
+#    name = "lmdb_reader_op",
+#    prefix = "lmdb_reader_op",
+#    deps = IO_DEPS + [
+#        "@lmdb",
+#    ],
+#)
+
 tf_kernel_library(
     name = "matching_files_op",
     prefix = "matching_files_op",
@@ -3430,7 +3451,7 @@ STATE_DEPS = [
     "//tensorflow/core:framework",
     "//tensorflow/core:lib",
     "//tensorflow/core:state_ops_op_lib",
-]
+] + if_sycl(["//tensorflow/core:sycl_runtime"])
 
 tf_kernel_library(
     name = "count_up_to_op",
@@ -4318,6 +4339,7 @@ filegroup(
             # not used on Android. Those ops also do not compile if included,
             # unless we add the additional deps they need.
             "tf_record_reader_op.*",
+            "lmdb_reader_op.*",
             "string_to_hash_bucket_op.*",
             "sdca_ops.*",
             "sdca_internal.*",
@@ -4356,6 +4378,12 @@ cc_library(
         "//conditions:default": [],
     }),
     copts = tf_copts(),
+    linkopts = select({
+        "//tensorflow:android": [
+            "-ldl",
+        ],
+        "//conditions:default": [],
+    }),
     tags = [
         "manual",
         "notap",
diff --git a/tensorflow/core/kernels/adjust_contrast_op_test.cc b/tensorflow/core/kernels/adjust_contrast_op_test.cc
index 53205a1b3d..0fc03b5a23 100644
--- a/tensorflow/core/kernels/adjust_contrast_op_test.cc
+++ b/tensorflow/core/kernels/adjust_contrast_op_test.cc
@@ -33,7 +33,7 @@ class AdjustContrastOpTest : public OpsTestBase {
 };
 
 TEST_F(AdjustContrastOpTest, Simple_1113) {
-  TF_EXPECT_OK(NodeDefBuilder("adjust_constrast_op", "AdjustContrastv2")
+  TF_EXPECT_OK(NodeDefBuilder("adjust_contrast_op", "AdjustContrastv2")
                    .Input(FakeInput(DT_FLOAT))
                    .Input(FakeInput(DT_FLOAT))
                    .Finalize(node_def()));
@@ -48,7 +48,7 @@ TEST_F(AdjustContrastOpTest, Simple_1113) {
 }
 
 TEST_F(AdjustContrastOpTest, Simple_1223) {
-  TF_EXPECT_OK(NodeDefBuilder("adjust_constrast_op", "AdjustContrastv2")
+  TF_EXPECT_OK(NodeDefBuilder("adjust_contrast_op", "AdjustContrastv2")
                    .Input(FakeInput(DT_FLOAT))
                    .Input(FakeInput(DT_FLOAT))
                    .Finalize(node_def()));
@@ -65,7 +65,7 @@ TEST_F(AdjustContrastOpTest, Simple_1223) {
 }
 
 TEST_F(AdjustContrastOpTest, Big_99x99x3) {
-  TF_EXPECT_OK(NodeDefBuilder("adjust_constrast_op", "AdjustContrastv2")
+  TF_EXPECT_OK(NodeDefBuilder("adjust_contrast_op", "AdjustContrastv2")
                    .Input(FakeInput(DT_FLOAT))
                    .Input(FakeInput(DT_FLOAT))
                    .Finalize(node_def()));
diff --git a/tensorflow/core/kernels/batch_dataset_op.cc b/tensorflow/core/kernels/batch_dataset_op.cc
index 443859a95c..c8289eff2a 100644
--- a/tensorflow/core/kernels/batch_dataset_op.cc
+++ b/tensorflow/core/kernels/batch_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class BatchDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/batch_matmul_op_impl.h b/tensorflow/core/kernels/batch_matmul_op_impl.h
index dfc81a960e..b87c98c374 100644
--- a/tensorflow/core/kernels/batch_matmul_op_impl.h
+++ b/tensorflow/core/kernels/batch_matmul_op_impl.h
@@ -39,6 +39,9 @@ namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 namespace {
 
@@ -413,6 +416,40 @@ struct LaunchBatchMatMul<GPUDevice, Scalar> {
 
 #endif  // GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+template <typename Scalar>
+struct ParallelMatMulKernelSYCL {
+  static void Run(const OpKernelContext* context, const Tensor& in_x,
+                  const Tensor& in_y, bool adj_x, bool adj_y, Tensor* out,
+                  int start, int limit) {
+    auto Tx = in_x.tensor<Scalar, 3>();
+    auto Ty = in_y.tensor<Scalar, 3>();
+    auto Tz = out->tensor<Scalar, 3>();
+    Eigen::array<Eigen::IndexPair<Eigen::DenseIndex>, 1> contract_pairs;
+    contract_pairs[0] = ContractionDims(adj_x, adj_y);
+    auto d = context->eigen_sycl_device();
+    for (int i = start; i < limit; ++i) {
+      auto x = Tx.template chip<0>(i);
+      auto y = Ty.template chip<0>(i);
+      auto z = Tz.template chip<0>(i);
+      z.device(d) = x.contract(y, contract_pairs);
+    }
+  }
+};
+
+template <typename Scalar>
+struct LaunchBatchMatMul<SYCLDevice, Scalar> {
+  static void Launch(OpKernelContext* context, const Tensor& in_x,
+                     const Tensor& in_y, bool adj_x, bool adj_y, Tensor* out) {
+
+  // Number of matrix multiplies i.e. size of the batch.
+  const int64 num_units = in_x.dim_size(0);
+  ParallelMatMulKernelSYCL<Scalar>::Run(context, in_x, in_y, adj_x, adj_y, out,
+                           0, num_units);
+  }
+};
+#endif // TENSORFLOW_USE_SYCL
+
 template <typename Device, typename Scalar>
 class BatchMatMul : public OpKernel {
  public:
@@ -492,4 +529,10 @@ class BatchMatMul : public OpKernel {
       Name("BatchMatMul").Device(DEVICE_GPU).TypeConstraint<TYPE>("T"), \
       BatchMatMul<GPUDevice, TYPE>)
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_BATCH_MATMUL_SYCL(TYPE)                                 \
+  REGISTER_KERNEL_BUILDER(                                               \
+      Name("BatchMatMul").Device(DEVICE_SYCL).TypeConstraint<TYPE>("T"), \
+      BatchMatMul<SYCLDevice, TYPE>)
+#endif // TENSORFLOW_USE_SYCL
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/batch_matmul_op_real.cc b/tensorflow/core/kernels/batch_matmul_op_real.cc
index c719e30c4d..1900ed8e31 100644
--- a/tensorflow/core/kernels/batch_matmul_op_real.cc
+++ b/tensorflow/core/kernels/batch_matmul_op_real.cc
@@ -30,4 +30,8 @@ TF_CALL_half(REGISTER_BATCH_MATMUL_GPU);
 #endif
 #endif  // GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+TF_CALL_float(REGISTER_BATCH_MATMUL_SYCL);
+TF_CALL_double(REGISTER_BATCH_MATMUL_SYCL);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/batch_norm_op.cc b/tensorflow/core/kernels/batch_norm_op.cc
index 56f4e25fad..d3ed617f71 100644
--- a/tensorflow/core/kernels/batch_norm_op.cc
+++ b/tensorflow/core/kernels/batch_norm_op.cc
@@ -28,6 +28,9 @@ namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 template <typename Device, typename T>
 class BatchNormOp : public OpKernel {
@@ -201,6 +204,18 @@ TF_CALL_float(REGISTER_GPU_KERNEL);
 
 #endif  // GOOGLE_CUDA
 
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_KERNEL(T)                                         \
+  REGISTER_KERNEL_BUILDER(Name("BatchNormWithGlobalNormalization") \
+                              .Device(DEVICE_SYCL)                 \
+                              .TypeConstraint<T>("T"),             \
+                          BatchNormOp<SYCLDevice, T>);
+
+TF_CALL_float(REGISTER_KERNEL);
+TF_CALL_double(REGISTER_KERNEL);
+#undef REGISTER_KERNEL
+#endif  // TENSORFLOW_USE_SYCL
+
 #define REGISTER_KERNEL(T)                                             \
   REGISTER_KERNEL_BUILDER(Name("BatchNormWithGlobalNormalizationGrad") \
                               .Device(DEVICE_CPU)                      \
@@ -248,4 +263,17 @@ TF_CALL_float(REGISTER_GPU_KERNEL);
 
 #endif  // GOOGLE_CUDA
 
+#if TENSORFLOW_USE_SYCL
+#define REGISTER_KERNEL(T)                                             \
+  REGISTER_KERNEL_BUILDER(Name("BatchNormWithGlobalNormalizationGrad") \
+                              .Device(DEVICE_SYCL)                     \
+                              .TypeConstraint<T>("T"),                 \
+                          BatchNormGradOp<SYCLDevice, T>);
+
+TF_CALL_float(REGISTER_KERNEL);
+TF_CALL_double(REGISTER_KERNEL);
+#undef REGISTER_KERNEL
+
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cast_op.cc b/tensorflow/core/kernels/cast_op.cc
index 562934ed63..8bad488482 100644
--- a/tensorflow/core/kernels/cast_op.cc
+++ b/tensorflow/core/kernels/cast_op.cc
@@ -239,12 +239,11 @@ class SyclCastOp : public CastOpBase {
 };
 
 #define REGISTER_CAST_SYCL(srctype, dsttype)                    \
-  REGISTER_KERNEL_BUILDER(Name("Cast")                         \
-                              .TypeConstraint<srctype>("SrcT") \
-                              .TypeConstraint<dsttype>("DstT") \
+  REGISTER_KERNEL_BUILDER(Name("Cast")                          \
+                              .TypeConstraint<srctype>("SrcT")  \
+                              .TypeConstraint<dsttype>("DstT")  \
                               .Device(DEVICE_SYCL),             \
                           SyclCastOp)
-
 CURRY_TYPES2(REGISTER_CAST_SYCL, bool);
 CURRY_TYPES2(REGISTER_CAST_SYCL, int32);
 CURRY_TYPES2(REGISTER_CAST_SYCL, int64);
diff --git a/tensorflow/core/kernels/cast_op.h b/tensorflow/core/kernels/cast_op.h
index 0def600ac0..5c24f164a4 100644
--- a/tensorflow/core/kernels/cast_op.h
+++ b/tensorflow/core/kernels/cast_op.h
@@ -50,7 +50,7 @@ template <typename From, typename To>
 struct scalar_cast_op<std::complex<From>, To> {
   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE To
   operator()(const std::complex<From>& a) const {
-    // Replicate numpy behaviour of returning just the real part
+    // Replicate numpy behavior of returning just the real part
     return static_cast<To>(a.real());
   }
 };
@@ -59,7 +59,7 @@ template <typename From, typename To>
 struct scalar_cast_op<From, std::complex<To>> {
   EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE std::complex<To> operator()(
       const From& a) const {
-    // Replicate numpy behaviour of setting the imaginary part to 0
+    // Replicate numpy behavior of setting the imaginary part to 0
     return std::complex<To>(static_cast<To>(a), To(0));
   }
 };
diff --git a/tensorflow/core/kernels/cast_op_impl_int32.cc b/tensorflow/core/kernels/cast_op_impl_int32.cc
index fca9cd60ec..69ed760455 100644
--- a/tensorflow/core/kernels/cast_op_impl_int32.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int32.cc
@@ -38,10 +38,9 @@ GetGpuCastFromInt32(DataType dst_dtype) {
 typedef Eigen::SyclDevice SYCLDevice;
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetSyclCastFromInt32(DataType dst_dtype) {
-  CURRY_TYPES3(CAST_CASE, CPUDevice, int32);
+  CURRY_TYPES3(CAST_CASE, SYCLDevice, int32);
   return nullptr;
 }
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
-
diff --git a/tensorflow/core/kernels/cast_op_impl_int64.cc b/tensorflow/core/kernels/cast_op_impl_int64.cc
index c0a543708d..7a8363ca39 100644
--- a/tensorflow/core/kernels/cast_op_impl_int64.cc
+++ b/tensorflow/core/kernels/cast_op_impl_int64.cc
@@ -19,9 +19,6 @@ namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
-#ifdef TENSORFLOW_USE_SYCL
-typedef Eigen::SyclDevice SYCLDevice;
-#endif // TENSORFLOW_USE_SYCL
 
 std::function<void(OpKernelContext*, const Tensor&, Tensor*)>
 GetCpuCastFromInt64(DataType dst_dtype) {
diff --git a/tensorflow/core/kernels/concat_lib_cpu.cc b/tensorflow/core/kernels/concat_lib_cpu.cc
index 9ad1e60c6c..258ce15456 100644
--- a/tensorflow/core/kernels/concat_lib_cpu.cc
+++ b/tensorflow/core/kernels/concat_lib_cpu.cc
@@ -95,7 +95,7 @@ void ConcatSYCL(const Eigen::SyclDevice& d,
      const std::vector<std::unique_ptr<typename TTypes<T, 2>::ConstMatrix>>&, \
      typename TTypes<T, 2>::Matrix* output);
 
-TF_CALL_GPU_NUMBER_TYPES(REGISTER_SYCL)
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL)
 
 #undef REGISTER_SYCL
 #endif // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/concat_op.cc b/tensorflow/core/kernels/concat_op.cc
index 916bbc4996..e7848a7e26 100644
--- a/tensorflow/core/kernels/concat_op.cc
+++ b/tensorflow/core/kernels/concat_op.cc
@@ -232,7 +232,8 @@ REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                               .HostMemory("axis"),           \
                           ConcatV2Op<SYCLDevice, type>)
 
-TF_CALL_GPU_NUMBER_TYPES(REGISTER_SYCL);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
+
 REGISTER_KERNEL_BUILDER(Name("Concat")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
@@ -248,6 +249,7 @@ REGISTER_KERNEL_BUILDER(Name("ConcatV2")
                             .HostMemory("axis")
                             .HostMemory("output"),
                         ConcatV2Op<CPUDevice, int32>);
+
 #undef REGISTER_SYCL
 #endif // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/constant_op.cc b/tensorflow/core/kernels/constant_op.cc
index 15fc086752..68e960d6b7 100644
--- a/tensorflow/core/kernels/constant_op.cc
+++ b/tensorflow/core/kernels/constant_op.cc
@@ -30,6 +30,10 @@ limitations under the License.
 #include "tensorflow/core/kernels/fill_functor.h"
 #include "tensorflow/core/platform/macros.h"
 
+#ifdef TENSORFLOW_USE_SYCL
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+#endif // TENSORFLOW_USE_SYCL
+
 namespace tensorflow {
 
 ConstantOp::ConstantOp(OpKernelConstruction* ctx)
@@ -52,18 +56,6 @@ ConstantOp::~ConstantOp() {}
 
 REGISTER_KERNEL_BUILDER(Name("Const").Device(DEVICE_CPU), ConstantOp);
 
-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                     \
-  REGISTER_KERNEL_BUILDER(                                             \
-      Name("Const").Device(DEVICE_SYCL).TypeConstraint<TYPE>("dtype"), \
-      ConstantOp);
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-REGISTER_SYCL_KERNEL(bool);
-REGISTER_SYCL_KERNEL(int64);
-#undef REGISTER_SYCL_KERNEL
-#endif
-
 #if GOOGLE_CUDA
 #define REGISTER_KERNEL(D, TYPE)                                      \
   REGISTER_KERNEL_BUILDER(                                            \
@@ -85,6 +77,22 @@ REGISTER_KERNEL(GPU, bool);
 #undef REGISTER_KERNEL
 #endif
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(D, TYPE)                                  \
+  REGISTER_KERNEL_BUILDER(                                             \
+      Name("Const").Device(DEVICE_##D).TypeConstraint<TYPE>("dtype"),  \
+      ConstantOp);
+REGISTER_SYCL_KERNEL(SYCL, float);
+REGISTER_SYCL_KERNEL(SYCL, double);
+REGISTER_SYCL_KERNEL(SYCL, uint8);
+REGISTER_SYCL_KERNEL(SYCL, int8);
+REGISTER_SYCL_KERNEL(SYCL, uint16);
+REGISTER_SYCL_KERNEL(SYCL, int16);
+REGISTER_SYCL_KERNEL(SYCL, int64);
+REGISTER_SYCL_KERNEL(SYCL, bool);
+#undef REGISTER_SYCL_KERNEL
+#endif
+
 HostConstantOp::HostConstantOp(OpKernelConstruction* ctx)
     : OpKernel(ctx), tensor_(ctx->output_type(0)) {
   const TensorProto* proto = nullptr;
@@ -116,9 +124,6 @@ REGISTER_KERNEL_BUILDER(Name("Const")
 #endif
 
 #ifdef TENSORFLOW_USE_SYCL
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(Name("Const")
                             .Device(DEVICE_SYCL)
                             .HostMemory("output")
@@ -143,17 +148,6 @@ struct FillFunctor<CPUDevice, T> {
   }
 };
 
-#ifdef TENSORFLOW_USE_SYCL
-// Partial specialization of FillFunctor<Device=SYCLDevice, T>.
-template <typename T>
-struct FillFunctor<SYCLDevice, T> {
-  void operator()(const SYCLDevice& d, typename TTypes<T>::Flat out,
-                  typename TTypes<T>::ConstScalar in) {
-    To32Bit(out).device(d) = To32Bit(out).constant(in());
-  }
-};
-#endif  // TENSORFLOW_USE_SYCL
-
 }  // end namespace functor
 
 template <typename Device, typename T>
@@ -184,6 +178,28 @@ class FillOp : public OpKernel {
   }
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+
+namespace functor {
+// Partial specialization of FillFunctor<Device=SYCLDevice, T>.
+template <typename T>
+struct FillFunctor<SYCLDevice, T> {
+  void operator()(const SYCLDevice& d, typename TTypes<T>::Flat out,
+                  typename TTypes<T>::ConstScalar in) {
+#if !defined(EIGEN_HAS_INDEX_LIST)
+  Eigen::array<int, 1> rank1{1};
+#else
+  Eigen::IndexList<Eigen::type2index<1>> rank1;
+#endif
+  const int size  = out.dimension(0);
+  Eigen::array<int, 1> broadcast_dims{size};
+
+  To32Bit(out).device(d) = in.reshape(rank1).broadcast(broadcast_dims);
+  }
+};
+}
+#endif // TENSORFLOW_USE_SYCL
+
 #define REGISTER_KERNEL(D, TYPE)                         \
   REGISTER_KERNEL_BUILDER(Name("Fill")                   \
                               .Device(DEVICE_##D)        \
@@ -199,8 +215,14 @@ REGISTER_KERNEL(CPU, quint8);
 #undef REGISTER_CPU_KERNEL
 
 #ifdef TENSORFLOW_USE_SYCL
-REGISTER_KERNEL(SYCL, float)
-REGISTER_KERNEL(SYCL, double)
+REGISTER_KERNEL(SYCL, float);
+REGISTER_KERNEL(SYCL, double);
+REGISTER_KERNEL(SYCL, uint8);
+REGISTER_KERNEL(SYCL, int8);
+REGISTER_KERNEL(SYCL, uint16);
+REGISTER_KERNEL(SYCL, int16);
+REGISTER_KERNEL(SYCL, int64);
+
 REGISTER_KERNEL_BUILDER(Name("Fill")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
@@ -208,6 +230,7 @@ REGISTER_KERNEL_BUILDER(Name("Fill")
                             .HostMemory("value")
                             .HostMemory("output"),
                         FillOp<CPUDevice, int32>);
+#undef REGISTER_KERNEL_SYCL
 #endif  // TENSORFLOW_USE_SYCL
 
 #if GOOGLE_CUDA
@@ -260,8 +283,10 @@ TF_CALL_POD_STRING_TYPES(REGISTER_CPU);
 #undef REGISTER_CPU
 
 #ifdef TENSORFLOW_USE_SYCL
-REGISTER_KERNEL(float, SYCL);
 REGISTER_KERNEL(bool, SYCL);
+REGISTER_KERNEL(float, SYCL);
+REGISTER_KERNEL(double, SYCL);
+REGISTER_KERNEL(int64, SYCL);
 REGISTER_KERNEL_BUILDER(Name("ZerosLike")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
diff --git a/tensorflow/core/kernels/conv_grad_input_ops.cc b/tensorflow/core/kernels/conv_grad_input_ops.cc
index a94b1bea4b..eb9a616966 100644
--- a/tensorflow/core/kernels/conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_input_ops.cc
@@ -176,7 +176,7 @@ struct LaunchXsmmBackwardInputConvolution<CPUDevice, float> {
     desc.filter_format =
         LIBXSMM_DNN_TENSOR_FORMAT_LIBXSMM;  // LIBXSMM_DNN_TENSOR_FORMAT_RSCK;
     desc.fuse_ops = LIBXSMM_DNN_CONV_FUSE_NONE;
-    desc.options = LIBXSMM_DNN_CONV_OPTION_WU_EXT_FILTER_REDUCE;
+    desc.options = LIBXSMM_DNN_CONV_OPTION_WU_EXT_FILTER_REDUCE_OVERWRITE;
     desc.datatype = LIBXSMM_DNN_DATATYPE_F32;
 
     auto input_ptr = input_backward.data();
diff --git a/tensorflow/core/kernels/conv_ops.cc b/tensorflow/core/kernels/conv_ops.cc
index 8c75b312ef..f8eb9c555e 100644
--- a/tensorflow/core/kernels/conv_ops.cc
+++ b/tensorflow/core/kernels/conv_ops.cc
@@ -228,7 +228,7 @@ class LaunchXsmmConvOp<CPUDevice, float> {
     desc.buffer_format = LIBXSMM_DNN_TENSOR_FORMAT_NHWC;
     desc.filter_format = LIBXSMM_DNN_TENSOR_FORMAT_LIBXSMM;
     desc.fuse_ops = LIBXSMM_DNN_CONV_FUSE_NONE;
-    desc.options = LIBXSMM_DNN_CONV_OPTION_WU_EXT_FILTER_REDUCE;
+    desc.options = LIBXSMM_DNN_CONV_OPTION_WU_EXT_FILTER_REDUCE_OVERWRITE;
     desc.datatype = LIBXSMM_DNN_DATATYPE_F32;
 
     if (!CanUseXsmmConv2D(desc, data_format)) {
diff --git a/tensorflow/core/kernels/conv_ops_fused.cc b/tensorflow/core/kernels/conv_ops_fused.cc
index f7348f1077..291ebf2298 100644
--- a/tensorflow/core/kernels/conv_ops_fused.cc
+++ b/tensorflow/core/kernels/conv_ops_fused.cc
@@ -713,7 +713,7 @@ class FusedResizeConv2DUsingGemmOp : public OpKernel {
       const int32 before =
           paddings_matrix(d, 0);  // Pad before existing elements.
       const int32 after =
-          paddings_matrix(d, 1);  // Pad after exisitng elements.
+          paddings_matrix(d, 1);  // Pad after existing elements.
       OP_REQUIRES(context, before >= 0 && after >= 0,
                   errors::InvalidArgument("paddings must be non-negative: ",
                                           before, " ", after));
diff --git a/tensorflow/core/kernels/cuda_solvers.h b/tensorflow/core/kernels/cuda_solvers.h
index 70ccbb90cc..5d1c807e66 100644
--- a/tensorflow/core/kernels/cuda_solvers.h
+++ b/tensorflow/core/kernels/cuda_solvers.h
@@ -116,7 +116,7 @@ class CudaSolver {
   // Launches a memcpy of solver status data specified by dev_lapack_info from
   // device to the host, and asynchronously invokes the given callback when the
   // copy is complete. The first Status argument to the callback will be
-  // Status::OK if all lapack infos retrived are zero, otherwise an error status
+  // Status::OK if all lapack infos retrieved are zero, otherwise an error status
   // is given. The second argument contains a host-side copy of the entire set
   // of infos retrieved, and can be used for generating detailed error messages.
   Status CopyLapackInfoToHostAsync(
diff --git a/tensorflow/core/kernels/cwise_ops_common.h b/tensorflow/core/kernels/cwise_ops_common.h
index f30d889de2..b43370ee65 100644
--- a/tensorflow/core/kernels/cwise_ops_common.h
+++ b/tensorflow/core/kernels/cwise_ops_common.h
@@ -468,7 +468,7 @@ struct ApproximateEqual<CPUDevice, T> {
 
 // Macros to register kernels for multiple types (T0, T1, etc.)  on
 // device type "D" (CPU or GPU) for operation "N" (e.g., sqrt) using
-// the functor "F" (e.g., functor:sqrt).
+// the functor "F" (e.g., functor::sqrt).
 
 #if defined(__ANDROID_TYPES_SLIM__)
 // Note that __ANDROID_TYPES_SLIM__ is also checked in the cwise_ops*.cc files.
diff --git a/tensorflow/core/kernels/debug_ops.cc b/tensorflow/core/kernels/debug_ops.cc
index 55a7657ea8..965a60c7e0 100644
--- a/tensorflow/core/kernels/debug_ops.cc
+++ b/tensorflow/core/kernels/debug_ops.cc
@@ -28,25 +28,25 @@ REGISTER_KERNEL_BUILDER(Name("Copy").Device(DEVICE_CPU), CopyOp);
 
 REGISTER_KERNEL_BUILDER(Name("CopyHost").Device(DEVICE_CPU), CopyOp);
 
-#ifdef TENSORFLOW_USE_SYCL
-REGISTER_KERNEL_BUILDER(Name("Copy").Device(DEVICE_SYCL), CopyOp);
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("Copy").Device(DEVICE_GPU), CopyOp);
 
 REGISTER_KERNEL_BUILDER(Name("CopyHost")
-                            .Device(DEVICE_SYCL)
+                            .Device(DEVICE_GPU)
                             .HostMemory("input")
                             .HostMemory("output"),
                         CopyOp);
-#endif // TENSORFLOW_USE_SYCL
+#endif  // GOOGLE_CUDA
 
-#if GOOGLE_CUDA
-REGISTER_KERNEL_BUILDER(Name("Copy").Device(DEVICE_GPU), CopyOp);
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("Copy").Device(DEVICE_SYCL), CopyOp);
 
 REGISTER_KERNEL_BUILDER(Name("CopyHost")
-                            .Device(DEVICE_GPU)
+                            .Device(DEVICE_SYCL)
                             .HostMemory("input")
                             .HostMemory("output"),
                         CopyOp);
-#endif
+#endif // TENSORFLOW_USE_SYCL
 
 // Register debug identity (non-ref and ref) ops.
 REGISTER_KERNEL_BUILDER(Name("DebugIdentity").Device(DEVICE_CPU),
@@ -126,15 +126,16 @@ TF_CALL_double(REGISTER_GPU_DEBUG_NUMERIC_SUMMARY_COUNT);
 #endif  // GOOGLE_CUDA
 
 #if TENSORFLOW_USE_SYCL
-#define REGISTER_GPU_DEBUG_NUMERIC_SUMMARY_COUNT(type)    \
+#define REGISTER_SYCL_DEBUG_NUMERIC_SUMMARY_COUNT(type)   \
   REGISTER_KERNEL_BUILDER(Name("DebugNumericSummary")     \
                               .Device(DEVICE_SYCL)        \
                               .HostMemory("input")        \
                               .HostMemory("output")       \
                               .TypeConstraint<type>("T"), \
                           DebugNumericSummaryOp<type>);
-REGISTER_GPU_DEBUG_NUMERIC_SUMMARY_COUNT(float);
-REGISTER_GPU_DEBUG_NUMERIC_SUMMARY_COUNT(double);
+TF_CALL_bool(REGISTER_SYCL_DEBUG_NUMERIC_SUMMARY_COUNT);
+TF_CALL_INTEGRAL_TYPES(REGISTER_SYCL_DEBUG_NUMERIC_SUMMARY_COUNT);
+TF_CALL_float(REGISTER_SYCL_DEBUG_NUMERIC_SUMMARY_COUNT);
+TF_CALL_double(REGISTER_SYCL_DEBUG_NUMERIC_SUMMARY_COUNT);
 #endif  // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/debug_ops.h b/tensorflow/core/kernels/debug_ops.h
index a9fa59ab01..ef12e2e42c 100644
--- a/tensorflow/core/kernels/debug_ops.h
+++ b/tensorflow/core/kernels/debug_ops.h
@@ -19,6 +19,9 @@ limitations under the License.
 #if GOOGLE_CUDA
 #include "tensorflow/core/common_runtime/gpu/gpu_util.h"
 #endif
+#ifdef TENSORFLOW_USE_SYCL
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+#endif // TENSORFLOW_USE_SYCL
 #include "tensorflow/core/debug/debug_io_utils.h"
 #include "tensorflow/core/framework/device_base.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -84,6 +87,22 @@ class CopyOp : public OpKernel {
         // The input tensor is on the host (CPU): deep-copy from CPU to CPU.
         *copied_tensor = tensor::DeepCopy(src_tensor);
       }
+#elif defined(TENSORFLOW_USE_SYCL)
+      Device* device = static_cast<Device*>(context->device());
+      // Determine if the input tensor is not on CPU (e.g., on GPU).
+      const bool off_host_input = device->device_type() == DEVICE_SYCL &&
+                            !context->input_alloc_attr(0).on_host();
+
+      if (off_host_input) {
+        auto size = src_tensor.NumElements() * sizeof(src_tensor.dtype());
+        auto dst_ptr = GetBase(copied_tensor);
+        auto src_ptr = GetBase(&src_tensor);
+        typedef decltype(src_tensor.dtype()) ttype;
+        context->eigen_sycl_device().memcpy(
+            dst_ptr, static_cast<const ttype*>(src_ptr), size);
+      } else {
+        *copied_tensor = tensor::DeepCopy(src_tensor);
+      }
 #else
       *copied_tensor = tensor::DeepCopy(src_tensor);
 #endif
diff --git a/tensorflow/core/kernels/decode_raw_op.cc b/tensorflow/core/kernels/decode_raw_op.cc
index da247161f9..9492a4e26d 100644
--- a/tensorflow/core/kernels/decode_raw_op.cc
+++ b/tensorflow/core/kernels/decode_raw_op.cc
@@ -70,10 +70,24 @@ class DecodeRawOp : public OpKernel {
     auto out = output_tensor->flat_inner_dims<T>();
     DCHECK_EQ(flat_in.size(), out.dimensions()[0]);
     T* out_data = out.data();
-    for (int64 i = 0; i < flat_in.size(); ++i) {
-      const T* in_data = reinterpret_cast<const T*>(flat_in(i).data());
-      memcpy(out_data, in_data, str_size);
-      out_data += added_dim;
+    if (port::kLittleEndian == little_endian_ || sizeof(T) == 1) {
+      for (int64 i = 0; i < flat_in.size(); ++i) {
+        const T* in_data = reinterpret_cast<const T*>(flat_in(i).data());
+        memcpy(out_data, in_data, str_size);
+        out_data += added_dim;
+      }
+    } else {
+      for (int64 i = 0; i < flat_in.size(); ++i) {
+        const char* in_data_bytes =
+            reinterpret_cast<const char*>(flat_in(i).data());
+        char* out_data_bytes = reinterpret_cast<char*>(out_data);
+        const char* p = in_data_bytes;
+        char* q = out_data_bytes;
+        for (; p < in_data_bytes + str_size; p += sizeof(T), q += sizeof(T)) {
+          std::reverse_copy(p, p + sizeof(T), q);
+        }
+        out_data += added_dim;
+      }
     }
   }
 
diff --git a/tensorflow/core/kernels/deep_conv2d.cc b/tensorflow/core/kernels/deep_conv2d.cc
index a481401479..8e9b8a7e2e 100644
--- a/tensorflow/core/kernels/deep_conv2d.cc
+++ b/tensorflow/core/kernels/deep_conv2d.cc
@@ -26,7 +26,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-// DeepConv2D is a Conv2D implementation specialzied for deep convolutions (i.e
+// DeepConv2D is a Conv2D implementation specialized for deep convolutions (i.e
 // large 'in_depth' and 'out_depth' product. See cost models below for details).
 //
 // DeepConv2D is implemented by computing the following equation:
diff --git a/tensorflow/core/kernels/deep_conv2d.h b/tensorflow/core/kernels/deep_conv2d.h
index a9de20e7ae..c3f6f66dc9 100644
--- a/tensorflow/core/kernels/deep_conv2d.h
+++ b/tensorflow/core/kernels/deep_conv2d.h
@@ -22,7 +22,7 @@ namespace tensorflow {
 
 class OpKernelContext;
 
-// DeepConv2D is a Conv2D implementation specialzied for deep (i.e. large
+// DeepConv2D is a Conv2D implementation specialized for deep (i.e. large
 // in_depth * out_depth product) convolutions (see deep_conv2d.cc for details).
 
 // DeepConv2DTransform is an interface for implementing transforms for
diff --git a/tensorflow/core/kernels/dense_to_sparse_batch_dataset_op.cc b/tensorflow/core/kernels/dense_to_sparse_batch_dataset_op.cc
index b93a3b2970..2c36093355 100644
--- a/tensorflow/core/kernels/dense_to_sparse_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/dense_to_sparse_batch_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class DenseToSparseBatchDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/dense_update_ops.cc b/tensorflow/core/kernels/dense_update_ops.cc
index 767f143727..33991fa1f9 100644
--- a/tensorflow/core/kernels/dense_update_ops.cc
+++ b/tensorflow/core/kernels/dense_update_ops.cc
@@ -126,6 +126,9 @@ class DenseUpdateOp : public OpKernel {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 #define REGISTER_KERNELS(type)                                     \
   REGISTER_KERNEL_BUILDER(                                         \
@@ -136,26 +139,6 @@ TF_CALL_ALL_TYPES(REGISTER_KERNELS);
 TF_CALL_QUANTIZED_TYPES(REGISTER_KERNELS);
 #undef REGISTER_KERNELS
 
-#if TENSORFLOW_USE_SYCL
-typedef Eigen::SyclDevice SYCLDevice;
-#define REGISTER_SYCL_KERNEL(type)                                     \
-  REGISTER_KERNEL_BUILDER(                                             \
-                          Name("Assign")                               \
-                          .Device(DEVICE_SYCL)                         \
-                          .TypeConstraint<type>("T"),                  \
-                          AssignOpT<SYCLDevice, type>);                \
-  REGISTER_KERNEL_BUILDER(                                             \
-      Name("AssignAdd").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
-      DenseUpdateOp<SYCLDevice, type, DenseUpdateType::ADD>);          \
-  REGISTER_KERNEL_BUILDER(                                             \
-      Name("AssignSub").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
-      DenseUpdateOp<SYCLDevice, type, DenseUpdateType::SUB>);
-
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-#undef REGISTER_SYCL_KERNEL
-#endif
-
 #if GOOGLE_CUDA
 // Only register 'Assign' on GPU for the subset of types also supported by
 // 'Variable' (see variable_ops.cc.)
@@ -175,6 +158,16 @@ TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 #undef REGISTER_GPU_KERNELS
 #endif  // GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNELS(type)                                \
+REGISTER_KERNEL_BUILDER(                                           \
+    Name("Assign").Device(DEVICE_SYCL).TypeConstraint<type>("T"),  \
+    AssignOpT<SYCLDevice, type>);
+
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNELS);
+#undef REGISTER_SYCL_KERNELS
+#endif // TENSORFLOW_USE_SYCL
+
 #define REGISTER_KERNELS(type)                                        \
   REGISTER_KERNEL_BUILDER(                                            \
       Name("AssignAdd").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
@@ -214,4 +207,16 @@ TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
 #undef REGISTER_GPU_KERNELS
 #endif  // end GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNELS(type)                                         \
+  REGISTER_KERNEL_BUILDER(                                             \
+      Name("AssignAdd").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
+      DenseUpdateOp<SYCLDevice, type, DenseUpdateType::ADD>);          \
+  REGISTER_KERNEL_BUILDER(                                             \
+      Name("AssignSub").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
+      DenseUpdateOp<SYCLDevice, type, DenseUpdateType::SUB>);
+
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNELS);
+#undef REGISTER_SYCL_KERNELS
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/fft_ops.cc b/tensorflow/core/kernels/fft_ops.cc
index b479956632..593fa487c9 100644
--- a/tensorflow/core/kernels/fft_ops.cc
+++ b/tensorflow/core/kernels/fft_ops.cc
@@ -17,7 +17,6 @@ limitations under the License.
 
 // See docs in ../ops/spectral_ops.cc.
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -26,6 +25,7 @@ limitations under the License.
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/work_sharder.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
 #if GOOGLE_CUDA
 #include "tensorflow/core/platform/stream_executor.h"
@@ -163,9 +163,58 @@ class FFTCPU : public FFTBase {
         output.device(device) =
             full_fft.slice(zero_start_indices, output.dimensions());
       } else {
-        // TODO: reconstruct the full fft and take the inverse.
-        ctx->CtxFailureWithWarning(
-            errors::Unimplemented("IRFFT is not implemented as a CPU kernel"));
+        // Reconstruct the full fft and take the inverse.
+        auto input = ((Tensor)in).flat_inner_dims<complex64, FFTRank + 1>();
+        auto output = out->flat_inner_dims<float, FFTRank + 1>();
+
+        auto sizes = input.dimensions();
+
+        // Calculate the shape of full-fft temporary tensor.
+        TensorShape fullShape;
+        fullShape.AddDim(sizes[0]);
+        for (auto i = 1; i <= FFTRank; i++) {
+          fullShape.AddDim(fft_shape[i - 1]);
+        }
+
+        Tensor temp;
+        OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<complex64>::v(),
+                                               fullShape, &temp));
+        auto full_fft = temp.flat_inner_dims<complex64, FFTRank + 1>();
+
+        // Calculate the starting point and range of the source of
+        // negative frequency part.
+        auto negSizes = input.dimensions();
+        negSizes[FFTRank] = fft_shape[FFTRank - 1] - sizes[FFTRank];
+        Eigen::DSizes<Eigen::DenseIndex, FFTRank + 1> negTargetIndices;
+        negTargetIndices[FFTRank] = sizes[FFTRank];
+
+        Eigen::DSizes<Eigen::DenseIndex, FFTRank + 1> startIndices,
+            negStartIndices;
+        negStartIndices[FFTRank] = 1;
+
+        full_fft.slice(startIndices, sizes) = input.slice(startIndices, sizes);
+
+        // First, conduct FFT on outer dimensions.
+        auto outerAxes = Eigen::ArrayXi::LinSpaced(FFTRank - 1, 1, FFTRank - 1);
+        full_fft = full_fft.template fft<Eigen::BothParts, Eigen::FFT_REVERSE>(
+            outerAxes);
+
+        // Reconstruct the full fft by appending reversed and conjugated
+        // spectrum as the negative frequency part.
+        Eigen::array<bool, FFTRank + 1> reversedAxis;
+        for (auto i = 0; i <= FFTRank; i++) {
+          reversedAxis[i] = i == FFTRank;
+        }
+
+        full_fft.slice(negTargetIndices, negSizes) =
+            full_fft.slice(negStartIndices, negSizes)
+                .reverse(reversedAxis)
+                .conjugate();
+
+        auto innerAxis = Eigen::array<int, 1>{FFTRank};
+        output.device(device) =
+            full_fft.template fft<Eigen::RealPart, Eigen::FFT_REVERSE>(
+                innerAxis);
       }
     }
   }
@@ -194,10 +243,16 @@ REGISTER_KERNEL_BUILDER(Name("IFFT3D").Device(DEVICE_CPU).Label(FFT_LABEL),
 
 REGISTER_KERNEL_BUILDER(Name("RFFT").Device(DEVICE_CPU).Label(FFT_LABEL),
                         FFTCPU<true, true, 1>);
+REGISTER_KERNEL_BUILDER(Name("IRFFT").Device(DEVICE_CPU).Label(FFT_LABEL),
+                        FFTCPU<false, true, 1>);
 REGISTER_KERNEL_BUILDER(Name("RFFT2D").Device(DEVICE_CPU).Label(FFT_LABEL),
                         FFTCPU<true, true, 2>);
+REGISTER_KERNEL_BUILDER(Name("IRFFT2D").Device(DEVICE_CPU).Label(FFT_LABEL),
+                        FFTCPU<false, true, 2>);
 REGISTER_KERNEL_BUILDER(Name("RFFT3D").Device(DEVICE_CPU).Label(FFT_LABEL),
                         FFTCPU<true, true, 3>);
+REGISTER_KERNEL_BUILDER(Name("IRFFT3D").Device(DEVICE_CPU).Label(FFT_LABEL),
+                        FFTCPU<false, true, 3>);
 
 #undef FFT_LABEL
 
diff --git a/tensorflow/core/kernels/fill_functor.cc b/tensorflow/core/kernels/fill_functor.cc
index af06e12a5e..8a0a558eef 100644
--- a/tensorflow/core/kernels/fill_functor.cc
+++ b/tensorflow/core/kernels/fill_functor.cc
@@ -56,17 +56,22 @@ DEFINE_SETZERO_CPU(complex128);
 template <typename T>
 void SetZeroFunctor<Eigen::SyclDevice, T>::operator()(
     const Eigen::SyclDevice& d, typename TTypes<T>::Flat out) {
-  out.device(d) = out.constant(T(0));
+      To32Bit(out).device(d) = To32Bit(out).constant(T(0));
 }
 
 #define DEFINE_SETZERO_SYCL(T) \
   template struct SetZeroFunctor<Eigen::SyclDevice, T>;
-DEFINE_SETZERO_SYCL(float);
 DEFINE_SETZERO_SYCL(bool);
+DEFINE_SETZERO_SYCL(float);
 DEFINE_SETZERO_SYCL(double);
+DEFINE_SETZERO_SYCL(uint8);
+DEFINE_SETZERO_SYCL(int8);
+DEFINE_SETZERO_SYCL(uint16);
+DEFINE_SETZERO_SYCL(int16);
+DEFINE_SETZERO_SYCL(int32);
+DEFINE_SETZERO_SYCL(int64);
 #undef DEFINE_SETZERO_SYCL
 #endif  // TENSORFLOW_USE_SYCL
-
 template <typename T>
 void SetOneFunctor<Eigen::ThreadPoolDevice, T>::operator()(
     const Eigen::ThreadPoolDevice& d, typename TTypes<T>::Flat out) {
diff --git a/tensorflow/core/kernels/filter_dataset_op.cc b/tensorflow/core/kernels/filter_dataset_op.cc
index 7b2c0de97d..62ad921062 100644
--- a/tensorflow/core/kernels/filter_dataset_op.cc
+++ b/tensorflow/core/kernels/filter_dataset_op.cc
@@ -25,7 +25,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class FilterDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/flat_map_dataset_op.cc b/tensorflow/core/kernels/flat_map_dataset_op.cc
index 5b22a922b2..68a6cf1960 100644
--- a/tensorflow/core/kernels/flat_map_dataset_op.cc
+++ b/tensorflow/core/kernels/flat_map_dataset_op.cc
@@ -25,7 +25,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class FlatMapDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/group_by_window_dataset_op.cc b/tensorflow/core/kernels/group_by_window_dataset_op.cc
index cb8d566044..a58c15a097 100644
--- a/tensorflow/core/kernels/group_by_window_dataset_op.cc
+++ b/tensorflow/core/kernels/group_by_window_dataset_op.cc
@@ -27,7 +27,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 class GroupByWindowDatasetOp : public OpKernel {
  public:
diff --git a/tensorflow/core/kernels/hexagon/graph_transferer.h b/tensorflow/core/kernels/hexagon/graph_transferer.h
index 60b58fd500..fa12b22d75 100644
--- a/tensorflow/core/kernels/hexagon/graph_transferer.h
+++ b/tensorflow/core/kernels/hexagon/graph_transferer.h
@@ -57,7 +57,7 @@ class GraphTransferer {
       const GraphDef& graph_def,
       const std::vector<std::pair<string, Tensor>>& input_node_info_list,
       const std::vector<string>& output_node_names,
-      const bool shape_inference_for_unkown_shape);
+      const bool shape_inference_for_unknown_shape);
 
   // Load graph structure into GraphTransferer from protobuf file
   // TODO(satok): Pass a pair of TensorShape and DataType instead of
diff --git a/tensorflow/core/kernels/hinge-loss.h b/tensorflow/core/kernels/hinge-loss.h
index 36b02fcc5d..789a7ce7a3 100644
--- a/tensorflow/core/kernels/hinge-loss.h
+++ b/tensorflow/core/kernels/hinge-loss.h
@@ -44,7 +44,7 @@ class HingeLossUpdater : public DualLossUpdater {
                             const double current_dual, const double wx,
                             const double weighted_example_norm) const final {
     // Intutitvely there are 3 cases:
-    // a. new optimal value of the dual variable falls withing the admissible
+    // a. new optimal value of the dual variable falls within the admissible
     // range [0, 1]. In this case we set new dual to this value.
     // b. new optimal value is < 0. Then, because of convexity, the optimal
     // valid value for new dual = 0
diff --git a/tensorflow/core/kernels/image_resizer_state.h b/tensorflow/core/kernels/image_resizer_state.h
index 9ef44a5782..f088315ff5 100644
--- a/tensorflow/core/kernels/image_resizer_state.h
+++ b/tensorflow/core/kernels/image_resizer_state.h
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// This is a helper struct to package up the input and ouput
+// This is a helper struct to package up the input and output
 // parameters of an image resizer (the height, widths, etc.).  To
 // reduce code duplication and ensure consistency across the different
 // resizers, it performs the input validation.
diff --git a/tensorflow/core/kernels/inplace_ops.cc b/tensorflow/core/kernels/inplace_ops.cc
index 4433b9eea9..67bec7d50e 100644
--- a/tensorflow/core/kernels/inplace_ops.cc
+++ b/tensorflow/core/kernels/inplace_ops.cc
@@ -25,11 +25,14 @@ limitations under the License.
 
 namespace tensorflow {
 typedef Eigen::ThreadPoolDevice CPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SyclDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 namespace functor {
 
-template <typename T>
-Status DoParallelConcatUpdate(const CPUDevice& d, const Tensor& value,
+template <typename Device, typename T>
+Status DoParallelConcatUpdate(const Device& d, const Tensor& value,
                               int32 loc, Tensor* output) {
   auto Tvalue = value.flat_outer_dims<T>();
   auto Toutput = output->flat_outer_dims<T>();
@@ -46,7 +49,7 @@ Status DoParallelConcat(const CPUDevice& d, const Tensor& value, int32 loc,
   switch (value.dtype()) {
 #define CASE(type)                  \
   case DataTypeToEnum<type>::value: \
-    return DoParallelConcatUpdate<type>(d, value, loc, output);
+    return DoParallelConcatUpdate<CPUDevice, type>(d, value, loc, output);
     TF_CALL_NUMBER_TYPES(CASE);
     TF_CALL_string(CASE);
 #undef CASE
@@ -55,6 +58,23 @@ Status DoParallelConcat(const CPUDevice& d, const Tensor& value, int32 loc,
   }
 }
 
+#ifdef TENSORFLOW_USE_SYCL
+template <>
+Status DoParallelConcat(const SyclDevice& d, const Tensor& value, int32 loc,
+                        Tensor* output) {
+  CHECK_EQ(value.dtype(), output->dtype());
+  switch (value.dtype()) {
+#define CASE(type)                  \
+  case DataTypeToEnum<type>::value: \
+    return DoParallelConcatUpdate<SyclDevice, type>(d, value, loc, output);
+    TF_CALL_GPU_NUMBER_TYPES_NO_HALF(CASE);
+#undef CASE
+    default:
+      return errors::InvalidArgument("Unsupported data type: ", value.dtype());
+  }
+}
+#endif // TENSORFLOW_USE_SYCL
+
 }  // end namespace functor
 
 namespace {
@@ -152,6 +172,42 @@ TF_CALL_POD_STRING_TYPES(REGISTER_EMPTY)
 TF_CALL_POD_STRING_TYPES(REGISTER_PARALLEL_CONCAT);
 #undef REGISTER_PARALLEL_CONCAT
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_EMPTY(type)                                  \
+  REGISTER_KERNEL_BUILDER(Name("_ParallelConcatStart")        \
+                              .Device(DEVICE_SYCL)            \
+                              .TypeConstraint<type>("dtype"), \
+                          ParallelConcatStart<SyclDevice, type>);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_EMPTY)
+#undef REGISTER_EMPTY
+
+#define REGISTER_PARALLEL_CONCAT(type)                                      \
+  REGISTER_KERNEL_BUILDER(                                                  \
+      Name("ParallelConcat").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
+      FailureKernel);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_PARALLEL_CONCAT);
+#undef REGISTER_PARALLEL_CONCAT
+
+#define REGISTER(type)                                    \
+  REGISTER_KERNEL_BUILDER(Name("_ParallelConcatUpdate")   \
+                              .Device(DEVICE_SYCL)        \
+                              .TypeConstraint<type>("T"), \
+                          ParallelConcatUpdate<SyclDevice>);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER)
+#undef REGISTER
+
+// Register versions that operate on int32 data on the CPU even though the op
+// has been placed on the SYCL
+
+REGISTER_KERNEL_BUILDER(Name("_ParallelConcatUpdate")
+                            .Device(DEVICE_SYCL)
+                            .HostMemory("value")
+                            .HostMemory("update")
+                            .HostMemory("output")
+                            .TypeConstraint<int32>("T"),
+                        ParallelConcatUpdate<CPUDevice>);
+#endif // TENSORFLOW_USE_SYCL
+
 #if GOOGLE_CUDA
 
 typedef Eigen::GpuDevice GPUDevice;
diff --git a/tensorflow/core/kernels/iterator_ops.cc b/tensorflow/core/kernels/iterator_ops.cc
index 0a82ff227e..ed350d9833 100644
--- a/tensorflow/core/kernels/iterator_ops.cc
+++ b/tensorflow/core/kernels/iterator_ops.cc
@@ -27,7 +27,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following ops.
 
 Status VerifyTypesMatch(const DataTypeVector& expected,
diff --git a/tensorflow/core/kernels/lmdb_reader_op.cc b/tensorflow/core/kernels/lmdb_reader_op.cc
new file mode 100755
index 0000000000..23cabe7b54
--- /dev/null
+++ b/tensorflow/core/kernels/lmdb_reader_op.cc
@@ -0,0 +1,134 @@
+/* Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "lmdb.h"
+#include "tensorflow/core/framework/reader_op_kernel.h"
+#include "tensorflow/core/framework/reader_base.h"
+#include "tensorflow/core/lib/core/errors.h"
+
+#include <sys/stat.h>
+
+namespace tensorflow {
+
+inline void MDB_CHECK(int mdb_status) {
+  CHECK_EQ(mdb_status, MDB_SUCCESS) << mdb_strerror(mdb_status);
+}
+
+class LMDBReader : public ReaderBase {
+ public:
+  LMDBReader(const string& node_name, Env* env)
+      : ReaderBase(strings::StrCat("LMDBReader '", node_name, "'")),
+        env_(env),
+        mdb_env_(nullptr),
+        mdb_dbi_(0),
+        mdb_txn_(nullptr),
+        mdb_cursor_(nullptr) {}
+
+  Status OnWorkStartedLocked() override {
+    MDB_CHECK(mdb_env_create(&mdb_env_));
+    int flags = MDB_RDONLY | MDB_NOTLS;
+
+    // Check if the LMDB filename is actually a file instead of a directory.
+    // If so, set appropriate flags so we can open it.
+    struct stat source_stat;
+    if (stat(current_work().c_str(), &source_stat) == 0 &&
+        (source_stat.st_mode & S_IFREG)) {
+      flags |= MDB_NOSUBDIR;
+    }
+
+    MDB_CHECK(mdb_env_open(mdb_env_, current_work().c_str(), flags, 0664));
+    MDB_CHECK(mdb_txn_begin(mdb_env_, nullptr, MDB_RDONLY, &mdb_txn_));
+    MDB_CHECK(mdb_dbi_open(mdb_txn_, nullptr, 0, &mdb_dbi_));
+
+    return Status::OK();
+  }
+
+  Status OnWorkFinishedLocked() override {
+    if (mdb_env_ != nullptr) {
+      if (mdb_cursor_) {
+        mdb_cursor_close(mdb_cursor_);
+      }
+      mdb_txn_abort(mdb_txn_);
+      mdb_dbi_close(mdb_env_, mdb_dbi_);
+      mdb_env_close(mdb_env_);
+      mdb_env_ = nullptr;
+    }
+    return Status::OK();
+  }
+
+  Status ReadLocked(string* key, string* value, bool* produced,
+                    bool* at_end) override {
+    if (mdb_cursor_ == nullptr) {
+      MDB_CHECK(mdb_cursor_open(mdb_txn_, mdb_dbi_, &mdb_cursor_));
+      if (Seek(MDB_FIRST) == false) {
+        *at_end = true;
+        return Status::OK();
+      }
+    }
+    else {
+      if (Seek(MDB_NEXT) == false) {
+        *at_end = true;
+        return Status::OK();
+      }
+    }
+    *key = string(static_cast<const char*>(mdb_key_.mv_data),
+                  mdb_key_.mv_size);
+    *value = string(static_cast<const char*>(mdb_value_.mv_data),
+                    mdb_value_.mv_size);
+    *produced = true;
+    return Status::OK();
+  }
+
+  Status ResetLocked() override {
+    CHECK_EQ(Seek(MDB_FIRST), true);
+    return ReaderBase::ResetLocked();
+  }
+
+ private:
+  bool Seek(MDB_cursor_op op) {
+    CHECK_NOTNULL(mdb_cursor_);
+    int mdb_status = mdb_cursor_get(mdb_cursor_, &mdb_key_, &mdb_value_, op);
+    if (mdb_status == MDB_NOTFOUND) {
+      return false;
+    } else {
+      MDB_CHECK(mdb_status);
+      return true;
+    }
+  }
+
+  Env* const env_;
+  MDB_env* mdb_env_;
+  MDB_dbi mdb_dbi_;
+
+  MDB_txn* mdb_txn_;
+  MDB_cursor* mdb_cursor_;
+  MDB_val mdb_key_, mdb_value_;
+};
+
+class LMDBReaderOp : public ReaderOpKernel {
+ public:
+  explicit LMDBReaderOp(OpKernelConstruction* context)
+      : ReaderOpKernel(context) {
+    Env* env = context->env();
+    SetReaderFactory([this, env]() {
+      return new LMDBReader(name(), env);
+    });
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("LMDBReader").Device(DEVICE_CPU),
+                        LMDBReaderOp);
+
+}
diff --git a/tensorflow/core/kernels/map_dataset_op.cc b/tensorflow/core/kernels/map_dataset_op.cc
index a4d38bbc08..08308d8557 100644
--- a/tensorflow/core/kernels/map_dataset_op.cc
+++ b/tensorflow/core/kernels/map_dataset_op.cc
@@ -25,7 +25,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class MapDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/map_stage_op.cc b/tensorflow/core/kernels/map_stage_op.cc
new file mode 100644
index 0000000000..832fa8102b
--- /dev/null
+++ b/tensorflow/core/kernels/map_stage_op.cc
@@ -0,0 +1,918 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <map>
+#include <mutex>
+#include <numeric>
+#include <unordered_map>
+#include <vector>
+
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/lib/gtl/optional.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+
+namespace {
+
+// Partial Ordering Comparator for Tensor keys containing scalar int64's
+struct KeyTensorLess {
+  bool operator()(const Tensor & lhs, const Tensor & rhs) const {
+    return std::less<int64>{}(lhs.scalar<int64>()(),
+                              rhs.scalar<int64>()());
+  }
+};
+
+// Key Equality operator for Tensor keys containing scalar int64's
+struct KeyTensorEqual {
+  bool operator()(const Tensor & lhs, const Tensor & rhs) const {
+    return std::equal_to<int64>{}(lhs.scalar<int64>()(),
+                                  rhs.scalar<int64>()());
+  }
+};
+
+// Hash for Tensor keys containing scalar int64's
+struct KeyTensorHash {
+  std::size_t operator()(const Tensor & key) const {
+    return std::hash<int64>{}(key.scalar<int64>()());
+  }
+};
+
+
+// General Template Definition
+template <bool Ordered, typename Data>
+struct MapTraits {};
+
+// Partially specialise for ordered
+template <typename Data>
+struct MapTraits<true, Data>
+{
+  typedef Tensor KeyType;
+  typedef Data DataType;
+  typedef std::map<KeyType, Data, KeyTensorLess> MapType;
+};
+
+// Partially specialise for unordered
+template <typename Data>
+struct MapTraits<false, Data>
+{
+  typedef Tensor KeyType;
+  typedef Data DataType;
+  typedef std::unordered_map<KeyType, Data,
+                            KeyTensorHash, KeyTensorEqual> MapType;
+};
+
+// Wrapper around map/unordered_map
+template <bool Ordered>
+class StagingMap : public ResourceBase
+{
+public:
+  // Public typedefs
+  typedef std::vector<Tensor> Tuple;
+  typedef gtl::optional<Tensor> OptionalTensor;
+  typedef std::vector<OptionalTensor> OptionalTuple;
+
+  typedef MapTraits<Ordered, OptionalTuple> MapTraits_;
+  typedef typename MapTraits_::MapType MapType;
+  typedef typename MapTraits_::KeyType KeyType;
+
+  typedef MapTraits<false, OptionalTuple> IncompleteTraits;
+  typedef typename IncompleteTraits::MapType IncompleteType;
+
+private:
+  // Private variables
+  DataTypeVector dtypes_;
+  std::size_t capacity_;
+  std::size_t memory_limit_;
+  std::size_t current_bytes_;
+  std::mutex mu_;
+  std::condition_variable not_empty_;
+  std::condition_variable full_;
+  IncompleteType incomplete_;
+  MapType map_;
+
+private:
+  // private methods
+
+  // If map is configured for bounded capacity, notify
+  // waiting inserters that space is now available
+  void notify_inserters_if_bounded(std::unique_lock<std::mutex>& l)
+  {
+    if(has_capacity() || has_memory_limit())
+    {
+      l.unlock();
+      full_.notify_one();
+    }
+  }
+
+  // Notify any removers waiting to extract values
+  // that data is now available
+  void notify_removers(std::unique_lock<std::mutex>& l)
+  {
+      l.unlock();
+      not_empty_.notify_one();
+  }
+
+  inline bool has_capacity()
+    { return capacity_ > 0; }
+
+  inline bool has_memory_limit()
+    { return memory_limit_ > 0; }
+
+  inline bool would_exceed_memory_limit(std::size_t bytes)
+    { return bytes + current_bytes_ > memory_limit_; }
+
+  inline bool is_capacity_full()
+    { return map_.size() >= capacity_; }
+
+  // Get number of bytes in the tuple
+  inline std::size_t get_tuple_bytes(const Tuple & tuple)
+  {
+    return std::accumulate(tuple.begin(), tuple.end(), 0,
+      [](const std::size_t & lhs, const Tensor & rhs) {
+        return lhs + rhs.TotalBytes();
+    });
+  }
+
+  // Get number of bytes in the incomplete tuple
+  inline std::size_t get_tuple_bytes(const OptionalTuple & tuple)
+  {
+    return std::accumulate(tuple.begin(), tuple.end(), 0,
+      [](const std::size_t & lhs, const OptionalTensor & rhs) {
+        return (lhs + rhs.has_value()) ? rhs.value().TotalBytes() : 0;
+    });
+  }
+
+
+  // Check that the index is within bounds
+  inline Status check_index(const Tensor & key, std::size_t index)
+  {
+    if(index >= dtypes_.size())
+    {
+      return Status(errors::InvalidArgument("Index '",
+        index, "' for key '", key.scalar<int64>()(),
+        "' was out of bounds '", dtypes_.size(), "'."));
+    }
+
+    return Status::OK();
+  }
+
+  inline Status copy_or_move_tensors(OptionalTuple & map_tuple,
+                                          const Tensor & key,
+                                          const Tensor & indices,
+                                          Tuple * output,
+                                          bool copy=false)
+  {
+    auto findices = indices.flat<int>();
+
+    // Return values at specified indices
+    for(std::size_t i = 0; i < findices.dimension(0); ++i)
+    {
+      std::size_t index = findices(i);
+
+      TF_RETURN_IF_ERROR(check_index(key, index));
+
+      // Insist on a value present at the specified index
+      if(!map_tuple[index].has_value())
+      {
+        return Status(errors::InvalidArgument("Tensor at index '",
+          index, "' for key '", key.scalar<int64>()(),
+          "' has already been removed."));
+      }
+
+      // Copy the contained tensor and
+      // remove from the OptionalTuple
+      output->push_back(map_tuple[index].value());
+
+      // Clear out the entry if we're not copying (moving)
+      if(!copy) {
+        map_tuple[index].reset();
+      }
+    }
+
+    return Status::OK();
+  }
+
+  // Check that the optional value at the specified index
+  // is uninitialized
+  inline Status check_index_uninitialized(const Tensor & key,
+                                  std::size_t index,
+                                  const OptionalTuple & tuple)
+  {
+    if(tuple[index].has_value())
+    {
+      return Status(errors::InvalidArgument("The tensor for index '",
+        index, "' for key '", key.scalar<int64>()(),
+        "' was already initialized '", dtypes_.size(), "'."));
+    }
+
+    return Status::OK();
+  }
+
+  // Check that the indices are strictly ordered
+  inline Status check_index_ordering(const Tensor & indices)
+  {
+    auto findices = indices.flat<int>();
+
+    for(std::size_t i = 0; i < findices.dimension(0)-1; ++i)
+    {
+      if(findices(i) < findices(i+1))
+        { continue; }
+
+      return Status(errors::InvalidArgument("Indices are not "
+                                          "strictly ordered"));
+    }
+
+    return Status::OK();
+  }
+
+  // Check bytes are within memory limits memory limits
+  inline Status check_memory_limit(std::size_t bytes)
+  {
+    if(has_memory_limit() && bytes > memory_limit_) {
+      return Status(errors::ResourceExhausted("Attempted to insert "
+        "tensors with combined size of '", bytes, "' bytes into "
+        "Staging Area with a memory limit of '", memory_limit_, "'."));
+    }
+
+    return Status::OK();
+  }
+
+  // Insert incomplete data into the Barrier
+  Status put_incomplete(const KeyType & key,
+                        const Tensor & indices,
+                        OptionalTuple *  tuple,
+                        std::unique_lock<std::mutex>& l)
+  {
+    auto findices = indices.flat<int>();
+
+    // Search for the key in our incomplete set
+    auto it = incomplete_.find(key);
+
+    // Check that the tuple fits within the memory limit
+    std::size_t tuple_bytes = get_tuple_bytes(*tuple);
+    TF_RETURN_IF_ERROR(check_memory_limit(tuple_bytes));
+
+    if(has_memory_limit())
+    {
+      full_.wait(l, [tuple_bytes, this]() {
+        // Stop waiting if we don't exceed the memory limit
+        return !would_exceed_memory_limit(tuple_bytes);
+      });
+    }
+
+    // This key isn't present in the incomplete set
+    // Create OptionalTuple and insert
+    if(it == incomplete_.end())
+    {
+      OptionalTuple empty(dtypes_.size());
+
+      // Initialize empty tuple with given dta
+      for(std::size_t i = 0; i < findices.dimension(0); ++i)
+      {
+        std::size_t index = findices(i);
+        TF_RETURN_IF_ERROR(check_index(key, index));
+
+        // Assign tuple at this index
+        empty[index] = std::move((*tuple)[i]);
+      }
+
+      // Insert into incomplete map
+      incomplete_.insert({key, std::move(empty)});
+
+      // Increment size
+      current_bytes_ += tuple_bytes;
+    }
+    // Found an entry in the incomplete index
+    // Update with given data and insert complete entries
+    // into the main map
+    else
+    {
+      // Reference existing incomplete tuple
+      OptionalTuple & present = it->second;
+
+      // Assign given data
+      for(std::size_t i = 0; i < findices.dimension(0); ++i)
+      {
+        std::size_t index = findices(i);
+        TF_RETURN_IF_ERROR(check_index(key, index));
+        TF_RETURN_IF_ERROR(check_index_uninitialized(key,
+                                                    index, present));
+
+        // Assign tuple at this index
+        present[index] = std::move((*tuple)[i]);
+      }
+
+      // Increment size
+      current_bytes_ += tuple_bytes;
+
+      // Do we have values at all tuple elements?
+      bool complete = std::all_of(present.begin(), present.end(),
+        [](const OptionalTensor & v) { return v.has_value(); });
+
+      // If so, put the tuple in the actual map
+      if(complete)
+      {
+        OptionalTuple insert_tuple = std::move(it->second);
+
+        // Remove from incomplete
+        incomplete_.erase(it);
+
+        TF_RETURN_IF_ERROR(put_complete(key, &insert_tuple, l));
+      }
+    }
+
+    return Status::OK();
+  }
+
+  // Does the insertion into the actual staging area
+  Status put_complete(const KeyType & key, OptionalTuple * tuple,
+                    std::unique_lock<std::mutex> & l)
+  {
+    // Insert key and tuples into the map
+    map_.insert({key, std::move(*tuple)});
+
+    notify_removers(l);
+
+    return Status::OK();
+  }
+
+public:
+  // public methods
+  explicit StagingMap(const DataTypeVector & dtypes,
+          std::size_t capacity, std::size_t memory_limit) :
+      dtypes_(dtypes),
+      capacity_(capacity),
+      memory_limit_(memory_limit),
+      current_bytes_(0) {}
+
+  Status put(KeyType* key, const Tensor * indices,
+              OptionalTuple* tuple)
+  {
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Sanity check the indices
+    TF_RETURN_IF_ERROR(check_index_ordering(*indices));
+
+    // Handle incomplete inserts
+    if(indices->NumElements() != dtypes_.size())
+    {
+      return put_incomplete(*key, *indices, tuple, l);
+    }
+
+    std::size_t tuple_bytes = get_tuple_bytes(*tuple);
+    // Check that tuple_bytes fits within the memory limit
+    TF_RETURN_IF_ERROR(check_memory_limit(tuple_bytes));
+
+    // If map capacity is bounded wait until map is not full
+    if(has_capacity() || has_memory_limit()) {
+      full_.wait(l, [tuple_bytes, this]() {
+        // If there's a memory limit, check if there's space for insertion
+        bool memory_limit_valid = has_memory_limit() ?
+              !would_exceed_memory_limit(tuple_bytes) : true;
+        // If we're configured for capacity check if there's space for insertion
+        bool capacity_valid = has_capacity() ? !is_capacity_full() : true;
+
+        // Stop waiting upon success for both conditions
+        return memory_limit_valid && capacity_valid;
+      });
+    }
+
+    // Do the put operation
+    TF_RETURN_IF_ERROR(put_complete(*key, tuple, l));
+
+    // Update the current size
+    current_bytes_ += tuple_bytes;
+
+    return Status::OK();
+  }
+
+  Status get(const KeyType* key, const Tensor * indices,
+                                  Tuple* tuple)
+  {
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Sanity check the indices
+    TF_RETURN_IF_ERROR(check_index_ordering(*indices));
+
+    typename MapType::iterator it;
+
+    // Wait until the element with the requested key is present
+    not_empty_.wait(l, [&, this]() {
+      it = map_.find(*key);
+      return it != map_.end();
+    });
+
+    TF_RETURN_IF_ERROR(copy_or_move_tensors(it->second, *key,
+                                                *indices, tuple,
+                                                true));
+
+    // Update bytes in the Staging Area
+    current_bytes_ -= get_tuple_bytes(*tuple);
+
+    return Status::OK();
+  }
+
+  Status pop(const KeyType* key, const Tensor * indices,  Tuple* tuple)
+  {
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Sanity check the indices
+    TF_RETURN_IF_ERROR(check_index_ordering(*indices));
+
+    typename MapType::iterator it;
+
+    // Wait until the element with the requested key is present
+    not_empty_.wait(l, [&, this]() {
+      it = map_.find(*key);
+      return it != this->map_.end();
+    });
+
+    TF_RETURN_IF_ERROR(copy_or_move_tensors(it->second, *key,
+                                                *indices, tuple));
+
+    // Remove entry if all the values have been consumed
+    bool any_left = std::any_of(it->second.begin(), it->second.end(),
+                [](const OptionalTensor & T) { return T.has_value(); });
+
+    if(!any_left) {
+      map_.erase(it);
+    }
+
+    // Update bytes in the Staging Area
+    current_bytes_ -= get_tuple_bytes(*tuple);
+
+    notify_inserters_if_bounded(l);
+
+    return Status::OK();
+  }
+
+  Status popitem(KeyType* key, const Tensor * indices, Tuple* tuple)
+  {
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Sanity check the indices
+    TF_RETURN_IF_ERROR(check_index_ordering(*indices));
+
+    // Wait until map is not empty
+    not_empty_.wait(l, [this]() { return !this->map_.empty(); });
+
+    // Move from the first element and erase it
+
+    auto it = map_.begin();
+
+    TF_RETURN_IF_ERROR(copy_or_move_tensors(it->second, *key,
+                                                *indices, tuple));
+
+    *key = it->first;
+
+    // Remove entry if all the values have been consumed
+    bool any_left = std::any_of(it->second.begin(), it->second.end(),
+                [](const OptionalTensor & T) { return T.has_value(); });
+
+    if(!any_left) {
+      map_.erase(it);
+    }
+
+    // Update bytes in the Staging Area
+    current_bytes_ -= get_tuple_bytes(*tuple);
+
+    notify_inserters_if_bounded(l);
+
+    return Status::OK();
+  }
+
+  Status clear()
+  {
+    std::unique_lock<std::mutex> l(mu_);
+    map_.clear();
+    incomplete_.clear();
+    current_bytes_ = 0;
+
+    notify_inserters_if_bounded(l);
+
+    return Status::OK();
+  }
+
+  size_t incomplete_size()
+  {
+    std::unique_lock<std::mutex> l(mu_);
+    return incomplete_.size();
+  }
+
+  size_t size()
+  {
+    // Lock the map and return the size
+    std::unique_lock<std::mutex> l(mu_);
+    return map_.size();
+  }
+
+  string DebugString()
+  {
+    return "StagingMap";
+  }
+};
+
+template <bool Ordered>
+Status GetStagingMap(OpKernelContext* ctx,
+                    const NodeDef& ndef,
+                    StagingMap<Ordered>** map)
+{
+  auto rm = ctx->resource_manager();
+  ContainerInfo cinfo;
+
+  // Lambda for creating the Staging Area
+  auto create_fn = [&ndef](StagingMap<Ordered>** ret) -> Status
+  {
+    DataTypeVector dtypes;
+    int64 capacity;
+    int64 memory_limit;
+    TF_RETURN_IF_ERROR(GetNodeAttr(ndef, "dtypes", &dtypes));
+    TF_RETURN_IF_ERROR(GetNodeAttr(ndef, "capacity", &capacity));
+    TF_RETURN_IF_ERROR(GetNodeAttr(ndef, "memory_limit", &memory_limit));
+    *ret = new StagingMap<Ordered>(dtypes, capacity, memory_limit);
+    return Status::OK();
+  };
+
+  TF_RETURN_IF_ERROR(cinfo.Init(rm, ndef, true /* use name() */));
+  TF_RETURN_IF_ERROR(rm->LookupOrCreate<StagingMap<Ordered>>(
+                        cinfo.container(), cinfo.name(),
+                        map, create_fn));
+  return Status::OK();
+}
+
+template <bool Ordered>
+class MapStageOp : public OpKernel
+{
+ public:
+  explicit MapStageOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+    typename StagingMap<Ordered>::OptionalTuple tuple;
+
+    const Tensor * key_tensor;
+    const Tensor * indices_tensor;
+    OpInputList values_tensor;
+
+    OP_REQUIRES_OK(ctx, ctx->input("key", &key_tensor));
+    OP_REQUIRES_OK(ctx, ctx->input("indices", &indices_tensor));
+    OP_REQUIRES_OK(ctx, ctx->input_list("values", &values_tensor));
+
+    // Create copy for insertion into Staging Area
+    Tensor key(*key_tensor);
+
+    // Create the tuple to store
+    for (std::size_t i = 0; i < values_tensor.size(); ++i) {
+      tuple.push_back(values_tensor[i]);
+    }
+
+    // Store the tuple in the map
+    OP_REQUIRES_OK(ctx, map->put(&key, indices_tensor, &tuple));
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapStage").Device(DEVICE_CPU),
+                      MapStageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapStage").Device(DEVICE_CPU),
+                      MapStageOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapStage")
+                      .HostMemory("key")
+                      .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapStageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapStage")
+                      .HostMemory("key")
+                      .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapStageOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapStage").HostMemory("key")
+                      .Device(DEVICE_SYCL), MapStageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapStage").HostMemory("key")
+                      .Device(DEVICE_SYCL), MapStageOp<true>);
+
+#endif // TENSORFLOW_USE_SYCL
+
+template <bool Ordered>
+class MapUnstageOp : public OpKernel
+{
+ public:
+  explicit MapUnstageOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+    typename StagingMap<Ordered>::Tuple tuple;
+
+    const Tensor * key_tensor;
+    const Tensor * indices_tensor;
+    OpInputList values_tensor;
+
+    OP_REQUIRES_OK(ctx, ctx->input("key", &key_tensor));
+    OP_REQUIRES_OK(ctx, ctx->input("indices", &indices_tensor));
+    OP_REQUIRES_OK(ctx, map->pop(key_tensor, indices_tensor, &tuple));
+
+    OP_REQUIRES(ctx,
+        tuple.size() == indices_tensor->NumElements(),
+        errors::InvalidArgument("output/indices size mismatch: ", tuple.size(),
+                                " vs. ", indices_tensor->NumElements()));
+
+    for (size_t i = 0; i < tuple.size(); ++i) {
+      ctx->set_output(i, tuple[i]);
+    }
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapUnstage").Device(DEVICE_CPU),
+                            MapUnstageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstage").Device(DEVICE_CPU),
+                            MapUnstageOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapUnstage")
+                        .HostMemory("key")
+                        .HostMemory("indices")
+                        .Device(DEVICE_GPU), MapUnstageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstage")
+                        .HostMemory("key")
+                        .HostMemory("indices")
+                        .Device(DEVICE_GPU), MapUnstageOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapUnstage")
+                        .HostMemory("key")
+                        .HostMemory("indices")
+                        .Device(DEVICE_SYCL), MapUnstageOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstage")
+                        .HostMemory("key")
+                        .HostMemory("indices")
+                        .Device(DEVICE_SYCL), MapUnstageOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+template <bool Ordered>
+class MapPeekOp : public OpKernel
+{
+ public:
+  explicit MapPeekOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+    typename StagingMap<Ordered>::Tuple tuple;
+
+    const Tensor * key_tensor;
+    const Tensor * indices_tensor;
+    OpInputList values_tensor;
+
+    OP_REQUIRES_OK(ctx, ctx->input("key", &key_tensor));
+    OP_REQUIRES_OK(ctx, ctx->input("indices", &indices_tensor));
+    OP_REQUIRES_OK(ctx, map->get(key_tensor, indices_tensor, &tuple));
+
+    OP_REQUIRES(ctx,
+        tuple.size() == indices_tensor->NumElements(),
+        errors::InvalidArgument("output/indices size mismatch: ", tuple.size(),
+                                " vs. ", indices_tensor->NumElements()));
+
+    for (size_t i = 0; i < tuple.size(); ++i) {
+      ctx->set_output(i, tuple[i]);
+    }
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapPeek").Device(DEVICE_CPU),
+                      MapPeekOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapPeek").Device(DEVICE_CPU),
+                      MapPeekOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapPeek")
+                      .HostMemory("key")
+                      .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapPeekOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapPeek")
+                      .HostMemory("key")
+                      .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapPeekOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapPeek")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_SYCL), MapPeekOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapPeek")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_SYCL), MapPeekOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+
+
+template <bool Ordered>
+class MapUnstageNoKeyOp : public OpKernel
+{
+ public:
+  explicit MapUnstageNoKeyOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+
+    // Pop a random (key, value) off the map
+    typename StagingMap<Ordered>::KeyType key;
+    typename StagingMap<Ordered>::Tuple tuple;
+
+    const Tensor * indices_tensor;
+
+    OP_REQUIRES_OK(ctx, ctx->input("indices", &indices_tensor));
+    OP_REQUIRES_OK(ctx, map->popitem(&key, indices_tensor, &tuple));
+
+    // Allocate a key tensor and assign the key as the first output
+    ctx->set_output(0, key);
+
+    // Set the rest of the outputs to the tuple Tensors
+    OP_REQUIRES(ctx,
+        tuple.size() == indices_tensor->NumElements(),
+        errors::InvalidArgument("output/indices size mismatch: ", tuple.size(),
+                                " vs. ", indices_tensor->NumElements()));
+
+    for (size_t i = 0; i < tuple.size(); ++i) {
+      ctx->set_output(i+1, tuple[i]);
+    }
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapUnstageNoKey").Device(DEVICE_CPU),
+                      MapUnstageNoKeyOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstageNoKey").Device(DEVICE_CPU),
+                      MapUnstageNoKeyOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapUnstageNoKey")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapUnstageNoKeyOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstageNoKey")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_GPU), MapUnstageNoKeyOp<true>);
+
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapUnstageNoKey")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_SYCL), MapUnstageNoKeyOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapUnstageNoKey")
+                       .HostMemory("key")
+                       .HostMemory("indices")
+                      .Device(DEVICE_SYCL), MapUnstageNoKeyOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+
+template <bool Ordered>
+class MapSizeOp : public OpKernel
+{
+ public:
+  explicit MapSizeOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override
+  {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+
+    // Allocate size output tensor
+    Tensor * size = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, TensorShape({}),
+                                                     &size));
+
+    // Set it to the actual size
+    size->scalar<int32>().setConstant(map->size());
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapSize").Device(DEVICE_CPU),
+                        MapSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapSize").Device(DEVICE_CPU),
+                        MapSizeOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapSize").Device(DEVICE_GPU)
+                        .HostMemory("size"), MapSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapSize").Device(DEVICE_GPU)
+                        .HostMemory("size"), MapSizeOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapSize").Device(DEVICE_SYCL)
+                        .HostMemory("size"), MapSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapSize").Device(DEVICE_SYCL)
+                        .HostMemory("size"), MapSizeOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+template <bool Ordered>
+class MapIncompleteSizeOp : public OpKernel
+{
+ public:
+  explicit MapIncompleteSizeOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override
+  {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+
+    // Allocate size output tensor
+    Tensor * size = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, TensorShape({}),
+                                                     &size));
+
+    // Set it to the actual size
+    size->scalar<int32>().setConstant(map->incomplete_size());
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapIncompleteSize").Device(DEVICE_CPU),
+                        MapIncompleteSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapIncompleteSize").Device(DEVICE_CPU),
+                        MapIncompleteSizeOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapIncompleteSize").Device(DEVICE_GPU)
+                        .HostMemory("size"), MapIncompleteSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapIncompleteSize").Device(DEVICE_GPU)
+                        .HostMemory("size"), MapIncompleteSizeOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapIncompleteSize").Device(DEVICE_SYCL)
+                        .HostMemory("size"), MapIncompleteSizeOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapIncompleteSize").Device(DEVICE_SYCL)
+                        .HostMemory("size"), MapIncompleteSizeOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+template <bool Ordered>
+class MapClearOp : public OpKernel
+{
+ public:
+  explicit MapClearOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  void Compute(OpKernelContext* ctx) override
+  {
+    StagingMap<Ordered>* map = nullptr;
+    OP_REQUIRES_OK(ctx, GetStagingMap(ctx, def(), &map));
+    core::ScopedUnref scope(map);
+
+    OP_REQUIRES_OK(ctx, map->clear());
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MapClear").Device(DEVICE_CPU),
+                        MapClearOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapClear").Device(DEVICE_CPU),
+                        MapClearOp<true>);
+
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("MapClear").Device(DEVICE_GPU),
+                        MapClearOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapClear").Device(DEVICE_GPU),
+                        MapClearOp<true>);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("MapClear").Device(DEVICE_SYCL),
+                        MapClearOp<false>);
+REGISTER_KERNEL_BUILDER(Name("OrderedMapClear").Device(DEVICE_SYCL),
+                        MapClearOp<true>);
+#endif // TENSORFLOW_USE_SYCL
+
+}  // namespace
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/matmul_op.cc b/tensorflow/core/kernels/matmul_op.cc
index 199e160592..8003f7ff67 100644
--- a/tensorflow/core/kernels/matmul_op.cc
+++ b/tensorflow/core/kernels/matmul_op.cc
@@ -395,6 +395,7 @@ TF_CALL_half(REGISTER_GPU);
                               .Label("eigen"),                   \
                           MatMulOp<SYCLDevice, T, false /* xxblas */>)
 TF_CALL_float(REGISTER_SYCL);
+TF_CALL_double(REGISTER_SYCL);
 
 #endif  // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/meta_support.h b/tensorflow/core/kernels/meta_support.h
index 0d87baf034..53aece78e8 100644
--- a/tensorflow/core/kernels/meta_support.h
+++ b/tensorflow/core/kernels/meta_support.h
@@ -64,7 +64,7 @@ bool IsSupportedAndEnabled();
 //     sum((a_data[i, l] + offset_a) * (b_data[l, j] + offset_b)) : l in [0, k)
 //
 // If transpose_a is false the lhs operand has row major layout, otherwise
-// column major. Similarily transpose_b describes the layout of the rhs operand.
+// column major. Similarly transpose_b describes the layout of the rhs operand.
 // lda, ldb, and ldc are the strides of the lhs operand, rhs operand and the
 // result arrays.
 void QuantizedGemm(OpKernelContext* context, bool transpose_a, bool transpose_b,
diff --git a/tensorflow/core/kernels/mfcc_test.cc b/tensorflow/core/kernels/mfcc_test.cc
index 07b94e2e6c..cb32df8811 100644
--- a/tensorflow/core/kernels/mfcc_test.cc
+++ b/tensorflow/core/kernels/mfcc_test.cc
@@ -65,7 +65,7 @@ TEST(MfccTest, AvoidsNansWithZeroInput) {
   int expected_size = 13;
   ASSERT_EQ(expected_size, output.size());
   for (const double value : output) {
-    EXPECT_FALSE(isnan(value));
+    EXPECT_FALSE(std::isnan(value));
   }
 }
 
diff --git a/tensorflow/core/kernels/mirror_pad_op.cc b/tensorflow/core/kernels/mirror_pad_op.cc
index f694198b6a..e3643f9447 100644
--- a/tensorflow/core/kernels/mirror_pad_op.cc
+++ b/tensorflow/core/kernels/mirror_pad_op.cc
@@ -85,7 +85,7 @@ class MirrorPadOp : public OpKernel {
     TTypes<int32>::ConstMatrix paddings = in1.matrix<int32>();
     for (int d = 0; d < dims; ++d) {
       const int32 before = paddings(d, 0);  // Pad before existing elements.
-      const int32 after = paddings(d, 1);   // Pad after exisitng elements.
+      const int32 after = paddings(d, 1);   // Pad after existing elements.
       OP_REQUIRES(context, before >= 0 && after >= 0,
                   errors::InvalidArgument("paddings must be non-negative: ",
                                           before, " ", after));
@@ -272,7 +272,7 @@ class MirrorPadGradOp : public OpKernel {
     TTypes<int32>::ConstMatrix paddings = in1.matrix<int32>();
     for (int d = 0; d < dims; ++d) {
       const int32 before = paddings(d, 0);  // Pad before existing elements.
-      const int32 after = paddings(d, 1);   // Pad after exisitng elements.
+      const int32 after = paddings(d, 1);   // Pad after existing elements.
       OP_REQUIRES(context, before >= 0 && after >= 0,
                   errors::InvalidArgument("Paddings must be non-negative: ",
                                           before, ", ", after));
diff --git a/tensorflow/core/kernels/mkl_avgpooling_op.cc b/tensorflow/core/kernels/mkl_avgpooling_op.cc
index 8bd1724e32..d90baee069 100644
--- a/tensorflow/core/kernels/mkl_avgpooling_op.cc
+++ b/tensorflow/core/kernels/mkl_avgpooling_op.cc
@@ -343,11 +343,10 @@ class MklAvgPoolingGradOp : public OpKernel {
       if (!outbackprop_in_mkl_format) {
         // For avgpooling, tensor_in_shape should have 1 dimension, and 4
         // elements.
-        OP_REQUIRES(
-            context,
-            tensor_in_shape.dims() == 1 && tensor_in_shape.NumElements() == 4,
-            errors::InvalidArgument("original input shape must be "
-                                    "1-dimensional and 4 elements"));
+        OP_REQUIRES(context, tensor_in_shape.dims() == 1 &&
+                                 tensor_in_shape.NumElements() == 4,
+                    errors::InvalidArgument("original input shape must be "
+                                            "1-dimensional and 4 elements"));
 
         // For avgpooling, out_backprop should have 4 dimensions.
         OP_REQUIRES(context, out_backprop.dims() == 4,
diff --git a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
index 1831448df6..d4364d31e4 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
@@ -38,9 +38,9 @@ limitations under the License.
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
 
+#include "tensorflow/core/util/mkl_util.h"
 #include "third_party/mkl/include/mkl_dnn.h"
 #include "third_party/mkl/include/mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
index b60bcf70e7..dc6b88e953 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_filter_ops.cc
@@ -37,9 +37,9 @@ limitations under the License.
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
 
+#include "tensorflow/core/util/mkl_util.h"
 #include "third_party/mkl/include/mkl_dnn.h"
 #include "third_party/mkl/include/mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
index 5613b674e5..23827ceea5 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_input_ops.cc
@@ -23,8 +23,6 @@ limitations under the License.
 #define EIGEN_USE_THREADS
 #include <algorithm>
 #include <vector>
-#include "third_party/mkl/include/mkl_dnn.h"
-#include "third_party/mkl/include/mkl_dnn_types.h"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
@@ -42,6 +40,8 @@ limitations under the License.
 #include "tensorflow/core/util/tensor_format.h"
 #include "tensorflow/core/util/use_cudnn.h"
 #include "tensorflow/core/util/work_sharder.h"
+#include "third_party/mkl/include/mkl_dnn.h"
+#include "third_party/mkl/include/mkl_dnn_types.h"
 
 namespace tensorflow {
 
@@ -295,7 +295,7 @@ class MklConv2DCustomBackpropInputOp : public OpKernel {
         dnnDelete_F32(mkl_convert_filter);
       } else {
         // If we do not need any layout conversion for filter, then
-        // we direclty assign input filter to resources[].
+        // we directly assign input filter to resources[].
         conv_res[dnnResourceFilter] =
             static_cast<void*>(const_cast<T*>(filter.flat<T>().data()));
       }
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 0620e3fc91..76b9f1798d 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -36,9 +36,9 @@ limitations under the License.
 #include "tensorflow/core/util/padding.h"
 #include "tensorflow/core/util/tensor_format.h"
 
+#include "tensorflow/core/util/mkl_util.h"
 #include "third_party/mkl/include/mkl_dnn.h"
 #include "third_party/mkl/include/mkl_dnn_types.h"
-#include "tensorflow/core/util/mkl_util.h"
 
 namespace tensorflow {
 
@@ -98,19 +98,18 @@ class MklConv2DOp : public OpKernel {
                                         filter.shape().DebugString()));
 
     for (int i = 0; i < 3; i++) {
-      OP_REQUIRES(
-          context,
-          FastBoundsCheck(filter.dim_size(i), std::numeric_limits<int>::max()),
-          errors::InvalidArgument("filter too large"));
+      OP_REQUIRES(context, FastBoundsCheck(filter.dim_size(i),
+                                           std::numeric_limits<int>::max()),
+                  errors::InvalidArgument("filter too large"));
     }
 
     const int64 input_depth =
         input_in_mkl_format ? GetMklTensorDim(mkl_context.input_shape, 'C')
                             : GetTensorDim(input, data_format_, 'C');
-    OP_REQUIRES(context, input_depth == filter.dim_size(2),
-                errors::InvalidArgument(
-                    "input and filter must have the same depth: ", input_depth,
-                    " vs ", filter.dim_size(2)));
+    OP_REQUIRES(
+        context, input_depth == filter.dim_size(2),
+        errors::InvalidArgument("input and filter must have the same depth: ",
+                                input_depth, " vs ", filter.dim_size(2)));
     // The last dimension for filter is out_depth.
     const int out_depth = static_cast<int>(filter.dim_size(3));
 
@@ -119,10 +118,9 @@ class MklConv2DOp : public OpKernel {
     const int64 input_rows_raw =
         input_in_mkl_format ? GetMklTensorDim(mkl_context.input_shape, 'H')
                             : GetTensorDim(input, data_format_, 'H');
-    OP_REQUIRES(
-        context,
-        FastBoundsCheck(input_rows_raw, std::numeric_limits<int>::max()),
-        errors::InvalidArgument("Input rows too large"));
+    OP_REQUIRES(context, FastBoundsCheck(input_rows_raw,
+                                         std::numeric_limits<int>::max()),
+                errors::InvalidArgument("Input rows too large"));
     const int input_rows = static_cast<int>(input_rows_raw);
     const int filter_rows = static_cast<int>(filter.dim_size(0));
 
@@ -131,10 +129,9 @@ class MklConv2DOp : public OpKernel {
     const int64 input_cols_raw =
         input_in_mkl_format ? GetMklTensorDim(mkl_context.input_shape, 'W')
                             : GetTensorDim(input, data_format_, 'W');
-    OP_REQUIRES(
-        context,
-        FastBoundsCheck(input_cols_raw, std::numeric_limits<int>::max()),
-        errors::InvalidArgument("Input cols too large"));
+    OP_REQUIRES(context, FastBoundsCheck(input_cols_raw,
+                                         std::numeric_limits<int>::max()),
+                errors::InvalidArgument("Input cols too large"));
     const int input_cols = static_cast<int>(input_cols_raw);
     const int filter_cols = static_cast<int>(filter.dim_size(1));
 
@@ -142,10 +139,9 @@ class MklConv2DOp : public OpKernel {
     const int64 input_batch_raw =
         input_in_mkl_format ? GetMklTensorDim(mkl_context.input_shape, 'N')
                             : GetTensorDim(input, data_format_, 'N');
-    OP_REQUIRES(
-        context,
-        FastBoundsCheck(input_batch_raw, std::numeric_limits<int>::max()),
-        errors::InvalidArgument("batch is too large"));
+    OP_REQUIRES(context, FastBoundsCheck(input_batch_raw,
+                                         std::numeric_limits<int>::max()),
+                errors::InvalidArgument("batch is too large"));
     const int batch = static_cast<int>(input_batch_raw);
 
     // For now we take the stride from the second and third dimensions only (we
diff --git a/tensorflow/core/kernels/mkl_lrn_op.cc b/tensorflow/core/kernels/mkl_lrn_op.cc
index 8f6c12d0d1..070aeff49f 100644
--- a/tensorflow/core/kernels/mkl_lrn_op.cc
+++ b/tensorflow/core/kernels/mkl_lrn_op.cc
@@ -709,7 +709,7 @@ class MklLRNGradOp : public OpKernel {
       Shard(worker_threads.num_threads, worker_threads.workers, nodes * batch,
             depth * depth, shard);
     }
-
+		
     // release mkl resources
     void Mklcleanup() {
       bool ingrad_in_mkl_format = ingrad_shape.IsMklTensor();
diff --git a/tensorflow/core/kernels/mkl_maxpooling_op.cc b/tensorflow/core/kernels/mkl_maxpooling_op.cc
index 44e1b93dba..846bb5710d 100644
--- a/tensorflow/core/kernels/mkl_maxpooling_op.cc
+++ b/tensorflow/core/kernels/mkl_maxpooling_op.cc
@@ -311,7 +311,7 @@ class MklMaxPoolingGradOp : public OpKernel {
         lt_input_user = (dnnLayout_t)input_shape.GetCurLayout();
       }
 
-      // We dont care about the output layout for now as we can create it from
+      // We don't care about the output layout for now as we can create it from
       // primitives for the max pooling fwd prop
       if (outbackprop_in_mkl_format == false) {
         CHECK_EQ(dnnLayoutCreate_F32(&lt_outbackprop_user, params.in_dim,
@@ -382,19 +382,18 @@ class MklMaxPoolingGradOp : public OpKernel {
       if (workspace_enabled == false) {
         if (convert_input != nullptr) {
           if (input_in_mkl_format == false) {
-            CHECK_EQ(dnnConversionExecute_F32(
-                         convert_input,
-                         const_cast<void*>(static_cast<const void*>(
-                             tensor_in.flat<T>().data())),
-                         input_buf),
-                     E_SUCCESS);
+            CHECK_EQ(
+                dnnConversionExecute_F32(
+                    convert_input, const_cast<void*>(static_cast<const void*>(
+                                       tensor_in.flat<T>().data())),
+                    input_buf),
+                E_SUCCESS);
             CHECK_EQ(dnnDelete_F32(convert_input), E_SUCCESS);
             convert_input = nullptr;
           } else {
             input_shape.GetConvertedFlatData(
-                lt_input_prim,
-                const_cast<void*>(
-                    static_cast<const void*>(tensor_in.flat<T>().data())),
+                lt_input_prim, const_cast<void*>(static_cast<const void*>(
+                                   tensor_in.flat<T>().data())),
                 input_buf);
           }
           pooling_resfwd[dnnResourceSrc] = input_buf;
@@ -439,9 +438,8 @@ class MklMaxPoolingGradOp : public OpKernel {
           CHECK_EQ(dnnDelete_F32(convert_outbackprop), E_SUCCESS);
         } else {
           output_backprop_shape.GetConvertedFlatData(
-              lt_outbackprop_prim,
-              const_cast<void*>(
-                  static_cast<const void*>(out_backprop.flat<T>().data())),
+              lt_outbackprop_prim, const_cast<void*>(static_cast<const void*>(
+                                       out_backprop.flat<T>().data())),
               outbackprop_buf);
         }
         pooling_res[dnnResourceDiffDst] = outbackprop_buf;
diff --git a/tensorflow/core/kernels/mkl_pooling_ops_common.cc b/tensorflow/core/kernels/mkl_pooling_ops_common.cc
index d88bd4c640..65e8852cfb 100644
--- a/tensorflow/core/kernels/mkl_pooling_ops_common.cc
+++ b/tensorflow/core/kernels/mkl_pooling_ops_common.cc
@@ -14,8 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #ifdef INTEL_MKL
-#include "tensorflow/core/kernels/mkl_pooling_ops_common.h"
 #include <vector>
+#include "tensorflow/core/kernels/mkl_pooling_ops_common.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/framework/common_shape_fns.h"
 
diff --git a/tensorflow/core/kernels/mkl_relu_op.cc b/tensorflow/core/kernels/mkl_relu_op.cc
index f77bdf665e..10d2937584 100644
--- a/tensorflow/core/kernels/mkl_relu_op.cc
+++ b/tensorflow/core/kernels/mkl_relu_op.cc
@@ -16,17 +16,17 @@ limitations under the License.
 // See docs in ../ops/nn_ops.cc.
 #ifdef INTEL_MKL
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/numeric_op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/lib/core/errors.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 
-#include "third_party/mkl/include/mkl_dnn.h"
-#include "third_party/mkl/include/mkl_dnn_types.h"
 #include "tensorflow/core/platform/default/logging.h"
 #include "tensorflow/core/util/mkl_util.h"
+#include "third_party/mkl/include/mkl_dnn.h"
+#include "third_party/mkl/include/mkl_dnn_types.h"
 
 namespace tensorflow {
 
diff --git a/tensorflow/core/kernels/non_max_suppression_op_test.cc b/tensorflow/core/kernels/non_max_suppression_op_test.cc
index fdbcf05b89..e0e8c87f95 100644
--- a/tensorflow/core/kernels/non_max_suppression_op_test.cc
+++ b/tensorflow/core/kernels/non_max_suppression_op_test.cc
@@ -143,8 +143,8 @@ TEST_F(NonMaxSuppressionOpTest, TestInconsistentBoxAndScoreShapes) {
 
   ASSERT_FALSE(s.ok());
   EXPECT_TRUE(
-      StringPiece(s.ToString()).contains("scores has incompatible shape"))
-      << s;
+              StringPiece(s.ToString()).contains("scores has incompatible shape"))
+    << s;
 }
 
 TEST_F(NonMaxSuppressionOpTest, TestInvalidIOUThreshold) {
@@ -156,8 +156,8 @@ TEST_F(NonMaxSuppressionOpTest, TestInvalidIOUThreshold) {
 
   ASSERT_FALSE(s.ok());
   EXPECT_TRUE(
-      StringPiece(s.ToString()).contains("iou_threshold must be in [0, 1]"))
-      << s;
+              StringPiece(s.ToString()).contains("iou_threshold must be in [0, 1]"))
+    << s;
 }
 
 TEST_F(NonMaxSuppressionOpTest, TestEmptyInput) {
diff --git a/tensorflow/core/kernels/ops_util.h b/tensorflow/core/kernels/ops_util.h
index 2d81e682ca..68a9c37406 100644
--- a/tensorflow/core/kernels/ops_util.h
+++ b/tensorflow/core/kernels/ops_util.h
@@ -50,8 +50,12 @@ bool IsInnerDimsSizeAligned(const TensorShape& s) {
   if (s.dims() == 0) return false;
   const int64 dim0_size = s.dim_size(0);
   if (dim0_size == 0) return false;
+#if EIGEN_MAX_ALIGN_BYTES == 0
+  return true;
+#else
   const int64 bytes_per_dim0 = (s.num_elements() / dim0_size) * sizeof(T);
   return bytes_per_dim0 % EIGEN_MAX_ALIGN_BYTES == 0;
+#endif
 }
 
 // Given a shape 's' of a tensor of type T and the `start` and `end` index of a
@@ -61,6 +65,9 @@ bool IsInnerDimsSizeAligned(const TensorShape& s) {
 template <typename T>
 bool IsDim0SliceAligned(const TensorShape& s, int64 start, int64 end_or_size) {
   if (s.dims() == 1) {
+#if EIGEN_MAX_ALIGN_BYTES == 0
+    return true;
+#else
     bool start_aligned = (start * sizeof(T)) % EIGEN_MAX_ALIGN_BYTES == 0;
     // End is aligned if either the explicit end index is passed and is a
     // a multiple of EIGEN_MAX_ALIGN_BYTES, or the start index is aligned and
@@ -68,6 +75,7 @@ bool IsDim0SliceAligned(const TensorShape& s, int64 start, int64 end_or_size) {
     // index, or start and size.
     bool end_aligned = (end_or_size * sizeof(T)) % EIGEN_MAX_ALIGN_BYTES == 0;
     return start_aligned && end_aligned;
+#endif
   } else {
     return IsInnerDimsSizeAligned<T>(s);
   }
diff --git a/tensorflow/core/kernels/ops_util_test.cc b/tensorflow/core/kernels/ops_util_test.cc
index 04a42a9921..42ffef6735 100644
--- a/tensorflow/core/kernels/ops_util_test.cc
+++ b/tensorflow/core/kernels/ops_util_test.cc
@@ -286,6 +286,14 @@ TEST_F(OpsUtilTest, SanitizeThreadSuffix) {
 }
 
 TEST_F(OpsUtilTest, Aligned1DSlice) {
+#if EIGEN_MAX_ALIGN_BYTES == 0
+  // When EIGEN_MAX_ALIGN_BYTES is 0, a 1D tensor is always aligned.
+  Tensor t(DT_FLOAT, TensorShape({3}));
+  int64 start = 0;
+  int64 end = 1;
+  bool output = IsDim0SliceAligned<float>(t.shape(), start, end);
+  EXPECT_EQ(output, true);
+#else
   Tensor t(DT_FLOAT, TensorShape({EIGEN_MAX_ALIGN_BYTES * 2}));
   int64 start = 0;
   int64 end = EIGEN_MAX_ALIGN_BYTES;
@@ -295,8 +303,10 @@ TEST_F(OpsUtilTest, Aligned1DSlice) {
   Tensor sliced;
   CHECK(sliced.CopyFrom(t.Slice(start, end), TensorShape({end - start})));
   EXPECT_EQ(sliced.IsAligned(), true);
+#endif
 }
 
+#if EIGEN_MAX_ALIGN_BYTES > 0
 TEST_F(OpsUtilTest, Misaligned1DSlice) {
   Tensor t(DT_FLOAT, TensorShape({EIGEN_MAX_ALIGN_BYTES * 2}));
   int64 start = 1;
@@ -308,8 +318,18 @@ TEST_F(OpsUtilTest, Misaligned1DSlice) {
   CHECK(sliced.CopyFrom(t.Slice(start, end), TensorShape({end - start})));
   EXPECT_EQ(sliced.IsAligned(), false);
 }
+#endif
 
 TEST_F(OpsUtilTest, Aligned2DSliceOfDim0) {
+#if EIGEN_MAX_ALIGN_BYTES == 0
+  // When EIGEN_MAX_ALIGN_BYTES is 0 and the size of the first dimension is nonzero,
+  // a multidimensional tensor is always aligned.
+  Tensor t(DT_FLOAT, TensorShape({3, 4}));
+  int64 start = 1;
+  int64 end = 2;
+  bool output = IsDim0SliceAligned<float>(t.shape(), start, end);
+  EXPECT_EQ(output, true);
+#else
   // For multidimensional tensors, alignment is dictated by inner_dim_size.
   int64 inner_dim_size = EIGEN_MAX_ALIGN_BYTES;
   Tensor t(DT_FLOAT, TensorShape({3, inner_dim_size}));
@@ -321,8 +341,10 @@ TEST_F(OpsUtilTest, Aligned2DSliceOfDim0) {
   Tensor sliced;
   CHECK(sliced.CopyFrom(t.Slice(start, end), TensorShape({1, inner_dim_size})));
   EXPECT_EQ(sliced.IsAligned(), true);
+#endif
 }
 
+#if EIGEN_MAX_ALIGN_BYTES > 0
 TEST_F(OpsUtilTest, Misaligned2DSliceOfDim0) {
   // For multidimensional tensors, alignment is dictated by inner_dim_size.
   int64 inner_dim_size = EIGEN_MAX_ALIGN_BYTES + 1;
@@ -336,6 +358,23 @@ TEST_F(OpsUtilTest, Misaligned2DSliceOfDim0) {
   CHECK(sliced.CopyFrom(t.Slice(start, end), TensorShape({1, inner_dim_size})));
   EXPECT_EQ(sliced.IsAligned(), false);
 }
+#endif
+
+TEST_F(OpsUtilTest, MisalignedEmptyShape) {
+  TensorShape shape({});
+  int64 start = 1;
+  int64 end = 2;
+  bool output = IsDim0SliceAligned<float>(shape, start, end);
+  EXPECT_EQ(output, false);
+}
+
+TEST_F(OpsUtilTest, MisalignedEmptyDim0) {
+  TensorShape shape({0, 1, 2});
+  int64 start = 0;
+  int64 end = 1;
+  bool output = IsDim0SliceAligned<float>(shape, start, end);
+  EXPECT_EQ(output, false);
+}
 
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/pack_op.cc b/tensorflow/core/kernels/pack_op.cc
index cb78aacb0d..edaa10761e 100644
--- a/tensorflow/core/kernels/pack_op.cc
+++ b/tensorflow/core/kernels/pack_op.cc
@@ -118,6 +118,12 @@ class PackOp : public OpKernel {
         return;
       }
 #endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+      if (std::is_same<Device, SYCLDevice>::value) {
+        ConcatSYCL<T>(c->eigen_sycl_device(), inputs_flat, &output_flat);
+        return;
+      }
+#endif // TENSORFLOW_USE_SYCL
       ConcatCPU<T>(c->device(), inputs_flat, &output_flat);
     }
   }
@@ -166,26 +172,18 @@ REGISTER_KERNEL_BUILDER(Name("Pack")
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-
 #define REGISTER_SYCL(type)                                       \
   REGISTER_KERNEL_BUILDER(                                        \
       Name("Pack").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
       PackOp<SYCLDevice, type>)
 
-REGISTER_SYCL(float);
-REGISTER_SYCL(double);
-#undef REGISTER_SYCL
-
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
 REGISTER_KERNEL_BUILDER(Name("Pack")
                             .Device(DEVICE_SYCL)
                             .HostMemory("values")
                             .HostMemory("output")
                             .TypeConstraint<int32>("T"),
                         PackOp<CPUDevice, int32>);
-
+#undef REGISTER_SYCL
 #endif  // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/pad_op.cc b/tensorflow/core/kernels/pad_op.cc
index 91984319c6..4c43193579 100644
--- a/tensorflow/core/kernels/pad_op.cc
+++ b/tensorflow/core/kernels/pad_op.cc
@@ -212,12 +212,7 @@ REGISTER_KERNEL_BUILDER(Name("Pad")
                               .HostMemory("paddings"),            \
                           PadOp<SYCLDevice, T>)
 
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
-
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNEL);
 REGISTER_KERNEL_BUILDER(Name("Pad")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
@@ -226,6 +221,7 @@ REGISTER_KERNEL_BUILDER(Name("Pad")
                             .HostMemory("paddings")
                             .HostMemory("output"),
                         PadOp<CPUDevice, int32>);
+#undef REGISTER_SYCL_KERNEL
 #endif // TENSORFLOW_USE_SYCL
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/padded_batch_dataset_op.cc b/tensorflow/core/kernels/padded_batch_dataset_op.cc
index af2cfc09f0..b0c000dd25 100644
--- a/tensorflow/core/kernels/padded_batch_dataset_op.cc
+++ b/tensorflow/core/kernels/padded_batch_dataset_op.cc
@@ -23,7 +23,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 // The following five functions are copied from padding_fifo_queue.cc.
@@ -306,7 +306,7 @@ class PaddedBatchDatasetOp : public OpKernel {
             const TensorShape& element_shape =
                 batch_elements[i][component_index].shape();
             // TODO(mrry): Perform this check in the shape function if
-            // enough static information is avaiable to do so.
+            // enough static information is available to do so.
             if (element_shape.dims() != padded_shape.dims()) {
               return errors::InvalidArgument(
                   "All elements in a batch must have the same rank as the "
diff --git a/tensorflow/core/kernels/parallel_map_dataset_op.cc b/tensorflow/core/kernels/parallel_map_dataset_op.cc
index 6b1214e660..93ed644d72 100644
--- a/tensorflow/core/kernels/parallel_map_dataset_op.cc
+++ b/tensorflow/core/kernels/parallel_map_dataset_op.cc
@@ -26,7 +26,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class ParallelMapDatasetOp : public OpKernel {
@@ -188,7 +188,7 @@ class ParallelMapDatasetOp : public OpKernel {
 
           if (!output_buffer_.empty() && output_buffer_.front().is_produced) {
             // A new output element is available. Forward the status
-            // from computing it, and (if we sucessfully got an
+            // from computing it, and (if we successfully got an
             // element) the output values.
             Status s = output_buffer_.front().output_status;
             if (s.ok()) {
diff --git a/tensorflow/core/kernels/quantization_utils.h b/tensorflow/core/kernels/quantization_utils.h
index e258efd545..e563f15b0b 100644
--- a/tensorflow/core/kernels/quantization_utils.h
+++ b/tensorflow/core/kernels/quantization_utils.h
@@ -87,7 +87,7 @@ float QuantizedToFloat(T input, float range_min, float range_max) {
       static_cast<int64>(Eigen::NumTraits<T>::lowest());
   const double offset_input = static_cast<double>(input) - lowest_quantized;
   // For compatibility with DEQUANTIZE_WITH_EIGEN, we should convert
-  // range_scale to a float, otherwise range_min_rounded might be slighly
+  // range_scale to a float, otherwise range_min_rounded might be slightly
   // different.
   const double range_min_rounded =
       round(range_min / static_cast<float>(range_scale)) *
diff --git a/tensorflow/core/kernels/random_op.cc b/tensorflow/core/kernels/random_op.cc
index 80b1be8d4c..e78f8e2621 100644
--- a/tensorflow/core/kernels/random_op.cc
+++ b/tensorflow/core/kernels/random_op.cc
@@ -48,6 +48,9 @@ namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 namespace functor {
 using random::PhiloxRandom;
@@ -549,4 +552,193 @@ TF_CALL_int64(REGISTER_INT);
 
 #endif  // GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+
+namespace functor {
+
+using namespace cl;
+
+template <class Distribution, bool VariableSamplesPerOutput>
+struct FillPhiloxRandomKernel;
+
+template <class Distribution>
+struct FillPhiloxRandomKernel<Distribution, false> {
+  typedef typename Distribution::ResultElementType T;
+  using write_accessor = sycl::accessor<uint8_t, 1, sycl::access::mode::write, sycl::access::target::global_buffer>;
+
+  FillPhiloxRandomKernel(write_accessor& data, random::PhiloxRandom& gen, Distribution& dist)
+      : data_(data),
+        gen_(gen),
+        dist_(dist) {
+  }
+
+  void operator()(sycl::nd_item<1> item) {
+    const size_t kGroupSize = Distribution::kResultElementCount;
+
+    const size_t item_id = item.get_global(0);
+    const size_t total_item_count = item.get_global_range(0);
+    size_t offset = item_id * kGroupSize;
+    gen_.Skip(item_id);
+
+    const size_t size = data_.get_size() / sizeof(T);
+    T* data = ConvertToActualTypeSycl(T, data_);
+
+    while (offset + kGroupSize <= size) {
+      const typename Distribution::ResultType samples = dist_(&gen_);
+      for (size_t i = 0; i < kGroupSize; ++i) {
+        data[offset + i] = samples[i];
+      }
+
+      offset += (total_item_count - 1) * kGroupSize;
+      gen_.Skip(total_item_count - 1);
+    }
+
+    const typename Distribution::ResultType samples = dist_(&gen_);
+    for (size_t i = 0; i < kGroupSize; ++i) {
+      if (offset >= size) {
+          return;
+      }
+      data[offset] = samples[i];
+      ++offset;
+    }
+  }
+
+ private:
+  write_accessor data_;
+  random::PhiloxRandom gen_;
+  Distribution dist_;
+};
+
+
+template <class Distribution>
+struct FillPhiloxRandomKernel<Distribution, true> {
+  typedef typename Distribution::ResultElementType T;
+  using write_accessor = sycl::accessor<uint8_t, 1, sycl::access::mode::write, sycl::access::target::global_buffer>;
+
+  FillPhiloxRandomKernel(write_accessor& data, random::PhiloxRandom& gen, Distribution& dist)
+      : data_(data),
+        gen_(gen),
+        dist_(dist) {
+  }
+
+  void operator()(sycl::nd_item<1> item) {
+    using random::PhiloxRandom;
+    using random::SingleSampleAdapter;
+
+    const size_t kReservedSamplesPerOutput = 256;
+    const size_t kGroupSize = Distribution::kResultElementCount;
+    const size_t kGeneratorSkipPerOutputGroup = kGroupSize *
+                                                kReservedSamplesPerOutput /
+                                                PhiloxRandom::kResultElementCount;
+
+    const size_t item_id = item.get_global(0);
+    const size_t total_item_count = item.get_global_range(0);
+    size_t group_index = item_id;
+    size_t offset = group_index * kGroupSize;
+
+    T* data = ConvertToActualTypeSycl(T, data_);
+    const size_t size = data_.get_size() / sizeof(T);
+
+    while (offset < size) {
+      // Since each output takes a variable number of samples, we need to
+      // realign the generator to the beginning for the current output group
+      PhiloxRandom gen = gen_;
+      gen.Skip(group_index * kGeneratorSkipPerOutputGroup);
+      SingleSampleAdapter<PhiloxRandom> single_samples(&gen);
+
+      const typename Distribution::ResultType samples = dist_(&single_samples);
+
+      for (size_t i = 0; i < kGroupSize; ++i) {
+        if (offset >= size) {
+          return;
+        }
+        data[offset] = samples[i];
+        ++offset;
+      }
+
+      offset += (total_item_count - 1) * kGroupSize;
+      group_index += total_item_count;
+    }
+  }
+
+ private:
+  write_accessor data_;
+  random::PhiloxRandom gen_;
+  Distribution dist_;
+};
+
+template <typename T>
+class FillRandomKernel;
+// Partial specialization for SYCL to fill the entire region with randoms
+// It splits the work into several tasks and run them in parallel
+template <class Distribution>
+void FillPhiloxRandom<SYCLDevice, Distribution>::operator()(
+    OpKernelContext* context, const SYCLDevice& device, random::PhiloxRandom gen,
+    typename Distribution::ResultElementType* data, int64 size,
+    Distribution dist) {
+
+  const size_t group_size = device.maxSyclThreadsPerBlock();
+  const size_t group_count = (size + group_size - 1) / group_size;
+
+  auto buffer = device.get_sycl_buffer(data);
+
+  device.sycl_queue().submit([&](sycl::handler& cgh) {
+    auto access = buffer.template get_access<sycl::access::mode::write>(cgh);
+
+    FillPhiloxRandomKernel<Distribution, Distribution::kVariableSamplesPerOutput> task(access, gen, dist);
+    cgh.parallel_for<class FillRandomKernel<Distribution>>(
+      sycl::nd_range<1>(sycl::range<1>(group_count * group_size), sycl::range<1>(group_size)),
+      task
+    );
+  });
+}
+
+}
+
+#define REGISTER(TYPE)                                                       \
+  template struct functor::FillPhiloxRandom<                                 \
+      SYCLDevice, random::UniformDistribution<random::PhiloxRandom, TYPE> >; \
+  REGISTER_KERNEL_BUILDER(                                                   \
+      Name("RandomUniform")                                                  \
+          .Device(DEVICE_SYCL)                                               \
+          .HostMemory("shape")                                               \
+          .TypeConstraint<TYPE>("dtype"),                                    \
+      PhiloxRandomOp<SYCLDevice, random::UniformDistribution<                \
+                                    random::PhiloxRandom, TYPE> >);          \
+  REGISTER_KERNEL_BUILDER(                                                   \
+      Name("RandomStandardNormal")                                           \
+          .Device(DEVICE_SYCL)                                               \
+          .HostMemory("shape")                                               \
+          .TypeConstraint<TYPE>("dtype"),                                    \
+      PhiloxRandomOp<SYCLDevice, random::NormalDistribution<                 \
+                                    random::PhiloxRandom, TYPE> >);          \
+  REGISTER_KERNEL_BUILDER(                                                   \
+      Name("TruncatedNormal")                                                \
+          .Device(DEVICE_SYCL)                                               \
+          .HostMemory("shape")                                               \
+          .TypeConstraint<TYPE>("dtype"),                                    \
+      PhiloxRandomOp<                                                        \
+          SYCLDevice,                                                        \
+          random::TruncatedNormalDistribution<                               \
+              random::SingleSampleAdapter<random::PhiloxRandom>, TYPE> >);
+
+#define REGISTER_INT(IntType)                                    \
+  REGISTER_KERNEL_BUILDER(Name("RandomUniformInt")               \
+                              .Device(DEVICE_SYCL)               \
+                              .HostMemory("shape")               \
+                              .HostMemory("minval")              \
+                              .HostMemory("maxval")              \
+                              .TypeConstraint<IntType>("Tout"),  \
+                          RandomUniformIntOp<SYCLDevice, IntType>);
+
+TF_CALL_float(REGISTER);
+TF_CALL_double(REGISTER);
+TF_CALL_int32(REGISTER_INT);
+TF_CALL_int64(REGISTER_INT);
+
+#undef REGISTER
+#undef REGISTER_INT
+
+#endif // TENSORFLOW_USE_SYCL
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/random_op.h b/tensorflow/core/kernels/random_op.h
index b52901c38e..97bcaf1a49 100644
--- a/tensorflow/core/kernels/random_op.h
+++ b/tensorflow/core/kernels/random_op.h
@@ -54,6 +54,18 @@ struct FillPhiloxRandom<GPUDevice, Distribution> {
 };
 #endif  // GOOGLE_CUDA
 
+#if TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+// Declares the partially SYCL-specialized functor struct.
+template <class Distribution>
+struct FillPhiloxRandom<SYCLDevice, Distribution> {
+  void operator()(OpKernelContext* ctx, const SYCLDevice& d,
+                  random::PhiloxRandom gen,
+                  typename Distribution::ResultElementType* data, int64 size,
+                  Distribution dist);
+};
+#endif  // TENSORFLOW_USE_SYCL
+
 }  // namespace functor
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/kernels/random_op_gpu.cu.cc b/tensorflow/core/kernels/random_op_gpu.cu.cc
index 5f7d9b7dd6..7afa6974c6 100644
--- a/tensorflow/core/kernels/random_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/random_op_gpu.cu.cc
@@ -141,7 +141,7 @@ struct FillPhiloxRandomKernel<Distribution, false> {
       const typename Distribution::ResultType samples = dist(&gen);
       copier(&data[offset], samples);
 
-      offset += (total_thread_count - 1) * kGroupSize;
+      offset += total_thread_count * kGroupSize;
       gen.Skip(total_thread_count - 1);
     }
 
diff --git a/tensorflow/core/kernels/range_dataset_op.cc b/tensorflow/core/kernels/range_dataset_op.cc
index 8cfd3a0a8f..c181f6e804 100644
--- a/tensorflow/core/kernels/range_dataset_op.cc
+++ b/tensorflow/core/kernels/range_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class RangeDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/reader_dataset_ops.cc b/tensorflow/core/kernels/reader_dataset_ops.cc
index e0e90d0cce..e7f65c39cb 100644
--- a/tensorflow/core/kernels/reader_dataset_ops.cc
+++ b/tensorflow/core/kernels/reader_dataset_ops.cc
@@ -23,7 +23,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following ops.
 
 class TextLineDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/reduction_ops_common.h b/tensorflow/core/kernels/reduction_ops_common.h
index 0cd9c255bc..553f889523 100644
--- a/tensorflow/core/kernels/reduction_ops_common.h
+++ b/tensorflow/core/kernels/reduction_ops_common.h
@@ -238,9 +238,11 @@ class ReductionOp : public OpKernel {
     if (ctx->track_allocations()) {
       // The temporary memory becomes the output memory.
       if (ctx->allocate_on_host(alloc_attr)) {
-        ctx->record_host_temp_memory_size(-out.AllocatedBytes());
+        ctx->record_host_temp_memory_size(
+            -static_cast<int64>(out.AllocatedBytes()));
       } else {
-        ctx->record_device_temp_memory_size(-out.AllocatedBytes());
+        ctx->record_device_temp_memory_size(
+            -static_cast<int64>(out.AllocatedBytes()));
       }
     }
     ctx->set_output(0, out);
@@ -276,31 +278,6 @@ struct ReduceFunctor<CPUDevice, Reducer>
 template <typename Reducer>
 struct ReduceFunctor<SYCLDevice, Reducer>
         : ReduceFunctorBase<SYCLDevice, Reducer>{};
-
-template <typename T>
-struct ReduceFunctor<SYCLDevice, Eigen::internal::MeanReducer<T> > {
-  template <typename OUT_T, typename IN_T, typename ReductionAxes>
-  static void Reduce(const SYCLDevice& d, OUT_T out, IN_T in,
-                     const ReductionAxes& reduction_axes,
-                     const Eigen::internal::MeanReducer<T>& reducer) {
-    typedef typename IN_T::Index Index;
-    // Eigen sum reductions are much faster on GPU than mean reductions:
-    // Simply trigger them by computing the sum of the weighted inputs.
-    Index num_coeffs_to_reduce = 1;
-    for (int i = 0; i < Eigen::internal::array_size<ReductionAxes>::value;
-         ++i) {
-      num_coeffs_to_reduce *= in.dimension(reduction_axes[i]);
-    }
-    T scale = T(1.0) / num_coeffs_to_reduce;
-    out.device(d) = (in * scale).sum(reduction_axes);
-  }
-
-  template <typename OUT_T>
-  static void FillIdentity(const SYCLDevice& d, OUT_T out,
-                           const Eigen::internal::MeanReducer<T>& reducer) {
-    FillIdentityEigenImpl(d, out, reducer);
-  }
-};
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace functor
diff --git a/tensorflow/core/kernels/reduction_ops_max.cc b/tensorflow/core/kernels/reduction_ops_max.cc
index 5ab97d1eee..d243e7c55f 100644
--- a/tensorflow/core/kernels/reduction_ops_max.cc
+++ b/tensorflow/core/kernels/reduction_ops_max.cc
@@ -67,7 +67,7 @@ REGISTER_KERNEL_BUILDER(
           .HostMemory("reduction_indices"), \
       ReductionOp<SYCLDevice, type, Eigen::internal::MaxReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
-#undef REGISTER_SYCL_KERNELS
+REGISTER_SYCL_KERNELS(double);
 
 REGISTER_KERNEL_BUILDER(
     Name("Max")
@@ -78,6 +78,7 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
     ReductionOp<CPUDevice, int32, Eigen::internal::MaxReducer<int32>>);
+#undef REGISTER_SYCL_KERNELS
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_mean.cc b/tensorflow/core/kernels/reduction_ops_mean.cc
index 03f737b4fa..5b01de8ddb 100644
--- a/tensorflow/core/kernels/reduction_ops_mean.cc
+++ b/tensorflow/core/kernels/reduction_ops_mean.cc
@@ -54,6 +54,7 @@ TF_CALL_complex128(REGISTER_GPU_KERNELS);
           .HostMemory("reduction_indices"), \
       ReductionOp<SYCLDevice, type, Eigen::internal::MeanReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
+REGISTER_SYCL_KERNELS(double);
 #undef REGISTER_SYCL_KERNELS
 #endif // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/reduction_ops_min.cc b/tensorflow/core/kernels/reduction_ops_min.cc
index ec240421b9..1e394bea41 100644
--- a/tensorflow/core/kernels/reduction_ops_min.cc
+++ b/tensorflow/core/kernels/reduction_ops_min.cc
@@ -67,7 +67,7 @@ REGISTER_KERNEL_BUILDER(
           .HostMemory("reduction_indices"), \
       ReductionOp<SYCLDevice, type, Eigen::internal::MinReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
-#undef REGISTER_SYCL_KERNELS
+REGISTER_SYCL_KERNELS(double);
 
 REGISTER_KERNEL_BUILDER(
     Name("Min")
@@ -78,6 +78,7 @@ REGISTER_KERNEL_BUILDER(
         .TypeConstraint<int32>("T")
         .TypeConstraint<int32>("Tidx"),
     ReductionOp<CPUDevice, int32, Eigen::internal::MinReducer<int32>>);
+#undef REGISTER_SYCL_KERNELS
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_prod.cc b/tensorflow/core/kernels/reduction_ops_prod.cc
index f841a981b4..33f6ae6bae 100644
--- a/tensorflow/core/kernels/reduction_ops_prod.cc
+++ b/tensorflow/core/kernels/reduction_ops_prod.cc
@@ -54,19 +54,10 @@ TF_CALL_complex128(REGISTER_GPU_KERNELS);
           .TypeConstraint<int32>("Tidx")    \
           .HostMemory("reduction_indices"), \
       ReductionOp<SYCLDevice, type, Eigen::internal::ProdReducer<type>>);
+REGISTER_SYCL_KERNELS(int32);
 REGISTER_SYCL_KERNELS(float);
 REGISTER_SYCL_KERNELS(double);
 #undef REGISTER_SYCL_KERNELS
-
-REGISTER_KERNEL_BUILDER(
-    Name("Prod")
-        .Device(DEVICE_SYCL)
-        .TypeConstraint<int32>("T")
-        .TypeConstraint<int32>("Tidx")
-        .HostMemory("input")
-        .HostMemory("output")
-        .HostMemory("reduction_indices"),
-    ReductionOp<CPUDevice, int32, Eigen::internal::ProdReducer<int32>>);
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reduction_ops_sum.cc b/tensorflow/core/kernels/reduction_ops_sum.cc
index 828e1a588c..c1f4f3475a 100644
--- a/tensorflow/core/kernels/reduction_ops_sum.cc
+++ b/tensorflow/core/kernels/reduction_ops_sum.cc
@@ -67,11 +67,8 @@ REGISTER_KERNEL_BUILDER(
           .HostMemory("reduction_indices"), \
       ReductionOp<SYCLDevice, type, Eigen::internal::SumReducer<type>>);
 REGISTER_SYCL_KERNELS(float);
-#undef REGISTER_SYCL_KERNELS
+REGISTER_SYCL_KERNELS(double);
 
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(
     Name("Sum")
         .Device(DEVICE_SYCL)
@@ -81,6 +78,7 @@ REGISTER_KERNEL_BUILDER(
         .HostMemory("output")
         .HostMemory("reduction_indices"),
     ReductionOp<CPUDevice, int32, Eigen::internal::SumReducer<int32>>);
+#undef REGISTER_SYCL_KERNELS
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/reference_gemm.h b/tensorflow/core/kernels/reference_gemm.h
index 5e4cde07d7..ad8f427429 100644
--- a/tensorflow/core/kernels/reference_gemm.h
+++ b/tensorflow/core/kernels/reference_gemm.h
@@ -21,7 +21,7 @@ limitations under the License.
 // for bit depths or argument combinations that aren't supported by optimized
 // code.
 // It assumes the row-major convention used by TensorFlow, and implements
-// C = A * B, like the standard BLAS GEMM interface. If the tranpose flags are
+// C = A * B, like the standard BLAS GEMM interface. If the transpose flags are
 // true, then the relevant matrix is treated as stored in column-major order.
 
 namespace tensorflow {
diff --git a/tensorflow/core/kernels/relu_op.cc b/tensorflow/core/kernels/relu_op.cc
index d70398bea5..d8d30e87e2 100644
--- a/tensorflow/core/kernels/relu_op.cc
+++ b/tensorflow/core/kernels/relu_op.cc
@@ -156,7 +156,7 @@ TF_CALL_GPU_NUMBER_TYPES(REGISTER_GPU_KERNELS);
       Name("EluGrad").Device(DEVICE_SYCL).TypeConstraint<type>("T"),   \
       EluGradOp<SYCLDevice, type>)
 
-REGISTER_SYCL_KERNELS(float);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNELS);
 #undef REGISTER_SYCL_KERNELS
 #endif // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/relu_op.h b/tensorflow/core/kernels/relu_op.h
index e2e0bd48dd..365c6201a5 100644
--- a/tensorflow/core/kernels/relu_op.h
+++ b/tensorflow/core/kernels/relu_op.h
@@ -175,10 +175,6 @@ void EluGradOp<Device, T>::OperateNoTemplate(OpKernelContext* context,
 
 }  // namespace tensorflow
 
-#ifdef TENSORFLOW_USE_SYCL
-#undef EIGEN_USE_SYCL
-#endif // TENSORFLOW_USE_SYCL
-
 #undef EIGEN_USE_THREADS
 
 #endif  // TENSORFLOW_KERNELS_RELU_OP_H_
diff --git a/tensorflow/core/kernels/repeat_dataset_op.cc b/tensorflow/core/kernels/repeat_dataset_op.cc
index 9bf4039fb7..8fc59e1779 100644
--- a/tensorflow/core/kernels/repeat_dataset_op.cc
+++ b/tensorflow/core/kernels/repeat_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class RepeatDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/reshape_op.cc b/tensorflow/core/kernels/reshape_op.cc
index 6589a54624..04454b76c1 100644
--- a/tensorflow/core/kernels/reshape_op.cc
+++ b/tensorflow/core/kernels/reshape_op.cc
@@ -42,8 +42,12 @@ TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER_GPU_KERNEL);
                               .TypeConstraint<type>("T")        \
                               .TypeConstraint<int32>("Tshape"), \
                           ReshapeOp);
-TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER_SYCL_KERNEL);
-#undef REGISTER_SYCL_KERNEL
+REGISTER_SYCL_KERNEL(float)
+REGISTER_SYCL_KERNEL(double)
+REGISTER_SYCL_KERNEL(uint8)
+REGISTER_SYCL_KERNEL(int8)
+REGISTER_SYCL_KERNEL(int64)
+REGISTER_SYCL_KERNEL(uint16)
 
 REGISTER_KERNEL_BUILDER(Name("Reshape")
                             .Device(DEVICE_SYCL)
@@ -53,6 +57,7 @@ REGISTER_KERNEL_BUILDER(Name("Reshape")
                             .TypeConstraint<int32>("T")
                             .TypeConstraint<int32>("Tshape"),
                         ReshapeOp);
+#undef REGISTER_SYCL_KERNEL
 #endif  // TENSORFLOW_USE_SYCL
 
 #if GOOGLE_CUDA
diff --git a/tensorflow/core/kernels/reverse_op.cc b/tensorflow/core/kernels/reverse_op.cc
index 6f7a0a4df5..4f2afa5257 100644
--- a/tensorflow/core/kernels/reverse_op.cc
+++ b/tensorflow/core/kernels/reverse_op.cc
@@ -140,9 +140,9 @@ class ReverseOp : public OpKernel {
       OP_REQUIRES_OK(context,
                      context->allocate_output(0, input.shape(), &output));
 
-#define HANDLE_REVERSE(NDIMS)                                               \
-  case NDIMS:                                                               \
-    HandleReverseCase<Device, T, NDIMS>(context, dims.vec<bool>(), output); \
+#define HANDLE_REVERSE(NDIMS)                                                 \
+  case NDIMS:                                                                 \
+    HandleReverseCase<Device, T, NDIMS>(context, dims.vec<bool>(), output);   \
     return;
 
       switch (input_dims) {
@@ -361,7 +361,10 @@ REGISTER_KERNEL_BUILDER(Name("ReverseV2")
                               .TypeConstraint<int32>("Tidx") \
                               .HostMemory("axis"),           \
                           ReverseV2Op<SYCLDevice, T>)
+TF_CALL_uint8(REGISTER_SYCL_KERNELS);
+TF_CALL_int8(REGISTER_SYCL_KERNELS);
 TF_CALL_float(REGISTER_SYCL_KERNELS);
+TF_CALL_double(REGISTER_SYCL_KERNELS);
 
 REGISTER_KERNEL_BUILDER(Name("Reverse")
                             .Device(DEVICE_SYCL)
@@ -379,5 +382,4 @@ REGISTER_KERNEL_BUILDER(Name("ReverseV2")
                             .HostMemory("output"),
                         ReverseV2Op<CPUDevice, int32>);
 #endif // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/scatter_functor.h b/tensorflow/core/kernels/scatter_functor.h
index 63add61ba7..c6e35fe329 100644
--- a/tensorflow/core/kernels/scatter_functor.h
+++ b/tensorflow/core/kernels/scatter_functor.h
@@ -75,6 +75,50 @@ struct Assign<scatter_op::UpdateOp::DIV> {
   }
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+template <scatter_op::UpdateOp Op>
+struct AssignSYCL {};
+template <>
+struct AssignSYCL<scatter_op::UpdateOp::ASSIGN> {
+  template <typename Device, typename Params, typename Update>
+  static void Run(Device d, Params p, Update u) {
+    p.device(d) = u;
+  }
+};
+
+template <>
+struct AssignSYCL<scatter_op::UpdateOp::ADD> {
+  template <typename Device, typename Params, typename Update>
+  static void Run(Device d, Params p, Update u) {
+    p.device(d) += u;
+  }
+};
+
+template <>
+struct AssignSYCL<scatter_op::UpdateOp::SUB> {
+  template <typename Device, typename Params, typename Update>
+  static void Run(Device d, Params p, Update u) {
+    p.device(d) -= u;
+  }
+};
+
+template <>
+struct AssignSYCL<scatter_op::UpdateOp::MUL> {
+  template <typename Device, typename Params, typename Update>
+  static void Run(Device d, Params p, Update u) {
+    p.device(d) = p * u;
+  }
+};
+
+template <>
+struct AssignSYCL<scatter_op::UpdateOp::DIV> {
+  template <typename Device, typename Params, typename Update>
+  static void Run(Device d, Params p, Update u) {
+    p.device(d) = p / u;
+  }
+};
+#endif // TENSORFLOW_USE_SYCL
+
 }  // namespace internal
 }  // namespace scatter_op
 
@@ -110,6 +154,31 @@ struct ScatterFunctorBase {
   }
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+template <typename T, typename Index, scatter_op::UpdateOp op>
+struct ScatterFunctorBase <SYCLDevice, T, Index, op> {
+  Index operator()(OpKernelContext* c, const SYCLDevice& d,
+                   typename TTypes<T>::Matrix params,
+                   typename TTypes<T>::ConstMatrix updates,
+                   typename TTypes<Index>::ConstFlat indices) {
+    // indices and params sizes were validated in DoCompute().
+    const Index N = static_cast<Index>(indices.size());
+    const Index limit = static_cast<Index>(params.dimension(0));
+    for (Index i = 0; i < N; i++) {
+      // Grab the index and check its validity.  An earlier version of the
+      // code checked it and then grabbed it from memory a second time, which
+      // was a security risk since it could have changed in between.
+      const Index index = ::tensorflow::internal::SubtleMustCopy(indices(i));
+      if (!FastBoundsCheck(index, limit)) return i;
+      // Copy last Ndim-1 dimensions of updates[i] to params[index]
+      scatter_op::internal::AssignSYCL<op>::Run(d, params.template chip<0>(index),
+                                            updates.template chip<0>(i));
+    }
+    return -1;
+  }
+};
+#endif // TENSORFLOW_USE_SYCL
+
 template <typename T, typename Index>
 struct ScatterFunctorBase<CPUDevice, T, Index, scatter_op::UpdateOp::ASSIGN> {
   Index operator()(OpKernelContext* c, const CPUDevice& d,
@@ -149,10 +218,27 @@ struct ScatterFunctorBase<CPUDevice, T, Index, scatter_op::UpdateOp::ASSIGN> {
 template <typename T, typename Index, scatter_op::UpdateOp op>
 struct ScatterFunctor<CPUDevice, T, Index, op>
         : ScatterFunctorBase<CPUDevice, T, Index, op>{};
-#if TENSORFLOW_USE_SYCL
-template<typename T, typename Index, scatter_op::UpdateOp op>
-struct ScatterFunctor<SYCLDevice, T, Index, op>
-        : ScatterFunctorBase<SYCLDevice, T, Index, op>{};
+
+#ifdef TENSORFLOW_USE_SYCL
+template <typename T, typename Index, scatter_op::UpdateOp op>
+struct ScatterFunctorSYCL {
+  Index operator()(OpKernelContext* c, const SYCLDevice& d,
+                   typename TTypes<T>::Matrix params,
+                   typename TTypes<T>::ConstMatrix updates,
+                   typename TTypes<Index>::Flat indices) {
+    // indices and params sizes were validated in DoCompute().
+    const Index N = static_cast<Index>(indices.size());
+    const Index limit = static_cast<Index>(params.dimension(0));
+    for (Index i = 0; i < N; i++) {
+      const Index index = ::tensorflow::internal::SubtleMustCopy(indices(i));
+      if (!FastBoundsCheck(index, limit)) return i;
+      // Copy last Ndim-1 dimensions of updates[i] to params[index]
+      scatter_op::internal::AssignSYCL<op>::Run(
+          d, params.template chip<0>(index), updates.template chip<0>(i));
+    }
+    return -1;
+  }
+};
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace functor
diff --git a/tensorflow/core/kernels/scatter_nd_op.cc b/tensorflow/core/kernels/scatter_nd_op.cc
index 363de801ba..48565d8cb9 100644
--- a/tensorflow/core/kernels/scatter_nd_op.cc
+++ b/tensorflow/core/kernels/scatter_nd_op.cc
@@ -27,10 +27,17 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/util.h"
 
+#ifdef TENSORFLOW_USE_SYCL
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+#endif // TENSORFLOW_USE_SYCL
+
 namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
 typedef Eigen::GpuDevice GPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 // Check whether updates.shape = indices.shape[:batch_dim] +
 // params_shape[slice_dim:]
@@ -138,6 +145,40 @@ static void PrepareAndValidateInputs(OpKernelContext* c,
   *num_updates = indices_shape.num_elements() / safe_slice_dim;
 }
 
+template <typename Device, typename Index>
+class IndexFlattener {
+public:
+  inline typename TTypes<Index, 2>::ConstTensor
+  operator()(OpKernelContext*, const Tensor& indices) {
+    return indices.flat_inner_dims<Index>();
+  }
+};
+
+#ifdef TENSORFLOW_USE_SYCL
+template <typename Index>
+class IndexFlattener<SYCLDevice, Index> {
+public:
+  IndexFlattener() { indices_host_ = nullptr; }
+  ~IndexFlattener() { delete[] indices_host_; }
+
+  inline typename TTypes<Index, 2>::ConstTensor
+  operator()(OpKernelContext* c, const Tensor& indices) {
+    size_t num_indices = indices.NumElements();
+    indices_host_ = new Index[num_indices];
+    auto device = c->eigen_sycl_device();
+    auto size = sizeof(Index) * num_indices;
+    auto src_ptr = GetBase(&indices);
+    device.memcpyDeviceToHost(indices_host_, static_cast<const Index*>(src_ptr),
+                              size);
+    return typename TTypes<Index, 2>::ConstTensor(indices_host_,
+           indices.shape().AsEigenDSizes<2>());
+  }
+
+private:
+  Index* indices_host_;
+};
+#endif
+
 template <typename Device, typename T, typename Index>
 class ScatterNdOp : public OpKernel {
  public:
@@ -166,7 +207,8 @@ class ScatterNdOp : public OpKernel {
                                     &num_updates, &slice_size);
     if (!c->status().ok()) return;
 
-    auto indices_flat = indices.flat_inner_dims<Index>();
+    IndexFlattener<Device, Index> index_flattener;
+    auto indices_flat = index_flattener(c, indices);
     auto updates_flat = updates.shaped<T, 2>({num_updates, slice_size});
 
     Tensor* out = nullptr;
@@ -262,7 +304,8 @@ class ScatterNdUpdateOp : public OpKernel {
                                     &slice_dim, &num_updates, &slice_size);
     if (!c->status().ok()) return;
 
-    auto indices_flat = indices.flat_inner_dims<Index>();
+    IndexFlattener<Device, Index> index_flattener;
+    auto indices_flat = index_flattener(c, indices);
     auto updates_flat = updates.shaped<T, 2>({num_updates, slice_size});
     auto params_matrix = params.template shaped<T, 2>(
         {params_shape.num_elements() / slice_size, slice_size});
@@ -419,6 +462,19 @@ TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DECLARE_GPU_SPECS);
 
 #endif  // GOOGLE_CUDA
 
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SCATTER_ND_ADD_SUB_SYCL(type) \
+  REGISTER_SCATTER_ND_ADD_SUB(type, SYCL);
+
+#define REGISTER_SCATTER_ND_UPDATE_SYCL(type) \
+  REGISTER_SCATTER_ND_UPDATE(type, SYCL);
+
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_ADD_SUB_SYCL);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_UPDATE_SYCL);
+#undef REGISTER_SCATTER_ND_ADD_SUB_SYCL
+#undef REGISTER_SCATTER_ND_UPDATE_SYCL
+#endif // TENSORFLOW_USE_SYCL
+
 #undef REGISTER_SCATTER_ND_ADD
 #undef REGISTER_SCATTER_ND_ADD_SUB
 #undef REGISTER_SCATTER_ND_ADD_SUB_CPU
diff --git a/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h b/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
index bbe2c6864f..788797b668 100644
--- a/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
+++ b/tensorflow/core/kernels/scatter_nd_op_cpu_impl.h
@@ -38,6 +38,9 @@ limitations under the License.
 namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
+#ifdef TENSORFLOW_USE_SYCL
+typedef Eigen::SyclDevice SYCLDevice;
+#endif // TENSORFLOW_USE_SYCL
 
 class OpKernelContext;
 
@@ -186,6 +189,92 @@ TF_CALL_NUMBER_TYPES(REGISTER_SCATTER_ND_MATH)
 #undef REGISTER_SCATTER_ND_INDEX
 #undef REGISTER_SCATTER_ND_FULL
 
+#ifdef TENSORFLOW_USE_SYCL
+// Implementation of update functor for SYCL.
+template <typename T, typename Index, scatter_nd_op::UpdateOp OP, int IXDIM>
+struct ScatterNdFunctor<SYCLDevice, T, Index, OP, IXDIM> {
+  Index operator()(
+      const SYCLDevice& d, const Index slice_size,
+      const Eigen::array<Eigen::DenseIndex, IXDIM> output_shape_prefix,
+      typename TTypes<T, 2>::Tensor Tparams,
+      typename TTypes<Index, 2>::ConstTensor Tindices,
+      typename TTypes<T, 2>::ConstTensor Tupdates,
+      typename TTypes<T, 2>::Tensor Toutput) {
+    // error_loc is -1 if there's no out-of-bounds index,
+    // otherwise it is the location of an OOB index in Tindices.
+    Index error_loc = -1;
+
+    const Eigen::DenseIndex batch_size = Tindices.dimension(0);
+
+    Index batch_strides[IXDIM];
+    for (int dim = IXDIM - 1; dim >= 0; --dim) {
+      if (dim == IXDIM - 1) {
+        batch_strides[dim] = 1;
+      } else {
+        batch_strides[dim] =
+            batch_strides[dim + 1] * output_shape_prefix[dim + 1];
+      }
+    }
+
+    for (Eigen::DenseIndex loc = 0; loc < batch_size; ++loc) {
+      Index i = 0;
+      bool out_of_bounds = false;
+      for (int dim = 0; dim < IXDIM; ++dim) {
+        const Index ix_d = internal::SubtleMustCopy(Tindices(loc, dim));
+        out_of_bounds |= !FastBoundsCheck(ix_d, output_shape_prefix[dim]);
+        i += ix_d * batch_strides[dim];
+      }
+      if (TF_PREDICT_FALSE(out_of_bounds)) {
+        error_loc = loc;
+        break;
+      } else {
+        auto input_chip = Toutput.template chip<0>(i);
+        auto output_chip = input_chip.device(d);
+        auto update_chip = Tupdates.template chip<0>(loc);
+        update_executor::UpdateExecutor<
+            decltype(input_chip), decltype(update_chip), decltype(output_chip),
+            OP>::Execute(input_chip, update_chip, output_chip);
+      }
+    }
+
+    return error_loc;
+  }
+};
+
+#define REGISTER_SCATTER_ND_FULL_SYCL(T, Index, op)                           \
+  template Index                                                              \
+  ScatterNdFunctor<SYCLDevice, T, Index, op, CPU_PROVIDED_IXDIM>::operator()( \
+      const SYCLDevice& d, const Index slice_size,                            \
+      const Eigen::array<Eigen::DenseIndex, CPU_PROVIDED_IXDIM>               \
+          output_shape_prefix,                                                \
+      typename TTypes<T, 2>::Tensor Tparams,                                  \
+      typename TTypes<Index, 2>::ConstTensor Tindices,                        \
+      typename TTypes<T, 2>::ConstTensor Tupdates,                            \
+      typename TTypes<T, 2>::Tensor Toutput)
+
+#define REGISTER_SCATTER_ND_INDEX_SYCL(type, op)  \
+  REGISTER_SCATTER_ND_FULL_SYCL(type, int32, op); \
+  REGISTER_SCATTER_ND_FULL_SYCL(type, int64, op)
+
+#define REGISTER_SCATTER_ND_UPDATE_SYCL(type) \
+  REGISTER_SCATTER_ND_INDEX_SYCL(type, scatter_nd_op::UpdateOp::ASSIGN);
+
+#define REGISTER_SCATTER_ND_MATH_SYCL(type)                           \
+  REGISTER_SCATTER_ND_INDEX_SYCL(type, scatter_nd_op::UpdateOp::ADD); \
+  REGISTER_SCATTER_ND_INDEX_SYCL(type, scatter_nd_op::UpdateOp::SUB);
+
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_UPDATE_SYCL)
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SCATTER_ND_MATH_SYCL)
+REGISTER_SCATTER_ND_UPDATE_SYCL(int32);
+REGISTER_SCATTER_ND_MATH_SYCL(int32);
+
+#undef REGISTER_SCATTER_ND_MATH_SYCL
+#undef REGISTER_SCATTER_ND_UPDATE_SYCL
+#undef REGISTER_SCATTER_ND_INDEX_SYCL
+#undef REGISTER_SCATTER_ND_FULL_SYCL
+
+#endif // TENSORFLOW_USE_SYCL
+
 }  // namespace functor
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/scatter_op.cc b/tensorflow/core/kernels/scatter_op.cc
index 51dad49cfe..8607c7f95a 100644
--- a/tensorflow/core/kernels/scatter_op.cc
+++ b/tensorflow/core/kernels/scatter_op.cc
@@ -23,6 +23,10 @@ limitations under the License.
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/util/util.h"
 
+#ifdef TENSORFLOW_USE_SYCL
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+#endif // TENSORFLOW_USE_SYCL
+
 namespace tensorflow {
 
 typedef Eigen::ThreadPoolDevice CPUDevice;
@@ -131,6 +135,79 @@ class ScatterUpdateOp : public OpKernel {
   }
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+template <typename T, typename Index, scatter_op::UpdateOp op>
+class ScatterUpdateOp <SYCLDevice, T, Index, op> : public OpKernel {
+ public:
+  explicit ScatterUpdateOp(OpKernelConstruction* c) : OpKernel(c) {
+    OP_REQUIRES_OK(c, c->GetAttr("use_locking", &use_exclusive_lock_));
+  }
+
+  void Compute(OpKernelContext* c) override {
+    if (use_exclusive_lock_) {
+      // Hold mutex while we apply updates
+      mutex_lock l(*c->input_ref_mutex(0));
+      DoCompute(c);
+    } else {
+      DoCompute(c);
+    }
+  }
+
+ private:
+  bool use_exclusive_lock_;
+
+  void DoCompute(OpKernelContext* c) {
+    Tensor params = c->mutable_input(0, use_exclusive_lock_);
+    const Tensor& indices = c->input(1);
+    const Tensor& updates = c->input(2);
+    DoValidationChecking(c, params, indices, updates);
+    if (!c->status().ok()) return;
+
+    // Check that we have enough index space
+    const int64 N_big = indices.NumElements();
+    OP_REQUIRES(c, N_big <= std::numeric_limits<Index>::max(),
+                errors::InvalidArgument(
+                    "indices has too many elements for ",
+                    DataTypeString(DataTypeToEnum<Index>::v()), " indexing: ",
+                    N_big, " > ", std::numeric_limits<Index>::max()));
+    const Index N = static_cast<Index>(indices.NumElements());
+    OP_REQUIRES(
+        c, params.dim_size(0) <= std::numeric_limits<Index>::max(),
+        errors::InvalidArgument("params.shape[0] too large for ",
+                                DataTypeString(DataTypeToEnum<Index>::v()),
+                                " indexing: ", params.dim_size(0), " > ",
+                                std::numeric_limits<Index>::max()));
+
+    // We always return the input ref.
+    c->forward_ref_input_to_ref_output(0, 0);
+
+    if (N > 0) {
+      auto index_size = indices.NumElements() * sizeof(Index);
+      Tensor indices_host = Tensor(indices.dtype(), indices.shape());
+
+      auto src_ptr = GetBase(&indices);
+      auto dst_ptr = GetBase(&indices_host);
+
+      c->eigen_sycl_device().memcpyDeviceToHost(
+          dst_ptr, static_cast<const Index*>(src_ptr), index_size);
+
+      auto indices_flat = indices_host.flat<Index>();
+      auto params_flat = params.flat_outer_dims<T>();
+      auto updates_flat = updates.shaped<T, 2>({N, updates.NumElements() / N});
+
+      functor::ScatterFunctorSYCL<T, Index, op> functor;
+      const Index bad_i = functor(c, c->template eigen_device<SYCLDevice>(),
+                                  params_flat, updates_flat, indices_flat);
+      OP_REQUIRES(
+          c, bad_i < 0,
+          errors::InvalidArgument(
+              "indices", SliceDebugString(indices.shape(), bad_i), " = ",
+              indices_flat(bad_i), " is not in [0, ", params.dim_size(0), ")"));
+    }
+  }
+};
+#endif // TENSORFLOW_USE_SYCL
+
 #define REGISTER_SCATTER_KERNEL_INDEX(type, index_type, dev, name, op) \
   REGISTER_KERNEL_BUILDER(Name(name)                                   \
                               .Device(DEVICE_##dev)                    \
diff --git a/tensorflow/core/kernels/sendrecv_ops.cc b/tensorflow/core/kernels/sendrecv_ops.cc
index 4c656bd74b..2a98a6530c 100644
--- a/tensorflow/core/kernels/sendrecv_ops.cc
+++ b/tensorflow/core/kernels/sendrecv_ops.cc
@@ -93,11 +93,11 @@ void SendOp::Compute(OpKernelContext* ctx) {
 REGISTER_KERNEL_BUILDER(Name("_Send").Device(DEVICE_CPU), SendOp);
 REGISTER_KERNEL_BUILDER(Name("_Send").Device(DEVICE_GPU), SendOp);
 
-#if TENSORFLOW_USE_SYCL
+#ifdef TENSORFLOW_USE_SYCL
 REGISTER_KERNEL_BUILDER(Name("_Send").Device(DEVICE_SYCL), SendOp);
 REGISTER_KERNEL_BUILDER(
     Name("_HostSend").Device(DEVICE_SYCL).HostMemory("tensor"), SendOp);
-#endif
+#endif // TENSORFLOW_USE_SYCL
 
 REGISTER_KERNEL_BUILDER(Name("_HostSend").Device(DEVICE_CPU), SendOp);
 REGISTER_KERNEL_BUILDER(
@@ -168,17 +168,17 @@ void RecvOp::ComputeAsync(OpKernelContext* ctx, DoneCallback done) {
 REGISTER_KERNEL_BUILDER(Name("_Recv").Device(DEVICE_CPU), RecvOp);
 REGISTER_KERNEL_BUILDER(Name("_Recv").Device(DEVICE_GPU), RecvOp);
 
-#if TENSORFLOW_USE_SYCL
+#ifdef TENSORFLOW_USE_SYCL
 REGISTER_KERNEL_BUILDER(Name("_Recv").Device(DEVICE_SYCL), RecvOp);
-#endif
+#endif // TENSORFLOW_USE_SYCL
 
 REGISTER_KERNEL_BUILDER(Name("_HostRecv").Device(DEVICE_CPU), RecvOp);
 REGISTER_KERNEL_BUILDER(
     Name("_HostRecv").Device(DEVICE_GPU).HostMemory("tensor"), RecvOp);
 
-#if TENSORFLOW_USE_SYCL
+#ifdef TENSORFLOW_USE_SYCL
 REGISTER_KERNEL_BUILDER(
     Name("_HostRecv").Device(DEVICE_SYCL).HostMemory("tensor"), RecvOp);
-#endif
+#endif // TENSORFLOW_USE_SYCL
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/shuffle_dataset_op.cc b/tensorflow/core/kernels/shuffle_dataset_op.cc
index 7156e5155f..14e7d1bf97 100644
--- a/tensorflow/core/kernels/shuffle_dataset_op.cc
+++ b/tensorflow/core/kernels/shuffle_dataset_op.cc
@@ -24,7 +24,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class ShuffleDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/skip_dataset_op.cc b/tensorflow/core/kernels/skip_dataset_op.cc
index 06c9d8c6ec..1cff90a05e 100644
--- a/tensorflow/core/kernels/skip_dataset_op.cc
+++ b/tensorflow/core/kernels/skip_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class SkipDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/slice_op.cc b/tensorflow/core/kernels/slice_op.cc
index 2a9ff40f8c..ee6f9a28cd 100644
--- a/tensorflow/core/kernels/slice_op.cc
+++ b/tensorflow/core/kernels/slice_op.cc
@@ -328,8 +328,9 @@ namespace functor {
   DECLARE_SYCL_SPEC(T, 6); \
   DECLARE_SYCL_SPEC(T, 7);
 
-TF_CALL_GPU_NUMBER_TYPES(DECLARE_FOR_N);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DECLARE_FOR_N);
 DECLARE_FOR_N(int32);
+DECLARE_FOR_N(bool);
 
 #undef DECLARE_FOR_N
 #undef DECLARE_SYCL_SPEC
@@ -344,11 +345,8 @@ DECLARE_FOR_N(int32);
                               .TypeConstraint<int32>("Index"), \
                           SliceOp<SYCLDevice, type>)
 
-TF_CALL_GPU_NUMBER_TYPES(REGISTER_SYCL);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
 
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(Name("Slice")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
@@ -358,7 +356,6 @@ REGISTER_KERNEL_BUILDER(Name("Slice")
                             .HostMemory("size")
                             .HostMemory("output"),
                         SliceOp<CPUDevice, int32>);
-
 #undef REGISTER_SYCL
 
 #endif  // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/smooth-hinge-loss.h b/tensorflow/core/kernels/smooth-hinge-loss.h
index 45da0fb117..5074ad0795 100644
--- a/tensorflow/core/kernels/smooth-hinge-loss.h
+++ b/tensorflow/core/kernels/smooth-hinge-loss.h
@@ -35,7 +35,7 @@ class SmoothHingeLossUpdater : public DualLossUpdater {
                             const double current_dual, const double wx,
                             const double weighted_example_norm) const final {
     // Intutitvely there are 3 cases:
-    // a. new optimal value of the dual variable falls withing the admissible
+    // a. new optimal value of the dual variable falls within the admissible
     // range [0, 1]. In this case we set new dual to this value.
     // b. new optimal value is < 0. Then, because of convexity, the optimal
     // valid value for new dual = 0
diff --git a/tensorflow/core/kernels/softmax_op.cc b/tensorflow/core/kernels/softmax_op.cc
index de11de32f1..8345a98a0d 100644
--- a/tensorflow/core/kernels/softmax_op.cc
+++ b/tensorflow/core/kernels/softmax_op.cc
@@ -90,6 +90,8 @@ REGISTER_KERNEL_BUILDER(
 REGISTER_KERNEL_BUILDER(
     Name("Softmax").Device(DEVICE_SYCL).TypeConstraint<float>("T"),
     SoftmaxOp<SYCLDevice, float>);
+REGISTER_KERNEL_BUILDER(
+    Name("Softmax").Device(DEVICE_SYCL).TypeConstraint<double>("T"),
+    SoftmaxOp<SYCLDevice, double>);
 #endif // TENSORFLOW_USE_SYCL
-
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/sparse_matmul_op.h b/tensorflow/core/kernels/sparse_matmul_op.h
index 61bd6593c3..098b2d6500 100644
--- a/tensorflow/core/kernels/sparse_matmul_op.h
+++ b/tensorflow/core/kernels/sparse_matmul_op.h
@@ -31,11 +31,11 @@ namespace internal {
 // in the lower 16-bits of input
 template <typename Packet>
 EIGEN_DEVICE_FUNC inline Packet pexpand_bf16_l(const Packet& from) {
-  tensorflow::uint32 tmp;  
+  tensorflow::uint32 tmp;
 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-    tmp = (reinterpret_cast<const tensorflow::uint32&>(from) ) & 0xffff0000;  
-#else    
-    tmp = (reinterpret_cast<const tensorflow::uint32&>(from) << 16) & 0xffff0000;  
+  tmp = (reinterpret_cast<const tensorflow::uint32&>(from)) & 0xffff0000;
+#else
+  tmp = (reinterpret_cast<const tensorflow::uint32&>(from) << 16) & 0xffff0000;
 #endif
   return reinterpret_cast<const float&>(tmp);
 }
@@ -44,12 +44,12 @@ EIGEN_DEVICE_FUNC inline Packet pexpand_bf16_l(const Packet& from) {
 // in the upper 16-bits of input
 template <typename Packet>
 EIGEN_DEVICE_FUNC inline Packet pexpand_bf16_u(const Packet& from) {
-  tensorflow::uint32 tmp;  
+  tensorflow::uint32 tmp;
 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
-    tmp = (reinterpret_cast<const tensorflow::uint32&>(from) << 16 ) & 0xffff0000;  
+  tmp = (reinterpret_cast<const tensorflow::uint32&>(from) << 16) & 0xffff0000;
 #else
-    tmp = (reinterpret_cast<const tensorflow::uint32&>(from)) & 0xffff0000;  
-#endif 
+  tmp = (reinterpret_cast<const tensorflow::uint32&>(from)) & 0xffff0000;
+#endif
   return reinterpret_cast<const float&>(tmp);
 }
 
@@ -61,12 +61,12 @@ EIGEN_DEVICE_FUNC inline Packet4f pexpand_bf16_l(const Packet4f& from) {
   float r[4];
   tensorflow::uint32 p[4];
   pstoreu(r, from);
-  tensorflow::uint32 * ir = reinterpret_cast<tensorflow::uint32 *>(r);
+  tensorflow::uint32* ir = reinterpret_cast<tensorflow::uint32*>(r);
   p[0] = (ir[0] << 16) & 0xffff0000;
-  p[1] = ir[0]& 0xffff0000;
+  p[1] = ir[0] & 0xffff0000;
   p[2] = (ir[1] << 16) & 0xffff0000;
   p[3] = ir[1] & 0xffff0000;
-  return ploadu<Packet4f>(reinterpret_cast<float *>(p));
+  return ploadu<Packet4f>(reinterpret_cast<float*>(p));
 }
 
 template <typename Packet>
@@ -74,12 +74,12 @@ EIGEN_DEVICE_FUNC inline Packet4f pexpand_bf16_u(const Packet4f& from) {
   float r[4];
   tensorflow::uint32 p[4];
   pstoreu(r, from);
-  tensorflow::uint32 * ir = reinterpret_cast<tensorflow::uint32 *>(r);
+  tensorflow::uint32* ir = reinterpret_cast<tensorflow::uint32*>(r);
   p[0] = (ir[2] << 16) & 0xffff0000;
   p[1] = ir[2] & 0xffff0000;
   p[2] = (ir[3] << 16) & 0xffff0000;
   p[3] = ir[3] & 0xffff0000;
-  return ploadu<Packet4f>(reinterpret_cast<float *>(p));
+  return ploadu<Packet4f>(reinterpret_cast<float*>(p));
 }
 #endif
 
@@ -131,23 +131,25 @@ EIGEN_DEVICE_FUNC inline Packet pload2bf16(
 template <>
 EIGEN_STRONG_INLINE Packet4f pload4bf16<Packet4f>(const float* from) {
   tensorflow::uint32 p[4];
-  const tensorflow::uint32* ir = reinterpret_cast<const tensorflow::uint32 *>(from);
+  const tensorflow::uint32* ir =
+      reinterpret_cast<const tensorflow::uint32*>(from);
   p[0] = (ir[0] << 16) & 0xffff0000;
-  p[1] = ir[0]& 0xffff0000;
+  p[1] = ir[0] & 0xffff0000;
   p[2] = (ir[1] << 16) & 0xffff0000;
   p[3] = ir[1] & 0xffff0000;
-  return ploadu<Packet4f>(reinterpret_cast<float *>(p));
+  return ploadu<Packet4f>(reinterpret_cast<float*>(p));
 }
 
 template <>
 EIGEN_STRONG_INLINE Packet4f pload2bf16<Packet4f>(const float* from) {
   tensorflow::uint32 p[4];
-  const tensorflow::uint32* ir = reinterpret_cast<const tensorflow::uint32 *>(from);
+  const tensorflow::uint32* ir =
+      reinterpret_cast<const tensorflow::uint32*>(from);
   p[0] = (ir[0] << 16) & 0xffff0000;
-  p[1] = ir[0]& 0xffff0000;
+  p[1] = ir[0] & 0xffff0000;
   p[2] = (ir[0] << 16) & 0xffff0000;
   p[3] = ir[0] & 0xffff0000;
-  return ploadu<Packet4f>(reinterpret_cast<float *>(p));  
+  return ploadu<Packet4f>(reinterpret_cast<float*>(p));
 }
 #endif
 
@@ -255,12 +257,13 @@ EIGEN_STRONG_INLINE Packet8d pbroadcast_second<Packet8d>(const Packet8d& a_in) {
 }
 template <>
 EIGEN_STRONG_INLINE Packet8d pbroadcast_third<Packet8d>(const Packet8d& a_in) {
-  Packet2d a = _mm512_extractf32x4_ps(a_in, 1);
+  Packet2d a = _mm256_extractf128_pd(_mm512_castpd512_pd256(a_in), 1);
   return _mm512_broadcastsd_pd(a);
 }
 template <>
 EIGEN_STRONG_INLINE Packet8d pbroadcast_fourth<Packet8d>(const Packet8d& a_in) {
-  Packet2d a = _mm_permute_pd(_mm512_extractf32x4_ps(a_in, 1), 3);
+  Packet2d a =
+      _mm_permute_pd(_mm256_extractf128_pd(_mm512_castpd512_pd256(a_in), 1), 3);
   return _mm512_broadcastsd_pd(a);
 }
 template <>
@@ -417,14 +420,17 @@ EIGEN_STRONG_INLINE Packet8f pbroadcast_fourth<Packet8f>(const Packet8f& a) {
 
 template <typename Packet>
 EIGEN_DEVICE_FUNC inline Packet16f pexpand_bf16_l(const Packet16f& from) {
-  return _mm512_slli_epi32(_mm512_cvtepu16_epi32(_mm512_castsi512_si256(from)),
-                           16);
+  return _mm512_castsi512_ps(_mm512_slli_epi32(
+      _mm512_cvtepu16_epi32(_mm512_castsi512_si256(_mm512_castps_si512(from))),
+      16));
 }
 
 template <typename Packet>
 EIGEN_DEVICE_FUNC inline Packet16f pexpand_bf16_u(const Packet16f& from) {
-  return _mm512_slli_epi32(
-      _mm512_cvtepu16_epi32(_mm512_extractf64x4_pd(from, 1)), 16);
+  Packet16i tmp = _mm512_castps_si512(from);
+  Packet16i tmp2 = _mm512_alignr_epi32(tmp, tmp, 8);
+  return _mm512_castsi512_ps(_mm512_slli_epi32(
+      _mm512_cvtepu16_epi32(_mm512_castsi512_si256(tmp2)), 16));
 }
 
 #endif
diff --git a/tensorflow/core/kernels/sparse_tensor_dense_add_op.h b/tensorflow/core/kernels/sparse_tensor_dense_add_op.h
index b06dcf143e..353cf0e519 100644
--- a/tensorflow/core/kernels/sparse_tensor_dense_add_op.h
+++ b/tensorflow/core/kernels/sparse_tensor_dense_add_op.h
@@ -24,7 +24,7 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
-// TOOD(zongheng): this should be a general functor that powers SparseAdd and
+// TODO(zongheng): this should be a general functor that powers SparseAdd and
 // ScatterNd ops.  It should be moved to its own head file, once the other ops
 // are implemented.
 template <typename Device, typename T, typename Index, int NDIMS,
diff --git a/tensorflow/core/kernels/sparse_tensor_slice_dataset_op.cc b/tensorflow/core/kernels/sparse_tensor_slice_dataset_op.cc
index 98514f1b07..70cab66d64 100644
--- a/tensorflow/core/kernels/sparse_tensor_slice_dataset_op.cc
+++ b/tensorflow/core/kernels/sparse_tensor_slice_dataset_op.cc
@@ -25,7 +25,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 template <typename T>
diff --git a/tensorflow/core/kernels/split_lib_cpu.cc b/tensorflow/core/kernels/split_lib_cpu.cc
index e377e4d97a..6583f96a91 100644
--- a/tensorflow/core/kernels/split_lib_cpu.cc
+++ b/tensorflow/core/kernels/split_lib_cpu.cc
@@ -50,16 +50,12 @@ void Split<Eigen::SyclDevice, T>::operator()(
     typename TTypes<T, 3>::ConstTensor input,
     const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
     const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
-  if (output.size() < 131072) {
-    output = input.slice(slice_indices, slice_sizes);
-  } else {
     output.device(d) = input.slice(slice_indices, slice_sizes);
-  }
 }
 
 #define DEFINE_SYCL_KERNELS(T) template struct Split<Eigen::SyclDevice, T>;
 
-TF_CALL_GPU_NUMBER_TYPES(DEFINE_SYCL_KERNELS)
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DEFINE_SYCL_KERNELS);
 #endif // TENSORFLOW_USE_SYCL
 
 }  // namespace functor
diff --git a/tensorflow/core/kernels/split_op.cc b/tensorflow/core/kernels/split_op.cc
index 5051e736f1..c4f312f3f6 100644
--- a/tensorflow/core/kernels/split_op.cc
+++ b/tensorflow/core/kernels/split_op.cc
@@ -253,7 +253,6 @@ class SplitOpGPU : public SplitOpBase<GPUDevice, T> {
 #endif  // GOOGLE_CUDA
 
 #ifdef TENSORFLOW_USE_SYCL
-
 template <typename T>
 class SplitOpSYCL : public SplitOpBase<SYCLDevice, T> {
  public:
@@ -320,8 +319,7 @@ class SplitOpSYCL : public SplitOpBase<SYCLDevice, T> {
     }
   }
 };
-
-#endif  // TENSORFLOW_USE_SYCL
+#endif // TENSORFLOW_USE_SYCL
 
 #define REGISTER_SPLIT(type)                             \
   REGISTER_KERNEL_BUILDER(Name("Split")                  \
@@ -359,7 +357,7 @@ TF_CALL_complex128(REGISTER_GPU);
                               .HostMemory("split_dim"),   \
                           SplitOpSYCL<type>)
 
-TF_CALL_GPU_NUMBER_TYPES(REGISTER_SYCL);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
 #undef REGISTER_SYCL
 
 #endif  // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/stage_op.cc b/tensorflow/core/kernels/stage_op.cc
index 387c2471ce..49352ff4d1 100644
--- a/tensorflow/core/kernels/stage_op.cc
+++ b/tensorflow/core/kernels/stage_op.cc
@@ -1,4 +1,4 @@
-/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
@@ -14,6 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include <deque>
+#include <mutex>
+#include <numeric>
 #include <vector>
 
 #include "tensorflow/core/framework/op_kernel.h"
@@ -30,50 +32,181 @@ namespace {
 
 class Buffer : public ResourceBase {
  public:
-  explicit Buffer() {}
-
+  // public types
   typedef std::vector<Tensor> Tuple;
 
+ private:
+  // private variables
+  std::size_t capacity_;
+  std::size_t memory_limit_;
+  std::size_t current_bytes_;
+  std::mutex mu_;
+  std::condition_variable non_empty_cond_var_;
+  std::condition_variable full_cond_var_;
+  std::deque<Tuple> buf_;
+
+
+ private:
+  // private methods
+
+  // If the buffer is configured for bounded capacity, notify
+  // waiting inserters that space is now available
+  void notify_inserters_if_bounded(std::unique_lock<std::mutex> & l)
+  {
+    if(IsBounded())
+    {
+      l.unlock();
+      full_cond_var_.notify_one();
+    }
+  }
+
+  // Are there a limit number of elements or a memory limit
+  // configued on this buffer?
+  bool IsBounded() {
+    return capacity_ > 0 || memory_limit_ > 0;
+  }
+
+  bool IsCapacityFull() {
+    return buf_.size() >= capacity_;
+  }
+
+  bool WouldExceedMemoryLimit(std::size_t bytes) {
+    return bytes + current_bytes_ > memory_limit_;
+  }
+
+  std::size_t GetTupleBytes(const Tuple & tuple)
+  {
+    return std::accumulate(tuple.begin(), tuple.end(), 0,
+      [](const std::size_t & lhs, const Tensor & rhs) {
+        return lhs + rhs.TotalBytes();
+    });
+  }
+
+ public:
+  // public methods
+  explicit Buffer(std::size_t capacity, std::size_t memory_limit) :
+      capacity_(capacity),
+      memory_limit_(memory_limit),
+      current_bytes_(0) {}
+
   // the Buffer takes ownership of the Tuple
-  void Put(Tuple* tuple) {
-    mutex_lock l(mu_);
+  Status Put(Tuple* tuple) {
+    std::unique_lock<std::mutex> l(mu_);
+
+    std::size_t tuple_bytes = GetTupleBytes(*tuple);
+
+    // Sanity check so that we don't block for ever below
+    if(memory_limit_ > 0 && tuple_bytes > memory_limit_) {
+      return Status(errors::ResourceExhausted("Attempted to insert "
+        "tensors with combined size of '", tuple_bytes, "' bytes into "
+        "Staging Area with a memory limit of '", memory_limit_, "'."));
+    }
+
+
+    // If buffer capacity is bounded wait until elements have been removed
+    if(IsBounded()) {
+      full_cond_var_.wait(l, [tuple_bytes, this]() {
+        // If there's a memory limit, check if there's space for insertion
+        bool memory_limit_valid = memory_limit_ > 0 ?
+            !WouldExceedMemoryLimit(tuple_bytes) : true;
+        // If we're configured for capacity check if there's space for insertion
+        bool capacity_valid = capacity_ > 0 ? !IsCapacityFull() : true;
+
+        // Stop waiting upon success for both conditions
+        return capacity_valid && memory_limit_valid;
+      });
+    }
+
+    // Update bytes in the Staging Area
+    current_bytes_ += tuple_bytes;
+
+    // Store tuple
     buf_.push_back(std::move(*tuple));
-    non_empty_cond_var_.notify_one();  // maybe possible to optimize by reducing
-                                       // how often this signal is sent
+
+    l.unlock();
+    // maybe possible to optimize by reducing
+    // how often this signal is sent
+    non_empty_cond_var_.notify_one();
+
+    return Status::OK();
   }
 
+  // Get tuple at front of the buffer
   void Get(Tuple* tuple) {  // TODO(zhifengc): Support cancellation.
-    mutex_lock l(mu_);
-    while (buf_.empty()) {
-      non_empty_cond_var_.wait(l);
-    }
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Wait for data if the buffer is empty
+    non_empty_cond_var_.wait(l, [this]() {
+      return !buf_.empty();
+    });
 
+    // Move data into the output tuple
     *tuple = std::move(buf_.front());
     buf_.pop_front();
+
+    // Update bytes in the Staging Area
+    current_bytes_ -= GetTupleBytes(*tuple);
+
+    notify_inserters_if_bounded(l);
+  }
+
+  // Return tuple at index
+  Status Peek(std::size_t index, Tuple* tuple) {
+    std::unique_lock<std::mutex> l(mu_);
+
+    // Wait if the requested index is not available
+    non_empty_cond_var_.wait(l, [index, this]() {
+      return index < this->buf_.size();
+    });
+
+    // Place tensors in the output tuple
+    for(const auto & tensor: buf_[index]) {
+      tuple->push_back(tensor);
+    }
+
+    return Status::OK();
+  }
+
+  // Buffer size
+  size_t Size() {
+    std::unique_lock<std::mutex> l(mu_);
+    return buf_.size();
+  }
+
+  void Clear() {
+    std::unique_lock<std::mutex> l(mu_);
+    buf_.clear();
+    current_bytes_ = 0;
+
+    notify_inserters_if_bounded(l);
   }
 
   string DebugString() override {
-    mutex_lock l(mu_);
+    std::unique_lock<std::mutex> l(mu_);
     return strings::StrCat("Staging size: ", buf_.size());
   }
 
- private:
-  mutex mu_;
-  condition_variable non_empty_cond_var_;
-  std::deque<Tuple> buf_ GUARDED_BY(mu_);
 };
 
-Status CreateBuffer(Buffer** ret) {
-  *ret = new Buffer;
-  return Status::OK();
-}
-
 Status GetBuffer(OpKernelContext* ctx, const NodeDef& ndef, Buffer** buf) {
   auto rm = ctx->resource_manager();
   ContainerInfo cinfo;
+
+  // Lambda for creating the Staging Area
+  auto create_fn = [&ndef](Buffer** ret) -> Status
+  {
+    int64 capacity;
+    int64 memory_limit;
+    TF_RETURN_IF_ERROR(GetNodeAttr(ndef, "capacity", &capacity));
+    TF_RETURN_IF_ERROR(GetNodeAttr(ndef, "memory_limit", &memory_limit));
+    *ret = new Buffer(capacity, memory_limit);
+    return Status::OK();
+  };
+
+
   TF_RETURN_IF_ERROR(cinfo.Init(rm, ndef, true /* use name() */));
   TF_RETURN_IF_ERROR(rm->LookupOrCreate<Buffer>(cinfo.container(), cinfo.name(),
-                                                buf, CreateBuffer));
+                                                buf, create_fn));
   return Status::OK();
 }
 
@@ -92,7 +225,7 @@ class StageOp : public OpKernel {
     for (int i = 0; i < ctx->num_inputs(); ++i) {
       tuple.push_back(ctx->input(i));
     }
-    buf->Put(&tuple);
+    OP_REQUIRES_OK(ctx, buf->Put(&tuple));
   }
 };
 
@@ -115,11 +248,13 @@ class UnstageOp : public OpKernel {
     OP_REQUIRES_OK(ctx, GetBuffer(ctx, def(), &buf));
     core::ScopedUnref scope(buf);
     Buffer::Tuple tuple;
+
     buf->Get(&tuple);
-    OP_REQUIRES(
-        ctx, tuple.size() == (size_t)ctx->num_outputs(),
+
+    OP_REQUIRES(ctx, tuple.size() == (size_t)ctx->num_outputs(),
         errors::InvalidArgument("Mismatch stage/unstage: ", tuple.size(),
                                 " vs. ", ctx->num_outputs()));
+
     for (size_t i = 0; i < tuple.size(); ++i) {
       ctx->set_output(i, tuple[i]);
     }
@@ -134,4 +269,97 @@ REGISTER_KERNEL_BUILDER(Name("Unstage").Device(DEVICE_GPU), UnstageOp);
 REGISTER_KERNEL_BUILDER(Name("Unstage").Device(DEVICE_SYCL), UnstageOp);
 #endif // TENSORFLOW_USE_SYCL
 
+class StagePeekOp : public OpKernel {
+ public:
+  explicit StagePeekOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    Buffer* buf = nullptr;
+    OP_REQUIRES_OK(ctx, GetBuffer(ctx, def(), &buf));
+    core::ScopedUnref scope(buf);
+    Buffer::Tuple tuple;
+
+    std::size_t index = ctx->input(0).scalar<int>()();
+
+    OP_REQUIRES_OK(ctx, buf->Peek(index, &tuple));
+
+    OP_REQUIRES(ctx, tuple.size() == (size_t)ctx->num_outputs(),
+        errors::InvalidArgument("Mismatch stage/unstage: ", tuple.size(),
+                                " vs. ", ctx->num_outputs()));
+
+    for (size_t i = 0; i < tuple.size(); ++i) {
+      ctx->set_output(i, tuple[i]);
+    }
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("StagePeek").Device(DEVICE_CPU),
+                                              StagePeekOp);
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("StagePeek").HostMemory("index").
+                            Device(DEVICE_GPU), StagePeekOp);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("StagePeek").HostMemory("index")
+                          .Device(DEVICE_SYCL), StagePeekOp);
+#endif // TENSORFLOW_USE_SYCL
+
+
+class StageSizeOp : public OpKernel {
+ public:
+  explicit StageSizeOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    Buffer* buf = nullptr;
+    OP_REQUIRES_OK(ctx, GetBuffer(ctx, def(), &buf));
+    core::ScopedUnref scope(buf);
+
+    // Allocate size output tensor
+    Tensor * size = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, TensorShape({}),
+                                                     &size));
+
+    // Set it to the actual size
+    size->scalar<int32>().setConstant(buf->Size());
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("StageSize").Device(DEVICE_CPU), StageSizeOp);
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("StageSize").HostMemory("size")
+                        .Device(DEVICE_GPU), StageSizeOp);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("StageSize").HostMemory("size")
+                        .Device(DEVICE_SYCL), StageSizeOp);
+#endif // TENSORFLOW_USE_SYCL
+
+class StageClearOp : public OpKernel {
+ public:
+  explicit StageClearOp(OpKernelConstruction* ctx) : OpKernel(ctx) {}
+
+  // Using this op in such a way that it blocks forever
+  // is an error.  As such cancellation is not handled.
+  void Compute(OpKernelContext* ctx) override {
+    Buffer* buf = nullptr;
+    OP_REQUIRES_OK(ctx, GetBuffer(ctx, def(), &buf));
+    core::ScopedUnref scope(buf);
+
+    buf->Clear();
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("StageClear").Device(DEVICE_CPU), StageClearOp);
+#if GOOGLE_CUDA
+REGISTER_KERNEL_BUILDER(Name("StageClear").Device(DEVICE_GPU), StageClearOp);
+#endif
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("StageClear").Device(DEVICE_SYCL), StageClearOp);
+#endif // TENSORFLOW_USE_SYCL
+
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/strided_slice_op.cc b/tensorflow/core/kernels/strided_slice_op.cc
index 43a706bc23..47eb85999e 100644
--- a/tensorflow/core/kernels/strided_slice_op.cc
+++ b/tensorflow/core/kernels/strided_slice_op.cc
@@ -525,12 +525,8 @@ REGISTER_KERNEL_BUILDER(Name("ResourceStridedSliceAssign")
                               .TypeConstraint<int32>("Index"),    \
                           StridedSliceAssignOp<SYCLDevice, type>)
 
-REGISTER_SYCL(float);
-REGISTER_SYCL(double);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
 
-// A special GPU kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(Name("StridedSlice")
                             .Device(DEVICE_SYCL)
                             .TypeConstraint<int32>("T")
diff --git a/tensorflow/core/kernels/strided_slice_op_impl.h b/tensorflow/core/kernels/strided_slice_op_impl.h
index df7490486e..d0ccd5c652 100644
--- a/tensorflow/core/kernels/strided_slice_op_impl.h
+++ b/tensorflow/core/kernels/strided_slice_op_impl.h
@@ -297,7 +297,7 @@ DECLARE_FOR_N_CPU(bfloat16);
   INSTANTIATE(SYCLDevice, T, STRIDED_SLICE_INSTANTIATE_DIM)
 
 TF_CALL_SYCL_PROXY_TYPES(PREVENT_FOR_N_SYCL);
-TF_CALL_GPU_NUMBER_TYPES(DECLARE_FOR_N_SYCL);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DECLARE_FOR_N_SYCL);
 DECLARE_FOR_N_SYCL(int32);
 
 #undef DECLARE_FOR_N_SYCL
diff --git a/tensorflow/core/kernels/take_dataset_op.cc b/tensorflow/core/kernels/take_dataset_op.cc
index 7e2eb6ae49..e27a36bc9b 100644
--- a/tensorflow/core/kernels/take_dataset_op.cc
+++ b/tensorflow/core/kernels/take_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class TakeDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/tensor_dataset_op.cc b/tensorflow/core/kernels/tensor_dataset_op.cc
index ee0e00ee59..6b6fcb1978 100644
--- a/tensorflow/core/kernels/tensor_dataset_op.cc
+++ b/tensorflow/core/kernels/tensor_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class TensorDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/tensor_slice_dataset_op.cc b/tensorflow/core/kernels/tensor_slice_dataset_op.cc
index 982ea44659..fc70d2ecc5 100644
--- a/tensorflow/core/kernels/tensor_slice_dataset_op.cc
+++ b/tensorflow/core/kernels/tensor_slice_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class TensorSliceDatasetOp : public OpKernel {
diff --git a/tensorflow/core/kernels/tile_ops.cc b/tensorflow/core/kernels/tile_ops.cc
index 06f20cd9ec..7c72487d3f 100644
--- a/tensorflow/core/kernels/tile_ops.cc
+++ b/tensorflow/core/kernels/tile_ops.cc
@@ -265,7 +265,9 @@ TF_CALL_complex128(HANDLE_TYPE_NAME_GPU);
 #ifdef TENSORFLOW_USE_SYCL
 TF_CALL_float(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_double(HANDLE_TYPE_NAME_SYCL);
+TF_CALL_int16(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int32(HANDLE_TYPE_NAME_SYCL);
+TF_CALL_int64(HANDLE_TYPE_NAME_SYCL);
 #endif // TENSORFLOW_USE_SYCL
 
 #undef HANDLE_TYPE_NAME_CPU
@@ -522,7 +524,9 @@ TF_CALL_complex128(HANDLE_TYPE_NAME_GPU);
 
 TF_CALL_float(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_double(HANDLE_TYPE_NAME_SYCL);
+TF_CALL_int16(HANDLE_TYPE_NAME_SYCL);
 TF_CALL_int32(HANDLE_TYPE_NAME_SYCL);
+TF_CALL_int64(HANDLE_TYPE_NAME_SYCL);
 #undef HANDLE_TYPE_NAME_SYCL
 #endif // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/tile_ops_cpu_impl.h b/tensorflow/core/kernels/tile_ops_cpu_impl.h
index f06cc5514c..db3f046439 100644
--- a/tensorflow/core/kernels/tile_ops_cpu_impl.h
+++ b/tensorflow/core/kernels/tile_ops_cpu_impl.h
@@ -69,9 +69,13 @@ typedef Eigen::SyclDevice SYCLDevice;
 #define DEFINE_DIM(T, NDIM) template struct Tile<SYCLDevice, T, NDIM>;
 #define DEFINE_TYPE(T) DEFINE_DIM(T, CPU_PROVIDED_IXDIM)
 
+TF_CALL_bool(DEFINE_TYPE);
 TF_CALL_float(DEFINE_TYPE);
 TF_CALL_double(DEFINE_TYPE);
+TF_CALL_uint8(DEFINE_TYPE);
 TF_CALL_int32(DEFINE_TYPE);
+TF_CALL_int16(DEFINE_TYPE);
+TF_CALL_int64(DEFINE_TYPE);
 
 #undef DEFINE_DIM
 #undef DEFINE_TYPE
@@ -82,9 +86,13 @@ TF_CALL_int32(DEFINE_TYPE);
   template struct ReduceAndReshape<SYCLDevice, T, NDIM, 1>;
 #define DEFINE_TYPE(T) DEFINE_DIM(T, CPU_PROVIDED_IXDIM)
 
+TF_CALL_bool(DEFINE_TYPE);
 TF_CALL_float(DEFINE_TYPE);
 TF_CALL_double(DEFINE_TYPE);
+TF_CALL_uint8(DEFINE_TYPE);
+TF_CALL_int16(DEFINE_TYPE);
 TF_CALL_int32(DEFINE_TYPE);
+TF_CALL_int64(DEFINE_TYPE);
 
 #undef DEFINE_DIM
 #undef DEFINE_TYPE
diff --git a/tensorflow/core/kernels/training_ops.cc b/tensorflow/core/kernels/training_ops.cc
index d331a8debf..f6b6194f0a 100644
--- a/tensorflow/core/kernels/training_ops.cc
+++ b/tensorflow/core/kernels/training_ops.cc
@@ -23,6 +23,10 @@ limitations under the License.
 #include "tensorflow/core/kernels/training_op_helpers.h"
 #include "tensorflow/core/kernels/variable_ops.h"
 
+#ifdef TENSORFLOW_USE_SYCL
+#include "tensorflow/core/common_runtime/sycl/sycl_util.h"
+#endif // TENSORFLOW_USE_SYCL
+
 namespace tensorflow {
 
 using CPUDevice = Eigen::ThreadPoolDevice;
@@ -50,16 +54,27 @@ struct ApplyGradientDescent<CPUDevice, T> {
 
 #ifdef TENSORFLOW_USE_SYCL
 template <typename T>
-struct ApplyGradientDescent<SYCLDevice, T> {
+struct ApplyGradientDescentSYCL {
   void operator()(const SYCLDevice& d, typename TTypes<T>::Flat var,
-                  typename TTypes<T>::ConstScalar lr,
-                  typename TTypes<T>::ConstFlat grad) {
-    var.device(d) -= grad * lr();
+                  T lr, typename TTypes<T>::ConstFlat grad) {
+    var.device(d) -= grad * lr;
   }
 };
 #endif
 
 template <typename T>
+struct ApplyDelayCompensatedGradientDescent<CPUDevice, T> {
+  void operator()(const CPUDevice& d, typename TTypes<T>::Flat var,
+                  typename TTypes<T>::ConstScalar lr,
+                  typename TTypes<T>::ConstFlat grad,
+                  typename TTypes<T>::ConstScalar variance,
+                  typename TTypes<T>::Flat shadow) {
+    var.device(d) -= lr() * (grad + variance() * grad * (var - shadow));
+    shadow.device(d) = var;
+  }
+};
+
+template <typename T>
 struct ApplyAdadelta<CPUDevice, T> {
   void operator()(const CPUDevice& d, typename TTypes<T>::Flat var,
                   typename TTypes<T>::Flat accum,
@@ -264,10 +279,24 @@ struct ApplyAdamNonCuda {
   }
 };
 
+#ifdef TENSORFLOW_USE_SYCL
 template <typename T>
-struct ApplyAdam<CPUDevice, T> : ApplyAdamNonCuda<CPUDevice, T> {};
+struct ApplyAdamSYCL {
+  void operator()(const SYCLDevice& d, typename TTypes<T>::Flat var,
+                  typename TTypes<T>::Flat m, typename TTypes<T>::Flat v,
+                  T beta1_power, T beta2_power, T lr, T beta1, T beta2, T epsilon,
+                  typename TTypes<T>::ConstFlat grad) {
+    const T alpha = lr * Eigen::numext::sqrt(T(1) - beta2_power) /
+                    (T(1) - beta1_power);
+    m.device(d) += (grad - m) * (T(1) - beta1);
+    v.device(d) += (grad.square() - v) * (T(1) - beta2);
+    var.device(d) -= (m * alpha) / (v.sqrt() + epsilon);
+  }
+};
+#endif // TENSORFLOW_USE_SYCL
+
 template <typename T>
-struct ApplyAdam<SYCLDevice, T> : ApplyAdamNonCuda<SYCLDevice, T> {};
+struct ApplyAdam<CPUDevice, T> : ApplyAdamNonCuda<CPUDevice, T> {};
 
 template <typename T>
 struct ApplyRMSProp<CPUDevice, T> {
@@ -346,6 +375,51 @@ class ApplyGradientDescentOp : public OpKernel {
   bool use_exclusive_lock_;
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+template <typename T>
+class ApplyGradientDescentOp < SYCLDevice, T > : public OpKernel {
+ public:
+  explicit ApplyGradientDescentOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("use_locking", &use_exclusive_lock_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    auto locks = MaybeLockVariableInputMutexesInOrder(ctx, use_exclusive_lock_, {0});
+    Tensor var;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 0, use_exclusive_lock_, &var));
+
+    OP_REQUIRES(
+        ctx, var.IsInitialized(),
+        errors::FailedPrecondition(
+            "Attempting to use uninitialized variables: ", def().input(0)));
+    const Tensor& alpha_dev = ctx->input(1);
+    OP_REQUIRES(ctx, IsLegacyScalar(alpha_dev.shape()),
+                errors::InvalidArgument("alpha is not a scalar: ",
+                                        alpha_dev.shape().DebugString()));
+    const Tensor& delta = ctx->input(2);
+    OP_REQUIRES(
+        ctx, var.shape().IsSameSize(delta.shape()),
+        errors::InvalidArgument("var and delta do not have the same shape",
+                                var.shape().DebugString(), " ",
+                                delta.shape().DebugString()));
+
+    auto device = ctx->eigen_sycl_device();
+    auto size = sizeof(T);
+    T alpha = T(0);
+    auto src_ptr = GetBase(&alpha_dev);
+    device.memcpyDeviceToHost(&alpha, static_cast<const T *>(src_ptr), size);
+
+    functor::ApplyGradientDescentSYCL<T>()(device, var.flat<T>(),
+        alpha, delta.flat<T>());
+
+    MaybeForwardRefInputToRefOutput(ctx, 0, 0);
+  }
+
+ private:
+  bool use_exclusive_lock_;
+};
+#endif // TENSORFLOW_USE_SYCL
+
 #define REGISTER_KERNELS(D, T)                                                \
   REGISTER_KERNEL_BUILDER(                                                    \
       Name("ApplyGradientDescent").Device(DEVICE_##D).TypeConstraint<T>("T"), \
@@ -361,13 +435,6 @@ TF_CALL_half(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
-#ifdef TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNELS(T) REGISTER_KERNELS(SYCL, T);
-TF_CALL_float(REGISTER_SYCL_KERNELS);
-TF_CALL_double(REGISTER_SYCL_KERNELS);
-#undef REGISTER_SYCL_KERNELS
-#endif
-
 #if GOOGLE_CUDA
 // Forward declarations of the functor specializations for GPU.
 namespace functor {
@@ -388,6 +455,81 @@ REGISTER_KERNELS(GPU, Eigen::half);
 REGISTER_KERNELS(GPU, float);
 REGISTER_KERNELS(GPU, double);
 #endif
+
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNELS(T) REGISTER_KERNELS(SYCL, T);
+TF_CALL_float(REGISTER_SYCL_KERNELS);
+TF_CALL_double(REGISTER_SYCL_KERNELS);
+#undef REGISTER_SYCL_KERNELS
+#endif // TENSORFLOW_USE_SYCL
+
+#undef REGISTER_CPU_KERNELS
+#undef REGISTER_KERNELS
+
+template <typename Device, typename T>
+class ApplyDelayCompensatedGradientDescentOp : public OpKernel {
+ public:
+  explicit ApplyDelayCompensatedGradientDescentOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("use_locking", &use_exclusive_lock_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    auto locks = MaybeLockVariableInputMutexesInOrder(ctx, use_exclusive_lock_, {0, 4});
+    Tensor var;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 0, use_exclusive_lock_, &var));
+    OP_REQUIRES(
+        ctx, var.IsInitialized(),
+        errors::FailedPrecondition(
+            "Attempting to use uninitialized variables: ", def().input(0)));
+    const Tensor& alpha = ctx->input(1);
+    OP_REQUIRES(ctx, IsLegacyScalar(alpha.shape()),
+                errors::InvalidArgument("alpha is not a scalar: ",
+                                        alpha.shape().DebugString()));
+    const Tensor& delta = ctx->input(2);
+    OP_REQUIRES(
+        ctx, var.shape().IsSameSize(delta.shape()),
+        errors::InvalidArgument("var and delta do not have the same shape",
+                                var.shape().DebugString(), " ",
+                                delta.shape().DebugString()));
+    const Tensor& lambda = ctx->input(3);
+    OP_REQUIRES(ctx, IsLegacyScalar(lambda.shape()),
+                errors::InvalidArgument("lambda is not a scalar: ",
+                                        lambda.shape().DebugString()));
+    Tensor shadow;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 4, use_exclusive_lock_, &shadow));
+    OP_REQUIRES(
+        ctx, shadow.shape().IsSameSize(var.shape()),
+        errors::InvalidArgument("shadow and var do not have the same shape",
+                                shadow.shape().DebugString(), " ",
+                                var.shape().DebugString()));
+
+    const Device& device = ctx->template eigen_device<Device>();
+    functor::ApplyDelayCompensatedGradientDescent<Device, T>()(
+        device, var.flat<T>(), alpha.scalar<T>(), delta.flat<T>(),
+        lambda.scalar<T>(), shadow.flat<T>()
+    );
+
+    MaybeForwardRefInputToRefOutput(ctx, 0, 0);
+  }
+
+ private:
+  bool use_exclusive_lock_;
+};
+
+#define REGISTER_KERNELS(D, T)                                 \
+  REGISTER_KERNEL_BUILDER(                                     \
+      Name("ApplyDelayCompensatedGradientDescent")             \
+          .Device(DEVICE_##D)                                  \
+          .HostMemory("var")                                   \
+          .HostMemory("shadow")                                \
+          .TypeConstraint<T>("T"),                             \
+      ApplyDelayCompensatedGradientDescentOp<D##Device, T>);
+#define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
+
+TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_float(REGISTER_CPU_KERNELS);
+TF_CALL_double(REGISTER_CPU_KERNELS);
+
 #undef REGISTER_CPU_KERNELS
 #undef REGISTER_KERNELS
 
@@ -2343,6 +2485,120 @@ class ApplyAdamOp : public OpKernel {
   bool use_nesterov_;
 };
 
+#ifdef TENSORFLOW_USE_SYCL
+template <typename T>
+class ApplyAdamOp < SYCLDevice, T> : public OpKernel {
+ public:
+  explicit ApplyAdamOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("use_locking", &use_exclusive_lock_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    auto locks = MaybeLockVariableInputMutexesInOrder(ctx, use_exclusive_lock_, {0, 1, 2});
+
+    Tensor var;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 0, use_exclusive_lock_, &var));
+    Tensor m;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 1, use_exclusive_lock_, &m));
+    Tensor v;
+    OP_REQUIRES_OK(ctx, GetInputTensorFromVariable(ctx, 2, use_exclusive_lock_, &v));
+    OP_REQUIRES(
+        ctx, var.IsInitialized(),
+        errors::FailedPrecondition(
+            "Attempting to use uninitialized variables: ", def().input(0)));
+    OP_REQUIRES(
+        ctx, m.IsInitialized(),
+        errors::FailedPrecondition(
+            "Attempting to use uninitialized variables: ", def().input(1)));
+    OP_REQUIRES(
+        ctx, v.IsInitialized(),
+        errors::FailedPrecondition(
+            "Attempting to use uninitialized variables: ", def().input(2)));
+
+    const Tensor& beta1_power_dev = ctx->input(3);
+    const Tensor& beta2_power_dev = ctx->input(4);
+    const Tensor& lr_dev = ctx->input(5);
+    const Tensor& beta1_dev = ctx->input(6);
+    const Tensor& beta2_dev = ctx->input(7);
+    const Tensor& epsilon_dev = ctx->input(8);
+
+    T beta1_power = 0;
+    T beta2_power = 0;
+    T lr = 0;
+    T beta1 = 0;
+    T beta2 = 0;
+    T epsilon = 0;
+
+    auto device = ctx->eigen_sycl_device();
+    auto size = sizeof(T);
+    auto src_ptr = GetBase(&beta1_power_dev);
+    device.memcpyDeviceToHost(&beta1_power, static_cast<const T *>(src_ptr), size);
+
+    src_ptr = GetBase(&beta2_power_dev);
+    device.memcpyDeviceToHost(&beta2_power, static_cast<const T *>(src_ptr), size);
+
+    src_ptr = GetBase(&lr_dev);
+    device.memcpyDeviceToHost(&lr, static_cast<const T *>(src_ptr), size);
+
+    src_ptr = GetBase(&beta1_dev);
+    device.memcpyDeviceToHost(&beta1, static_cast<const T *>(src_ptr), size);
+
+    src_ptr = GetBase(&beta2_dev);
+    device.memcpyDeviceToHost(&beta2, static_cast<const T *>(src_ptr), size);
+
+    src_ptr = GetBase(&epsilon_dev);
+    device.memcpyDeviceToHost(&epsilon, static_cast<const T *>(src_ptr), size);
+
+
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta1_power_dev.shape()),
+                errors::InvalidArgument("beta1_power is not a scalar: ",
+                                        beta1_power_dev.shape().DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta2_power_dev.shape()),
+                errors::InvalidArgument("beta2_power is not a scalar: ",
+                                        beta2_power_dev.shape().DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(lr_dev.shape()),
+                errors::InvalidArgument("lr is not a scalar : ",
+                                        lr_dev.shape().DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta1_dev.shape()),
+                errors::InvalidArgument("beta1 is not a scalar: ",
+                                        beta1_dev.shape().DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(beta2_dev.shape()),
+                errors::InvalidArgument("beta2 is not a scalar: ",
+                                        beta2_dev.shape().DebugString()));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(epsilon_dev.shape()),
+                errors::InvalidArgument("epsilon is not a scalar: ",
+                                        epsilon_dev.shape().DebugString()));
+
+    const Tensor& grad = ctx->input(9);
+
+    OP_REQUIRES(ctx, var.shape().IsSameSize(m.shape()),
+                errors::InvalidArgument("var and m do not have the same shape",
+                                        var.shape().DebugString(), " ",
+                                        m.shape().DebugString()));
+    OP_REQUIRES(ctx, var.shape().IsSameSize(v.shape()),
+                errors::InvalidArgument("var and v do not have the same shape",
+                                        var.shape().DebugString(), " ",
+                                        v.shape().DebugString()));
+    OP_REQUIRES(
+        ctx, var.shape().IsSameSize(grad.shape()),
+        errors::InvalidArgument("var and grad do not have the same shape",
+                                var.shape().DebugString(), " ",
+                                grad.shape().DebugString()));
+
+    functor::ApplyAdamSYCL<T>()(device, var.flat<T>(), m.flat<T>(),
+                                    v.flat<T>(), beta1_power,
+                                    beta2_power, lr,
+                                    beta1, beta2,
+                                    epsilon, grad.flat<T>());
+
+    MaybeForwardRefInputToRefOutput(ctx, 0, 0);
+  }
+
+ private:
+  bool use_exclusive_lock_;
+};
+#endif // TENSORFLOW_USE_SYCL
+
 using CPUDevice = Eigen::ThreadPoolDevice;
 using GPUDevice = Eigen::GpuDevice;
 
diff --git a/tensorflow/core/kernels/training_ops.h b/tensorflow/core/kernels/training_ops.h
index 11c9faa4ec..0a3c5d361e 100644
--- a/tensorflow/core/kernels/training_ops.h
+++ b/tensorflow/core/kernels/training_ops.h
@@ -35,6 +35,15 @@ struct ApplyGradientDescent {
 };
 
 template <typename Device, typename T>
+struct ApplyDelayCompensatedGradientDescent {
+  void operator()(const Device& d, typename TTypes<T>::Flat var,
+                  typename TTypes<T>::ConstScalar alpha,
+                  typename TTypes<T>::ConstFlat delta,
+                  typename TTypes<T>::ConstScalar lambda,
+                  typename TTypes<T>::Flat shadow);
+};
+
+template <typename Device, typename T>
 struct ApplyAdadelta {
   void operator()(const Device& d, typename TTypes<T>::Flat var,
                   typename TTypes<T>::Flat accum,
diff --git a/tensorflow/core/kernels/transpose_functor_cpu.cc b/tensorflow/core/kernels/transpose_functor_cpu.cc
index 97426efab9..248c11976e 100644
--- a/tensorflow/core/kernels/transpose_functor_cpu.cc
+++ b/tensorflow/core/kernels/transpose_functor_cpu.cc
@@ -144,8 +144,42 @@ Status DoTranspose<CPUDevice>(const CPUDevice& d, const Tensor& in,
 #ifdef TENSORFLOW_USE_SYCL
 typedef Eigen::SyclDevice SYCLDevice;
 
+template <typename Device, typename T>
+void TransposeSYCL(const Device& d, const Tensor& in,
+               const gtl::ArraySlice<int32> perm, Tensor* out) {
+  switch (in.dims()) {
+    case 1:
+      internal::TransposeUsingEigen<Device, T, 1>(d, in, perm, out);
+      break;
+    case 2:
+      internal::TransposeUsingEigen<Device, T, 2>(d, in, perm, out);
+      break;
+    case 3:
+      internal::TransposeUsingEigen<Device, T, 3>(d, in, perm, out);
+      break;
+    case 4:
+      internal::TransposeUsingEigen<Device, T, 4>(d, in, perm, out);
+      break;
+    case 5:
+      internal::TransposeUsingEigen<Device, T, 5>(d, in, perm, out);
+      break;
+    case 6:
+      internal::TransposeUsingEigen<Device, T, 6>(d, in, perm, out);
+      break;
+    case 7:
+      internal::TransposeUsingEigen<Device, T, 7>(d, in, perm, out);
+      break;
+    case 8:
+      internal::TransposeUsingEigen<Device, T, 8>(d, in, perm, out);
+      break;
+    default:
+      LOG(FATAL) << "Unsupported TransposeUsingEigen for: " << in.dims();
+      break;
+  }
+}
+
 template <typename T>
-struct internal::Transpose<SYCLDevice, T> {
+struct Transpose<SYCLDevice, T> {
   static void run(const SYCLDevice& d, const Tensor& in,
                   const gtl::ArraySlice<int32> perm, Tensor* out) {
     // Should add a specialized implementation for SYCLDevice here.
@@ -160,10 +194,36 @@ Status DoTranspose<SYCLDevice>(const SYCLDevice& d, const Tensor& in,
   CHECK_EQ(in.dims(), perm.size());
   CHECK_EQ(in.dtype(), out->dtype());
   switch (in.dtype()) {
+    case DT_BOOL:
+    case DT_INT8:
+    case DT_QINT8:
+    case DT_QUINT8:
+    case DT_UINT8:
+      TransposeSYCL<SYCLDevice, uint8>(d, in, perm, out);
+      break;
+
+    case DT_BFLOAT16:
+    case DT_HALF:
+    case DT_INT16:
+    case DT_QINT16:
+    case DT_QUINT16:
+    case DT_UINT16:
+      TransposeSYCL<SYCLDevice, uint16>(d, in, perm, out);
+      break;
     case DT_FLOAT:
-    case DT_DOUBLE:
     case DT_INT32:
-      internal::Transpose<SYCLDevice, uint32>::run(d, in, perm, out);
+    case DT_QINT32:
+      TransposeSYCL<SYCLDevice, uint32>(d, in, perm, out);
+      break;
+
+    case DT_COMPLEX64:
+    case DT_DOUBLE:
+    case DT_INT64:
+      TransposeSYCL<SYCLDevice, uint64>(d, in, perm, out);
+      break;
+
+    case DT_COMPLEX128:
+      TransposeSYCL<SYCLDevice, complex128>(d, in, perm, out);
       break;
 
     default:
diff --git a/tensorflow/core/kernels/unique_op.cc b/tensorflow/core/kernels/unique_op.cc
index d50e2060ac..b57e13a28c 100644
--- a/tensorflow/core/kernels/unique_op.cc
+++ b/tensorflow/core/kernels/unique_op.cc
@@ -115,4 +115,23 @@ REGISTER_KERNEL_BUILDER(Name("Unique")
                             .HostMemory("y")
                             .HostMemory("idx"),
                         UniqueOp<int64>);
+
+#ifdef TENSORFLOW_USE_SYCL
+REGISTER_KERNEL_BUILDER(Name("Unique")
+                            .Device(DEVICE_SYCL)
+                            .TypeConstraint<int32>("T")
+                            .TypeConstraint<int32>("out_idx")
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("idx"),
+                        UniqueOp<int32>);
+REGISTER_KERNEL_BUILDER(Name("Unique")
+                            .Device(DEVICE_SYCL)
+                            .TypeConstraint<int64>("T")
+                            .TypeConstraint<int32>("out_idx")
+                            .HostMemory("x")
+                            .HostMemory("y")
+                            .HostMemory("idx"),
+                        UniqueOp<int64>);
+#endif // TENSORFLOW_USE_SYCL
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/unpack_op.cc b/tensorflow/core/kernels/unpack_op.cc
index e4c79ae17b..c3bebfcbf9 100644
--- a/tensorflow/core/kernels/unpack_op.cc
+++ b/tensorflow/core/kernels/unpack_op.cc
@@ -159,20 +159,15 @@ REGISTER_KERNEL_BUILDER(Name("Unpack")
       Name("Unpack").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
       UnpackOp<SYCLDevice, type>)
 
-REGISTER_SYCL(float);
-REGISTER_SYCL(double);
-#undef REGISTER_SYCL
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL);
 
-// A special SYCL kernel for int32.
-// TODO(b/25387198): Also enable int32 in device memory. This kernel
-// registration requires all int32 inputs and outputs to be in host memory.
 REGISTER_KERNEL_BUILDER(Name("Unpack")
                             .Device(DEVICE_SYCL)
                             .HostMemory("value")
                             .HostMemory("output")
                             .TypeConstraint<int32>("T"),
                         UnpackOp<CPUDevice, int32>);
-
+#undef REGISTER_SYCL
 #endif  // TENSORFLOW_USE_SYCL
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/variable_ops.cc b/tensorflow/core/kernels/variable_ops.cc
index 7a4d9dc650..36b8ff09d7 100644
--- a/tensorflow/core/kernels/variable_ops.cc
+++ b/tensorflow/core/kernels/variable_ops.cc
@@ -32,33 +32,29 @@ REGISTER_KERNEL_BUILDER(Name("DestroyTemporaryVariable").Device(DEVICE_CPU),
 REGISTER_KERNEL_BUILDER(Name("IsVariableInitialized").Device(DEVICE_CPU),
                         IsVariableInitializedOp);
 
-#if TENSORFLOW_USE_SYCL
-#define REGISTER_SYCL_KERNEL(TYPE)                                      \
-  REGISTER_KERNEL_BUILDER(                                              \
-                          Name("Variable")                              \
-                          .Device(DEVICE_SYCL)                          \
-                          .TypeConstraint<TYPE>("dtype"),               \
-                          VariableOp);                                  \
-  REGISTER_KERNEL_BUILDER(Name("VariableV2")                            \
-                          .Device(DEVICE_SYCL)                          \
-                          .TypeConstraint<TYPE>("dtype"),               \
-                          VariableOp);                                  \
-  REGISTER_KERNEL_BUILDER(Name("TemporaryVariable")                     \
-                          .Device(DEVICE_SYCL)                          \
-                          .TypeConstraint<TYPE>("dtype"),               \
-                          TemporaryVariableOp);                         \
-  REGISTER_KERNEL_BUILDER(Name("DestroyTemporaryVariable")              \
-                          .Device(DEVICE_SYCL)                          \
-                          .TypeConstraint<TYPE>("T"),                   \
-                          DestroyTemporaryVariableOp);                  \
-  REGISTER_KERNEL_BUILDER(Name("IsVariableInitialized")                 \
-                          .Device(DEVICE_SYCL)                          \
-                          .TypeConstraint<TYPE>("dtype")                \
-                          .HostMemory("is_initialized"),                \
+#ifdef TENSORFLOW_USE_SYCL
+#define REGISTER_SYCL_KERNEL(type)                                         \
+  REGISTER_KERNEL_BUILDER(                                                 \
+      Name("Variable").Device(DEVICE_SYCL).TypeConstraint<type>("dtype"),  \
+      VariableOp);                                                         \
+  REGISTER_KERNEL_BUILDER(                                                 \
+      Name("VariableV2").Device(DEVICE_SYCL).TypeConstraint<type>("dtype"),\
+      VariableOp);                                                         \
+  REGISTER_KERNEL_BUILDER(Name("TemporaryVariable")                        \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("dtype"),              \
+                          TemporaryVariableOp);                            \
+  REGISTER_KERNEL_BUILDER(Name("DestroyTemporaryVariable")                 \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("T"),                  \
+                          DestroyTemporaryVariableOp);                     \
+  REGISTER_KERNEL_BUILDER(Name("IsVariableInitialized")                    \
+                              .Device(DEVICE_SYCL)                         \
+                              .TypeConstraint<type>("dtype")               \
+                              .HostMemory("is_initialized"),               \
                           IsVariableInitializedOp);
 
-REGISTER_SYCL_KERNEL(float);
-REGISTER_SYCL_KERNEL(double);
+TF_CALL_GPU_NUMBER_TYPES_NO_HALF(REGISTER_SYCL_KERNEL);
 #undef REGISTER_SYCL_KERNEL
 #endif // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/variable_ops.h b/tensorflow/core/kernels/variable_ops.h
index f0b5796d04..355140d44c 100644
--- a/tensorflow/core/kernels/variable_ops.h
+++ b/tensorflow/core/kernels/variable_ops.h
@@ -180,10 +180,10 @@ class DestroyTemporaryVariableOp : public OpKernel {
     if (context->track_allocations()) {
       if (context->allocate_on_host(AllocatorAttributes())) {
         context->record_host_persistent_memory_allocation(
-            -tmpvar.AllocatedBytes());
+            -static_cast<int64>(tmpvar.AllocatedBytes()));
       } else {
         context->record_device_persistent_memory_allocation(
-            -tmpvar.AllocatedBytes());
+            -static_cast<int64>(tmpvar.AllocatedBytes()));
       }
     }
   }
diff --git a/tensorflow/core/kernels/xsmm_conv2d.cc b/tensorflow/core/kernels/xsmm_conv2d.cc
index c4690eb23e..601704c8a7 100644
--- a/tensorflow/core/kernels/xsmm_conv2d.cc
+++ b/tensorflow/core/kernels/xsmm_conv2d.cc
@@ -131,32 +131,7 @@ class libxsmm_dnn_conv_desc_wrap {
 
 struct HashFunction {
   std::size_t operator()(const libxsmm_dnn_conv_desc_wrap& w) const {
-    // unsigned char ptr[sizeof(&w.d)];
-
-    // memcpy(ptr, (unsigned char *)&w.d, sizeof(&w.d))
-
-    //
-    /*
-    std::ostringstream N,C,H,W,K,R,S,u,v,padh,padw;
-
-    N << w.d.N; C << w.d.C;
-    H << w.d.H; W << w.d.W;
-    K << w.d.K; R << w.d.R;
-    S << w.d.S; u << w.d.u;
-    v << w.d.v; padh << w.d.pad_h_in;
-    padw << w.d.pad_w_in;
- 
- 
-    std::string out_ =   N.str() + C.str()\
-                       + H.str() + W.str()\
-                       + K.str() + R.str()\
-                       + S.str() + u.str()\
-                       + v.str() + padh.str()\
-                       + padw.str();
-    //
-    //
-    */
-    return (std::hash<unsigned long long>()((unsigned long long)&(w.d)));
+    return libxsmm_hash(&w.d, sizeof(w.d), 25071975);
   }
 };
 
@@ -221,8 +196,6 @@ static bool CallLibxsmmConvGeneric(OpKernelContext* ctx,
 
   status = libxsmm_dnn_get_codegen_success(libxsmm_handle, kind);
   if (status == LIBXSMM_DNN_WARN_FALLBACK) {
-    chk_libxsmm_err(libxsmm_dnn_destroy_conv_layer(libxsmm_handle),
-                    "Destroy handle");
     return false;  // Use non-libxsmm code
   }
   chk_libxsmm_err(status, "Check codegen status");
@@ -324,8 +297,6 @@ static bool CallLibxsmmConvGeneric(OpKernelContext* ctx,
     chk_libxsmm_err(status, "Link filter");
   }
   if (kind == LIBXSMM_DNN_COMPUTE_KIND_FWD) {
-    chk_libxsmm_err(libxsmm_dnn_zero_buffer(libxsmm_output), "Zero output");
-
     chk_libxsmm_err(libxsmm_dnn_bind_buffer(libxsmm_handle, libxsmm_input,
                                             LIBXSMM_DNN_REGULAR_INPUT),
                     "Bind input forward");
diff --git a/tensorflow/core/kernels/zip_dataset_op.cc b/tensorflow/core/kernels/zip_dataset_op.cc
index 79f48cc820..e7fc9bc6b1 100644
--- a/tensorflow/core/kernels/zip_dataset_op.cc
+++ b/tensorflow/core/kernels/zip_dataset_op.cc
@@ -21,7 +21,7 @@ namespace tensorflow {
 
 namespace {
 
-// See documentation in ../ops/iterator_ops.cc for a high-level
+// See documentation in ../ops/dataset_ops.cc for a high-level
 // description of the following op.
 
 class ZipDatasetOp : public OpKernel {
diff --git a/tensorflow/core/lib/gtl/optional.h b/tensorflow/core/lib/gtl/optional.h
index f80b5c113d..8ba4b09143 100644
--- a/tensorflow/core/lib/gtl/optional.h
+++ b/tensorflow/core/lib/gtl/optional.h
@@ -541,7 +541,7 @@ class optional : private internal_optional::optional_data<T>,
   // opt.emplace(arg1,arg2,arg3);  (Constructs Foo(arg1,arg2,arg3))
   //
   // If the optional is non-empty, and the `args` refer to subobjects of the
-  // current object, then behaviour is undefined.  This is because the current
+  // current object, then behavior is undefined.  This is because the current
   // object will be destructed before the new object is constructed with `args`.
   //
   template <typename... Args,
@@ -586,7 +586,7 @@ class optional : private internal_optional::optional_data<T>,
 
   // [optional.observe], observers
   // You may use `*opt`, and `opt->m`, to access the underlying T value and T's
-  // member `m`, respectively.  If the optional is empty, behaviour is
+  // member `m`, respectively.  If the optional is empty, behavior is
   // undefined.
   constexpr const T* operator->() const { return this->pointer(); }
   T* operator->() {
diff --git a/tensorflow/core/lib/jpeg/jpeg_mem.cc b/tensorflow/core/lib/jpeg/jpeg_mem.cc
index 3bd754cf76..258793aa1e 100644
--- a/tensorflow/core/lib/jpeg/jpeg_mem.cc
+++ b/tensorflow/core/lib/jpeg/jpeg_mem.cc
@@ -45,7 +45,7 @@ enum JPEGErrors {
   JPEGERRORS_BAD_PARAM
 };
 
-// Prevent bad compiler behaviour in ASAN mode by wrapping most of the
+// Prevent bad compiler behavior in ASAN mode by wrapping most of the
 // arguments in a struct struct.
 class FewerArgsForCompiler {
  public:
diff --git a/tensorflow/core/lib/lmdb/testdata/data.mdb b/tensorflow/core/lib/lmdb/testdata/data.mdb
new file mode 100644
index 0000000000..3ea75699cb
--- /dev/null
+++ b/tensorflow/core/lib/lmdb/testdata/data.mdb
diff --git a/tensorflow/core/lib/random/random_distributions.h b/tensorflow/core/lib/random/random_distributions.h
index 03b155344c..c15a6436d6 100644
--- a/tensorflow/core/lib/random/random_distributions.h
+++ b/tensorflow/core/lib/random/random_distributions.h
@@ -27,6 +27,7 @@ limitations under the License.
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/lib/random/philox_random.h"
 
+
 namespace tensorflow {
 namespace random {
 
@@ -373,7 +374,7 @@ class TruncatedNormalDistribution<SingleSampleGenerator, Eigen::half> {
       BoxMullerFloat(x0, x1, &f[0], &f[1]);
 
       for (int i = 0; i < 2; ++i) {
-        if (fabs(f[i]) < kTruncateValue) {
+        if (Eigen::numext::abs(f[i]) < kTruncateValue) {
           results[index++] = Eigen::half(f[i]);
           if (index >= kResultElementCount) {
             return results;
@@ -416,7 +417,7 @@ class TruncatedNormalDistribution<SingleSampleGenerator, float> {
       BoxMullerFloat(x0, x1, &f[0], &f[1]);
 
       for (int i = 0; i < 2; ++i) {
-        if (fabs(f[i]) < kTruncateValue) {
+        if (Eigen::numext::abs(f[i]) < kTruncateValue) {
           results[index++] = f[i];
           if (index >= kResultElementCount) {
             return results;
@@ -458,7 +459,7 @@ class TruncatedNormalDistribution<SingleSampleGenerator, double> {
       BoxMullerDouble(x0, x1, x2, x3, &d[0], &d[1]);
 
       for (int i = 0; i < 2; ++i) {
-        if (fabs(d[i]) < kTruncateValue) {
+        if (Eigen::numext::abs(d[i]) < kTruncateValue) {
           results[index++] = d[i];
           if (index >= kResultElementCount) {
             return results;
@@ -483,12 +484,12 @@ void BoxMullerFloat(uint32 x0, uint32 x1, float* f0, float* f1) {
     u1 = epsilon;
   }
   const float v1 = 2.0f * M_PI * Uint32ToFloat(x1);
-  const float u2 = sqrt(-2.0f * log(u1));
-#if defined(__linux__)
-  sincosf(v1, f0, f1);
+  const float u2 = Eigen::numext::sqrt(-2.0f * Eigen::numext::log(u1));
+#if defined(TENSORFLOW_USE_SYCL) || !defined(__linux__)
+  *f0 = Eigen::numext::sin(v1);
+  *f1 = Eigen::numext::cos(v1);
 #else
-  *f0 = sinf(v1);
-  *f1 = cosf(v1);
+  sincosf(v1, f0, f1);
 #endif
   *f0 *= u2;
   *f1 *= u2;
@@ -509,12 +510,12 @@ void BoxMullerDouble(uint32 x0, uint32 x1, uint32 x2, uint32 x3, double* d0,
     u1 = epsilon;
   }
   const double v1 = 2 * M_PI * Uint64ToDouble(x2, x3);
-  const double u2 = sqrt(-2.0 * log(u1));
-#if defined(__linux__)
-  sincos(v1, d0, d1);
+  const double u2 = Eigen::numext::sqrt(-2.0 * Eigen::numext::log(u1));
+#if defined(TENSORFLOW_USE_SYCL) || !defined(__linux__)
+  *d0 = Eigen::numext::sin(v1);
+  *d1 = Eigen::numext::cos(v1);
 #else
-  *d0 = sin(v1);
-  *d1 = cos(v1);
+  sincos(v1, d0, d1);
 #endif
   *d0 *= u2;
   *d1 *= u2;
diff --git a/tensorflow/core/ops/data_flow_ops.cc b/tensorflow/core/ops/data_flow_ops.cc
index 97d0800d12..f0fcd02835 100644
--- a/tensorflow/core/ops/data_flow_ops.cc
+++ b/tensorflow/core/ops/data_flow_ops.cc
@@ -827,7 +827,7 @@ operations that would block will fail immediately.
 
 handle: The handle to a queue.
 cancel_pending_enqueues: If true, all pending enqueue requests that are
-  blocked on the given queue will be cancelled.
+  blocked on the given queue will be canceled.
 )doc");
 
 REGISTER_OP("QueueCloseV2")
@@ -845,7 +845,7 @@ operations that would block will fail immediately.
 
 handle: The handle to a queue.
 cancel_pending_enqueues: If true, all pending enqueue requests that are
-  blocked on the given queue will be cancelled.
+  blocked on the given queue will be canceled.
 )doc");
 
 REGISTER_OP("QueueSize")
@@ -1879,7 +1879,7 @@ Subsequent TakeMany operations that would block will fail immediately.
 
 handle: The handle to a barrier.
 cancel_pending_enqueues: If true, all pending enqueue requests that are
-  blocked on the barrier's queue will be cancelled. InsertMany will fail, even
+  blocked on the barrier's queue will be canceled. InsertMany will fail, even
   if no new key is introduced.
 )doc");
 
@@ -1967,6 +1967,8 @@ handle: The handle for a tensor stored in the session state.
 
 REGISTER_OP("Stage")
     .Input("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
     .Attr("dtypes: list(type)")
     .Attr("container: string = ''")
     .Attr("shared_name: string = ''")
@@ -1979,6 +1981,11 @@ The basic functionality of this Op is similar to a queue with many
 fewer capabilities and options.  This Op is optimized for performance.
 
 values: a list of tensors
+dtypes A list of data types that inserted values should adhere to.
+capacity: Maximum number of elements in the Staging Area. If > 0, inserts
+  on the container will block when the capacity is reached.
+memory_limit: The maximum number of bytes allowed for Tensors in the Staging Area.
+  If > 0, inserts will block until sufficient space is available.
 container: If non-empty, this queue is placed in the given container. Otherwise,
   a default container is used.
 shared_name: It is necessary to match this name to the matching Unstage Op.
@@ -1986,6 +1993,8 @@ shared_name: It is necessary to match this name to the matching Unstage Op.
 
 REGISTER_OP("Unstage")
     .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
     .Attr("dtypes: list(type)")
     .Attr("container: string = ''")
     .Attr("shared_name: string = ''")
@@ -1994,10 +2003,287 @@ REGISTER_OP("Unstage")
     .Doc(R"doc(
 Op is similar to a lightweight Dequeue.
 
-The basic funtionality is similar to dequeue with many fewer
+The basic functionality is similar to dequeue with many fewer
 capabilities and options.  This Op is optimized for performance.
 )doc");
 
+REGISTER_OP("StagePeek")
+    .Input("index: int32")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op peeks at the values at the specified index.  If the
+underlying container does not contain sufficient elements
+this op will block until it does.   This Op is optimized for
+performance.
+    )doc");
+
+
+REGISTER_OP("StageSize")
+    .Output("size: int32")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(shape_inference::ScalarShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op returns the number of elements in the underlying container.
+    )doc");
+
+REGISTER_OP("StageClear")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes all elements in the underlying container.
+    )doc");
+
+// UnorderedMap
+REGISTER_OP("MapStage")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Input("values: fake_dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("fake_dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::NoOutputs)
+    .SetIsStateful()
+    .Doc(R"doc(
+Stage (key, values) in the underlying container which behaves like a hashtable.
+
+key: int64
+values: a list of tensors
+dtypes A list of data types that inserted values should adhere to.
+capacity: Maximum number of elements in the Staging Area. If > 0, inserts
+  on the container will block when the capacity is reached.
+container: If non-empty, this queue is placed in the given container. Otherwise,
+  a default container is used.
+shared_name: It is necessary to match this name to the matching Unstage Op.
+)doc");
+
+REGISTER_OP("MapPeek")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op peeks at the values at the specified key.  If the
+underlying container does not contain this key
+this op will block until it does.
+    )doc");
+
+REGISTER_OP("MapUnstage")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes and returns the values associated with the key
+from the underlying container.   If the underlying container
+does not contain this key, the op will block until it does.
+    )doc");
+
+REGISTER_OP("MapUnstageNoKey")
+    .Input("indices: int32")
+    .Output("key: int64")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes and returns a random (key, value)
+from the underlying container.   If the underlying container
+does not contain elements, the op will block until it does.
+      )doc");
+
+REGISTER_OP("MapSize")
+    .Output("size: int32")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::ScalarShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op returns the number of elements in the underlying container.
+    )doc");
+
+REGISTER_OP("MapIncompleteSize")
+    .Output("size: int32")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::ScalarShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op returns the number of incomplete elements in the underlying container.
+    )doc");
+
+
+REGISTER_OP("MapClear")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::NoOutputs)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes all elements in the underlying container.
+    )doc");
+
+
+// OrderedMap
+REGISTER_OP("OrderedMapStage")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Input("values: fake_dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("fake_dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::NoOutputs)
+    .SetIsStateful()
+    .Doc(R"doc(
+Stage (key, values) in the underlying container which behaves like a ordered
+associative container.   Elements are ordered by key.
+
+key: int64
+values: a list of tensors
+dtypes A list of data types that inserted values should adhere to.
+capacity: Maximum number of elements in the Staging Area. If > 0, inserts
+  on the container will block when the capacity is reached.
+container: If non-empty, this queue is placed in the given container. Otherwise,
+  a default container is used.
+shared_name: It is necessary to match this name to the matching Unstage Op.
+)doc");
+
+REGISTER_OP("OrderedMapPeek")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op peeks at the values at the specified key.  If the
+underlying container does not contain this key
+this op will block until it does.   This Op is optimized for
+performance.
+    )doc");
+
+REGISTER_OP("OrderedMapUnstage")
+    .Input("key: int64")
+    .Input("indices: int32")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes and returns the values associated with the key
+from the underlying container.   If the underlying container
+does not contain this key, the op will block until it does.
+    )doc");
+
+REGISTER_OP("OrderedMapUnstageNoKey")
+    .Input("indices: int32")
+    .Output("key: int64")
+    .Output("values: dtypes")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes and returns the (key, value) element with the smallest
+key from the underlying container.   If the underlying container
+does not contain elements, the op will block until it does.
+      )doc");
+
+REGISTER_OP("OrderedMapSize")
+    .Output("size: int32")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::ScalarShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op returns the number of elements in the underlying container.
+    )doc");
+
+REGISTER_OP("OrderedMapIncompleteSize")
+    .Output("size: int32")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::ScalarShape)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op returns the number of incomplete elements in the underlying container.
+    )doc");
+
+REGISTER_OP("OrderedMapClear")
+    .Attr("capacity: int >= 0 = 0")
+    .Attr("memory_limit: int >= 0 = 0")
+    .Attr("dtypes: list(type)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetShapeFn(tensorflow::shape_inference::NoOutputs)
+    .SetIsStateful()
+    .Doc(R"doc(
+Op removes all elements in the underlying container.
+    )doc");
+
 REGISTER_OP("RecordInput")
     .Output("records: string")
     .Attr("file_pattern: string")
diff --git a/tensorflow/core/ops/image_ops.cc b/tensorflow/core/ops/image_ops.cc
index f2a8d8718d..3edae6f927 100644
--- a/tensorflow/core/ops/image_ops.cc
+++ b/tensorflow/core/ops/image_ops.cc
@@ -900,7 +900,7 @@ boxes: A 2-D tensor of shape `[num_boxes, 4]`. The `i`-th row of the tensor
   in normalized coordinates `[y1, x1, y2, x2]`. A normalized coordinate value of
   `y` is mapped to the image coordinate at `y * (image_height - 1)`, so as the
   `[0, 1]` interval of normalized image height is mapped to
-  `[0, image_height - 1] in image height coordinates. We do allow y1 > y2, in
+  `[0, image_height - 1]` in image height coordinates. We do allow `y1` > `y2`, in
   which case the sampled crop is an up-down flipped version of the original
   image. The width dimension is treated similarly. Normalized coordinates
   outside the `[0, 1]` range are allowed, in which case we use
@@ -994,16 +994,16 @@ method: A string specifying the interpolation method. Only 'bilinear' is
 // --------------------------------------------------------------------------
 
 REGISTER_OP("NonMaxSuppression")
-    .Input("boxes: float")
-    .Input("scores: float")
-    .Input("max_output_size: int32")
-    .Output("selected_indices: int32")
-    .Attr("iou_threshold: float = 0.5")
-    .SetShapeFn([](InferenceContext* c) {
+  .Input("boxes: float")
+  .Input("scores: float")
+  .Input("max_output_size: int32")
+  .Output("selected_indices: int32")
+  .Attr("iou_threshold: float = 0.5")
+  .SetShapeFn([](InferenceContext* c) {
       c->set_output(0, c->Vector(c->UnknownDim()));
       return Status::OK();
     })
-    .Doc(R"doc(
+  .Doc(R"doc(
 Greedily selects a subset of bounding boxes in descending order of score,
 pruning away boxes that have high intersection-over-union (IOU) overlap
 with previously selected boxes.  Bounding boxes are supplied as
diff --git a/tensorflow/core/ops/io_ops.cc b/tensorflow/core/ops/io_ops.cc
index f55a2e8560..fa12816c92 100644
--- a/tensorflow/core/ops/io_ops.cc
+++ b/tensorflow/core/ops/io_ops.cc
@@ -527,6 +527,21 @@ shared_name: If non-empty, this reader is named in the given bucket
              with this shared_name. Otherwise, the node name is used instead.
 )doc");
 
+REGISTER_OP("LMDBReader")
+    .Output("reader_handle: Ref(string)")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .SetIsStateful()
+    .SetShapeFn(TwoElementOutput)
+    .Doc(R"doc(
+A Reader that outputs the records from a LMDB file.
+reader_handle: The handle to reference the Reader.
+container: If non-empty, this reader is placed in the given container.
+        Otherwise, a default container is used.
+shared_name: If non-empty, this reader is named in the given bucket
+             with this shared_name. Otherwise, the node name is used instead.
+)doc");
+
 // TODO(cwhipkey): mark this deprecated in favor of V2.
 REGISTER_OP("IdentityReader")
     .Output("reader_handle: Ref(string)")
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index f1cb0c7a0b..93599fa37c 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -2380,7 +2380,7 @@ op {
     default_value {
       b: false
     }
-    description: "If true, all pending enqueue requests that are\nblocked on the barrier\'s queue will be cancelled. InsertMany will fail, even\nif no new key is introduced."
+    description: "If true, all pending enqueue requests that are\nblocked on the barrier\'s queue will be canceled. InsertMany will fail, even\nif no new key is introduced."
   }
   summary: "Closes the given barrier."
   description: "This operation signals that no more new elements will be inserted in the\ngiven barrier. Subsequent InsertMany that try to introduce a new key will fail.\nSubsequent InsertMany operations that just add missing components to already\nexisting elements will continue to succeed. Subsequent TakeMany operations will\ncontinue to succeed if sufficient completed elements remain in the barrier.\nSubsequent TakeMany operations that would block will fail immediately."
@@ -15813,7 +15813,7 @@ op {
     default_value {
       b: false
     }
-    description: "If true, all pending enqueue requests that are\nblocked on the given queue will be cancelled."
+    description: "If true, all pending enqueue requests that are\nblocked on the given queue will be canceled."
   }
   summary: "Closes the given queue."
   description: "This operation signals that no more elements will be enqueued in the\ngiven queue. Subsequent Enqueue(Many) operations will fail.\nSubsequent Dequeue(Many) operations will continue to succeed if\nsufficient elements remain in the queue. Subsequent Dequeue(Many)\noperations that would block will fail immediately."
@@ -15831,7 +15831,7 @@ op {
     default_value {
       b: false
     }
-    description: "If true, all pending enqueue requests that are\nblocked on the given queue will be cancelled."
+    description: "If true, all pending enqueue requests that are\nblocked on the given queue will be canceled."
   }
   summary: "Closes the given queue."
   description: "This operation signals that no more elements will be enqueued in the\ngiven queue. Subsequent Enqueue(Many) operations will fail.\nSubsequent Dequeue(Many) operations will continue to succeed if\nsufficient elements remain in the queue. Subsequent Dequeue(Many)\noperations that would block will fail immediately."
@@ -26088,6 +26088,33 @@ op {
   is_stateful: true
 }
 op {
+  name: "LMDBReader"
+  output_arg {
+    name: "reader_handle"
+    description: "The handle to reference the Reader."
+    type: DT_STRING
+    is_ref: true
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+    description: "If non-empty, this reader is placed in the given container.\nOtherwise, a default container is used."
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+    description: "If non-empty, this reader is named in the given bucket\nwith this shared_name. Otherwise, the node name is used instead."
+  }
+  summary: "A Reader that outputs the records from a LMDB database."
+  is_stateful: true
+}
+op {
   name: "TakeDataset"
   input_arg {
     name: "input_dataset"
@@ -28118,7 +28145,7 @@ op {
     }
   }
   summary: "Op is similar to a lightweight Dequeue."
-  description: "The basic funtionality is similar to dequeue with many fewer\ncapabilities and options.  This Op is optimized for performance."
+  description: "The basic functionality is similar to dequeue with many fewer\ncapabilities and options.  This Op is optimized for performance."
   is_stateful: true
 }
 op {
diff --git a/tensorflow/core/ops/training_ops.cc b/tensorflow/core/ops/training_ops.cc
index e6a9c0c018..1d24ea36a3 100644
--- a/tensorflow/core/ops/training_ops.cc
+++ b/tensorflow/core/ops/training_ops.cc
@@ -103,6 +103,28 @@ use_locking: If `True`, the subtraction will be protected by a lock;
   otherwise the behavior is undefined, but may exhibit less contention.
 )doc");
 
+REGISTER_OP("ApplyDelayCompensatedGradientDescent")
+    .Input("var: resource")
+    .Input("alpha: T")
+    .Input("delta: T")
+    .Input("lambda: T")
+    .Input("shadow: resource")
+    .Attr("T: numbertype")
+    .Attr("use_locking: bool = false")
+    .SetShapeFn(ApplyGradientDescentShapeFn)
+    .Doc(R"doc(
+var -= alpha * (delta + lambda * delta * (var - shadow))
+Update '*shadow' by changing it to the new value of 'var'
+
+var: Should be from a Variable().
+alpha: Scaling factor. Must be a scalar.
+delta: The change.
+lambda: The variance parameter.
+shadow: Same as "var".
+use_locking: If `True`, the subtraction will be protected by a lock;
+  otherwise the behavior is undefined, but may exhibit less contention.
+)doc");
+
 static Status ApplyProximalGradientDescentShapeFn(InferenceContext* c,
                                                   bool sparse) {
   ShapeHandle unused;
diff --git a/tensorflow/core/platform/default/build_config.bzl b/tensorflow/core/platform/default/build_config.bzl
index 10414cbca2..94f255663e 100644
--- a/tensorflow/core/platform/default/build_config.bzl
+++ b/tensorflow/core/platform/default/build_config.bzl
@@ -98,12 +98,14 @@ def tf_proto_library(name, srcs = [], has_services = None,
   )
 
 def tf_additional_lib_hdrs(exclude = []):
+  windows_hdrs = native.glob([
+      "platform/default/*.h",
+      "platform/windows/*.h",
+      "platform/posix/error.h",
+  ], exclude = exclude)
   return select({
-    "//tensorflow:windows" : native.glob([
-        "platform/default/*.h",
-        "platform/windows/*.h",
-        "platform/posix/error.h",
-      ], exclude = exclude),
+    "//tensorflow:windows" : windows_hdrs,
+    "//tensorflow:windows_msvc" : windows_hdrs,
     "//conditions:default" : native.glob([
         "platform/default/*.h",
         "platform/posix/*.h",
@@ -111,12 +113,14 @@ def tf_additional_lib_hdrs(exclude = []):
   })
 
 def tf_additional_lib_srcs(exclude = []):
+  windows_srcs = native.glob([
+      "platform/default/*.cc",
+      "platform/windows/*.cc",
+      "platform/posix/error.cc",
+  ], exclude = exclude)
   return select({
-    "//tensorflow:windows" : native.glob([
-        "platform/default/*.cc",
-        "platform/windows/*.cc",
-        "platform/posix/error.cc",
-      ], exclude = exclude),
+    "//tensorflow:windows" : windows_srcs,
+    "//tensorflow:windows_msvc" : windows_srcs,
     "//conditions:default" : native.glob([
         "platform/default/*.cc",
         "platform/posix/*.cc",
@@ -148,11 +152,13 @@ def tf_env_time_hdrs():
   ]
 
 def tf_env_time_srcs():
+  win_env_time = native.glob([
+    "platform/windows/env_time.cc",
+    "platform/env_time.cc",
+  ], exclude = [])
   return select({
-    "//tensorflow:windows" : native.glob([
-        "platform/windows/env_time.cc",
-        "platform/env_time.cc",
-      ], exclude = []),
+    "//tensorflow:windows" : win_env_time,
+    "//tensorflow:windows_msvc" : win_env_time,
     "//conditions:default" : native.glob([
         "platform/posix/env_time.cc",
         "platform/env_time.cc",
@@ -254,3 +260,9 @@ def tf_additional_verbs_lib_defines():
       "//tensorflow:with_verbs_support": ["TENSORFLOW_USE_VERBS"],
       "//conditions:default": [],
   })
+
+def tf_additional_mpi_lib_defines():
+  return select({
+      "//tensorflow:with_mpi_support": ["TENSORFLOW_USE_MPI"],
+      "//conditions:default": [],
+  })
diff --git a/tensorflow/core/platform/default/build_config_root.bzl b/tensorflow/core/platform/default/build_config_root.bzl
index eb804bfc78..fa4ac4ba73 100644
--- a/tensorflow/core/platform/default/build_config_root.bzl
+++ b/tensorflow/core/platform/default/build_config_root.bzl
@@ -26,7 +26,16 @@ def tf_additional_license_deps():
 def tf_additional_verbs_deps():
   return select({
       "//tensorflow:with_verbs_support": [
-      "//tensorflow/contrib/verbs:verbs_server_lib",
-      "//tensorflow/contrib/verbs:grpc_verbs_client"], 
+          "//tensorflow/contrib/verbs:verbs_server_lib",
+          "//tensorflow/contrib/verbs:grpc_verbs_client",
+      ], 
+      "//conditions:default": [],
+  })
+
+def tf_additional_mpi_deps():
+  return select({
+      "//tensorflow:with_mpi_support": [
+          "//tensorflow/contrib/mpi:mpi_server_lib",
+      ],
       "//conditions:default": [],
   })
diff --git a/tensorflow/core/platform/posix/port.cc b/tensorflow/core/platform/posix/port.cc
index 6ee402594b..3b17bac808 100644
--- a/tensorflow/core/platform/posix/port.cc
+++ b/tensorflow/core/platform/posix/port.cc
@@ -32,7 +32,7 @@ limitations under the License.
 #ifdef SNAPPY
 #include "snappy.h"
 #endif
-#if defined(__APPLE__) && defined(__MACH__)
+#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__)
 #include <thread>
 #endif
 
@@ -56,7 +56,7 @@ int NumSchedulableCPUs() {
   }
   perror("sched_getaffinity");
 #endif
-#if defined(__APPLE__) && defined(__MACH__)
+#if (defined(__APPLE__) && defined(__MACH__)) || defined(__FreeBSD__)
   unsigned int count = std::thread::hardware_concurrency();
   if (count > 0) return static_cast<int>(count);
 #endif
diff --git a/tensorflow/core/platform/posix/subprocess.cc b/tensorflow/core/platform/posix/subprocess.cc
index fc511fdf72..cefc66831a 100644
--- a/tensorflow/core/platform/posix/subprocess.cc
+++ b/tensorflow/core/platform/posix/subprocess.cc
@@ -28,7 +28,7 @@ limitations under the License.
 // A danger of calling fork() (as opposed to clone() or vfork()) is that if
 // many people have used pthread_atfork() to acquire locks, fork() can deadlock,
 // because it's unlikely that the locking order will be correct in a large
-// programme where different layers are unaware of one another and using
+// program where different layers are unaware of one another and using
 // pthread_atfork() independently.
 //
 // The danger of not calling fork() is that if libc managed to use
diff --git a/tensorflow/core/protobuf/master.proto b/tensorflow/core/protobuf/master.proto
index 22bcdf0f0c..6b25a86ba4 100644
--- a/tensorflow/core/protobuf/master.proto
+++ b/tensorflow/core/protobuf/master.proto
@@ -202,7 +202,7 @@ message CloseSessionResponse {
 // Old sessions may continue to have side-effects on resources not in
 // containers listed in "containers", and thus may affect future
 // sessions' results in ways that are hard to predict.  Thus, if well-defined
-// behaviour is desired, is it recommended that all containers be listed in
+// behavior is desired, is it recommended that all containers be listed in
 // "containers".  Similarly, if a device_filter is specified, results may be
 // hard to predict.
 message ResetRequest {
diff --git a/tensorflow/core/public/session.h b/tensorflow/core/public/session.h
index acd1482418..c1f097c7c6 100644
--- a/tensorflow/core/public/session.h
+++ b/tensorflow/core/public/session.h
@@ -206,7 +206,7 @@ Status NewSession(const SessionOptions& options, Session** out_session);
 /// Old sessions may continue to have side-effects on resources not in
 /// containers listed in "containers", and thus may affect future
 /// sessions' results in ways that are hard to predict.  Thus, if well-defined
-/// behaviour is desired, it is recommended that all containers be listed in
+/// behavior is desired, it is recommended that all containers be listed in
 /// "containers".
 ///
 /// `containers` is a vector of string representation of resource container
diff --git a/tensorflow/core/public/version.h b/tensorflow/core/public/version.h
index 57ff12dcd7..d30d7819fc 100644
--- a/tensorflow/core/public/version.h
+++ b/tensorflow/core/public/version.h
@@ -19,7 +19,7 @@ limitations under the License.
 // TensorFlow uses semantic versioning, see http://semver.org/.
 
 #define TF_MAJOR_VERSION 1
-#define TF_MINOR_VERSION 1
+#define TF_MINOR_VERSION 2
 #define TF_PATCH_VERSION 0
 
 // TF_VERSION_SUFFIX is non-empty for pre-releases (e.g. "-alpha", "-alpha.1",
diff --git a/tensorflow/core/util/ctc/ctc_beam_search_test.cc b/tensorflow/core/util/ctc/ctc_beam_search_test.cc
index 217c7ce1f6..b2d5ef56ad 100644
--- a/tensorflow/core/util/ctc/ctc_beam_search_test.cc
+++ b/tensorflow/core/util/ctc/ctc_beam_search_test.cc
@@ -217,7 +217,7 @@ TEST(CtcBeamSearch, AllBeamElementsHaveFiniteScores) {
   // Make sure all scores are finite.
   for (int path = 0; path < top_paths; ++path) {
     LOG(INFO) << "path " << path;
-    EXPECT_FALSE(isinf(score[0][path]));
+    EXPECT_FALSE(std::isinf(score[0][path]));
   }
 }
 
diff --git a/tensorflow/core/util/ctc/ctc_loss_calculator.h b/tensorflow/core/util/ctc/ctc_loss_calculator.h
index 81a7033556..be00895b0d 100644
--- a/tensorflow/core/util/ctc/ctc_loss_calculator.h
+++ b/tensorflow/core/util/ctc/ctc_loss_calculator.h
@@ -46,7 +46,7 @@ class CTCLossCalculator {
   // these examples.
   //
   // Reference materials:
-  //  GravesTh: Alex Graves, "Supervised Sequence Labelling with Recurrent
+  //  GravesTh: Alex Graves, "Supervised Sequence Labeling with Recurrent
   //    Neural Networks" (PhD Thesis), Technische Universit¨at M¨unchen.
  public:
   typedef std::vector<std::vector<int>> LabelSequences;
diff --git a/tensorflow/core/util/cuda_kernel_helper.h b/tensorflow/core/util/cuda_kernel_helper.h
index ccee269eb3..c86c6e4a5d 100644
--- a/tensorflow/core/util/cuda_kernel_helper.h
+++ b/tensorflow/core/util/cuda_kernel_helper.h
@@ -20,13 +20,95 @@ limitations under the License.
 
 #include <algorithm>
 
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+
+// Usage of GetCudaLaunchConfig, GetCuda2DLaunchConfig, and
+// GetCuda3DLaunchConfig:
+//
+// There are two versions of GetCudaLaunchConfig and GetCuda2DLaunchConfig, one
+// version uses heuristics without any knowledge of the device kernel, the other
+// version uses cudaOccupancyMaxPotentialBlockSize to determine the theoretical
+// launch parameters that maximize occupancy. Currently, only the maximum
+// occupancy version of GetCuda3DLaunchConfig is available.
+//
+// For large number of work elements, the convention is that each kernel would
+// iterate through its assigned range. The return value of GetCudaLaunchConfig
+// is struct CudaLaunchConfig, which contains all the information needed for the
+// kernel launch, including: virtual number of threads, the number of threads
+// per block and number of threads per block used inside <<< >>> of a kernel
+// launch. GetCuda2DLaunchConfig and GetCuda3DLaunchConfig does the same thing
+// as CudaLaunchConfig. The only difference is the dimension. The macros
+// CUDA_1D_KERNEL_LOOP and CUDA_AXIS_KERNEL_LOOP might be used to do inner loop.
+//
+/* Sample code:
+
+__global__ void MyKernel1D(CudaLaunchConfig config, other_args...) {
+  CUDA_1D_KERNEL_LOOP(x, config.virtual_thread_count) {
+    do_your_job_here;
+  }
+}
+
+__global__ void MyKernel2D(Cuda2DLaunchConfig config, other_args...) {
+  CUDA_AXIS_KERNEL_LOOP(x, config.virtual_thread_count, x) {
+    CUDA_AXIS_KERNEL_LOOP(y, config.virtual_thread_count, y) {
+      do_your_job_here;
+    }
+  }
+}
+
+__global__ void MyKernel3D(Cuda3DLaunchConfig config, other_args...) {
+  CUDA_AXIS_KERNEL_LOOP(x, config.virtual_thread_count, x) {
+    CUDA_AXIS_KERNEL_LOOP(y, config.virtual_thread_count, y) {
+      CUDA_AXIS_KERNEL_LOOP(z, config.virtual_thread_count, z) {
+        do_your_job_here;
+      }
+    }
+  }
+}
+
+void MyDriverFunc(const GPUDevice &d) {
+  // use heuristics
+  CudaLaunchConfig cfg1 = GetCudaLaunchConfig(10240, d);
+  MyKernel1D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg1, other_args...);
+  Cuda2DLaunchConfig cfg2 = GetCuda2DLaunchConfig(10240, 10240, d);
+  MyKernel2D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg2, other_args...);
+  Cuda3DLaunchConfig cfg3 = GetCuda3DLaunchConfig(4096, 4096, 100, d);
+  MyKernel3D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg3, other_args...);
+
+  // maximize occupancy
+  CudaLaunchConfig cfg4 = GetCudaLaunchConfig(10240, d, MyKernel1D, 0, 0 );
+  MyKernel1D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg4, other_args...);
+  Cuda2DLaunchConfig cfg5 = GetCuda2DLaunchConfig(10240, 10240, d,
+                                                  MyKernel1D, 0, 0);
+  MyKernel2D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg5, other_args...);
+  Cuda3DLaunchConfig cfg6 = GetCuda3DLaunchConfig(4096, 4096, 100, d,
+                                                  MyKernel1D, 0, 0);
+  MyKernel3D <<<config.block_count,
+                config.thread_per_block, 0, d.stream()>>> (cfg6, other_args...);
+}
+
+// See the test for this for more example:
+// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/cuda_kernel_helper_test.cu.cc
+
+*/
 
 #define CUDA_1D_KERNEL_LOOP(i, n)                            \
   for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
        i += blockDim.x * gridDim.x)
 
+#define CUDA_AXIS_KERNEL_LOOP(i, n, axis)                                  \
+  for (int i = blockIdx.axis * blockDim.axis + threadIdx.axis; i < n.axis; \
+       i += blockDim.axis * gridDim.axis)
+
+#define DIV_UP(a, b) (((a) + (b) - 1) / (b))
+
 namespace tensorflow {
 
 typedef Eigen::GpuDevice GPUDevice;
@@ -47,16 +129,22 @@ struct CudaLaunchConfig {
 // memory-limited.
 inline CudaLaunchConfig GetCudaLaunchConfig(int work_element_count,
                                             const GPUDevice& d) {
+  CudaLaunchConfig config;
+
+  // in case of invalid input, return the default value config, which has all -1
+  if (work_element_count <= 0) {
+    return config;
+  }
+
   const int virtual_thread_count = work_element_count;
   const int physical_thread_count = std::min(
       d.getNumCudaMultiProcessors() * d.maxCudaThreadsPerMultiProcessor(),
       virtual_thread_count);
   const int thread_per_block = std::min(1024, d.maxCudaThreadsPerBlock());
-  const int block_count = std::min(
-      (physical_thread_count + thread_per_block - 1) / thread_per_block,
-      d.getNumCudaMultiProcessors());
+  const int block_count =
+      std::min(DIV_UP(physical_thread_count, thread_per_block),
+               d.getNumCudaMultiProcessors());
 
-  CudaLaunchConfig config;
   config.virtual_thread_count = virtual_thread_count;
   config.thread_per_block = thread_per_block;
   config.block_count = block_count;
@@ -70,16 +158,23 @@ inline CudaLaunchConfig GetCudaLaunchConfig(int work_element_count,
                                             const GPUDevice& d, DeviceFunc func,
                                             size_t dynamic_shared_memory_size,
                                             int block_size_limit) {
+  CudaLaunchConfig config;
+
+  if (work_element_count <= 0) {
+    return config;
+  }
+
   int block_count = 0;
   int thread_per_block = 0;
-  cudaOccupancyMaxPotentialBlockSize(&block_count, &thread_per_block, func,
-                                     dynamic_shared_memory_size,
-                                     block_size_limit);
+
+  cudaError_t err = cudaOccupancyMaxPotentialBlockSize(
+      &block_count, &thread_per_block, func, dynamic_shared_memory_size,
+      block_size_limit);
+  CHECK_EQ(err, cudaSuccess);
+
   block_count =
-      std::min(block_count,
-               (work_element_count + thread_per_block - 1) / thread_per_block);
+      std::min(block_count, DIV_UP(work_element_count, thread_per_block));
 
-  CudaLaunchConfig config;
   config.virtual_thread_count = work_element_count;
   config.thread_per_block = thread_per_block;
   config.block_count = block_count;
@@ -87,16 +182,18 @@ inline CudaLaunchConfig GetCudaLaunchConfig(int work_element_count,
 }
 
 struct Cuda2DLaunchConfig {
-  dim3 virtual_thread_count;
-  dim3 thread_per_block;
-  dim3 block_count;
+  dim3 virtual_thread_count = dim3(0, 0, 0);
+  dim3 thread_per_block = dim3(0, 0, 0);
+  dim3 block_count = dim3(0, 0, 0);
 };
 
 inline Cuda2DLaunchConfig GetCuda2DLaunchConfig(int xdim, int ydim,
                                                 const GPUDevice& d) {
   Cuda2DLaunchConfig config;
 
-  config.virtual_thread_count = dim3(xdim, ydim, 1);
+  if (xdim <= 0 || ydim <= 0) {
+    return config;
+  }
 
   const int kThreadsPerBlock = 256;
   int block_cols = std::min(xdim, kThreadsPerBlock);
@@ -108,16 +205,78 @@ inline Cuda2DLaunchConfig GetCuda2DLaunchConfig(int xdim, int ydim,
 
   const int max_blocks = std::max(physical_thread_count / kThreadsPerBlock, 1);
 
+  config.virtual_thread_count = dim3(xdim, ydim, 1);
   config.thread_per_block = dim3(block_cols, block_rows, 1);
 
-  int grid_x = std::min((xdim + block_cols - 1) / block_cols, max_blocks);
+  int grid_x = std::min(DIV_UP(xdim, block_cols), max_blocks);
 
   config.block_count = dim3(
       grid_x, std::min(max_blocks / grid_x, std::max(ydim / block_rows, 1)), 1);
+  return config;
+}
 
+// Calculate the Cuda 2D and 3D launch config we should use for a kernel launch.
+// This variant takes the resource limits of func into account to maximize
+// occupancy.
+using Cuda3DLaunchConfig = Cuda2DLaunchConfig;
+
+template <typename DeviceFunc>
+inline Cuda3DLaunchConfig GetCuda3DLaunchConfig(
+    int xdim, int ydim, int zdim, const GPUDevice& d, DeviceFunc func,
+    size_t dynamic_shared_memory_size, int block_size_limit) {
+  Cuda3DLaunchConfig config;
+
+  if (xdim <= 0 || ydim <= 0 || zdim <= 0) {
+    return config;
+  }
+
+  int dev;
+  cudaGetDevice(&dev);
+  cudaDeviceProp deviceProp;
+  cudaGetDeviceProperties(&deviceProp, dev);
+  int xthreadlimit = deviceProp.maxThreadsDim[0];
+  int ythreadlimit = deviceProp.maxThreadsDim[1];
+  int zthreadlimit = deviceProp.maxThreadsDim[2];
+  int xgridlimit = deviceProp.maxGridSize[0];
+  int ygridlimit = deviceProp.maxGridSize[1];
+  int zgridlimit = deviceProp.maxGridSize[2];
+
+  int block_count = 0;
+  int thread_per_block = 0;
+  cudaError_t err = cudaOccupancyMaxPotentialBlockSize(
+      &block_count, &thread_per_block, func, dynamic_shared_memory_size,
+      block_size_limit);
+  CHECK_EQ(err, cudaSuccess);
+
+#define MIN3(a, b, c) std::min((a), std::min((b), (c)))
+  int threadsx = MIN3(xdim, thread_per_block, xthreadlimit);
+  int threadsy =
+      MIN3(ydim, std::max(thread_per_block / threadsx, 1), ythreadlimit);
+  int threadsz =
+      MIN3(zdim, std::max(thread_per_block / (threadsx * threadsy), 1),
+           zthreadlimit);
+
+  int blocksx = MIN3(block_count, DIV_UP(xdim, threadsx), xgridlimit);
+  int blocksy =
+      MIN3(DIV_UP(block_count, blocksx), DIV_UP(ydim, threadsy), ygridlimit);
+  int blocksz = MIN3(DIV_UP(block_count, (blocksx * blocksy)),
+                     DIV_UP(zdim, threadsz), zgridlimit);
+#undef MIN3
+
+  config.virtual_thread_count = dim3(xdim, ydim, zdim);
+  config.thread_per_block = dim3(threadsx, threadsy, threadsz);
+  config.block_count = dim3(blocksx, blocksy, blocksz);
   return config;
 }
 
+template <typename DeviceFunc>
+inline Cuda2DLaunchConfig GetCuda2DLaunchConfig(
+    int xdim, int ydim, const GPUDevice& d, DeviceFunc func,
+    size_t dynamic_shared_memory_size, int block_size_limit) {
+  return GetCuda3DLaunchConfig(xdim, ydim, 1, d, func,
+                               dynamic_shared_memory_size, block_size_limit);
+}
+
 namespace gpu {
 
 template <typename IntType>
@@ -511,6 +670,8 @@ __device__ EIGEN_ALWAYS_INLINE double CudaShuffleXor(double value, int laneMask,
 
 }  // namespace tensorflow
 
+#undef DIV_UP
+
 #endif  // GOOGLE_CUDA
 
 #endif  // TENSORFLOW_CORE_UTIL_CUDA_KERNEL_HELPER_H_
diff --git a/tensorflow/core/util/cuda_kernel_helper_test.cu.cc b/tensorflow/core/util/cuda_kernel_helper_test.cu.cc
new file mode 100644
index 0000000000..abd72b7d77
--- /dev/null
+++ b/tensorflow/core/util/cuda_kernel_helper_test.cu.cc
@@ -0,0 +1,303 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if GOOGLE_CUDA
+#define EIGEN_USE_GPU
+
+#include <numeric>
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+
+#define CUDA_EXPECT_SUCCESS                                 \
+  {                                                         \
+    cudaDeviceSynchronize();                                \
+    cudaError_t err = cudaGetLastError();                   \
+    EXPECT_EQ(cudaSuccess, err) << cudaGetErrorString(err); \
+  }
+
+#define CUDA_ASSERT_SUCCESS                                 \
+  {                                                         \
+    cudaDeviceSynchronize();                                \
+    cudaError_t err = cudaGetLastError();                   \
+    ASSERT_EQ(cudaSuccess, err) << cudaGetErrorString(err); \
+  }
+
+namespace tensorflow {
+
+namespace {
+
+__global__ void SetOutbufZero(CudaLaunchConfig config, int* outbuf) {
+  CUDA_1D_KERNEL_LOOP(x, config.virtual_thread_count) { outbuf[x] = 0; }
+}
+
+// counting number of jobs by using atomic +1
+__global__ void Count1D(CudaLaunchConfig config, int bufsize, int* outbuf) {
+  CUDA_1D_KERNEL_LOOP(x, config.virtual_thread_count) {
+    if (x < 0) {  // x might overflow when testing extreme case
+      break;
+    }
+    atomicAdd(&outbuf[x % bufsize], 1);
+  }
+}
+__global__ void Count2D(Cuda2DLaunchConfig config, int bufsize, int* outbuf) {
+  CUDA_AXIS_KERNEL_LOOP(x, config.virtual_thread_count, x) {
+    if (x < 0) {  // x might overflow when testing extreme case
+      break;
+    }
+    CUDA_AXIS_KERNEL_LOOP(y, config.virtual_thread_count, y) {
+      if (y < 0) {  // y might overflow when testing extreme case
+        break;
+      }
+      int idx = x * config.virtual_thread_count.y + y;
+      atomicAdd(&outbuf[idx % bufsize], 1);
+    }
+  }
+}
+__global__ void Count3D(Cuda3DLaunchConfig config, int bufsize, int* outbuf) {
+  CUDA_AXIS_KERNEL_LOOP(x, config.virtual_thread_count, x) {
+    if (x < 0) {  // x might overflow when testing extreme case
+      break;
+    }
+    CUDA_AXIS_KERNEL_LOOP(y, config.virtual_thread_count, y) {
+      if (y < 0) {  // y might overflow when testing extreme case
+        break;
+      }
+      CUDA_AXIS_KERNEL_LOOP(z, config.virtual_thread_count, z) {
+        if (z < 0) {  // z might overflow when testing extreme case
+          break;
+        }
+        int idx =
+            x * config.virtual_thread_count.y * config.virtual_thread_count.z +
+            y * config.virtual_thread_count.z + z;
+        atomicAdd(&outbuf[idx % bufsize], 1);
+      }
+    }
+  }
+}
+
+}  // namespace
+
+class CudaLaunchConfigTest : public ::testing::Test {
+ protected:
+  const int bufsize = 1024;
+  int* outbuf = nullptr;
+  Eigen::CudaStreamDevice stream;
+  GPUDevice d = GPUDevice(&stream);
+
+  virtual void SetUp() {
+    cudaError_t err = cudaMallocManaged(&outbuf, sizeof(int) * bufsize);
+    ASSERT_EQ(cudaSuccess, err) << cudaGetErrorString(err);
+  }
+
+  virtual void TearDown() {
+    cudaDeviceSynchronize();
+    cudaFree(outbuf);
+    outbuf = nullptr;
+  }
+};
+
+TEST_F(CudaLaunchConfigTest, GetCudaLaunchConfig) {
+  CudaLaunchConfig cfg;
+
+  // test invalid inputs
+  CudaLaunchConfig default_value;
+  cfg = GetCudaLaunchConfig(0, d);
+  EXPECT_EQ(default_value.virtual_thread_count, cfg.virtual_thread_count);
+  EXPECT_EQ(default_value.block_count, cfg.block_count);
+  EXPECT_EQ(default_value.thread_per_block, cfg.thread_per_block);
+
+  cfg = GetCudaLaunchConfig(-1, d);
+  EXPECT_EQ(default_value.virtual_thread_count, cfg.virtual_thread_count);
+  EXPECT_EQ(default_value.block_count, cfg.block_count);
+  EXPECT_EQ(default_value.thread_per_block, cfg.thread_per_block);
+
+  cfg = GetCudaLaunchConfig(0, d, Count1D, 0, 0);
+  EXPECT_EQ(default_value.virtual_thread_count, cfg.virtual_thread_count);
+  EXPECT_EQ(default_value.block_count, cfg.block_count);
+  EXPECT_EQ(default_value.thread_per_block, cfg.thread_per_block);
+
+  cfg = GetCudaLaunchConfig(-1, d, Count1D, 0, 0);
+  EXPECT_EQ(default_value.virtual_thread_count, cfg.virtual_thread_count);
+  EXPECT_EQ(default_value.block_count, cfg.block_count);
+  EXPECT_EQ(default_value.thread_per_block, cfg.thread_per_block);
+
+  // test valid inputs
+  #define TEST_LAUNCH_PARAMETER(work_element_count)                             \
+    cfg = GetCudaLaunchConfig(bufsize, d);                                      \
+    SetOutbufZero<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>>     \
+                                                                (cfg, outbuf);  \
+    CUDA_ASSERT_SUCCESS                                                         \
+    cfg = GetCudaLaunchConfig(work_element_count, d);                           \
+    Count1D<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>> (         \
+        cfg, bufsize, outbuf);                                                  \
+    CUDA_EXPECT_SUCCESS                                                         \
+    EXPECT_EQ(work_element_count, std::accumulate(outbuf, outbuf + bufsize, 0));\
+                                                                                \
+    cfg = GetCudaLaunchConfig(bufsize, d, SetOutbufZero, 0, 0);                 \
+    SetOutbufZero<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>>     \
+                                                                (cfg, outbuf);  \
+    CUDA_ASSERT_SUCCESS                                                         \
+    cfg = GetCudaLaunchConfig(work_element_count, d, Count1D, 0, 0);            \
+    Count1D<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>> (         \
+        cfg, bufsize, outbuf);                                                  \
+    CUDA_EXPECT_SUCCESS                                                         \
+    EXPECT_EQ(work_element_count, std::accumulate(outbuf, outbuf + bufsize, 0))
+
+  TEST_LAUNCH_PARAMETER(128);
+  TEST_LAUNCH_PARAMETER(129);
+  TEST_LAUNCH_PARAMETER(511);
+  TEST_LAUNCH_PARAMETER(512);
+  TEST_LAUNCH_PARAMETER(2048);
+  TEST_LAUNCH_PARAMETER(2049);
+  TEST_LAUNCH_PARAMETER(8191);
+  TEST_LAUNCH_PARAMETER(8192);
+  TEST_LAUNCH_PARAMETER(123456);
+  TEST_LAUNCH_PARAMETER(1 << 31 - 1);  // max value of int
+  #undef TEST_LAUNCH_PARAMETER
+}
+
+bool operator==(const Cuda2DLaunchConfig& a, const Cuda2DLaunchConfig& b) {
+  return a.thread_per_block.x == b.thread_per_block.x &&
+         a.thread_per_block.y == b.thread_per_block.y &&
+         a.thread_per_block.z == b.thread_per_block.z &&
+         a.block_count.x == b.block_count.x &&
+         a.block_count.y == b.block_count.y &&
+         a.block_count.z == b.block_count.z &&
+         a.thread_per_block.x == b.thread_per_block.x &&
+         a.thread_per_block.y == b.thread_per_block.y &&
+         a.thread_per_block.z == b.thread_per_block.z;
+}
+
+TEST_F(CudaLaunchConfigTest, GetCuda2DLaunchConfig) {
+  Cuda2DLaunchConfig cfg;
+  CudaLaunchConfig cfg1d;
+
+  // test invalid inputs
+  Cuda2DLaunchConfig default_value;
+  cfg = GetCuda2DLaunchConfig(1, 0, d);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(1, -1, d);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(-1, 1, d);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(-1, 1, d);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(0, -1, d);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(0, 0, d);
+  EXPECT_EQ(default_value, cfg);
+
+  cfg = GetCuda2DLaunchConfig(1, 0, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(1, -1, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(-1, 1, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(-1, 1, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(0, -1, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda2DLaunchConfig(0, 0, d, Count2D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+
+  // test valid inputs
+  #define TEST_LAUNCH_PARAMETER(dimx, dimy)                                     \
+    cfg1d = GetCudaLaunchConfig(bufsize, d);                                    \
+    SetOutbufZero<<<cfg1d.block_count, cfg1d.thread_per_block, 0, d.stream()>>> \
+                                                                (cfg1d, outbuf);\
+    CUDA_ASSERT_SUCCESS                                                         \
+    cfg = GetCuda2DLaunchConfig(dimx, dimy, d);                                 \
+    Count2D<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>> (         \
+        cfg, bufsize, outbuf);                                                  \
+    CUDA_EXPECT_SUCCESS                                                         \
+    EXPECT_EQ(dimx * dimy, std::accumulate(outbuf, outbuf + bufsize, 0));       \
+                                                                                \
+    cfg1d = GetCudaLaunchConfig(bufsize, d, SetOutbufZero, 0, 0);               \
+    SetOutbufZero<<<cfg1d.block_count, cfg1d.thread_per_block, 0, d.stream()>>> \
+                                                                (cfg1d, outbuf);\
+    CUDA_ASSERT_SUCCESS                                                         \
+    cfg = GetCuda2DLaunchConfig(dimx, dimy, d, Count2D, 0, 0);                  \
+    Count2D<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>> (         \
+        cfg, bufsize, outbuf);                                                  \
+    CUDA_EXPECT_SUCCESS                                                         \
+    EXPECT_EQ(dimx * dimy, std::accumulate(outbuf, outbuf + bufsize, 0))
+
+  TEST_LAUNCH_PARAMETER(128, 128);
+  TEST_LAUNCH_PARAMETER(129, 64);
+  TEST_LAUNCH_PARAMETER(511, 2048);
+  TEST_LAUNCH_PARAMETER(512, 512);
+  TEST_LAUNCH_PARAMETER(2048, 1024);
+  TEST_LAUNCH_PARAMETER(2049, 32);
+  TEST_LAUNCH_PARAMETER(8191, 1);
+  TEST_LAUNCH_PARAMETER(8192, 10);
+  TEST_LAUNCH_PARAMETER(123456, 12);
+  TEST_LAUNCH_PARAMETER(1, (1 << 31 - 1));
+  TEST_LAUNCH_PARAMETER((1 << 31 - 1), 1);
+  #undef TEST_LAUNCH_PARAMETER
+}
+
+TEST_F(CudaLaunchConfigTest, GetCuda3DLaunchConfig) {
+  Cuda3DLaunchConfig cfg;
+  CudaLaunchConfig cfg1d;
+
+  // test invalid inputs
+  Cuda3DLaunchConfig default_value;
+  cfg = GetCuda3DLaunchConfig(0, 1, 1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(-1, 1, 1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(1, 0, 1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(1, -1, 1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(1, 1, 0, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(1, 1, -1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(0, 0, 0, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+  cfg = GetCuda3DLaunchConfig(-1, -1, -1, d, Count3D, 0, 0);
+  EXPECT_EQ(default_value, cfg);
+
+  // test valid inputs
+  #define TEST_LAUNCH_PARAMETER(dimx, dimy, dimz)                               \
+    cfg1d = GetCudaLaunchConfig(bufsize, d, SetOutbufZero, 0, 0);               \
+    SetOutbufZero<<<cfg1d.block_count, cfg1d.thread_per_block, 0, d.stream()>>> \
+                                                                (cfg1d, outbuf);\
+    CUDA_ASSERT_SUCCESS                                                         \
+    cfg = GetCuda3DLaunchConfig(dimx, dimy, dimz, d, Count3D, 0, 0);            \
+    Count3D<<<cfg.block_count, cfg.thread_per_block, 0, d.stream()>>> (         \
+        cfg, bufsize, outbuf);                                                  \
+    CUDA_EXPECT_SUCCESS                                                         \
+    EXPECT_EQ(dimx * dimy * dimz, std::accumulate(outbuf, outbuf + bufsize, 0))
+
+  TEST_LAUNCH_PARAMETER(128, 128, 128);
+  TEST_LAUNCH_PARAMETER(129, 64, 1024);
+  TEST_LAUNCH_PARAMETER(511, 2048, 128);
+  TEST_LAUNCH_PARAMETER(512, 512, 64);
+  TEST_LAUNCH_PARAMETER(2048, 1024, 128);
+  TEST_LAUNCH_PARAMETER(2049, 32, 1024);
+  TEST_LAUNCH_PARAMETER(8191, 1, 1024);
+  TEST_LAUNCH_PARAMETER(8192, 10, 32);
+  TEST_LAUNCH_PARAMETER(123456, 12, 21);
+  TEST_LAUNCH_PARAMETER(1, 1, (1 << 31 - 1));
+  TEST_LAUNCH_PARAMETER(1, (1 << 31 - 1), 1);
+  TEST_LAUNCH_PARAMETER((1 << 31 - 1), 1, 1);
+  #undef TEST_LAUNCH_PARAMETER
+}
+
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
diff --git a/tensorflow/core/util/example_proto_fast_parsing.cc b/tensorflow/core/util/example_proto_fast_parsing.cc
index 8fd11642b0..bcf3512efc 100644
--- a/tensorflow/core/util/example_proto_fast_parsing.cc
+++ b/tensorflow/core/util/example_proto_fast_parsing.cc
@@ -89,7 +89,7 @@ class Feature {
       default:
         // Initialize variable to avoid compiler warning
         *dtype = DT_INVALID;
-        return errors::InvalidArgument("Unsuported datatype.");
+        return errors::InvalidArgument("Unsupported datatype.");
     }
     return Status::OK();
   }
diff --git a/tensorflow/core/util/example_proto_fast_parsing.h b/tensorflow/core/util/example_proto_fast_parsing.h
index 5f8b4af5fe..20536cee16 100644
--- a/tensorflow/core/util/example_proto_fast_parsing.h
+++ b/tensorflow/core/util/example_proto_fast_parsing.h
@@ -45,7 +45,7 @@ struct FastParseExampleConfig {
     DataType dtype;
     // These 2 fields correspond exactly to dense_shapes and dense_defaults in
     // ParseExample op.
-    // Documentation is avaliable in: tensorflow/core/ops/parsing_ops.cc
+    // Documentation is available in: tensorflow/core/ops/parsing_ops.cc
     PartialTensorShape shape;
     Tensor default_value;
     bool variable_length;
@@ -62,7 +62,7 @@ struct FastParseExampleConfig {
 };
 
 // This is exactly the output of TF's ParseExample Op.
-// Documentation is avaliable in: tensorflow/core/ops/parsing_ops.cc
+// Documentation is available in: tensorflow/core/ops/parsing_ops.cc
 struct Result {
   std::vector<Tensor> sparse_indices;
   std::vector<Tensor> sparse_values;
diff --git a/tensorflow/core/util/tensor_bundle/tensor_bundle.h b/tensorflow/core/util/tensor_bundle/tensor_bundle.h
index 962df4373b..3571281820 100644
--- a/tensorflow/core/util/tensor_bundle/tensor_bundle.h
+++ b/tensorflow/core/util/tensor_bundle/tensor_bundle.h
@@ -31,7 +31,7 @@ limitations under the License.
 // (tensorflow::table::Table).  Each key is a name of a tensor and its value is
 // a serialized BundleEntryProto.  Each BundleEntryProto describes the metadata
 // of a tensor: which of the "data" files contains the content of a tensor, the
-// offset into that file, checksum, some auxilary data, etc.
+// offset into that file, checksum, some auxiliary data, etc.
 //
 // A tensor bundle can be accessed randomly using a BundleReader.  Usage:
 //
diff --git a/tensorflow/core/util/tensor_format.h b/tensorflow/core/util/tensor_format.h
index 8c76f0f3c5..cb0f4f4b6a 100644
--- a/tensorflow/core/util/tensor_format.h
+++ b/tensorflow/core/util/tensor_format.h
@@ -43,6 +43,7 @@ inline int GetTensorBatchDimIndex(int num_dims, TensorFormat format) {
     return 0;
   } else {
     LOG(FATAL) << "Unknown format " << format;
+    return -1;  // Avoid compiler warning about missing return value
   }
 }
 
@@ -54,6 +55,7 @@ inline int GetTensorFeatureDimIndex(int num_dims, TensorFormat format) {
     return 1;
   } else {
     LOG(FATAL) << "Unknown format " << format;
+    return -1;  // Avoid compiler warning about missing return value
   }
 }
 
@@ -67,6 +69,7 @@ inline int GetTensorSpatialDimIndex(int num_dims, TensorFormat format,
     return dim + 2;
   } else {
     LOG(FATAL) << "Unknown format " << format;
+    return -1;  // Avoid compiler warning about missing return value
   }
 }
 
diff --git a/tensorflow/docs_src/api_guides/cc/guide.md b/tensorflow/docs_src/api_guides/cc/guide.md
index b5ec83f85b..c5473cad97 100644
--- a/tensorflow/docs_src/api_guides/cc/guide.md
+++ b/tensorflow/docs_src/api_guides/cc/guide.md
@@ -111,7 +111,7 @@ Here are some of the properties controlled by a `Scope` object:
 Please refer to @{tensorflow::Scope} for the complete list of member functions
 that let you create child scopes with new properties.
 
-### Operation Construtors
+### Operation Constructors
 
 You can create graph operations with operation constructors, one C++ class per
 TensorFlow operation. Unlike the Python API which uses snake-case to name the
diff --git a/tensorflow/docs_src/api_guides/python/contrib.linalg.md b/tensorflow/docs_src/api_guides/python/contrib.linalg.md
index b2c7fcf6bb..5f1db6c6af 100644
--- a/tensorflow/docs_src/api_guides/python/contrib.linalg.md
+++ b/tensorflow/docs_src/api_guides/python/contrib.linalg.md
@@ -9,7 +9,7 @@ Subclasses of `LinearOperator` provide a access to common methods on a
 (batch) matrix, without the need to materialize the matrix.  This allows:
 
 * Matrix free computations
-* Different operators to take advantage of special strcture, while providing a
+* Different operators to take advantage of special structure, while providing a
   consistent API to users.
 
 ### Base class
diff --git a/tensorflow/docs_src/deploy/distributed.md b/tensorflow/docs_src/deploy/distributed.md
index 99390f7416..f3e2fac49f 100644
--- a/tensorflow/docs_src/deploy/distributed.md
+++ b/tensorflow/docs_src/deploy/distributed.md
@@ -54,7 +54,7 @@ the following:
 ### Create a `tf.train.ClusterSpec` to describe the cluster
 
 The cluster specification dictionary maps job names to lists of network
-adresses. Pass this dictionary to
+addresses. Pass this dictionary to
 the @{tf.train.ClusterSpec}
 constructor.  For example:
 
diff --git a/tensorflow/docs_src/deploy/hadoop.md b/tensorflow/docs_src/deploy/hadoop.md
index 9493ad02c0..c50c1580a5 100644
--- a/tensorflow/docs_src/deploy/hadoop.md
+++ b/tensorflow/docs_src/deploy/hadoop.md
@@ -46,7 +46,7 @@ be set:
     expanded as described in the libhdfs documentation:
 
     ```shell
-    CLASSPATH=$($HADOOP_HDFS_HOME}/bin/hadoop classpath --glob) python your_script.py
+    CLASSPATH=$(${HADOOP_HDFS_HOME}/bin/hadoop classpath --glob) python your_script.py
     ```
     For older version of Hadoop/libhdfs (older than 2.6.0), you have to expand the
     classpath wildcard manually. For more details, see
diff --git a/tensorflow/docs_src/extend/estimators.md b/tensorflow/docs_src/extend/estimators.md
index f972ee5f50..6bd21be019 100644
--- a/tensorflow/docs_src/extend/estimators.md
+++ b/tensorflow/docs_src/extend/estimators.md
@@ -303,7 +303,7 @@ The `model_fn` must accept three arguments:
 `model_fn` may also accept a `params` argument containing a dict of
 hyperparameters used for training (as shown in the skeleton above).
 
-The body of the function perfoms the following tasks (described in detail in the
+The body of the function performs the following tasks (described in detail in the
 sections that follow):
 
 *   Configuring the model—here, for the abalone predictor, this will be a neural
@@ -371,7 +371,7 @@ layer.
 
 The input layer is a series of nodes (one for each feature in the model) that
 will accept the feature data that is passed to the `model_fn` in the `features`
-argument. If `features` contains an n-dimenional `Tensor` with all your feature
+argument. If `features` contains an n-dimensional `Tensor` with all your feature
 data (which is the case if `x` and `y` `Dataset`s are passed to `fit()`,
 `evaluate()`, and `predict()` directly), then it can serve as the input layer.
 If `features` contains a dict of @{$linear#feature-columns-and-transformations$feature columns} passed to
diff --git a/tensorflow/docs_src/extend/language_bindings.md b/tensorflow/docs_src/extend/language_bindings.md
index 84e0b03086..b9fd72978d 100644
--- a/tensorflow/docs_src/extend/language_bindings.md
+++ b/tensorflow/docs_src/extend/language_bindings.md
@@ -29,7 +29,7 @@ into broad categories:
     are modified.
 -   *Gradients (AKA automatic differentiation)*: Given a graph and a list of
     input and output operations, add operations to the graph that compute the
-    partial deriviatives (gradients) of the inputs with respect to the outputs.
+    partial derivatives (gradients) of the inputs with respect to the outputs.
     Allows for customization of the gradient function for a particular operation
     in the graph.
 -   *Functions*: Define a subgraph that may be called in multiple places in the
diff --git a/tensorflow/docs_src/extend/tool_developers/index.md b/tensorflow/docs_src/extend/tool_developers/index.md
index 3705b310ed..06fc5e70dd 100644
--- a/tensorflow/docs_src/extend/tool_developers/index.md
+++ b/tensorflow/docs_src/extend/tool_developers/index.md
@@ -63,7 +63,7 @@ There are actually two different formats that a ProtoBuf can be saved in.
 TextFormat is a human-readable form, which makes it nice for debugging and
 editing, but can get large when there's numerical data like weights stored in
 it. You can see a small example of that in
-[graph_run_run2.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/components/tf_tensorboard/test/data/graph_run_run2.pbtxt).
+[graph_run_run2.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/demo/data/graph_run_run2.pbtxt).
 
 Binary format files are a lot smaller than their text equivalents, even though
 they're not as readable for us. In this script, we ask the user to supply a
diff --git a/tensorflow/docs_src/get_started/mnist/pros.md b/tensorflow/docs_src/get_started/mnist/pros.md
index 5dbb00c0b5..d50e874d52 100644
--- a/tensorflow/docs_src/get_started/mnist/pros.md
+++ b/tensorflow/docs_src/get_started/mnist/pros.md
@@ -392,7 +392,7 @@ The differences are that:
 - We will add logging to every 100th iteration in the training process.
 
 We will also use tf.Session rather than tf.InteractiveSession. This better
-separates the process of creating the graph (model sepecification) and the
+separates the process of creating the graph (model specification) and the
 process of evaluating the graph (model fitting). It generally makes for cleaner
 code. The tf.Session is created within a [`with` block](https://docs.python.org/3/whatsnew/2.6.html#pep-343-the-with-statement)
 so that it is automatically destroyed once the block is exited.
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index c1c7b66546..91189f199d 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -35,7 +35,7 @@ enable TensorFlow for C:
          OS="linux" # Change to "darwin" for Mac OS
          TARGET_DIRECTORY="/usr/local"
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.1.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-${OS}-x86_64-1.2.0-rc2.tar.gz" |
            sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
@@ -73,7 +73,7 @@ After installing TensorFlow for C, enter the following code into a file named
 #include <tensorflow/c/c_api.h>
 
 int main() {
-  printf(“Hello from TensorFlow C library version %s\n”, TF_Version());
+  printf("Hello from TensorFlow C library version %s\n", TF_Version());
   return 0;
 }
 ```
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index c9abaf2aca..c9b8dffadb 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -35,7 +35,7 @@ steps to install this library and enable TensorFlow for Go:
          TF_TYPE="cpu" # Change to "gpu" for GPU support
          TARGET_DIRECTORY='/usr/local'
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.1.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-${TF_TYPE}-$(go env GOOS)-x86_64-1.2.0-rc2.tar.gz" |
          sudo tar -C $TARGET_DIRECTORY -xz
 
      The `tar` command extracts the TensorFlow C library into the `lib`
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 72d0c7b1ff..612c4c94f2 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -34,7 +34,7 @@ following to the project's `pom.xml` to use the TensorFlow Java APIs:
 <dependency>
   <groupId>org.tensorflow</groupId>
   <artifactId>tensorflow</artifactId>
-  <version>1.1.0</version>
+  <version>1.2.0-rc2</version>
 </dependency>
 ```
 
@@ -63,7 +63,7 @@ As an example, these steps will create a Maven project that uses TensorFlow:
                <dependency>
                  <groupId>org.tensorflow</groupId>
                  <artifactId>tensorflow</artifactId>
-                 <version>1.1.0</version>
+                 <version>1.2.0-rc2</version>
                </dependency>
              </dependencies>
          </project>
@@ -122,7 +122,7 @@ refer to the simpler instructions above instead.
 Take the following steps to install TensorFlow for Java on Linux or Mac OS:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.1.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.2.0-rc2.jar),
      which is the TensorFlow Java Archive (JAR).
 
   2. Decide whether you will run TensorFlow for Java on CPU(s) only or with
@@ -141,7 +141,7 @@ Take the following steps to install TensorFlow for Java on Linux or Mac OS:
          OS=$(uname -s | tr '[:upper:]' '[:lower:]')
          mkdir -p ./jni
          curl -L \
-           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.1.0.tar.gz" |
+           "https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-${TF_TYPE}-${OS}-x86_64-1.2.0-rc2.tar.gz" |
            tar -xz -C ./jni
 
 ### Install on Windows
@@ -149,10 +149,10 @@ Take the following steps to install TensorFlow for Java on Linux or Mac OS:
 Take the following steps to install TensorFlow for Java on Windows:
 
   1. Download
-     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.1.0.jar),
+     [libtensorflow.jar](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-1.2.0-rc2.jar),
      which is the TensorFlow Java Archive (JAR).
   2. Download the following Java Native Interface (JNI) file appropriate for
-     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.1.0.zip).
+     [TensorFlow for Java on Windows](https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow_jni-cpu-windows-x86_64-1.2.0-rc2.zip).
   3. Extract this .zip file.
 
 
@@ -200,7 +200,7 @@ must be part of your `classpath`. For example, you can include the
 downloaded `.jar` in your `classpath` by using the `-cp` compilation flag
 as follows:
 
-<pre><b>javac -cp libtensorflow-1.1.0.jar HelloTF.java</b></pre>
+<pre><b>javac -cp libtensorflow-1.2.0-rc2.jar HelloTF.java</b></pre>
 
 
 ### Running
@@ -214,11 +214,11 @@ two files are available to the JVM:
 For example, the following command line executes the `HelloTF` program on Linux
 and Mac OS X:
 
-<pre><b>java -cp libtensorflow-1.1.0.jar:. -Djava.library.path=./jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.2.0-rc2.jar:. -Djava.library.path=./jni HelloTF</b></pre>
 
-And the following comand line executes the `HelloTF` program on Windows:
+And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.1.0.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.2.0-rc2.jar;. -Djava.library.path=jni HelloTF</b></pre>
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index 47c8fc77ee..8ce4acda13 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -69,6 +69,8 @@ supported choices are as follows:
   * ["native" pip](#InstallingNativePip)
   * [Docker](#InstallingDocker)
   * [Anaconda](#InstallingAnaconda)
+  * installing from sources, which is documented in
+    [a separate guide](https://www.tensorflow.org/install/install_sources).
 
 **We recommend the virtualenv installation.**
 [Virtualenv](https://virtualenv.pypa.io/en/stable/)
@@ -114,12 +116,12 @@ Take the following steps to install TensorFlow with Virtualenv:
   1. Install pip and virtualenv by issuing one of the following commands:
 
      <pre>$ <b>sudo apt-get install python-pip python-dev python-virtualenv</b> # for Python 2.7
-    $ <b>sudo apt-get install python3-pip python3-dev python-virtualenv</b> # for Python 3.n</pre>
+     $ <b>sudo apt-get install python3-pip python3-dev python-virtualenv</b> # for Python 3.n</pre>
 
   2. Create a virtualenv environment by issuing one of the following commands:
 
      <pre>$ <b>virtualenv --system-site-packages</b> <i>targetDirectory</i> # for Python 2.7
-    $ <b>virtualenv --system-site-packages -p python3</b> <i>targetDirectory</i> # for Python 3.n</pre>
+     $ <b>virtualenv --system-site-packages -p python3</b> <i>targetDirectory</i> # for Python 3.n</pre>
 
      where <code><em>targetDirectory</em></code> specifies the top of the
      virtualenv tree.  Our instructions assume that
@@ -129,22 +131,22 @@ Take the following steps to install TensorFlow with Virtualenv:
   3. Activate the virtualenv environment by issuing one of the following
      commands:
 
-     <pre> $ <b>source ~/tensorflow/bin/activate</b> # bash, sh, ksh, or zsh
+     <pre>$ <b>source ~/tensorflow/bin/activate</b> # bash, sh, ksh, or zsh
      $ <b>source ~/tensorflow/bin/activate.csh</b>  # csh or tcsh</pre>
 
      The preceding <tt>source</tt> command should change your prompt
      to the following:
 
-     <pre> (tensorflow)$ </pre>
+     <pre>(tensorflow)$ </pre>
 
   4. Ensure pip ≥8.1 is installed:
 
-     <pre> (tensorflow)$ <b>easy_install -U pip</b></pre>
+     <pre>(tensorflow)$ <b>easy_install -U pip</b></pre>
 
   5. Issue one of the following commands to install TensorFlow in the active
      virtualenv environment:
 
-     <pre> (tensorflow)$ <b>pip install --upgrade tensorflow</b>      # for Python 2.7
+     <pre>(tensorflow)$ <b>pip install --upgrade tensorflow</b>      # for Python 2.7
      (tensorflow)$ <b>pip3 install --upgrade tensorflow</b>     # for Python 3.n
      (tensorflow)$ <b>pip install --upgrade tensorflow-gpu</b>  # for Python 2.7 and GPU
      (tensorflow)$ <b>pip3 install --upgrade tensorflow-gpu</b> # for Python 3.n and GPU</pre>
@@ -152,6 +154,26 @@ Take the following steps to install TensorFlow with Virtualenv:
      If the preceding command succeeds, skip Step 5. If the preceding
      command fails, perform Step 5.
 
+  5. (Optional) If Step 4 failed (typically because you invoked a pip version
+     lower than 8.1), install TensorFlow in the active virtualenv environment
+     by issuing a command of the following format:
+
+     <pre>(tensorflow)$ <b>pip install --upgrade</b> <i>tfBinaryURL</i>   # Python 2.7
+     (tensorflow)$ <b>pip3 install --upgrade</b> <i>tfBinaryURL</i>  # Python 3.n </pre>
+
+     where <code><em>tfBinaryURL</em></code> identifies the URL of the
+     TensorFlow Python package. The appropriate value of
+     <code><em>tfBinaryURL</em></code>depends on the operating system,
+     Python version, and GPU support. Find the appropriate value for
+     <code><em>tfBinaryURL</em></code> for your system
+     [here](#the_url_of_the_tensorflow_python_package).  For example, if you
+     are installing TensorFlow for Linux, Python 2.7, and CPU-only support,
+     issue the following command to install TensorFlow in the active
+     virtualenv environment:
+
+     <pre>(tensorflow)$ <b>pip3 install --upgrade \
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp34-cp34m-linux_x86_64.whl</b></pre>
+
 If you encounter installation problems, see
 [Common Installation Problems](#common_installation_problems).
 
@@ -230,7 +252,7 @@ take the following steps:
 
   1. Install TensorFlow by invoking **one** of the following commands:
 
-     <pre> $ <b>pip install tensorflow</b>      # Python 2.7; CPU support (no GPU support)
+     <pre>$ <b>pip install tensorflow</b>      # Python 2.7; CPU support (no GPU support)
      $ <b>pip3 install tensorflow</b>     # Python 3.n; CPU support (no GPU support)
      $ <b>pip install tensorflow-gpu</b>  # Python 2.7;  GPU support
      $ <b>pip3 install tensorflow-gpu</b> # Python 3.n; GPU support </pre>
@@ -241,7 +263,7 @@ take the following steps:
   2. (Optional.) If Step 1 failed, install the latest version of TensorFlow
      by issuing a command of the following format:
 
-     <pre> $ <b>sudo pip  install --upgrade</b> <i>tfBinaryURL</i>   # Python 2.7
+     <pre>$ <b>sudo pip  install --upgrade</b> <i>tfBinaryURL</i>   # Python 2.7
      $ <b>sudo pip3 install --upgrade</b> <i>tfBinaryURL</i>   # Python 3.n </pre>
 
      where <code><em>tfBinaryURL</em></code> identifies the URL of the
@@ -255,7 +277,7 @@ take the following steps:
 
      <pre>
      $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp34-cp34m-linux_x86_64.whl</b>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp34-cp34m-linux_x86_64.whl</b>
      </pre>
 
      If this step fails, see
@@ -427,13 +449,13 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
   3. Activate the conda environment by issuing the following command:
 
-     <pre> $ <b>source activate tensorflow</b>
+     <pre>$ <b>source activate tensorflow</b>
      (tensorflow)$  # Your prompt should change </pre>
 
   4. Issue a command of the following format to install
      TensorFlow inside your conda environment:
 
-     <pre> (tensorflow)$ <b>pip install --ignore-installed --upgrade</b> <i>tfBinaryURL</i></pre>
+     <pre>(tensorflow)$ <b>pip install --ignore-installed --upgrade</b> <i>tfBinaryURL</i></pre>
 
      where <code><em>tfBinaryURL</em></code> is the
      [URL of the TensorFlow Python package](#the_url_of_the_tensorflow_python_package).
@@ -442,7 +464,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
 
      <pre>
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp34-cp34m-linux_x86_64.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp34-cp34m-linux_x86_64.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -610,14 +632,14 @@ This section documents the relevant values for Linux installations.
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp27-none-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.0rc2-cp27-none-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -629,14 +651,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp34-cp34m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.0rc2-cp34-cp34m-linux_x86_64.whl
 </pre>
 
 Note that GPU support requires the NVIDIA hardware and software described in
@@ -648,14 +670,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp35-cp35m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.0rc2-cp35-cp35m-linux_x86_64.whl
 </pre>
 
 
@@ -667,14 +689,14 @@ Note that GPU support requires the NVIDIA hardware and software described in
 CPU only:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.2.0rc2-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
 GPU support:
 
 <pre>
-https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp36-cp36m-linux_x86_64.whl
+https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.2.0rc2-cp36-cp36m-linux_x86_64.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index 43ffc961aa..f85ecefb83 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -11,8 +11,8 @@ You must pick the mechanism by which you install TensorFlow. The supported choic
   * virtualenv
   * "native" pip
   * Docker
-  * installing from sources, which is for experts and is documented in
-    a separate guide.
+  * installing from sources, which is documented in
+    [a separate guide](https://www.tensorflow.org/install/install_sources).
 
 **We recommend the virtualenv installation.**
 [Virtualenv](https://virtualenv.pypa.io/en/stable/)
@@ -91,6 +91,26 @@ Take the following steps to install TensorFlow with Virtualenv:
      <pre> (tensorflow)$ <b>pip install --upgrade tensorflow</b>      # for Python 2.7
      (tensorflow)$ <b>pip3 install --upgrade tensorflow</b>     # for Python 3.n
 
+  7. Optional. If Step 6 failed (typically because you invoked a pip version
+     lower than 8.1), install TensorFlow in the active
+     virtualenv environment by issuing a command of the following format:
+
+     <pre> $ <b>pip install --upgrade</b> <i>tfBinaryURL</i>   # Python 2.7
+     $ <b>pip3 install --upgrade</b> <i>tfBinaryURL</i>  # Python 3.n </pre>
+
+     where <i>tfBinaryURL</i> identifies the URL
+     of the TensorFlow Python package. The appropriate value of
+     <i>tfBinaryURL</i> depends on the operating system and
+     Python version. Find the appropriate value for
+     <i>tfBinaryURL</i> for your system
+     [here](#the_url_of_the_tensorflow_python_package).
+     For example, if you are installing TensorFlow for Mac OS X,
+     Python 2.7, the command to install
+     TensorFlow in the active Virtualenv is as follows:
+
+     <pre> $ <b>pip3 install --upgrade \
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.0rc2-py2-none-any.whl</b></pre>
+
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
 
@@ -210,7 +230,7 @@ take the following steps:
      issue the following command:
 
      <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0rc2-py2-none-any.whl</b> </pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.0rc2-py2-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
@@ -319,7 +339,7 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      TensorFlow for Python 2.7:
 
      <pre> (tensorflow)$ <b>pip install --ignore-installed --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0rc2-py2-none-any.whl</b></pre>
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.0rc2-py2-none-any.whl</b></pre>
 
 
 <a name="ValidateYourInstallation"></a>
@@ -492,7 +512,7 @@ This section documents the relevant values for Mac OS installations.
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0rc2-py2-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.0rc2-py2-none-any.whl
 </pre>
 
 
@@ -500,7 +520,7 @@ https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0rc2-py2-none-a
 
 
 <pre>
-https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.1.0rc2-py3-none-any.whl
+https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.2.0rc2-py3-none-any.whl
 </pre>
 
 
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index 8dd7870faa..6699c7069a 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -223,7 +223,7 @@ creating the pip package and installing TensorFlow.
 If you wish to build TensorFlow with GPU, `configure` will ask
 you to specify the version numbers of Cuda and cuDNN. If several
 versions of Cuda or cuDNN are installed on your system, explicitly select
-the desired version instead of relying on the system default.
+the desired version instead of relying on the default.
 
 Here is an example execution of the `configure` script.  Note that your
 own input will likely differ from our sample input:
@@ -233,6 +233,14 @@ own input will likely differ from our sample input:
 $ <b>cd tensorflow</b>  # cd to the top-level directory created
 $ <b>./configure</b>
 Please specify the location of python. [Default is /usr/bin/python]: <b>/usr/bin/python2.7</b>
+Found possible Python library paths:
+  /usr/local/lib/python2.7/dist-packages
+  /usr/lib/python2.7/dist-packages
+Please input the desired Python library path to use.  Default is [/usr/lib/python2.7/dist-packages]
+
+Using python library path: /usr/local/lib/python2.7/dist-packages
+Do you wish to build TensorFlow with MKL support? [y/N]
+No MKL support will be enabled for TensorFlow
 Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
 Do you wish to use jemalloc as the malloc implementation? [Y/n]
 jemalloc enabled
@@ -241,31 +249,26 @@ No Google Cloud Platform support will be enabled for TensorFlow
 Do you wish to build TensorFlow with Hadoop File System support? [y/N]
 No Hadoop File System support will be enabled for TensorFlow
 Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N]
-No XLA JIT support will be enabled for TensorFlow
-Found possible Python library paths:
-  /usr/local/lib/python2.7/dist-packages
-  /usr/lib/python2.7/dist-packages
-Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]
-Using python library path: /usr/local/lib/python2.7/dist-packages
-Do you wish to build TensorFlow with OpenCL support? [y/N] N
+No XLA support will be enabled for TensorFlow
+Do you wish to build TensorFlow with VERBS support? [y/N]
+No VERBS support will be enabled for TensorFlow
+Do you wish to build TensorFlow with OpenCL support? [y/N]
 No OpenCL support will be enabled for TensorFlow
-Do you wish to build TensorFlow with CUDA support? [y/N] Y
+Do you wish to build TensorFlow with CUDA support? [y/N] <b>Y</b>
 CUDA support will be enabled for TensorFlow
-Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
-Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: <b>8.0</b>
+Do you want to use clang as CUDA compiler? [y/N]
+nvcc will be used as CUDA compiler
+Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: <b>8.0</b>
 Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
-Please specify the cuDNN version you want to use. [Leave empty to use system default]: <b>5</b>
-Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
+Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
+Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: <b>6</b>
+Please specify the location where cuDNN 6 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
 Please specify a list of comma-separated Cuda compute capabilities you want to build with.
 You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
 Please note that each additional compute capability significantly increases your build time and binary size.
 [Default is: "3.5,5.2"]: <b>3.0</b>
-Setting up Cuda include
-Setting up Cuda lib
-Setting up Cuda bin
-Setting up Cuda nvvm
-Setting up CUPTI include
-Setting up CUPTI lib64
+Do you wish to build TensorFlow with MPI support? [y/N] 
+MPI support will not be enabled for TensorFlow
 Configuration finished
 </pre>
 
@@ -320,10 +323,10 @@ Invoke `pip install` to install that pip package.
 The filename of the `.whl` file depends on your platform.
 For example, the following command will install the pip package
 
-for TensorFlow 1.1.0 on Linux:
+for TensorFlow 1.2.0rc2 on Linux:
 
 <pre>
-$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.1.0-py2-none-any.whl</b>
+$ <b>sudo pip install /tmp/tensorflow_pkg/tensorflow-1.2.0rc2-py2-none-any.whl</b>
 </pre>
 
 ## Validate your installation
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index db7c661aa1..42820660ee 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -114,12 +114,12 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      environment. To install the CPU-only version of TensorFlow, enter the
      following command:
 
-     <pre>(tensorflow)C:\> <b>pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow-1.1.0-cp35-cp35m-win_amd64.whl</b> </pre>
+     <pre>(tensorflow)C:\> <b>pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow-1.2.0rc2-cp35-cp35m-win_amd64.whl</b> </pre>
 
      To install the GPU version of TensorFlow, enter the following command
      (on a single line):
 
-     <pre>(tensorflow)C:\> <b>pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-1.1.0-cp35-cp35m-win_amd64.whl</b> </pre>
+     <pre>(tensorflow)C:\> <b>pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-1.2.0rc2-cp35-cp35m-win_amd64.whl</b> </pre>
 
 ## Validate your installation
 
diff --git a/tensorflow/docs_src/performance/benchmarks.md b/tensorflow/docs_src/performance/benchmarks.md
index 47ab028e20..20165a090e 100644
--- a/tensorflow/docs_src/performance/benchmarks.md
+++ b/tensorflow/docs_src/performance/benchmarks.md
@@ -92,7 +92,7 @@ addition to the batch sizes listed in the table, InceptionV3, ResNet-50,
 ResNet-152, and VGG16 were tested with a batch size of 32. Those results are in
 the *other results* section.
 
-Options            | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+Options            | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ------------------ | ----------- | --------- | ---------- | ------- | -----
 Batch size per GPU | 64          | 64        | 64         | 512     | 64
 Optimizer          | sgd         | sgd       | sgd        | sgd     | sgd
@@ -120,7 +120,7 @@ VGG16       | replicated (with NCCL) | n/a
 
 **Training synthetic data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 142         | 219       | 91.8       | 2987    | 154
 2    | 284         | 422       | 181        | 5658    | 295
@@ -129,7 +129,7 @@ GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
 
 **Training real data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 142         | 218       | 91.4       | 2890    | 154
 2    | 278         | 425       | 179        | 4448    | 284
@@ -182,7 +182,7 @@ addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were
 tested with a batch size of 32. Those results are in the *other results*
 section.
 
-Options            | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+Options            | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ------------------ | ----------- | --------- | ---------- | ------- | -----
 Batch size per GPU | 64          | 64        | 32         | 512     | 32
 Optimizer          | sgd         | sgd       | sgd        | sgd     | sgd
@@ -199,7 +199,7 @@ The configuration used for each model was `variable_update` equal to
 
 **Training synthetic data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 30.5        | 51.9      | 20.0       | 656     | 35.4
 2    | 57.8        | 99.0      | 38.2       | 1209    | 64.8
@@ -208,7 +208,7 @@ GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
 
 **Training real data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 30.6        | 51.2      | 20.0       | 639     | 34.2
 2    | 58.4        | 98.8      | 38.3       | 1136    | 62.9
@@ -257,7 +257,7 @@ addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were
 tested with a batch size of 32. Those results are in the *other results*
 section.
 
-Options            | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+Options            | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ------------------ | ----------- | --------- | ---------- | ------- | -----
 Batch size per GPU | 64          | 64        | 32         | 512     | 32
 Optimizer          | sgd         | sgd       | sgd        | sgd     | sgd
@@ -281,7 +281,7 @@ VGG16       | parameter_server          | gpu
 
 **Training synthetic data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 30.8        | 51.5      | 19.7       | 684     | 36.3
 2    | 58.7        | 98.0      | 37.6       | 1244    | 69.4
@@ -290,7 +290,7 @@ GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
 
 **Training real data**
 
-GPUs | InceptionV3 | ResNet-50 | ResNet-152 | Alexnet | VGG16
+GPUs | InceptionV3 | ResNet-50 | ResNet-152 | AlexNet | VGG16
 ---- | ----------- | --------- | ---------- | ------- | -----
 1    | 30.5        | 51.3      | 19.7       | 674     | 36.3
 2    | 59.0        | 94.9      | 38.2       | 1227    | 67.5
diff --git a/tensorflow/docs_src/performance/performance_models.md b/tensorflow/docs_src/performance/performance_models.md
index d48431eaa0..aa4261f545 100644
--- a/tensorflow/docs_src/performance/performance_models.md
+++ b/tensorflow/docs_src/performance/performance_models.md
@@ -133,7 +133,7 @@ Benefits of using this scheme:
 ## Best Practices in Building High-Performance Models
 
 Collected below are a couple of additional best practices that can improve
-performance and increase the flexiblity of models.
+performance and increase the flexibility of models.
 
 ### Build the model with both NHWC and NCHW
 
diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md
index 49c25027fc..4667b4cad7 100644
--- a/tensorflow/docs_src/performance/quantization.md
+++ b/tensorflow/docs_src/performance/quantization.md
@@ -153,7 +153,7 @@ bit.
 
 The min and max operations actually look at the values in the input float
 tensor, and then feeds them into the Dequantize operation that converts the
-tensor into eight-bits. There're more details on how the quantized representation
+tensor into eight-bits. There are more details on how the quantized representation
 works later on.
 
 Once the individual operations have been converted, the next stage is to remove
diff --git a/tensorflow/docs_src/performance/xla/index.md b/tensorflow/docs_src/performance/xla/index.md
index d2c1843327..19045b45d9 100644
--- a/tensorflow/docs_src/performance/xla/index.md
+++ b/tensorflow/docs_src/performance/xla/index.md
@@ -65,13 +65,13 @@ The following diagram shows the compilation process in XLA:
   <img src="https://www.tensorflow.org/images/how-does-xla-work.png">
 </div>
 
-XLA comes with several optimizations and analyses that are target-independent,
+XLA comes with several optimizations and analyzes that are target-independent,
 such as [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
 target-independent operation fusion, and buffer analysis for allocating runtime
 memory for the computation.
 
 After the target-independent step, XLA sends the HLO computation to a backend.
-The backend can perform further HLO-level analyses and optimizations, this time
+The backend can perform further HLO-level analyzes and optimizations, this time
 with target specific information and needs in mind. For example, the XLA GPU
 backend may perform operation fusion beneficial specifically for the GPU
 programming model and determine how to partition the computation into streams.
diff --git a/tensorflow/docs_src/programmers_guide/faq.md b/tensorflow/docs_src/programmers_guide/faq.md
index fa8b6fb7f1..e31d2717a6 100644
--- a/tensorflow/docs_src/programmers_guide/faq.md
+++ b/tensorflow/docs_src/programmers_guide/faq.md
@@ -189,7 +189,7 @@ operation for that variable in a session. It is destroyed when that
 
 Variables allow concurrent read and write operations. The value read from a
 variable may change if it is concurrently updated. By default, concurrent
-assigment operations to a variable are allowed to run with no mutual exclusion.
+assignment operations to a variable are allowed to run with no mutual exclusion.
 To acquire a lock when assigning to a variable, pass `use_locking=True` to
 @{tf.Variable.assign}.
 
diff --git a/tensorflow/docs_src/programmers_guide/reading_data.md b/tensorflow/docs_src/programmers_guide/reading_data.md
index 088724337e..3c31d3a1a7 100644
--- a/tensorflow/docs_src/programmers_guide/reading_data.md
+++ b/tensorflow/docs_src/programmers_guide/reading_data.md
@@ -332,7 +332,7 @@ limit has been reached and no more examples are available.
 
 The last ingredient is the
 @{tf.train.Coordinator}. This is responsible
-for letting all the threads know if anything has signalled a shut down. Most
+for letting all the threads know if anything has signaled a shut down. Most
 commonly this would be because an exception was raised, for example one of the
 threads got an error when running some operation (or an ordinary Python
 exception).
diff --git a/tensorflow/docs_src/programmers_guide/saved_model_cli.md b/tensorflow/docs_src/programmers_guide/saved_model_cli.md
index eb9e60e42e..9851bd7251 100644
--- a/tensorflow/docs_src/programmers_guide/saved_model_cli.md
+++ b/tensorflow/docs_src/programmers_guide/saved_model_cli.md
@@ -189,7 +189,7 @@ inputs that match the dtype and shape of the model signature.
 
 By default, SavedModel CLI will print outputs to console. If a directory is
 passed to `--outdir` option, the outputs will be saved as npy files named after
-output tensor keys under the given directory. Use `--overwite` to overwrite
+output tensor keys under the given directory. Use `--overwrite` to overwrite
 existing output files.
 
 #### TensorFlow Debugger (tfdbg) Integration
diff --git a/tensorflow/docs_src/programmers_guide/supervisor.md b/tensorflow/docs_src/programmers_guide/supervisor.md
index 55a090df58..ec7c91b147 100644
--- a/tensorflow/docs_src/programmers_guide/supervisor.md
+++ b/tensorflow/docs_src/programmers_guide/supervisor.md
@@ -137,10 +137,10 @@ For example this code runs the summary op every 100 steps in the training loop:
       if sv.should_stop():
         break
       if step % 100 == 0:
-        _, summ = session.run([my_train_op, my_summary_op])
+        _, summ = sess.run([my_train_op, my_summary_op])
         sv.summary_computed(sess, summ)
       else:
-        session.run(my_train_op)
+        sess.run(my_train_op)
 ```
 
 ## Pre-trained Model Scenario
@@ -203,15 +203,15 @@ Example: Call `my_additional_summaries()` every 20mn:
 
 ```python
 
-def my_additional_sumaries(sv, sess):
+def my_additional_summaries(sv, sess):
  ...fetch and write summaries, see below...
 
 ...
   sv = tf.train.Supervisor(logdir="/my/training/directory")
   with sv.managed_session() as sess:
-    # Call my_additional_sumaries() every 1200s, or 20mn,
+    # Call my_additional_summaries() every 1200s, or 20mn,
     # passing (sv, sess) as arguments.
-    sv.loop(1200, my_additional_sumaries, args=(sv, sess))
+    sv.loop(1200, my_additional_summaries, args=(sv, sess))
     ...main training loop...
 ```
 
@@ -226,11 +226,11 @@ better when only one events file in a directory is being actively appended to.
 The supervisor provides a helper function to append summaries:
 @{tf.train.Supervisor.summary_computed}.
 Just pass to the function the output returned by a summary op.  Here is an
-example of using that function to implement `my_additional_sumaries()` from the
+example of using that function to implement `my_additional_summaries()` from the
 previous example:
 
 ```python
-def my_additional_sumaries(sv, sess):
+def my_additional_summaries(sv, sess):
   summaries = sess.run(my_additional_summary_op)
   sv.summary_computed(sess, summaries)
 ```
diff --git a/tensorflow/docs_src/programmers_guide/tfdbg-tflearn.md b/tensorflow/docs_src/programmers_guide/tfdbg-tflearn.md
index 92f24f077a..f39465fb31 100644
--- a/tensorflow/docs_src/programmers_guide/tfdbg-tflearn.md
+++ b/tensorflow/docs_src/programmers_guide/tfdbg-tflearn.md
@@ -106,7 +106,7 @@ hooks = [tf_debug.DumpingDebugHook("/shared/storage/location/tfdbg_dumps_1")]
 ```
 
 Then this `hook` can be used in the same way as the `LocalCLIDebugHook` examples
-above. As the training and/or evalution of `Estimator` or `Experiment`
+above. As the training and/or evaluation of `Estimator` or `Experiment`
 happens, directories of the naming pattern
 `/shared/storage/location/tfdbg_dumps_1/run_<epoch_timestamp_microsec>_<uuid>`
 will appear. Each directory corresponds to a `Session.run()` call that underlies
diff --git a/tensorflow/docs_src/programmers_guide/threading_and_queues.md b/tensorflow/docs_src/programmers_guide/threading_and_queues.md
index 835e806046..7d3edb788e 100644
--- a/tensorflow/docs_src/programmers_guide/threading_and_queues.md
+++ b/tensorflow/docs_src/programmers_guide/threading_and_queues.md
@@ -121,7 +121,7 @@ example = ...ops to create one example...
 # Create a queue, and an op that enqueues examples one at a time in the queue.
 queue = tf.RandomShuffleQueue(...)
 enqueue_op = queue.enqueue(example)
-# Create a training graph that starts by dequeuing a batch of examples.
+# Create a training graph that starts by dequeueing a batch of examples.
 inputs = queue.dequeue_many(batch_size)
 train_op = ...use 'inputs' to build the training part of the graph...
 ```
diff --git a/tensorflow/docs_src/programmers_guide/version_semantics.md b/tensorflow/docs_src/programmers_guide/version_semantics.md
index 47fc582387..cee3b105de 100644
--- a/tensorflow/docs_src/programmers_guide/version_semantics.md
+++ b/tensorflow/docs_src/programmers_guide/version_semantics.md
@@ -118,7 +118,7 @@ Many users of TensorFlow will be saving graphs and trained models to disk for
 later evaluation or more training, often changing versions of TensorFlow in the
 process.  First, following semver, any graph or checkpoint written out with one
 version of TensorFlow can be loaded and evaluated with a later version of
-TensorFlow with the same major release.  However, we will endeavour to preserve
+TensorFlow with the same major release.  However, we will endeavor to preserve
 backwards compatibility even across major releases when possible, so that the
 serialized files are usable over long periods of time.
 
diff --git a/tensorflow/docs_src/tutorials/layers.md b/tensorflow/docs_src/tutorials/layers.md
index aa8e2cc839..0fdfcf5d2a 100644
--- a/tensorflow/docs_src/tutorials/layers.md
+++ b/tensorflow/docs_src/tutorials/layers.md
@@ -341,7 +341,7 @@ pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
 ```
 
 Note that convolutional layer #2 takes the output tensor of our first pooling
-layer (`pool1`) as input, and produces the tensor `h_conv2` as output. `conv2`
+layer (`pool1`) as input, and produces the tensor `conv2` as output. `conv2`
 has a shape of <code>[<em>batch_size</em>, 14, 14, 64]</code>, the same width
 and height as `pool1` (due to `padding="same"`), and 64 channels for the 64
 filters applied.
@@ -585,7 +585,7 @@ hand-drawn digits) and training labels (the corresponding value from 0–9 for
 each image) as [numpy
 arrays](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html)
 in `train_data` and `train_labels`, respectively. Similarly, we store the
-evalulation feature data (10,000 images) and evaluation labels in `eval_data`
+evaluation feature data (10,000 images) and evaluation labels in `eval_data`
 and `eval_labels`, respectively.
 
 ### Create the Estimator {#create-the-estimator}
diff --git a/tensorflow/docs_src/tutorials/recurrent.md b/tensorflow/docs_src/tutorials/recurrent.md
index 12d6285147..708a9620dd 100644
--- a/tensorflow/docs_src/tutorials/recurrent.md
+++ b/tensorflow/docs_src/tutorials/recurrent.md
@@ -51,11 +51,28 @@ The core of the model consists of an LSTM cell that processes one word at a
 time and computes probabilities of the possible values for the next word in the
 sentence. The memory state of the network is initialized with a vector of zeros
 and gets updated after reading each word. For computational reasons, we will
-process data in mini-batches of size `batch_size`.
+process data in mini-batches of size `batch_size`.  In this example, it is important 
+to note that `current_batch_of_words` does not correspond to a "sentence" of words.  
+Every word in a batch should correspond to time t.  Tensorflow will automatically sum 
+the gradients of each batch for you.
+
+For example:
+```
+ t=0  t=1    t=2  t=3     t=4
+[The, brown, fox, is,     quick]
+[The, red,   fox, jumped, high]
+
+words_in_dataset[0] = [The, The]
+words_in_dataset[1] = [fox, fox]
+words_in_dataset[2] = [is, jumped]
+words_in_dataset[3] = [quick, high]
+num_batches = 4, batch_size = 2, time_steps = 5
+```
 
 The basic pseudocode is as follows:
 
 ```python
+words_in_dataset = tf.placeholder(tf.float32, [num_batches, batch_size, num_features])
 lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
 # Initial state of the LSTM memory.
 state = tf.zeros([batch_size, lstm.state_size])
diff --git a/tensorflow/docs_src/tutorials/wide.md b/tensorflow/docs_src/tutorials/wide.md
index ce82009903..c2621026c7 100644
--- a/tensorflow/docs_src/tutorials/wide.md
+++ b/tensorflow/docs_src/tutorials/wide.md
@@ -56,8 +56,8 @@ import tempfile
 import urllib
 train_file = tempfile.NamedTemporaryFile()
 test_file = tempfile.NamedTemporaryFile()
-urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
-urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)
+urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
+urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)
 ```
 
 Once the CSV files are downloaded, let's read them into
diff --git a/tensorflow/examples/android/jni/object_tracking/image_utils.h b/tensorflow/examples/android/jni/object_tracking/image_utils.h
index 2d712e77f9..ac9ffd90f8 100644
--- a/tensorflow/examples/android/jni/object_tracking/image_utils.h
+++ b/tensorflow/examples/android/jni/object_tracking/image_utils.h
@@ -67,7 +67,7 @@ inline static void MarkImage(const int x, const int y, const int radius,
     // reduce the number of iterations required as compared to starting from
     // either 0 and counting up or radius and counting down.
     for (int d_x = radius - d_y; d_x <= radius; ++d_x) {
-      // The first time this critera is met, we know the width of the circle at
+      // The first time this criteria is met, we know the width of the circle at
       // this row (without using sqrt).
       if (squared_y_dist + Square(d_x) >= squared_radius) {
         const int min_x = MAX(x - d_x, 0);
diff --git a/tensorflow/examples/image_retraining/retrain.py b/tensorflow/examples/image_retraining/retrain.py
index 6c1b40b442..8e3b1a3a36 100644
--- a/tensorflow/examples/image_retraining/retrain.py
+++ b/tensorflow/examples/image_retraining/retrain.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-r"""Simple transfer learning with an Inception v3 architecture model.
+"""Simple transfer learning with an Inception v3 architecture model.
 
 With support for TensorBoard.
 
diff --git a/tensorflow/examples/label_image/main.cc b/tensorflow/examples/label_image/main.cc
index 90454bd7ac..a98c0817e3 100644
--- a/tensorflow/examples/label_image/main.cc
+++ b/tensorflow/examples/label_image/main.cc
@@ -50,6 +50,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/threadpool.h"
 #include "tensorflow/core/lib/io/path.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
+#include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/init_main.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
@@ -86,6 +87,29 @@ Status ReadLabelsFile(const string& file_name, std::vector<string>* result,
   return Status::OK();
 }
 
+static Status ReadEntireFile(tensorflow::Env* env, const string& filename,
+                             Tensor* output) {
+
+  tensorflow::uint64 file_size = 0;
+  TF_RETURN_IF_ERROR(env->GetFileSize(filename, &file_size));
+
+  string contents;
+  contents.resize(file_size);
+
+  std::unique_ptr<tensorflow::RandomAccessFile> file;
+  TF_RETURN_IF_ERROR(env->NewRandomAccessFile(filename, &file));
+
+  tensorflow::StringPiece data;
+  TF_RETURN_IF_ERROR(file->Read(0, file_size, &data, &(contents)[0]));
+  if (data.size() != file_size) {
+    return tensorflow::errors::DataLoss("Truncated read of '", filename,
+                                        "' expected ", file_size, " got ",
+                                        data.size());
+  }
+  output->scalar<string>()() = data.ToString();
+  return Status::OK();
+}
+
 // Given an image file name, read in the data, try to decode it as an image,
 // resize it to the requested size, and then scale the values as desired.
 Status ReadTensorFromImageFile(const string& file_name, const int input_height,
@@ -97,8 +121,20 @@ Status ReadTensorFromImageFile(const string& file_name, const int input_height,
 
   string input_name = "file_reader";
   string output_name = "normalized";
-  auto file_reader =
-      tensorflow::ops::ReadFile(root.WithOpName(input_name), file_name);
+
+  // read file_name into a tensor named input
+  Tensor input(tensorflow::DT_STRING, tensorflow::TensorShape());
+  TF_RETURN_IF_ERROR(ReadEntireFile(tensorflow::Env::Default(), file_name,
+                                    &input));
+
+  // use a placeholder to read input data
+  auto file_reader = Placeholder(root.WithOpName("input"),
+                                 tensorflow::DataType::DT_STRING);
+
+  std::vector<std::pair<string, tensorflow::Tensor>> inputs = {
+    {"input", input},
+  };
+
   // Now try to figure out what kind of file it is and decode it.
   const int wanted_channels = 3;
   tensorflow::Output image_reader;
@@ -141,7 +177,7 @@ Status ReadTensorFromImageFile(const string& file_name, const int input_height,
   std::unique_ptr<tensorflow::Session> session(
       tensorflow::NewSession(tensorflow::SessionOptions()));
   TF_RETURN_IF_ERROR(session->Create(graph));
-  TF_RETURN_IF_ERROR(session->Run({}, {output_name}, {}, out_tensors));
+  TF_RETURN_IF_ERROR(session->Run({inputs}, {output_name}, {}, out_tensors));
   return Status::OK();
 }
 
diff --git a/tensorflow/examples/learn/resnet.py b/tensorflow/examples/learn/resnet.py
index 7737f10495..881905fde8 100755
--- a/tensorflow/examples/learn/resnet.py
+++ b/tensorflow/examples/learn/resnet.py
@@ -145,7 +145,7 @@ def res_net(x, y, activation=tf.nn.relu):
   target = tf.one_hot(y, depth=10, dtype=tf.float32)
   logits = tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
   loss = tf.losses.softmax_cross_entropy(target, logits)
-  return tf.softmax(logits), loss
+  return tf.nn.softmax(logits), loss
 
 
 def res_net_model(x, y):
diff --git a/tensorflow/examples/learn/text_classification_cnn.py b/tensorflow/examples/learn/text_classification_cnn.py
index 41fbdba1a7..468a96b58f 100644
--- a/tensorflow/examples/learn/text_classification_cnn.py
+++ b/tensorflow/examples/learn/text_classification_cnn.py
@@ -73,7 +73,7 @@ def cnn_model(features, target):
 
   # Apply regular WX + B and classification.
   logits = tf.contrib.layers.fully_connected(pool2, 15, activation_fn=None)
-  loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
+  loss = tf.losses.softmax_cross_entropy(target, logits)
 
   train_op = tf.contrib.layers.optimize_loss(
       loss,
@@ -105,14 +105,11 @@ def main(unused_argv):
   print('Total words: %d' % n_words)
 
   # Build model
-  classifier = learn.Estimator(model_fn=cnn_model)
+  classifier = learn.SKCompat(learn.Estimator(model_fn=cnn_model))
 
   # Train and predict
   classifier.fit(x_train, y_train, steps=100)
-  y_predicted = [
-      p['class'] for p in classifier.predict(
-          x_test, as_iterable=True)
-  ]
+  y_predicted = classifier.predict(x_test)['class']
   score = metrics.accuracy_score(y_test, y_predicted)
   print('Accuracy: {0:f}'.format(score))
 
diff --git a/tensorflow/examples/learn/wide_n_deep_tutorial.py b/tensorflow/examples/learn/wide_n_deep_tutorial.py
index c275f53af7..a0c6df821a 100644
--- a/tensorflow/examples/learn/wide_n_deep_tutorial.py
+++ b/tensorflow/examples/learn/wide_n_deep_tutorial.py
@@ -44,7 +44,7 @@ def maybe_download(train_data, test_data):
     train_file_name = train_data
   else:
     train_file = tempfile.NamedTemporaryFile(delete=False)
-    urllib.request.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)  # pylint: disable=line-too-long
+    urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)  # pylint: disable=line-too-long
     train_file_name = train_file.name
     train_file.close()
     print("Training data is downloaded to %s" % train_file_name)
@@ -53,7 +53,7 @@ def maybe_download(train_data, test_data):
     test_file_name = test_data
   else:
     test_file = tempfile.NamedTemporaryFile(delete=False)
-    urllib.request.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)  # pylint: disable=line-too-long
+    urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)  # pylint: disable=line-too-long
     test_file_name = test_file.name
     test_file.close()
     print("Test data is downloaded to %s" % test_file_name)
diff --git a/tensorflow/examples/tutorials/deepdream/deepdream.ipynb b/tensorflow/examples/tutorials/deepdream/deepdream.ipynb
index 4ff8e368c4..186c14b4fd 100644
--- a/tensorflow/examples/tutorials/deepdream/deepdream.ipynb
+++ b/tensorflow/examples/tutorials/deepdream/deepdream.ipynb
@@ -120,7 +120,7 @@
    },
    "outputs": [],
    "source": [
-    "#!wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip && unzip inception5h.zip"
+    "!wget -nc https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip && unzip -n inception5h.zip"
    ]
   },
   {
diff --git a/tensorflow/examples/udacity/1_notmnist.ipynb b/tensorflow/examples/udacity/1_notmnist.ipynb
index 521cbf3000..39674e1aa4 100644
--- a/tensorflow/examples/udacity/1_notmnist.ipynb
+++ b/tensorflow/examples/udacity/1_notmnist.ipynb
@@ -70,7 +70,7 @@
         "colab_type": "text"
       },
       "source": [
-        "First, we'll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the testset 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine."
+        "First, we'll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the testset 19000 labeled examples. Given these sizes, it should be possible to train models quickly on any machine."
       ]
     },
     {
@@ -109,7 +109,7 @@
         "outputId": "0d0f85df-155f-4a89-8e7e-ee32df36ec8d"
       },
       "source": [
-        "url = 'http://commondatastorage.googleapis.com/books1000/'\n",
+        "url = 'https://commondatastorage.googleapis.com/books1000/'\n",
         "last_percent_reported = None\n",
         "data_root = '.' # Change me to store data elsewhere\n",
         "\n",
@@ -168,7 +168,7 @@
       },
       "source": [
         "Extract the dataset from the compressed .tar.gz file.\n",
-        "This should give you a set of directories, labelled A through J."
+        "This should give you a set of directories, labeled A through J."
       ]
     },
     {
diff --git a/tensorflow/examples/udacity/Dockerfile b/tensorflow/examples/udacity/Dockerfile
index 9f5ef1aca3..3d48ced41b 100644
--- a/tensorflow/examples/udacity/Dockerfile
+++ b/tensorflow/examples/udacity/Dockerfile
@@ -12,4 +12,4 @@ RUN pip install scikit-learn pyreadline Pillow
 RUN rm -rf /notebooks/*
 ADD *.ipynb /notebooks/
 WORKDIR /notebooks
-CMD ["/run_jupyter.sh"]
+CMD ["/run_jupyter.sh", "--allow-root"]
diff --git a/tensorflow/go/op/scope.go b/tensorflow/go/op/scope.go
index d87833f451..a9ec79463a 100644
--- a/tensorflow/go/op/scope.go
+++ b/tensorflow/go/op/scope.go
@@ -49,6 +49,11 @@ func NewScope() *Scope {
 	return &Scope{graph: tf.NewGraph(), namemap: make(map[string]int), err: new(scopeErr)}
 }
 
+// NewScopeWithGraph creates a Scope initialized with the Graph thats passed in
+func NewScopeWithGraph(g *tf.Graph) *Scope {
+	return &Scope{graph: g, namemap: make(map[string]int), err: new(scopeErr)}
+}
+
 // Finalize returns the Graph on which this scope operates on and renders s
 // unusable. If there was an error during graph construction, that error is
 // returned instead.
diff --git a/tensorflow/go/op/scope_test.go b/tensorflow/go/op/scope_test.go
index b74fd24b26..6fb5d32e50 100644
--- a/tensorflow/go/op/scope_test.go
+++ b/tensorflow/go/op/scope_test.go
@@ -95,6 +95,21 @@ func TestMultipleGeneratedOps(t *testing.T) {
 	}
 }
 
+func TestScopeWithGraph(t *testing.T) {
+	s1 := NewScope()
+	Const(s1, "hello")
+	graph, err := s1.Finalize()
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	s2 := NewScopeWithGraph(graph)
+	Const(s2.SubScope("addition"), "world")
+	if err := s2.Err(); err != nil {
+		t.Fatal(err)
+	}
+}
+
 func Example() {
 	// This example creates a Graph that multiplies a constant matrix with
 	// a matrix to be provided during graph execution (via
diff --git a/tensorflow/go/op/wrappers.go b/tensorflow/go/op/wrappers.go
index c414255f93..9f048d3ea0 100644
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@@ -1337,6 +1337,47 @@ func PlaceholderV2(scope *Scope, dtype tf.DataType, shape tf.Shape) (output tf.O
 	return op.Output(0)
 }
 
+// PlaceholderAttr is an optional argument to Placeholder.
+type PlaceholderAttr func(optionalAttr)
+
+// PlaceholderShape sets the optional shape attribute to value.
+//
+// value: (Optional) The shape of the tensor. If the shape has 0 dimensions, the
+// shape is unconstrained.
+// If not specified, defaults to <unknown_rank:true >
+func PlaceholderShape(value tf.Shape) PlaceholderAttr {
+	return func(m optionalAttr) {
+		m["shape"] = value
+	}
+}
+
+// A placeholder op for a value that will be fed into the computation.
+//
+// N.B. This operation will fail with an error if it is executed. It is
+// intended as a way to represent a value that will always be fed, and to
+// provide attrs that enable the fed value to be checked at runtime.
+//
+// Arguments:
+//	dtype: The type of elements in the tensor.
+//
+// Returns A placeholder tensor that must be replaced using the feed mechanism.
+func Placeholder(scope *Scope, dtype tf.DataType, optional ...PlaceholderAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "Placeholder",
+
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Pads a tensor with mirrored values.
 //
 // This operation pads a `input` with mirrored values according to the `paddings`
@@ -4153,7 +4194,7 @@ func UnstageSharedName(value string) UnstageAttr {
 
 // Op is similar to a lightweight Dequeue.
 //
-// The basic funtionality is similar to dequeue with many fewer
+// The basic functionality is similar to dequeue with many fewer
 // capabilities and options.  This Op is optimized for performance.
 func Unstage(scope *Scope, dtypes []tf.DataType, optional ...UnstageAttr) (values []tf.Output) {
 	if scope.Err() != nil {
@@ -4724,7 +4765,7 @@ type QueueCloseV2Attr func(optionalAttr)
 // QueueCloseV2CancelPendingEnqueues sets the optional cancel_pending_enqueues attribute to value.
 //
 // value: If true, all pending enqueue requests that are
-// blocked on the given queue will be cancelled.
+// blocked on the given queue will be canceled.
 // If not specified, defaults to false
 func QueueCloseV2CancelPendingEnqueues(value bool) QueueCloseV2Attr {
 	return func(m optionalAttr) {
@@ -4895,76 +4936,6 @@ func FixedLengthRecordDataset(scope *Scope, filenames tf.Output, header_bytes tf
 	return op.Output(0)
 }
 
-// PlaceholderAttr is an optional argument to Placeholder.
-type PlaceholderAttr func(optionalAttr)
-
-// PlaceholderShape sets the optional shape attribute to value.
-//
-// value: (Optional) The shape of the tensor. If the shape has 0 dimensions, the
-// shape is unconstrained.
-// If not specified, defaults to <unknown_rank:true >
-func PlaceholderShape(value tf.Shape) PlaceholderAttr {
-	return func(m optionalAttr) {
-		m["shape"] = value
-	}
-}
-
-// A placeholder op for a value that will be fed into the computation.
-//
-// N.B. This operation will fail with an error if it is executed. It is
-// intended as a way to represent a value that will always be fed, and to
-// provide attrs that enable the fed value to be checked at runtime.
-//
-// Arguments:
-//	dtype: The type of elements in the tensor.
-//
-// Returns A placeholder tensor that must be replaced using the feed mechanism.
-func Placeholder(scope *Scope, dtype tf.DataType, optional ...PlaceholderAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtype": dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "Placeholder",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Creates a dataset that caches elements from `input_dataset`.
-//
-// A CacheDataset will iterate over the input_dataset, and store tensors. If the
-// cache already exists, the cache will be used. If the cache is inappropriate
-// (e.g. cannot be opened, contains tensors of the wrong shape / size), an error
-// will the returned when used.
-//
-// Arguments:
-//
-//	filename: A path on the filesystem where we should cache the dataset. Note: this
-// will be a directory.
-//
-//
-func CacheDataset(scope *Scope, input_dataset tf.Output, filename tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "CacheDataset",
-		Input: []tf.Input{
-			input_dataset, filename,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Deprecated. Use TensorArrayGradV3
 func TensorArrayGradV2(scope *Scope, handle tf.Output, flow_in tf.Output, source string) (grad_handle tf.Output) {
 	if scope.Err() != nil {
diff --git a/tensorflow/java/BUILD b/tensorflow/java/BUILD
index a8910248c1..9abb63c966 100644
--- a/tensorflow/java/BUILD
+++ b/tensorflow/java/BUILD
@@ -5,10 +5,13 @@ package(default_visibility = ["//visibility:private"])
 
 licenses(["notice"])  # Apache 2.0
 
+load("build_defs", "JAVACOPTS")
+
 java_library(
     name = "tensorflow",
     srcs = [":java_sources"],
     data = [":libtensorflow_jni"],
+    javacopts = JAVACOPTS,
     visibility = ["//visibility:public"],
 )
 
@@ -27,6 +30,7 @@ java_library(
     name = "testutil",
     testonly = 1,
     srcs = ["src/test/java/org/tensorflow/TestUtil.java"],
+    javacopts = JAVACOPTS,
     deps = [":tensorflow"],
 )
 
@@ -34,6 +38,7 @@ java_test(
     name = "GraphTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/GraphTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.GraphTest",
     deps = [
         ":tensorflow",
@@ -46,6 +51,7 @@ java_test(
     name = "OperationBuilderTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/OperationBuilderTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.OperationBuilderTest",
     deps = [
         ":tensorflow",
@@ -58,6 +64,7 @@ java_test(
     name = "OperationTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/OperationTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.OperationTest",
     deps = [
         ":tensorflow",
@@ -71,6 +78,7 @@ java_test(
     size = "small",
     srcs = ["src/test/java/org/tensorflow/SavedModelBundleTest.java"],
     data = ["//tensorflow/cc/saved_model:saved_model_half_plus_two"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.SavedModelBundleTest",
     deps = [
         ":tensorflow",
@@ -83,6 +91,7 @@ java_test(
     name = "SessionTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/SessionTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.SessionTest",
     deps = [
         ":tensorflow",
@@ -95,6 +104,7 @@ java_test(
     name = "ShapeTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/ShapeTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.ShapeTest",
     deps = [
         ":tensorflow",
@@ -107,6 +117,7 @@ java_test(
     name = "TensorFlowTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/TensorFlowTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.TensorFlowTest",
     deps = [
         ":tensorflow",
@@ -118,6 +129,7 @@ java_test(
     name = "TensorTest",
     size = "small",
     srcs = ["src/test/java/org/tensorflow/TensorTest.java"],
+    javacopts = JAVACOPTS,
     test_class = "org.tensorflow.TensorTest",
     deps = [
         ":tensorflow",
@@ -151,6 +163,7 @@ cc_binary(
             LINKER_EXPORTED_SYMBOLS,
         ],
         "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "-z defs",
             "-s",
diff --git a/tensorflow/java/README.md b/tensorflow/java/README.md
index 337b55bccf..2abee05f4e 100644
--- a/tensorflow/java/README.md
+++ b/tensorflow/java/README.md
@@ -21,7 +21,7 @@ native libraries will need to be built from source.
 
 1.  Install [bazel](https://www.bazel.build/versions/master/docs/install.html)
 
-2.  Setup the environment to buile TensorFlow from source code
+2.  Setup the environment to build TensorFlow from source code
     ([Linux](https://www.tensorflow.org/versions/master/get_started/os_setup.html#prepare-environment-for-linux)
     or [Mac OS
     X](https://www.tensorflow.org/versions/master/get_started/os_setup.html#prepare-environment-for-mac-os-x)).
diff --git a/tensorflow/java/build_defs.bzl b/tensorflow/java/build_defs.bzl
new file mode 100644
index 0000000000..750d76301e
--- /dev/null
+++ b/tensorflow/java/build_defs.bzl
@@ -0,0 +1,154 @@
+# -*- Python -*-
+
+# A more robust set of lint and errorprone checks when building
+# Java source to improve code consistency.
+
+XLINT_OPTS = [
+    "-Werror",
+    "-Xlint:all",
+    "-Xlint:-serial",
+    "-Xlint:-try",
+]
+
+# The bazel errorprone plugin currently only enables default errorChecks
+# https://github.com/bazelbuild/bazel/blob/97975603e5ff2247e6bb352e3afd27fea38f108d/src/java_tools/buildjar/java/com/google/devtools/build/buildjar/javac/plugins/errorprone/ErrorPronePlugin.java#L52
+#
+# Default errorChecks are errorprone checkers listed under ENABLED_ERRORS at
+# https://github.com/google/error-prone/blob/c6f24bc387989158d99af28e7ae86755e56c5f38/core/src/main/java/com/google/errorprone/scanner/BuiltInCheckerSuppliers.java#L273
+#
+# Here we enable all available errorprone checks to converge on a consistent
+# code style.
+# https://github.com/google/error-prone/blob/c6f24bc387989158d99af28e7ae86755e56c5f38/core/src/main/java/com/google/errorprone/scanner/BuiltInCheckerSuppliers.java#L260
+
+# This list is from ENABLED_WARNINGS in
+# com/google/errorprone/scanner/BuiltInCheckerSuppliers.java
+EP_ENABLED_WARNINGS = [
+    "-Xep:AmbiguousMethodReference:ERROR",
+    "-Xep:ArgumentSelectionDefectChecker:ERROR",
+    "-Xep:AssertEqualsArgumentOrderChecker:ERROR",
+    "-Xep:BadAnnotationImplementation:ERROR",
+    "-Xep:BadComparable:ERROR",
+    "-Xep:BoxedPrimitiveConstructor:ERROR",
+    "-Xep:CannotMockFinalClass:ERROR",
+    "-Xep:ClassCanBeStatic:ERROR",
+    "-Xep:ClassNewInstance:ERROR",
+    "-Xep:DefaultCharset:ERROR",
+    "-Xep:DoubleCheckedLocking:ERROR",
+    "-Xep:ElementsCountedInLoop:ERROR",
+    "-Xep:EqualsHashCode:ERROR",
+    "-Xep:EqualsIncompatibleType:ERROR",
+    "-Xep:Finally:ERROR",
+    "-Xep:FloatingPointLiteralPrecision:ERROR",
+    "-Xep:FragmentInjection:ERROR",
+    "-Xep:FragmentNotInstantiable:ERROR",
+    "-Xep:FunctionalInterfaceClash:ERROR",
+    "-Xep:FutureReturnValueIgnored:ERROR",
+    "-Xep:GetClassOnEnum:ERROR",
+    "-Xep:ImmutableAnnotationChecker:ERROR",
+    "-Xep:ImmutableEnumChecker:ERROR",
+    "-Xep:IncompatibleModifiers:ERROR",
+    "-Xep:InjectOnConstructorOfAbstractClass:ERROR",
+    "-Xep:InputStreamSlowMultibyteRead:ERROR",
+    "-Xep:IterableAndIterator:ERROR",
+    "-Xep:JavaLangClash:ERROR",
+    "-Xep:JUnit3FloatingPointComparisonWithoutDelta:ERROR",
+    "-Xep:JUnitAmbiguousTestClass:ERROR",
+    "-Xep:LiteralClassName:ERROR",
+    "-Xep:LogicalAssignment:ERROR",
+    "-Xep:MissingFail:ERROR",
+    "-Xep:MissingOverride:ERROR",
+    "-Xep:MutableConstantField:ERROR",
+    "-Xep:NamedParameters:ERROR",
+    "-Xep:NarrowingCompoundAssignment:ERROR",
+    "-Xep:NonAtomicVolatileUpdate:ERROR",
+    "-Xep:NonOverridingEquals:ERROR",
+    "-Xep:NullableConstructor:ERROR",
+    "-Xep:NullablePrimitive:ERROR",
+    "-Xep:NullableVoid:ERROR",
+    "-Xep:OperatorPrecedence:ERROR",
+    "-Xep:OverridesGuiceInjectableMethod:ERROR",
+    "-Xep:PreconditionsInvalidPlaceholder:ERROR",
+    "-Xep:ProtoFieldPreconditionsCheckNotNull:ERROR",
+    "-Xep:ReferenceEquality:ERROR",
+    "-Xep:RequiredModifiers:ERROR",
+    "-Xep:ShortCircuitBoolean:ERROR",
+    "-Xep:SimpleDateFormatConstant:ERROR",
+    "-Xep:StaticGuardedByInstance:ERROR",
+    "-Xep:SynchronizeOnNonFinalField:ERROR",
+    "-Xep:TruthConstantAsserts:ERROR",
+    "-Xep:TypeParameterShadowing:ERROR",
+    "-Xep:TypeParameterUnusedInFormals:ERROR",
+    "-Xep:UnsynchronizedOverridesSynchronized:ERROR",
+    "-Xep:URLEqualsHashCode:ERROR",
+    "-Xep:WaitNotInLoop:ERROR",
+]
+
+# This list is from DISABLED_CHECKS in
+# com/google/errorprone/scanner/BuiltInCheckerSuppliers.java
+EP_DISABLED_CHECKS = [
+    "-Xep:AutoFactoryAtInject:ERROR",
+    "-Xep:AssertFalse:ERROR",
+    "-Xep:AssistedInjectAndInjectOnConstructors:ERROR",
+    "-Xep:AssistedInjectAndInjectOnSameConstructor:ERROR",
+    "-Xep:BigDecimalLiteralDouble:ERROR",
+    "-Xep:BindingToUnqualifiedCommonType:ERROR",
+    "-Xep:ClassName:ERROR",
+    "-Xep:ComparisonContractViolated:ERROR",
+    "-Xep:ConstantField:ERROR",
+    "-Xep:ConstructorInvokesOverridable:ERROR",
+    # False positives, disabled
+    # "-Xep:ConstructorLeaksThis:ERROR",
+    "-Xep:DepAnn:ERROR",
+    "-Xep:DivZero:ERROR",
+    "-Xep:EmptyIfStatement:ERROR",
+    "-Xep:EmptySetMultibindingContributions:ERROR",
+    "-Xep:EmptyTopLevelDeclaration:ERROR",
+    "-Xep:ExpectedExceptionChecker:ERROR",
+    "-Xep:HardCodedSdCardPath:ERROR",
+    "-Xep:InjectedConstructorAnnotations:ERROR",
+    "-Xep:InsecureCipherMode:ERROR",
+    "-Xep:InvalidTargetingOnScopingAnnotation:ERROR",
+    "-Xep:IterablePathParameter:ERROR",
+    "-Xep:JMockTestWithoutRunWithOrRuleAnnotation:ERROR",
+    "-Xep:JavaxInjectOnFinalField:ERROR",
+    "-Xep:LockMethodChecker:ERROR",
+    "-Xep:LongLiteralLowerCaseSuffix:ERROR",
+    "-Xep:MethodCanBeStatic:ERROR",
+    "-Xep:MissingDefault:ERROR",
+    "-Xep:MixedArrayDimensions:ERROR",
+    "-Xep:MoreThanOneQualifier:ERROR",
+    "-Xep:MultiVariableDeclaration:ERROR",
+    "-Xep:MultipleTopLevelClasses:ERROR",
+    "-Xep:NoAllocationChecker:ERROR",
+    "-Xep:NonCanonicalStaticMemberImport:ERROR",
+    "-Xep:NumericEquality:ERROR",
+    "-Xep:PackageLocation:ERROR",
+    "-Xep:PrimitiveArrayPassedToVarargsMethod:ERROR",
+    "-Xep:PrivateConstructorForUtilityClass:ERROR",
+    "-Xep:PrivateConstructorForNoninstantiableModule:ERROR",
+    "-Xep:ProtoStringFieldReferenceEquality:ERROR",
+    "-Xep:QualifierOrScopeOnInjectMethod:ERROR",
+    "-Xep:QualifierWithTypeUse:ERROR",
+    "-Xep:RedundantThrows:ERROR",
+    "-Xep:RemoveUnusedImports:ERROR",
+    "-Xep:ScopeAnnotationOnInterfaceOrAbstractClass:ERROR",
+    "-Xep:ScopeOrQualifierAnnotationRetention:ERROR",
+    "-Xep:StaticQualifiedUsingExpression:ERROR",
+    "-Xep:StaticOrDefaultInterfaceMethod:ERROR",
+    "-Xep:StringEquality:ERROR",
+    "-Xep:TestExceptionChecker:ERROR",
+    # TODO: stylistic changes in code
+    # "-Xep:ThrowsUncheckedException:ERROR",
+    # "-Xep:UngroupedOverloads:ERROR",
+    "-Xep:UnlockMethodChecker:ERROR",
+    "-Xep:UnnecessaryDefaultInEnumSwitch:ERROR",
+    "-Xep:UnnecessaryStaticImport:ERROR",
+    "-Xep:UseBinds:ERROR",
+    "-Xep:VarChecker:ERROR",
+    "-Xep:WildcardImport:ERROR",
+    "-Xep:WrongParameterPackage:ERROR",
+]
+
+EP_OPTS = EP_ENABLED_WARNINGS + EP_DISABLED_CHECKS
+
+JAVACOPTS = XLINT_OPTS + EP_OPTS
diff --git a/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java b/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java
index 80a67c3491..d817239919 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/NativeLibrary.java
@@ -159,4 +159,6 @@ final class NativeLibrary {
       src.close();
     }
   }
+
+  private NativeLibrary() {}
 }
diff --git a/tensorflow/java/src/main/java/org/tensorflow/Tensor.java b/tensorflow/java/src/main/java/org/tensorflow/Tensor.java
index 692de2289d..f4f853f716 100644
--- a/tensorflow/java/src/main/java/org/tensorflow/Tensor.java
+++ b/tensorflow/java/src/main/java/org/tensorflow/Tensor.java
@@ -489,7 +489,7 @@ public final class Tensor implements AutoCloseable {
     // assumes a fully-known shape
     int n = 1;
     for (int i = 0; i < shape.length; i++) {
-      n *= shape[i];
+      n *= (int) shape[i];
     }
     return n;
   }
@@ -508,9 +508,8 @@ public final class Tensor implements AutoCloseable {
         return 1;
       case STRING:
         throw new IllegalArgumentException("STRING tensors do not have a fixed element size");
-      default:
-        throw new IllegalArgumentException("DataType " + dataType + " is not supported yet");
     }
+    throw new IllegalArgumentException("DataType " + dataType + " is not supported yet");
   }
 
   private static DataType dataTypeOf(Object o) {
diff --git a/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java b/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
index fa975e55cd..f6dc3ee1e9 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/GraphTest.java
@@ -47,7 +47,7 @@ public class GraphTest {
 
   // Helper function whose implementation is based on knowledge of how
   // TestUtil.transpose_A_times_X is implemented.
-  private void validateImportedGraph(Graph g, String prefix) {
+  private static void validateImportedGraph(Graph g, String prefix) {
     Operation op = g.operation(prefix + "A");
     assertNotNull(op);
     assertEquals(prefix + "A", op.name());
diff --git a/tensorflow/java/src/test/java/org/tensorflow/OperationTest.java b/tensorflow/java/src/test/java/org/tensorflow/OperationTest.java
index 101839e6d7..74fdcf484e 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/OperationTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/OperationTest.java
@@ -22,7 +22,6 @@ import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
 
-
 /** Unit tests for {@link org.tensorflow.Operation}. */
 @RunWith(JUnit4.class)
 public class OperationTest {
@@ -53,7 +52,7 @@ public class OperationTest {
     assertEquals(3, split(new int[] {0, 1, 2}, 3));
   }
 
-  private int split(int[] values, int num_split) {
+  private static int split(int[] values, int num_split) {
     try (Graph g = new Graph()) {
       return g.opBuilder("Split", "Split")
           .addInput(TestUtil.constant(g, "split_dim", 0))
diff --git a/tensorflow/java/src/test/java/org/tensorflow/TensorFlowTest.java b/tensorflow/java/src/test/java/org/tensorflow/TensorFlowTest.java
index 27e2215f62..a31ea900d1 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/TensorFlowTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/TensorFlowTest.java
@@ -33,7 +33,7 @@ public class TensorFlowTest {
   public void registeredOpList() {
     // Would be nice to actually parse the output as a tensorflow.OpList protocol buffer message,
     // but as of May 2017, bazel support for generating Java code from protocol buffer definitions
-    // was not sorted out. Revisit? Till then, at least excercise the code.
+    // was not sorted out. Revisit? Till then, at least exercise the code.
     assertTrue(TensorFlow.registeredOpList().length > 0);
   }
 }
diff --git a/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java b/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
index e998843f05..44eecc1d1e 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
@@ -15,6 +15,7 @@ limitations under the License.
 
 package org.tensorflow;
 
+import static java.nio.charset.StandardCharsets.UTF_8;
 import static org.junit.Assert.assertArrayEquals;
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;
@@ -43,7 +44,7 @@ public class TensorTest {
     boolean[] bools = {true, false, true, false};
     long[] bools_shape = {4};
     byte[] bools_ = TestUtil.bool2byte(bools);
-    byte[] strings = "test".getBytes();
+    byte[] strings = "test".getBytes(UTF_8);
     long[] strings_shape = {};
     byte[] strings_; // raw TF_STRING
     try (Tensor t = Tensor.create(strings)) {
diff --git a/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java b/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
index 265e21203b..6a3a16c2e1 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/TestUtil.java
@@ -54,6 +54,7 @@ public class TestUtil {
 
   /**
    * Counts the total number of elements in an ND array.
+   *
    * @param array the array to count the elements of
    * @return the number of elements
    */
@@ -61,10 +62,9 @@ public class TestUtil {
     int count = 0;
     for (int i = 0; i < Array.getLength(array); i++) {
       Object e = Array.get(array, i);
-      if(!e.getClass().isArray()) {
+      if (!e.getClass().isArray()) {
         count += 1;
-      }
-      else {
+      } else {
         count += flattenedNumElements(e);
       }
     }
@@ -73,6 +73,7 @@ public class TestUtil {
 
   /**
    * Flattens an ND-array into a 1D-array with the same elements.
+   *
    * @param array the array to flatten
    * @param elementType the element class (e.g. {@code Integer.TYPE} for an {@code int[]})
    * @return a flattened array
@@ -86,10 +87,9 @@ public class TestUtil {
   private static int flatten(Object array, Object out, int next) {
     for (int i = 0; i < Array.getLength(array); i++) {
       Object e = Array.get(array, i);
-      if(!e.getClass().isArray()) {
+      if (!e.getClass().isArray()) {
         Array.set(out, next++, e);
-      }
-      else {
+      } else {
         next = flatten(e, out, next);
       }
     }
@@ -99,11 +99,12 @@ public class TestUtil {
   /**
    * Converts a {@code boolean[]} to a {@code byte[]}.
    *
-   * <p>Suitable for creating tensors of type {@link DataType#BOOL} using {@link java.nio.ByteBuffer}.
+   * <p>Suitable for creating tensors of type {@link DataType#BOOL} using {@link
+   * java.nio.ByteBuffer}.
    */
   public static byte[] bool2byte(boolean[] array) {
     byte[] out = new byte[array.length];
-    for(int i = 0; i< array.length; i++) {
+    for (int i = 0; i < array.length; i++) {
       out[i] = array[i] ? (byte) 1 : (byte) 0;
     }
     return out;
@@ -112,13 +113,16 @@ public class TestUtil {
   /**
    * Converts a {@code byte[]} to a {@code boolean[]}.
    *
-   * <p>Suitable for reading tensors of type {@link DataType#BOOL} using {@link java.nio.ByteBuffer}.
+   * <p>Suitable for reading tensors of type {@link DataType#BOOL} using {@link
+   * java.nio.ByteBuffer}.
    */
   public static boolean[] byte2bool(byte[] array) {
     boolean[] out = new boolean[array.length];
-    for(int i = 0; i< array.length; i++) {
+    for (int i = 0; i < array.length; i++) {
       out[i] = array[i] != 0;
     }
     return out;
   }
+
+  private TestUtil() {}
 }
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index 31778c2cfd..f8ab41079d 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -28,6 +28,7 @@ load("//tensorflow/core:platform/default/build_config.bzl", "tf_additional_lib_d
 load("//tensorflow/core:platform/default/build_config_root.bzl", "tf_additional_plugin_deps")
 load("//tensorflow/python:build_defs.bzl", "tf_gen_op_wrapper_private_py")
 load("//tensorflow/core:platform/default/build_config_root.bzl", "tf_additional_verbs_deps")
+load("//tensorflow/core:platform/default/build_config_root.bzl", "tf_additional_mpi_deps")
 
 py_library(
     name = "python",
@@ -272,6 +273,7 @@ py_test(
     data = [":framework/test_file_system.so"],
     main = "framework/file_system_test.py",
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":client_testlib",
         ":data_flow_ops",
@@ -695,6 +697,7 @@ py_test(
     srcs = ["framework/contrib_test.py"],
     main = "framework/contrib_test.py",
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         "//tensorflow:tensorflow_py",
         "//tensorflow/python:client_testlib",
@@ -972,6 +975,7 @@ py_test(
     srcs = ["framework/tensor_util_test.py"],
     main = "framework/tensor_util_test.py",
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":array_ops",
         ":client_testlib",
@@ -989,6 +993,7 @@ py_test(
     srcs = ["framework/test_util_test.py"],
     main = "framework/test_util_test.py",
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":control_flow_ops",
         ":errors",
@@ -2229,6 +2234,7 @@ cuda_py_test(
     ],
     data = ["//tensorflow/core:image_testdata"],
     shard_count = 5,
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2276,6 +2282,7 @@ cuda_py_test(
         ":nn_ops_gen",
         "//third_party/py/numpy",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2729,7 +2736,8 @@ tf_py_wrap_cc(
         "//util/python:python_headers",
     ] + (tf_additional_lib_deps() +
          tf_additional_plugin_deps() +
-         tf_additional_verbs_deps()),
+         tf_additional_verbs_deps() +
+         tf_additional_mpi_deps()),
 )
 
 py_library(
@@ -2954,6 +2962,7 @@ py_test(
     tags = [
         "no_gpu",
         "no_pip_gpu",  # testInteractivePlacePrunedGraph fails on invalid assumption about GPU ops.
+        "no_windows",
     ],
     deps = [
         ":array_ops",
@@ -3074,6 +3083,7 @@ py_test(
     size = "small",
     srcs = ["lib/io/file_io_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":client_testlib",
         ":errors",
@@ -3231,6 +3241,7 @@ py_test(
     srcs = ["training/saver_large_variable_test.py"],
     srcs_version = "PY2AND3",
     tags = [
+        "manual",
         "noasan",  # http://b/30379628
         "notsan",  # http://b/30379628
     ],
@@ -3286,6 +3297,7 @@ py_test(
     size = "small",
     srcs = ["training/supervisor_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":array_ops",
         ":client_testlib",
@@ -3308,6 +3320,7 @@ py_test(
     size = "small",
     srcs = ["training/basic_session_run_hooks_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":client",
         ":client_testlib",
@@ -3331,6 +3344,7 @@ py_test(
     size = "small",
     srcs = ["training/monitored_session_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":array_ops",
         ":client",
@@ -3581,6 +3595,7 @@ py_test(
     size = "small",
     srcs = ["ops/dequantize_op_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":array_ops",
         ":client_testlib",
@@ -3594,6 +3609,7 @@ py_test(
     size = "small",
     srcs = ["ops/quantized_conv_ops_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":client_testlib",
         ":framework_for_generated_wrappers",
diff --git a/tensorflow/python/debug/BUILD b/tensorflow/python/debug/BUILD
index 07d0a9ec73..39446b6ca2 100644
--- a/tensorflow/python/debug/BUILD
+++ b/tensorflow/python/debug/BUILD
@@ -472,6 +472,7 @@ py_test(
         "cli/curses_ui_test.py",
     ],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     deps = [
         ":curses_ui",
         ":debugger_cli_common",
diff --git a/tensorflow/python/debug/cli/analyzer_cli.py b/tensorflow/python/debug/cli/analyzer_cli.py
index 69b6d9ffdf..da27f4cebe 100644
--- a/tensorflow/python/debug/cli/analyzer_cli.py
+++ b/tensorflow/python/debug/cli/analyzer_cli.py
@@ -368,7 +368,7 @@ class DebugAnalyzer(object):
   def add_tensor_filter(self, filter_name, filter_callable):
     """Add a tensor filter.
 
-    A tensor filter is a named callable of the siganture:
+    A tensor filter is a named callable of the signature:
       filter_callable(dump_datum, tensor),
 
     wherein dump_datum is an instance of debug_data.DebugTensorDatum carrying
diff --git a/tensorflow/python/debug/cli/analyzer_cli_test.py b/tensorflow/python/debug/cli/analyzer_cli_test.py
index 8b191f332e..ce224fff20 100644
--- a/tensorflow/python/debug/cli/analyzer_cli_test.py
+++ b/tensorflow/python/debug/cli/analyzer_cli_test.py
@@ -498,7 +498,8 @@ class AnalyzerCLISimpleMulAddTest(test_util.TensorFlowTestCase):
 
     cls._is_gpu_available = test.is_gpu_available()
     if cls._is_gpu_available:
-      cls._main_device = "/job:localhost/replica:0/task:0/gpu:0"
+      gpu_name = test_util.gpu_device_name()
+      cls._main_device = "/job:localhost/replica:0/task:0" + gpu_name
     else:
       cls._main_device = "/job:localhost/replica:0/task:0/cpu:0"
 
@@ -1461,7 +1462,8 @@ class AnalyzerCLIControlDepTest(test_util.TensorFlowTestCase):
 
     cls._is_gpu_available = test.is_gpu_available()
     if cls._is_gpu_available:
-      cls._main_device = "/job:localhost/replica:0/task:0/gpu:0"
+      gpu_name = test_util.gpu_device_name()
+      cls._main_device = "/job:localhost/replica:0/task:0" + gpu_name
     else:
       cls._main_device = "/job:localhost/replica:0/task:0/cpu:0"
 
diff --git a/tensorflow/python/debug/cli/debugger_cli_common.py b/tensorflow/python/debug/cli/debugger_cli_common.py
index 9ad49771d1..12e79ab07a 100644
--- a/tensorflow/python/debug/cli/debugger_cli_common.py
+++ b/tensorflow/python/debug/cli/debugger_cli_common.py
@@ -840,7 +840,7 @@ class TabCompletionRegistry(object):
 
     Args:
       context_words: A list of context words belonging to the context being
-        registerd. It is a list of str, instead of a single string, to support
+        registered. It is a list of str, instead of a single string, to support
         synonym words triggering the same tab-completion context, e.g.,
         both "drink" and the short-hand "dr" can trigger the same context.
       comp_items: A list of completion items, as a list of str.
diff --git a/tensorflow/python/debug/cli/profile_analyzer_cli.py b/tensorflow/python/debug/cli/profile_analyzer_cli.py
index c08605b92b..3304194b1c 100644
--- a/tensorflow/python/debug/cli/profile_analyzer_cli.py
+++ b/tensorflow/python/debug/cli/profile_analyzer_cli.py
@@ -330,7 +330,7 @@ class ProfileAnalyzer(object):
     self._arg_parsers["list_profile"] = ap
 
     ap = argparse.ArgumentParser(
-        description="Print a Python source file wiht line-level profile "
+        description="Print a Python source file with line-level profile "
                     "information",
         usage=argparse.SUPPRESS)
     ap.add_argument(
diff --git a/tensorflow/python/debug/lib/debug_utils.py b/tensorflow/python/debug/lib/debug_utils.py
index 9013cb096d..f1e972940b 100644
--- a/tensorflow/python/debug/lib/debug_utils.py
+++ b/tensorflow/python/debug/lib/debug_utils.py
@@ -121,7 +121,7 @@ def watch_graph(run_options,
       are set, the two filtering operations will occur in a logical `AND`
       relation. In other words, a node will be included if and only if it
       hits both whitelists.
-    tensor_dtype_regex_whitelist: Regular-experssion whitelist for Tensor
+    tensor_dtype_regex_whitelist: Regular-expression whitelist for Tensor
       data type, e.g., `"^int.*"`.
       This whitelist operates in logical `AND` relations to the two whitelists
       above.
@@ -210,7 +210,7 @@ def watch_graph_with_blacklists(run_options,
       relation. In other words, a node will be excluded if it hits either of
       the two blacklists; a node will be included if and only if it hits
       neither of the blacklists.
-    tensor_dtype_regex_blacklist: Regular-experssion blacklist for Tensor
+    tensor_dtype_regex_blacklist: Regular-expression blacklist for Tensor
       data type, e.g., `"^int.*"`.
       This blacklist operates in logical `OR` relations to the two whitelists
       above.
diff --git a/tensorflow/python/debug/lib/session_debug_testlib.py b/tensorflow/python/debug/lib/session_debug_testlib.py
index 19d4bcae1b..a219ddf9f2 100644
--- a/tensorflow/python/debug/lib/session_debug_testlib.py
+++ b/tensorflow/python/debug/lib/session_debug_testlib.py
@@ -81,7 +81,8 @@ class SessionDebugTestBase(test_util.TensorFlowTestCase):
     if test.is_gpu_available():
       cls._expected_partition_graph_count = 2
       cls._expected_num_devices = 2
-      cls._main_device = "/job:localhost/replica:0/task:0/gpu:0"
+      gpu_name = test_util.gpu_device_name()
+      cls._main_device = "/job:localhost/replica:0/task:0" + gpu_name
     else:
       cls._expected_partition_graph_count = 1
       cls._expected_num_devices = 1
diff --git a/tensorflow/python/debug/lib/stepper_test.py b/tensorflow/python/debug/lib/stepper_test.py
index 825c559312..78e7b3b5eb 100644
--- a/tensorflow/python/debug/lib/stepper_test.py
+++ b/tensorflow/python/debug/lib/stepper_test.py
@@ -591,7 +591,7 @@ class StepperAssignAddTest(test_util.TensorFlowTestCase):
     with NodeStepper(self.sess, [self.q, self.v_add]) as stepper:
       self.assertIsNone(stepper.last_updated())
 
-  def testContToUpdateInvalidatesDumpedIntermedates(self):
+  def testContToUpdateInvalidatesDumpedIntermediates(self):
     with NodeStepper(self.sess, [self.q, self.v_add]) as stepper:
       self.assertAllClose(400.0, stepper.cont("q:0"))
       self.assertItemsEqual(["v/read:0", "p:0"],
diff --git a/tensorflow/python/debug/wrappers/dumping_wrapper.py b/tensorflow/python/debug/wrappers/dumping_wrapper.py
index 0d9b3cfa7e..63229a8539 100644
--- a/tensorflow/python/debug/wrappers/dumping_wrapper.py
+++ b/tensorflow/python/debug/wrappers/dumping_wrapper.py
@@ -86,7 +86,7 @@ class DumpingDebugWrapperSession(framework.NonInteractiveDebugWrapperSession):
     """Implementation of abstrat method in superclass.
 
     See doc of `NonInteractiveDebugWrapperSession.prepare_run_debug_urls()`
-    for details. This implentation creates a run-specific subdirectory under
+    for details. This implementation creates a run-specific subdirectory under
     self._session_root and stores information regarding run `fetches` and
     `feed_dict.keys()` in the subdirectory.
 
diff --git a/tensorflow/python/debug/wrappers/framework.py b/tensorflow/python/debug/wrappers/framework.py
index ea642adbd1..2c239038e4 100644
--- a/tensorflow/python/debug/wrappers/framework.py
+++ b/tensorflow/python/debug/wrappers/framework.py
@@ -666,7 +666,7 @@ class WatchOptions(object):
         are set, the two filtering operations will occur in a logical `AND`
         relation. In other words, a node will be included if and only if it
         hits both whitelists.
-      tensor_dtype_regex_whitelist: Regular-experssion whitelist for Tensor
+      tensor_dtype_regex_whitelist: Regular-expression whitelist for Tensor
         data type, e.g., `"^int.*"`.
         This whitelist operates in logical `AND` relations to the two whitelists
         above.
diff --git a/tensorflow/python/estimator/BUILD b/tensorflow/python/estimator/BUILD
index 82ea1c5b20..8d02ed83af 100644
--- a/tensorflow/python/estimator/BUILD
+++ b/tensorflow/python/estimator/BUILD
@@ -147,6 +147,7 @@ py_test(
         ":dnn",
         ":dnn_testing_utils",
         ":export_export",
+        ":metric_keys",
         ":numpy_io",
         ":pandas_io",
         ":prediction_keys",
@@ -262,6 +263,7 @@ py_test(
         "//tensorflow/python:saver_test_utils",
         "//tensorflow/python:session",
         "//tensorflow/python:state_ops",
+        "//tensorflow/python:summary",
         "//tensorflow/python:training",
         "//tensorflow/python:util",
         "//tensorflow/python:variables",
@@ -492,6 +494,7 @@ py_test(
         "//tensorflow/python:platform",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:state_ops",
+        "//tensorflow/python:summary",
         "//tensorflow/python:training",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
diff --git a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
index f12f080479..f93e422b70 100644
--- a/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
+++ b/tensorflow/python/estimator/canned/dnn_linear_combined_test.py
@@ -361,54 +361,6 @@ class DNNOnlyClassifierTrainTest(
         self, _dnn_classifier_fn)
 
 
-# A function to mimic dnn-regressor init reuse same tests.
-def _dnn_regressor_fn(
-    hidden_units,
-    feature_columns,
-    model_dir=None,
-    label_dimension=1,
-    weight_feature_key=None,
-    optimizer='Adagrad',
-    config=None,
-    input_layer_partitioner=None):
-  return dnn_linear_combined.DNNLinearCombinedRegressor(
-      model_dir=model_dir,
-      dnn_hidden_units=hidden_units,
-      dnn_feature_columns=feature_columns,
-      dnn_optimizer=optimizer,
-      label_dimension=label_dimension,
-      weight_feature_key=weight_feature_key,
-      input_layer_partitioner=input_layer_partitioner,
-      config=config)
-
-
-class DNNOnlyRegressorEvaluateTest(
-    dnn_testing_utils.BaseDNNRegressorEvaluateTest, test.TestCase):
-
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorEvaluateTest.__init__(
-        self, _dnn_regressor_fn)
-
-
-class DNNOnlyRegressorPredictTest(
-    dnn_testing_utils.BaseDNNRegressorPredictTest, test.TestCase):
-
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorPredictTest.__init__(
-        self, _dnn_regressor_fn)
-
-
-class DNNOnlyRegressorTrainTest(
-    dnn_testing_utils.BaseDNNRegressorTrainTest, test.TestCase):
-
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorTrainTest.__init__(
-        self, _dnn_regressor_fn)
-
-
 class DNNLinearCombinedClassifierIntegrationTest(test.TestCase):
 
   def setUp(self):
diff --git a/tensorflow/python/estimator/canned/dnn_test.py b/tensorflow/python/estimator/canned/dnn_test.py
index 9658596259..145d9471ea 100644
--- a/tensorflow/python/estimator/canned/dnn_test.py
+++ b/tensorflow/python/estimator/canned/dnn_test.py
@@ -28,6 +28,7 @@ from tensorflow.core.example import example_pb2
 from tensorflow.core.example import feature_pb2
 from tensorflow.python.estimator.canned import dnn
 from tensorflow.python.estimator.canned import dnn_testing_utils
+from tensorflow.python.estimator.canned import metric_keys
 from tensorflow.python.estimator.canned import prediction_keys
 from tensorflow.python.estimator.export import export
 from tensorflow.python.estimator.inputs import numpy_io
@@ -39,6 +40,7 @@ from tensorflow.python.ops import data_flow_ops
 from tensorflow.python.ops import parsing_ops
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
+from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import input as input_lib
 from tensorflow.python.training import queue_runner
 
@@ -64,62 +66,152 @@ class DNNModelFnTest(dnn_testing_utils.BaseDNNModelFnTest, test.TestCase):
     dnn_testing_utils.BaseDNNModelFnTest.__init__(self, dnn._dnn_model_fn)
 
 
-class DNNClassifierEvaluateTest(
-    dnn_testing_utils.BaseDNNClassifierEvaluateTest, test.TestCase):
+class DNNRegressorEvaluateTest(test.TestCase):
 
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNClassifierEvaluateTest.__init__(
-        self, _dnn_classifier_fn)
+  def setUp(self):
+    self._model_dir = tempfile.mkdtemp()
 
+  def tearDown(self):
+    if self._model_dir:
+      writer_cache.FileWriterCache.clear()
+      shutil.rmtree(self._model_dir)
 
-class DNNClassifierPredictTest(
-    dnn_testing_utils.BaseDNNClassifierPredictTest, test.TestCase):
+  def test_one_dim(self):
+    """Asserts evaluation metrics for one-dimensional input and logits."""
+    # Create checkpoint: num_inputs=1, hidden_units=(2, 2), num_outputs=1.
+    global_step = 100
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
+         ([[-1.], [1.]], [.3]),), global_step, self._model_dir)
 
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNClassifierPredictTest.__init__(
-        self, _dnn_classifier_fn)
+    # Create DNNRegressor and evaluate.
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=(2, 2),
+        feature_columns=[feature_column.numeric_column('age')],
+        model_dir=self._model_dir)
+    def _input_fn():
+      return {'age': [[10.]]}, [[1.]]
+    # Uses identical numbers as DNNModelTest.test_one_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [[-2.08]] => predictions = [-2.08].
+    # loss = (1+2.08)^2 = 9.4864
+    expected_loss = 9.4864
+    self.assertAllClose({
+        metric_keys.MetricKeys.LOSS: expected_loss,
+        metric_keys.MetricKeys.LOSS_MEAN: expected_loss,
+        ops.GraphKeys.GLOBAL_STEP: global_step
+    }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
+
+  def test_multi_dim(self):
+    """Asserts evaluation metrics for multi-dimensional input and logits."""
+    # Create checkpoint: num_inputs=2, hidden_units=(2, 2), num_outputs=3.
+    global_step = 100
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5], [-.6, -.5]], [.1, -.1]), ([[1., .8], [-.8, -1.]],
+                                               [.2, -.2]),
+         ([[-1., 1., .5], [-1., 1., .5]], [.3, -.3,
+                                           .0]),), global_step, self._model_dir)
+    label_dimension = 3
+
+    # Create DNNRegressor and evaluate.
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=(2, 2),
+        feature_columns=[feature_column.numeric_column('age', shape=[2])],
+        label_dimension=label_dimension,
+        model_dir=self._model_dir)
+    def _input_fn():
+      return {'age': [[10., 8.]]}, [[1., -1., 0.5]]
+    # Uses identical numbers as
+    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [[-0.48, 0.48, 0.39]]
+    # loss = (1+0.48)^2 + (-1-0.48)^2 + (0.5-0.39)^2 = 4.3929
+    expected_loss = 4.3929
+    self.assertAllClose({
+        metric_keys.MetricKeys.LOSS: expected_loss,
+        metric_keys.MetricKeys.LOSS_MEAN: expected_loss / label_dimension,
+        ops.GraphKeys.GLOBAL_STEP: global_step
+    }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
 
 
-class DNNClassifierTrainTest(
-    dnn_testing_utils.BaseDNNClassifierTrainTest, test.TestCase):
+class DNNClassifierEvaluateTest(
+    dnn_testing_utils.BaseDNNClassifierEvaluateTest, test.TestCase):
 
   def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
     test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNClassifierTrainTest.__init__(
+    dnn_testing_utils.BaseDNNClassifierEvaluateTest.__init__(
         self, _dnn_classifier_fn)
 
 
-def _dnn_regressor_fn(*args, **kwargs):
-  return dnn.DNNRegressor(*args, **kwargs)
-
+class DNNRegressorPredictTest(test.TestCase):
 
-class DNNRegressorEvaluateTest(
-    dnn_testing_utils.BaseDNNRegressorEvaluateTest, test.TestCase):
-
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorEvaluateTest.__init__(
-        self, _dnn_regressor_fn)
+  def setUp(self):
+    self._model_dir = tempfile.mkdtemp()
 
+  def tearDown(self):
+    if self._model_dir:
+      writer_cache.FileWriterCache.clear()
+      shutil.rmtree(self._model_dir)
 
-class DNNRegressorPredictTest(
-    dnn_testing_utils.BaseDNNRegressorPredictTest, test.TestCase):
+  def test_one_dim(self):
+    """Asserts predictions for one-dimensional input and logits."""
+    # Create checkpoint: num_inputs=1, hidden_units=(2, 2), num_outputs=1.
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
+         ([[-1.], [1.]], [.3]),),
+        global_step=0,
+        model_dir=self._model_dir)
 
-  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
-    test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorPredictTest.__init__(
-        self, _dnn_regressor_fn)
+    # Create DNNRegressor and predict.
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=(2, 2),
+        feature_columns=(feature_column.numeric_column('x'),),
+        model_dir=self._model_dir)
+    input_fn = numpy_io.numpy_input_fn(
+        x={'x': np.array([[10.]])}, batch_size=1, shuffle=False)
+    # Uses identical numbers as DNNModelTest.test_one_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [[-2.08]] => predictions = [-2.08].
+    self.assertAllClose({
+        prediction_keys.PredictionKeys.PREDICTIONS: [-2.08],
+    }, next(dnn_regressor.predict(input_fn=input_fn)))
+
+  def test_multi_dim(self):
+    """Asserts predictions for multi-dimensional input and logits."""
+    # Create checkpoint: num_inputs=2, hidden_units=(2, 2), num_outputs=3.
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5], [-.6, -.5]], [.1, -.1]),
+         ([[1., .8], [-.8, -1.]], [.2, -.2]), ([[-1., 1., .5], [-1., 1., .5]],
+                                               [.3, -.3,
+                                                .0]),), 100, self._model_dir)
+
+    # Create DNNRegressor and predict.
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=(2, 2),
+        feature_columns=(feature_column.numeric_column('x', shape=(2,)),),
+        label_dimension=3,
+        model_dir=self._model_dir)
+    input_fn = numpy_io.numpy_input_fn(
+        # Inputs shape is (batch_size, num_inputs).
+        x={'x': np.array([[10., 8.]])},
+        batch_size=1,
+        shuffle=False)
+    # Uses identical numbers as
+    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [[-0.48, 0.48, 0.39]] => predictions = [-0.48, 0.48, 0.39]
+    self.assertAllClose({
+        prediction_keys.PredictionKeys.PREDICTIONS: [-0.48, 0.48, 0.39],
+    }, next(dnn_regressor.predict(input_fn=input_fn)))
 
 
-class DNNRegressorTrainTest(
-    dnn_testing_utils.BaseDNNRegressorTrainTest, test.TestCase):
+class DNNClassifierPredictTest(
+    dnn_testing_utils.BaseDNNClassifierPredictTest, test.TestCase):
 
   def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
     test.TestCase.__init__(self, methodName)
-    dnn_testing_utils.BaseDNNRegressorTrainTest.__init__(
-        self, _dnn_regressor_fn)
+    dnn_testing_utils.BaseDNNClassifierPredictTest.__init__(
+        self, _dnn_classifier_fn)
 
 
 def _queue_parsed_features(feature_map):
@@ -145,6 +237,7 @@ class DNNRegressorIntegrationTest(test.TestCase):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def _test_complete_flow(
@@ -304,6 +397,7 @@ class DNNClassifierIntegrationTest(test.TestCase):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def _as_label(self, data_in_float):
@@ -467,5 +561,172 @@ class DNNClassifierIntegrationTest(test.TestCase):
         batch_size=batch_size)
 
 
+class DNNRegressorTrainTest(test.TestCase):
+
+  def setUp(self):
+    self._model_dir = tempfile.mkdtemp()
+
+  def tearDown(self):
+    if self._model_dir:
+      writer_cache.FileWriterCache.clear()
+      shutil.rmtree(self._model_dir)
+
+  def test_from_scratch_with_default_optimizer(self):
+    hidden_units = (2, 2)
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=hidden_units,
+        feature_columns=(feature_column.numeric_column('age'),),
+        model_dir=self._model_dir)
+
+    # Train for a few steps, then validate final checkpoint.
+    num_steps = 5
+    dnn_regressor.train(
+        input_fn=lambda: ({'age': ((1,),)}, ((10,),)), steps=num_steps)
+    dnn_testing_utils._assert_checkpoint(
+        self, num_steps, input_units=1, hidden_units=hidden_units,
+        output_units=1, model_dir=self._model_dir)
+
+  def test_from_scratch(self):
+    hidden_units = (2, 2)
+    mock_optimizer = dnn_testing_utils.mock_optimizer(
+        self, hidden_units=hidden_units)
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=hidden_units,
+        feature_columns=(feature_column.numeric_column('age'),),
+        optimizer=mock_optimizer,
+        model_dir=self._model_dir)
+    self.assertEqual(0, mock_optimizer.minimize.call_count)
+
+    # Train for a few steps, then validate optimizer, summaries, and
+    # checkpoint.
+    num_steps = 5
+    summary_hook = dnn_testing_utils._SummaryHook()
+    dnn_regressor.train(
+        input_fn=lambda: ({'age': ((1,),)}, ((5.,),)), steps=num_steps,
+        hooks=(summary_hook,))
+    self.assertEqual(1, mock_optimizer.minimize.call_count)
+    dnn_testing_utils._assert_checkpoint(
+        self, num_steps, input_units=1, hidden_units=hidden_units,
+        output_units=1, model_dir=self._model_dir)
+    summaries = summary_hook.summaries()
+    self.assertEqual(num_steps, len(summaries))
+    for summary in summaries:
+      summary_keys = [v.tag for v in summary.value]
+      self.assertIn(metric_keys.MetricKeys.LOSS, summary_keys)
+      self.assertIn(metric_keys.MetricKeys.LOSS_MEAN, summary_keys)
+
+  def test_one_dim(self):
+    """Asserts train loss for one-dimensional input and logits."""
+    base_global_step = 100
+    hidden_units = (2, 2)
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
+         ([[-1.], [1.]], [.3]),), base_global_step, self._model_dir)
+
+    # Uses identical numbers as DNNModelFnTest.test_one_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [-2.08] => predictions = [-2.08]
+    # loss = (1 + 2.08)^2 = 9.4864
+    expected_loss = 9.4864
+    mock_optimizer = dnn_testing_utils.mock_optimizer(
+        self, hidden_units=hidden_units, expected_loss=expected_loss)
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=hidden_units,
+        feature_columns=(feature_column.numeric_column('age'),),
+        optimizer=mock_optimizer,
+        model_dir=self._model_dir)
+    self.assertEqual(0, mock_optimizer.minimize.call_count)
+
+    # Train for a few steps, then validate optimizer, summaries, and
+    # checkpoint.
+    num_steps = 5
+    summary_hook = dnn_testing_utils._SummaryHook()
+    dnn_regressor.train(
+        input_fn=lambda: ({'age': [[10.]]}, [[1.]]), steps=num_steps,
+        hooks=(summary_hook,))
+    self.assertEqual(1, mock_optimizer.minimize.call_count)
+    summaries = summary_hook.summaries()
+    self.assertEqual(num_steps, len(summaries))
+    for summary in summaries:
+      dnn_testing_utils._assert_simple_summary(
+          self,
+          {
+              metric_keys.MetricKeys.LOSS_MEAN: expected_loss,
+              'dnn/dnn/hiddenlayer_0/fraction_of_zero_values': 0.,
+              'dnn/dnn/hiddenlayer_1/fraction_of_zero_values': 0.5,
+              'dnn/dnn/logits/fraction_of_zero_values': 0.,
+              metric_keys.MetricKeys.LOSS: expected_loss,
+          },
+          summary)
+    dnn_testing_utils._assert_checkpoint(
+        self, base_global_step + num_steps, input_units=1,
+        hidden_units=hidden_units, output_units=1, model_dir=self._model_dir)
+
+  def test_multi_dim(self):
+    """Asserts train loss for multi-dimensional input and logits."""
+    base_global_step = 100
+    hidden_units = (2, 2)
+    dnn_testing_utils.create_checkpoint(
+        (([[.6, .5], [-.6, -.5]], [.1, -.1]), ([[1., .8], [-.8, -1.]],
+                                               [.2, -.2]),
+         ([[-1., 1., .5], [-1., 1., .5]],
+          [.3, -.3, .0]),), base_global_step, self._model_dir)
+    input_dimension = 2
+    label_dimension = 3
+
+    # Uses identical numbers as
+    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
+    # See that test for calculation of logits.
+    # logits = [[-0.48, 0.48, 0.39]]
+    # loss = (1+0.48)^2 + (-1-0.48)^2 + (0.5-0.39)^2 = 4.3929
+    expected_loss = 4.3929
+    mock_optimizer = dnn_testing_utils.mock_optimizer(
+        self, hidden_units=hidden_units, expected_loss=expected_loss)
+    dnn_regressor = dnn.DNNRegressor(
+        hidden_units=hidden_units,
+        feature_columns=[
+            feature_column.numeric_column('age', shape=[input_dimension])],
+        label_dimension=label_dimension,
+        optimizer=mock_optimizer,
+        model_dir=self._model_dir)
+    self.assertEqual(0, mock_optimizer.minimize.call_count)
+
+    # Train for a few steps, then validate optimizer, summaries, and
+    # checkpoint.
+    num_steps = 5
+    summary_hook = dnn_testing_utils._SummaryHook()
+    dnn_regressor.train(
+        input_fn=lambda: ({'age': [[10., 8.]]}, [[1., -1., 0.5]]),
+        steps=num_steps,
+        hooks=(summary_hook,))
+    self.assertEqual(1, mock_optimizer.minimize.call_count)
+    summaries = summary_hook.summaries()
+    self.assertEqual(num_steps, len(summaries))
+    for summary in summaries:
+      dnn_testing_utils._assert_simple_summary(
+          self,
+          {
+              metric_keys.MetricKeys.LOSS_MEAN: expected_loss / label_dimension,
+              'dnn/dnn/hiddenlayer_0/fraction_of_zero_values': 0.,
+              'dnn/dnn/hiddenlayer_1/fraction_of_zero_values': 0.5,
+              'dnn/dnn/logits/fraction_of_zero_values': 0.,
+              metric_keys.MetricKeys.LOSS: expected_loss,
+          },
+          summary)
+    dnn_testing_utils._assert_checkpoint(
+        self, base_global_step + num_steps, input_units=input_dimension,
+        hidden_units=hidden_units, output_units=label_dimension,
+        model_dir=self._model_dir)
+
+
+class DNNClassifierTrainTest(
+    dnn_testing_utils.BaseDNNClassifierTrainTest, test.TestCase):
+
+  def __init__(self, methodName='runTest'):  # pylint: disable=invalid-name
+    test.TestCase.__init__(self, methodName)
+    dnn_testing_utils.BaseDNNClassifierTrainTest.__init__(
+        self, _dnn_classifier_fn)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/estimator/canned/dnn_testing_utils.py b/tensorflow/python/estimator/canned/dnn_testing_utils.py
index 5c8ed45d8d..d92653c834 100644
--- a/tensorflow/python/estimator/canned/dnn_testing_utils.py
+++ b/tensorflow/python/estimator/canned/dnn_testing_utils.py
@@ -44,6 +44,7 @@ from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variables as variables_lib
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary as summary_lib
+from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import checkpoint_utils
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import optimizer
@@ -210,6 +211,7 @@ class BaseDNNModelFnTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def _test_logits(self, mode, hidden_units, logits_dimension, inputs,
@@ -435,7 +437,7 @@ class BaseDNNModelFnTest(object):
             self.fail('Invalid mode: {}'.format(mode))
 
 
-class BaseDNNClassifierEvaluateTest(object):
+class BaseDNNClassifierEvaluateTest(test.TestCase):
 
   def __init__(self, dnn_classifier_fn):
     self._dnn_classifier_fn = dnn_classifier_fn
@@ -516,75 +518,7 @@ class BaseDNNClassifierEvaluateTest(object):
     }, dnn_classifier.evaluate(input_fn=_input_fn, steps=1))
 
 
-class BaseDNNRegressorEvaluateTest(object):
-
-  def __init__(self, dnn_regressor_fn):
-    self._dnn_regressor_fn = dnn_regressor_fn
-
-  def setUp(self):
-    self._model_dir = tempfile.mkdtemp()
-
-  def tearDown(self):
-    if self._model_dir:
-      shutil.rmtree(self._model_dir)
-
-  def test_one_dim(self):
-    """Asserts evaluation metrics for one-dimensional input and logits."""
-    # Create checkpoint: num_inputs=1, hidden_units=(2, 2), num_outputs=1.
-    global_step = 100
-    create_checkpoint(
-        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
-         ([[-1.], [1.]], [.3]),), global_step, self._model_dir)
-
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=(2, 2),
-        feature_columns=[feature_column.numeric_column('age')],
-        model_dir=self._model_dir)
-    def _input_fn():
-      return {'age': [[10.]]}, [[1.]]
-    # Uses identical numbers as DNNModelTest.test_one_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [[-2.08]] => predictions = [-2.08].
-    # loss = (1+2.08)^2 = 9.4864
-    expected_loss = 9.4864
-    self.assertAllClose({
-        metric_keys.MetricKeys.LOSS: expected_loss,
-        metric_keys.MetricKeys.LOSS_MEAN: expected_loss,
-        ops.GraphKeys.GLOBAL_STEP: global_step
-    }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
-
-  def test_multi_dim(self):
-    """Asserts evaluation metrics for multi-dimensional input and logits."""
-    # Create checkpoint: num_inputs=2, hidden_units=(2, 2), num_outputs=3.
-    global_step = 100
-    create_checkpoint(
-        (([[.6, .5], [-.6, -.5]], [.1, -.1]), ([[1., .8], [-.8, -1.]],
-                                               [.2, -.2]),
-         ([[-1., 1., .5], [-1., 1., .5]], [.3, -.3,
-                                           .0]),), global_step, self._model_dir)
-    label_dimension = 3
-
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=(2, 2),
-        feature_columns=[feature_column.numeric_column('age', shape=[2])],
-        label_dimension=label_dimension,
-        model_dir=self._model_dir)
-    def _input_fn():
-      return {'age': [[10., 8.]]}, [[1., -1., 0.5]]
-    # Uses identical numbers as
-    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [[-0.48, 0.48, 0.39]]
-    # loss = (1+0.48)^2 + (-1-0.48)^2 + (0.5-0.39)^2 = 4.3929
-    expected_loss = 4.3929
-    self.assertAllClose({
-        metric_keys.MetricKeys.LOSS: expected_loss,
-        metric_keys.MetricKeys.LOSS_MEAN: expected_loss / label_dimension,
-        ops.GraphKeys.GLOBAL_STEP: global_step
-    }, dnn_regressor.evaluate(input_fn=_input_fn, steps=1))
-
-
-class BaseDNNClassifierPredictTest(object):
+class BaseDNNClassifierPredictTest(test.TestCase):
 
   def __init__(self, dnn_classifier_fn):
     self._dnn_classifier_fn = dnn_classifier_fn
@@ -673,68 +607,6 @@ class BaseDNNClassifierPredictTest(object):
         [b'1'], predictions[prediction_keys.PredictionKeys.CLASSES])
 
 
-class BaseDNNRegressorPredictTest(object):
-
-  def __init__(self, dnn_regressor_fn):
-    self._dnn_regressor_fn = dnn_regressor_fn
-
-  def setUp(self):
-    self._model_dir = tempfile.mkdtemp()
-
-  def tearDown(self):
-    if self._model_dir:
-      shutil.rmtree(self._model_dir)
-
-  def test_one_dim(self):
-    """Asserts predictions for one-dimensional input and logits."""
-    # Create checkpoint: num_inputs=1, hidden_units=(2, 2), num_outputs=1.
-    create_checkpoint(
-        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
-         ([[-1.], [1.]], [.3]),),
-        global_step=0,
-        model_dir=self._model_dir)
-
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=(2, 2),
-        feature_columns=(feature_column.numeric_column('x'),),
-        model_dir=self._model_dir)
-    input_fn = numpy_io.numpy_input_fn(
-        x={'x': np.array([[10.]])}, batch_size=1, shuffle=False)
-    # Uses identical numbers as DNNModelTest.test_one_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [[-2.08]] => predictions = [-2.08].
-    self.assertAllClose({
-        prediction_keys.PredictionKeys.PREDICTIONS: [-2.08],
-    }, next(dnn_regressor.predict(input_fn=input_fn)))
-
-  def test_multi_dim(self):
-    """Asserts predictions for multi-dimensional input and logits."""
-    # Create checkpoint: num_inputs=2, hidden_units=(2, 2), num_outputs=3.
-    create_checkpoint(
-        (([[.6, .5], [-.6, -.5]], [.1, -.1]),
-         ([[1., .8], [-.8, -1.]], [.2, -.2]), ([[-1., 1., .5], [-1., 1., .5]],
-                                               [.3, -.3,
-                                                .0]),), 100, self._model_dir)
-
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=(2, 2),
-        feature_columns=(feature_column.numeric_column('x', shape=(2,)),),
-        label_dimension=3,
-        model_dir=self._model_dir)
-    input_fn = numpy_io.numpy_input_fn(
-        # Inputs shape is (batch_size, num_inputs).
-        x={'x': np.array([[10., 8.]])},
-        batch_size=1,
-        shuffle=False)
-    # Uses identical numbers as
-    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [[-0.48, 0.48, 0.39]] => predictions = [-0.48, 0.48, 0.39]
-    self.assertAllClose({
-        prediction_keys.PredictionKeys.PREDICTIONS: [-0.48, 0.48, 0.39],
-    }, next(dnn_regressor.predict(input_fn=input_fn)))
-
-
 class _SummaryHook(session_run_hook.SessionRunHook):
   """Saves summaries every N steps."""
 
@@ -813,7 +685,7 @@ def _assert_simple_summary(testcase, expected_values, actual_summary):
   })
 
 
-class BaseDNNClassifierTrainTest(object):
+class BaseDNNClassifierTrainTest(test.TestCase):
 
   def __init__(self, dnn_classifier_fn):
     self._dnn_classifier_fn = dnn_classifier_fn
@@ -981,162 +853,3 @@ class BaseDNNClassifierTrainTest(object):
         self, base_global_step + num_steps, input_units=1,
         hidden_units=hidden_units, output_units=n_classes,
         model_dir=self._model_dir)
-
-
-class BaseDNNRegressorTrainTest(object):
-
-  def __init__(self, dnn_regressor_fn):
-    self._dnn_regressor_fn = dnn_regressor_fn
-
-  def setUp(self):
-    self._model_dir = tempfile.mkdtemp()
-
-  def tearDown(self):
-    if self._model_dir:
-      shutil.rmtree(self._model_dir)
-
-  def test_from_scratch_with_default_optimizer(self):
-    hidden_units = (2, 2)
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=hidden_units,
-        feature_columns=(feature_column.numeric_column('age'),),
-        model_dir=self._model_dir)
-
-    # Train for a few steps, then validate final checkpoint.
-    num_steps = 5
-    dnn_regressor.train(
-        input_fn=lambda: ({'age': ((1,),)}, ((10,),)), steps=num_steps)
-    _assert_checkpoint(
-        self, num_steps, input_units=1, hidden_units=hidden_units,
-        output_units=1, model_dir=self._model_dir)
-
-  def test_from_scratch(self):
-    hidden_units = (2, 2)
-    opt = mock_optimizer(self, hidden_units=hidden_units)
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=hidden_units,
-        feature_columns=(feature_column.numeric_column('age'),),
-        optimizer=opt,
-        model_dir=self._model_dir)
-    self.assertEqual(0, opt.minimize.call_count)
-
-    # Train for a few steps, then validate optimizer, summaries, and
-    # checkpoint.
-    num_steps = 5
-    summary_hook = _SummaryHook()
-    dnn_regressor.train(
-        input_fn=lambda: ({'age': ((1,),)}, ((5.,),)), steps=num_steps,
-        hooks=(summary_hook,))
-    self.assertEqual(1, opt.minimize.call_count)
-    _assert_checkpoint(
-        self, num_steps, input_units=1, hidden_units=hidden_units,
-        output_units=1, model_dir=self._model_dir)
-    summaries = summary_hook.summaries()
-    self.assertEqual(num_steps, len(summaries))
-    for summary in summaries:
-      summary_keys = [v.tag for v in summary.value]
-      self.assertIn(metric_keys.MetricKeys.LOSS, summary_keys)
-      self.assertIn(metric_keys.MetricKeys.LOSS_MEAN, summary_keys)
-
-  def test_one_dim(self):
-    """Asserts train loss for one-dimensional input and logits."""
-    base_global_step = 100
-    hidden_units = (2, 2)
-    create_checkpoint(
-        (([[.6, .5]], [.1, -.1]), ([[1., .8], [-.8, -1.]], [.2, -.2]),
-         ([[-1.], [1.]], [.3]),), base_global_step, self._model_dir)
-
-    # Uses identical numbers as DNNModelFnTest.test_one_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [-2.08] => predictions = [-2.08]
-    # loss = (1 + 2.08)^2 = 9.4864
-    expected_loss = 9.4864
-    opt = mock_optimizer(
-        self, hidden_units=hidden_units, expected_loss=expected_loss)
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=hidden_units,
-        feature_columns=(feature_column.numeric_column('age'),),
-        optimizer=opt,
-        model_dir=self._model_dir)
-    self.assertEqual(0, opt.minimize.call_count)
-
-    # Train for a few steps, then validate optimizer, summaries, and
-    # checkpoint.
-    num_steps = 5
-    summary_hook = _SummaryHook()
-    dnn_regressor.train(
-        input_fn=lambda: ({'age': [[10.]]}, [[1.]]), steps=num_steps,
-        hooks=(summary_hook,))
-    self.assertEqual(1, opt.minimize.call_count)
-    summaries = summary_hook.summaries()
-    self.assertEqual(num_steps, len(summaries))
-    for summary in summaries:
-      _assert_simple_summary(
-          self,
-          {
-              metric_keys.MetricKeys.LOSS_MEAN: expected_loss,
-              'dnn/dnn/hiddenlayer_0/fraction_of_zero_values': 0.,
-              'dnn/dnn/hiddenlayer_1/fraction_of_zero_values': 0.5,
-              'dnn/dnn/logits/fraction_of_zero_values': 0.,
-              metric_keys.MetricKeys.LOSS: expected_loss,
-          },
-          summary)
-    _assert_checkpoint(
-        self, base_global_step + num_steps, input_units=1,
-        hidden_units=hidden_units, output_units=1, model_dir=self._model_dir)
-
-  def test_multi_dim(self):
-    """Asserts train loss for multi-dimensional input and logits."""
-    base_global_step = 100
-    hidden_units = (2, 2)
-    create_checkpoint(
-        (([[.6, .5], [-.6, -.5]], [.1, -.1]), ([[1., .8], [-.8, -1.]],
-                                               [.2, -.2]),
-         ([[-1., 1., .5], [-1., 1., .5]],
-          [.3, -.3, .0]),), base_global_step, self._model_dir)
-    input_dimension = 2
-    label_dimension = 3
-
-    # Uses identical numbers as
-    # DNNModelFnTest.test_multi_dim_input_multi_dim_logits.
-    # See that test for calculation of logits.
-    # logits = [[-0.48, 0.48, 0.39]]
-    # loss = (1+0.48)^2 + (-1-0.48)^2 + (0.5-0.39)^2 = 4.3929
-    expected_loss = 4.3929
-    opt = mock_optimizer(
-        self, hidden_units=hidden_units, expected_loss=expected_loss)
-    dnn_regressor = self._dnn_regressor_fn(
-        hidden_units=hidden_units,
-        feature_columns=[
-            feature_column.numeric_column('age', shape=[input_dimension])],
-        label_dimension=label_dimension,
-        optimizer=opt,
-        model_dir=self._model_dir)
-    self.assertEqual(0, opt.minimize.call_count)
-
-    # Train for a few steps, then validate optimizer, summaries, and
-    # checkpoint.
-    num_steps = 5
-    summary_hook = _SummaryHook()
-    dnn_regressor.train(
-        input_fn=lambda: ({'age': [[10., 8.]]}, [[1., -1., 0.5]]),
-        steps=num_steps,
-        hooks=(summary_hook,))
-    self.assertEqual(1, opt.minimize.call_count)
-    summaries = summary_hook.summaries()
-    self.assertEqual(num_steps, len(summaries))
-    for summary in summaries:
-      _assert_simple_summary(
-          self,
-          {
-              metric_keys.MetricKeys.LOSS_MEAN: expected_loss / label_dimension,
-              'dnn/dnn/hiddenlayer_0/fraction_of_zero_values': 0.,
-              'dnn/dnn/hiddenlayer_1/fraction_of_zero_values': 0.5,
-              'dnn/dnn/logits/fraction_of_zero_values': 0.,
-              metric_keys.MetricKeys.LOSS: expected_loss,
-          },
-          summary)
-    _assert_checkpoint(
-        self, base_global_step + num_steps, input_units=input_dimension,
-        hidden_units=hidden_units, output_units=label_dimension,
-        model_dir=self._model_dir)
diff --git a/tensorflow/python/estimator/canned/linear_testing_utils.py b/tensorflow/python/estimator/canned/linear_testing_utils.py
index fbad819e85..692bb09a02 100644
--- a/tensorflow/python/estimator/canned/linear_testing_utils.py
+++ b/tensorflow/python/estimator/canned/linear_testing_utils.py
@@ -50,6 +50,7 @@ from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
+from tensorflow.python.summary.writer import writer_cache
 from tensorflow.python.training import checkpoint_utils
 from tensorflow.python.training import input as input_lib
 from tensorflow.python.training import optimizer
@@ -154,6 +155,7 @@ class BaseLinearRegressorPartitionerTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def testPartitioner(self):
@@ -233,6 +235,7 @@ class BaseLinearRegressorEvaluationTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def test_evaluation_for_simple_data(self):
@@ -392,6 +395,7 @@ class BaseLinearRegressorPredictTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def test_1d(self):
@@ -487,6 +491,7 @@ class BaseLinearRegressorIntegrationTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def _test_complete_flow(self, train_input_fn, eval_input_fn, predict_input_fn,
@@ -654,6 +659,7 @@ class BaseLinearRegressorTrainingTest(object):
 
   def tearDown(self):
     if self._model_dir:
+      writer_cache.FileWriterCache.clear()
       shutil.rmtree(self._model_dir)
 
   def _mock_optimizer(self, expected_loss=None):
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index 8e6edf6da7..293aa75253 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -200,7 +200,7 @@ class Estimator(object):
         error. 'steps' works incrementally. If you call two times
         train(steps=10) then training occurs in total 20 steps. If `OutOfRange`
         or `StopIteration` error occurs in the middle, training stops before 20
-        steps. If you don't want to have incremental behaviour please set
+        steps. If you don't want to have incremental behavior please set
         `max_steps` instead. If set, `max_steps` must be `None`.
       max_steps: Number of total steps for which to train model. If `None`,
         train forever or train until input_fn generates the `OutOfRange` or
diff --git a/tensorflow/python/estimator/export/export_output.py b/tensorflow/python/estimator/export/export_output.py
index 49bcd06d50..7c7f92872e 100644
--- a/tensorflow/python/estimator/export/export_output.py
+++ b/tensorflow/python/estimator/export/export_output.py
@@ -69,7 +69,7 @@ class ClassificationOutput(ExportOutput):
   """
 
   def __init__(self, scores=None, classes=None):
-    """Constructor for `ClassifyOutput`.
+    """Constructor for `ClassificationOutput`.
 
     Args:
       scores: A float `Tensor` giving scores (sometimes but not always
diff --git a/tensorflow/python/framework/file_system_test.py b/tensorflow/python/framework/file_system_test.py
index 26b2a5b9b9..5eb59141a2 100644
--- a/tensorflow/python/framework/file_system_test.py
+++ b/tensorflow/python/framework/file_system_test.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # =============================================================================
-"""Tests for functions."""
+"""Tests for file_system."""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/python/framework/importer.py b/tensorflow/python/framework/importer.py
index ffce6ce4c8..c2fc7e3af9 100644
--- a/tensorflow/python/framework/importer.py
+++ b/tensorflow/python/framework/importer.py
@@ -437,6 +437,7 @@ def import_graph_def(graph_def, input_map=None, return_elements=None,
                            'WholeFileReader', 'TextLineReader',
                            'FixedLengthRecordReader',
                            'TFRecordReader', 'IdentityReader',
+                           'LMDBReader',
                            'RefSwitch', 'RefEnter', 'RefNextIteration',
                            'RefMerge', 'RefIdentity']:
               pass
diff --git a/tensorflow/python/framework/random_seed_test.py b/tensorflow/python/framework/random_seed_test.py
index d64500fbc9..c1d2b05b0b 100644
--- a/tensorflow/python/framework/random_seed_test.py
+++ b/tensorflow/python/framework/random_seed_test.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for tensorflow.python.framework.ops."""
+"""Tests for tensorflow.python.framework.random_seed."""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/python/framework/sparse_tensor_test.py b/tensorflow/python/framework/sparse_tensor_test.py
index 19a2b187b9..e709eaeda1 100644
--- a/tensorflow/python/framework/sparse_tensor_test.py
+++ b/tensorflow/python/framework/sparse_tensor_test.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Tests for tensorflow.python.framework.ops."""
+"""Tests for tensorflow.python.framework.sparse_tensor."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
diff --git a/tensorflow/python/framework/tensor_shape.py b/tensorflow/python/framework/tensor_shape.py
index d72700ea2c..3aedbfef0d 100644
--- a/tensorflow/python/framework/tensor_shape.py
+++ b/tensorflow/python/framework/tensor_shape.py
@@ -66,6 +66,8 @@ class Dimension(object):
   def __int__(self):
     return self._value
 
+  # This is needed for Windows.
+  # See https://github.com/tensorflow/tensorflow/pull/9780
   def __long__(self):
     return self._value
 
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index 82d64bf5ad..e150b385f2 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -29,6 +29,7 @@ tf_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:string_ops",
     ],
+    tags = ["no_windows"],
 )
 
 tf_py_test(
@@ -93,6 +94,7 @@ cuda_py_test(
         "//tensorflow/python:platform",
         "//tensorflow/python:platform_benchmark",
     ],
+    tags = ["no_windows"],
 )
 
 tf_py_test(
@@ -145,6 +147,7 @@ tf_py_test(
         "//tensorflow/python:clip_ops",
         "//tensorflow/python:framework_for_generated_wrappers",
     ],
+    tags = ["no_windows"],
 )
 
 tf_py_test(
@@ -241,6 +244,7 @@ tf_py_test(
         "//tensorflow/python:nn_grad",
     ],
     data = ["//tensorflow/core:image_testdata"],
+    tags = ["no_windows"],
 )
 
 tf_py_test(
@@ -873,6 +877,7 @@ tf_py_test(
         "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:variables",
     ],
+    tags = ["no_windows"],
 )
 
 tf_py_test(
@@ -920,7 +925,10 @@ cuda_py_test(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:variables",
     ],
-    tags = ["noasan"],
+    tags = [
+        "no_windows",
+        "noasan",
+    ],
 )
 
 cuda_py_test(
@@ -951,6 +959,7 @@ cuda_py_test(
         "//tensorflow/python:platform",
     ],
     shard_count = 2,
+    tags = ["no_windows_gpu"],
 )
 
 tf_py_test(
@@ -969,6 +978,8 @@ tf_py_test(
         "//tensorflow/python:util",
         "//tensorflow/python:variables",
     ],
+    data = ["//tensorflow/core:lmdb_testdata"],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -1250,6 +1261,7 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_for_generated_wrappers",
     ],
+    tags = ["manual"],
 )
 
 cuda_py_test(
@@ -1311,6 +1323,7 @@ cuda_py_test(
         "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -1493,6 +1506,7 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_for_generated_wrappers",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 cuda_py_test(
@@ -1548,6 +1562,7 @@ cuda_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:script_ops",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -1563,7 +1578,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "random_ops_test",
-    size = "small",
+    size = "medium",
     srcs = ["random_ops_test.py"],
     additional_deps = [
         "//third_party/py/numpy",
@@ -1888,6 +1903,7 @@ cuda_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:parsing_ops",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -1950,6 +1966,7 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:math_ops",
     ],
+    tags = ["no_windows_gpu"],
 )
 
 cuda_py_test(
@@ -2046,6 +2063,7 @@ cuda_py_test(
         "//tensorflow/python:nn_grad",
         "//tensorflow/python:nn_ops",
     ],
+    tags = ["manual"],
 )
 
 cuda_py_test(
@@ -2109,6 +2127,7 @@ cuda_py_test(
         "//tensorflow/python:variables",
     ],
     shard_count = 4,
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2139,6 +2158,7 @@ tf_py_test(
         "//tensorflow/python:nn_grad",
         "//tensorflow/python:nn_ops",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2150,6 +2170,7 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_for_generated_wrappers",
     ],
+    tags = ["manual"],
 )
 
 cuda_py_test(
@@ -2256,6 +2277,7 @@ cuda_py_test(
         "//tensorflow/python:variables",
     ],
     shard_count = 10,
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2295,6 +2317,7 @@ cuda_py_test(
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:math_ops",
     ],
+    tags = ["no_windows"],
 )
 
 cuda_py_test(
@@ -2353,7 +2376,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "stage_op_test",
-    size = "small",
+    size = "medium",
     srcs = ["stage_op_test.py"],
     additional_deps = [
         "//tensorflow/python:array_ops",
@@ -2363,6 +2386,22 @@ cuda_py_test(
         "//tensorflow/python:util",
         "//tensorflow/python:data_flow_ops",
     ],
+    tags = ["manual"],  # http://b/62429636
+)
+
+cuda_py_test(
+    name = "map_stage_op_test",
+    size = "medium",
+    srcs = ["map_stage_op_test.py"],
+    additional_deps = [
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:util",
+        "//tensorflow/python:data_flow_ops",
+    ],
+    tags = ["manual"],  # http://b/62429636
 )
 
 cuda_py_test(
@@ -2428,7 +2467,10 @@ cuda_py_test(
         "//tensorflow/python:variables",
     ],
     shard_count = 50,
-    tags = ["notap"],  # b/30226163
+    tags = [
+        "manual",
+        "notap",  # b/30226163
+    ],
 )
 
 cuda_py_test(
diff --git a/tensorflow/python/kernel_tests/barrier_ops_test.py b/tensorflow/python/kernel_tests/barrier_ops_test.py
index e90543a44b..7f49c63957 100644
--- a/tensorflow/python/kernel_tests/barrier_ops_test.py
+++ b/tensorflow/python/kernel_tests/barrier_ops_test.py
@@ -402,7 +402,7 @@ class BarrierTest(test.TestCase):
       with self.assertRaisesOpError("is closed"):
         fail_insert_op.run()
 
-      # This op should succeed because the barrier has not cancelled
+      # This op should succeed because the barrier has not canceled
       # pending enqueues
       insert_1_op.run()
       self.assertEquals(size_t.eval(), [3])
@@ -461,7 +461,7 @@ class BarrierTest(test.TestCase):
       with self.assertRaisesOpError("is closed"):
         fail_insert_op.run()
 
-      # This op should fail because the queue is cancelled.
+      # This op should fail because the queue is canceled.
       with self.assertRaisesOpError("is closed"):
         insert_2_op.run()
 
diff --git a/tensorflow/python/kernel_tests/basic_gpu_test.py b/tensorflow/python/kernel_tests/basic_gpu_test.py
index dbbc2de811..013aa1ba8a 100644
--- a/tensorflow/python/kernel_tests/basic_gpu_test.py
+++ b/tensorflow/python/kernel_tests/basic_gpu_test.py
@@ -129,7 +129,7 @@ class MathBuiltinUnaryTest(test.TestCase):
     for dtype in [np.float32]:
       self._testDtype(dtype, use_gpu=True)
 
-  def testFloorDevide(self):
+  def testFloorDivide(self):
     x = (1 + np.linspace(0, 5, np.prod([1, 3, 2]))).astype(np.float32).reshape(
         [1, 3, 2])
     y = (1 + np.linspace(0, 5, np.prod([1, 3, 2]))).astype(np.float32).reshape(
diff --git a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
index 77982654bd..91694cd0b2 100644
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@@ -1423,7 +1423,7 @@ class ControlFlowTest(test.TestCase):
       self.assertEqual(45, rx.eval())
 
   def _testWhileGrad_ColocateGradients(self, colocate):
-    gpu_dev_name = test.gpu_device_name() if test.is_gpu_available(
+    gpu_dev_name = test.gpu_device_name().lower() if test.is_gpu_available(
     ) else "/gpu:0"
     gpu_short_name = gpu_dev_name.split("/")[-1]
 
diff --git a/tensorflow/python/kernel_tests/conv_ops_3d_test.py b/tensorflow/python/kernel_tests/conv_ops_3d_test.py
index 04c43ef5fa..14622ab467 100644
--- a/tensorflow/python/kernel_tests/conv_ops_3d_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_3d_test.py
@@ -330,7 +330,7 @@ class Conv3DTest(test.TestCase):
 
     if test.is_gpu_available() and use_gpu:
       data_type = dtypes.float32
-      # TOOD(mjanusz): Modify gradient_checker to also provide max relative
+      # TODO(mjanusz): Modify gradient_checker to also provide max relative
       # error and synchronize the tolerance levels between the tests for forward
       # and backward computations.
       if test.is_gpu_available():
diff --git a/tensorflow/python/kernel_tests/decode_bmp_op_test.py b/tensorflow/python/kernel_tests/decode_bmp_op_test.py
index e7a8ac3af6..783492a6f2 100644
--- a/tensorflow/python/kernel_tests/decode_bmp_op_test.py
+++ b/tensorflow/python/kernel_tests/decode_bmp_op_test.py
@@ -25,82 +25,35 @@ from tensorflow.python.ops import image_ops
 from tensorflow.python.platform import test
 
 
+
 class DecodeBmpOpTest(test.TestCase):
 
   def testex1(self):
     img_bytes = [[[0, 0, 255], [0, 255, 0]], [[255, 0, 0], [255, 255, 255]]]
     # Encoded BMP bytes from Wikipedia
     encoded_bytes = [
-        0x42,
-        0x40,
-        0x46,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0x36,
-        0,
-        0,
-        0,
-        0x28,
-        0,
-        0,
-        0,
-        0x2,
-        0,
-        0,
-        0,
-        0x2,
-        0,
-        0,
-        0,
-        0x1,
-        0,
-        0x18,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0x10,
-        0,
-        0,
-        0,
-        0x13,
-        0xb,
-        0,
-        0,
-        0x13,
-        0xb,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0,
-        0xff,
-        0xff,
-        0xff,
-        0xff,
-        0,
-        0,
-        0xff,
-        0,
-        0,
-        0,
-        0xff,
-        0,
-        0,
-        0,
+        0x42, 0x40,
+        0x46, 0, 0, 0,
+        0, 0,
+        0, 0,
+        0x36, 0, 0, 0,
+        0x28, 0, 0, 0,
+        0x2, 0, 0, 0,
+        0x2, 0, 0, 0,
+        0x1, 0,
+        0x18, 0,
+        0, 0, 0, 0,
+        0x10, 0, 0, 0,
+        0x13, 0xb, 0, 0,
+        0x13, 0xb, 0, 0,
+        0, 0, 0, 0,
+        0, 0, 0, 0,
+        0, 0, 0xff,
+        0xff, 0xff, 0xff,
+        0, 0,
+        0xff, 0, 0,
+        0, 0xff, 0,
+        0, 0,
     ]
 
     byte_string = bytes(bytearray(encoded_bytes))
diff --git a/tensorflow/python/kernel_tests/decode_raw_op_test.py b/tensorflow/python/kernel_tests/decode_raw_op_test.py
index fbaf335efb..e986b7ff2b 100644
--- a/tensorflow/python/kernel_tests/decode_raw_op_test.py
+++ b/tensorflow/python/kernel_tests/decode_raw_op_test.py
@@ -54,18 +54,26 @@ class DecodeRawOpTest(test.TestCase):
       self.assertEqual([None, None], decode.get_shape().as_list())
 
       result = decode.eval(feed_dict={in_bytes: ["AaBC"]})
-      if sys.byteorder == "big":
-        self.assertAllEqual(
-            [[ord("A") * 256 + ord("a"), ord("B") * 256 + ord("C")]], result)
-      else:
-        self.assertAllEqual(
-            [[ord("A") + ord("a") * 256, ord("B") + ord("C") * 256]], result)
+      self.assertAllEqual(
+          [[ord("A") + ord("a") * 256, ord("B") + ord("C") * 256]], result)
 
       with self.assertRaisesOpError(
           "Input to DecodeRaw has length 3 that is not a multiple of 2, the "
           "size of int16"):
         decode.eval(feed_dict={in_bytes: ["123", "456"]})
 
+  def testEndianness(self):
+    with self.test_session():
+      in_bytes = array_ops.placeholder(dtypes.string, shape=[None])
+      decode_le = parsing_ops.decode_raw(
+          in_bytes, out_type=dtypes.int32, little_endian=True)
+      decode_be = parsing_ops.decode_raw(
+          in_bytes, out_type=dtypes.int32, little_endian=False)
+      result = decode_le.eval(feed_dict={in_bytes: ["\x01\x02\x03\x04"]})
+      self.assertAllEqual([[0x04030201]], result)
+      result = decode_be.eval(feed_dict={in_bytes: ["\x01\x02\x03\x04"]})
+      self.assertAllEqual([[0x01020304]], result)
+
   def testToFloat16(self):
     with self.test_session():
       in_bytes = array_ops.placeholder(dtypes.string, shape=[None])
diff --git a/tensorflow/python/kernel_tests/distributions/categorical_test.py b/tensorflow/python/kernel_tests/distributions/categorical_test.py
index 396de45cad..33db933e82 100644
--- a/tensorflow/python/kernel_tests/distributions/categorical_test.py
+++ b/tensorflow/python/kernel_tests/distributions/categorical_test.py
@@ -127,7 +127,7 @@ class CategoricalTest(test.TestCase):
       self.assertAllClose(dist.prob(0).eval(), 0.2)
 
   def testCDFWithDynamicEventShape(self):
-    """Test that dynamically-sized events with unkown shape work."""
+    """Test that dynamically-sized events with unknown shape work."""
     batch_size = 2
     histograms = array_ops.placeholder(dtype=dtypes.float32,
                                        shape=(batch_size, None))
diff --git a/tensorflow/python/kernel_tests/fft_ops_test.py b/tensorflow/python/kernel_tests/fft_ops_test.py
index 2f3c5a6c33..546e7a296d 100644
--- a/tensorflow/python/kernel_tests/fft_ops_test.py
+++ b/tensorflow/python/kernel_tests/fft_ops_test.py
@@ -212,9 +212,8 @@ class FFTOpsTest(BaseFFTOpsTest):
 class RFFTOpsTest(BaseFFTOpsTest):
 
   def _CompareBackward(self, x, rank, fft_length=None, use_placeholder=False):
-    if test.is_gpu_available(cuda_only=True):
-      super(RFFTOpsTest, self)._CompareBackward(x, rank, fft_length,
-                                                use_placeholder)
+    super(RFFTOpsTest, self)._CompareBackward(x, rank, fft_length,
+                                              use_placeholder)
 
   def _tfFFT(self, x, rank, fft_length=None, use_gpu=False, feed_dict=None):
     with self.test_session(use_gpu=use_gpu):
@@ -270,8 +269,7 @@ class RFFTOpsTest(BaseFFTOpsTest):
         x = np.zeros((0,) * dims).astype(np.float32)
         self.assertEqual(x.shape, self._tfFFT(x, rank).shape)
         x = np.zeros((0,) * dims).astype(np.complex64)
-        if test.is_gpu_available(cuda_only=True):
-          self.assertEqual(x.shape, self._tfIFFT(x, rank).shape)
+        self.assertEqual(x.shape, self._tfIFFT(x, rank).shape)
 
   def testBasic(self):
     for rank in VALID_FFT_RANKS:
@@ -300,36 +298,37 @@ class RFFTOpsTest(BaseFFTOpsTest):
                                 use_placeholder=True)
 
   def testFftLength(self):
-    for rank in VALID_FFT_RANKS:
-      for dims in xrange(rank, rank + 3):
-        for size in (5, 6):
-          inner_dim = size // 2 + 1
-          r2c = np.mod(np.arange(np.power(size, dims)), 10).reshape(
-              (size,) * dims)
-          c2r = np.mod(np.arange(np.power(size, dims - 1) * inner_dim),
-                       10).reshape((size,) * (dims - 1) + (inner_dim,))
-
-          # Test truncation (FFT size < dimensions).
-          fft_length = (size - 2,) * rank
-          self._CompareForward(r2c.astype(np.float32), rank, fft_length)
-          self._CompareBackward(c2r.astype(np.complex64), rank, fft_length)
-
-          # Confirm it works with unknown shapes as well.
-          self._CompareForward(r2c.astype(np.float32), rank, fft_length,
-                               use_placeholder=True)
-          self._CompareBackward(c2r.astype(np.complex64), rank, fft_length,
-                                use_placeholder=True)
-
-          # Test padding (FFT size > dimensions).
-          fft_length = (size + 2,) * rank
-          self._CompareForward(r2c.astype(np.float32), rank, fft_length)
-          self._CompareBackward(c2r.astype(np.complex64), rank, fft_length)
-
-          # Confirm it works with unknown shapes as well.
-          self._CompareForward(r2c.astype(np.float32), rank, fft_length,
-                               use_placeholder=True)
-          self._CompareBackward(c2r.astype(np.complex64), rank, fft_length,
-                                use_placeholder=True)
+    if test.is_gpu_available(cuda_only=True):
+      for rank in VALID_FFT_RANKS:
+        for dims in xrange(rank, rank + 3):
+          for size in (5, 6):
+            inner_dim = size // 2 + 1
+            r2c = np.mod(np.arange(np.power(size, dims)), 10).reshape(
+                (size,) * dims)
+            c2r = np.mod(np.arange(np.power(size, dims - 1) * inner_dim),
+                         10).reshape((size,) * (dims - 1) + (inner_dim,))
+
+            # Test truncation (FFT size < dimensions).
+            fft_length = (size - 2,) * rank
+            self._CompareForward(r2c.astype(np.float32), rank, fft_length)
+            self._CompareBackward(c2r.astype(np.complex64), rank, fft_length)
+
+            # Confirm it works with unknown shapes as well.
+            self._CompareForward(r2c.astype(np.float32), rank, fft_length,
+                                 use_placeholder=True)
+            self._CompareBackward(c2r.astype(np.complex64), rank, fft_length,
+                                  use_placeholder=True)
+
+            # Test padding (FFT size > dimensions).
+            fft_length = (size + 2,) * rank
+            self._CompareForward(r2c.astype(np.float32), rank, fft_length)
+            self._CompareBackward(c2r.astype(np.complex64), rank, fft_length)
+
+            # Confirm it works with unknown shapes as well.
+            self._CompareForward(r2c.astype(np.float32), rank, fft_length,
+                                 use_placeholder=True)
+            self._CompareBackward(c2r.astype(np.complex64), rank, fft_length,
+                                  use_placeholder=True)
 
   def testRandom(self):
     np.random.seed(12345)
@@ -428,23 +427,22 @@ class RFFTOpsTest(BaseFFTOpsTest):
                 use_gpu=True)
 
   def testGrad_Random(self):
-    if test.is_gpu_available(cuda_only=True):
-      np.random.seed(54321)
-      for rank in VALID_FFT_RANKS:
-        # rfft3d/irfft3d do not have gradients yet.
-        if rank == 3:
-          continue
-        for dims in xrange(rank, rank + 2):
-          for size in (5, 6):
-            re = np.random.rand(*((size,) * dims)).astype(np.float32) * 2 - 1
-            im = np.random.rand(*((size,) * dims)).astype(np.float32) * 2 - 1
-            self._checkGradReal(self._tfFFTForRank(rank), re, use_gpu=True)
-            self._checkGradComplex(
-                self._tfIFFTForRank(rank),
-                re,
-                im,
-                result_is_complex=False,
-                use_gpu=True)
+    np.random.seed(54321)
+    for rank in VALID_FFT_RANKS:
+      # rfft3d/irfft3d do not have gradients yet.
+      if rank == 3:
+        continue
+      for dims in xrange(rank, rank + 2):
+        for size in (5, 6):
+          re = np.random.rand(*((size,) * dims)).astype(np.float32) * 2 - 1
+          im = np.random.rand(*((size,) * dims)).astype(np.float32) * 2 - 1
+          self._checkGradReal(self._tfFFTForRank(rank), re, use_gpu=True)
+          self._checkGradComplex(
+              self._tfIFFTForRank(rank),
+              re,
+              im,
+              result_is_complex=False,
+              use_gpu=True)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/kernel_tests/map_stage_op_test.py b/tensorflow/python/kernel_tests/map_stage_op_test.py
new file mode 100644
index 0000000000..2d2169c310
--- /dev/null
+++ b/tensorflow/python/kernel_tests/map_stage_op_test.py
@@ -0,0 +1,556 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import data_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import test
+
+
+class MapStageTest(test.TestCase):
+
+  def testSimple(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.float32])
+        stage = stager.put(pi, [v], [0])
+        k, y = stager.get(gi)
+        y = math_ops.reduce_max(math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      sess.run(stage, feed_dict={x: -1, pi: 0})
+      for i in range(10):
+        _, yval = sess.run([stage, y], feed_dict={x: i, pi: i+1, gi:i})
+        self.assertAllClose(4 * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
+
+  def testMultiple(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.float32, dtypes.float32])
+        stage = stager.put(pi, [x, v], [0, 1])
+        k, (z, y) = stager.get(gi)
+        y = math_ops.reduce_max(z * math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      sess.run(stage, feed_dict={x: -1, pi: 0})
+      for i in range(10):
+        _, yval = sess.run([stage, y], feed_dict={x: i, pi: i+1, gi:i})
+        self.assertAllClose(
+            4 * (i - 1) * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
+
+  def testDictionary(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32],
+            shapes=[[], [128, 128]],
+            names=['x', 'v'])
+        stage = stager.put(pi,{'x': x, 'v': v})
+        key, ret = stager.get(gi)
+        z = ret['x']
+        y = ret['v']
+        y = math_ops.reduce_max(z * math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      sess.run(stage, feed_dict={x: -1, pi: 0})
+      for i in range(10):
+        _, yval = sess.run([stage, y], feed_dict={x: i, pi: i+1, gi:i})
+        self.assertAllClose(
+            4 * (i - 1) * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
+
+  def testColocation(self):
+    gpu_dev = test.gpu_device_name()
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(gpu_dev):
+        stager = data_flow_ops.MapStagingArea([dtypes.float32])
+        y = stager.put(1, [v], [0])
+        self.assertEqual(y.device, '/device:GPU:0' if gpu_dev
+                                                   else gpu_dev)
+      with ops.device('/cpu:0'):
+        _, x = stager.get(1)
+        y = stager.peek(1)
+        _, z = stager.get()
+        self.assertEqual(x.device, '/device:CPU:0')
+        self.assertEqual(y.device, '/device:CPU:0')
+        self.assertEqual(z.device, '/device:CPU:0')
+
+    G.finalize()
+
+  def testPeek(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.int32, name='x')
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+        p = array_ops.placeholder(dtypes.int32, name='p')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.int32, ], shapes=[[]])
+        stage = stager.put(pi,[x], [0])
+        peek = stager.peek(gi)
+        size = stager.size()
+
+    G.finalize()
+
+    n = 10
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      for i in range(n):
+        sess.run(stage, feed_dict={x:i, pi:i})
+
+      for i in range(n):
+        self.assertTrue(sess.run(peek, feed_dict={gi: i}) == i)
+
+      self.assertTrue(sess.run(size) == 10)
+
+  def testSizeAndClear(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32, name='x')
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32],
+            shapes=[[], [128, 128]],
+            names=['x', 'v'])
+        stage = stager.put(pi,{'x': x, 'v': v})
+        size = stager.size()
+        clear = stager.clear()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      sess.run(stage, feed_dict={x: -1, pi: 3})
+      self.assertEqual(sess.run(size), 1)
+      sess.run(stage, feed_dict={x: -1, pi: 1})
+      self.assertEqual(sess.run(size), 2)
+      sess.run(clear)
+      self.assertEqual(sess.run(size), 0)
+
+
+  def testCapacity(self):
+    capacity = 3
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.int32, name='x')
+        pi = array_ops.placeholder(dtypes.int64, name='pi')
+        gi = array_ops.placeholder(dtypes.int64, name='gi')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.int32, ],
+          capacity=capacity, shapes=[[]])
+
+      stage = stager.put(pi, [x], [0])
+      get = stager.get()
+      size = stager.size()
+
+    G.finalize()
+
+    from six.moves import queue as Queue
+    import threading
+
+    queue = Queue.Queue()
+    n = 5
+    missed = 0
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Stage data in a separate thread which will block
+      # when it hits the staging area's capacity and thus
+      # not fill the queue with n tokens
+      def thread_run():
+        for i in range(n):
+          sess.run(stage, feed_dict={x: i, pi: i})
+          queue.put(0)
+
+      t = threading.Thread(target=thread_run)
+      t.start()
+
+      # Get tokens from the queue, making notes of when we timeout
+      for i in range(n):
+        try:
+          queue.get(timeout=0.05)
+        except Queue.Empty:
+          missed += 1
+
+      # We timed out n - capacity times waiting for queue puts
+      self.assertTrue(missed == n - capacity)
+
+      # Clear the staging area out a bit
+      for i in range(n - capacity):
+        sess.run(get)
+
+      # This should now succeed
+      t.join()
+
+      self.assertTrue(sess.run(size) == capacity)
+
+      # Clear out the staging area completely
+      for i in range(capacity):
+        sess.run(get)
+
+  def testMemoryLimit(self):
+    memory_limit = 512*1024  # 512K
+    chunk = 200*1024 # 256K
+    capacity = memory_limit // chunk
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.uint8, name='x')
+        pi = array_ops.placeholder(dtypes.int64, name='pi')
+        gi = array_ops.placeholder(dtypes.int64, name='gi')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.uint8],
+          memory_limit=memory_limit, shapes=[[]])
+        stage = stager.put(pi, [x], [0])
+        get = stager.get()
+        size = stager.size()
+
+    G.finalize()
+
+    from six.moves import queue as Queue
+    import threading
+    import numpy as np
+
+    queue = Queue.Queue()
+    n = 5
+    missed = 0
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Stage data in a separate thread which will block
+      # when it hits the staging area's capacity and thus
+      # not fill the queue with n tokens
+      def thread_run():
+        for i in range(n):
+          sess.run(stage, feed_dict={x: np.full(chunk, i, dtype=np.uint8),
+                                    pi: i})
+          queue.put(0)
+
+      t = threading.Thread(target=thread_run)
+      t.start()
+
+      # Get tokens from the queue, making notes of when we timeout
+      for i in range(n):
+        try:
+          queue.get(timeout=0.05)
+        except Queue.Empty:
+          missed += 1
+
+      # We timed out n - capacity times waiting for queue puts
+      self.assertTrue(missed == n - capacity)
+
+      # Clear the staging area out a bit
+      for i in range(n - capacity):
+        sess.run(get)
+
+      # This should now succeed
+      t.join()
+
+      self.assertTrue(sess.run(size) == capacity)
+
+      # Clear out the staging area completely
+      for i in range(capacity):
+        sess.run(get)
+
+  def testOrdering(self):
+    import six
+    import random
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.int32, name='x')
+        pi = array_ops.placeholder(dtypes.int64, name='pi')
+        gi = array_ops.placeholder(dtypes.int64, name='gi')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea([dtypes.int32, ],
+          shapes=[[]], ordered=True)
+        stage = stager.put(pi, [x], [0])
+        get = stager.get()
+        size = stager.size()
+
+    G.finalize()
+
+    n = 10
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Keys n-1..0
+      keys = list(reversed(six.moves.range(n)))
+
+      for i in keys:
+        sess.run(stage, feed_dict={pi: i, x: i})
+
+      self.assertTrue(sess.run(size) == n)
+
+      # Check that key, values come out in ascending order
+      for i, k in enumerate(reversed(keys)):
+        get_key, values = sess.run(get)
+        self.assertTrue(i == k == get_key == values)
+
+      self.assertTrue(sess.run(size) == 0)
+
+  def testPartialDictInsert(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        f = array_ops.placeholder(dtypes.float32)
+        v = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+      with ops.device(test.gpu_device_name()):
+        # Test barrier with dictionary
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32, dtypes.float32],
+            names=['x', 'v', 'f'])
+        stage_xf = stager.put(pi,{'x': x, 'f': f})
+        stage_v = stager.put(pi, {'v': v})
+        key, ret = stager.get(gi)
+        size = stager.size()
+        isize = stager.incomplete_size()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # 0 complete and incomplete entries
+      self.assertTrue(sess.run([size, isize]) == [0, 0])
+      # Stage key 0, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 0, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+      # Stage key 1, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 1, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 2])
+
+      # Now complete key 0 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 0, v: 1})
+      # 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 1])
+      # We can now obtain tuple associated with key 0
+      self.assertTrue(sess.run([key, ret], feed_dict={gi:0})
+                              == [0, { 'x':1, 'f':2, 'v':1}])
+
+      # 0 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+      # Now complete key 1 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 1, v: 3})
+      # We can now obtain tuple associated with key 1
+      self.assertTrue(sess.run([key, ret], feed_dict={gi:1})
+                              == [1, { 'x':1, 'f':2, 'v':3}])
+
+  def testPartialIndexInsert(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        f = array_ops.placeholder(dtypes.float32)
+        v = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32, dtypes.float32])
+        stage_xf = stager.put(pi, [x, f], [0, 2])
+        stage_v = stager.put(pi, [v], [1])
+        key, ret = stager.get(gi)
+        size = stager.size()
+        isize = stager.incomplete_size()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # 0 complete and incomplete entries
+      self.assertTrue(sess.run([size, isize]) == [0, 0])
+      # Stage key 0, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 0, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+      # Stage key 1, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 1, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 2])
+
+      # Now complete key 0 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 0, v: 1})
+      # 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 1])
+      # We can now obtain tuple associated with key 0
+      self.assertTrue(sess.run([key, ret], feed_dict={gi:0})
+                              == [0, [1, 1, 2]])
+
+      # 0 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+      # Now complete key 1 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 1, v: 3})
+      # We can now obtain tuple associated with key 1
+      self.assertTrue(sess.run([key, ret], feed_dict={gi:1})
+                              == [1, [1,3, 2]])
+
+  def testPartialDictGetsAndPeeks(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        f = array_ops.placeholder(dtypes.float32)
+        v = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        pei = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+      with ops.device(test.gpu_device_name()):
+        # Test barrier with dictionary
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32, dtypes.float32],
+            names=['x', 'v', 'f'])
+        stage_xf = stager.put(pi,{'x': x, 'f': f})
+        stage_v = stager.put(pi, {'v': v})
+        peek_xf = stager.peek(pei, ['x', 'f'])
+        peek_v = stager.peek(pei, ['v'])
+        key_xf, get_xf = stager.get(gi, ['x', 'f'])
+        key_v, get_v = stager.get(gi, ['v'])
+        pop_key_xf, pop_xf = stager.get(indices=['x', 'f'])
+        pop_key_v, pop_v = stager.get(pi, ['v'])
+        size = stager.size()
+        isize = stager.incomplete_size()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # 0 complete and incomplete entries
+      self.assertTrue(sess.run([size, isize]) == [0, 0])
+      # Stage key 0, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 0, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+      # Stage key 1, x and f tuple entries
+      sess.run(stage_xf, feed_dict={pi: 1, x: 1, f: 2})
+      self.assertTrue(sess.run([size, isize]) == [0, 2])
+
+      # Now complete key 0 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 0, v: 1})
+      # 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 1])
+
+      # We can now peek at 'x' and 'f' values associated with key 0
+      self.assertTrue(sess.run(peek_xf, feed_dict={pei:0})
+                              == { 'x':1, 'f':2})
+      # Peek at 'v' value associated with key 0
+      self.assertTrue(sess.run(peek_v, feed_dict={pei:0})
+                              == { 'v':1})
+      # 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 1])
+
+      # We can now obtain 'x' and 'f' values associated with key 0
+      self.assertTrue(sess.run([key_xf, get_xf], feed_dict={gi:0})
+                              == [0, { 'x':1, 'f':2}])
+      # Still have 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 1])
+
+      # We can no longer get 'x' and 'f' from key 0
+      with self.assertRaises(errors.InvalidArgumentError) as cm:
+        sess.run([key_xf, get_xf], feed_dict={gi:0})
+
+      exc_str = ("Tensor at index '0' for key '0' "
+                "has already been removed.")
+
+      self.assertTrue(exc_str in cm.exception.message)
+
+      # Obtain 'v' value associated with key 0
+      self.assertTrue(sess.run([key_v, get_v], feed_dict={gi:0})
+                              == [0, { 'v':1}])
+      # 0 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [0, 1])
+
+      # Now complete key 1 with tuple entry v
+      sess.run(stage_v, feed_dict={pi: 1, v: 1})
+      # 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 0])
+
+      # Pop without key to obtain 'x' and 'f' values associated with key 1
+      self.assertTrue(sess.run([pop_key_xf, pop_xf])
+                              == [1, { 'x':1, 'f':2}])
+      # still 1 complete and 1 incomplete entry
+      self.assertTrue(sess.run([size, isize]) == [1, 0])
+      # We can now obtain 'x' and 'f' values associated with key 1
+      self.assertTrue(sess.run([pop_key_v, pop_v], feed_dict={pi:1})
+                              == [1, { 'v': 1 }])
+      # Nothing is left
+      self.assertTrue(sess.run([size, isize]) == [0, 0])
+
+  def testPartialIndexGets(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        f = array_ops.placeholder(dtypes.float32)
+        v = array_ops.placeholder(dtypes.float32)
+        pi = array_ops.placeholder(dtypes.int64)
+        pei = array_ops.placeholder(dtypes.int64)
+        gi = array_ops.placeholder(dtypes.int64)
+      with ops.device(test.gpu_device_name()):
+        # Test again with partial index gets
+        stager = data_flow_ops.MapStagingArea(
+            [dtypes.float32, dtypes.float32, dtypes.float32])
+        stage_xvf = stager.put(pi, [x, v, f], [0, 1, 2])
+        key_xf, get_xf = stager.get(gi, [0, 2])
+        key_v, get_v = stager.get(gi, [1])
+        size = stager.size()
+        isize = stager.incomplete_size()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Stage complete tuple
+      sess.run(stage_xvf, feed_dict={pi: 0, x: 1, f: 2, v: 3})
+
+      self.assertTrue(sess.run([size, isize]) == [1, 0])
+
+      # Partial get using indices
+      self.assertTrue(sess.run([key_xf, get_xf],
+            feed_dict={gi: 0}) == [0, [1, 2]])
+
+      # Still some of key 0 left
+      self.assertTrue(sess.run([size, isize]) == [1, 0])
+
+      # Partial get of remaining index
+      self.assertTrue(sess.run([key_v, get_v],
+            feed_dict={gi: 0}) == [0, [3]])
+
+      # All gone
+      self.assertTrue(sess.run([size, isize]) == [0, 0])
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/kernel_tests/matrix_solve_op_test.py b/tensorflow/python/kernel_tests/matrix_solve_op_test.py
index 07ff53cfe6..e7ae7f714f 100644
--- a/tensorflow/python/kernel_tests/matrix_solve_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_solve_op_test.py
@@ -96,11 +96,6 @@ class MatrixSolveOpTest(test.TestCase):
             [[1., 0., -1.], [-1., 1., 0.], [0., -1., 1.]])
         linalg_ops.matrix_solve(matrix, matrix).eval()
 
-  def testEmpty(self):
-    with self.test_session():
-      self._verifySolve(np.empty([0, 0]), np.empty([0, 0]))
-      self._verifySolve(np.empty([2, 2]), np.empty([2, 0]))
-
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/metrics_test.py b/tensorflow/python/kernel_tests/metrics_test.py
index cd5bee362d..543039bdd3 100644
--- a/tensorflow/python/kernel_tests/metrics_test.py
+++ b/tensorflow/python/kernel_tests/metrics_test.py
@@ -1169,7 +1169,7 @@ class AUCTest(test.TestCase):
       self.assertAlmostEqual(1, auc.eval(), 6)
 
   def np_auc(self, predictions, labels, weights):
-    """Computes the AUC explicitely using Numpy.
+    """Computes the AUC explicitly using Numpy.
 
     Args:
       predictions: an ndarray with shape [N].
diff --git a/tensorflow/python/kernel_tests/random_ops_test.py b/tensorflow/python/kernel_tests/random_ops_test.py
index d44c0b3d9f..56aaa53b98 100644
--- a/tensorflow/python/kernel_tests/random_ops_test.py
+++ b/tensorflow/python/kernel_tests/random_ops_test.py
@@ -66,7 +66,8 @@ class RandomNormalTest(test.TestCase):
     for dt in dtypes.float16, dtypes.float32, dtypes.float64:
       results = {}
       for use_gpu in [False, True]:
-        sampler = self._Sampler(1000, 0.0, 1.0, dt, use_gpu=use_gpu, seed=12345)
+        sampler = self._Sampler(
+            1000000, 0.0, 1.0, dt, use_gpu=use_gpu, seed=12345)
         results[use_gpu] = sampler()
       if dt == dtypes.float16:
         self.assertAllClose(results[False], results[True], rtol=1e-3, atol=1e-3)
@@ -135,7 +136,7 @@ class TruncatedNormalTest(test.TestCase):
         # We need a particular larger number of samples to test multiple rounds
         # on GPU
         sampler = self._Sampler(
-            200000, 0.0, 1.0, dt, use_gpu=use_gpu, seed=12345)
+            1000000, 0.0, 1.0, dt, use_gpu=use_gpu, seed=12345)
         results[use_gpu] = sampler()
       if dt == dtypes.float16:
         self.assertAllClose(results[False], results[True], rtol=1e-3, atol=1e-3)
@@ -243,7 +244,7 @@ class RandomUniformTest(test.TestCase):
       results = {}
       for use_gpu in False, True:
         sampler = self._Sampler(
-            1000, minv=0, maxv=maxv, dtype=dt, use_gpu=use_gpu, seed=12345)
+            1000000, minv=0, maxv=maxv, dtype=dt, use_gpu=use_gpu, seed=12345)
         results[use_gpu] = sampler()
       self.assertAllEqual(results[False], results[True])
 
diff --git a/tensorflow/python/kernel_tests/reader_ops_test.py b/tensorflow/python/kernel_tests/reader_ops_test.py
index 10f34751d0..12932219fc 100644
--- a/tensorflow/python/kernel_tests/reader_ops_test.py
+++ b/tensorflow/python/kernel_tests/reader_ops_test.py
@@ -858,5 +858,49 @@ class AsyncReaderTest(test.TestCase):
     output.append(sess.run(args))
 
 
+# TODO(jhseu): Restore after fixing.
+#class LMDBReaderTest(test.TestCase):
+#
+#  def setUp(self):
+#    super(LMDBReaderTest, self).setUp()
+#
+#  def testReadFromFile(self):
+#    with self.test_session() as sess:
+#      reader = io_ops.LMDBReader(name="test_read_from_file")
+#      path = os.path.join("tensorflow", "core", "lib", "lmdb", "testdata",
+#                          "data.mdb")
+#      queue = data_flow_ops.FIFOQueue(99, [dtypes.string], shapes=())
+#      key, value = reader.read(queue)
+#
+#      queue.enqueue([path]).run()
+#      queue.close().run()
+#      for i in range(10):
+#        k, v = sess.run([key, value])
+#        self.assertAllEqual(compat.as_bytes(k), compat.as_bytes(str(i)))
+#        self.assertAllEqual(compat.as_bytes(v), compat.as_bytes(str(chr(ord('a') + i))))
+#
+#      with self.assertRaisesOpError("is closed and has insufficient elements "
+#                                    "\\(requested 1, current size 0\\)"):
+#        k, v = sess.run([key, value])
+#
+#  def testReadFromFolder(self):
+#    with self.test_session() as sess:
+#      reader = io_ops.LMDBReader(name="test_read_from_folder")
+#      path = os.path.join("tensorflow", "core", "lib", "lmdb", "testdata")
+#      queue = data_flow_ops.FIFOQueue(99, [dtypes.string], shapes=())
+#      key, value = reader.read(queue)
+#
+#      queue.enqueue([path]).run()
+#      queue.close().run()
+#      for i in range(10):
+#        k, v = sess.run([key, value])
+#        self.assertAllEqual(compat.as_bytes(k), compat.as_bytes(str(i)))
+#        self.assertAllEqual(compat.as_bytes(v), compat.as_bytes(str(chr(ord('a') + i))))
+#
+#      with self.assertRaisesOpError("is closed and has insufficient elements "
+#                                    "\\(requested 1, current size 0\\)"):
+#        k, v = sess.run([key, value])
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/sparse_ops_test.py b/tensorflow/python/kernel_tests/sparse_ops_test.py
index 766221a074..4bb9eeca6a 100644
--- a/tensorflow/python/kernel_tests/sparse_ops_test.py
+++ b/tensorflow/python/kernel_tests/sparse_ops_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import numpy as np
+import unittest
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
@@ -587,6 +588,7 @@ class SparseReduceSumTest(test_util.TensorFlowTestCase):
     self._compare(sp_t, reduction_axes, ndims, False)
     self._compare(sp_t, reduction_axes, ndims, True)
 
+  @unittest.skipIf(np.__version__ == "1.13.0", "numpy 1.13 bug")
   def testSimpleAndRandomInputs(self):
     sp_t = sparse_tensor.SparseTensor(self.ind, self.vals, self.dense_shape)
 
@@ -619,6 +621,7 @@ class SparseReduceSumTest(test_util.TensorFlowTestCase):
       with self.assertRaisesOpError("Invalid reduction dimension 2"):
         sparse_ops.sparse_reduce_sum(sp_t, 2).eval()
 
+  @unittest.skipIf(np.__version__ == "1.13.0", "numpy 1.13 bug")
   def testGradient(self):
     np.random.seed(8161)
     test_dims = [(11, 1, 5, 7, 1), (2, 2)]
@@ -859,6 +862,7 @@ class SparseMinimumMaximumTest(test_util.TensorFlowTestCase):
 
 class SparseTransposeTest(test.TestCase):
 
+  @unittest.skipIf(np.__version__ == "1.13.0", "numpy 1.13 bug")
   def testTranspose(self):
     with self.test_session(use_gpu=False):
       np.random.seed(1618)
diff --git a/tensorflow/python/kernel_tests/stage_op_test.py b/tensorflow/python/kernel_tests/stage_op_test.py
index 81eee48d2e..4a89fb64e3 100644
--- a/tensorflow/python/kernel_tests/stage_op_test.py
+++ b/tensorflow/python/kernel_tests/stage_op_test.py
@@ -1,4 +1,4 @@
-# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,22 +27,26 @@ from tensorflow.python.platform import test
 class StageTest(test.TestCase):
 
   def testSimple(self):
-    with self.test_session(use_gpu=True) as sess:
+    with ops.Graph().as_default() as G:
       with ops.device('/cpu:0'):
         x = array_ops.placeholder(dtypes.float32)
         v = 2. * (array_ops.zeros([128, 128]) + x)
-      with ops.device('/gpu:0'):
+      with ops.device(test.gpu_device_name()):
         stager = data_flow_ops.StagingArea([dtypes.float32])
         stage = stager.put([v])
         y = stager.get()
         y = math_ops.reduce_max(math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
       sess.run(stage, feed_dict={x: -1})
       for i in range(10):
         _, yval = sess.run([stage, y], feed_dict={x: i})
         self.assertAllClose(4 * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
 
   def testMultiple(self):
-    with self.test_session(use_gpu=True) as sess:
+    with ops.Graph().as_default() as G:
       with ops.device('/cpu:0'):
         x = array_ops.placeholder(dtypes.float32)
         v = 2. * (array_ops.zeros([128, 128]) + x)
@@ -51,6 +55,10 @@ class StageTest(test.TestCase):
         stage = stager.put([x, v])
         z, y = stager.get()
         y = math_ops.reduce_max(z * math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
       sess.run(stage, feed_dict={x: -1})
       for i in range(10):
         _, yval = sess.run([stage, y], feed_dict={x: i})
@@ -58,7 +66,7 @@ class StageTest(test.TestCase):
             4 * (i - 1) * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
 
   def testDictionary(self):
-    with self.test_session(use_gpu=True) as sess:
+    with ops.Graph().as_default() as G:
       with ops.device('/cpu:0'):
         x = array_ops.placeholder(dtypes.float32)
         v = 2. * (array_ops.zeros([128, 128]) + x)
@@ -72,24 +80,199 @@ class StageTest(test.TestCase):
         z = ret['x']
         y = ret['v']
         y = math_ops.reduce_max(z * math_ops.matmul(y, y))
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
       sess.run(stage, feed_dict={x: -1})
       for i in range(10):
         _, yval = sess.run([stage, y], feed_dict={x: i})
         self.assertAllClose(
             4 * (i - 1) * (i - 1) * (i - 1) * 128, yval, rtol=1e-4)
 
-  def testColocation1(self):
-    with ops.device('/cpu:0'):
-      x = array_ops.placeholder(dtypes.float32)
-      v = 2. * (array_ops.zeros([128, 128]) + x)
-    with ops.device('/gpu:0'):
-      stager = data_flow_ops.StagingArea([dtypes.float32])
-      y = stager.put([v])
-      self.assertEqual(y.device, '/device:GPU:0')
-    with ops.device('/cpu:0'):
-      x = stager.get()
-      self.assertEqual(x.device, '/device:CPU:0')
+  def testColocation(self):
+    gpu_dev = test.gpu_device_name()
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32)
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(gpu_dev):
+        stager = data_flow_ops.StagingArea([dtypes.float32])
+        y = stager.put([v])
+        self.assertEqual(y.device, '/device:GPU:0' if gpu_dev
+                                                   else gpu_dev)
+      with ops.device('/cpu:0'):
+        x = stager.get()
+        self.assertEqual(x.device, '/device:CPU:0')
+
+    G.finalize()
+
+  def testPeek(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.int32, name='x')
+        p = array_ops.placeholder(dtypes.int32, name='p')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.StagingArea([dtypes.int32, ], shapes=[[]])
+        stage = stager.put([x])
+        peek = stager.peek(p)
+        ret = stager.get()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      for i in range(10):
+        sess.run(stage, feed_dict={x:i})
+
+      for i in range(10):
+        self.assertTrue(sess.run(peek, feed_dict={p:i}) == i)
+
+  def testSizeAndClear(self):
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.float32, name='x')
+        v = 2. * (array_ops.zeros([128, 128]) + x)
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.StagingArea(
+            [dtypes.float32, dtypes.float32],
+            shapes=[[], [128, 128]],
+            names=['x', 'v'])
+        stage = stager.put({'x': x, 'v': v})
+        ret = stager.get()
+        size = stager.size()
+        clear = stager.clear()
+
+    G.finalize()
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      sess.run(stage, feed_dict={x: -1})
+      self.assertEqual(sess.run(size), 1)
+      sess.run(stage, feed_dict={x: -1})
+      self.assertEqual(sess.run(size), 2)
+      sess.run(clear)
+      self.assertEqual(sess.run(size), 0)
+
+  def testCapacity(self):
+    capacity = 3
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.int32, name='x')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.StagingArea([dtypes.int32, ],
+          capacity=capacity, shapes=[[]])
+        stage = stager.put([x])
+        ret = stager.get()
+        size = stager.size()
+
+    G.finalize()
+
+    from six.moves import queue as Queue
+    import threading
+
+    queue = Queue.Queue()
+    n = 5
+    missed = 0
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Stage data in a separate thread which will block
+      # when it hits the staging area's capacity and thus
+      # not fill the queue with n tokens
+      def thread_run():
+        for i in range(n):
+          sess.run(stage, feed_dict={x: i})
+          queue.put(0)
+
+      t = threading.Thread(target=thread_run)
+      t.start()
+
+      # Get tokens from the queue, making notes of when we timeout
+      for i in range(n):
+        try:
+          queue.get(timeout=0.05)
+        except Queue.Empty:
+          missed += 1
+
+      # We timed out n - capacity times waiting for queue puts
+      self.assertTrue(missed == n - capacity)
+
+      # Clear the staging area out a bit
+      for i in range(n - capacity):
+        self.assertTrue(sess.run(ret) == i)
+
+      # Thread should be able to join now
+      t.join()
+
+      self.assertTrue(sess.run(size) == capacity)
+
+      # Clear the staging area completely
+      for i in range(capacity):
+        self.assertTrue(sess.run(ret) == i+(n-capacity))
+
+      self.assertTrue(sess.run(size) == 0)
+
+  def testMemoryLimit(self):
+    memory_limit = 512*1024  # 512K
+    chunk = 200*1024 # 256K
+    capacity = memory_limit // chunk
+
+    with ops.Graph().as_default() as G:
+      with ops.device('/cpu:0'):
+        x = array_ops.placeholder(dtypes.uint8, name='x')
+      with ops.device(test.gpu_device_name()):
+        stager = data_flow_ops.StagingArea([dtypes.uint8, ],
+          memory_limit=memory_limit, shapes=[[]])
+        stage = stager.put([x])
+        ret = stager.get()
+        size = stager.size()
+
+    G.finalize()
+
+    from six.moves import queue as Queue
+    import threading
+    import numpy as np
+
+    queue = Queue.Queue()
+    n = 5
+    missed = 0
+
+    with self.test_session(use_gpu=True, graph=G) as sess:
+      # Stage data in a separate thread which will block
+      # when it hits the staging area's capacity and thus
+      # not fill the queue with n tokens
+      def thread_run():
+        for i in range(n):
+          sess.run(stage, feed_dict={x: np.full(chunk, i, dtype=np.uint8)})
+          queue.put(0)
+
+      t = threading.Thread(target=thread_run)
+      t.start()
+
+      # Get tokens from the queue, making notes of when we timeout
+      for i in range(n):
+        try:
+          queue.get(timeout=0.05)
+        except Queue.Empty:
+          missed += 1
+
+      # We timed out n - capacity times waiting for queue puts
+      self.assertTrue(missed == n - capacity)
+
+      # Clear the staging area out a bit
+      for i in range(n - capacity):
+        self.assertTrue(sess.run(ret)[0] == i)
+
+      # Thread should be able to join now
+      t.join()
+
+      self.assertTrue(sess.run(size) == capacity)
+
+      # Clear the staging area completely
+      for i in range(capacity):
+        self.assertTrue(sess.run(ret)[0] == i+(n-capacity))
 
+      self.assertTrue(sess.run(size) == 0)
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/kernel_tests/substr_op_test.py b/tensorflow/python/kernel_tests/substr_op_test.py
index 0c0710fed4..854394b0dd 100644
--- a/tensorflow/python/kernel_tests/substr_op_test.py
+++ b/tensorflow/python/kernel_tests/substr_op_test.py
@@ -183,7 +183,7 @@ class SubstrOpTest(test.TestCase):
 
     position = np.array([[1, 2, 3], [1, 2, 3], [1, 2, 3]], dtype)
     length = np.array([[2, 3, 4]], dtype)
-    # Should fail: postion/length have different dimensionality
+    # Should fail: position/length have different dimensionality
     with self.assertRaises(ValueError):
       substr_op = string_ops.substr(test_string, position, length)
 
diff --git a/tensorflow/python/kernel_tests/variable_scope_test.py b/tensorflow/python/kernel_tests/variable_scope_test.py
index 245dcc96db..7108131d53 100644
--- a/tensorflow/python/kernel_tests/variable_scope_test.py
+++ b/tensorflow/python/kernel_tests/variable_scope_test.py
@@ -115,7 +115,7 @@ class VariableScopeTest(test.TestCase):
           dtypes.int64, dtypes.bool
       ]
 
-      # Use different varibale_name to distinguish various dtypes
+      # Use different variable_name to distinguish various dtypes
       for (i, dtype) in enumerate(types):
         x = variable_scope.get_variable(
             name="x%d" % i, shape=(3, 4), dtype=dtype)
@@ -807,7 +807,7 @@ class VariableScopeWithPartitioningTest(test.TestCase):
           dtypes.int64, dtypes.bool
       ]
 
-      # Use different varibale_name to distinguish various dtypes
+      # Use different variable_name to distinguish various dtypes
       for (i, dtype) in enumerate(types):
         x = variable_scope.get_variable(
             name="x%d" % i,
diff --git a/tensorflow/python/ops/candidate_sampling_ops.py b/tensorflow/python/ops/candidate_sampling_ops.py
index 3053a333bf..d6294c24f5 100644
--- a/tensorflow/python/ops/candidate_sampling_ops.py
+++ b/tensorflow/python/ops/candidate_sampling_ops.py
@@ -249,7 +249,7 @@ def fixed_unigram_candidate_sampler(true_classes,
       `distortion = 1.0` gives regular unigram sampling (as defined by the vocab
       file), and `distortion = 0.0` gives a uniform distribution.
     num_reserved_ids: Optionally some reserved IDs can be added in the range
-      `[0, num_reserved_ids]` by the users. One use case is that a special
+      `[0, num_reserved_ids)` by the users. One use case is that a special
       unknown word token is used as ID 0. These IDs will have a sampling
       probability of 0.
     num_shards: A sampler can be used to sample from a subset of the original
diff --git a/tensorflow/python/ops/check_ops.py b/tensorflow/python/ops/check_ops.py
index 753999a672..1d853df86c 100644
--- a/tensorflow/python/ops/check_ops.py
+++ b/tensorflow/python/ops/check_ops.py
@@ -726,7 +726,7 @@ def _assert_ranks_condition(
 
   # Attempt to statically defined rank.
   ranks_static = tuple([tensor_util.constant_value(rank) for rank in ranks])
-  if None not in ranks_static:
+  if not any(r is None for r in ranks_static):
     for rank_static in ranks_static:
       if rank_static.ndim != 0:
         raise ValueError('Rank must be a scalar.')
diff --git a/tensorflow/python/ops/control_flow_ops.py b/tensorflow/python/ops/control_flow_ops.py
index f1d34cb0e8..478e0a9472 100644
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@@ -326,7 +326,7 @@ def switch(data, pred, dtype=None, name=None):
 def _SwitchRefOrTensor(data, pred, name="Switch"):
   """Forwards `data` to an output determined by `pred`.
 
-  If `pred` is false, the `data` input is forwared to the first output.
+  If `pred` is false, the `data` input is forwarded to the first output.
   Otherwise, the data goes to the second output.
 
   This op handles `Tensor`s and `IndexedSlices`.
diff --git a/tensorflow/python/ops/ctc_ops.py b/tensorflow/python/ops/ctc_ops.py
index 4ea4d9ed2d..477c0d1cb4 100644
--- a/tensorflow/python/ops/ctc_ops.py
+++ b/tensorflow/python/ops/ctc_ops.py
@@ -37,7 +37,7 @@ def ctc_loss(labels, inputs, sequence_length,
   This op implements the CTC loss as presented in the article:
 
   [A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber.
-  Connectionist Temporal Classification: Labelling Unsegmented Sequence Data
+  Connectionist Temporal Classification: Labeling Unsegmented Sequence Data
   with Recurrent Neural Networks. ICML 2006, Pittsburgh, USA, pp. 369-376.](http://www.cs.toronto.edu/~graves/icml_2006.pdf)
 
   Input requirements:
diff --git a/tensorflow/python/ops/data_flow_ops.py b/tensorflow/python/ops/data_flow_ops.py
index c272a7115d..4eead79531 100644
--- a/tensorflow/python/ops/data_flow_ops.py
+++ b/tensorflow/python/ops/data_flow_ops.py
@@ -1,4 +1,4 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -516,7 +516,7 @@ class QueueBase(object):
     that would block will fail immediately.
 
     If `cancel_pending_enqueues` is `True`, all pending requests will also
-    be cancelled.
+    be canceled.
 
     Args:
       cancel_pending_enqueues: (Optional.) A boolean, defaulting to
@@ -988,7 +988,7 @@ class Barrier(object):
     TakeMany operations that would block will fail immediately.
 
     If `cancel_pending_enqueues` is `True`, all pending requests to the
-    underlying queue will also be cancelled, and completing of already
+    underlying queue will also be canceled, and completing of already
     started values is also not acceptable anymore.
 
     Args:
@@ -1344,72 +1344,30 @@ class SparseConditionalAccumulator(ConditionalAccumulatorBase):
         dense_shape=return_val.shape)
 
 
-class StagingArea(object):
-  """Class for staging inputs. No ordering guarantees.
-
-  A `StagingArea` is a TensorFlow data structure that stores tensors across
-  multiple steps, and exposes operations that can put and get tensors.
-
-  Each `StagingArea` element is a tuple of one or more tensors, where each
-  tuple component has a static dtype, and may have a static shape.
-
-  The capacity of a `StagingArea` is unbounded and supports multiple
-  concurrent producers and consumers; and provides exactly-once delivery.
-
-  Each element of a `StagingArea` is a fixed-length tuple of tensors whose
-  dtypes are described by `dtypes`, and whose shapes are optionally described
-  by the `shapes` argument.
-
-  If the `shapes` argument is specified, each component of a staging area
-  element must have the respective fixed shape. If it is
-  unspecified, different elements may have different shapes,
-  """
-
+class BaseStagingArea(object):
+  """Base class for Staging Areas."""
   _identifier = 0
   _lock = threading.Lock()
 
-  def __init__(self, dtypes, shapes=None, names=None, shared_name=None):
-    """Constructs a staging area object.
-
-    The two optional lists, `shapes` and `names`, must be of the same length
-    as `dtypes` if provided.  The values at a given index `i` indicate the
-    shape and name to use for the corresponding queue component in `dtypes`.
-
-    The device scope at the time of object creation determines where the
-    storage for the `StagingArea` will reside.  Calls to `put` will incur a copy
-    to this memory space, if necessary.  Tensors returned by `get` will be
-    placed according to the device scope when `get` is called.
-
-    Args:
-      dtypes:  A list of types.  The length of dtypes must equal the number
-        of tensors in each element.
-      shapes: (Optional.) Constraints on the shapes of tensors in an element.
-        A list of shape tuples or None. This list is the same length
-        as dtypes.  If the shape of any tensors in the element are constrained,
-        all must be; shapes can be None if the shapes should not be constrained.
-      names: (Optional.) If provided, the `get()` and
-        `put()` methods will use dictionaries with these names as keys.
-        Must be None or a list or tuple of the same length as `dtypes`.
-      shared_name: (Optional.) A name to be used for the shared object. By
-        passing the same name to two different python objects they will share
-        the underlying staging area. Must be a string.
-
-    Raises:
-      ValueError: If one of the arguments is invalid.
-    """
+  def __init__(self, dtypes, shapes=None, names=None, shared_name=None,
+                  capacity=0, memory_limit=0):
     if shared_name is None:
-      self._name = ops.get_default_graph().unique_name("StagingArea")
+      self._name = (ops.get_default_graph()
+                       .unique_name(self.__class__.__name__))
     elif isinstance(shared_name, six.string_types):
       self._name = shared_name
     else:
       raise ValueError("shared_name must be a string")
+
     self._dtypes = dtypes
+
     if shapes is not None:
       if len(shapes) != len(dtypes):
         raise ValueError("StagingArea shapes must be the same length as dtypes")
       self._shapes = [tensor_shape.TensorShape(s) for s in shapes]
     else:
       self._shapes = [tensor_shape.unknown_shape() for _ in self._dtypes]
+
     if names is not None:
       if len(names) != len(dtypes):
         raise ValueError("StagingArea names must be the same length as dtypes")
@@ -1417,6 +1375,9 @@ class StagingArea(object):
     else:
       self._names = None
 
+    self._capacity = capacity
+    self._memory_limit = memory_limit
+
     # all get and put ops must colocate with this op
     with ops.name_scope("%s_root" % self._name):
       self._coloc_op = control_flow_ops.no_op()
@@ -1441,52 +1402,141 @@ class StagingArea(object):
     """The list of names for each component of a staging area element."""
     return self._names
 
-  def _check_put_dtypes(self, vals):
+  @property
+  def capacity(self):
+    """The maximum number of elements of this staging area."""
+    return self._capacity
+
+  @property
+  def memory_limit(self):
+    """The maximum number of bytes of this staging area."""
+    return self._memory_limit
+
+  def _check_put_dtypes(self, vals, indices=None):
     """Validate and convert `vals` to a list of `Tensor`s.
 
     The `vals` argument can be a Tensor, a list or tuple of tensors, or a
     dictionary with tensor values.
 
+    If `vals` is a list, then the appropriate indices associated with the
+    values must be provided.
+
     If it is a dictionary, the staging area must have been constructed with a
     `names` attribute and the dictionary keys must match the staging area names.
+    `indices` will be inferred from the dictionary keys.
     If the staging area was constructed with a `names` attribute, `vals` must
     be a dictionary.
 
+    Checks that the dtype and shape of each value matches that
+    of the staging area.
+
     Args:
       vals: A tensor, a list or tuple of tensors, or a dictionary..
 
     Returns:
-      A list of `Tensor` objects.
+      A (tensors, indices) tuple where `tensors` is a list of `Tensor` objects
+      and `indices` is a list of indices associed with the tensors.
 
     Raises:
-      ValueError: If `vals` is invalid.
+      ValueError: If `vals` or `indices` is invalid.
     """
     if isinstance(vals, dict):
       if not self._names:
         raise ValueError(
             "Staging areas must have names to enqueue a dictionary")
-      if sorted(self._names) != sorted(vals.keys()):
+      if not set(vals.keys()).issubset(self._names):
         raise ValueError("Keys in dictionary to put do not match names "
                          "of staging area. Dictionary: (%s), Queue: (%s)" %
                          (sorted(vals.keys()), sorted(self._names)))
       # The order of values in `self._names` indicates the order in which the
       # tensors in the dictionary `vals` must be listed.
-      vals = [vals[k] for k in self._names]
+      vals, indices, n = zip(*[(vals[k], i, k) for i, k in enumerate(self._names)
+                                                  if k in vals])
     else:
       if self._names:
         raise ValueError("You must enqueue a dictionary in a staging area "
                          "with names")
+
+      if indices is None:
+        raise ValueError("Indices must be supplied when inserting a list "
+                        "of tensors")
+
+      if len(indices) != len(vals):
+        raise ValueError("Number of indices '%s' doesn't match "
+                         "number of values '%s'")
+
       if not isinstance(vals, (list, tuple)):
         vals = [vals]
+        indices = [0]
+
+    # Sanity check number of values
+    if not len(vals) <= len(self._dtypes):
+      raise ValueError("Unexpected number of inputs '%s' vs '%s'" % (
+                          len(values), len(self._dtypes)))
 
     tensors = []
-    for i, (val, dtype) in enumerate(zip(vals, self._dtypes)):
-      tensors.append(
-          ops.convert_to_tensor(
-              val, dtype=dtype, name="component_%d" % i))
+
+    for val, i in zip(vals, indices):
+      dtype, shape = self._dtypes[i], self._shapes[i]
+      # Check dtype
+      if not val.dtype == dtype:
+        raise ValueError("Datatypes do not match. '%s' != '%s'" %(
+                        str(val.dtype), str(dtype)))
+
+      # Check shape
+      val.get_shape().assert_is_compatible_with(shape)
+
+      tensors.append(ops.convert_to_tensor(val, dtype=dtype,
+                                          name="component_%d" % i))
+
+    return tensors, indices
+
+  def _create_device_transfers(self, tensors):
+    """Encode inter-device transfers if the current device
+    is not the same as the Staging Area's device
+    """
+
+    if not isinstance(tensors, (tuple, list)):
+      tensors = [tensors]
+
+    curr_device_scope = control_flow_ops.no_op().device
+
+    if curr_device_scope != self._coloc_op.device:
+      tensors = [array_ops.identity(t) for t in tensors]
 
     return tensors
 
+  def _get_return_value(self, tensors, indices):
+    """Return the value to return from a get op.
+
+    If the staging area has names, return a dictionary with the
+    names as keys.  Otherwise return either a single tensor
+    or a list of tensors depending on the length of `tensors`.
+
+    Args:
+      tensors: List of tensors from the get op.
+      indices: Indices of associated names and shapes
+
+    Returns:
+      A single tensor, a list of tensors, or a dictionary
+      of tensors.
+    """
+
+    tensors = self._create_device_transfers(tensors)
+
+    # Sets shape
+    for output, i in zip(tensors, indices):
+      output.set_shape(self._shapes[i])
+
+    if self._names:
+      # The returned values in `tensors` are in the same order as
+      # the names in `self._names`.
+      return {self._names[i]: t for t, i in zip(tensors, indices)}
+    elif len(tensors) == 1:
+      return tensors[0]
+    else:
+      return tensors
+
   def _scope_vals(self, vals):
     """Return a list of values to pass to `name_scope()`.
 
@@ -1503,9 +1553,86 @@ class StagingArea(object):
     else:
       return [vals]
 
+class StagingArea(BaseStagingArea):
+  """Class for staging inputs. No ordering guarantees.
+
+  A `StagingArea` is a TensorFlow data structure that stores tensors across
+  multiple steps, and exposes operations that can put and get tensors.
+
+  Each `StagingArea` element is a tuple of one or more tensors, where each
+  tuple component has a static dtype, and may have a static shape.
+
+  The capacity of a `StagingArea` may be bounded or unbounded.
+  It supports multiple concurrent producers and consumers; and
+  provides exactly-once delivery.
+
+  Each element of a `StagingArea` is a fixed-length tuple of tensors whose
+  dtypes are described by `dtypes`, and whose shapes are optionally described
+  by the `shapes` argument.
+
+  If the `shapes` argument is specified, each component of a staging area
+  element must have the respective fixed shape. If it is
+  unspecified, different elements may have different shapes,
+
+  It can be configured with a capacity in which case
+  put(values) will block until space becomes available.
+
+  Similarly, it can be configured with a memory limit which
+  will block put(values) until space is available.
+  This is mostly useful for limiting the number of tensors on
+  devices such as GPUs.
+
+  All get() and peek() commands block if the the requested data
+  is not present in the Staging Area.
+
+  """
+
+  def __init__(self, dtypes, shapes=None, names=None, shared_name=None,
+                  capacity=0, memory_limit=0):
+    """Constructs a staging area object.
+
+    The two optional lists, `shapes` and `names`, must be of the same length
+    as `dtypes` if provided.  The values at a given index `i` indicate the
+    shape and name to use for the corresponding queue component in `dtypes`.
+
+    The device scope at the time of object creation determines where the
+    storage for the `StagingArea` will reside.  Calls to `put` will incur a copy
+    to this memory space, if necessary.  Tensors returned by `get` will be
+    placed according to the device scope when `get` is called.
+
+    Args:
+      dtypes:  A list of types.  The length of dtypes must equal the number
+        of tensors in each element.
+      capacity: (Optional.) Maximum number of elements.
+        An integer. If zero, the Staging Area is unbounded
+      memory_limit: (Optional.) Maximum number of bytes of all tensors
+        in the Staging Area.
+        An integer. If zero, the Staging Area is unbounded
+      shapes: (Optional.) Constraints on the shapes of tensors in an element.
+        A list of shape tuples or None. This list is the same length
+        as dtypes.  If the shape of any tensors in the element are constrained,
+        all must be; shapes can be None if the shapes should not be constrained.
+      names: (Optional.) If provided, the `get()` and
+        `put()` methods will use dictionaries with these names as keys.
+        Must be None or a list or tuple of the same length as `dtypes`.
+      shared_name: (Optional.) A name to be used for the shared object. By
+        passing the same name to two different python objects they will share
+        the underlying staging area. Must be a string.
+
+    Raises:
+      ValueError: If one of the arguments is invalid.
+    """
+
+    super(StagingArea, self).__init__(dtypes, shapes,
+                                          names, shared_name,
+                                          capacity, memory_limit)
+
   def put(self, values, name=None):
     """Create an op that places a value into the staging area.
 
+    This operation will block if the `StagingArea` has reached
+    its capacity.
+
     Args:
       values: Tensor (or a tuple of Tensors) to place into the staging area.
       name: A name for the operation (optional).
@@ -1518,46 +1645,25 @@ class StagingArea(object):
     """
     with ops.name_scope(name, "%s_put" % self._name,
                         self._scope_vals(values)) as scope:
-      vals = self._check_put_dtypes(values)
-      if len(values) != len(self._dtypes):
-        raise ValueError("Unexpected number of inputs " + str(len(values)) +
-                         "vs " + str(len(self._dtypes)))
-      for val, dtype in zip(vals, self._dtypes):
-        if val.dtype != dtype:
-          raise ValueError("Datatypes do not match. " + str(val.dtype) + " != "
-                           + str(dtype))
 
-      for val, shape in zip(vals, self._shapes):
-        val.get_shape().assert_is_compatible_with(shape)
+      # Hard-code indices for this staging area
+      indices = (list(six.moves.range(len(values)))
+                  if isinstance(values, (list, tuple)) else None)
+      vals, _ = self._check_put_dtypes(values, indices)
 
       with ops.colocate_with(self._coloc_op):
         op = gen_data_flow_ops.stage(values=vals, shared_name=self._name,
-                                     name=scope)
+                                     name=scope, capacity=self._capacity,
+                                     memory_limit=self._memory_limit)
 
       return op
 
-  def _get_return_value(self, tensors):
-    """Return the value to return from a get op.
-
-    If the staging area has names, return a dictionary with the
-    names as keys.  Otherwise return either a single tensor
-    or a list of tensors depending on the length of `tensors`.
-
-    Args:
-      tensors: List of tensors from the get op.
+  def __internal_get(self, get_fn, name):
+    with ops.colocate_with(self._coloc_op):
+      ret = get_fn()
 
-    Returns:
-      A single tensor, a list of tensors, or a dictionary
-      of tensors.
-    """
-    if self._names:
-      # The returned values in `tensors` are in the same order as
-      # the names in `self._names`.
-      return {n: tensors[i] for i, n in enumerate(self._names)}
-    elif len(tensors) == 1:
-      return tensors[0]
-    else:
-      return tensors
+    indices = list(six.moves.range(len(self._dtypes))) # Hard coded
+    return self._get_return_value(ret, indices)
 
   def get(self, name=None):
     """Gets one element from this staging area.
@@ -1584,19 +1690,448 @@ class StagingArea(object):
     if name is None:
       name = "%s_get" % self._name
 
+    fn = lambda: gen_data_flow_ops.unstage(dtypes=self._dtypes,
+                    shared_name=self._name, name=name,
+                    capacity=self._capacity,
+                    memory_limit=self._memory_limit)
+
+    return self.__internal_get(fn, name)
+
+  def peek(self, index, name=None):
+    """Peeks at an element in the staging area.
+
+    If the staging area is too small to contain the element at
+    the specified index, it will block until enough elements
+    are inserted to complete the operation.
+
+    The placement of the returned tensor will be determined by
+    the current device scope when this function is called.
+
+    Args:
+      index: The index of the tensor within the staging area
+              to look up.
+      name: A name for the operation (optional).
+
+    Returns:
+      The tuple of tensors that was gotten.
+    """
+    if name is None:
+      name = "%s_peek" % self._name
+
+    fn = lambda: gen_data_flow_ops.stage_peek(index,
+                    dtypes=self._dtypes, shared_name=self._name,
+                    name=name, capacity=self._capacity,
+                    memory_limit=self._memory_limit)
+
+    return self.__internal_get(fn, name)
+
+  def size(self, name=None):
+    """Returns the number of elements in the staging area.
+
+    Args:
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_size" % self._name
+
+    return gen_data_flow_ops.stage_size(name=name, shared_name=self._name,
+                        dtypes=self._dtypes, capacity=self._capacity,
+                        memory_limit=self._memory_limit)
+
+  def clear(self, name=None):
+    """Clears the staging area.
+
+    Args:
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_clear" % self._name
+
+    return gen_data_flow_ops.stage_clear(name=name, shared_name=self._name,
+                        dtypes=self._dtypes, capacity=self._capacity,
+                        memory_limit=self._memory_limit)
+
+class MapStagingArea(BaseStagingArea):
+  """
+  A `MapStagingArea` is a TensorFlow data structure that stores tensors across
+  multiple steps, and exposes operations that can put and get tensors.
+
+  Each `MapStagingArea` element is a (key, value) pair.
+  Only int64 keys are supported, other types should be
+  hashed to produce a key.
+  Values are a tuple of one or more tensors.
+  Each tuple component has a static dtype,
+  and may have a static shape.
+
+  The capacity of a `MapStagingArea` may be bounded or unbounded.
+  It supports multiple concurrent producers and consumers; and
+  provides exactly-once delivery.
+
+  Each value tuple of a `MapStagingArea` is a fixed-length tuple of tensors whose
+  dtypes are described by `dtypes`, and whose shapes are optionally described
+  by the `shapes` argument.
+
+  If the `shapes` argument is specified, each component of a staging area
+  element must have the respective fixed shape. If it is
+  unspecified, different elements may have different shapes,
+
+  It behaves like an associative container with support for:
+
+   - put(key, values)
+   - peek(key)         like dict.get(key)
+   - get(key)          like dict.pop(key)
+   - get(key=None)     like dict.popitem()
+   - size()
+   - clear()
+
+  If ordered a tree structure ordered by key will be used and
+  get(key=None) will remove (key, value) pairs in increasing key order.
+  Otherwise a hashtable
+
+  It can be configured with a capacity in which case
+  put(key, values) will block until space becomes available.
+
+  Similarly, it can be configured with a memory limit which
+  will block put(key, values) until space is available.
+  This is mostly useful for limiting the number of tensors on
+  devices such as GPUs.
+
+  All get() and peek() commands block if the requested
+  (key, value) pair is not present in the staging area.
+
+  Partial puts are supported and will be placed in an incomplete
+  map until such time as all values associated with the key have
+  been inserted. Once completed, this (key, value) pair will be
+  inserted into the map. Data in the incomplete map
+  counts towards the memory limit, but not towards capacity limit.
+
+  Partial gets from the map are also supported.
+  This removes the partially requested tensors from the entry,
+  but the entry is only removed from the map once all tensors
+  associated with it are removed.
+  """
+
+  def __init__(self, dtypes, shapes=None, names=None, shared_name=None,
+                      ordered=False, capacity=0, memory_limit=0):
+    """
+    Args:
+      dtypes:  A list of types.  The length of dtypes must equal the number
+        of tensors in each element.
+      capacity: (Optional.) Maximum number of elements.
+        An integer. If zero, the Staging Area is unbounded
+      memory_limit: (Optional.) Maximum number of bytes of all tensors
+        in the Staging Area (excluding keys).
+        An integer. If zero, the Staging Area is unbounded
+      ordered: (Optional.) If True the underlying data structure
+        is a tree ordered on key. Otherwise assume a hashtable.
+      shapes: (Optional.) Constraints on the shapes of tensors in an element.
+        A list of shape tuples or None. This list is the same length
+        as dtypes.  If the shape of any tensors in the element are constrained,
+        all must be; shapes can be None if the shapes should not be constrained.
+      names: (Optional.) If provided, the `get()` and
+        `put()` methods will use dictionaries with these names as keys.
+        Must be None or a list or tuple of the same length as `dtypes`.
+      shared_name: (Optional.) A name to be used for the shared object. By
+        passing the same name to two different python objects they will share
+        the underlying staging area. Must be a string.
+
+    Raises:
+      ValueError: If one of the arguments is invalid.
+
+    """
+
+    super(MapStagingArea, self).__init__(dtypes, shapes,
+                                      names, shared_name,
+                                      capacity, memory_limit)
+
+    # Defer to different methods depending if the map is ordered
+    self._ordered = ordered
+
+    if ordered:
+      self._put_fn = gen_data_flow_ops.ordered_map_stage
+      self._pop_fn = gen_data_flow_ops.ordered_map_unstage
+      self._popitem_fn = gen_data_flow_ops.ordered_map_unstage_no_key
+      self._peek_fn = gen_data_flow_ops.ordered_map_peek
+      self._size_fn = gen_data_flow_ops.ordered_map_size
+      self._incomplete_size_fn = gen_data_flow_ops.ordered_map_incomplete_size
+      self._clear_fn = gen_data_flow_ops.ordered_map_clear
+    else:
+      self._put_fn = gen_data_flow_ops.map_stage
+      self._pop_fn = gen_data_flow_ops.map_unstage
+      self._popitem_fn = gen_data_flow_ops.map_unstage_no_key
+      self._peek_fn = gen_data_flow_ops.map_peek
+      self._size_fn = gen_data_flow_ops.map_size
+      self._incomplete_size_fn = gen_data_flow_ops.map_incomplete_size
+      self._clear_fn = gen_data_flow_ops.map_clear
+
+  def put(self, key, vals, indices=None, name=None):
+    """
+    Create an op that stores the (key, vals) pair in the staging area.
+
+    Incomplete puts are possible, preferably using a dictionary for vals
+    as the appropriate dtypes and shapes can be inferred from the value names
+    dictionary key values. If vals is a list or tuple, indices must
+    also be specified so that the op knows at which element position
+    to perform the insert.
+
+    This operation will block if the capacity or memory limit of this
+    container is reached.
+
+    Args:
+        key: Key associated with the data
+        vals: Tensor (or a dict/tuple of Tensors) to place
+                into the staging area.
+        indices: (Optional) if vals is a tuple/list, this is required.
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+
+    Raises:
+        ValueError: If the number or type of inputs don't match the staging area.
+    """
+
+    with ops.name_scope(name, "%s_put" % self._name,
+                        self._scope_vals(vals)) as scope:
+
+      vals, indices = self._check_put_dtypes(vals, indices)
+
+      with ops.colocate_with(self._coloc_op):
+        op = self._put_fn(key, indices, vals, dtypes=self._dtypes,
+                             shared_name=self._name, name=scope,
+                             capacity=self._capacity,
+                             memory_limit=self._memory_limit)
+    return op
+
+  def _get_indices_and_dtypes(self, indices=None):
+    if indices is None:
+      indices = list(six.moves.range(len(self._dtypes)))
+
+    if not isinstance(indices, (tuple, list)):
+      raise TypeError("Invalid indices type '%s'" % type(indices))
+
+    if len(indices) == 0:
+      raise ValueError("Empty indices")
+
+    if all(isinstance(i, str) for i in indices):
+      if self._names is None:
+        raise ValueError("String indices provided '%s', but this Staging Area "
+                        "was not created with names." % indices)
+
+      try:
+        indices = [self._names.index(n) for n in indices]
+      except ValueError:
+        raise ValueError("Named index '%s' not in "
+                        "Staging Area names '%s'" % (n, self._names))
+    elif all(isinstance(i, int) for i in indices):
+      pass
+    else:
+      raise TypeError("Mixed types in indices '%s'. "
+                      "May only be str or int" % indices)
+
+    dtypes = [self._dtypes[i] for i in indices]
+
+    return indices, dtypes
+
+
+  def peek(self, key, indices=None, name=None):
+    """
+    Peeks at staging area data associated with the key.
+
+    If the key is not in the staging area, it will block
+    until the associated (key, value) is inserted.
+
+    Args:
+        key: Key associated with the required data
+        indices: Partial list of tensors to retrieve (optional).
+                A list of integer or string indices.
+                String indices are only valid if the Staging Area
+                has names associated with it.
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+
+    if name is None:
+      name = "%s_pop" % self._name
+
+    indices, dtypes = self._get_indices_and_dtypes(indices)
+
+    with ops.colocate_with(self._coloc_op):
+      result = self._peek_fn(key, shared_name=self._name,
+                      indices=indices,
+                      dtypes=dtypes,
+                      name=name,
+                      capacity=self._capacity,
+                      memory_limit=self._memory_limit)
+
+    return self._get_return_value(result, indices)
+
+  def get(self, key=None, indices=None, name=None):
+    """
+    If the key is provided, the associated (key, value)
+    is returned from the staging area. If the key is not
+    in the staging area, this method will block until
+    the associated (key, value) is inserted.
+
+    If no key is provided and the staging area is ordered,
+    the (key, value) with the smallest key will be returned.
+    Otherwise, a random (key, value) will be returned.
+
+    If the staging area is empty when this operation executes,
+    it will block until there is an element to dequeue.
+
+    Args:
+        key: Key associated with the required data (Optional)
+        indices: Partial list of tensors to retrieve (optional).
+                A list of integer or string indices.
+                String indices are only valid if the Staging Area
+                has names associated with it.
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if key is None:
+      return self._popitem(indices=indices, name=name)
+    else:
+      return self._pop(key, indices=indices, name=name)
+
+  def _pop(self, key, indices=None, name=None):
+    """
+    Remove and return the associated (key, value)
+    is returned from the staging area. If the key is not
+    in the staging area, this method will block until
+    the associated (key, value) is inserted.
+
+    Args:
+        key: Key associated with the required data
+        indices: Partial list of tensors to retrieve (optional).
+                A list of integer or string indices.
+                String indices are only valid if the Staging Area
+                has names associated with it.
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_get" % self._name
+
+    indices, dtypes = self._get_indices_and_dtypes(indices)
+
     with ops.colocate_with(self._coloc_op):
-      ret = gen_data_flow_ops.unstage(dtypes=self._dtypes,
-                                      shared_name=self._name, name=name)
+      result = self._pop_fn(key, shared_name=self._name,
+                      indices=indices,
+                      dtypes=dtypes,
+                      name=name,
+                      capacity=self._capacity,
+                      memory_limit=self._memory_limit)
 
-    curr_device_scope = control_flow_ops.no_op().device
-    if curr_device_scope != self._coloc_op.device:
-      for i in range(len(ret)):
-        ret[i] = array_ops.identity(ret[i])
+    return key, self._get_return_value(result, indices)
 
-    for output, shape in zip(ret, self._shapes):
-      output.set_shape(shape)
+  def _popitem(self, indices=None, name=None):
+    """
+    If the staging area is ordered,
+    the (key, value) with the smallest key will be returned.
+    Otherwise, a random (key, value) will be returned.
+
+    If the staging area is empty when this operation executes,
+    it will block until there is an element to dequeue.
+
+    Args:
+        key: Key associated with the required data
+        indices: Partial list of tensors to retrieve (optional).
+                A list of integer or string indices.
+                String indices are only valid if the Staging Area
+                has names associated with it.
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_get_nokey" % self._name
+
+    indices, dtypes = self._get_indices_and_dtypes(indices)
+
+    with ops.colocate_with(self._coloc_op):
+      key, result = self._popitem_fn(shared_name=self._name,
+                              indices=indices,
+                              dtypes=dtypes,
+                              name=name,
+                              capacity=self._capacity,
+                              memory_limit=self._memory_limit)
+
+    # Separate keys and results out from
+    # underlying namedtuple
+    key = self._create_device_transfers(key)[0]
+    result = self._get_return_value(result, indices)
+
+    return key, result
+
+  def size(self, name=None):
+    """
+    Returns the number of elements in the staging area.
+
+    Args:
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_size" % self._name
+
+    return self._size_fn(shared_name=self._name,
+                        name=name, dtypes=self._dtypes,
+                        capacity=self._capacity,
+                        memory_limit=self._memory_limit)
+
+  def incomplete_size(self, name=None):
+    """
+    Returns the number of incomplete elements in the staging area.
+
+    Args:
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_incomplete_size" % self._name
+
+    return self._incomplete_size_fn(shared_name=self._name,
+                        name=name, dtypes=self._dtypes,
+                        capacity=self._capacity,
+                        memory_limit=self._memory_limit)
+
+
+
+  def clear(self, name=None):
+    """
+    Clears the staging area.
+
+    Args:
+        name: A name for the operation (optional)
+
+    Returns:
+        The created op
+    """
+    if name is None:
+      name = "%s_clear" % self._name
 
-    return self._get_return_value(ret)
+    return self._clear_fn(shared_name=self._name,
+                        name=name, dtypes=self._dtypes,
+                        capacity=self._capacity,
+                        memory_limit=self._memory_limit)
 
 
 class RecordInput(object):
diff --git a/tensorflow/python/ops/distributions/transformed_distribution.py b/tensorflow/python/ops/distributions/transformed_distribution.py
index 09b26a9fb7..1be3819569 100644
--- a/tensorflow/python/ops/distributions/transformed_distribution.py
+++ b/tensorflow/python/ops/distributions/transformed_distribution.py
@@ -339,7 +339,7 @@ class TransformedDistribution(distribution_lib.Distribution):
             self.distribution.event_shape_tensor()))
 
   def _event_shape(self):
-    # If there's a chance that the event_shape has been overriden, we return
+    # If there's a chance that the event_shape has been overridden, we return
     # what we statically know about the `event_shape_override`. This works
     # because: `_is_maybe_event_override` means `static_override` is `None` or a
     # non-empty list, i.e., we don't statically know the `event_shape` or we do.
@@ -360,7 +360,7 @@ class TransformedDistribution(distribution_lib.Distribution):
         self.distribution.batch_shape_tensor())
 
   def _batch_shape(self):
-    # If there's a chance that the batch_shape has been overriden, we return
+    # If there's a chance that the batch_shape has been overridden, we return
     # what we statically know about the `batch_shape_override`. This works
     # because: `_is_maybe_batch_override` means `static_override` is `None` or a
     # non-empty list, i.e., we don't statically know the `batch_shape` or we do.
diff --git a/tensorflow/python/ops/embedding_ops.py b/tensorflow/python/ops/embedding_ops.py
index 6930f9af05..4c94f9e9b5 100644
--- a/tensorflow/python/ops/embedding_ops.py
+++ b/tensorflow/python/ops/embedding_ops.py
@@ -97,7 +97,7 @@ def embedding_lookup(params, ids, partition_strategy="mod", name=None,
   Raises:
     ValueError: If `params` is empty.
   """
-  if params in (None, (), []):
+  if params is None or params in ((), []):
     raise ValueError("Need at least one param")
   if isinstance(params, variables.PartitionedVariable):
     params = list(params)  # Iterate to get the underlying Variables.
diff --git a/tensorflow/python/ops/hidden_ops.txt b/tensorflow/python/ops/hidden_ops.txt
index 06adfc5066..553e0dc135 100644
--- a/tensorflow/python/ops/hidden_ops.txt
+++ b/tensorflow/python/ops/hidden_ops.txt
@@ -191,6 +191,7 @@ WholeFileReader
 TextLineReaderV2
 TFRecordReaderV2
 WholeFileReaderV2
+LMDBReader
 
 # linalg_ops
 BatchCholesky
diff --git a/tensorflow/python/ops/image_ops.py b/tensorflow/python/ops/image_ops.py
index 75c67dcb3c..51d0276140 100644
--- a/tensorflow/python/ops/image_ops.py
+++ b/tensorflow/python/ops/image_ops.py
@@ -60,6 +60,7 @@ See the @{$python/image} guide.
 @@per_image_standardization
 @@draw_bounding_boxes
 @@non_max_suppression
+@@non_max_suppression_v2
 @@sample_distorted_bounding_box
 @@total_variation
 """
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index b16c1863dd..65a1399c5b 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -52,6 +52,7 @@ ops.NotDifferentiable('SampleDistortedBoundingBox')
 # latent bugs here.
 ops.NotDifferentiable('ExtractGlimpse')
 ops.NotDifferentiable('NonMaxSuppression')
+ops.NotDifferentiable('NonMaxSuppressionV2')
 
 
 def _assert(cond, ex_type, msg):
@@ -281,7 +282,7 @@ def flip_left_right(image):
 
 
 def flip_up_down(image):
-  """Flip an image horizontally (upside down).
+  """Flip an image vertically (upside down).
 
   Outputs the contents of `image` flipped along the first dimension, which is
   `height`.
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index 492dbe6d13..5588d18ef1 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -1449,7 +1449,7 @@ class PadToBoundingBoxTest(test_util.TensorFlowTestCase):
           use_tensor_inputs_options=[False])
 
       # The orignal error message does not contain back slashes. However, they
-      # are added by either the assert op or the runtime. If this behaviour
+      # are added by either the assert op or the runtime. If this behavior
       # changes in the future, the match string will also needs to be changed.
       self._assertRaises(
           x,
@@ -2281,7 +2281,7 @@ class ResizeImageWithCropOrPadTest(test_util.TensorFlowTestCase):
           use_tensor_inputs_options=[False])
 
       # The orignal error message does not contain back slashes. However, they
-      # are added by either the assert op or the runtime. If this behaviour
+      # are added by either the assert op or the runtime. If this behavior
       # changes in the future, the match string will also needs to be changed.
       self._assertRaises(
           x,
diff --git a/tensorflow/python/ops/io_ops.py b/tensorflow/python/ops/io_ops.py
index 68ecc219e4..0b1a77969a 100644
--- a/tensorflow/python/ops/io_ops.py
+++ b/tensorflow/python/ops/io_ops.py
@@ -443,6 +443,25 @@ class TFRecordReader(ReaderBase):
 ops.NotDifferentiable("TFRecordReader")
 
 
+class LMDBReader(ReaderBase):
+  """A Reader that outputs the records from a LMDB file.
+
+  See ReaderBase for supported methods.
+  """
+  def __init__(self, name=None, options=None):
+    """Create a LMDBReader.
+
+    Args:
+      name: A name for the operation (optional).
+      options: A LMDBRecordOptions object (optional).
+    """
+    rr = gen_io_ops._lmdb_reader(name=name)
+    super(LMDBReader, self).__init__(rr)
+
+
+ops.NotDifferentiable("LMDBReader")
+
+
 class IdentityReader(ReaderBase):
   """A Reader that outputs the queued work as both the key and value.
 
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index 3b7332e863..89b7746e71 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -208,19 +208,25 @@ argmin.__doc__ = (gen_math_ops.arg_min.__doc__.replace("dimensions",
 def abs(x, name=None):
   r"""Computes the absolute value of a tensor.
 
-  Given a tensor of real numbers `x`, this operation returns a tensor
-  containing the absolute value of each element in `x`. For example, if x is
-  an input element and y is an output element, this operation computes
-  \\(y = |x|\\).
+  Given a tensor `x` of complex numbers, this operation returns a tensor of type
+  `float32` or `float64` that is the absolute value of each element in `x`. All
+  elements in `x` must be complex numbers of the form \\(a + bj\\). The
+  absolute value is computed as \\( \sqrt{a^2 + b^2}\\).  For example:
+  ```
+  # tensor 'x' is [[-2.25 + 4.75j], [-3.25 + 5.75j]]
+  tf.complex_abs(x) ==> [5.25594902, 6.60492229]
+  ```
 
   Args:
-    x: A `Tensor` or `SparseTensor` of type `float32`, `float64`, `int32`, or
-      `int64`.
+    x: A `Tensor` or `SparseTensor` of type `float32`, `float64`, `int32`,
+      `int64`, `complex64` or `complex128`.
     name: A name for the operation (optional).
 
   Returns:
     A `Tensor` or `SparseTensor` the same size and type as `x` with absolute
       values.
+    Note, for `complex64` or `complex128' input, the returned `Tensor` will be
+      of type `float32` or `float64`, respectively.
   """
   with ops.name_scope(name, "Abs", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
@@ -386,7 +392,7 @@ def sign(x, name=None):
     A `Tensor` or `SparseTensor`, respectively. Has the same type as `x`.
 
   @compatibility(numpy)
-  Equivalent to numpy.sign except for the behaviour for input values of NaN.
+  Equivalent to numpy.sign except for the behavior for input values of NaN.
   @end_compatibility
   """
   with ops.name_scope(name, "Sign", [x]) as name:
@@ -1675,8 +1681,9 @@ def matmul(a,
            name=None):
   """Multiplies matrix `a` by matrix `b`, producing `a` * `b`.
 
-  The inputs must be matrices (or tensors of rank > 2, representing batches of
-  matrices), with matching inner dimensions, possibly after transposition.
+  The inputs must, following any transpositions, be tensors of rank >= 2 
+  where the inner 2 dimensions specify valid matrix multiplication arguments, 
+  and any further outer dimensions match.
 
   Both matrices must be of the same type. The supported types are:
   `float16`, `float32`, `float64`, `int32`, `complex64`, `complex128`.
diff --git a/tensorflow/python/ops/math_ops_test.py b/tensorflow/python/ops/math_ops_test.py
index a9089d461f..9683603785 100644
--- a/tensorflow/python/ops/math_ops_test.py
+++ b/tensorflow/python/ops/math_ops_test.py
@@ -424,9 +424,9 @@ class DivAndModTest(test_util.TensorFlowTestCase):
       tf_divs = array_ops.constant(divs)
       tf2_result = (tf_nums // tf_divs * tf_divs + tf_nums % tf_divs).eval()
       np_result = (nums // divs) * divs + (nums % divs)
-      # consistentcy with numpy
+      # Consistent with numpy
       self.assertAllEqual(tf_result, np_result)
-      # consistentcy with two forms of divide
+      # Consistent with two forms of divide
       self.assertAllEqual(tf_result, tf2_result)
       # consistency for truncation form
       tf3_result = (math_ops.truncatediv(nums, divs) * divs +
diff --git a/tensorflow/python/ops/rnn_cell_impl.py b/tensorflow/python/ops/rnn_cell_impl.py
index 500e3b7859..49a4aba473 100644
--- a/tensorflow/python/ops/rnn_cell_impl.py
+++ b/tensorflow/python/ops/rnn_cell_impl.py
@@ -233,7 +233,7 @@ class BasicRNNCell(RNNCell):
   """The most basic RNN cell.
 
   Args:
-    num_units: int, The number of units in the LSTM cell.
+    num_units: int, The number of units in the RNN cell.
     activation: Nonlinearity to use.  Default: `tanh`.
     reuse: (optional) Python boolean describing whether to reuse variables
      in an existing scope.  If not `True`, and the existing scope already has
diff --git a/tensorflow/python/ops/script_ops.py b/tensorflow/python/ops/script_ops.py
index ebe1f5c0a4..fe532fa186 100644
--- a/tensorflow/python/ops/script_ops.py
+++ b/tensorflow/python/ops/script_ops.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Script Language Operators. See the @{$python/script_ops} guide.
+"""Script Language Operators. See the @{python/script_ops} guide.
 
 @@py_func
 """
diff --git a/tensorflow/python/ops/session_ops.py b/tensorflow/python/ops/session_ops.py
index e74c52b8cf..de43b562f9 100644
--- a/tensorflow/python/ops/session_ops.py
+++ b/tensorflow/python/ops/session_ops.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Tensor Handle Operations. See the @{$python/session_ops} guide.
+"""Tensor Handle Operations. See the @{python/session_ops} guide.
 
 @@get_session_handle
 @@get_session_handle_v2
diff --git a/tensorflow/python/ops/sparse_ops.py b/tensorflow/python/ops/sparse_ops.py
index b196ed05b7..7079922736 100644
--- a/tensorflow/python/ops/sparse_ops.py
+++ b/tensorflow/python/ops/sparse_ops.py
@@ -14,7 +14,7 @@
 # ==============================================================================
 
 # pylint: disable=g-short-docstring-punctuation
-"""Sparse Tensor Representation. See the @{$python/sparse_ops} guide.
+"""Sparse Tensor Representation. See the @{python/sparse_ops} guide.
 
 @@SparseTensor
 @@SparseTensorValue
@@ -1478,7 +1478,7 @@ def sparse_tensor_dense_matmul(sp_a,
   `sp_a.dense_shape` takes on large values.
 
   Below is a rough speed comparison between `sparse_tensor_dense_matmul`,
-  labelled 'sparse', and `matmul`(a_is_sparse=True), labelled 'dense'.  For
+  labeled 'sparse', and `matmul`(a_is_sparse=True), labeled 'dense'.  For
   purposes of the comparison, the time spent converting from a `SparseTensor` to
   a dense `Tensor` is not included, so it is overly conservative with respect to
   the time ratio.
diff --git a/tensorflow/python/ops/special_math_ops.py b/tensorflow/python/ops/special_math_ops.py
index 851fba0beb..b561203bb4 100644
--- a/tensorflow/python/ops/special_math_ops.py
+++ b/tensorflow/python/ops/special_math_ops.py
@@ -424,7 +424,7 @@ def _exponential_space_einsum(equation, *inputs):
   missing_idx = set(idx_out).difference(idx_all)
   if missing_idx:
     raise ValueError(
-        'Unknown ouput axes: %s' % missing_idx
+        'Unknown output axes: %s' % missing_idx
     )
 
   axis_order = {}
diff --git a/tensorflow/python/ops/state_ops.py b/tensorflow/python/ops/state_ops.py
index dbc637975d..63394d5214 100644
--- a/tensorflow/python/ops/state_ops.py
+++ b/tensorflow/python/ops/state_ops.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Variables. See the @{$python/state_ops} guide.
+"""Variables. See the @{python/state_ops} guide.
 
 @@Variable
 @@global_variables
diff --git a/tensorflow/python/ops/variable_scope.py b/tensorflow/python/ops/variable_scope.py
index a29ddfa9f2..aceffd373a 100644
--- a/tensorflow/python/ops/variable_scope.py
+++ b/tensorflow/python/ops/variable_scope.py
@@ -282,7 +282,7 @@ class _VariableStore(object):
 
     # If a *_ref type is passed in an error would be triggered further down the
     # stack. We prevent this using base_dtype to get a non-ref version of the
-    # type, before doing anything else. When _ref types are removed in favour of
+    # type, before doing anything else. When _ref types are removed in favor of
     # resources, this line can be removed.
     try:
       dtype = dtype.base_dtype
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index 1797460a6d..5968f2684b 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -1196,7 +1196,7 @@ def initialize_variables(var_list, name="init"):
 def global_variables_initializer():
   """Returns an Op that initializes global variables.
 
-  This is just a shortcut for `variable_initializer(global_variables())`
+  This is just a shortcut for `variables_initializer(global_variables())`
 
   Returns:
     An Op that initializes global variables in the graph.
@@ -1214,7 +1214,7 @@ def initialize_all_variables():
 def local_variables_initializer():
   """Returns an Op that initializes all local variables.
 
-  This is just a shortcut for `variable_initializer(local_variables())`
+  This is just a shortcut for `variables_initializer(local_variables())`
 
   Returns:
     An Op that initializes all local variables in the graph.
diff --git a/tensorflow/python/saved_model/BUILD b/tensorflow/python/saved_model/BUILD
index 8301a73e87..775232c19f 100644
--- a/tensorflow/python/saved_model/BUILD
+++ b/tensorflow/python/saved_model/BUILD
@@ -105,6 +105,7 @@ py_test(
     srcs = ["saved_model_test.py"],
     data = ["//tensorflow/cc/saved_model:saved_model_half_plus_two"],
     srcs_version = "PY2AND3",
+    tags = ["no_windows"],
     visibility = ["//visibility:private"],
     deps = [
         ":builder",
diff --git a/tensorflow/python/saved_model/README.md b/tensorflow/python/saved_model/README.md
index f19127ecd5..38203da5b6 100644
--- a/tensorflow/python/saved_model/README.md
+++ b/tensorflow/python/saved_model/README.md
@@ -102,7 +102,7 @@ The typical usage of `builder` is as follows:
 ~~~python
 export_dir = ...
 ...
-builder = tf.saved_model_builder.SavedModelBuilder(export_dir)
+builder = tf.saved_model.builder.SavedModelBuilder(export_dir)
 with tf.Session(graph=tf.Graph()) as sess:
   ...
   builder.add_meta_graph_and_variables(sess,
diff --git a/tensorflow/python/tools/BUILD b/tensorflow/python/tools/BUILD
index 48b84f9a96..1f9389a5e7 100644
--- a/tensorflow/python/tools/BUILD
+++ b/tensorflow/python/tools/BUILD
@@ -216,6 +216,7 @@ py_test(
         "//tensorflow/cc/saved_model:saved_model_half_plus_two",
     ],
     srcs_version = "PY2AND3",
+    tags = ["manual"],
     deps = [
         ":saved_model_cli",
     ],
diff --git a/tensorflow/python/tools/import_pb_to_tensorboard.py b/tensorflow/python/tools/import_pb_to_tensorboard.py
index caeb04a24b..2bb055e978 100644
--- a/tensorflow/python/tools/import_pb_to_tensorboard.py
+++ b/tensorflow/python/tools/import_pb_to_tensorboard.py
@@ -31,7 +31,7 @@ def import_to_tensorboard(model_dir, log_dir):
 
   Args:
     model_dir: The location of the protobuf (`pb`) model to visualize
-    log_dir: The location for the Tensorboard log to begin visualisation from.
+    log_dir: The location for the Tensorboard log to begin visualization from.
 
   Usage:
     Call this function with your model location and desired log directory.
diff --git a/tensorflow/python/tools/print_selective_registration_header.py b/tensorflow/python/tools/print_selective_registration_header.py
index 5da57241ee..3e2ab4695e 100644
--- a/tensorflow/python/tools/print_selective_registration_header.py
+++ b/tensorflow/python/tools/print_selective_registration_header.py
@@ -14,13 +14,21 @@
 # ==============================================================================
 r"""Prints a header file to be used with SELECTIVE_REGISTRATION.
 
-Example usage:
-  print_selective_registration_header \
-      --graphs=path/to/graph.pb > ops_to_register.h
+An example of command-line usage is:
+  bazel build tensorflow/python/tools:print_selective_registration_header && \
+  bazel-bin/tensorflow/python/tools:print_selective_registration_header \
+    --graphs=path/to/graph.pb > ops_to_register.h
 
-  Then when compiling tensorflow, include ops_to_register.h in the include
-  search path and pass -DSELECTIVE_REGISTRATION  - see
-  core/framework/selective_registration.h for more details.
+Then when compiling tensorflow, include ops_to_register.h in the include search
+path and pass -DSELECTIVE_REGISTRATION and -DSUPPORT_SELECTIVE_REGISTRATION
+ - see core/framework/selective_registration.h for more details.
+
+When compiling for Android:
+  bazel build -c opt --copt="-DSELECTIVE_REGISTRATION" \
+    --copt="-DSUPPORT_SELECTIVE_REGISTRATION" \
+    //tensorflow/contrib/android:libtensorflow_inference.so \
+    --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
+    --config=android_arm
 """
 
 from __future__ import absolute_import
diff --git a/tensorflow/python/training/basic_session_run_hooks.py b/tensorflow/python/training/basic_session_run_hooks.py
index 1084acff65..bbd2e2e5ab 100644
--- a/tensorflow/python/training/basic_session_run_hooks.py
+++ b/tensorflow/python/training/basic_session_run_hooks.py
@@ -120,7 +120,12 @@ class SecondOrStepTimer(object):
 class LoggingTensorHook(session_run_hook.SessionRunHook):
   """Prints the given tensors once every N local steps or once every N seconds.
 
-  The tensors will be printed to the log, with `INFO` severity.
+  The tensors will be printed to the log, with `INFO` severity. If you are not
+  seeing the logs, you might want to add the following line after your imports:
+  
+  ```python
+    tf.logging.set_verbosity(tf.logging.INFO)
+  ```
   """
 
   def __init__(self, tensors, every_n_iter=None, every_n_secs=None,
diff --git a/tensorflow/python/training/coordinator.py b/tensorflow/python/training/coordinator.py
index d234df71c1..23e8638764 100644
--- a/tensorflow/python/training/coordinator.py
+++ b/tensorflow/python/training/coordinator.py
@@ -62,7 +62,7 @@ class Coordinator(object):
   #### Exception handling:
 
   A thread can report an exception to the coordinator as part of the
-  `should_stop()` call.  The exception will be re-raised from the
+  `request_stop()` call.  The exception will be re-raised from the
   `coord.join()` call.
 
   Thread code:
diff --git a/tensorflow/python/training/evaluation.py b/tensorflow/python/training/evaluation.py
index 7c46591d07..bbaa3931c2 100644
--- a/tensorflow/python/training/evaluation.py
+++ b/tensorflow/python/training/evaluation.py
@@ -113,7 +113,7 @@ def _evaluate_once(checkpoint_path,
 
   One may also consider using a `tf.contrib.training.SummaryAtEndHook` to record
   summaries after the `eval_ops` have run. If `eval_ops` is `None`, the
-  summaries run immedietly after the model checkpoint has been restored.
+  summaries run immediately after the model checkpoint has been restored.
 
   Note that `evaluate_once` creates a local variable used to track the number of
   evaluations run via `tf.contrib.training.get_or_create_eval_step`.
diff --git a/tensorflow/python/training/input.py b/tensorflow/python/training/input.py
index e9fe9215ae..1755167938 100644
--- a/tensorflow/python/training/input.py
+++ b/tensorflow/python/training/input.py
@@ -1085,7 +1085,7 @@ def maybe_batch_join(tensors_list, keep_input, batch_size, capacity=32,
       added to the queue or not.  If it is a scalar and evaluates `True`, then
       `tensors` are all added to the queue. If it is a vector and `enqueue_many`
       is `True`, then each example is added to the queue only if the
-      corresonding value in `keep_input` is `True`. This tensor essentially acts
+      corresponding value in `keep_input` is `True`. This tensor essentially acts
       as a filtering mechanism.
     batch_size: An integer. The new batch size pulled from the queue.
     capacity: An integer. The maximum number of elements in the queue.
@@ -1236,7 +1236,7 @@ def maybe_shuffle_batch(tensors, batch_size, capacity, min_after_dequeue,
       added to the queue or not.  If it is a scalar and evaluates `True`, then
       `tensors` are all added to the queue. If it is a vector and `enqueue_many`
       is `True`, then each example is added to the queue only if the
-      corresonding value in `keep_input` is `True`. This tensor essentially acts
+      corresponding value in `keep_input` is `True`. This tensor essentially acts
       as a filtering mechanism.
     num_threads: The number of threads enqueuing `tensor_list`.
     seed: Seed for the random shuffling within the queue.
@@ -1378,7 +1378,7 @@ def maybe_shuffle_batch_join(tensors_list, batch_size, capacity,
       added to the queue or not.  If it is a scalar and evaluates `True`, then
       `tensors` are all added to the queue. If it is a vector and `enqueue_many`
       is `True`, then each example is added to the queue only if the
-      corresonding value in `keep_input` is `True`. This tensor essentially acts
+      corresponding value in `keep_input` is `True`. This tensor essentially acts
       as a filtering mechanism.
     seed: Seed for the random shuffling within the queue.
     enqueue_many: Whether each tensor in `tensor_list_list` is a single
diff --git a/tensorflow/python/training/learning_rate_decay.py b/tensorflow/python/training/learning_rate_decay.py
index dbdde81726..6d7d5940fb 100644
--- a/tensorflow/python/training/learning_rate_decay.py
+++ b/tensorflow/python/training/learning_rate_decay.py
@@ -138,17 +138,17 @@ def piecewise_constant(x, boundaries, values, name=None):
     # comparisons, for example if floats are converted to integers.
     boundaries = ops.convert_n_to_tensor(boundaries)
     for b in boundaries:
-      if b.dtype != x.dtype:
+      if b.dtype.base_dtype != x.dtype.base_dtype:
         raise ValueError(
             "Boundaries (%s) must have the same dtype as x (%s)." % (
-                b.dtype, x.dtype))
+                b.dtype.base_dtype, x.dtype.base_dtype))
     # TODO(rdipietro): Ensure that boundaries' elements are strictly increasing.
     values = ops.convert_n_to_tensor(values)
     for v in values[1:]:
-      if v.dtype != values[0].dtype:
+      if v.dtype.base_dtype != values[0].dtype.base_dtype:
         raise ValueError(
             "Values must have elements all with the same dtype (%s vs %s)." % (
-                values[0].dtype, v.dtype))
+                values[0].dtype.base_dtype, v.dtype.base_dtype))
 
     pred_fn_pairs = {}
     pred_fn_pairs[x <= boundaries[0]] = lambda: values[0]
diff --git a/tensorflow/python/training/learning_rate_decay_test.py b/tensorflow/python/training/learning_rate_decay_test.py
index 8232882822..177a2356e4 100644
--- a/tensorflow/python/training/learning_rate_decay_test.py
+++ b/tensorflow/python/training/learning_rate_decay_test.py
@@ -113,6 +113,11 @@ class LRDecayTest(test_util.TensorFlowTestCase):
       with self.assertRaises(ValueError):
         learning_rate_decay.piecewise_constant(x, boundaries, values)
 
+      # Test that ref types are valid.
+      x_ref = x.op.outputs[0]   # float32_ref tensor should be accepted
+      boundaries, values = [1.0, 2.0], [1, 2, 3]
+      learning_rate_decay.piecewise_constant(x_ref, boundaries, values)
+
 
 class LinearDecayTest(test_util.TensorFlowTestCase):
 
diff --git a/tensorflow/python/training/moving_averages.py b/tensorflow/python/training/moving_averages.py
index eff765c387..b31027ca3c 100644
--- a/tensorflow/python/training/moving_averages.py
+++ b/tensorflow/python/training/moving_averages.py
@@ -51,7 +51,7 @@ def assign_moving_average(variable, value, decay, zero_debias=True, name=None):
     variable: A Variable.
     value: A tensor with the same shape as 'variable'.
     decay: A float Tensor or float value.  The moving average decay.
-    zero_debias: A python bool. If true, assume the variable is 0-intialized and
+    zero_debias: A python bool. If true, assume the variable is 0-initialized and
       unbias it, as in https://arxiv.org/abs/1412.6980. See docstring in
       `_zero_debias` for more details.
     name: Optional name of the returned operation.
diff --git a/tensorflow/python/util/lazy_loader.py b/tensorflow/python/util/lazy_loader.py
index 34308ff931..6d2622b1c0 100644
--- a/tensorflow/python/util/lazy_loader.py
+++ b/tensorflow/python/util/lazy_loader.py
@@ -24,7 +24,7 @@ import types
 
 
 class LazyLoader(types.ModuleType):
-  """Lazily import a module, mainly to avoid pulling in large dependancies.
+  """Lazily import a module, mainly to avoid pulling in large dependencies.
 
   `contrib`, and `ffmpeg` are examples of modules that are large and not always
   needed, and this allows them to only be loaded when they are used.
diff --git a/tensorflow/stream_executor/cuda/cuda_diagnostics.h b/tensorflow/stream_executor/cuda/cuda_diagnostics.h
index 5cce6b9365..aa68321acc 100644
--- a/tensorflow/stream_executor/cuda/cuda_diagnostics.h
+++ b/tensorflow/stream_executor/cuda/cuda_diagnostics.h
@@ -75,7 +75,7 @@ class Diagnostician {
 
   // Given the DSO version number and the driver version file contents, extracts
   // the driver version and compares, warning the user in the case of
-  // incompatability.
+  // incompatibility.
   //
   // This is solely used for more informative log messages when the user is
   // running on a machine that happens to have a libcuda/kernel driver mismatch.
diff --git a/tensorflow/stream_executor/cuda/cuda_driver.h b/tensorflow/stream_executor/cuda/cuda_driver.h
index c5d7d8b32f..68494aba65 100644
--- a/tensorflow/stream_executor/cuda/cuda_driver.h
+++ b/tensorflow/stream_executor/cuda/cuda_driver.h
@@ -77,7 +77,7 @@ class CUDADriver {
 
   // Destroys a CUDA stream associated with the given context.
   // stream is owned by the caller, must not be null, and *stream is set to null
-  // if the stream is successfuly destroyed.
+  // if the stream is successfully destroyed.
   // http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g244c8833de4596bcd31a06cdf21ee758
   static void DestroyStream(CudaContext* context, CUstream *stream);
 
diff --git a/tensorflow/stream_executor/cuda/cuda_event.h b/tensorflow/stream_executor/cuda/cuda_event.h
index 46f0232b1d..56667e65d3 100644
--- a/tensorflow/stream_executor/cuda/cuda_event.h
+++ b/tensorflow/stream_executor/cuda/cuda_event.h
@@ -46,7 +46,7 @@ class CUDAEvent : public internal::EventInterface {
   // Polls the CUDA platform for the event's current status.
   Event::Status PollForStatus();
 
-  // The underyling CUDA event element.
+  // The underlying CUDA event element.
   const CUevent& cuda_event();
 
  private:
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index c1e72bb565..43c707730a 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -847,7 +847,7 @@ void *CUDAExecutor::CudaContextHack() { return context_; }
 
 CudaContext* CUDAExecutor::cuda_context() { return context_; }
 
-// Attemps to read the NUMA node corresponding to the GPU device's PCI bus out
+// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
 // of SysFS. Returns -1 if it cannot.
 //
 // For anything more complicated/prod-focused than this, you'll likely want to
diff --git a/tensorflow/stream_executor/lib/statusor.h b/tensorflow/stream_executor/lib/statusor.h
index bb423e390a..e06550009a 100644
--- a/tensorflow/stream_executor/lib/statusor.h
+++ b/tensorflow/stream_executor/lib/statusor.h
@@ -143,7 +143,7 @@ class StatusOr {
       : status_(std::move(other.status_)),
         value_(std::move(other.value_)) {}
 
-  // Move assignment opeartor to avoid unnecessary copy.
+  // Move assignment operator to avoid unnecessary copy.
   // T must be assignable from U
   template <typename U>
   StatusOr& operator=(StatusOr<U>&& other) {
diff --git a/tensorflow/stream_executor/plugin.h b/tensorflow/stream_executor/plugin.h
index b1db8b7cb8..0b88b86e2b 100644
--- a/tensorflow/stream_executor/plugin.h
+++ b/tensorflow/stream_executor/plugin.h
@@ -49,7 +49,7 @@ enum class PluginKind {
 //
 // A PluginConfig may be passed to the StreamExecutor constructor - the plugins
 // described therein will be used to provide BLAS, DNN, FFT, and RNG
-// functionality. Platform-approprate defaults will be used for any un-set
+// functionality. Platform-appropriate defaults will be used for any un-set
 // libraries. If a platform does not support a specified plugin (ex. cuBLAS on
 // an OpenCL executor), then an error will be logged and no plugin operations
 // will succeed.
diff --git a/tensorflow/stream_executor/stream_executor_pimpl.h b/tensorflow/stream_executor/stream_executor_pimpl.h
index 3dbeddd5d4..9814f1b960 100644
--- a/tensorflow/stream_executor/stream_executor_pimpl.h
+++ b/tensorflow/stream_executor/stream_executor_pimpl.h
@@ -205,7 +205,7 @@ class StreamExecutor {
   // This should be done before deallocating the region with delete[]/free/etc.
   bool HostMemoryUnregister(void *location) SE_MUST_USE_RESULT;
 
-  // Synchronizes all activity occuring in the StreamExecutor's context (most
+  // Synchronizes all activity occurring in the StreamExecutor's context (most
   // likely a whole device).
   bool SynchronizeAllActivity() SE_MUST_USE_RESULT;
 
@@ -238,7 +238,7 @@ class StreamExecutor {
                                     DeviceMemoryBase *gpu_dst);
 
   // Alternative interface for memcpying from host to device that takes an
-  // array slice. Checks that the destination size can accomodate the host
+  // array slice. Checks that the destination size can accommodate the host
   // slice size.
   template <class T>
   port::Status SynchronousMemcpyH2D(port::ArraySlice<T> host_src,
@@ -253,7 +253,7 @@ class StreamExecutor {
                                     void *host_dst);
 
   // Alternative interface for memcpying from device to host that takes an
-  // array slice. Checks that the destination size can accomodate the host
+  // array slice. Checks that the destination size can accommodate the host
   // slice size.
   template <typename T>
   port::Status SynchronousMemcpyD2H(const DeviceMemory<T> &gpu_src,
diff --git a/tensorflow/tensorboard/README.md b/tensorflow/tensorboard/README.md
index 5aff57a241..a9ab4d3bd2 100644
--- a/tensorflow/tensorboard/README.md
+++ b/tensorflow/tensorboard/README.md
@@ -55,7 +55,7 @@ work, but there may be bugs or performance issues.
 The first step in using TensorBoard is acquiring data from your TensorFlow run.
 For this, you need [summary ops](https://www.tensorflow.org/api_docs/python/tf/summary).
 Summary ops are ops, like
-[`tf.matmul`](https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/matmul)
+[`tf.matmul`](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/matmul)
 or
 [`tf.nn.relu`](https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/relu),
 which means they take in tensors, produce tensors, and are evaluated from within
diff --git a/tensorflow/tensorboard/components/tf_backend/behavior.ts b/tensorflow/tensorboard/components/tf_backend/behavior.ts
index 7ff9df8cf4..8df791efac 100644
--- a/tensorflow/tensorboard/components/tf_backend/behavior.ts
+++ b/tensorflow/tensorboard/components/tf_backend/behavior.ts
@@ -87,7 +87,7 @@ export const BackendBehavior = {
    * Backend reload, which gets metadata on available runs, tags, etc from
    *   the backend.
    * Frontend reload, which loads new data for each chart or visual display.
-   * Backend reload logic is provided by this behaivor. The frontend reload
+   * Backend reload logic is provided by this behavior. The frontend reload
    *   logic should be provided elsewhere, since it is component-specific.
    * To keep things simple and consistent, we do the backend reload first,
    *   and the frontend reload afterwards.
diff --git a/tensorflow/tensorboard/components/tf_backend/requestManager.ts b/tensorflow/tensorboard/components/tf_backend/requestManager.ts
index c943268cec..0fa198416e 100644
--- a/tensorflow/tensorboard/components/tf_backend/requestManager.ts
+++ b/tensorflow/tensorboard/components/tf_backend/requestManager.ts
@@ -76,7 +76,7 @@ export class RequestManager {
             .then(
                 (response) => {
                   // Success - Let's free space for another active
-                  // reqest, and launch it
+                  // request, and launch it
                   this._nActiveRequests--;
                   this.launchRequests();
                   return response;
diff --git a/tensorflow/tensorboard/components/tf_backend/test/requestManagerTests.ts b/tensorflow/tensorboard/components/tf_backend/test/requestManagerTests.ts
index 23a4e8f611..3800e6e402 100644
--- a/tensorflow/tensorboard/components/tf_backend/test/requestManagerTests.ts
+++ b/tensorflow/tensorboard/components/tf_backend/test/requestManagerTests.ts
@@ -151,7 +151,7 @@ describe('backend', () => {
           });
 
       r.then(
-          (success) => done(new Error('The reqest should have failed')),
+          (success) => done(new Error('The request should have failed')),
           (failure) => done());
     });
 
diff --git a/tensorflow/tensorboard/components/tf_dashboard_common/tf-chart-scaffold.html b/tensorflow/tensorboard/components/tf_dashboard_common/tf-chart-scaffold.html
index 9cacb7f5c8..a39fb9462b 100644
--- a/tensorflow/tensorboard/components/tf_dashboard_common/tf-chart-scaffold.html
+++ b/tensorflow/tensorboard/components/tf_dashboard_common/tf-chart-scaffold.html
@@ -32,7 +32,7 @@ chart() - Returns the underlying chart element.
 reload() - Reloads the data and sends it to the underlying chart.
 
 This element should have a compatible chart plugin element as it's content. The
-plugin is requred to implement two functions:
+plugin is required to implement two functions:
 - setVisibleSeries(names: string[]): a function that receives an array of series
     names as the first parameter, responsible for changing the series currently
     being displayed to only the series in this array.
diff --git a/tensorflow/tensorboard/components/tf_graph/tf-graph-scene.html b/tensorflow/tensorboard/components/tf_graph/tf-graph-scene.html
index 5ca5829f2b..fb2bc13f9a 100644
--- a/tensorflow/tensorboard/components/tf_graph/tf-graph-scene.html
+++ b/tensorflow/tensorboard/components/tf_graph/tf-graph-scene.html
@@ -328,7 +328,7 @@ limitations under the License.
 /* --- Annotation --- */
 
 /* only applied for annotations that are not summary or constant.
-(.summary, .constant gets overriden below) */
+(.summary, .constant gets overridden below) */
 ::content .annotation > .annotation-node > * {
   stroke-width: 0.5;
   stroke-dasharray: 1, 1;
diff --git a/tensorflow/tensorboard/components/tf_graph_common/util.ts b/tensorflow/tensorboard/components/tf_graph_common/util.ts
index bee40ae713..0b2df6545c 100644
--- a/tensorflow/tensorboard/components/tf_graph_common/util.ts
+++ b/tensorflow/tensorboard/components/tf_graph_common/util.ts
@@ -68,7 +68,7 @@ module tf.graph.util {
    * progress
    * of the subtask and the subtask message. The parent task should pass a
    * subtracker to its subtasks. The subtask reports its own progress which
-   * becames relative to the main task.
+   * becomes relative to the main task.
    */
   export function getSubtaskTracker(
       parentTracker: ProgressTracker, impactOnTotalProgress: number,
diff --git a/tensorflow/tensorboard/components/tf_imports/BUILD b/tensorflow/tensorboard/components/tf_imports/BUILD
index 7014643b03..5db9fddc3c 100644
--- a/tensorflow/tensorboard/components/tf_imports/BUILD
+++ b/tensorflow/tensorboard/components/tf_imports/BUILD
@@ -9,6 +9,7 @@ ts_web_library(
     name = "webcomponentsjs",
     srcs = ["@org_definitelytyped//:webcomponents.js.d.ts"],
     path = "/webcomponentsjs",
+    visibility = ["//visibility:public"],
     exports = ["@org_polymer_webcomponentsjs"],
 )
 
@@ -16,6 +17,7 @@ ts_web_library(
     name = "polymer",
     srcs = ["@org_definitelytyped//:polymer.d.ts"],
     path = "/polymer",
+    visibility = ["//visibility:public"],
     exports = ["@org_polymer"],
     deps = [":webcomponentsjs"],
 )
@@ -27,6 +29,7 @@ ts_web_library(
         "@org_definitelytyped//:lodash.d.ts",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
     deps = ["@com_lodash"],
 )
 
@@ -39,6 +42,7 @@ ts_web_library(
         "@org_threejs//:three.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
 )
 
 ts_web_library(
@@ -48,6 +52,7 @@ ts_web_library(
         "@com_numericjs//:numeric.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
 )
 
 ts_web_library(
@@ -57,6 +62,7 @@ ts_web_library(
         "@io_github_waylonflinn_weblas//:weblas.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
 )
 
 ts_web_library(
@@ -66,6 +72,7 @@ ts_web_library(
         "@io_github_cpettitt_graphlib//:graphlib.core.min.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
     deps = [":lodash"],
 )
 
@@ -76,6 +83,7 @@ ts_web_library(
         "@io_github_cpettitt_dagre//:dagre.core.min.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
     deps = [
         ":graphlib",
         ":lodash",
@@ -90,6 +98,7 @@ ts_web_library(
         "@org_d3js//:d3.min.js",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
 )
 
 ts_web_library(
@@ -99,6 +108,7 @@ ts_web_library(
         "plottable.html",
     ],
     path = "/tf-imports",
+    visibility = ["//visibility:public"],
     deps = [
         ":d3",
         ":plottable_js_css",
@@ -119,6 +129,7 @@ ts_web_library(
 ts_web_library(
     name = "web_component_tester",
     testonly = 1,
+    visibility = ["//visibility:public"],
     exports = [
         ":chai_typings",
         ":mocha_typings",
diff --git a/tensorflow/tensorboard/components/tf_storage/storage.ts b/tensorflow/tensorboard/components/tf_storage/storage.ts
index 573df1cb2b..873bc483a0 100644
--- a/tensorflow/tensorboard/components/tf_storage/storage.ts
+++ b/tensorflow/tensorboard/components/tf_storage/storage.ts
@@ -25,7 +25,7 @@ import {getFakeHash, setFakeHash, TABS, useHash} from '../tf-globals/globals';
  * which TensorBoard uses after like localhost:8000/#events&runPrefix=train*
  * to store state in the URI.
  *
- * It also allows saving the values to localStorage for long-term persistance.
+ * It also allows saving the values to localStorage for long-term persistence.
  */
 type StringDict = {[key: string]: string};
 
@@ -38,7 +38,7 @@ export let TAB = '__tab__';
 /**
  * The name of the property for users to set on a Polymer component
  * in order for its stored properties to be stored in the URI unambiguously.
- * (No need to set this if you want mutliple instances of the component to
+ * (No need to set this if you want multiple instances of the component to
  * share URI state)
  *
  * Example:
@@ -258,7 +258,7 @@ function _writeComponent(component: string) {
  * Convert dictionary of strings into a URI Component.
  * All key value entries get added as key value pairs in the component,
  * with the exception of a key with the TAB value, which if present
- * gets prepended to the URI Component string for backwards comptability
+ * gets prepended to the URI Component string for backwards compatibility
  * reasons.
  */
 function _dictToComponent(items: StringDict): string {
diff --git a/tensorflow/tensorboard/components/vz_projector/scatterPlotVisualizer3DLabels.ts b/tensorflow/tensorboard/components/vz_projector/scatterPlotVisualizer3DLabels.ts
index cbd9785e2f..7820af0d48 100644
--- a/tensorflow/tensorboard/components/vz_projector/scatterPlotVisualizer3DLabels.ts
+++ b/tensorflow/tensorboard/components/vz_projector/scatterPlotVisualizer3DLabels.ts
@@ -38,7 +38,7 @@ const VERTICES_PER_GLYPH = 2 * 3;  // 2 triangles, 3 verts per triangle
  *            bottom center of the word is positioned at (0, 0);
  *    position: The position of the label in worldspace.
  *    vUv: The (u, v) coordinates that index into the glyphs sheet (range 0, 1.)
- *    color: The color of the label (matches the cooresponding point's color.)
+ *    color: The color of the label (matches the corresponding point's color.)
  *    wordShown: Boolean. Whether or not the label is visible.
  */
 
diff --git a/tensorflow/tensorboard/components/vz_projector/vz-projector-inspector-panel.html b/tensorflow/tensorboard/components/vz_projector/vz-projector-inspector-panel.html
index 412bcbb480..1b81094776 100644
--- a/tensorflow/tensorboard/components/vz_projector/vz-projector-inspector-panel.html
+++ b/tensorflow/tensorboard/components/vz_projector/vz-projector-inspector-panel.html
@@ -163,7 +163,7 @@ limitations under the License.
   margin: 0 -12px 0 10px;
 }
 
-.euclidian {
+.euclidean {
   margin-right: 10px;
 }
 
@@ -223,7 +223,7 @@ limitations under the License.
         <span class="option-label">distance</span>
         <div class="options">
           <a class="selected cosine" href="javascript:void(0);">COSINE</a>
-          <a class="euclidean" href="javascript:void(0);">EUCLIDIAN</a>
+          <a class="euclidean" href="javascript:void(0);">EUCLIDEAN</a>
         </div>
       </div>
     </div>
diff --git a/tensorflow/tensorboard/scripts/generate_testdata.py b/tensorflow/tensorboard/scripts/generate_testdata.py
index ffc0ea9734..f191d16a82 100644
--- a/tensorflow/tensorboard/scripts/generate_testdata.py
+++ b/tensorflow/tensorboard/scripts/generate_testdata.py
@@ -31,7 +31,7 @@ from six.moves import xrange  # pylint: disable=redefined-builtin
 import tensorflow as tf
 
 
-tf.flags.DEFINE_string("target", None, """The directoy where serialized data
+tf.flags.DEFINE_string("target", None, """The directory where serialized data
 will be written""")
 
 tf.flags.DEFINE_boolean("overwrite", False, """Whether to remove and overwrite
diff --git a/tensorflow/tensorflow.bzl b/tensorflow/tensorflow.bzl
index b0ed57996c..dba8d6de63 100644
--- a/tensorflow/tensorflow.bzl
+++ b/tensorflow/tensorflow.bzl
@@ -112,6 +112,7 @@ def if_not_mobile(a):
 def if_not_windows(a):
   return select({
       clean_dep("//tensorflow:windows"): [],
+      clean_dep("//tensorflow:windows_msvc"): [],
       "//conditions:default": a,
   })
 
@@ -120,9 +121,24 @@ def if_x86(a):
   return select({
       clean_dep("//tensorflow:linux_x86_64"): a,
       clean_dep("//tensorflow:windows"): a,
+      clean_dep("//tensorflow:windows_msvc"): a,
       "//conditions:default": [],
   })
 
+def if_darwin(a):
+  return select({
+      clean_dep("//tensorflow:darwin"): a,
+      "//conditions:default": [],
+  })
+
+WIN_COPTS = [
+    "/DLANG_CXX11",
+    "/D__VERSION__=\\\"MSVC\\\"",
+    "/DPLATFORM_WINDOWS",
+    "/DTF_COMPILE_LIBRARY",
+    "/DEIGEN_HAS_C99_MATH",
+    "/DTENSORFLOW_USE_EIGEN_THREADPOOL",
+]
 
 # LINT.IfChange
 def tf_copts():
@@ -139,14 +155,8 @@ def tf_copts():
               "-O2",
           ],
           clean_dep("//tensorflow:darwin"): [],
-          clean_dep("//tensorflow:windows"): [
-              "/DLANG_CXX11",
-              "/D__VERSION__=\\\"MSVC\\\"",
-              "/DPLATFORM_WINDOWS",
-              "/DTF_COMPILE_LIBRARY",
-              "/DEIGEN_HAS_C99_MATH",
-              "/DTENSORFLOW_USE_EIGEN_THREADPOOL",
-          ],
+          clean_dep("//tensorflow:windows"): WIN_COPTS,
+          clean_dep("//tensorflow:windows_msvc"): WIN_COPTS,
           clean_dep("//tensorflow:ios"): ["-std=c++11"],
           "//conditions:default": ["-pthread"]
       }))
@@ -456,6 +466,29 @@ def tf_cuda_cc_test(name,
       linkopts=linkopts,
       args=args)
 
+def tf_cuda_only_cc_test(name,
+                    srcs=[],
+                    deps=[],
+                    tags=[],
+                    data=[],
+                    size="medium",
+                    linkstatic=0,
+                    args=[],
+                    linkopts=[]):
+  native.cc_test(
+    name="%s%s" % (name, "_gpu"),
+    srcs=srcs,
+    size=size,
+    args=args,
+    copts= _cuda_copts() + tf_copts(),
+    data=data,
+    deps=deps + if_cuda([
+        clean_dep("//tensorflow/core:cuda"),
+        clean_dep("//tensorflow/core:gpu_lib"),
+    ]),
+    linkopts=["-lpthread", "-lm"] + linkopts,
+    linkstatic=linkstatic,
+    tags=tags)
 
 # Create a cc_test for each of the tensorflow tests listed in "tests"
 def tf_cc_tests(srcs,
@@ -968,6 +1001,7 @@ def tf_py_wrap_cc(name,
           clean_dep("//tensorflow:tf_exported_symbols.lds")
       ],
       clean_dep("//tensorflow:windows"): [],
+      clean_dep("//tensorflow:windows_msvc"): [],
       "//conditions:default": [
           "-Wl,--version-script",
           clean_dep("//tensorflow:tf_version_script.lds")
@@ -978,6 +1012,7 @@ def tf_py_wrap_cc(name,
           clean_dep("//tensorflow:tf_exported_symbols.lds")
       ],
       clean_dep("//tensorflow:windows"): [],
+      clean_dep("//tensorflow:windows_msvc"): [],
       "//conditions:default": [
           clean_dep("//tensorflow:tf_version_script.lds")
       ]
diff --git a/tensorflow/tf_exported_symbols.lds b/tensorflow/tf_exported_symbols.lds
index cb81e89922..1f4d900ec2 100644
--- a/tensorflow/tf_exported_symbols.lds
+++ b/tensorflow/tf_exported_symbols.lds
@@ -1,3 +1,4 @@
 *tensorflow*
 *perftools*gputools*
 *tf_*
+TF_*
diff --git a/tensorflow/tf_version_script.lds b/tensorflow/tf_version_script.lds
index 8c8c8be5a9..b368f7cf21 100644
--- a/tensorflow/tf_version_script.lds
+++ b/tensorflow/tf_version_script.lds
@@ -2,6 +2,7 @@ tensorflow {
   global:
     *tensorflow*;
     *perftools*gputools*;
+    TF_*;
   local:
     *;
 };
diff --git a/tensorflow/tools/api/golden/tensorflow.image.pbtxt b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
index 8f7790f299..93257c84a1 100644
--- a/tensorflow/tools/api/golden/tensorflow.image.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.image.pbtxt
@@ -97,6 +97,10 @@ tf_module {
     argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
   member_method {
+    name: "non_max_suppression_v2"
+    argspec: "args=[\'boxes\', \'scores\', \'max_output_size\', \'iou_threshold\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
+  member_method {
     name: "pad_to_bounding_box"
     argspec: "args=[\'image\', \'offset_height\', \'offset_width\', \'target_height\', \'target_width\'], varargs=None, keywords=None, defaults=None"
   }
diff --git a/tensorflow/tools/benchmark/README.md b/tensorflow/tools/benchmark/README.md
index 5cb1aa6cf8..fd1bebe835 100644
--- a/tensorflow/tools/benchmark/README.md
+++ b/tensorflow/tools/benchmark/README.md
@@ -9,6 +9,8 @@ both on desktop machines and on Android.
 
 ### On Android:
 
+(0) Refer to https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android to edit the `WORKSPACE` to configure the android NDK/SDK.
+
 (1) build for your specific platform, e.g.:
 ```bash
 $bazel build -c opt \
diff --git a/tensorflow/tools/ci_build/Dockerfile.gpu b/tensorflow/tools/ci_build/Dockerfile.gpu
index 68493965fa..5d18295f68 100644
--- a/tensorflow/tools/ci_build/Dockerfile.gpu
+++ b/tensorflow/tools/ci_build/Dockerfile.gpu
@@ -1,14 +1,15 @@
-FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04
 
 MAINTAINER Jan Prach <jendap@google.com>
 
 # In the Ubuntu 14.04 images, cudnn is placed in system paths. Move them to
 # /usr/local/cuda
-RUN cp /usr/include/cudnn.h /usr/local/cuda/include
-RUN cp /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64
+RUN cp -P /usr/include/cudnn.h /usr/local/cuda/include
+RUN cp -P /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64
 
 # Copy and run the install scripts.
 COPY install/*.sh /install/
+ARG DEBIAN_FRONTEND=noninteractive
 RUN /install/install_bootstrap_deb_packages.sh
 RUN add-apt-repository -y ppa:openjdk-r/ppa && \
     add-apt-repository -y ppa:george-edison55/cmake-3.x
diff --git a/tensorflow/tools/ci_build/Dockerfile.gpu_clang b/tensorflow/tools/ci_build/Dockerfile.gpu_clang
index 00aaa9f760..c4342d17f5 100644
--- a/tensorflow/tools/ci_build/Dockerfile.gpu_clang
+++ b/tensorflow/tools/ci_build/Dockerfile.gpu_clang
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu14.04
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04
 
 MAINTAINER Ilya Biryukov <ibiryukov@google.com>
 
diff --git a/tensorflow/tools/ci_build/install/install_bazel.sh b/tensorflow/tools/ci_build/install/install_bazel.sh
index 64d78f75b1..daba126f88 100755
--- a/tensorflow/tools/ci_build/install/install_bazel.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel.sh
@@ -15,7 +15,7 @@
 # ==============================================================================
 
 # Select bazel version.
-BAZEL_VERSION="0.4.5"
+BAZEL_VERSION="0.5.0"
 
 set +e
 local_bazel_ver=$(bazel version 2>&1 | grep -i label | awk '{print $3}')
diff --git a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
index f76c1add24..5581023ad7 100644
--- a/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh
@@ -96,62 +96,6 @@ exclude_cpu_cc_tests="${failing_cpu_cc_tests} + ${broken_cpu_cc_tests}"
 
 exclude_gpu_cc_tests="${extra_failing_gpu_cc_tests} + ${exclude_cpu_cc_tests}"
 
-# Python tests
-# The first argument is the name of the python test direcotry
-function get_failing_cpu_py_tests() {
-    echo "
-    //$1/tensorflow/python:basic_session_run_hooks_test + \
-    //$1/tensorflow/python:contrib_test + \
-    //$1/tensorflow/python:dequantize_op_test + \
-    //$1/tensorflow/python:file_io_test + \
-    //$1/tensorflow/python:file_system_test + \
-    //$1/tensorflow/python:framework_meta_graph_test + \
-    //$1/tensorflow/python:framework_ops_test + \
-    //$1/tensorflow/python:framework_tensor_util_test + \
-    //$1/tensorflow/python:framework_test_util_test + \
-    //$1/tensorflow/python:gradients_test + \
-    //$1/tensorflow/python:image_ops_test + \
-    //$1/tensorflow/python:localhost_cluster_performance_test + \
-    //$1/tensorflow/python:monitored_session_test + \
-    //$1/tensorflow/python:nn_batchnorm_test + \
-    //$1/tensorflow/python:protobuf_compare_test + \
-    //$1/tensorflow/python:quantized_conv_ops_test + \
-    //$1/tensorflow/python:saver_large_variable_test + \
-    //$1/tensorflow/python:saver_test + \
-    //$1/tensorflow/python:session_test + \
-    //$1/tensorflow/python:supervisor_test + \
-    //$1/tensorflow/python:sync_replicas_optimizer_test + \
-    //$1/tensorflow/python/debug:curses_ui_test + \
-    //$1/tensorflow/python/kernel_tests:as_string_op_test + \
-    //$1/tensorflow/python/kernel_tests:benchmark_test + \
-    //$1/tensorflow/python/kernel_tests:cast_op_test + \
-    //$1/tensorflow/python/kernel_tests:clip_ops_test + \
-    //$1/tensorflow/python/kernel_tests:conv_ops_test + \
-    //$1/tensorflow/python/kernel_tests:decode_image_op_test + \
-    //$1/tensorflow/python/kernel_tests:depthwise_conv_op_test + \
-    //$1/tensorflow/python/kernel_tests:functional_ops_test + \
-    //$1/tensorflow/python/kernel_tests:py_func_test + \
-    //$1/tensorflow/python/kernel_tests:rnn_test + \
-    //$1/tensorflow/python/kernel_tests:sets_test + \
-    //$1/tensorflow/python/kernel_tests:sparse_matmul_op_test + \
-    //$1/tensorflow/python/kernel_tests:string_to_number_op_test + \
-    //$1/tensorflow/python/kernel_tests:summary_ops_test + \
-    //$1/tensorflow/python/kernel_tests:variable_scope_test + \
-    //$1/tensorflow/python/saved_model:saved_model_test \
-    "
-}
-
-function get_failing_gpu_py_tests() {
-    echo "
-    //$1/tensorflow/python/kernel_tests:diag_op_test + \
-    //$1/tensorflow/python/kernel_tests:one_hot_op_test + \
-    //$1/tensorflow/python/kernel_tests:rnn_test + \
-    //$1/tensorflow/python/kernel_tests:sets_test + \
-    //$1/tensorflow/python/kernel_tests:trace_op_test + \
-    $(get_failing_cpu_py_tests $1)
-    "
-}
-
 function clean_output_base() {
   # TODO(pcloudy): bazel clean --expunge doesn't work on Windows yet.
   # Clean the output base manually to ensure build correctness
@@ -178,6 +122,10 @@ function run_configure_for_cpu_build {
   if [ -z "$TF_NEED_MKL" ]; then
     export TF_NEED_MKL=0
   fi
+  export TF_NEED_VERBS=0
+  export TF_NEED_GCP=0
+  export TF_NEED_HDFS=0
+  export TF_NEED_OPENCL=0
   echo "" | ./configure
 }
 
@@ -197,6 +145,11 @@ function run_configure_for_gpu_build {
   if [ -z "$CC_OPT_FLAGS" ]; then
     export CC_OPT_FLAGS="-march=native"
   fi
+  export TF_NEED_VERBS=0
+  export TF_NEED_MKL=0
+  export TF_NEED_GCP=0
+  export TF_NEED_HDFS=0
+  export TF_NEED_OPENCL=0
   echo "" | ./configure
 }
 
diff --git a/tensorflow/tools/ci_build/windows/bazel/common_env.sh b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
index e4e3861710..8853dc53b1 100644
--- a/tensorflow/tools/ci_build/windows/bazel/common_env.sh
+++ b/tensorflow/tools/ci_build/windows/bazel/common_env.sh
@@ -30,7 +30,7 @@ export TMPDIR="C:/tmp"
 mkdir -p "$TMPDIR"
 
 # Set bash path
-export BAZEL_SH="C:/tools/msys64/usr/bin/bash"
+export BAZEL_SH=${BAZEL_SH:-"C:/tools/msys64/usr/bin/bash"}
 
 # Set Python path for ./configure
 export PYTHON_BIN_PATH="C:/Program Files/Anaconda3/python"
@@ -55,4 +55,4 @@ export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0/extras/CUPT
 export PATH="/c/tools/cuda/bin:$PATH"
 
 # Set the common build options on Windows
-export BUILD_OPTS='--cpu=x64_windows_msvc --host_cpu=x64_windows_msvc --copt=-w --host_copt=-w --verbose_failures --experimental_ui'
+export BUILD_OPTS='--copt=-w --host_copt=-w --verbose_failures --experimental_ui'
diff --git a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
index 34844e60c8..61f5ed084c 100644
--- a/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
+++ b/tensorflow/tools/ci_build/windows/cpu/pip/build_tf_windows.sh
@@ -42,10 +42,10 @@ source "tensorflow/tools/ci_build/windows/bazel/common_env.sh" \
 source "tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh" \
   || { echo "Failed to source bazel_test_lib.sh" >&2; exit 1; }
 
-clean_output_base
-
 run_configure_for_cpu_build
 
+clean_output_base
+
 bazel build -c opt $BUILD_OPTS tensorflow/tools/pip_package:build_pip_package || exit $?
 
 # Create a python test directory to avoid package name conflict
@@ -58,12 +58,10 @@ create_python_test_dir "${PY_TEST_DIR}"
 PIP_NAME=$(ls ${PY_TEST_DIR}/tensorflow-*.whl)
 reinstall_tensorflow_pip ${PIP_NAME}
 
-failing_cpu_py_tests=$(get_failing_cpu_py_tests ${PY_TEST_DIR})
-
-passing_tests=$(bazel query "kind(py_test,  //${PY_TEST_DIR}/tensorflow/python/...) - (${failing_cpu_py_tests})" |
-  # We need to strip \r so that the result could be store into a variable under MSYS
-  tr '\r' ' ')
-
 # Define no_tensorflow_py_deps=true so that every py_test has no deps anymore,
 # which will result testing system installed tensorflow
-bazel test -c opt $BUILD_OPTS -k $passing_tests --define=no_tensorflow_py_deps=true --test_output=errors
+bazel test -c opt $BUILD_OPTS -k --test_output=errors \
+  --define=no_tensorflow_py_deps=true --test_lang_filters=py \
+  --test_tag_filters=-no_pip,-no_windows \
+  --build_tag_filters=-no_pip,-no_windows --build_tests_only \
+  //${PY_TEST_DIR}/tensorflow/python/...
diff --git a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
index f124012edc..b4f9cc8476 100644
--- a/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
+++ b/tensorflow/tools/ci_build/windows/gpu/cmake/run_build.bat
@@ -22,12 +22,14 @@ CALL "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat"
 :: Turn echo back on, above script turns it off.
 ECHO ON
 
-:: Some common variables to be shared between runs.
-SET CMAKE_EXE="C:\Program Files\cmake\bin\cmake.exe"
-SET SWIG_EXE="C:\swigwin-3.0.10\swig.exe"
-SET PY_EXE="C:\Program Files\Anaconda3\python.exe"
-SET PY_LIB="C:\Program Files\Anaconda3\libs\python35.lib"
-SET CUDNN_HOME="c:\tools\cuda"
+:: Set environment variables to be shared between runs. Do not override if they
+:: are set already.
+
+IF DEFINED CMAKE_EXE (ECHO CMAKE_EXE is set to %CMAKE_EXE%) ELSE (SET CMAKE_EXE="C:\Program Files\cmake\bin\cmake.exe")
+IF DEFINED SWIG_EXE (ECHO SWIG_EXE is set to %SWIG_EXE%) ELSE (SET SWIG_EXE="C:\swigwin-3.0.10\swig.exe")
+IF DEFINED PY_EXE (ECHO PY_EXE is set to %PY_EXE%) ELSE (SET PY_EXE="C:\Program Files\Anaconda3\python.exe")
+IF DEFINED PY_LIB (ECHO PY_LIB is set to %PY_LIB%) ELSE (SET PY_LIB="C:\Program Files\Anaconda3\libs\python35.lib")
+IF DEFINED CUDNN_HOME (ECHO CUDNN_HOME is set to %CUDNN_HOME%) ELSE (SET CUDNN_HOME="c:\tools\cuda")
 
 SET CMAKE_DIR=%REPO_ROOT%\tensorflow\contrib\cmake
 SET MSBUILD_EXE="C:\Program Files (x86)\MSBuild\14.0\Bin\msbuild.exe"
diff --git a/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat b/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
index 9307ebb66b..ba2d939b5f 100644
--- a/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
+++ b/tensorflow/tools/ci_build/windows/gpu/cmake/run_py.bat
@@ -22,7 +22,7 @@ CD %BUILD_DIR%
 SET BUILD_CC_TESTS=OFF
 SET BUILD_PYTHON_TESTS=ON
 
-SET PIP_EXE="C:\Program Files\Anaconda3\Scripts\pip.exe"
+IF DEFINED PIP_EXE (ECHO PIP_EXE is set to %PIP_EXE%) ELSE (SET PIP_EXE="C:\Program Files\Anaconda3\Scripts\pip.exe")
 
 :: Run the CMAKE build to build the pip package.
 CALL %REPO_ROOT%\tensorflow\tools\ci_build\windows\gpu\cmake\run_build.bat
diff --git a/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh b/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
index eaf9ef8158..cc157c33f5 100644
--- a/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
+++ b/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh
@@ -42,10 +42,10 @@ source "tensorflow/tools/ci_build/windows/bazel/common_env.sh" \
 source "tensorflow/tools/ci_build/windows/bazel/bazel_test_lib.sh" \
   || { echo "Failed to source bazel_test_lib.sh" >&2; exit 1; }
 
-clean_output_base
-
 run_configure_for_gpu_build
 
+clean_output_base
+
 bazel build -c opt --config=win-cuda $BUILD_OPTS tensorflow/tools/pip_package:build_pip_package || exit $?
 
 # Create a python test directory to avoid package name conflict
@@ -58,13 +58,11 @@ create_python_test_dir "${PY_TEST_DIR}"
 PIP_NAME=$(ls ${PY_TEST_DIR}/tensorflow-*.whl)
 reinstall_tensorflow_pip ${PIP_NAME}
 
-failing_gpu_py_tests=$(get_failing_gpu_py_tests ${PY_TEST_DIR})
-
-passing_tests=$(bazel query "kind(py_test,  //${PY_TEST_DIR}/tensorflow/python/...) - (${failing_gpu_py_tests})" |
-  # We need to strip \r so that the result could be store into a variable under MSYS
-  tr '\r' ' ')
-
 # Define no_tensorflow_py_deps=true so that every py_test has no deps anymore,
 # which will result testing system installed tensorflow
-# GPU tests are very flaky when running concurently, so set local_test_jobs=5
-bazel test -c opt --config=win-cuda $BUILD_OPTS -k $passing_tests --define=no_tensorflow_py_deps=true --test_output=errors --local_test_jobs=5
+# GPU tests are very flaky when running concurently, so set local_test_jobs=1
+bazel test -c opt --config=win-cuda $BUILD_OPTS -k --test_output=errors \
+  --define=no_tensorflow_py_deps=true --test_lang_filters=py \
+  --test_tag_filters=-no_pip,-no_windows,-no_windows_gpu \
+  --build_tag_filters=-no_pip,-no_windows,-no_windows_gpu \
+  --local_test_jobs=1 --build_tests_only //${PY_TEST_DIR}/tensorflow/python/...
diff --git a/tensorflow/tools/dist_test/python/census_widendeep.py b/tensorflow/tools/dist_test/python/census_widendeep.py
index db56a687f6..3a55781496 100644
--- a/tensorflow/tools/dist_test/python/census_widendeep.py
+++ b/tensorflow/tools/dist_test/python/census_widendeep.py
@@ -133,7 +133,7 @@ class CensusDataSource(object):
       columns: Columns to retrieve from the data files (A list of strings)
       label_column: Name of the label column
       categorical_columns: Names of the categorical columns (A list of strings)
-      continuous_columns: Names of the continuous columsn (A list of strings)
+      continuous_columns: Names of the continuous columns (A list of strings)
     """
 
     # Retrieve data from disk (if available) or download from the web.
diff --git a/tensorflow/tools/dist_test/python/mnist_replica.py b/tensorflow/tools/dist_test/python/mnist_replica.py
index 7e68258b0a..f7dbfea7fb 100644
--- a/tensorflow/tools/dist_test/python/mnist_replica.py
+++ b/tensorflow/tools/dist_test/python/mnist_replica.py
@@ -16,9 +16,9 @@
 """Distributed MNIST training and validation, with model replicas.
 
 A simple softmax model with one hidden layer is defined. The parameters
-(weights and biases) are located on two parameter servers (ps), while the
-ops are defined on a worker node. The TF sessions also run on the worker
-node.
+(weights and biases) are located on one parameter server (ps), while the ops
+are executed on two worker nodes by default. The TF sessions also run on the 
+worker node.
 Multiple invocations of this script can be done in parallel, with different
 values for --task_index. There should be exactly one invocation with
 --task_index, which will create a master session that carries out variable
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index c801ceff93..38a67f80aa 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -57,7 +57,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.4.5
+ENV BAZEL_VERSION 0.5.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -72,7 +72,7 @@ RUN mkdir /bazel && \
 
 RUN git clone https://github.com/tensorflow/tensorflow.git && \
     cd tensorflow && \
-    git checkout r1.1
+    git checkout r1.2
 WORKDIR /tensorflow
 
 # TODO(craigcitro): Don't install the pip package, since it makes it
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 24350c507e..d0a038a9db 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
 MAINTAINER Craig Citro <craigcitro@google.com>
 
@@ -57,7 +57,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.4.5
+ENV BAZEL_VERSION 0.5.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
@@ -72,7 +72,7 @@ RUN mkdir /bazel && \
 
 RUN git clone https://github.com/tensorflow/tensorflow.git && \
     cd tensorflow && \
-    git checkout r1.1
+    git checkout r1.2
 WORKDIR /tensorflow
 
 # Configure the build for our CUDA configuration.
diff --git a/tensorflow/tools/docker/Dockerfile.gpu b/tensorflow/tools/docker/Dockerfile.gpu
index 88876421f5..3ba1e963f9 100644
--- a/tensorflow/tools/docker/Dockerfile.gpu
+++ b/tensorflow/tools/docker/Dockerfile.gpu
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
+FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
 MAINTAINER Craig Citro <craigcitro@google.com>
 
diff --git a/tensorflow/tools/docker/jupyter_notebook_config.py b/tensorflow/tools/docker/jupyter_notebook_config.py
index 6b1ebc3ee0..747beb8251 100644
--- a/tensorflow/tools/docker/jupyter_notebook_config.py
+++ b/tensorflow/tools/docker/jupyter_notebook_config.py
@@ -22,5 +22,10 @@ c.MultiKernelManager.default_kernel_name = 'python2'
 
 # sets a password if PASSWORD is set in the environment
 if 'PASSWORD' in os.environ:
-  c.NotebookApp.password = passwd(os.environ['PASSWORD'])
+  password = os.environ['PASSWORD']
+  if password:
+    c.NotebookApp.password = passwd(password)
+  else:
+    c.NotebookApp.password = ''
+    c.NotebookApp.token = ''
   del os.environ['PASSWORD']
diff --git a/tensorflow/tools/docker/parameterized_docker_build.sh b/tensorflow/tools/docker/parameterized_docker_build.sh
index 886266caaf..f88af68cde 100755
--- a/tensorflow/tools/docker/parameterized_docker_build.sh
+++ b/tensorflow/tools/docker/parameterized_docker_build.sh
@@ -64,7 +64,7 @@
 #
 #   TF_DOCKER_BUILD_OPTIONS
 #     (Optional)
-#     Specifices the desired build options. Defaults to OPT.
+#     Specifies the desired build options. Defaults to OPT.
 
 # Script directory
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
diff --git a/tensorflow/tools/docs/parser.py b/tensorflow/tools/docs/parser.py
index 52a65e1c9d..7ae1d2abd9 100644
--- a/tensorflow/tools/docs/parser.py
+++ b/tensorflow/tools/docs/parser.py
@@ -897,7 +897,7 @@ class _ClassPageInfo(object):
 
   @property
   def guides(self):
-    """Returns a markdown string containing backlinks to relevent api_guides."""
+    """Returns a markdown string containing backlinks to relevant api_guides."""
     return self._guides
 
   def set_guides(self, guides):
diff --git a/tensorflow/tools/docs/pretty_docs.py b/tensorflow/tools/docs/pretty_docs.py
index faea32cc42..6031a33d32 100644
--- a/tensorflow/tools/docs/pretty_docs.py
+++ b/tensorflow/tools/docs/pretty_docs.py
@@ -58,6 +58,7 @@ def build_md_page(page_info):
 def _build_function_page(page_info):
   """Given a FunctionPageInfo object Return the page as an md string."""
   parts = [_Metadata(page_info.full_name).build_html()]
+  parts.append('# %s\n\n' % page_info.full_name)
 
   parts.append('# %s\n\n' % page_info.full_name)
 
diff --git a/tensorflow/tools/docs/py_guide_parser.py b/tensorflow/tools/docs/py_guide_parser.py
index 245643cb32..216353ecee 100644
--- a/tensorflow/tools/docs/py_guide_parser.py
+++ b/tensorflow/tools/docs/py_guide_parser.py
@@ -35,7 +35,7 @@ class PyGuideParser(object):
   """Simple parsing of a guide .md file.
 
   Descendants can override the process_*() functions (called by process())
-  to either record infromation from the guide, or call replace_line()
+  to either record information from the guide, or call replace_line()
   to affect the return value of process().
   """
 
diff --git a/tensorflow/tools/gcs_test/python/gcs_smoke.py b/tensorflow/tools/gcs_test/python/gcs_smoke.py
index 615e142c47..51933a52a6 100644
--- a/tensorflow/tools/gcs_test/python/gcs_smoke.py
+++ b/tensorflow/tools/gcs_test/python/gcs_smoke.py
@@ -36,7 +36,7 @@ flags.DEFINE_integer("num_examples", 10, "Number of examples to generate")
 FLAGS = flags.FLAGS
 
 def create_examples(num_examples, input_mean):
-  """Create ExampleProto's containg data."""
+  """Create ExampleProto's containing data."""
   ids = np.arange(num_examples).reshape([num_examples, 1])
   inputs = np.random.randn(num_examples, 1) + input_mean
   target = inputs - input_mean
diff --git a/tensorflow/tools/graph_transforms/README.md b/tensorflow/tools/graph_transforms/README.md
index 7e8c51efe6..bfda55d3ad 100644
--- a/tensorflow/tools/graph_transforms/README.md
+++ b/tensorflow/tools/graph_transforms/README.md
@@ -586,7 +586,7 @@ equivalent, followed by a float conversion op so that the result is usable by
 subsequent nodes. This is mostly useful for [shrinking file
 sizes](#shrinking-file-size), but also helps with the more advanced
 [quantize_nodes](#quantize_nodes) transform. Even though there are no
-prerequesites, it is advisable to run [fold_batch_norms](#fold_batch_norms) or
+prerequisites, it is advisable to run [fold_batch_norms](#fold_batch_norms) or
 [fold_old_batch_norms](#fold_old_batch_norms), because rounding variances down
 to zero may cause significant loss of precision.
 
@@ -674,7 +674,7 @@ number of steps. The unique values are chosen per buffer by linearly allocating
 between the largest and smallest values present. This is useful when you'll be
 deploying on mobile, and you want a model that will compress effectively. See
 [shrinking file size](#shrinking-file-size) for more details. Even though there
-are no prerequesites, it is advisable to run
+are no prerequisites, it is advisable to run
 [fold_batch_norms](#fold_batch_norms) or
 [fold_old_batch_norms](#fold_old_batch_norms), because rounding variances down
 to zero may cause significant loss of precision.
@@ -998,7 +998,7 @@ There are a few things to know about the `ReplaceMatchingOpTypes` function:
     important nodes are listed in the `output_nodes` argument that's passed into
     each replacement function call. You can disable this checking by setting
     `allow_inconsistencies` to true in the options, but otherwise any
-    replacements that break the graph constraints will be cancelled. If you do
+    replacements that break the graph constraints will be canceled. If you do
     allow inconsistencies, it's your transform's responsibility to fix them up
     before you return your final result. Functions like `RenameNodeInputs` can
     be useful if you are doing wholesale node renaming for example.
@@ -1055,7 +1055,7 @@ in the future.
 
 The Graph Transform Tool associates names of transforms with the code to
 implement them using the `REGISTER_GRAPH_TRANSFORM()` macro. This takes a string
-and a function, and automagically registers the transform with the tool. You
+and a function, and automatically registers the transform with the tool. You
 will need to watch out for a few things though:
 
 *   Because it's using global C++ objects in each file under the hood, the
diff --git a/tensorflow/tools/graph_transforms/transform_utils.h b/tensorflow/tools/graph_transforms/transform_utils.h
index 515adf6344..6ed549a958 100644
--- a/tensorflow/tools/graph_transforms/transform_utils.h
+++ b/tensorflow/tools/graph_transforms/transform_utils.h
@@ -107,7 +107,7 @@ void FilterGraphDef(const GraphDef& input_graph_def,
                     std::function<bool(const NodeDef&)> selector,
                     GraphDef* output_graph_def);
 
-// Creates a copy of the input graph, with all occurences of the attributes with
+// Creates a copy of the input graph, with all occurrences of the attributes with
 // the names in the argument removed from the node defs.
 void RemoveAttributes(const GraphDef& input_graph_def,
                       const std::vector<string>& attributes,
diff --git a/tensorflow/tools/lib_package/BUILD b/tensorflow/tools/lib_package/BUILD
index 1e36af93ea..cc8a6fb74e 100644
--- a/tensorflow/tools/lib_package/BUILD
+++ b/tensorflow/tools/lib_package/BUILD
@@ -97,6 +97,7 @@ genrule(
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
         "@libxsmm_archive//:LICENSE",
+        #"@lmdb//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
         "@png_archive//:LICENSE",
         "@protobuf//:LICENSE",
@@ -126,6 +127,7 @@ genrule(
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
         "@libxsmm_archive//:LICENSE",
+        #"@lmdb//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
         "@png_archive//:LICENSE",
         "@protobuf//:LICENSE",
diff --git a/tensorflow/tools/lib_package/libtensorflow_java_test.sh b/tensorflow/tools/lib_package/libtensorflow_java_test.sh
index 463990b79c..a44298e01a 100755
--- a/tensorflow/tools/lib_package/libtensorflow_java_test.sh
+++ b/tensorflow/tools/lib_package/libtensorflow_java_test.sh
@@ -18,7 +18,7 @@ set -ex
 
 # Sanity test for the binary artifacts for the TensorFlow Java API.
 # - Unarchive
-# - Compile a trivial Java file that excercises the Java API and underlying
+# - Compile a trivial Java file that exercises the Java API and underlying
 #   native library.
 # - Run it
 
diff --git a/tensorflow/tools/pip_package/BUILD b/tensorflow/tools/pip_package/BUILD
index 742d7eaae7..85cb46bf54 100644
--- a/tensorflow/tools/pip_package/BUILD
+++ b/tensorflow/tools/pip_package/BUILD
@@ -74,8 +74,11 @@ py_binary(
         "//tensorflow/python/debug:debug_pip",
         "//tensorflow/python/saved_model",
         "//tensorflow/python/tools:tools_pip",
-        "//tensorflow/tensorboard",
         # These targets don't build on Windows yet. Exclude them for now.
+        # rules_closure currently doesn't build on Windows due to
+        # https://github.com/bazelbuild/rules_closure/pull/206
+        # Since tensorboard dependes on rules_closure, exclude tensorboard until it's fixed.
+        # "//tensorflow/tensorboard",
         # "//tensorflow/contrib/ndlstm",
         # "//tensorflow/contrib/slim",
         # "//tensorflow/contrib/slim/python/slim/nets:nets_pip",
@@ -107,6 +110,7 @@ filegroup(
         "@jemalloc//:COPYING",
         "@jpeg//:LICENSE.md",
         "@libxsmm_archive//:LICENSE",
+        #"@lmdb//:LICENSE",
         "@local_config_sycl//sycl:LICENSE.text",
         "@nanopb_git//:LICENSE.txt",
         "@org_html5lib//:LICENSE",
@@ -128,6 +132,7 @@ sh_binary(
     srcs = ["build_pip_package.sh"],
     data = select({
         "//tensorflow:windows": [":simple_console_for_windows"],
+        "//tensorflow:windows_msvc": [":simple_console_for_windows"],
         "//conditions:default": [
             ":licenses",
             "MANIFEST.in",
diff --git a/tensorflow/tools/pip_package/pip_smoke_test.py b/tensorflow/tools/pip_package/pip_smoke_test.py
index 0524d2f1aa..58a80fd98a 100644
--- a/tensorflow/tools/pip_package/pip_smoke_test.py
+++ b/tensorflow/tools/pip_package/pip_smoke_test.py
@@ -46,6 +46,7 @@ BLACKLIST = [
     "//tensorflow/python:tf_optimizer",
     "//tensorflow/python:compare_test_proto_py",
     "//tensorflow/core:image_testdata",
+    "//tensorflow/core:lmdb_testdata",
     "//tensorflow/core/kernels/cloud:bigquery_reader_ops",
     "//tensorflow/python/feature_column:vocabulary_testdata",
     "//tensorflow/python:framework/test_file_system.so",
diff --git a/tensorflow/tools/pip_package/setup.py b/tensorflow/tools/pip_package/setup.py
index 0ce6d72906..3bbc900042 100644
--- a/tensorflow/tools/pip_package/setup.py
+++ b/tensorflow/tools/pip_package/setup.py
@@ -29,7 +29,7 @@ from setuptools.dist import Distribution
 # This version string is semver compatible, but incompatible with pip.
 # For pip, we will remove all '-' characters from this string, and use the
 # result for pip.
-_VERSION = '1.1.0'
+_VERSION = '1.2.0-rc2'
 
 REQUIRED_PACKAGES = [
     'numpy >= 1.11.0',
diff --git a/tensorflow/tools/proto_text/BUILD b/tensorflow/tools/proto_text/BUILD
index 2d14538c8d..3a60c8c958 100644
--- a/tensorflow/tools/proto_text/BUILD
+++ b/tensorflow/tools/proto_text/BUILD
@@ -44,6 +44,7 @@ cc_library(
     hdrs = ["gen_proto_text_functions_lib.h"],
     linkopts = select({
         "//tensorflow:windows": [],
+        "//tensorflow:windows_msvc": [],
         "//tensorflow:darwin": [
             "-lm",
             "-lpthread",
diff --git a/tensorflow/tools/tfprof/README.md b/tensorflow/tools/tfprof/README.md
index 88f5501bc7..9ecebd994f 100644
--- a/tensorflow/tools/tfprof/README.md
+++ b/tensorflow/tools/tfprof/README.md
@@ -110,7 +110,7 @@ Sigmoid                        152.57MB (85.28%, 0.21%),        96.66ms (23.46%,
 
 ### Visualize time and memory.
 <left>
-![CodeTimeline](g3doc/graph_timeline.png)
+[CodeTimeline](g3doc/graph_timeline.png)
 </left>
 
 ### Teams
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 57a096d993..888390764a 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -506,6 +506,17 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   )
 
   native.new_http_archive(
+    name = "lmdb",
+    urls = [
+      "http://mirror.bazel.build/github.com/LMDB/lmdb/archive/LMDB_0.9.19.tar.gz",
+      "https://github.com/LMDB/lmdb/archive/LMDB_0.9.19.tar.gz",
+    ],
+    sha256 = "108532fb94c6f227558d45be3f3347b52539f0f58290a7bb31ec06c462d05326",
+    strip_prefix = "lmdb-LMDB_0.9.19/libraries/liblmdb",
+    build_file = str(Label("//third_party:lmdb.BUILD")),
+  )
+
+  native.new_http_archive(
       name = "jsoncpp_git",
       urls = [
           "http://mirror.bazel.build/github.com/open-source-parsers/jsoncpp/archive/11086dd6a7eba04289944367ca82cea71299ed70.tar.gz",
@@ -741,7 +752,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
 
   native.new_http_archive(
       name = "io_angular_clutz",
-      build_file = "//third_party:clutz.BUILD",
+      build_file = str(Label("//third_party:clutz.BUILD")),
       sha256 = "2981de41d1ff4774b544423da9a2cd8beb3be649e95aef2ef2fd83957300b3fe",
       strip_prefix = "clutz-b0db5ade9bb535d387f05292316c422790c9848e",
       urls = [
diff --git a/third_party/curl.BUILD b/third_party/curl.BUILD
index 43f6599acc..882967df1c 100644
--- a/third_party/curl.BUILD
+++ b/third_party/curl.BUILD
@@ -5,6 +5,26 @@ licenses(["notice"])  # MIT/X derivative license
 
 exports_files(["COPYING"])
 
+CURL_WIN_COPTS = [
+    "/I%prefix%/curl/lib",
+    "/DHAVE_CONFIG_H",
+    "/DCURL_DISABLE_FTP",
+    "/DCURL_DISABLE_NTLM",
+    "/DHAVE_LIBZ",
+    "/DHAVE_ZLIB_H",
+    # Defining _USING_V110_SDK71_ is hackery to defeat curl's incorrect
+    # detection of what OS releases we can build on with VC 2012. This
+    # may not be needed (or may have to change) if the WINVER setting
+    # changes in //third_party/msvc/vc_12_0/CROSSTOOL.
+    "/D_USING_V110_SDK71_",
+]
+
+CURL_WIN_SRCS = [
+    "lib/asyn-thread.c",
+    "lib/inet_ntop.c",
+    "lib/system_win32.c",
+]
+
 cc_library(
     name = "curl",
     srcs = [
@@ -210,11 +230,8 @@ cc_library(
         "@%ws%//tensorflow:ios": [
             "lib/vtls/darwinssl.c",
         ],
-        "@%ws%//tensorflow:windows": [
-            "lib/asyn-thread.c",
-            "lib/inet_ntop.c",
-            "lib/system_win32.c",
-        ],
+        "@%ws%//tensorflow:windows": CURL_WIN_SRCS,
+        "@%ws%//tensorflow:windows_msvc": CURL_WIN_SRCS,
         "//conditions:default": [
             "lib/vtls/openssl.c",
         ],
@@ -231,19 +248,8 @@ cc_library(
         "include/curl/typecheck-gcc.h",
     ],
     copts = select({
-        "@%ws%//tensorflow:windows": [
-            "/I%prefix%/curl/lib",
-            "/DHAVE_CONFIG_H",
-            "/DCURL_DISABLE_FTP",
-            "/DCURL_DISABLE_NTLM",
-            "/DHAVE_LIBZ",
-            "/DHAVE_ZLIB_H",
-            # Defining _USING_V110_SDK71_ is hackery to defeat curl's incorrect
-            # detection of what OS releases we can build on with VC 2012. This
-            # may not be needed (or may have to change) if the WINVER setting
-            # changes in //third_party/msvc/vc_12_0/CROSSTOOL.
-            "/D_USING_V110_SDK71_",
-        ],
+        "@%ws%//tensorflow:windows": CURL_WIN_COPTS,
+        "@%ws%//tensorflow:windows_msvc": CURL_WIN_COPTS,
         "//conditions:default": [
             "-I%prefix%/curl/lib",
             "-D_GNU_SOURCE",
@@ -262,6 +268,10 @@ cc_library(
             # See curl.h for discussion of write size and Windows
             "/DCURL_MAX_WRITE_SIZE=16384",
         ],
+        "@%ws%//tensorflow:windows_msvc": [
+            # See curl.h for discussion of write size and Windows
+            "/DCURL_MAX_WRITE_SIZE=16384",
+        ],
         "//conditions:default": [
             "-DCURL_MAX_WRITE_SIZE=65536",
         ],
@@ -279,7 +289,10 @@ cc_library(
         ],
         "@%ws%//tensorflow:ios": [],
         "@%ws%//tensorflow:windows": [
-            "ws2_32.lib",
+            "-Wl,ws2_32.lib",
+        ],
+        "@%ws%//tensorflow:windows_msvc": [
+            "-Wl,ws2_32.lib",
         ],
         "//conditions:default": [
             "-lrt",
@@ -291,12 +304,19 @@ cc_library(
     ] + select({
         "@%ws%//tensorflow:ios": [],
         "@%ws%//tensorflow:windows": [],
+        "@%ws%//tensorflow:windows_msvc": [],
         "//conditions:default": [
             "@boringssl//:ssl",
         ],
     }),
 )
 
+CURL_BIN_WIN_COPTS = [
+    "/I%prefix%/curl/lib",
+    "/DHAVE_CONFIG_H",
+    "/DCURL_DISABLE_LIBCURL_OPTION",
+]
+
 cc_binary(
     name = "curl_bin",
     srcs = [
@@ -386,11 +406,8 @@ cc_binary(
         "src/tool_xattr.h",
     ],
     copts = select({
-        "@%ws%//tensorflow:windows": [
-            "/I%prefix%/curl/lib",
-            "/DHAVE_CONFIG_H",
-            "/DCURL_DISABLE_LIBCURL_OPTION",
-        ],
+        "@%ws%//tensorflow:windows": CURL_BIN_WIN_COPTS,
+        "@%ws%//tensorflow:windows_msvc": CURL_BIN_WIN_COPTS,
         "//conditions:default": [
             "-I%prefix%/curl/lib",
             "-D_GNU_SOURCE",
diff --git a/third_party/farmhash.BUILD b/third_party/farmhash.BUILD
index 6a1d4da6e5..a51e1511c1 100644
--- a/third_party/farmhash.BUILD
+++ b/third_party/farmhash.BUILD
@@ -3,12 +3,19 @@ licenses(["notice"])  # MIT
 exports_files(["COPYING"])
 
 config_setting(
-    name = "windows",
+    name = "windows_msvc",
     values = {
         "cpu": "x64_windows_msvc",
     },
 )
 
+config_setting(
+    name = "windows",
+    values = {
+        "cpu": "x64_windows",
+    },
+)
+
 cc_library(
     name = "farmhash",
     srcs = ["src/farmhash.cc"],
@@ -16,6 +23,7 @@ cc_library(
     # Disable __builtin_expect support on Windows
     copts = select({
         ":windows": ["/DFARMHASH_OPTIONAL_BUILTIN_EXPECT"],
+        ":windows_msvc": ["/DFARMHASH_OPTIONAL_BUILTIN_EXPECT"],
         "//conditions:default": [],
     }),
     includes = ["src/."],
diff --git a/third_party/gif.BUILD b/third_party/gif.BUILD
index fec7449130..ad6821af3c 100644
--- a/third_party/gif.BUILD
+++ b/third_party/gif.BUILD
@@ -24,6 +24,7 @@ cc_library(
     visibility = ["//visibility:public"],
     deps = select({
         ":windows": [":windows_polyfill"],
+        ":windows_msvc": [":windows_polyfill"],
         "//conditions:default": [],
     }),
 )
@@ -41,6 +42,15 @@ genrule(
 )
 
 config_setting(
+    name = "windows_msvc",
+    values = {
+        "cpu": "x64_windows_msvc",
+    },
+)
+
+config_setting(
     name = "windows",
-    values = {"cpu": "x64_windows_msvc"},
+    values = {
+        "cpu": "x64_windows",
+    },
 )
diff --git a/third_party/gpus/cuda_configure.bzl b/third_party/gpus/cuda_configure.bzl
index c6deae05b8..61932a8e6d 100644
--- a/third_party/gpus/cuda_configure.bzl
+++ b/third_party/gpus/cuda_configure.bzl
@@ -5,7 +5,7 @@
 
   * `TF_NEED_CUDA`: Whether to enable building with CUDA.
   * `GCC_HOST_COMPILER_PATH`: The GCC host compiler path
-  * `TF_CUDA_CLANG`: Wheter to use clang as a cuda compiler.
+  * `TF_CUDA_CLANG`: Whether to use clang as a cuda compiler.
   * `CLANG_CUDA_COMPILER_PATH`: The clang compiler path that will be used for
     both host and device code compilation if TF_CUDA_CLANG is 1.
   * `CUDA_TOOLKIT_PATH`: The path to the CUDA toolkit. Default is
@@ -41,8 +41,8 @@ def find_cc(repository_ctx):
   """Find the C++ compiler."""
   # On Windows, we use Bazel's MSVC CROSSTOOL for GPU build
   # Return a dummy value for GCC detection here to avoid error
-  if _cpu_value(repository_ctx) == "Windows":
-    return "/use/--config x64_windows_msvc/instead"
+  if _is_windows(repository_ctx):
+    return "/use/--config=win-cuda --cpu=x64_windows_msvc/instead"
 
   if _use_cuda_clang(repository_ctx):
     target_cc_name = "clang"
@@ -57,7 +57,7 @@ def find_cc(repository_ctx):
     if cc_name_from_env:
       cc_name = cc_name_from_env
   if cc_name.startswith("/"):
-    # Absolute path, maybe we should make this suported by our which function.
+    # Absolute path, maybe we should make this supported by our which function.
     return cc_name
   cc = repository_ctx.which(cc_name)
   if cc == None:
@@ -122,10 +122,10 @@ def get_cxx_inc_directories(repository_ctx, cc):
 
 
 def auto_configure_fail(msg):
-  """Output failure message when auto configuration fails."""
+  """Output failure message when cuda configuration fails."""
   red = "\033[0;31m"
   no_color = "\033[0m"
-  fail("\n%sAuto-Configuration Error:%s %s\n" % (red, no_color, msg))
+  fail("\n%sCuda Configuration Error:%s %s\n" % (red, no_color, msg))
 # END cc_configure common functions (see TODO above).
 
 
@@ -421,6 +421,10 @@ def _cpu_value(repository_ctx):
   return result.stdout.strip()
 
 
+def _is_windows(repository_ctx):
+  """Returns true if the host operating system is windows."""
+  return _cpu_value(repository_ctx) == "Windows"
+
 def _lib_name(lib, cpu_value, version="", static=False):
   """Constructs the platform-specific name of a library.
 
@@ -769,14 +773,48 @@ def _create_dummy_repository(repository_ctx):
   repository_ctx.file("crosstool/BUILD", _DUMMY_CROSSTOOL_BUILD_FILE)
 
 
+def _execute(repository_ctx, cmdline, error_msg=None, error_details=None,
+             empty_stdout_fine=False):
+  """Executes an arbitrary shell command.
+
+  Args:
+    repository_ctx: the repository_ctx object
+    cmdline: list of strings, the command to execute
+    error_msg: string, a summary of the error if the command fails
+    error_details: string, details about the error or steps to fix it
+    empty_stdout_fine: bool, if True, an empty stdout result is fine, otherwise
+      it's an error
+  Return:
+    the result of repository_ctx.execute(cmdline)
+  """
+  result = repository_ctx.execute(cmdline)
+  if result.stderr or not (empty_stdout_fine or result.stdout):
+    auto_configure_fail(
+        "\n".join([
+            error_msg.strip() if error_msg else "Repository command failed",
+            result.stderr.strip(),
+            error_details if error_details else ""]))
+  return result
+
+
+def _norm_path(path):
+  """Returns a path with '/' and remove the trailing slash."""
+  path = path.replace("\\", "/")
+  if path[-1] == "/":
+    path = path[:-1]
+  return path
+
+
 def _symlink_genrule_for_dir(repository_ctx, src_dir, dest_dir, genrule_name,
     src_files = [], dest_files = []):
-  """Returns a genrule to symlink a set of files.
+  """Returns a genrule to symlink(or copy if on Windows) a set of files.
 
   If src_dir is passed, files will be read from the given directory; otherwise
   we assume files are in src_files and dest_files
   """
   if src_dir != None:
+    src_dir = _norm_path(src_dir)
+    dest_dir = _norm_path(dest_dir)
     files = _read_dir(repository_ctx, src_dir)
     # Create a list with the src_dir stripped to use for outputs.
     dest_files = files.replace(src_dir, '').splitlines()
@@ -787,8 +825,10 @@ def _symlink_genrule_for_dir(repository_ctx, src_dir, dest_dir, genrule_name,
     if dest_files[i] != "":
       # If we have only one file to link we do not want to use the dest_dir, as
       # $(@D) will include the full path to the file.
-      dest = ' $(@D)/' + dest_dir + dest_files[i] if len(dest_files) != 1 else ' $(@D)/' + dest_files[i]
-      command.append('ln -s ' + src_files[i] + dest)
+      dest = '$(@D)/' + dest_dir + dest_files[i] if len(dest_files) != 1 else '$(@D)/' + dest_files[i]
+      # On Windows, symlink is not supported, so we just copy all the files.
+      cmd = 'cp -f' if _is_windows(repository_ctx) else 'ln -s'
+      command.append(cmd + ' "%s" "%s"' % (src_files[i] , dest))
       outs.append('      "' + dest_dir + dest_files[i] + '",')
   genrule = _genrule(src_dir, genrule_name, " && ".join(command),
                      "\n".join(outs))
@@ -821,10 +861,20 @@ def _read_dir(repository_ctx, src_dir):
   symlinks. The returned string contains the full path of all files
   separated by line breaks.
   """
-  find_result = repository_ctx.execute([
-      "find", src_dir, "-follow", "-type", "f"
-  ])
-  return find_result.stdout
+  if _is_windows(repository_ctx):
+    src_dir = src_dir.replace("/", "\\")
+    find_result = _execute(
+        repository_ctx, ["cmd.exe", "/c", "dir", src_dir, "/b", "/s", "/a-d"],
+        empty_stdout_fine=True)
+    # src_files will be used in genrule.outs where the paths must
+    # use forward slashes.
+    result = find_result.stdout.replace("\\", "/")
+  else:
+    find_result = _execute(
+        repository_ctx, ["find", src_dir, "-follow", "-type", "f"],
+        empty_stdout_fine=True)
+    result = find_result.stdout
+  return result
 
 
 def _use_cuda_clang(repository_ctx):
diff --git a/third_party/hadoop/hdfs.h b/third_party/hadoop/hdfs.h
index 560d8bba0e..a664f3b50c 100644
--- a/third_party/hadoop/hdfs.h
+++ b/third_party/hadoop/hdfs.h
@@ -171,7 +171,7 @@ void hdfsFileFreeReadStatistics(struct hdfsReadStatistics *stats);
  * Connect to the hdfs.
  * @param nn   The NameNode.  See hdfsBuilderSetNameNode for details.
  * @param port The port on which the server is listening.
- * @param user the user name (this is hadoop domain user). Or NULL is equivelant
+ * @param user the user name (this is hadoop domain user). Or NULL is equivalent
  * to hhdfsConnect(host, port)
  * @return Returns a handle to the filesystem or NULL on error.
  * @deprecated Use hdfsBuilderConnect instead.
@@ -397,7 +397,7 @@ hdfsFile hdfsOpenFile(hdfsFS fs, const char *path, int flags, int bufferSize,
                       short replication, tSize blocksize);
 
 /**
- * hdfsTruncateFile - Truncate a hdfs file to given lenght.
+ * hdfsTruncateFile - Truncate a hdfs file to given length.
  * @param fs The configured filesystem handle.
  * @param path The full path to the file.
  * @param newlength The size the file is to be truncated to
diff --git a/third_party/jpeg/jpeg.BUILD b/third_party/jpeg/jpeg.BUILD
index 78e03eadcf..f6078052ec 100644
--- a/third_party/jpeg/jpeg.BUILD
+++ b/third_party/jpeg/jpeg.BUILD
@@ -9,17 +9,20 @@ load("@%ws%//third_party:common.bzl", "template_rule")
 
 libjpegturbo_nocopts = "-[W]error"
 
+WIN_COPTS = [
+    "/Ox",
+    "/w14711",  # function 'function' selected for inline expansion
+    "/w14710",  # 'function' : function not inlined
+]
+
 libjpegturbo_copts = select({
     ":android": [
         "-O2",
         "-fPIE",
         "-w",
     ],
-    ":windows": [
-        "/Ox",
-        "/w14711",  # function 'function' selected for inline expansion
-        "/w14710",  # 'function' : function not inlined
-    ],
+    ":windows": WIN_COPTS,
+    ":windows_msvc": WIN_COPTS,
     "//conditions:default": [
         "-O3",
         "-w",
@@ -370,6 +373,7 @@ genrule(
     outs = ["jconfig.h"],
     cmd = select({
         ":windows": "cp $(location jconfig_win.h) $@",
+        ":windows_msvc": "cp $(location jconfig_win.h) $@",
         ":k8": "cp $(location jconfig_nowin_simd.h) $@",
         ":armeabi-v7a": "cp $(location jconfig_nowin_simd.h) $@",
         ":arm64-v8a": "cp $(location jconfig_nowin_simd.h) $@",
@@ -386,6 +390,7 @@ genrule(
     outs = ["jconfigint.h"],
     cmd = select({
         ":windows": "cp $(location jconfigint_win.h) $@",
+        ":windows_msvc": "cp $(location jconfigint_win.h) $@",
         "//conditions:default": "cp $(location jconfigint_nowin.h) $@",
     }),
 )
@@ -482,5 +487,10 @@ config_setting(
 
 config_setting(
     name = "windows",
+    values = {"cpu": "x64_windows"},
+)
+
+config_setting(
+    name = "windows_msvc",
     values = {"cpu": "x64_windows_msvc"},
 )
diff --git a/third_party/libxsmm.BUILD b/third_party/libxsmm.BUILD
index f9f1ea1085..4124f2db63 100644
--- a/third_party/libxsmm.BUILD
+++ b/third_party/libxsmm.BUILD
@@ -11,19 +11,8 @@ exports_files(["LICENSE"])
 libxsmm_interface_arguments = "0 1"
 
 # Arguments to ./scripts/libxsmm_config.py, see that file for detailed description.
-#  ilp64: no
-#  big: no
-#  offload: no
-#  alignment [b]
-#  prefetch: 1 (auto)
-#  threshold: fallback to BLAS if n*m*k above this
-#  synchronize: yes
-#  jit: yes
-#  flags
-#  alpha = 1
-#  beta = 1
-#  gemm = 2
-libxsmm_config_arguments = "0 0 0 64 1 0 1 1 0 1 1 2"
+# rely on default arguments
+libxsmm_config_arguments = ""
 
 # Arguments to ./scripts/libxsmm_dispatch.py, see that file for detailed description.
 #  (dummy argument)
diff --git a/third_party/lmdb.BUILD b/third_party/lmdb.BUILD
new file mode 100644
index 0000000000..7c6e3dc3f0
--- /dev/null
+++ b/third_party/lmdb.BUILD
@@ -0,0 +1,37 @@
+# Description:
+#   LMDB is the Lightning Memory-mapped Database.
+
+licenses(["notice"])  # OpenLDAP Public License
+
+exports_files(["LICENSE"])
+
+cc_library(
+    name = "lmdb",
+    srcs = [
+        "mdb.c",
+        "midl.c",
+    ],
+    hdrs = [
+        "lmdb.h",
+        "midl.h",
+    ],
+    copts = [
+        "-w",
+    ],
+    linkopts = select({
+        ":windows": ["-Wl,advapi32.lib"],  # InitializeSecurityDescriptor, SetSecurityDescriptorDacl
+        ":windows_msvc": ["-Wl,advapi32.lib"],
+        "//conditions:default": ["-lpthread"],
+    }),
+    visibility = ["//visibility:public"],
+)
+
+config_setting(
+    name = "windows",
+    values = {"cpu": "x64_windows"},
+)
+
+config_setting(
+    name = "windows_msvc",
+    values = {"cpu": "x64_windows_msvc"},
+)
diff --git a/third_party/mpi/BUILD b/third_party/mpi/BUILD
new file mode 100644
index 0000000000..ff3f437e92
--- /dev/null
+++ b/third_party/mpi/BUILD
@@ -0,0 +1,25 @@
+licenses(["restricted"])
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+    visibility = ["//tensorflow:__subpackages__"],
+)
+
+load("//third_party/mpi:mpi.bzl", "mpi_hdr")
+load("//third_party/mpi:mpi.bzl", "if_mpi")
+
+cc_library(
+    name = "mpi",
+    srcs = if_mpi([
+        "libmpi.so",
+    ]),
+    hdrs = if_mpi(mpi_hdr()),
+    visibility = ["//visibility:public"],
+)
diff --git a/third_party/mpi/mpi.bzl b/third_party/mpi/mpi.bzl
new file mode 100644
index 0000000000..38ce91c4d0
--- /dev/null
+++ b/third_party/mpi/mpi.bzl
@@ -0,0 +1,17 @@
+#OpenMPI and Mvapich/mpich require different headers
+#based on the configuration options return one or the other
+
+def mpi_hdr():
+    MPI_LIB_IS_OPENMPI=True
+    hdrs = []    
+    if MPI_LIB_IS_OPENMPI:
+        hdrs = ["mpi.h", "mpi_portable_platform.h"]   #When using OpenMPI
+    else:
+        hdrs = ["mpi.h",  "mpio.h", "mpicxx.h"]        #When using MVAPICH
+    return hdrs
+
+def if_mpi(if_true, if_false = []):
+    return select({
+        "//tensorflow:with_mpi_support": if_true,
+        "//conditions:default": if_false
+    })
diff --git a/third_party/nasm.BUILD b/third_party/nasm.BUILD
index b3cf17a97e..341d58068b 100644
--- a/third_party/nasm.BUILD
+++ b/third_party/nasm.BUILD
@@ -101,6 +101,7 @@ cc_binary(
     ],
     copts = select({
         ":windows": [],
+        ":windows_msvc": [],
         "//conditions:default": [
             "-w",
             "-std=c99",
@@ -108,12 +109,22 @@ cc_binary(
     }),
     defines = select({
         ":windows": [],
+        ":windows_msvc": [],
         "//conditions:default": ["HAVE_SNPRINTF"],
     }),
     visibility = ["@jpeg//:__pkg__"],
 )
 
 config_setting(
+    name = "windows_msvc",
+    values = {
+        "cpu": "x64_windows_msvc",
+    },
+)
+
+config_setting(
     name = "windows",
-    values = {"cpu": "x64_windows_msvc"},
+    values = {
+        "cpu": "x64_windows",
+    },
 )
diff --git a/third_party/py/BUILD.tpl b/third_party/py/BUILD.tpl
index 157834df4b..1ee9c071ad 100644
--- a/third_party/py/BUILD.tpl
+++ b/third_party/py/BUILD.tpl
@@ -4,42 +4,14 @@ package(default_visibility = ["//visibility:public"])
 
 cc_library(
     name = "python_headers",
-    hdrs = select({
-        "windows" : [
-            "python_include_windows",
-        ],
-        "//conditions:default" : [
-            "python_include",
-        ],
-    }),
-    includes = select({
-        "windows" : [
-            "python_include_windows",
-        ],
-        "//conditions:default" : [
-            "python_include",
-        ],
-    }),
+    hdrs = [":python_include"],
+    includes = ["python_include"],
 )
 
 cc_library(
     name = "numpy_headers",
-    hdrs = select({
-        "windows" : [
-            "numpy_include_windows",
-        ],
-        "//conditions:default" : [
-            "numpy_include",
-        ],
-    }),
-    includes = select({
-        "windows" : [
-            "numpy_include_windows",
-        ],
-        "//conditions:default" : [
-            "numpy_include",
-        ],
-    }),
+    hdrs = [":numpy_include"],
+    includes = ["numpy_include"],
 )
 
 config_setting(
diff --git a/third_party/py/python_configure.bzl b/third_party/py/python_configure.bzl
index b2d0e250e7..b4a98af7b6 100644
--- a/third_party/py/python_configure.bzl
+++ b/third_party/py/python_configure.bzl
@@ -28,14 +28,14 @@ def _python_configure_warning(msg):
   """Output warning message during auto configuration."""
   yellow = "\033[1;33m"
   no_color = "\033[0m"
-  print("\n%sPython Configuration Warning:%s %s\n" % (yellow, no_color, msg))
+  print("%sPython Configuration Warning:%s %s" % (yellow, no_color, msg))
 
 
 def _python_configure_fail(msg):
   """Output failure message when auto configuration fails."""
   red = "\033[0;31m"
   no_color = "\033[0m"
-  fail("\n%sPython Configuration Error:%s %s\n" % (red, no_color, msg))
+  fail("%sPython Configuration Error:%s %s\n" % (red, no_color, msg))
 
 
 def _get_env_var(repository_ctx, name, default = None, enable_warning = True):
@@ -82,51 +82,27 @@ def _execute(repository_ctx, cmdline, error_msg=None, error_details=None,
   return result
 
 
-def _symlink_genrule_for_dir(repository_ctx, src_dir, dest_dir, genrule_name):
-  """returns a genrule to symlink all files in a directory."""
-  # Get the list of files under this directory
-  find_result = None
+def _read_dir(repository_ctx, src_dir):
+  """Returns a string with all files in a directory.
+
+  Finds all files inside a directory, traversing subfolders and following
+  symlinks. The returned string contains the full path of all files
+  separated by line breaks.
+  """
   if _is_windows(repository_ctx):
+    src_dir = src_dir.replace("/", "\\")
     find_result = _execute(
-        repository_ctx,
-        ["cmd.exe", "/c", "dir", src_dir.replace("/", "\\"), "/b", "/s",
-         "/a-d"],
+        repository_ctx, ["cmd.exe", "/c", "dir", src_dir, "/b", "/s", "/a-d"],
         empty_stdout_fine=True)
-    # src_files will be used to compute BUILD rules, where path must use
-    # forward slashes.
-    src_files = find_result.stdout.replace("\\", "/").splitlines()
-    # Create a list with the src_dir stripped to use for outputs.
-    fwdslashes_src_dir = src_dir.replace("\\", "/")
-    dest_files = [e.replace(fwdslashes_src_dir, "") for e in src_files]
+    # src_files will be used in genrule.outs where the paths must
+    # use forward slashes.
+    result = find_result.stdout.replace("\\", "/")
   else:
     find_result = _execute(
         repository_ctx, ["find", src_dir, "-follow", "-type", "f"],
         empty_stdout_fine=True)
-    # Create a list with the src_dir stripped to use for outputs.
-    dest_files = find_result.stdout.replace(src_dir, '').splitlines()
-    src_files = find_result.stdout.splitlines()
-  command = []
-  command_windows = []
-  outs = []
-  outs_windows = []
-  for i in range(len(dest_files)):
-    if dest_files[i] != "":
-      command.append('ln -s ' + src_files[i] + ' $(@D)/' +
-                     dest_dir + dest_files[i])
-      # ln -sf is actually implemented as copying in msys since creating
-      # symbolic links is privileged on Windows. But copying is too slow, so
-      # invoke mklink to create junctions on Windows.
-      command_windows.append('mklink /J ' + src_files[i] + ' $(@D)/' +
-                             dest_dir + dest_files[i])
-      outs.append('      "' + dest_dir + dest_files[i] + '",')
-      outs_windows.append('      "' + dest_dir + '_windows' +
-                          dest_files[i] + '",')
-  genrule = _genrule(src_dir, genrule_name, ' && '.join(command),
-                     '\n'.join(outs))
-  genrule_windows = _genrule(src_dir, genrule_name + '_windows',
-                             "cmd /c \"" + ' && '.join(command_windows) + "\"",
-                             '\n'.join(outs_windows))
-  return genrule + '\n' + genrule_windows
+    result = find_result.stdout
+  return result
 
 
 def _genrule(src_dir, genrule_name, command, outs):
@@ -144,11 +120,43 @@ def _genrule(src_dir, genrule_name, command, outs):
       '    cmd = """\n' +
       command +
       '    """,\n' +
-      '    visibility = ["//visibility:private"],' +
-      ')\n'
+      ')\n\n'
   )
 
 
+def _norm_path(path):
+  """Returns a path with '/' and remove the trailing slash."""
+  path = path.replace("\\", "/")
+  if path[-1] == "/":
+    path = path[:-1]
+  return path
+
+
+def _symlink_genrule_for_dir(repository_ctx, src_dir, dest_dir, genrule_name):
+  """Returns a genrule to symlink(or copy if on Windows) a set of files.
+  """
+  src_dir = _norm_path(src_dir)
+  dest_dir = _norm_path(dest_dir)
+  files = _read_dir(repository_ctx, src_dir)
+  # Create a list with the src_dir stripped to use for outputs.
+  dest_files = files.replace(src_dir, '').splitlines()
+  src_files = files.splitlines()
+  command = []
+  outs = []
+  for i in range(len(dest_files)):
+    if dest_files[i] != "":
+      # If we have only one file to link we do not want to use the dest_dir, as
+      # $(@D) will include the full path to the file.
+      dest = '$(@D)/' + dest_dir + dest_files[i] if len(dest_files) != 1 else '$(@D)/' + dest_files[i]
+      # On Windows, symlink is not supported, so we just copy all the files.
+      cmd = 'cp -f' if _is_windows(repository_ctx) else 'ln -s'
+      command.append(cmd + ' "%s" "%s"' % (src_files[i] , dest))
+      outs.append('      "' + dest_dir + dest_files[i] + '",')
+  genrule = _genrule(src_dir, genrule_name, " && ".join(command),
+                     "\n".join(outs))
+  return genrule
+
+
 def _get_python_lib(repository_ctx, python_bin):
   """Gets the python lib path."""
   print_lib = ("<<END\n" +
@@ -232,9 +240,13 @@ def _create_local_python_repository(repository_ctx):
   # If local checks were requested, the python and numpy include will be auto
   # detected on the host config (using _PYTHON_BIN_PATH).
   if repository_ctx.attr.local_checks:
-    python_bin = _get_env_var(repository_ctx, _PYTHON_BIN_PATH)
+    # TODO(nlopezgi): The default argument here is a workaround until
+    #                 bazelbuild/bazel#3057 is resolved.
+    python_bin = _get_env_var(repository_ctx, _PYTHON_BIN_PATH,
+                              "/usr/bin/python")
     _check_python_bin(repository_ctx, python_bin)
-    python_lib = _get_env_var(repository_ctx, _PYTHON_LIB_PATH, _get_python_lib(repository_ctx, python_bin))
+    python_lib = _get_env_var(repository_ctx, _PYTHON_LIB_PATH,
+                              _get_python_lib(repository_ctx, python_bin))
     _check_python_lib(repository_ctx, python_lib)
     python_include = _get_python_include(repository_ctx, python_bin)
     numpy_include = _get_numpy_include(repository_ctx, python_bin) + '/numpy'
diff --git a/third_party/snappy.BUILD b/third_party/snappy.BUILD
index 37eebe291e..120028dc52 100644
--- a/third_party/snappy.BUILD
+++ b/third_party/snappy.BUILD
@@ -35,6 +35,7 @@ genrule(
            "-e 's/@ac_cv_have_stdint_h@/1/g' " +
            select({
                "@%ws%//tensorflow:windows": "-e 's/@ac_cv_have_sys_uio_h@/0/g' ",
+               "@%ws%//tensorflow:windows_msvc": "-e 's/@ac_cv_have_sys_uio_h@/0/g' ",
                "//conditions:default": "-e 's/@ac_cv_have_sys_uio_h@/1/g' ",
            }) +
            "-e 's/@SNAPPY_MAJOR@/1/g' " +
diff --git a/third_party/swig.BUILD b/third_party/swig.BUILD
index bea5d6b531..d698fa934b 100644
--- a/third_party/swig.BUILD
+++ b/third_party/swig.BUILD
@@ -70,7 +70,8 @@ cc_binary(
         "Source/Swig/wrapfunc.c",
     ],
     copts = ["$(STACK_FRAME_UNLIMITED)"] + select({
-        ":x64_windows_msvc": [],
+        ":windows": [],
+        ":windows_msvc": [],
         "//conditions:default": [
             "-Wno-parentheses",
             "-Wno-unused-variable",
@@ -331,6 +332,11 @@ genrule(
 )
 
 config_setting(
-    name = "x64_windows_msvc",
+    name = "windows_msvc",
     values = {"cpu": "x64_windows_msvc"},
 )
+
+config_setting(
+    name = "windows",
+    values = {"cpu": "x64_windows"},
+)
diff --git a/third_party/sycl/crosstool/CROSSTOOL.tpl b/third_party/sycl/crosstool/CROSSTOOL.tpl
index 19b6f3ae32..2a96cdbf95 100755
--- a/third_party/sycl/crosstool/CROSSTOOL.tpl
+++ b/third_party/sycl/crosstool/CROSSTOOL.tpl
@@ -7,6 +7,11 @@ default_toolchain {
   toolchain_identifier: "local_linux"
 }
 
+default_toolchain {
+  cpu: "arm"
+  toolchain_identifier: "local_arm"
+}
+
 toolchain {
   abi_version: "local"
   abi_libc_version: "local"
@@ -49,6 +54,7 @@ toolchain {
   cxx_builtin_include_directory: "/usr/include"
 
   cxx_builtin_include_directory: "%{computecpp_toolkit_path}"
+  cxx_builtin_include_directory: "%{python_lib_path}"
 
   tool_path { name: "gcov" path: "/usr/bin/gcov" }
 
@@ -101,3 +107,96 @@ toolchain {
     compiler_flag: "-DNDEBUG"
   }
 }
+
+toolchain {
+  abi_version: "local"
+  abi_libc_version: "local"
+  builtin_sysroot: ""
+  compiler: "compiler"
+  host_system_name: "local"
+  needsPic: true
+  supports_gold_linker: false
+  supports_incremental_linker: false
+  supports_fission: false
+  supports_interface_shared_objects: false
+  supports_normalizing_ar: false
+  supports_start_end_lib: false
+  supports_thin_archives: false
+  target_libc: "local"
+  target_cpu: "local"
+  target_system_name: "local"
+  toolchain_identifier: "local_arm"
+
+  tool_path { name: "ar" path: "/usr/bin/ar" }
+  tool_path { name: "compat-ld" path: "/usr/bin/ld" }
+  tool_path { name: "cpp" path: "/usr/bin/cpp" }
+  tool_path { name: "dwp" path: "/usr/bin/dwp" }
+  tool_path { name: "gcc" path: "computecpp" }
+  # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
+  # and the device compiler to use "-std=c++11".
+  cxx_flag: "-std=c++11"
+  linker_flag: "-Wl,-no-as-needed"
+  linker_flag: "-lstdc++"
+  linker_flag: "-B/usr/bin/"
+
+  # TODO(bazel-team): In theory, the path here ought to exactly match the path
+  # used by gcc. That works because bazel currently doesn't track files at
+  # absolute locations and has no remote execution, yet. However, this will need
+  # to be fixed, maybe with auto-detection?
+  cxx_builtin_include_directory: "/usr/lib/gcc/"
+  cxx_builtin_include_directory: "/usr/lib"
+  cxx_builtin_include_directory: "/usr/lib64"
+  cxx_builtin_include_directory: "/usr/local/include"
+  cxx_builtin_include_directory: "/usr/include"
+
+  cxx_builtin_include_directory: "%{computecpp_toolkit_path}"
+  cxx_builtin_include_directory: "%{python_lib_path}"
+
+  tool_path { name: "gcov" path: "/usr/bin/gcov" }
+
+  # C(++) compiles invoke the compiler (as that is the one knowing where
+  # to find libraries), but we provide LD so other rules can invoke the linker.
+  tool_path { name: "ld" path: "/usr/bin/ld" }
+
+  tool_path { name: "nm" path: "/usr/bin/nm" }
+  tool_path { name: "objcopy" path: "/usr/bin/objcopy" }
+  objcopy_embed_flag: "-I"
+  objcopy_embed_flag: "binary"
+  tool_path { name: "objdump" path: "/usr/bin/objdump" }
+  tool_path { name: "strip" path: "/usr/bin/strip" }
+
+  # Make C++ compilation deterministic. Use linkstamping instead of these
+  # compiler symbols.
+  unfiltered_cxx_flag: "-Wno-builtin-macro-redefined"
+  unfiltered_cxx_flag: "-D__DATE__=\"redacted\""
+  unfiltered_cxx_flag: "-D__TIMESTAMP__=\"redacted\""
+  unfiltered_cxx_flag: "-D__TIME__=\"redacted\""
+
+  # All warnings are enabled. Maybe enable -Werror as well?
+  compiler_flag: "-Wall"
+
+  # Anticipated future default.
+  linker_flag: "-Wl,-no-as-needed"
+  # Stamp the binary with a unique identifier.
+  linker_flag: "-Wl,--build-id=md5"
+  linker_flag: "-Wl,--hash-style=gnu"
+
+  linking_mode_flags { mode: DYNAMIC }
+
+  compilation_mode_flags {
+    mode: FASTBUILD
+    compiler_flag: "-O0"
+  }
+
+  compilation_mode_flags {
+    mode: DBG
+    compiler_flag: "-g"
+  }
+
+  compilation_mode_flags {
+    mode: OPT
+    compiler_flag: "-g0"
+    compiler_flag: "-O2"
+    compiler_flag: "-DNDEBUG"
+  }
+}
diff --git a/third_party/sycl/crosstool/computecpp.tpl b/third_party/sycl/crosstool/computecpp.tpl
index 595e7136a6..94c5e6aaad 100755
--- a/third_party/sycl/crosstool/computecpp.tpl
+++ b/third_party/sycl/crosstool/computecpp.tpl
@@ -1,8 +1,9 @@
 #!/usr/bin/env python
 
 import os
-import subprocess
 import sys
+import tempfile
+from subprocess import call, Popen, PIPE
 
 CPU_CXX_COMPILER = ('%{host_cxx_compiler}')
 CPU_C_COMPILER = ('%{host_c_compiler}')
@@ -13,76 +14,81 @@ COMPUTECPP_DRIVER= COMPUTECPP_ROOT + 'bin/compute++'
 COMPUTECPP_INCLUDE = COMPUTECPP_ROOT + 'include'
 
 def main():
-  compiler_flags = []
-
-  # remove -fsamotoze-coverage from string
-  if CPU_CXX_COMPILER.find("g++") != -1:
-    compiler_flags = [flag for flag in sys.argv[1:] if not flag.startswith(('-Wl,--no-undefined', '-fsanitize-coverage', '-Wno-unused-but-set-variable', '-Wignored-attributes'))]
-  else:
-    compiler_flags = [flag for flag in sys.argv[1:] if not flag.startswith(('-Wl,--no-undefined', '-Wno-unused-but-set-variable', '-Wignored-attributes'))]
+  remove_flags = ('-Wl,--no-undefined', '-Wno-unused-but-set-variable', '-Wignored-attributes')
+    # remove -fsamotoze-coverage from string with g++
+  if 'g++' in CPU_CXX_COMPILER:
+    remove_flags += ('-fsanitize-coverage',)
+  compiler_flags = [flag for flag in sys.argv[1:] if not flag.startswith(remove_flags)]
 
   output_file_index = compiler_flags.index('-o') + 1
   output_file_name = compiler_flags[output_file_index]
 
-  if(output_file_index == 1):
+  if output_file_index == 1:
     # we are linking
-    return subprocess.call([CPU_CXX_COMPILER] + compiler_flags + ['-Wl,--no-undefined'])
+    return call([CPU_CXX_COMPILER] + compiler_flags + ['-Wl,--no-undefined'])
 
   # find what we compile
-  compiling_cpp = 0
-  if('-c' in compiler_flags):
-      compiled_file_index = compiler_flags.index('-c') + 1
-      compited_file_name = compiler_flags[compiled_file_index]
-      if(compited_file_name.endswith(('.cc', '.c++', '.cpp', '.CPP', '.C', '.cxx'))):
-          compiling_cpp = 1;
-
-  compiler_flags = compiler_flags + ['-D_GLIBCXX_USE_CXX11_ABI=0', '-DEIGEN_USE_SYCL=1', '-DTENSORFLOW_USE_SYCL', '-DEIGEN_HAS_C99_MATH']
-
-  if(compiling_cpp == 1):
-      # create a blacklist of folders that will be skipped when compiling with ComputeCpp
-      _skip = ["external", "llvm", ".cu.cc"]
-      # if compiling external project skip computecpp
-      if any(_folder in _skip for _folder in output_file_name):
-        return subprocess.call([CPU_CXX_COMPILER] + compiler_flags)
-
-  if(compiling_cpp == 1):
-      # this is an optimisation that will check if compiled file has to be compiled with ComputeCpp
-
-      _tmp_flags = [flag for flag in compiler_flags if not flag.startswith(('-o', output_file_name))]
-      # create preprocessed of the file
-      _cmd = " ".join([CPU_CXX_COMPILER] + _tmp_flags + ["-E"])
-      # check if it has parallel_for< in it
-      _cmd += " | grep \".parallel_for\" > /dev/null"
-      ps = subprocess.call(_cmd, shell=True)
-      # if not call CXX compiler
-      if(ps != 0):
-          return subprocess.call([CPU_CXX_COMPILER] + compiler_flags)
-
-  if(compiling_cpp == 1):
-      filename, file_extension = os.path.splitext(output_file_name)
-      bc_out = filename + '.sycl'
-
-      # strip asan for the device
-      computecpp_device_compiler_flags = ['-sycl-compress-name', '-DTENSORFLOW_USE_SYCL', '-Wno-unused-variable', '-I', COMPUTECPP_INCLUDE, '-isystem',
-          COMPUTECPP_INCLUDE, '-std=c++11', '-sycl', '-emit-llvm', '-no-serial-memop', '-Xclang', '-cl-denorms-are-zero', '-Xclang', '-cl-fp32-correctly-rounded-divide-sqrt']
-      computecpp_device_compiler_flags += [flag for flag in compiler_flags if not flag.startswith(('-fsanitize', '-march=native', '-mavx'))]
-
-      x = subprocess.call([COMPUTECPP_DRIVER] + computecpp_device_compiler_flags )
-      if(x == 0):
-          # dont want that in case of compiling with computecpp first
-          host_compiler_flags = [flag for flag in compiler_flags
-                                    if not flag.startswith(('-MF', '-MD',))
-                                    if not '.d' in flag
-                                ]
-
-          host_compiler_flags[host_compiler_flags.index('-c')] = "--include"
-
-          host_compiler_flags = ['-xc++', '-D_GLIBCXX_USE_CXX11_ABI=0', '-DTENSORFLOW_USE_SYCL', '-Wno-unused-variable', '-I', COMPUTECPP_INCLUDE, '-c', bc_out] + host_compiler_flags
-          x = subprocess.call([CPU_CXX_COMPILER] + host_compiler_flags)
-      return x
-  else:
+  compiling_cpp = False
+  if '-c' in compiler_flags:
+    compiled_file_index = compiler_flags.index('-c') + 1
+    compiled_file_name = compiler_flags[compiled_file_index]
+    compiling_cpp = compiled_file_name.endswith(('.cc', '.c++', '.cpp', '.CPP', '.C', '.cxx'))
+
+  # add -D_GLIBCXX_USE_CXX11_ABI=0 to the command line if you have custom installation of GCC/Clang
+  compiler_flags = compiler_flags + ['-DEIGEN_USE_SYCL=1', '-DTENSORFLOW_USE_SYCL', '-DEIGEN_HAS_C99_MATH']
+
+  if not compiling_cpp:
     # compile for C
-    return subprocess.call([CPU_C_COMPILER] + compiler_flags)
+    return call([CPU_C_COMPILER] + compiler_flags)
+
+  # create a blacklist of folders that will be skipped when compiling with ComputeCpp
+  skip_extensions = [".cu.cc"]
+  skip_folders = ["tensorflow/compiler", "tensorflow/docs_src", "tensorflow/tensorboard", "third_party", "external", "hexagon"]
+  skip_folders = [(folder + '/') for folder in skip_folders]
+  # if compiling external project skip computecpp
+  if any(compiled_file_name.endswith(_ext) for _ext in skip_extensions) or any(_folder in output_file_name for _folder in skip_folders):
+    return call([CPU_CXX_COMPILER] + compiler_flags)
+
+  # this is an optimisation that will check if compiled file has to be compiled with ComputeCpp
+  flags_without_output = list(compiler_flags)
+  del flags_without_output[output_file_index]   # remove output_file_name
+  del flags_without_output[output_file_index - 1] # remove '-o'
+  # create preprocessed of the file and store it for later use
+  pipe = Popen([CPU_CXX_COMPILER] + flags_without_output + ["-E"], stdout=PIPE)
+  preprocessed_file_str = pipe.communicate()[0]
+  if pipe.returncode != 0:
+    return pipe.returncode
+
+  # check if it has parallel_for in it
+  if not '.parallel_for' in preprocessed_file_str:
+    # call CXX compiler like usual
+    with tempfile.NamedTemporaryFile(suffix=".ii") as preprocessed_file: # Force '.ii' extension so that g++ does not preprocess the file again
+      preprocessed_file.write(preprocessed_file_str)
+      preprocessed_file.flush()
+      compiler_flags[compiled_file_index] = preprocessed_file.name
+      return call([CPU_CXX_COMPILER] + compiler_flags)
+  del preprocessed_file_str   # save some memory as this string can be quite big
+
+  filename, file_extension = os.path.splitext(output_file_name)
+  bc_out = filename + '.sycl'
+
+  # strip asan for the device
+  computecpp_device_compiler_flags = ['-sycl-compress-name', '-Wno-unused-variable',
+                                      '-I', COMPUTECPP_INCLUDE, '-isystem', COMPUTECPP_INCLUDE,
+                                      '-std=c++11', '-sycl', '-emit-llvm', '-no-serial-memop',
+                                      '-Xclang', '-cl-denorms-are-zero', '-Xclang', '-cl-fp32-correctly-rounded-divide-sqrt']
+  # disable flags enabling SIMD instructions
+  computecpp_device_compiler_flags += [flag for flag in compiler_flags if \
+    not any(x in flag.lower() for x in ('-fsanitize', '=native', '=core2', 'msse', 'vectorize', 'mavx', 'mmmx', 'm3dnow', 'fma'))]
+
+  x = call([COMPUTECPP_DRIVER] + computecpp_device_compiler_flags)
+  if x == 0:
+    # dont want that in case of compiling with computecpp first
+    host_compiler_flags = [flag for flag in compiler_flags if (not flag.startswith(('-MF', '-MD',)) and not '.d' in flag)]
+    host_compiler_flags[host_compiler_flags.index('-c')] = "--include"
+    host_compiler_flags = ['-xc++', '-Wno-unused-variable', '-I', COMPUTECPP_INCLUDE, '-c', bc_out] + host_compiler_flags
+    x = call([CPU_CXX_COMPILER] + host_compiler_flags)
+  return x
 
 if __name__ == '__main__':
   sys.exit(main())
diff --git a/third_party/sycl/sycl/LICENSE.text b/third_party/sycl/sycl/LICENSE.text
index 0c2955c4d7..8d3f050b39 100644
--- a/third_party/sycl/sycl/LICENSE.text
+++ b/third_party/sycl/sycl/LICENSE.text
@@ -67,7 +67,7 @@ you; so please press the "CANCEL" button to cancel your download.
         ComputeCpp within its marketing materials, without the
         express prior written permission of Codeplay.
  4. Support. Codeplay does not provide any guarantees of support for
-    the Software to the user. Codeplay will use reasonable endeavours
+    the Software to the user. Codeplay will use reasonable endeavors
     to respond to users' support requests, for the most recent
     release only, via the community support website at https://
     computecpp.codeplay.com.
@@ -78,7 +78,7 @@ you; so please press the "CANCEL" button to cancel your download.
     copyrights, trade secrets and other proprietary rights in the
     Software, including the rights to make and license the use of all
     copies. To the extent that any patents owned by Codeplay or its
-    licensors relate to any component of the Software, the licence
+    licensors relate to any component of the Software, the license
     granted to the user in accordance with this Agreement allows for
     the lawful use of such patents but only for the purposes of this
     Agreement and not further or otherwise. Therefore, the user may
diff --git a/third_party/sycl/sycl_configure.bzl b/third_party/sycl/sycl_configure.bzl
index 6ad498487f..7af063178e 100644
--- a/third_party/sycl/sycl_configure.bzl
+++ b/third_party/sycl/sycl_configure.bzl
@@ -5,18 +5,20 @@
   * HOST_CXX_COMPILER:  The host C++ compiler
   * HOST_C_COMPILER:    The host C compiler
   * COMPUTECPP_TOOLKIT_PATH: The path to the ComputeCpp toolkit.
+  * PYTHON_LIB_PATH: The path to the python lib
 """
 
 _HOST_CXX_COMPILER = "HOST_CXX_COMPILER"
 _HOST_C_COMPILER= "HOST_C_COMPILER"
 _COMPUTECPP_TOOLKIT_PATH = "COMPUTECPP_TOOLKIT_PATH"
+_PYTHON_LIB_PATH = "PYTHON_LIB_PATH"
 
 def _enable_sycl(repository_ctx):
   if "TF_NEED_OPENCL" in repository_ctx.os.environ:
     enable_sycl = repository_ctx.os.environ["TF_NEED_OPENCL"].strip()
     return enable_sycl == "1"
   return False
-  
+
 def auto_configure_fail(msg):
   """Output failure message when auto configuration fails."""
   red = "\033[0;31m"
@@ -55,7 +57,14 @@ def find_computecpp_root(repository_ctx):
     sycl_name = repository_ctx.os.environ[_COMPUTECPP_TOOLKIT_PATH].strip()
   if sycl_name.startswith("/"):
     return sycl_name
-  fail( "Cannot find SYCL compiler, please correct your path")
+  fail("Cannot find SYCL compiler, please correct your path")
+
+def find_python_lib(repository_ctx):
+  """Returns python path."""
+  if _PYTHON_LIB_PATH in repository_ctx.os.environ:
+    return repository_ctx.os.environ[_PYTHON_LIB_PATH].strip()
+  fail("Environment variable PYTHON_LIB_PATH was not specified re-run ./configure")
+
 
 def _check_lib(repository_ctx, toolkit_path, lib):
   """Checks if lib exists under sycl_toolkit_path or fail if it doesn't.
@@ -168,12 +177,13 @@ def _sycl_autoconf_imp(repository_ctx):
       "%{host_c_compiler}" : find_c(repository_ctx),
     })
 
-    computecpp_root = find_computecpp_root(repository_ctx);
+    computecpp_root = find_computecpp_root(repository_ctx)
     _check_dir(repository_ctx, computecpp_root)
 
     _tpl(repository_ctx, "crosstool:CROSSTOOL",
     {
       "%{computecpp_toolkit_path}" : computecpp_root,
+      "%{python_lib_path}" : find_python_lib(repository_ctx),
     })
 
     # symlink libraries
author	Jonathan Hseu <jhseu@google.com>	2017-06-09 10:37:18 -0700
committer	TensorFlower Gardener <gardener@tensorflow.org>	2017-06-09 10:41:00 -0700
commit	1b5235fd897f7ea5cffc715300f67b4dc852fa27 (patch)
tree	e2e26931aff0ff4b10174a430f816b5d31a4ab4b
parent	98eb5270e2f9b61408f04035c7edde66c21e3fa7 (diff)