aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/g3doc
diff options
context:
space:
mode:
authorGravatar A. Unique TensorFlower <gardener@tensorflow.org>2017-01-09 14:11:45 -0800
committerGravatar TensorFlower Gardener <gardener@tensorflow.org>2017-01-09 14:24:47 -0800
commit17fafb1f17843141fcd04b79c0e64996b508cb8b (patch)
tree1f067a2e26497a731fa10d48aa5ec1e3eccce1e7 /tensorflow/g3doc
parent8692d7de020fed91d19d42265c176954b83e83bb (diff)
XLA Documents for initial launch
Change: 144006194
Diffstat (limited to 'tensorflow/g3doc')
-rw-r--r--tensorflow/g3doc/experimental/index.md13
-rw-r--r--tensorflow/g3doc/experimental/leftnav_files8
-rw-r--r--tensorflow/g3doc/experimental/xla/broadcasting.md204
-rw-r--r--tensorflow/g3doc/experimental/xla/developing_new_backend.md77
-rw-r--r--tensorflow/g3doc/experimental/xla/index.md95
-rw-r--r--tensorflow/g3doc/experimental/xla/jit.md157
-rw-r--r--tensorflow/g3doc/experimental/xla/operation_semantics.md1468
-rw-r--r--tensorflow/g3doc/experimental/xla/shapes.md150
-rw-r--r--tensorflow/g3doc/experimental/xla/tfcompile.md288
-rw-r--r--tensorflow/g3doc/resources/xla_prerelease.md2223
10 files changed, 2460 insertions, 2223 deletions
diff --git a/tensorflow/g3doc/experimental/index.md b/tensorflow/g3doc/experimental/index.md
new file mode 100644
index 0000000000..d941a336a8
--- /dev/null
+++ b/tensorflow/g3doc/experimental/index.md
@@ -0,0 +1,13 @@
+# Experimental
+
+Collected in this section is information about experimental features included in
+Tensorflow. These features are either not complete or not included as part of
+core Tensorflow.
+
+### XLA
+
+[XLA](./xla/) is a domain-specific compiler for linear algebra that optimizes
+TensorFlow computations. The results are improvements in speed, memory usage,
+and portability on server and mobile platforms.
+
+[Learn more about XLA](./xla/)
diff --git a/tensorflow/g3doc/experimental/leftnav_files b/tensorflow/g3doc/experimental/leftnav_files
new file mode 100644
index 0000000000..489f3a846a
--- /dev/null
+++ b/tensorflow/g3doc/experimental/leftnav_files
@@ -0,0 +1,8 @@
+### XLA
+xla/index.md
+xla/jit.md
+xla/tfcompile.md
+xla/developing_new_backend.md
+xla/operation_semantics.md
+xla/shapes.md
+xla/broadcasting.md \ No newline at end of file
diff --git a/tensorflow/g3doc/experimental/xla/broadcasting.md b/tensorflow/g3doc/experimental/xla/broadcasting.md
new file mode 100644
index 0000000000..8dbf0d0446
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/broadcasting.md
@@ -0,0 +1,204 @@
+# Broadcasting semantics
+
+This document describes how the broadcasting semantics in XLA work.
+
+## What is broadcasting?
+
+Broadcasting is the process of making arrays with different shapes have
+compatible shapes for arithmetic operations. The terminology is borrowed from
+Numpy
+[(broadcasting)](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
+
+Broadcasting may be required for operations between multi-dimensional arrays of
+different ranks, or between multi-dimensional arrays with different but
+compatible shapes. Consider the addition `X+v` where `X` is a matrix (an array
+of rank 2) and `v` is a vector (an array of rank 1). To perform element-wise
+addition, XLA needs to "broadcast" the vector `v` to the same rank as the
+matrix `X`, by replicating `v` a certain number of times. The vector's length
+has to match at least one of the dimensions of the matrix.
+
+For example:
+
+ |1 2 3| + |7 8 9|
+ |4 5 6|
+
+The matrix's dimensions are (2,3), the vector's are (3). The vector is broadcast
+by replicating it over rows to get:
+
+ |1 2 3| + |7 8 9| = |8 10 12|
+ |4 5 6| |7 8 9| |11 13 15|
+
+In Numpy, this is called [broadcasting]
+(http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
+
+## Principles
+
+XLA is a low-level infrastructure with a XLA language this is as strict and
+explicit as possible, avoiding implicit and "magical" features that may make
+some computations slightly easier to define, at the cost of more assumptions
+baked into user code that will be difficult to change in the long term. If
+necessary, implicit and magical features can be added in client-level wrappers.
+
+In regards to broadcasting, explicit broadcasting specifications on operations
+between arrays of different ranks is required. This is different from Numpy,
+which infers the specification when possible.
+
+## Broadcasting a lower-rank array onto a higher-rank array
+
+*Scalars* can always be broadcast over arrays without an explicit specification
+of broadcasting dimensions. An element-wise binary operation between a scalar
+and an array means applying the operation with the scalar for each element in
+the array. For example, adding a scalar to a matrix means producing a matrix
+each element of which is a sum of the scalar with the corresponding input
+matrix's element.
+
+ |1 2 3| + 7 = |8 9 10|
+ |4 5 6| |11 12 13|
+
+Most broadcasting needs can be captured by using a tuple of dimensions on a
+binary operation. When the inputs to the operation have different ranks, this
+broadcasting tuple specifies which dimension(s) in the **higher-rank** array to
+match with the **lower-rank** array.
+
+Consider the previous example, instead of adding a scalar to a (2,3) matrix, add
+a vector of dimension (3) to a matrix of dimensions (2,3). *Without specifying
+broadcasting, this operation is invalid.* To correctly request matrix-vector
+addition, specify the broadcasting dimension to be (1), meaning the vector's
+dimension is matched to dimension 1 of the matrix. In 2D, if dimension 0 is
+considered as rows and dimension 1 as columns, this means that each element of
+the vector becomes a column of a size matching the number of rows in the matrix:
+
+ |7 8 9| ==> |7 8 9|
+ |7 8 9|
+
+As a more complex example, consider adding a 3-element vector (dimension (3)) to
+a 3x3 matrix (dimensions (3,3)). There are two ways broadcasting can happen for
+this example:
+
+(1) A broadcasting dimension of 1 can be used. Each vector element becomes a
+column and the vector is duplicated for each row in the matrix.
+
+ |7 8 9| ==> |7 8 9|
+ |7 8 9|
+ |7 8 9|
+
+(2) A broadcasting dimension of 0 can be used. Each vector element becomes a row
+and the vector is duplicated for each column in the matrix.
+
+ |7| ==> |7 7 7|
+ |8| |8 8 8|
+ |9| |9 9 9|
+
+> Note: when adding a 2x3 matrix to a 3-element vector, a broadcasting dimension
+> of 0 is invalid.
+
+The broadcasting dimensions can be a tuple that describes how a smaller rank
+shape is broadcast into a larger rank shape. For example, given a 2x3x4 cuboid
+and a 3x4 matrix, a broadcasting tuple (1,2) means matching the matrix to
+dimensions 1 and 2 of the cuboid.
+
+This type of broadcast is used in the binary ops in `ComputationBuilder`, if the
+`broadcast_dimensions` argument is given. For example, see
+[ComputationBuilder::Add](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.cc).
+In the XLA source code, this type of broadcasting is sometimes called "InDim"
+broadcasting.
+
+### Formal definition
+
+The broadcasting attribute allows matching a lower-rank array to a higher-rank
+array, by specifying which dimensions of the higher-rank array to match. For
+example, for an array with dimensions MxNxPxQ, a vector with dimension T can be
+matched as follows:
+
+ MxNxPxQ
+
+ dim 3: T
+ dim 2: T
+ dim 1: T
+ dim 0: T
+
+In each case, T has to be equal to the matching dimension of the higher-rank
+array. The vector's values are then broadcast from the matched dimension to all
+the other dimensions.
+
+To match a TxV matrix onto the MxNxPxQ array, a pair of broadcasting dimensions
+are used:
+
+ MxNxPxQ
+ dim 2,3: T V
+ dim 1,2: T V
+ dim 0,3: T V
+ etc...
+
+The order of dimensions in the broadcasting tuple has to be the order in which
+the lower-rank array's dimensions are expected to match the higher-rank array's
+dimensions. The first element in the tuple says which dimension in the
+higher-rank array has to match dimension 0 in the lower-rank array. The second
+element for dimension 1, and so on. The order of broadcast dimensions has to be
+strictly increasing. For example, in the previous example it is illegal to match
+V to N and T to P; it is also illegal to match V to both P and N.
+
+## Broadcasting similar-rank arrays with degenerate dimensions
+
+A related broadcasting problem is broadcasting two arrays that have the same
+rank but different dimension sizes. Similarly to Numpy's rules, this is only
+possible when the arrays are *compatible*. Two arrays are compatible when all
+their dimensions are compatible. Two dimensions are compatible if:
+
+* They are equal, or
+* One of them is 1 (a "degenerate" dimension)
+
+When two compatible arrays are encountered, the result shape has the maximum
+among the two inputs at every dimension index.
+
+Examples:
+
+1. (2,1) and (2,3) broadcast to (2,3).
+2. (1,2,5) and (7,2,5) broadcast to (7,2,5)
+3. (7,2,5) and (7,1,5) broadcast to (7,2,5)
+4. (7,2,5) and (7,2,6) are incompatible and cannot be broadcast.
+
+A special case arises, and is also supported, where each of the input arrays has
+a degenerate dimension at a different index. In this case, the result is an
+"outer operation": (2,1) and (1,3) broadcast to (2,3). For more examples,
+consult the [Numpy documentation on
+broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
+
+## Broadcast composition
+
+Broadcasting of a lower-rank array to a higher-rank array **and** broadcasting
+using degenerate dimensions can both be performed in the same binary operation.
+For example, a vector of size 4 and an matrix of size 1x2 can be added together
+using broadcast dimensions value of (0):
+
+ |1 2 3 4| + [5 6] // [5 6] is a 1x2 matrix, not a vector.
+
+First the vector is broadcast up to rank 2 (matrix) using the broadcast
+dimensions. The single value (0) in the broadcast dimensions indicates that
+dimension zero of the vector matches to dimension zero of the matrix. This
+produces an matrix of size 4xM where the value M is chosen to match the
+corresponding dimension size in the 1x2 array. Therefore, a 4x2 matrix is
+produced:
+
+ |1 1| + [5 6]
+ |2 2|
+ |3 3|
+ |4 4|
+
+Then "degenerate dimension broadcasting" broadcasts dimension zero of the 1x2
+matrix to match the corresponding dimension size of the right hand side:
+
+ |1 1| + |5 6| |6 7|
+ |2 2| + |5 6| = |7 8|
+ |3 3| + |5 6| |8 9|
+ |4 4| + |5 6| |9 10|
+
+A more complicated example is a matrix of size 1x2 added to an array of size
+4x3x1 using broadcast dimensions of (1, 2). First the 1x2 matrix is broadcast up
+to rank 3 using the broadcast dimensions to produces an intermediate Mx1x2 array
+where the dimension size M is determined by the size of the larger operand (the
+4x3x1 array) producing a 4x1x2 intermediate array. The M is at dimension 0
+(left-most dimension) because the dimensions 1 and 2 are mapped to the
+dimensions of the original 1x2 matrix as the broadcast dimension are (1, 2).
+This intermediate array can be added to the 4x3x1 matrix using broadcasting of
+degenerate dimensions to produce a 4x3x2 array result.
diff --git a/tensorflow/g3doc/experimental/xla/developing_new_backend.md b/tensorflow/g3doc/experimental/xla/developing_new_backend.md
new file mode 100644
index 0000000000..28010ff1b7
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/developing_new_backend.md
@@ -0,0 +1,77 @@
+# Developing a new backend for XLA
+
+This preliminary guide is for early adopters that want to easily retarget
+TensorFlow to their hardware in an efficient manner. The guide is not
+step-by-step and assumes knowledge of [LLVM](http://llvm.org),
+[Bazel](https://bazel.build/), and TensorFlow.
+
+XLA provides an abstract interface that a new architecture or accelerator can
+implement to create a backend to run TensorFlow graphs. Retargeting XLA should
+be significantly simpler and scalable than implementing every existing
+TensorFlow Op for new hardware.
+
+Most implementations will fall into one of the following scenarios:
+
+1. Existing CPU architecture not yet officially supported by XLA, with or
+ without an existing [LLVM](http://llvm.org) backend.
+2. Non-CPU-like hardware with an existing LLVM backend.
+3. Non-CPU-like hardware without an existing LLVM backend.
+
+> Note: An LLVM backend can mean either one of the officially released LLVM
+> backends or a custom LLVM backend developed in-house.
+
+## Scenario 1: Existing CPU architecture not yet officially supported by XLA
+
+In this scenario, start by looking at the existing [XLA CPU backend]
+(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/cpu/).
+XLA makes it easy to retarget TensorFlow to different CPUs by using LLVM, since
+the main difference between XLA backends for CPUs is the code generated by LLVM.
+Google tests XLA for x64 and ARM64 architectures.
+
+If the hardware vendor has an LLVM backend for their hardware, it is simple to
+link the backend with the LLVM built with XLA. In JIT mode, the XLA CPU backend
+emits code for the host CPU. For ahead-of-time compilation,
+[`xla::AotCompilationOptions`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h)
+can provide an LLVM triple to configure the target architecture.
+
+If there is no existing LLVM backend but another kind of code generator exists,
+it should be possible to reuse most of the existing CPU backend.
+
+## Scenario 2: Non-CPU-like hardware with an existing LLVM backend
+
+It is possible to model a new
+[`xla::Compiler`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h)
+implementation on the existing [`xla::CPUCompiler`]
+(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc)
+and [`xla::GPUCompiler`]
+(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc)
+classes, since these already emit LLVM IR. Depending on the nature of the
+hardware, it is possible that many of the LLVM IR generation aspects will have
+to be changed, but a lot of code can be shared with the existing backends.
+
+A good example to follow is the [GPU backend]
+(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/)
+of XLA. The GPU backend targets a non-CPU-like ISA, and therefore some aspects
+of its code generation are unique to the GPU domain. Other kinds of hardware,
+e.g. DSPs like Hexagon (which has an upstream LLVM backend), can reuse parts of
+the LLVM IR emission logic, but other parts will be unique.
+
+## Scenario 3: Non-CPU-like hardware without an existing LLVM backend
+
+If it is not possible to utilize LLVM, then the best option is to implement a
+new backend for XLA for the desired hardware. This option requires the most
+effort. The classes that need to be implemented are as follows:
+
+* [StreamExecutor](https://www.tensorflow.org/code/tensorflow/stream_executor/stream_executor.h):
+ For many devices not all methods of `StreamExecutor` are needed. See
+ existing `StreamExecutor` implementations for details.
+* [xla::Compiler](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h):
+ This class encapsulates the compilation of a HLO computation into an
+ `xla::Executable`.
+* [`xla::Executable`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/executable.h):
+ This class is used to launch a compiled computation on the platform.
+* [`xla::TransferManager`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/transfer_manager.h):
+ This class enables backends to provide platform-specific mechanisms for
+ constructing XLA literal data from given device memory handles. In other
+ words, it helps encapsulate the transfer of data from the host to the device
+ and back.
diff --git a/tensorflow/g3doc/experimental/xla/index.md b/tensorflow/g3doc/experimental/xla/index.md
new file mode 100644
index 0000000000..702e03ea8f
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/index.md
@@ -0,0 +1,95 @@
+# XLA Overview
+
+> Note: XLA is experimental and considered alpha. Most use cases will not
+> see improvements in performance (speed or decreased memory usage). We have
+> released XLA early so the Open Source Community can contribute to its
+> development, as well as create a path for integration with hardware
+> accelerators.
+
+XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear
+algebra that optimizes TensorFlow computations. The results are improvements in
+speed, memory usage, and portability on server and mobile platforms. Initially,
+most users will not see large benefits from XLA, but are welcome to experiment
+by using XLA via [just-in-time (JIT) compilaton](jit.md) or [ahead-of-time (AOT)
+compilation](tfcompile.md). Developers targeting new hardware accelerators are
+especially encouraged to try out XLA.
+
+The XLA framework is experimental and in active development. In particular,
+while it is unlikely that the semantics of existing operations will change, it
+is expected that more operations will be added to cover important use cases. The
+team welcomes feedback from the community about missing functionality and
+community contributions via GitHub.
+
+## Why did we build XLA?
+
+We had several objectives for XLA to work with TensorFlow:
+
+* *Improve execution speed.* Compile subgraphs to reduce the execution time of
+ short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse
+ pipelined operations to reduce memory overhead, and specialize to known
+ tensor shapes to allow for more aggressive constant propagation.
+
+* *Improve memory usage.* Analyze and schedule memory usage, in principle
+ eliminating many intermediate storage buffers.
+
+* *Reduce reliance on custom Ops.* Remove the need for many custom Ops by
+ improving the performance of automatically fused low-level Ops to match the
+ performance of custom Ops that were fused by hand.
+
+* *Reduce mobile footprint.* Eliminate the TensorFlow runtime by ahead-of-time
+ compiling the subgraph and emitting an object/header file pair that can be
+ linked directly into another application. The results can reduce the
+ footprint for mobile inference by several orders of magnitude.
+
+* *Improve portability.* Make it relatively easy to write a new backend for
+ novel hardware, at which point a large fraction of TensorFlow programs will
+ run unmodified on that hardware. This is in contrast with the approach of
+ specializing individual monolithic Ops for new hardware, which requires
+ TensorFlow programs to be rewritten to make use of those Ops.
+
+## How does XLA work?
+
+The input language to XLA is called "HLO IR", or just HLO (High Level
+Optimizer). The semantics of HLO are described on the
+[Operation Semantics](operation_semantics.md) page. It
+is most convenient to think of HLO as a [compiler
+IR](https://en.wikipedia.org/wiki/Intermediate_representation).
+
+XLA takes graphs ("computations") defined in HLO and compiles them into machine
+instructions for various architectures. XLA is modular in the sense that it is
+easy to slot in an alternative backend to [target some novel HW
+architecture](developing_new_backend.md). The CPU backend for x64 and ARM64 as
+well as the NVIDIA GPU backend are in the TensorFlow source tree.
+
+The following diagram shows the compilation process in XLA:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img src="../../images/how-does-xla-work.png">
+</div>
+
+XLA comes with several optimizations and analyses that are target-independent,
+such as [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination),
+target-independent operation fusion, and buffer analysis for allocating runtime
+memory for the computation.
+
+After the target-independent step, XLA sends the HLO computation to a backend.
+The backend can perform further HLO-level analyses and optimizations, this time
+with target specific information and needs in mind. For example, the XLA GPU
+backend may perform operation fusion beneficial specifically for the GPU
+programming model and determine how to partition the computation into streams.
+At this stage, backends may also pattern-match certain operations or
+combinations thereof to optimized library calls.
+
+The next step is target-specific code generation. The CPU and GPU backends
+included with XLA use [LLVM](http://llvm.org) for low-level IR, optimization,
+and code-generation. These backends emit the LLVM IR necessary to represent the
+XLA HLO computation in an efficient manner, and then invoke LLVM to emit native
+code from this LLVM IR.
+
+The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the
+CPU backend supports multiple CPU ISAs.
+
+## Supported Platforms
+
+XLA currently supports [JIT compilation](jit.md) on x86-64 and NVIDIA GPUs; and
+[AOT compilation](tfcompile.md) for x86-64 and ARM.
diff --git a/tensorflow/g3doc/experimental/xla/jit.md b/tensorflow/g3doc/experimental/xla/jit.md
new file mode 100644
index 0000000000..807dc57d40
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/jit.md
@@ -0,0 +1,157 @@
+# Using JIT Compilation
+
+> Note: TensorFlow must be compiled from source to include XLA.
+
+## Why use just-in-time (JIT) compilation?
+
+The TensorFlow/XLA JIT compiler compiles and runs parts of TensorFlow graphs via
+XLA. The benefit of this over the standard TensorFlow implementation is that XLA
+can fuse multiple operators (kernel fusion) into a small number of compiled
+kernels. Fusing operators can reduce memory bandwidth requirements and improve
+performance compared to executing operators one-at-a-time, as the TensorFlow
+executor does.
+
+## Running TensorFlow graphs via XLA
+
+There are two ways to run TensorFlow computations via XLA, either by
+JIT-compiling operators placed on a CPU or GPU device, or by placing operators
+on the `XLA_CPU` or `XLA_GPU` TensorFlow devices. Placing operators directly on
+a TensorFlow XLA device forces the operator to run on that device and is mainly
+used for testing.
+
+> Note: The XLA CPU backend produces fast single-threaded code (in most cases),
+> but does not yet parallelize as well as the TensorFlow CPU backend. The XLA
+> GPU backend is competitive with the standard TensorFlow implementation,
+> sometimes faster, sometimes slower.
+
+### Turning on JIT compilation
+
+JIT compilation can be turned on at the session level or manually for select
+operations. Both of these approaches are zero-copy --- data does not need to be
+copied when passing data between a compiled XLA kernel and a TensorFlow operator
+placed on the same device.
+
+#### Session
+
+Turning on JIT compilation at the session level will result in all possible
+operators being greedily compiled into XLA computations. Each XLA computation
+will be compiled into one or more kernels for the underlying device. (This does
+not mean everything will be fused into a single CUDA kernel, for example.)
+
+Subject to a few constraints, if there are two adjacent operators in the graph
+that both have XLA implementations, then they will be compiled into a single XLA
+computation.
+
+JIT compilation is turned on at the session level by setting the
+`global_jit_level` config to `tf.OptimizerOptions.ON_1` and passing the config
+during session initialization.
+
+```python
+# Config to turn on JIT compilation
+config = tf.ConfigProto()
+config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+
+sess = tf.Session(config=config)
+```
+
+#### Manual
+
+JIT compilation can also be turned on manually for one or more operators. This
+is done by tagging the operators to compile with the attribute
+`_XlaCompile=true`. The simplest way to do this is via the
+`tf.contrib.compiler.jit.experimental_jit_scope()` scope defined in
+[`tensorflow/contrib/compiler/jit.py`](https://www.tensorflow.org/code/tensorflow/contrib/compiler/jit.py).
+Example usage:
+
+```python
+ jit_scope = tf.contrib.compiler.jit.experimental_jit_scope
+
+ x = tf.placeholder(np.float32)
+ with jit_scope():
+ y = tf.add(x, x) # The "add" will be compiled with XLA.
+```
+
+The `_XlaCompile` attribute is currently supported on a best-effort basis. If an
+operator cannot be compiled, TensorFlow will silently fall back to the normal
+implementation.
+
+### Placing operators on XLA devices
+
+Another way to run computations via XLA is to place an operator on a specific
+XLA device. This method is normally only used for testing. Valid targets are
+`XLA_CPU` or `XLA_GPU`.
+
+```python
+with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"):
+ output = tf.add(input1, input2)
+```
+
+Unlike JIT compilation on the standard CPU and GPU devices, these devices make a
+copy of data when it is transferred on and off the device. The extra copy makes
+it expensive to mix XLA and TensorFlow operators in the same graph.
+
+## Tutorial
+
+This tutorial covers training a simple version of MNIST softmax with JIT turned
+on. While, the tutorial was created using CPU only, the steps are the same and
+the artifacts are similar to running on GPU.
+
+### Step #1: Prepare sample script
+
+Download or move
+[mnist_softmax_xla.py](https://www.tensorflow.org/code/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py)
+into a folder outside of the TensorFlow source tree.
+
+### Step #2: Run without XLA
+
+Execute the python script to train the model without XLA.
+
+```shell
+python mnist_softmax_xla.py --xla=false
+```
+
+Using the Chrome Trace Event Profiler (browse to chrome://tracing),
+open the timeline file created when the script finishes: `timeline.ctf.json`. \
+The rendered timeline should look similar to the
+picture below with multiple green boxes labeled `MatMul`, possibly across
+multiple CPUs.
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/jit_timeline_cpu.png">
+</div>
+
+### Step #3 Run with XLA
+
+Execute the python script to train the model with XLA and turn on a debugging
+feature of XLA via an environmental variable that outputs the XLA graph.
+
+```shell
+TF_XLA_FLAGS=--xla_generate_hlo_graph=.* python mnist_softmax_xla.py
+```
+
+Open the timeline file created (`timeline.ctf.json`). The rendered timeline
+should look similar to the picture below with one long bar labeled `_XlaLaunch`.
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/jit_timeline_cpu_xla.png">
+</div>
+
+To understand what is happening in `_XlaLaunch`, look at the console output for
+statements similar to the following:
+
+```shell
+computation cluster_0[_XlaCompiledKernel=true,_XlaNumConstantArgs=1].v82 [CPU:
+pipeline start, before inline]: /tmp/hlo_graph_0.dot
+
+```
+
+The debug statements point to the location of `hlo_graph_xx.dot` files that
+contain info about the graph created by XLA at runtime. To Render the .dot file
+into a png, install [GraphViz](http://www.graphviz.org/Download..php) and run:
+
+```shell
+dot -Tpng hlo_graph_0.dot -o hlo_graph_0.png
+```
+
+The result will look like the following:
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/jit_cpu_xla_graph.png">
+</div>
diff --git a/tensorflow/g3doc/experimental/xla/operation_semantics.md b/tensorflow/g3doc/experimental/xla/operation_semantics.md
new file mode 100644
index 0000000000..46ed6bd7c1
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/operation_semantics.md
@@ -0,0 +1,1468 @@
+# Operation Semantics
+
+The following describes the semantics of operations defined in the
+[`ComputationBuilder`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h)
+interface. Typically, these operations map one-to-one to operations defined in
+the RPC interface in
+[`xla_data.proto`](https://www.tensorflow.org/code/tensorflow/compiler/xla/xla_data.proto).
+
+A note on nomenclature: the generalized data type XLA deals with is an
+N-dimensional array holding elements of some uniform type (such as 32-bit
+float). Throughout the documentation, *array* is used to denote an
+arbitrary-dimensional array. For convenience, special cases have more specific
+and familiar names; for example a *vector* is a 1-dimensional array and a
+*matrix* is a 2-dimensional array.
+
+## Broadcast
+
+See also
+[`ComputationBuilder::Broadcast`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Adds dimensions to an array by duplicating the data in the array.
+
+<b> `Broadcast(operand, broadcast_sizes)` </b>
+
+Arguments | Type | Semantics
+----------------- | ----------------------- | -------------------------------
+`operand` | `ComputationDataHandle` | The array to duplicate
+`broadcast_sizes` | `ArraySlice<int64>` | The sizes of the new dimensions
+
+The new dimensions are inserted on the left, i.e. if `broadcast_sizes` has
+values `{a0, ..., aN}` and the operand shape has dimensions `{b0, ..., bM}` then
+the shape of the output has dimensions `{a0, ..., aN, b0, ..., bM}`.
+
+The new dimensions index into copies of the operand, i.e.
+
+```
+output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
+```
+
+For example, if `operand` is a scalar `f32` with value `2.0f`, and
+`broadcast_sizes` is `{2, 3}`, then the result will be an array with shape
+`f32[2, 3]` and all the values in the result will be `2.0f`.
+
+## Call
+
+See also
+[`ComputationBuilder::Call`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Invokes a computation with the given arguments.
+
+<b> `Call(computation, args...)` </b>
+
+| Arguments | Type | Semantics |
+| ------------- | ------------------------ | -------------------------------- |
+| `computation` | `Computation` | computation of type `T_0, T_1, |
+: : : ..., T_N -> S` with N parameters :
+: : : of arbitrary type :
+| `args` | sequence of N | N arguments of arbitrary type |
+: : `ComputationDataHandle`s : :
+
+The arity and types of the `args` must match the parameters of the
+`computation`. It is allowed to have no `args`.
+
+## Collapse
+
+See also
+[`ComputationBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h)
+and the [`Reshape`](#reshape) operation.
+
+Collapses dimensions of an array into one dimension.
+
+<b> `Collapse(operand, dimensions)` </b>
+
+| Arguments | Type | Semantics |
+| ------------ | ----------------------- | ----------------------------------- |
+| `operand` | `ComputationDataHandle` | array of type T |
+| `dimensions` | `int64` vector | in-order, consecutive subset of T's |
+: : : dimensions. :
+
+Collapse replaces the given subset of the operand's dimensions by a single
+dimension. The input arguments are an arbitrary array of type T and a
+compile-time-constant vector of dimension indices. The dimension indices must be
+an in-order (low to high dimension numbers), consecutive subset of T's
+dimensions. Thus, {0, 1, 2}, {0, 1}, or {1, 2} are all valid dimension sets, but
+{1, 0} or {0, 2} are not. They are replaced by a single new dimension, in the
+same position in the dimension sequence as those they replace, with the new
+dimension size equal to the product of original dimension sizes. The lowest
+dimension number in `dimensions` is the slowest varying dimension (most major)
+in the loop nest which collapses these dimension, and the highest dimension
+number is fastest varying (most minor). See the [`Reshape`](#reshape) operator
+if more general collapse ordering is needed.
+
+For example, let v be an array of 24 elements:
+
+```
+let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}},
+ {{20, 21, 22}, {25, 26, 27}},
+ {{30, 31, 32}, {35, 36, 37}},
+ {{40, 41, 42}, {45, 46, 47}}};
+
+// Collapse to a single dimension, leaving one dimension.
+let v012 = Collapse(v, {0,1,2});
+then v012 == f32[24] {10, 11, 12, 15, 16, 17,
+ 20, 21, 22, 25, 26, 27,
+ 30, 31, 32, 35, 36, 37,
+ 40, 41, 42, 45, 46, 47};
+
+// Collapse the two lower dimensions, leaving two dimensions.
+let v01 = Collapse(v, {0,1});
+then v01 == f32[4x6] {{10, 11, 12, 15, 16, 17},
+ {20, 21, 22, 25, 26, 27},
+ {30, 31, 32, 35, 36, 37},
+ {40, 41, 42, 45, 46, 47}};
+
+// Collapse the two higher dimensions, leaving two dimensions.
+let v12 = Collapse(v, {1,2});
+then v12 == f32[8x3] {{10, 11, 12},
+ {15, 16, 17},
+ {20, 21, 22},
+ {25, 26, 27},
+ {30, 31, 32},
+ {35, 36, 37},
+ {40, 41, 42},
+ {45, 46, 47}};
+
+```
+
+## Concatenate
+
+See also
+[`ComputationBuilder::ConcatInDim`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Concatenate composes an array from multiple array operands. The array is of the
+same rank as each of the input array operands (which must be of the same rank as
+each other) and contains the arguments in the order that they were specified.
+
+<b> `Concatenate(operands..., dimension)` </b>
+
+| Arguments | Type | Semantics |
+| ----------- | ----------------------- | ------------------------------------ |
+| `operands` | sequence of N | N arrays of type T with dimensions |
+: : `ComputationDataHandle` : [L0, L1, ...]. Requires N >= 1. :
+| `dimension` | `int64` | A value in the interval `[0, N)` |
+: : : that names the dimension to be :
+: : : concatenated between the `operands`. :
+
+With the exception of `dimension` all dimensions must be the same. This is
+because XLA does not support "ragged" arrays Also note that rank-0 values
+cannot be concatenated (as it's impossible to name the dimension along which the
+concatenation occurs).
+
+1-dimensional example:
+
+```
+Concat({{2, 3}, {4, 5}, {6, 7}}, 0)
+>>> {2, 3, 4, 5, 6, 7}
+```
+
+2-dimensional example:
+
+```
+let a = {
+ {1, 2},
+ {3, 4},
+ {5, 6},
+};
+let b = {
+ {7, 8},
+};
+Concat({a, b}, 0)
+>>> {
+ {1, 2},
+ {3, 4},
+ {5, 6},
+ {7, 8},
+}
+```
+
+Diagram:
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/ops_concatenate.png">
+</div>
+
+## ConvertElementType
+
+See also
+[`ComputationBuilder::ConvertElementType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Similar to an element-wise `static_cast` in C++, performs an element-wise
+conversion operation from a data shape to a target shape. The dimensions must
+match, and the conversion is an element-wise one; e.g. `s32` elements become
+`f32` elements via an `s32`-to-`f32` conversion routine.
+
+<b> `ConvertElementType(operand, new_element_type)` </b>
+
+Arguments | Type | Semantics
+------------------ | ----------------------- | ---------------------------
+`operand` | `ComputationDataHandle` | array of type T with dims D
+`new_element_type` | `PrimitiveType` | type U
+
+If the dimensions of the operand and the target shape do not match, or an
+invalid conversion is requested (e.g. to/from a tuple) an error will be
+produced.
+
+A conversion such as `T=s32` to `U=f32` will perform a normalizing int-to-float
+conversion routine such as round-to-nearest-even.
+
+> Note: The precise float-to-int and visa-versa conversions are currently
+> unspecified, but may become additional arguments to the convert operation in
+> the future. Not all possible conversions have been implemented for all
+>targets.
+
+```
+let a: s32[3] = {0, 1, 2};
+let b: f32[3] = convert(a, f32);
+then b == f32[3]{0.0, 1.0, 2.0}
+```
+
+## Conv (convolution)
+
+See also
+[`ComputationBuilder::Conv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+As ConvWithGeneralPadding, but the padding is specified in a short-hand way as
+either SAME or VALID. SAME padding pads the input (`lhs`) with zeroes so that
+the output has the same shape as the input when not taking striding into
+account. VALID padding simply means no padding.
+
+## ConvWithGeneralPadding (convolution)
+
+See also
+[`ComputationBuilder::ConvWithGeneralPadding`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Computes a convolution of the kind used in neural networks. Here, a convolution
+can be thought of as a n-dimensional window moving across a n-dimensional base
+area and a computation is performed for each possible position of the window.
+
+| Arguments | Type | Semantics |
+| ---------------- | ----------------------- | ----------------------------- |
+| `lhs` | `ComputationDataHandle` | rank n+2 array of inputs |
+| `rhs` | `ComputationDataHandle` | rank n+2 array of kernel |
+: : : weights :
+| `window_strides` | `ArraySlice<int64>` | n-d array of kernel strides |
+| `padding` | `ArraySlice<pair<int64, | n-d array of (low, high) |
+: : int64>>` : padding :
+| `lhs_dilation` | `ArraySlice<int64>` | n-d lhs dilation factor array |
+| `rhs_dilation` | `ArraySlice<int64>` | n-d rhs dilation factor array |
+
+Let n be the number of spatial dimensions. The `lhs` argument is a rank n+2
+array describing the base area. This is called the input, even though of course
+the rhs is also an input. In a neural network, these are the input activations.
+The n+2 dimensions are, in this order:
+
+* `batch`: Each coordinate in this dimension represents an independent input
+ for which convolution is carried out.
+* `z/depth/features`: Each (y,x) position in the base area has a vector
+ associated to it, which goes into this dimension.
+* `spatial_dims`: Describes the `n` spatial dimensions that define the base
+ area that the window moves across.
+
+The `rhs` argument is a rank n+2 array describing the convolutional
+filter/kernel/window. The dimensions are, in this order:
+
+* `output-z`: The `z` dimension of the output.
+* `input-z`: The size of this dimension should equal the size of the `z`
+ dimension in lhs.
+* `spatial_dims`: Describes the `n` spatial dimensions that define the n-d
+ window that moves across the base area.
+
+The `window_strides` argument specifies the stride of the convolutional window
+in the spatial dimensions. For example, if the stride in a the first spatial
+dimension is 3, then the window can only be placed at coordinates where the
+first spatial index is divisible by 3.
+
+The `padding` argument specifies the amount of zero padding to be applied to the
+base area. The amount of padding can be negative -- the absolute value of
+negative padding indicates the number of elements to remove from the specified
+dimension before doing the convolution. `padding[0]` specifies the padding for
+dimension `y` and `padding[1]` specifies the padding for dimension `x`. Each
+pair has the low padding as the first element and the high padding as the second
+element. The low padding is applied in the direction of lower indices while the
+high padding is applied in the direction of higher indices. For example, if
+`padding[1]` is `(2,3)` then there will be a padding by 2 zeroes on the left and
+by 3 zeroes on the right in the second spatial dimension. Using padding is
+equivalent to inserting those same zero values into the input (`lhs`) before
+doing the convolution.
+
+The `lhs_dilation` and `rhs_dilation` arguments specify the dilation factor to
+be applied to the lhs and rhs, respectively, in each spatial dimension. If the
+dilation factor in a spatial dimension is d, then d-1 holes are implicitly
+placed between each of the entries in that dimension, increasing the size of the
+array. The holes are filled with a no-op value, which for convolution means
+zeroes.
+
+Dilation of the rhs is also called atrous convolution. For more details, see the
+[TensorFlow documentation on atrous convolution](https://www.tensorflow.org/
+versions/r0.11/api_docs/python/nn.html#atrous_conv2d). Dilation of the lhs is
+also called deconvolution.
+
+The output shape has these dimensions, in this order:
+
+* `batch`: Same size as `batch` on the input (`lhs`).
+* `z`: Same size as `output-z` on the kernel (`rhs`).
+* `spatial_dims`: One value for each valid placement of the convolutional
+ window.
+
+The valid placements of the convolutional window are determined by the strides
+and the size of the base area after padding.
+
+To describe what a convolution does, consider a 2d convolution, and pick some
+fixed `batch`, `z`, `y`, `x` coordinates in the output. Then `(y,x)` is a
+position of a corner of the window within the base area (e.g. the upper left
+corner, depending on how you interpret the spatial dimensions). We now have a 2d
+window, taken from the base area, where each 2d point is associated to a 1d
+vector, so we get a 3d box. From the convolutional kernel, since we fixed the
+output coordinate `z`, we also have a 3d box. The two boxes have the same
+dimensions, so we can take the sum of the element-wise products between the two
+boxes (similar to a dot product). That is the output value.
+
+Note that if `output-z` is e.g., 5, then each position of the window produces 5
+values in the output into the `z` dimension of the output. These values differ
+in what part of the convolutional kernel is used - there is a separate 3d box of
+values used for each `output-z` coordinate. So you could think of it as 5
+separate convolutions with a different filter for each of them.
+
+Here is pseudo-code for a 2d convolution with padding and striding:
+
+```
+for (b, oz, oy, ox) { // output coordinates
+ value = 0;
+ for (iz, ky, kx) { // kernel coordinates and input z
+ iy = oy*stride_y + ky - pad_low_y;
+ ix = ox*stride_x + kx - pad_low_x;
+ if ((iy, ix) inside the base area considered without padding) {
+ value += input(b, iz, iy, ix) * kernel(oz, iz, ky, kx);
+ }
+ }
+ output(b, oz, oy, ox) = value;
+}
+```
+
+## CrossReplicaSum
+
+See also
+[`ComputationBuilder::CrossReplicaSum`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Computes a sum across replicas.
+
+<b> `CrossReplicaSum(operand)` </b>
+
+| Arguments | Type | Semantics |
+| ------------ | ----------------------- | ---------------------------------- |
+| `operand` | `ComputationDataHandle` | Array to sum across replicas. |
+
+The output shape is the same as the input shape. For example, if there are two
+replicas and the operand has the value `(1.0, 2.5)` and `(3.0, 5.1)`
+respectively on the two replicas, then the output value from this op will be
+`(4.0, 7.6)` on both replicas.
+
+Computing the result of CrossReplicaSum requires having one input from each
+replica, so if one replica executes a CrossReplicaSum node more times than
+another, then the former replica will wait forever. Since the replicas are all
+running the same program, there are not a lot of ways for that to happen, but it
+is possible when a while loop's condition depends on data from infeed and the
+data that is infed causes the while loop to iterate more times on one replica
+than another.
+
+## CustomCall
+
+See also
+[`ComputationBuilder::CustomCall`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Call a user-provided function within a computation.
+
+<b> `CustomCall(target_name, args..., shape)` </b>
+
+| Arguments | Type | Semantics |
+| ------------- | ------------------------ | -------------------------------- |
+| `target_name` | `string` | Name of the function. A call |
+: : : instruction will be emitted :
+: : : which targets this symbol name. :
+| `args` | sequence of N | N arguments of arbitrary type, |
+: : `ComputationDataHandle`s : which will be passed to the :
+: : : function. :
+| `shape` | `Shape` | Output shape of the function |
+
+The function signature is the same, regardless of the arity or type of args:
+
+```
+extern "C" void target_name(void* out, void** in);
+```
+
+For example, if CustomCall is used as follows:
+
+```
+let x = f32[2] {1,2};
+let y = f32[2x3] {{10, 20, 30}, {40, 50, 60}};
+
+CustomCall("myfunc", {x, y}, f32[3x3])
+```
+
+Here is an example of an implementation of `myfunc`:
+
+```
+extern "C" void myfunc(void* out, void** in) {
+ float (&x)[2] = *static_cast<float(*)[2]>(in[0]);
+ float (&y)[2][3] = *static_cast<float(*)[2][3]>(in[1]);
+ EXPECT_EQ(1, x[0]);
+ EXPECT_EQ(2, x[1]);
+ EXPECT_EQ(10, y[0][0]);
+ EXPECT_EQ(20, y[0][1]);
+ EXPECT_EQ(30, y[0][2]);
+ EXPECT_EQ(40, y[1][0]);
+ EXPECT_EQ(50, y[1][1]);
+ EXPECT_EQ(60, y[1][2]);
+ float (&z)[3][3] = *static_cast<float(*)[3][3]>(out);
+ z[0][0] = x[1] + y[1][0];
+ // ...
+}
+```
+
+The user-provided function must not have side-effects and its execution must be
+idempotent.
+
+> Note: The opaque nature of the user-provided function restricts optimization
+> opportunities for the compiler. Try to express your computation in terms of
+> native XLA ops whenever possible; only use CustomCall as a last resort.
+
+## Dot
+
+See also
+[`ComputationBuilder::Dot`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b> `Dot(lhs, rhs)` </b>
+
+Arguments | Type | Semantics
+--------- | ----------------------- | ---------------
+`lhs` | `ComputationDataHandle` | array of type T
+`rhs` | `ComputationDataHandle` | array of type T
+
+The exact semantics of this operation depend on the ranks of the operands:
+
+| Input | Output | Semantics |
+| ----------------------- | --------------------- | ----------------------- |
+| vector [n] `dot` vector | scalar | vector dot product |
+: [n] : : :
+| matrix [m x k] `dot` | vector [m] | matrix-vector |
+: vector [k] : : multiplication :
+| matrix [m x k] `dot` | matrix [m x n] | matrix-matrix |
+: matrix [k x n] : : multiplication :
+
+The operation performs sum of products over the last dimension of `lhs` and the
+one-before-last dimension of `rhs`. These are the "contracted" dimensions. The
+contracted dimensions of `lhs` and `rhs` must be of the same size. In practice,
+it can be used to perform dot products between vectors, vector/matrix
+multiplications or matrix/matrix multiplications.
+
+## Element-wise binary arithmetic operations
+
+See also
+[`ComputationBuilder::Add`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+A set of element-wise binary arithmetic operations is supported.
+
+<b> `Op(lhs, rhs)` </b>
+
+Where `Op` is one of `Add` (addition), `Sub` (subtraction), `Mul`
+(multiplication), `Div` (division), `Rem` (remainder), `Max` (maximum), `Min`
+(minimum), `LogicalAnd` (logical AND), or `LogicalOr` (logical OR).
+
+Arguments | Type | Semantics
+--------- | ----------------------- | ----------------------------------------
+`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T
+`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T
+
+The arguments' shapes have to be either similar or compatible. See the
+[broadcasting](broadcasting.md) documentation about what it means for shapes to
+be compatible. The result of an operation has a shape which is the result of
+broadcasting the two input arrays. In this variant, operations between arrays of
+different ranks are *not* supported, unless one of the operands is a scalar.
+
+When `Op` is `Rem`, the sign of the result is taken from the dividend, and the
+absolute value of the result is always less than the divisor's absolute value.
+
+An alternative variant with different-rank broadcasting support exists for these
+operations:
+
+<b> `Op(lhs, rhs, broadcast_dimensions)` </b>
+
+Where `Op` is the same as above. This variant of the operation should be used
+for arithmetic operations between arrays of different ranks (such as adding a
+matrix to a vector).
+
+The additional `broadcast_dimensions` operand is a slice of integers used to
+expand the rank of the lower-rank operand up to the rank of the higher-rank
+operand. `broadcast_dimensions` maps the dimensions of the lower-rank shape to
+the dimensions of the higher-rank shape. The unmapped dimensions of the expanded
+shape are filled with dimensions of size one. Degenerate-dimension broadcasting
+then broadcasts the shapes along these degenerate dimension to equalize the
+shapes of both operands. The semantics are described in detail on the
+[broadcasting page](broadcasting.md).
+
+## Element-wise comparison operations
+
+See also
+[`ComputationBuilder::Eq`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+A set of standard element-wise binary comparison operations is supported. Note
+that standard IEEE 754 floating-point comparison semantics apply when comparing
+floating-point types.
+
+<b> `Op(lhs, rhs)` </b>
+
+Where `Op` is one of `Eq` (equal-to), `Ne` (not equal-to), `Ge`
+(greater-or-equal-than), `Gt` (greater-than), `Le` (less-or-equal-than), `Le`
+(less-than).
+
+Arguments | Type | Semantics
+--------- | ----------------------- | ----------------------------------------
+`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T
+`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T
+
+The arguments' shapes have to be either similar or compatible. See the
+[broadcasting](broadcasting.md) documentation about what it means for shapes to
+be compatible. The result of an operation has a shape which is the result of
+broadcasting the two input arrays with the element type `PRED`. In this variant,
+operations between arrays of different ranks are *not* supported, unless one of
+the operands is a scalar.
+
+An alternative variant with different-rank broadcasting support exists for these
+operations:
+
+<b> `Op(lhs, rhs, broadcast_dimensions)` </b>
+
+Where `Op` is the same as above. This variant of the operation should be used
+for comparison operations between arrays of different ranks (such as adding a
+matrix to a vector).
+
+The additional `broadcast_dimensions` operand is a slice of integers specifying
+the dimensions to use for broadcasting the operands. The semantics are described
+in detail on the [broadcasting page](broadcasting.md).
+
+## Element-wise unary functions
+
+ComputationBuilder supports these element-wise unary functions:
+
+<b>`Abs(operand)`</b> Element-wise abs `x -> |x|`.
+
+<b>`Ceil(operand)`</b> Element-wise ceil `x -> ⌈x⌉`.
+
+<b>`Exp(operand)`</b> Element-wise natural exponential `x -> e^x`.
+
+<b>`Floor(operand)`</b> Element-wise floor `x -> ⌊x⌋`.
+
+<b>`Log(operand)`</b> Element-wise natural logarithm `x -> ln(x)`.
+
+<b>`LogicalNot(operand)`</b> Element-wise logical not `x -> !(x)`.
+
+<b>`Neg(operand)`</b> Element-wise negation `x -> -x`.
+
+<b>`Sign(operand)`</b> Element-wise sign operation `x -> sgn(x)` where
+
+$$\text{sgn}(x) = \begin{cases} -1 & x < 0\\ 0 & x = 0\\ 1 & x > 0 \end{cases}$$
+
+using the comparison operator of the element type of `operand`.
+
+<b>`Tanh(operand)`</b> Element-wise hyperbolic tangent `x -> tanh(x)`.
+
+
+Arguments | Type | Semantics
+--------- | ----------------------- | ---------------------------
+`operand` | `ComputationDataHandle` | The operand to the function
+
+The function is applied to each element in the `operand` array, resulting in an
+array with the same shape. It is allowed for `operand` to be a scalar (rank 0).
+
+## GetTupleElement
+
+See also
+[`ComputationBuilder::GetTupleElement`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Indexes into a tuple with a compile-time-constant value.
+
+The value must be a compile-time-constant so that shape inference can determine
+the type of the resulting value.
+
+This is analogous to `std::get<int N>(t)` in C++. Conceptually:
+
+```
+let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
+let s: s32 = 5;
+let t: (f32[10], s32) = tuple(v, s);
+let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32.
+```
+
+See also [`Tuple`](#tuple).
+
+## Infeed
+
+See also
+[`ComputationBuilder::Infeed`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b> `Infeed(shape)` </b>
+
+| Argument | Type | Semantics |
+| -------- | ------- | ----------------------------------------------------- |
+| `shape` | `Shape` | Shape of the data read from the Infeed interface. The |
+: : : layout field of the shape must be set to match the :
+: : : layout of the data sent to the device; otherwise its :
+: : : behavior is undefined. :
+
+Reads a single data item from the implicit Infeed streaming interface of the
+device, interpreting the data as the given shape and its layout, and returns a
+`ComputationDataHandle` of the data. Multiple Infeed operations are allowed in a
+computation, but there must be a total order among the Infeed operations. For
+example, two Infeeds in the code below have a total order since there is a
+dependency between the while loops. The compiler issues an error if there isn't
+a total order.
+
+```
+result1 = while (condition, init = init_value) {
+ Infeed(shape)
+}
+
+result2 = while (condition, init = result1) {
+ Infeed(shape)
+}
+```
+
+Nested tuple shapes are not supported. For an empty tuple shape, the Infeed
+operation is effectively a nop and proceeds without reading any data from the
+Infeed of the device.
+
+> Note: We plan to allow multiple Infeed operations without a total order, in
+> which case the compiler will provide information about how the Infeed
+> operations are serialized in the compiled program.
+
+## Map
+
+See also
+[`ComputationBuilder::Map`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b> `Map(operands..., computation)` </b>
+
+| Arguments | Type | Semantics |
+| ----------------- | ------------------------ | ----------------------------- |
+| `operands` | sequence of N | N arrays of types T_0..T_{N-1}|
+: : `ComputationDataHandle`s : :
+| `computation` | `Computation` | computation of type `T_0, |
+: : : T_1, ..., T_{N + M -1} -> S` :
+: : : with N parameters of type T :
+: : : and M of arbitrary type :
+| `static_operands` | sequence of M | M arrays of arbitrary type |
+: : `ComputationDataHandle`s : :
+
+Applies a scalar function over the given `operands` arrays, producing an array
+of the same dimensions where each element is the result of the mapped function
+applied to the corresponding elements in the input arrays with `static_operands`
+given as additional input to `computation`.
+
+The mapped function is an arbitrary computation with the restriction that it has
+N inputs of scalar type `T` and a single output with type `S`. The output has
+the same dimensions as the operands except that the element type T is replaced
+with S.
+
+For example: `Map(op1, op2, op3, computation, par1)` maps `elem_out <-
+computation(elem1, elem2, elem3, par1)` at each (multi-dimensional) index in the
+input arrays to produce the output array.
+
+## Pad
+
+See also
+[`ComputationBuilder::Pad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b> `Pad(operand, padding_value, padding_config)` </b>
+
+| Arguments | Type | Semantics |
+| ---------------- | ----------------------- | ----------------------------- |
+| `operand` | `ComputationDataHandle` | array of type `T` |
+| `padding_value` | `ComputationDataHandle` | scalar of type `T` to fill in |
+: : : the added padding :
+| `padding_config` | `PaddingConfig` | padding amount on both edges |
+: : : (low, high) and between the :
+: : : elements of each dimension :
+
+Expands the given `operand` array by padding around the array as well as between
+the elements of the array with the given `padding_value`. `padding_config`
+specifies the amount of edge padding and the interior padding for each
+dimension.
+
+`PaddingConfig` is a repeated field of `PaddingConfigDimension`, which contains
+three fields for each dimension: `edge_padding_low`, `edge_padding_high`, and
+`interior_padding`. `edge_padding_low` and `edge_padding_high` specifies the
+amount of padding added at the low-end (next to index 0) and the high-end (next
+to the highest index) of each dimension respectively. `interior_padding`
+specifies the amount of padding added between any two elements in each
+dimension. This operation is a no-op if the edge padding pairs are all (0, 0)
+and the interior padding values are all 0. Figure below shows examples of
+different `edge_padding` and `interior_padding` values for a two dimensional
+array.
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/ops_pad.png">
+</div>
+
+## Reduce
+
+See also
+[`ComputationBuilder::Reduce`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Applies a reduction function to an array.
+
+<b> `Reduce(operand, init_value, computation, dimensions)` </b>
+
+| Arguments | Type | Semantics |
+| ------------- | ----------------------- | -------------------------------- |
+| `operand` | `ComputationDataHandle` | array of type `T` |
+| `init_value` | `ComputationDataHandle` | scalar of type `T` |
+| `computation` | `Computation` | computation of type `T, T -> T` |
+| `dimensions` | `int64` array | unordered array of dimensions to |
+: : : reduce :
+
+Conceptually, this operation reduces one or more dimensions in the input array
+into scalars. The rank of the result array is `rank(operand) - len(dimensions)`.
+`init_value` is the initial value used for every reduction and may also be
+inserted anywhere during computation if the back-end chooses to do so. So in
+most cases `init_value` should be an identity of the reduction function (for
+example, 0 for addition).
+
+The evaluation order of the reduction function across the reduction dimensions
+is arbitrary and may be non-deterministic. Therefore, the reduction function
+should not be overly sensitive to reassociation[^1].
+
+As an example, when reducing across the one dimension in a 1D array with values
+[10, 11, 12, 13], with reduction function `f` (this is `computation`) then that
+could be computed as
+
+`f(10, f(11, f(12, f(init_value, 13)))`
+
+but there are also many other possibilities, e.g.
+
+`f(init_value, f(f(10, f(init_value, 11)), f(f(init_value, 12), f(13,
+init_value))))`
+
+The following is a rough pseudo-code example of how reduction could be
+implemented, using summation as the reduction computation with an initial value
+of 0.
+
+```python
+result_shape <- remove all dims in dimensions from operand_shape
+
+# Iterate over all elements in result_shape. The number of r's here is equal
+# to the rank of the result
+for r0 in range(result_shape[0]), r1 in range(result_shape[1]), ...:
+ # Initialize this result element
+ result[r0, r1...] <- 0
+
+ # Iterate over all the reduction dimensions
+ for d0 in range(dimensions[0]), d1 in range(dimensions[1]), ...:
+ # Increment the result element with the value of the operand's element.
+ # The index of the operand's element is constructed from all ri's and di's
+ # in the right order (by construction ri's and di's together index over the
+ # whole operand shape).
+ result[r0, r1...] += operand[ri... di]
+```
+
+Here's an example of reducing a 2D array (matrix). The shape has rank 2,
+dimension 0 of size 2 and dimension 1 of size 3:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:35%" src="../../images/ops_2d_matrix.png">
+</div>
+
+Results of reducing dimensions 0 or 1 with an "add" function:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:35%" src="../../images/ops_reduce_from_2d_matrix.png">
+</div>
+
+Note that both reduction results are 1D arrays. The diagram shows one as column
+and another as row just for visual convenience.
+
+For a more complex example, here is a 3D array. Its rank is 3, dimension 0 of
+size 4, dimension 1 of size 2 and dimension 2 of size 3. For simplicity, the
+values 1 to 6 are replicated across dimension 0.
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:35%" src="../../images/ops_reduce_from_3d_matrix.png">
+</div>
+
+Similarly to the 2D example, we can reduce just one dimension. If we reduce
+dimension 0, for example, we get a rank-2 array where all values across
+dimension 0 were folded into a scalar:
+
+```text
+| 4 8 12 |
+| 4 8 12 |
+```
+
+If we reduce dimension 2, we also get a rank-2 array where all values across
+dimension 2 were folded into a scalar:
+
+```text
+| 6 15 |
+| 6 15 |
+| 6 15 |
+| 6 15 |
+```
+
+Note that the relative order between the remaining dimensions in the input is
+preserved in the output, but some dimensions may get assigned new numbers (since
+the rank changes).
+
+We can also reduce multiple dimensions. Add-reducing dimensions 0 and 1 produces
+the 1D array `| 20 28 36 |`.
+
+Reducing the 3D array over all its dimensions produces the scalar `84`.
+
+## ReduceWindow
+
+See also
+[`ComputationBuilder::ReduceWindow`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Applies a reduction function to all elements in each window of the input
+multi-dimensional array, producing an output multi-dimensional array with the
+same number of elements as the number of valid positions of the window. A
+pooling layer can be expressed as a `ReduceWindow`.
+
+<b> `ReduceWindow(operand, init_value, computation, window_dimensions,
+window_strides, padding)` </b>
+
+| Arguments | Type | Semantics |
+| ------------------- | ----------------------- | ---------------------------- |
+| `operand` | `ComputationDataHandle` | N dimensional array |
+: : : containing elements of type :
+: : : T. This is the base area on :
+: : : which the window is placed. :
+| `init_value` | `ComputationDataHandle` | Starting value for the |
+: : : reduction. See [Reduce] :
+: : : (#reduce) for details. :
+| `computation` | `Computation` | Reduction function of type |
+: : : `T, T -> T`, to apply to all :
+: : : elements in each window :
+| `window_dimensions` | `ArraySlice<int64>` | array of integers for window |
+: : : dimension values :
+| `window_strides` | `ArraySlice<int64>` | array of integers for window |
+: : : stride values :
+| `padding` | `Padding` | padding type for window |
+: : : (Padding\:\:kSame or :
+: : : Padding\:\:kValid) :
+
+Below code and figure shows an example of using `ReduceWindow`. Input is a
+matrix of size [4x6] and both window_dimensions and window_stride_dimensions are
+[2x3].
+
+```
+// Create a computation for the reduction (maximum).
+Computation max;
+{
+ ComputationBuilder builder(client_, "max");
+ auto y = builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "y");
+ auto x = builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "x");
+ builder.Max(y, x);
+ max = builder.Build().ConsumeValueOrDie();
+}
+
+// Create a ReduceWindow computation with the max reduction computation.
+ComputationBuilder builder(client_, "reduce_window_2x3");
+auto shape = ShapeUtil::MakeShape(F32, {4, 6});
+auto input = builder.Parameter(0, shape, "input");
+builder.ReduceWindow(
+ input, *max,
+ /*init_val=*/builder.ConstantLiteral(LiteralUtil::MinValue(F32)),
+ /*window_dimensions=*/{2, 3},
+ /*window_stride_dimensions=*/{2, 3},
+ Padding::kValid);
+```
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:35%" src="../../images/ops_reduce_window.png">
+</div>
+
+Stride of 1 in a dimension specifies that the position of a window in the
+dimension is 1 element away from its adjacent window. In order to specify that
+no windows overlap with each other, window_stride_dimensions should be equal to
+window_dimensions. The figure below illustrates the use of two different stride
+values. Padding is applied to each dimension of the input and the calculations
+are the same as though the input came in with the dimensions it has after
+padding.
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:75%" src="../../images/ops_reduce_window_stride.png">
+</div>
+
+## Reshape
+
+See also
+[`ComputationBuilder::Reshape`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h)
+and the [`Collapse`](#collapse) operation.
+
+Reshapes the dimensions of an array into a new configuration.
+
+<b> `Reshape(operand, new_sizes)` </b>
+<b> `Reshape(operand, dimensions, new_sizes)` </b>
+
+Arguments | Type | Semantics
+------------ | ----------------------- | ---------------------------------------
+`operand` | `ComputationDataHandle` | array of type T
+`dimensions` | `int64` vector | order in which dimensions are collapsed
+`new_sizes` | `int64` vector | vector of sizes of new dimensions
+
+Conceptually, reshape first flattens an array into a one-dimensional vector of
+data values, and then refines this vector into a new shape. The input arguments
+are an arbitrary array of type T, a compile-time-constant vector of dimension
+indices, and a compile-time-constant vector of dimension sizes for the result.
+The values in the `dimension` vector, if given, must be a permutation of all of
+T's dimensions; the default if not given is `{0, ..., rank - 1}`. The order of
+the dimensions in `dimensions` is from slowest-varying dimension (most major) to
+fastest-varying dimension (most minor) in the loop nest which collapses the
+input array into a single dimension. The `new_sizes` vector determines the size
+of the output array. The value at index 0 in `new_sizes` is the size of
+dimension 0, the value at index 1 is the size of dimension 1, and so on. The
+product of the `new_size` dimensions must equal the product of the operand's
+dimension sizes. When refining the collapsed array into the multidimensional
+array defined by `new_sizes`, the dimensions in `new_sizes` are ordered from
+slowest varying (most major) and to fastest varying (most minor).
+
+For example, let v be an array of 24 elements:
+
+```
+let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}},
+ {{20, 21, 22}, {25, 26, 27}},
+ {{30, 31, 32}, {35, 36, 37}},
+ {{40, 41, 42}, {45, 46, 47}}};
+
+In-order collapse:
+let v012_24 = Reshape(v, {0,1,2}, {24});
+then v012_24 == f32[24] {10, 11, 12, 15, 16, 17, 20, 21, 22, 25, 26, 27,
+ 30, 31, 32, 35, 36, 37, 40, 41, 42, 45, 46, 47};
+
+let v012_83 = Reshape(v, {0,1,2}, {8,3});
+then v012_83 == f32[8x3] {{10, 11, 12}, {15, 16, 17},
+ {20, 21, 22}, {25, 26, 27},
+ {30, 31, 32}, {35, 36, 37},
+ {40, 41, 42}, {45, 46, 47}};
+
+Out-of-order collapse:
+let v021_24 = Reshape(v, {1,2,0}, {24});
+then v012_24 == f32[24] {10, 20, 30, 40, 11, 21, 31, 41, 12, 22, 32, 42,
+ 15, 25, 35, 45, 16, 26, 36, 46, 17, 27, 37, 47};
+
+let v021_83 = Reshape(v, {1,2,0}, {8,3});
+then v021_83 == f32[8x3] {{10, 20, 30}, {40, 11, 21},
+ {31, 41, 12}, {22, 32, 42},
+ {15, 25, 35}, {45, 16, 26},
+ {36, 46, 17}, {27, 37, 47}};
+
+
+let v021_262 = Reshape(v, {1,2,0}, {2,6,2});
+then v021_262 == f32[2x6x2] {{{10, 20}, {30, 40},
+ {11, 21}, {31, 41},
+ {12, 22}, {32, 42}},
+ {{15, 25}, {35, 45},
+ {16, 26}, {36, 46},
+ {17, 27}, {37, 47}}};
+```
+
+As a special case, reshape can transform a single-element array to a scalar and
+vice versa. For example,
+
+```
+Reshape(f32[1x1] {{5}}, {0,1}, {}) == 5;
+Reshape(5, {}, {1,1}) == f32[1x1] {{5}};
+```
+
+## Rev (reverse)
+
+See also
+[`ComputationBuilder::Rev`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b>`Rev(operand, dimensions)`</b>
+
+Arguments | Type | Semantics
+------------ | ----------------------- | ---------------------
+`operand` | `ComputationDataHandle` | array of type T
+`dimensions` | `ArraySlice<int64>` | dimensions to reverse
+
+Reverses the order of elements in the `operand` array along the specified
+`dimensions`, generating an output array of the same shape. Each element of the
+operand array at a multidimensional index is stored into the output array at a
+transformed index. The multidimensional index is transformed by reversing the
+index in each dimension to be reversed (i.e., if a dimension of size N is one of
+the reversing dimensions, its index i is transformed into N - 1 - i).
+
+One use for the `Rev` operation is to reverse the convolution weight array along
+the two window dimensions during the gradient computation in neural networks.
+
+## RngBernoulli
+
+See also
+[`ComputationBuilder::RngBernoulli`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Constructs an output of a given shape with random numbers generated following
+the Bernoulli distribution. The parameter needs to be a scalar valued F32
+operand while the output shape needs to have elemental type U32.
+
+<b>`RngBernoulli(mean, shape)`</b>
+
+| Arguments | Type | Semantics |
+| --------- | ----------------------- | ------------------------------------- |
+| `mean` | `ComputationDataHandle` | Scalar of type F32 specifying mean of |
+: : : generated numbers :
+| `shape` | `Shape` | Output shape of type U32 |
+
+## RngNormal
+
+See also
+[`ComputationBuilder::RngNormal`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Constructs an output of a given shape with random numbers generated following
+the $$N(\mu, \sigma)$$ normal distribution. The parameters `mu` and `sigma`, and
+output shape have to have elemental type F32. The parameters furthermore have to
+be scalar valued.
+
+<b>`RngNormal(mean, sigma, shape)`</b>
+
+| Arguments | Type | Semantics |
+| --------- | ----------------------- | -------------------------------------- |
+| `mu` | `ComputationDataHandle` | Scalar of type F32 specifying mean of |
+: : : generated numbers :
+| `sigma` | `ComputationDataHandle` | Scalar of type F32 specifying standard |
+: : : deviation of generated numbers :
+| `shape` | `Shape` | Output shape of type F32 |
+
+## RngUniform
+
+See also
+[`ComputationBuilder::RngUniform`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Constructs an output of a given shape with random numbers generated following
+the uniform distribution over the interval $$[a,b]$$. The parameters and output
+shape may be either F32, S32 or U32, but the types have to be consistent.
+Furthermore, the parameters need to be scalar valued.
+
+<b>`RngUniform(a, b, shape)`</b>
+
+| Arguments | Type | Semantics |
+| --------- | ----------------------- | --------------------------------- |
+| `a` | `ComputationDataHandle` | Scalar of type T specifying lower |
+: : : limit of interval :
+| `b` | `ComputationDataHandle` | Scalar of type T specifying upper |
+: : : limit of interval :
+| `shape` | `Shape` | Output shape of type T |
+
+## SelectAndScatter
+
+See also
+[`ComputationBuilder::SelectAndScatter`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+This operation can be considered as a composite operation that first computes
+`ReduceWindow` on the `operand` array to select an element from each window, and
+then scatters the `source` array to the indices of the selected elements to
+construct an output array with the same shape as the operand array. The binary
+`select` function is used to select an element from each window by applying it
+across each window, and it is called with the property that the first
+parameter's index vector is lexicographically less than the second parameter's
+index vector. The `select` function returns `true` if the first parameter is
+selected and returns `false` if the second parameter is selected, and the
+function must hold transitivity (i.e., if `select(a, b)` and `select(b, c)` are
+`true`, then `select(a, c)` is also `true`) so that the selected element does
+not depend on the order of the elements traversed for a given window.
+
+The function `scatter` is applied at each selected index in the output array. It
+takes two scalar parameters:
+
+1. Current value at the selected index in the output array
+2. The scatter value from `source` that applies to the selected index
+
+It combines the two parameters and returns a scalar value that's used to update
+the value at the selected index in the output array. Initially, all indices of
+the output array are set to `init_value`.
+
+The output array has the same shape as the `operand` array and the `source`
+array must have the same shape as the result of applying a `ReduceWindow`
+operation on the `operand` array. `SelectAndScatter` can be used to
+backpropagate the gradient values for a pooling layer in a neural network.
+
+<b>`SelectAndScatter(operand, select, window_dimensions, window_strides,
+padding, source, init_value, scatter)`</b>
+
+| Arguments | Type | Semantics |
+| ------------------- | ----------------------- | ---------------------------- |
+| `operand` | `ComputationDataHandle` | array of type T over which |
+: : : the windows slide :
+| `select` | `Computation` | binary computation of type |
+: : : `T, T -> PRED`, to apply to :
+: : : all elements in each window; :
+: : : returns `true` if the first :
+: : : parameter is selected and :
+: : : returns `false` if the :
+: : : second parameter is selected :
+| `window_dimensions` | `ArraySlice<int64>` | array of integers for window |
+: : : dimension values :
+| `window_strides` | `ArraySlice<int64>` | array of integers for window |
+: : : stride values :
+| `padding` | `Padding` | padding type for window |
+: : : (Padding\:\:kSame or :
+: : : Padding\:\:kValid) :
+| `source` | `ComputationDataHandle` | array of type T with the |
+: : : values to scatter :
+| `init_value` | `ComputationDataHandle` | scalar value of type T for |
+: : : the inital value of the :
+: : : output array :
+| `scatter` | `Computation` | binary computation of type |
+: : : `T, T -> T`, to apply each :
+: : : scatter source element with :
+: : : its destination element :
+
+The figure below shows examples of using `SelectAndScatter`, with the `select`
+function computing the maximal value among its parameters. Note that when the
+windows overlap, as in the figure (2) below, an index of the `operand` array may
+be selected multiple times by different windows. In the figure, the element of
+value 9 is selected by both of the top windows (blue and red) and the binary
+addition `scatter` function produces the output element of value 8 (2 + 6).
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%"
+ src="../../images/ops_scatter_to_selected_window_element.png">
+</div>
+
+## Select
+
+See also
+[`ComputationBuilder::Select`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Constructs an output array from elements of two input arrays, based on the
+values of a predicate array.
+
+<b> `Select(pred, on_true, on_false)` </b>
+
+Arguments | Type | Semantics
+---------- | ----------------------- | ------------------
+`pred` | `ComputationDataHandle` | array of type PRED
+`on_true` | `ComputationDataHandle` | array of type T
+`on_false` | `ComputationDataHandle` | array of type T
+
+The arrays `on_true` and `on_false` must have the same shape. This is also the
+shape of the output array. The array `pred` must have the same dimensionality as
+`on_true` and `on_false`, with the `PRED` element type.
+
+For each element `P` of `pred`, the corresponding element of the output array is
+taken from `on_true` if the value of `P` is `true`, and from `on_false` if the
+value of `P` is `false`. As a restricted form of [broadcasting]
+(broadcasting.md), `pred` can be a scalar of type `PRED`. In this case, the
+output array is taken wholly from `on_true` if `pred` is `true`, and from
+`on_false` if `pred` is `false`.
+
+Example with non-scalar `pred`:
+
+```
+let pred: PRED[4] = {true, false, false, true};
+let v1: s32[4] = {1, 2, 3, 4};
+let v2: s32[4] = {100, 200, 300, 400};
+==>
+Select(pred, v1, v2) = s32[4]{1, 200, 300, 4};
+```
+
+Example with scalar `pred`:
+
+```
+let pred: PRED = true;
+let v1: s32[4] = {1, 2, 3, 4};
+let v2: s32[4] = {100, 200, 300, 400};
+==>
+Select(pred, v1, v2) = s32[4]{1, 2, 3, 4};
+```
+
+Selections between tuples are supported. Tuples are considered to be scalar
+types for this purpose. If `on_true` and `on_false` are tuples (which must have
+the same shape!) then `pred` has to be a scalar of type `PRED`.
+
+## Slice
+
+See also
+[`ComputationBuilder::Slice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Slicing extracts a sub-array from the input array. The sub-array is of the same
+rank as the input and contains the values inside a bounding box within the input
+array where the dimensions and indices of the bounding box are given as
+arguments to the slice operation.
+
+<b> `Slice(operand, start_indices, limit_indices)` </b>
+
+| Arguments | Type | Semantics |
+| --------------- | ----------------------- | -------------------------------- |
+| `operand` | `ComputationDataHandle` | N dimensional array of type T |
+| `start_indices` | `ArraySlice<int64>` | List of N integers containing |
+: : : the starting indices of the :
+: : : slice for each dimension. Values :
+: : : must be greater than or equal to :
+: : : zero. :
+| `limit_indices` | `ArraySlice<int64>` | List of N integers containing |
+: : : the ending indices (exclusive) :
+: : : for the slice for each :
+: : : dimension. Each value must be :
+: : : strictly greater than the :
+: : : respective `start_indices` value :
+: : : for the dimension and less than :
+: : : or equal to the size of the :
+: : : dimension. :
+
+1-dimensional example:
+
+```
+let a = {0.0, 1.0, 2.0, 3.0, 4.0}
+Slice(a, {2}, {4}) produces:
+ {2.0, 3.0}
+```
+
+2-dimensional example:
+
+```
+let b =
+ { {0.0, 1.0, 2.0},
+ {3.0, 4.0, 5.0},
+ {6.0, 7.0, 8.0},
+ {9.0, 10.0, 11.0} }
+
+Slice(b, {2, 1}, {4, 3}) produces:
+ { { 7.0, 8.0},
+ {10.0, 11.0} }
+```
+
+## DynamicSlice
+
+See also
+[`ComputationBuilder::DynamicSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+DynamicSlice extracts a sub-array from the input array at dynamic
+`start_indices`. The size of the slice in each dimension is passed in
+`size_indices`, which specify the end point of exclusive slice intervals in each
+dimension: [start, start + size). The shape of `start_indices` must be rank ==
+1, with dimension size equal to the rank of `operand`.
+Note: handling of out-of-bounds slice indices (generated by incorrect runtime
+calculation of 'start_indices') is currently implementation-defined. Currently,
+slice indices are computed modulo input dimension sizes to prevent out-of-bound
+array accesses, but this behavior may change in future implementations.
+
+<b> `DynamicSlice(operand, start_indices, size_indices)` </b>
+
+| Arguments | Type | Semantics |
+| --------------- | ----------------------- | -------------------------------- |
+| `operand` | `ComputationDataHandle` | N dimensional array of type T |
+| `start_indices` | `ComputationDataHandle` | Rank 1 array of N integers |
+: : : containing the starting indices :
+: : : of the slice for each dimension. :
+: : : Value must be greater than or :
+: : : equal to zero. :
+| `size_indices` | `ArraySlice<int64>` | List of N integers containing |
+: : : the slice size for each :
+: : : dimension. Each value must be :
+: : : strictly greater than zero, and :
+: : : start + size must be less than :
+: : : or equal to the size of the :
+: : : dimension to avoid wrapping :
+: : : modulo dimension size. :
+
+1-dimensional example:
+
+```
+let a = {0.0, 1.0, 2.0, 3.0, 4.0}
+let s = {2}
+
+DynamicSlice(a, s, {2}) produces:
+ {2.0, 3.0}
+```
+
+2-dimensional example:
+
+```
+let b =
+ { {0.0, 1.0, 2.0},
+ {3.0, 4.0, 5.0},
+ {6.0, 7.0, 8.0},
+ {9.0, 10.0, 11.0} }
+let s = {2, 1}
+
+DynamicSlice(b, s, {2, 2}) produces:
+ { { 7.0, 8.0},
+ {10.0, 11.0} }
+```
+## DynamicUpdateSlice
+
+See also
+[`ComputationBuilder::DynamicUpdateSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+DynamicUpdateSlice generates a result which is the value of the input array
+`operand`, with a slice `update` overwritten at `start_indices`.
+The shape of `update` determines the shape of the sub-array of the result which
+is updated.
+The shape of `start_indices` must be rank == 1, with dimension size equal to
+the rank of `operand`.
+Note: handling of out-of-bounds slice indices (generated by incorrect runtime
+calculation of 'start_indices') is currently implementation-defined. Currently,
+slice indices are computed modulo update dimension sizes to prevent out-of-bound
+array accesses, but this behavior may change in future implementations.
+
+<b> `DynamicUpdateSlice(operand, update, start_indices)` </b>
+
+| Arguments | Type | Semantics |
+| --------------- | ----------------------- | -------------------------------- |
+| `operand` | `ComputationDataHandle` | N dimensional array of type T |
+| `update` | `ComputationDataHandle` | N dimensional array of type T |
+: : : containing the slice update. :
+: : : Each dimension of update shape :
+: : : must be strictly greater than :
+: : : zero, and start + update must be :
+: : : less than operand size for each :
+: : : dimension to avoid generating :
+: : : out-of-bounds update indices. :
+| `start_indices` | `ComputationDataHandle` | Rank 1 array of N integers |
+: : : containing the starting indices :
+: : : of the slice for each dimension. :
+: : : Value must be greater than or :
+: : : equal to zero. :
+
+1-dimensional example:
+
+```
+let a = {0.0, 1.0, 2.0, 3.0, 4.0}
+let u = {5.0, 6.0}
+let s = {2}
+
+DynamicUpdateSlice(a, u, s) produces:
+ {0.0, 1.0, 5.0, 6.0, 4.0}
+```
+
+2-dimensional example:
+
+```
+let b =
+ { {0.0, 1.0, 2.0},
+ {3.0, 4.0, 5.0},
+ {6.0, 7.0, 8.0},
+ {9.0, 10.0, 11.0} }
+let u =
+ { {12.0, 13.0},
+ {14.0, 15.0},
+ {16.0, 17.0} }
+
+let s = {1, 1}
+
+DynamicUpdateSlice(b, u, s) produces:
+ { {0.0, 1.0, 2.0},
+ {3.0, 12.0, 13.0},
+ {6.0, 14.0, 15.0},
+ {9.0, 16.0, 17.0} }
+```
+
+## Sort
+
+See also
+[`ComputationBuilder::Sort`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+Sorts the elements in the operand.
+
+<b>`Sort(operand)`</b>
+
+Arguments | Type | Semantics
+--------- | ----------------------- | -------------------
+`operand` | `ComputationDataHandle` | The operand to sort
+
+## Transpose
+
+See also the [`Reshape`](#reshape) operation.
+
+<b>`Transpose(operand)`</b>
+
+Arguments | Type | Semantics
+--------- | ----------------------- | -------------------------
+`operand` | `ComputationDataHandle` | The operand to transpose.
+`permutation` | `ArraySlice<int64>` | How to permute the dimensions.
+
+
+Permutes the operand dimensions with the given permutation, so
+`∀ i . 0 ≤ i < rank ⇒ input_dimensions[permutation[i]] = output_dimensions[i]`.
+
+This is the same as Reshape(operand, permutation,
+ Permute(permutation, operand.shape.dimensions)).
+
+## Tuple
+
+See also
+[`ComputationBuilder::Tuple`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+A tuple containing a variable number of data handles, each of which has its own
+shape.
+
+This is analogous to `std::tuple` in C++. Conceptually:
+
+```
+let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
+let s: s32 = 5;
+let t: (f32[10], s32) = tuple(v, s);
+```
+
+Tuples can be deconstructed (accessed) via the [`GetTupleElement`]
+(#gettupleelement) operation.
+
+## While
+
+See also
+[`ComputationBuilder::While`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+
+<b> `While(condition, body, init)` </b>
+
+| Arguments | Type | Semantics |
+| ----------- | ------------- | ---------------------------------------------- |
+| `condition` | `Computation` | Computation of type `T -> PRED` which defines |
+: : : the termination condition of the loop. :
+| `body` | `Computation` | Computation of type `T -> T` which defines the |
+: : : body of the loop. :
+| `init` | `T` | Initial value for the parameter of `condition` |
+: : : and `body`. :
+
+Sequentially executes the `body` until the `condition` fails. This is similar to
+a typical while loop in many other languages except for the differences and
+restrictions listed below.
+
+* A `While` node returns a value of type `T`, which is the result from the
+ last execution of the `body`.
+* The shape of the type `T` is statically determined and must be the same
+ across all iterations.
+* `While` nodes are not allowed to be nested. (This restriction may be lifted
+ in the future on some targets.)
+
+The T parameters of the computations are initialized with the `init` value in
+the first iteration and are automatically updated to the new result from `body`
+in each subsequent iteration.
+
+One main use case of the `While` node is to implement the repeated execution of
+training in neural networks. Simplified pseudocode is shown below with a graph
+that represents the computation. The code can be found in
+[`while_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/xla/tests/while_test.cc).
+The type `T` in this example is a `Tuple` consisting of an `int32` for the
+iteration count and a `vector[10]` for the accumulator. For 1000 iterations, the
+loop keeps adding a constant vector to the accumulator.
+
+```
+// Pseudocode for the computation.
+init = {0, zero_vector[10]} // Tuple of int32 and float[10].
+result = init;
+while (result(0) < 1000) {
+ iteration = result(0) + 1;
+ new_vector = result(1) + constant_vector[10];
+ result = {iteration, new_vector};
+}
+```
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+ <img style="width:100%" src="../../images/ops_while.png">
+</div>
+
+[^1]: Some obvious reductions like "add reduction" are not strictly associative
+ for floats. However, if the range of the data is limited, floating-point
+ addition is close enough to being associative for most practical uses. It
+ is possible to conceive some complete un-associative reductions, however,
+ and these will produce wrong results in TLA reductions.
diff --git a/tensorflow/g3doc/experimental/xla/shapes.md b/tensorflow/g3doc/experimental/xla/shapes.md
new file mode 100644
index 0000000000..39e74ff307
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/shapes.md
@@ -0,0 +1,150 @@
+# Shapes and Layout
+
+The XLA `Shape` proto
+([xla_data.proto](https://www.tensorflow.org/code/tensorflow/compiler/xla/xla_data.proto))
+describes the rank, size, and data type of an N-dimensional array (*array* in
+short).
+
+## Terminology, Notation, and Conventions
+
+* The rank of an array is equal to the number of dimensions. The *true rank*
+ of an array is the number of dimensions which have a size greater than 1.
+
+* Dimensions are numbered from `0` up to `N-1` for an `N` dimensional array.
+ The dimension numbers are arbitrary labels for convenience. The order of
+ these dimension numbers does not imply a particular minor/major ordering in
+ the layout of the shape. The layout is determined by the `Layout` proto.
+
+* By convention, dimensions are listed in increasing order of dimension
+ number. For example, for a 3-dimensional array of size `[A x B x C]`,
+ dimension 0 has size `A`, dimension 1 has size `B` and dimension 2 has size
+ `C`.
+
+ Some utilities in XLA also support negative indexing, similarly to Python;
+ dimension -1 is the last dimension (equivalent to `N-1` for an `N`
+ dimensional array). For example, for the 3-dimensional array described
+ above, dimension -1 has size `C`, dimension -2 has size `B` and so on.
+
+* Two, three, and four dimensional arrays often have specific letters
+ associated with dimensions. For example, for a 2D array:
+
+ * dimension 0: `y`
+ * dimension 1: `x`
+
+ For a 3D array:
+
+ * dimension 0: `z`
+ * dimension 1: `y`
+ * dimension 2: `x`
+
+ For a 4D array:
+
+ * dimension 0: `p`
+ * dimension 1: `z`
+ * dimension 2: `y`
+ * dimension 3: `x`
+
+* Functions in the XLA API which take dimensions do so in increasing order of
+ dimension number. This matches the ordering used when passing dimensions as
+ an `initializer_list`; e.g.
+
+ `ShapeUtil::MakeShape(F32, {A, B, C, D})`
+
+ Will create a shape whose dimension size array consists of the sequence
+ `[A, B, C, D]`.
+
+## Layout
+
+The `Layout` proto describes how an array is represented in memory. The `Layout`
+proto includes the following fields:
+
+```
+message Layout {
+ repeated int64 minor_to_major = 1;
+ repeated int64 padded_dimensions = 2;
+ optional PaddingValue padding_value = 3;
+}
+```
+
+### Minor-to-major dimension ordering
+
+The only required field is `minor_to_major`. This field describes the
+minor-to-major ordering of the dimensions within a shape. Values in
+`minor_to_major` are an ordering of the dimensions of the array (`0` to `N-1`
+for an `N` dimensional array) with the first value being the most-minor
+dimension up to the last value which is the most-major dimension. The most-minor
+dimension is the dimension which changes most rapidly when stepping through the
+elements of the array laid out in linear memory.
+
+For example, consider the following 2D array of size `[2 x 3]`:
+
+```
+a b c
+d e f
+```
+
+Here dimension `0` is size 2, and dimension `1` is size 3. If the
+`minor_to_major` field in the layout is `[0, 1]` then dimension `0` is the
+most-minor dimension and dimension `1` is the most-major dimension. This
+corresponds to the following layout in linear memory:
+
+```
+a d b e c f
+```
+
+This minor-to-major dimension order of `0` up to `N-1` is akin to *column-major*
+(at rank 2). Assuming a monotonic ordering of dimensions, another name we may
+use to refer to this layout in the code is simply "dim 0 is minor".
+
+On the other hand, if the `minor_to_major` field in the layout is `[1, 0]` then
+the layout in linear memory is:
+
+```
+a b c d e f
+```
+
+A minor-to-major dimension order of `N-1` down to `0` for an `N` dimensional
+array is akin to *row-major* (at rank 2). Assuming a monotonic ordering of
+dimensions, another name we may use to refer to this layout in the code is
+simply "dim 0 is major".
+
+#### Default minor-to-major ordering
+
+The default layout for newly created Shapes is "dimension order is
+major-to-minor" (akin to row-major at rank 2).
+
+### Padding
+
+Padding is defined in the optional `padded_dimensions` and `padding_value`
+fields. The field `padded_dimensions` describes the sizes (widths) to which each
+dimension is padded. If present, the number of elements in `padded_dimensions`
+must equal the rank of the shape.
+
+For example, given the `[2 x 3]` array defined above, if `padded_dimension` is
+`[3, 5]` then dimension 0 is padded to a width of 3 and dimension 1 is padded to
+a width of 5. The layout in linear memory (assuming a padding value of 0 and
+column-major layout) is:
+
+```
+a d 0 b e 0 c f 0 0 0 0 0 0 0
+```
+
+This is equivalent to the layout of the following array with the same
+minor-to-major dimension order:
+
+```
+a b c 0 0
+d e f 0 0
+0 0 0 0 0
+```
+
+### Indexing into arrays
+
+The class `IndexUtil` in
+[index_util.h](https://www.tensorflow.org/code/tensorflow/compiler/xla/index_util.h)
+provides utilities for converting between multidimensional indices and linear
+indices given a shape and layout. Multidimensional indices include a `int64`
+index for each dimension. Linear indices are a single `int64` value which
+indexes into the buffer holding the array. See `shape_util.h` and
+`layout_util.h` in the same directory for utilities that simplify creation and
+manipulation of shapes and layouts.
diff --git a/tensorflow/g3doc/experimental/xla/tfcompile.md b/tensorflow/g3doc/experimental/xla/tfcompile.md
new file mode 100644
index 0000000000..c99a337a10
--- /dev/null
+++ b/tensorflow/g3doc/experimental/xla/tfcompile.md
@@ -0,0 +1,288 @@
+# Using AOT compilation
+
+## What is tfcompile?
+
+`tfcompile` is a standalone tool that ahead-of-time (AOT) compiles TensorFlow
+graphs into executable code. It can reduce total binary size, and also avoid
+some runtime overheads. A typical use-case of `tfcompile` is to compile an
+inference graph into executable code for mobile devices.
+
+The TensorFlow graph is normally executed by the TensorFlow runtime. This incurs
+some runtime overhead for execution of each node in the graph. This also leads
+to a larger total binary size, since the code for the TensorFlow runtime needs
+to be available, in addition to the graph itself. The executable code produced
+by `tfcompile` does not use the TensorFlow runtime, and only has dependencies on
+kernels that are actually used in the computation.
+
+The compiler is built on top of the XLA framework. The code bridging TensorFlow
+to the XLA framework resides under
+[tensorflow/compiler](https://www.tensorflow.org/code/tensorflow/compiler/),
+which also includes support for [just-in-time (JIT) compilation](jit.md) of
+TensorFlow graphs.
+
+## What does tfcompile do?
+
+`tfcompile` takes a subgraph, identified by the TensorFlow concepts of
+[`feeds`](../../get_started/basic_usage.md#feeds) and
+[`fetches`](../../get_started/basic_usage.md#fetches), and generates a function
+that implements that subgraph. The feeds are the input arguments for the
+function, and the fetches are the output arguments for the function. All inputs
+must be fully specified by the feeds; the resulting pruned subgraph cannot
+contain Placeholder or Variable nodes. It is common to specify all Placeholders
+and Variables as feeds, which ensures the resulting subgraph no longer contains
+these nodes. The generated function is packaged as a `cc_library`, with a header
+file exporting the function signature, and an object file containing the
+implementation. The user writes code to invoke the generated function as
+appropriate.
+
+## Using tfcompile
+
+This section details high level steps for generating an executable binary with
+`tfcompile` from a TensorFlow subgraph. The steps are:
+
+* Step 1: Configure the subgraph to compile
+* Step 2: Use the `tf_library` build macro to compile the subgraph
+* Step 3: Write code to invoke the subgraph
+* Step 4: Create the final binary
+
+### Step 1: Configure the subgraph to compile
+
+Identify the [`feeds`](../../get_started/basic_usage.md#feeds) and
+[`fetches`](../../get_started/basic_usage.md#fetches) of the graph, which
+correspond to the input and output arguments for the generated function. Then
+configure the feeds and fetches in a
+[`tensorflow.tfcompile.Config`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.proto)
+proto.
+
+```textproto
+# Each feed is a positional input argument for the generated function. The order
+# of each entry matches the order of each input argument. Here “x_hold” and “y_hold”
+# refer to the names of placeholder nodes defined in the graph.
+feed {
+ id { node_name: "x_hold" }
+ shape {
+ dim { size: 2 }
+ dim { size: 3 }
+ }
+}
+feed {
+ id { node_name: "y_hold" }
+ shape {
+ dim { size: 3 }
+ dim { size: 2 }
+ }
+}
+
+# Each fetch is a positional output argument for the generated function. The order
+# of each entry matches the order of each output argument. Here “x_y_prod”
+# refers to the name of a matmul node defined in the graph.
+fetch {
+ id { node_name: "x_y_prod" }
+}
+```
+
+### Step 2: Use tf_library build macro to compile the subgraph
+
+This step converts the graph into a `cc_library` using the `tf_library` build
+macro. The `cc_library` consists of an object file containing the code generated
+from the graph, along with a header file that gives access to the generated
+code. `tf_library` utilizes `tfcompile` to compile the TensorFlow graph into
+executable code.
+
+```build
+load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
+
+# Use the tf_library macro to compile your graph into executable code.
+tf_library(
+ # name is used to generate the following underlying build rules:
+ # <name> : cc_library packaging the generated header and object files
+ # <name>_test : cc_test containing a simple test and benchmark
+ # <name>_benchmark : cc_binary containing a stand-alone benchmark with minimal deps;
+ # can be run on a mobile device
+ name = "test_graph_tfmatmul",
+ # cpp_class specifies the name of the generated C++ class, with namespaces allowed.
+ # The class will be generated in the given namespace(s), or if no namespaces are
+ # given, within the global namespace.
+ cpp_class = "foo::bar::MatMulComp",
+ # graph is the input GraphDef proto, by default expected in binary format. To
+ # use the text format instead, just use the ‘.pbtxt’ suffix. A subgraph will be
+ # created from this input graph, with feeds as inputs and fetches as outputs.
+ # No Placeholder or Variable ops may exist in this subgraph.
+ graph = "test_graph_tfmatmul.pb",
+ # config is the input Config proto, by default expected in binary format. To
+ # use the text format instead, use the ‘.pbtxt’ suffix. This is where the
+ # feeds and fetches were specified above, in the previous step.
+ config = "test_graph_tfmatmul.config.pbtxt",
+)
+```
+
+> To generate the GraphDef proto (test_graph_tfmatmul.pb) for this example, run
+> [make_test_graphs.py]("https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/make_test_graphs.py")
+> and specify the output location with the --out_dir flag.
+
+Typical graphs contain [`Variables`](../../api_docs/python/state_ops.md)
+representing the weights that are learned via training, but `tfcompile` cannot
+compile a subgraph that contain `Variables`. The
+[freeze_graph.py](https://www.tensorflow.org/code/tensorflow/python/tools/freeze_graph.py)
+tool converts variables into constants, using values stored in a checkpoint
+file. As a convenience, the `tf_library` macro supports the `freeze_checkpoint`
+argument, which runs the tool. For more examples see
+[tensorflow/compiler/aot/tests/BUILD](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/BUILD).
+
+> Constants that show up in the compiled subgraph are compiled directly into the
+> generated code. To pass the constants into the generated function, rather than
+> having them compiled-in, simply pass them in as feeds.
+
+For details on the `tf_library` build macro, see
+[tfcompile.bzl](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.bzl).
+
+For details on the underlying `tfcompile` tool, see
+[tfcompile_main.cc](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile_main.cc).
+
+### Step 3: Write code to invoke the subgraph
+
+This step uses the header file (`test_graph_tfmatmul.h`) generated by the
+`tf_library` build macro in the previous step to invoke the generated code. The
+header file is located in the `bazel-genfiles` directory corresponding to the
+build package, and is named based on the name attribute set on the `tf_library`
+build macro. For example, the header generated for `test_graph_tfmatmul` would
+be `test_graph_tfmatmul.h`. Below is an abbreviated version of what is
+generated. The generated file, in `bazel-genfiles`, contains additional useful
+comments.
+
+```c++
+namespace foo {
+namespace bar {
+
+// MatMulComp represents a computation previously specified in a
+// TensorFlow graph, now compiled into executable code.
+class MatMulComp {
+ public:
+ // AllocMode controls the buffer allocation mode.
+ enum class AllocMode {
+ ARGS_RESULTS_AND_TEMPS, // Allocate arg, result and temp buffers
+ RESULTS_AND_TEMPS_ONLY, // Only allocate result and temp buffers
+ };
+
+ MatMulComp(AllocMode mode = AllocMode::ARGS_RESULTS_AND_TEMPS);
+ ~MatMulComp();
+
+ // Runs the computation, with inputs read from arg buffers, and outputs
+ // written to result buffers. Returns true on success and false on failure.
+ bool Run();
+
+ // Arg methods for managing input buffers. Buffers are in row-major order.
+ // There is a set of methods for each positional argument.
+ void** args();
+
+ void set_arg0_data(float* data);
+ float* arg0_data();
+ float& arg0(size_t dim0, size_t dim1);
+
+ void set_arg1_data(float* data);
+ float* arg1_data();
+ float& arg1(size_t dim0, size_t dim1);
+
+ // Result methods for managing output buffers. Buffers are in row-major order.
+ // Must only be called after a successful Run call. There is a set of methods
+ // for each positional result.
+ void** results();
+
+
+ float* result0_data();
+ float& result0(size_t dim0, size_t dim1);
+};
+
+} // end namespace bar
+} // end namespace foo
+```
+
+The generated C++ class is called `MatMulComp` in the `foo::bar` namespace,
+because that was the `cpp_class` specified in the `tf_library` macro. All
+generated classes have a similar API, with the only difference being the methods
+to handle arg and result buffers. Those methods differ based on the number and
+types of the buffers, which were specified by the `feed` and `fetch` arguments
+to the `tf_library` macro.
+
+There are three types of buffers managed within the generated class: `args`
+representing the inputs, `results` representing the outputs, and `temps`
+representing temporary buffers used internally to perform the computation. By
+default, each instance of the generated class allocates and manages all of these
+buffers for you. The `AllocMode` constructor argument may be used to change this
+behavior. A convenience library is provided in
+[`tensorflow/compiler/aot/runtime.h`](https://www.tensorflow.org/code/tensorflow/compiler/aot/runtime.h)
+to help with manual buffer allocation; usage of this library is optional. All
+buffers should be aligned to 32-byte boundaries.
+
+The generated C++ class is just a wrapper around the low-level code generated by
+XLA.
+
+Example of invoking the generated function based on
+[`tfcompile_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/tfcompile_test.cc):
+
+```c++
+#define EIGEN_USE_THREADS
+#define EIGEN_USE_CUSTOM_THREAD_POOL
+
+#include <iostream>
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated
+
+int main(int argc, char** argv) {
+ Eigen::ThreadPool tp(2); // Size the thread pool as appropriate.
+ Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());
+
+
+ foo::bar::MatMulComp matmul;
+ matmul.set_thread_pool(&device);
+
+ // Set up args and run the computation.
+ const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
+ std::copy(args + 0, args + 6, matmul.arg0_data());
+ std::copy(args + 6, args + 12, matmul.arg1_data());
+ matmul.Run();
+
+ // Check result
+ if (matmul.result0(0, 0) == 58) {
+ std::cout << "Success" << std::endl;
+ } else {
+ std::cout << "Failed. Expected value 58 at 0,0. Got:"
+ << matmul.result0(0, 0) << std::endl;
+ }
+
+ return 0;
+}
+```
+
+### Step 4: Create the final binary
+
+This step combines the library generated by `tf_library` in step 2 and the code
+written in step 3 to create a final binary. Below is an example `bazel` BUILD
+file.
+
+```build
+# Example of linking your binary
+# Also see //third_party/tensorflow/compiler/aot/tests/BUILD
+load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
+
+# The same tf_library call from step 2 above.
+tf_library(
+ name = "test_graph_tfmatmul",
+ ...
+)
+
+# The executable code generated by tf_library can then be linked into your code.
+cc_binary(
+ name = "my_binary",
+ srcs = [
+ "my_code.cc", # include test_graph_tfmatmul.h to access the generated header
+ ],
+ deps = [
+ ":test_graph_tfmatmul", # link in the generated object file
+ "//third_party/eigen3",
+ ],
+ linkopts = [
+ "-lpthread",
+ ]
+)
+```
diff --git a/tensorflow/g3doc/resources/xla_prerelease.md b/tensorflow/g3doc/resources/xla_prerelease.md
deleted file mode 100644
index 1a6cf1d5a4..0000000000
--- a/tensorflow/g3doc/resources/xla_prerelease.md
+++ /dev/null
@@ -1,2223 +0,0 @@
-# XLA: The TensorFlow compiler framework
-
-This document describes a compiler framework for linear algebra called *XLA*
-that will be released as part of TensorFlow. Most users of TensorFlow will not
-invoke XLA directly, but will benefit from it through improvements in speed,
-memory usage, and portability.
-
-We are providing this preview for parties who are interested in details of
-TensorFlow compilation and may want to provide feedback. We will provide more
-documentation with the code release.
-
-[TOC]
-
-## Compiling TensorFlow
-
-The XLA compilation framework is invoked on subgraphs of TensorFlow
-computations. The framework requires all tensor shapes to be fixed, so compiled
-code is specialized to concrete shapes. This means, for example, that the
-compiler may be invoked multiple times for the same subgraph if it is executed
-on batches of different sizes. We had several goals in mind when designing the
-TensorFlow compilation strategy:
-
-* *Improved execution speed.* Compiling subgraphs reduces the execution time
- of short-lived Ops by eliminating overhead from the TensorFlow runtime. The
- framework also fuses pipelined operations, reducing memory
- overheads. Specializing to known tensor shapes improves performance by
- allowing more aggressive constant propagation.
-* *Improved tensor buffer memory usage.* The compiler framework has an
- opportunity to analyze and schedule memory usage, in principle eliminating
- many intermediate storage buffers.
-* *Reduce reliance on custom Ops.* Many TensorFlow custom Ops are equivalent
- to subgraphs of existing lower-level Ops. By focusing on the preceding two
- goals we aim as far as possible to make the performance of low-level Ops be
- the same as that of hand-written fused implementations, removing the need
- for many custom Ops.
-* *Much improved mobile footprint.* When the compiled subgraph is an entire
- TensorFlow computation, it is possible to eliminate the TensorFlow runtime
- altogether and simply emit an object/header file pair that can be linked
- directly into another application. This is particularly useful for mobile
- inference, and can reduce the footprint of a TensorFlow computation by
- several orders of magnitude.
-* *Improved portability.* The compiler-based framework is designed to target
- different back-end hardware, including a variety of CPUs, GPUs, and custom
- accelerator hardware such as TPUs. The CPU and GPU back-ends currently use
- LLVM, while the internal Google TPU back-end (which will not be open-sourced
- at this time) uses custom code generation. The goal for this and other
- accelerators is that it should be relatively easy to write a new back-end
- for novel hardware, at which point a large fraction of TensorFlow programs
- will run unmodified on that hardware. This is in contrast with the approach
- of specializing individual monolithic Ops for new hardware, which requires
- TensorFlow programs to be rewritten to make use of those Ops.
-
-## XLA: Accelerated Linear Algebra
-
-XLA is a domain-specific compiler for linear algebra. The semantics of
-operations are high level, e.g., arbitrary sized vector and matrix
-operations. This makes the compiler easy to target from TensorFlow, and
-preserves enough information to allow sophisticated scheduling and optimization.
-The following tutorial provides introductory information about XLA. More details
-follow in the [Operation Semantics](#operation_semantics) section.
-
-It is important to note that the XLA framework is not set in stone. In
-particular, while it is unlikely that the semantics of existing operations will
-be changed, it is expected that more operations will be added as necessary to
-cover important use cases, and we welcome feedback from the community about
-missing functionality.
-
-### Getting started - basic example
-
-The following code sample shows how to use XLA to compute a simple vector
-expression: $$\alpha x+y$$ ("axpy").
-
-This sample presents a self-contained function - `ComputeAxpyParameters`, that
-takes data as input, uses XLA to build a graph to compute the expression and
-returns the resulting data.
-
-This is done in several steps:
-
-1. Construct an XLA graph that encodes the expression we want to compute.
- The graph's nodes are XLA operations (sometimes called "ops" or HLOs for
- "high-level operations"), and its edges represent the data flow between
- operations.
-2. Ask XLA to create a "computation" based on this graph. This process
- JIT-compiles the graph into optimized native code for the chosen platform
- and returns a handle.
-3. Use the computation handle and the input data to calculate the result.
-
-The XLA graph we construct for axpy is:
-<div style="text-align:center">
-![axpy params graph](../images/xla-axpy-tutorial-params.svg)
-</div>
-
-Note that all operations have predefined [shapes](#shapes_and_layout). A shape
-describes the rank of the array, the size of each dimension and the primitive
-element type. For example, `f32[10]` is a rank-1 array of single-precision
-floats. `f32[]` is a single-precision float scalar.
-
-In XLA, shapes are statically determined, including the size of each
-dimension in an array. This permits the XLA compiler to produce very
-efficient code for all backends. When constructing the graph, only the shapes of
-input nodes (parameters or constants) have to be provided explicitly - the rest
-is automatically inferred by XLA; therefore, the burden on the developer is
-minimal.
-
-Here is the part of the axpy sample code that constructs the graph (step 1):
-
-```c++
-std::unique_ptr<xla::Literal> ComputeAxpyParameters(
- const xla::Literal& alpha, const xla::Literal& x,
- const xla::Literal& y) {
- // Get the singleton handle for an XLA client library and create a new
- // computation builder.
- xla::Client* client(xla::ClientLibrary::ClientLibraryOrDie());
- xla::ComputationBuilder builder(client, "axpy");
-
- // Build the actual XLA computation graph. It's a function taking
- // three parameters and computing a single output.
- auto param_alpha = builder.Parameter(0, alpha.shape(), "alpha");
- auto param_x = builder.Parameter(1, x.shape(), "x");
- auto param_y = builder.Parameter(2, y.shape(), "y");
- auto axpy = builder.Add(builder.Mul(param_alpha, param_x), param_y);
-```
-
-XLA features a client-server design. `xla::ClientLibrary` provides a
-simple way to instantiate an XLA server in the backend and connect to it with
-an `xla::Client` object.
-
-The `ComputationBuilder` class provides a convenient programming interface to
-construct XLA computations. The semantics of XLA operations with links
-to `ComputationBuilder` methods are documented in
-[Operation Semantics](#operation_semantics).
-
-Here is the part that JIT-compiles the graph (step 2):
-
-```c++
- // We're done building the graph. Create a computation on the server.
- util::StatusOr<std::unique_ptr<xla::Computation>> computation_status =
- builder.Build();
- std::unique_ptr<xla::Computation> computation =
- computation_status.ConsumeValueOrDie();
-```
-
-Here is the part that runs the compiled code on the input (step 3):
-
-```c++
- // Transfer the parameters to the server and get data handles that refer to
- // them.
- std::unique_ptr<xla::GlobalData> alpha_data =
- client->TransferToServer(alpha).ConsumeValueOrDie();
- std::unique_ptr<xla::GlobalData> x_data =
- client->TransferToServer(x).ConsumeValueOrDie();
- std::unique_ptr<xla::GlobalData> y_data =
- client->TransferToServer(y).ConsumeValueOrDie();
-
- // Now we have all we need to execute the computation on the device. We get
- // the result back in the form of a Literal.
- util::StatusOr<std::unique_ptr<xla::Literal>> result_status =
- client->ExecuteAndTransfer(
- *computation, {alpha_data.get(), x_data.get(), y_data.get()});
- return result_status.ConsumeValueOrDie();
-}
-```
-
-There is one thing noticeably absent from the above code: no specification of
-the device to use. The choice of device is orthogonal to the computation
-specified and can be selected by choosing the appropriate service plugin.
-
-### Moving data into and out of XLA
-
-The main way to move data into and out of XLA is by populating
-`xla::Literal` objects. This enables maximal generality for the XLA
-client-server model of computation. When the service is running in the same
-process as the client, the `xla::Client::TransferInProcess` method may be
-used to transfer arrays to and from the service more efficiently.
-
-### Constants vs. parameters
-
-For the simple axpy computation we've seen earlier, we can construct an
-alternative XLA graph:
-<div style="text-align:center">
-![axpy constants graph](../images/xla-axpy-tutorial-constants.svg)
-</div>
-
-The code to construct and run this computation is:
-
-```c++
-std::unique_ptr<xla::Literal> ComputeAxpyConstants(
- float alpha, gtl::ArraySlice<float> x,
- gtl::ArraySlice<float> y) {
- // Get the singleton handle for an XLA client library and create a new
- // computation builder.
- xla::Client* client(xla::ClientLibrary::ClientLibraryOrDie());
- xla::ComputationBuilder builder(client, "axpy");
-
- auto constant_alpha = builder.ConstantR0<float>(alpha);
- auto constant_x = builder.ConstantR1<float>(x);
- auto constant_y = builder.ConstantR1<float>(y);
- auto axpy = builder.Add(builder.Mul(constant_alpha, constant_x), constant_y);
-
- // We're done building the graph. Tell the server to create a Computation from
- // it, and then execute this computation on the device, transferring the
- // result back as a literal.
- util::StatusOr<std::unique_ptr<xla::Computation>> computation_status =
- builder.Build();
- std::unique_ptr<xla::Computation> computation =
- computation_status.ConsumeValueOrDie();
- // No need to pass arguments into the computation since it accepts no
- // parameters.
- util::StatusOr<std::unique_ptr<xla::Literal>> result_status =
- client->ExecuteAndTransfer(*computation, {});
- return result_status.ConsumeValueOrDie();
-}
-```
-
-This computation has no user-provided inputs - the inputs are constants that are
-embedded into the graph itself. It highlights an important design tradeoff that
-should be considered when using XLA.
-
-XLA is a JIT compiler. An XLA graph is created during the runtime of the host
-program, and JIT-compiled to native code for the desired backend(s). This
-compilation may take a non-trivial amount of time, which presents a tradeoff.
-
-Many uses will want to compile a single graph and then run it repeatedly with
-different inputs. This is what `parameter` ops are most suitable for. Re-running
-the computation with different data doesn't require recompiling the graph.
-Sometimes, however, some of the inputs may be constant (or at least constant
-throughout some subset of the host program's runtime). In those cases, it makes
-sense to create an XLA graph where these inputs are `constant` ops instead of
-parameters. This will permit the XLA compiler to perform constant folding
-and other advanced optimizations that may result in significantly more efficient
-code. On the other hand, this means a computation needs to be recompiled every
-time the "constant" value actually needs to change.
-
-## Shapes and Layout
-
-The XLA `Shape` proto describes the rank, size, and data type of an
-N-dimensional array (*array* in short).
-
-### Terminology, Notation, and Conventions
-
-* The rank of an array is equal to the number of dimensions. The *true rank*
- of an array is the number of dimensions which have a size greater than 1.
-
-* Dimensions are numbered from `0` up to `N-1` for an `N` dimensional array.
- The dimensions numbers are simply convenient labels. The order of these
- dimension numbers does not imply a particular minor/major ordering in the
- layout of the shape. The layout is determined by the `Layout` proto.
-
-* By convention, dimensions are listed in increasing order of dimension
- number. For example, for a 3-dimensional array of size `[A x B x C]`,
- dimension 0 has size `A`, dimension 1 has size `B` and dimension 2 has size
- `C`.
-
-* Two, three, and four dimensional arrays often have specific letters
- associated with dimensions. For example, for a 2D array:
-
- * dimension 0: `y`
- * dimension 1: `x`
-
- For a 3D array:
-
- * dimension 0: `z`
- * dimension 1: `y`
- * dimension 2: `x`
-
- For a 4D array:
-
- * dimension 0: `p`
- * dimension 1: `z`
- * dimension 2: `y`
- * dimension 3: `x`
-
-* Functions in the XLA API which take dimensions do so in increasing order
- of dimension number. This matches the ordering used when passing dimensions
- as an `initializer_list`; e.g.
-
- `ShapeUtil::MakeShape(F32, {A, B, C, D})`
-
- Will create a shape whose dimension array consists of the sequence `[A, B,
- C, D]`.
-
-### Layout
-
-The `Layout` proto describes how an array is represented in memory. The `Layout`
-proto includes the following fields:
-
-```
-message Layout {
- repeated int64 minor_to_major = 1;
- repeated int64 padded_dimensions = 2;
- optional PaddingValue padding_value = 3;
-}
-```
-
-#### Minor-to-major dimension ordering
-
-The only required field is `minor_to_major`. This field describes the
-minor-to-major ordering of the dimensions within a shape. Values in
-`minor_to_major` are an ordering of the dimensions of the array (`0` to `N-1`
-for an `N` dimensional array) with the first value being the most-minor
-dimension up to the last value which is the most-major dimension. The most-minor
-dimension is the dimension which changes most rapidly when stepping through the
-elements of the array laid out in linear memory.
-
-For example, consider the following 2D array of size `[2 x 3]`:
-
-```
-a b c
-d e f
-```
-
-Here dimension `0` is size 2, and dimension `1` is size 3. If the
-`minor_to_major` field in the layout is `[0, 1]` then dimension `0` is the
-most-minor dimension and dimension `1` is the most-major dimension. This
-corresponds to the following layout in linear memory:
-
-```
-a d b e c f
-```
-
-This minor-to-major dimension order of `0` up to `N-1` is akin to *column-major*
-(at rank 2). Assuming a monotonic ordering of dimensions, another name we may
-use to refer to this layout in the code is simply "dim 0 is minor".
-
-On the other hand, if the `minor_to_major` field in the layout is `[1, 0]` then
-the layout in linear memory is:
-
-```
-a b c d e f
-```
-
-A minor-to-major dimension order of `N-1` down to `0` for an `N` dimensional
-array is akin to *row-major* (at rank 2). Assuming a monotonic ordering of
-dimensions, another name we may use to refer to this layout in the code is
-simply "dim 0 is major".
-
-#### Padding
-
-Padding is defined in the optional `padded_dimensions` and `padding_value`
-fields. The field `padded_dimensions` describes the sizes (widths) to which each
-dimension is padded. If present, the number of elements in `padded_dimensions`
-must equal the rank of the shape.
-
-For example, given the `[2 x 3]` array defined above, if `padded_dimension` is
-`[3, 5]` then dimension 0 is padded to a width of 3 and dimension 1 is padded to
-a width of 5. The layout in linear memory (assuming a padding value of 0 and
-column-major layout) is:
-
-```
-a d 0 b e 0 c f 0 0 0 0 0 0 0
-```
-
-This is equivalent to the layout of the following array with the same
-minor-to-major dimension order:
-
-```
-a b c 0 0
-d e f 0 0
-0 0 0 0 0
-```
-
-## Operation Semantics
-
-The following describes the semantics of operations defined in the
-`ComputationBuilder` interface.
-
-A note on nomenclature: the generalized data type XLA deals with is an
-N-dimensional array holding elements of some uniform type (such as 32-bit
-float). Throughout the documentation, we use *array* to denote an
-arbitrary-dimensional array. For convenience, special cases have more specific
-and familiar names; for example a *vector* is a 1-dimensional array and a
-*matrix* is a 2-dimensional array.
-
-### Broadcast
-
-Adds dimensions to an array by duplicating the data in the array.
-
-<b> `Broadcast(operand, broadcast_sizes)` </b>
-
-Arguments | Type | Semantics
------------------ | ----------------------- | -------------------------------
-`operand` | `ComputationDataHandle` | The array to duplicate
-`broadcast_sizes` | `ArraySlice<int64>` | The sizes of the new dimensions
-
-The new dimensions are inserted on the left, i.e. if `broadcast_sizes` has
-values `{a0, ..., aN}` and the operand shape has dimensions `{b0, ..., bM}` then
-the shape of the output has dimensions `{a0, ..., aN, b0, ..., bM}`.
-
-The new dimensions index into copies of the operand, i.e.
-
-```
-output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
-```
-
-For example, if `operand` is a scalar `f32` with value `2.0f`, and
-`broadcast_sizes` is `{2, 3}`, then the result will be an array with shape
-`f32[2, 3]` and all the values in the result will be `2.0f`.
-
-### Collapse
-
-See also `ComputationBuilder::Collapse` and the [`Reshape`](#reshape) operation.
-
-Collapses dimensions of an array into one dimension.
-
-<b> `Collapse(operand, dimensions)` </b>
-
-| Arguments | Type | Semantics |
-| ------------ | ----------------------- | ----------------------------------- |
-| `operand` | `ComputationDataHandle` | array of type T |
-| `dimensions` | `int64` vector | in-order, consecutive subset of T's |
-: : : dimensions. :
-
-Collapse replaces the given subset of the operand's dimensions by a single
-dimension. The input arguments are an arbitrary array of type T and a
-compile-time-constant vector of dimension indices. The dimension indices must be
-an in-order (low to high dimension numbers), consecutive subset of T's
-dimensions. Thus, {0, 1, 2}, {0, 1}, or {1, 2} are all valid dimension sets, but
-{1, 0} or {0, 2} are not. They are replaced by a single new dimension, in the
-same position in the dimension sequence as those they replace, with the new
-dimension size equal to the product of original dimension sizes. The lowest
-dimension number in `dimensions` is the slowest varying dimension (most major)
-in the loop nest which collapses these dimension, and the highest dimension
-number is fastest varying (most minor). See the [`Reshape`](#reshape) operator
-if more general collapse ordering is needed.
-
-For example, let v be an array of 24 elements:
-
-```
-let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}},
- {{20, 21, 22}, {25, 26, 27}},
- {{30, 31, 32}, {35, 36, 37}},
- {{40, 41, 42}, {45, 46, 47}}};
-
-// Collapse to a single dimension, leaving one dimension.
-let v012 = Collapse(v, {0,1,2});
-then v012 == f32[24] {10, 11, 12, 15, 16, 17,
- 20, 21, 22, 25, 26, 27,
- 30, 31, 32, 35, 36, 37,
- 40, 41, 42, 45, 46, 47};
-
-// Collapse the two lower dimensions, leaving two dimensions.
-let v01 = Collapse(v, {0,1});
-then v01 == f32[4x6] {{10, 11, 12, 15, 16, 17},
- {20, 21, 22, 25, 26, 27},
- {30, 31, 32, 35, 36, 37},
- {40, 41, 42, 45, 46, 47}};
-
-// Collapse the two higher dimensions, leaving two dimensions.
-let v12 = Collapse(v, {1,2});
-then v12 == f32[8x3] {{10, 11, 12},
- {15, 16, 17},
- {20, 21, 22},
- {25, 26, 27},
- {30, 31, 32},
- {35, 36, 37},
- {40, 41, 42},
- {45, 46, 47}};
-
-```
-
-### Concatenate
-
-See also `ComputationBuilder::ConcatInDim`
-
-Concatenate composes an array from multiple array operands. The array is of the
-same rank as each of the input array operands (which must be of the same rank as
-each other) and contains the arguments in the order that they were specified.
-
-<b> `Concatenate(operands..., dimension)` </b>
-
-| Arguments | Type | Semantics |
-| ----------- | ----------------------- | ------------------------------------ |
-| `operands` | sequence of N | N arrays of type T with dimensions |
-: : `ComputationDataHandle` : [L0, L1, ...] :
-| `dimension` | `int64` | A value in the interval `[0, N)` |
-: : : that names the dimension to be :
-: : : concatenated between the `operands`. :
-
-With the exception of `dimension` all dimensions must be the same. This is
-because XLA does not support "ragged" arrays -- the dimension which is being
-concatenated must be the only one that differs between the operands. Also note
-that rank-0 values cannot be concatenated (as it's impossible to name the
-dimension along which the concatenation occurs).
-
-1-dimensional example:
-
-```
-Concat({{2, 3}, {4, 5}, {6, 7}}, 0)
->>> {2, 3, 4, 5, 6, 7}
-```
-
-2-dimensional example:
-
-```
-let a = {
- {1, 2},
- {3, 4},
- {5, 6},
-};
-let b = {
- {7, 8},
-};
-Concat({a, b}, 0)
->>> {
- {1, 2},
- {3, 4},
- {5, 6},
- {7, 8},
-}
-```
-
-Diagram:
-
-<center><iframe src="../images/xla-concatenate.svg" width="920" height="400" title="Concatenate Diagram" frameborder="0"></iframe></center>
-
-### ConvertElementType
-
-See `ComputationBuilder::ConvertElementType`
-
-Similar to an element-wise `static_cast` in C++, performs an element-wise
-conversion operation from a data shape to a target shape. The dimensions must
-match, and the conversion is an element-wise one; e.g. `s32` elements become
-`f32` elements via an `s32`-to-`f32` conversion routine.
-
-<b> `ConvertElementType(operand, new_element_type)` </b>
-
-Arguments | Type | Semantics
------------------- | ----------------------- | ---------------------------
-`operand` | `ComputationDataHandle` | array of type T with dims D
-`new_element_type` | `PrimitiveType` | type U
-
-If the dimensions of the operand and the target shape do not match, or an
-invalid conversion is requested (e.g. to/from a tuple) an error will be
-produced.
-
-A conversion such as `T=s32` to `U=f32` will perform a normalizing int-to-float
-conversion routine such as round-to-nearest-even.
-
-> Note: The precise float-to-int and visa-versa conversions are currently
-> unspecified, but may become additional arguments to the convert operation in
-> the future.
-
-```
-let a: s32[3] = {0, 1, 2};
-let b: f32[3] = convert(a, f32);
-then b == f32[3]{0.0, 1.0, 2.0}
-```
-
-### Conv (convolution)
-
-See `ComputationBuilder::Conv`
-
-As ConvWithGeneralPadding, but the padding is specified in a short-hand way as
-either SAME or VALID. SAME padding pads the input (`lhs`) with zeroes so that
-the output has the same shape as the input when not taking striding into
-account. VALID padding simply means no padding.
-
-### ConvWithGeneralPadding (convolution)
-
-See `ComputationBuilder::ConvWithGeneralPadding`
-
-Computes a convolution of the kind used in neural networks. Here, a convolution
-can be thought of as a 2d window moving across a 2d base area and a computation
-is performed for each possible position of the window.
-
-| Arguments | Type | Semantics |
-| ---------------- | ----------------------- | ------------------------------- |
-| `lhs` | `ComputationDataHandle` | rank-4 array of inputs |
-| `rhs` | `ComputationDataHandle` | rank-4 array of kernel weights |
-| `window_strides` | `ArraySlice<int64>` | 2d array of kernel strides |
-| `padding` | `ArraySlice<pair<int64, | 2d array of (low, high) padding |
-: : int64>>` : :
-
-The `lhs` argument is a rank 4 array describing the base area. We will call this
-the input, even though of course the rhs is also an input. In a neural network,
-these are the input activations. The 4 dimensions are, in this order:
-
-* `batch`: Each coordinate in this dimension represents an independent input
- for which convolution is carried out.
-* `z/depth/features`: Each (y,x) position in the base area has a vector
- associated to it, which goes into this dimension.
-* `y` and `x`: Describes the two spatial dimensions that define the 2d base
- area that the window moves across.
-
-The `rhs` argument is a rank 4 array describing the convolutional
-filter/kernel/window. The dimensions are, in this order:
-
-* `output-z`: The `z` dimension of the output.
-* `input-z`: The size of this dimension should equal the size of the `z`
- dimension in lhs.
-* `y` and `x`: Describes the two spatial dimensions that define the 2d window
- that moves across the base area.
-
-The `window_strides` argument specifies the stride of the convolutional window
-in the `y` and `x` dimensions. For example, if the stride in dimension `y` is 3,
-then the window can only be placed at coordinates where the `y` index is
-divisible by 3.
-
-The `padding` argument specifies the amount of zero padding to be applied to the
-base area. `padding[0]` specifies the padding for dimension `y` and `padding[1]`
-specifies the padding for dimension `x`. Each pair has the low padding as the
-first element and the high padding as the second element. The low padding is
-applied in the direction of lower indices while the high padding is applied in
-the direction of higher indices. For example, if `padding[1]` is `(2,3)` then
-there will be a padding by 2 zeroes on the left and by 3 zeroes on the right in
-the `x` dimension. Using padding is equivalent to inserting those same zero
-values into the input (`lhs`) before doing the convolution.
-
-The output shape has these dimensions, in this order:
-
-* `batch`: Same size as `batch` on the input (`lhs`).
-* `z`: Same size as `output-z` on the kernel (`rhs`).
-* `y` and `x`: One value for each valid placement of the convolutional window.
-
-The valid placements of the convolutional window are determined by the strides
-and the size of the base area after padding.
-
-To describe what a convolution does, pick some fixed `batch`, `z`, `y`, `x`
-coordinates in the output. Then `(y,x)` is a position of a corner of the window
-within the base area (e.g. the upper left corner, depending on how you interpret
-the spatial dimensions). We now have a 2d window, taken from the base area,
-where each 2d point is associated to a 1d vector, so we get a 3d box. From the
-convolutional kernel, since we fixed the output coordinate `z`, we also have a
-3d box. The two boxes have the same dimensions, so we can take the sum of the
-element-wise products between the two boxes (similar to a dot product). That is
-the output value.
-
-Note that if `output-z` is e.g. 5, then each position of the window produces 5
-values in the output into the `z` dimension of the output. These values differ
-in what part of the convolutional kernel is used - there is a separate 3d box of
-values used for each `output-z` coordinate. So you could think of it as 5
-separate convolutions with a different filter for each of them.
-
-Here is pseudo-code for a convolution with padding and striding:
-
-```
-for (b, oz, oy, ox) { // output coordinates
- value = 0;
- for (iz, ky, kx) { // kernel coordinates and input z
- iy = oy*stride_y + ky - pad_low_y;
- ix = ox*stride_x + kx - pad_low_x;
- if ((iy, ix) inside the base area considered without padding) {
- value += input(b, iz, iy, ix) * kernel(oz, iz, ky, kx);
- }
- }
- output(b, oz, oy, ox) = value;
-}
-```
-
-### Dot
-
-See also `ComputationBuilder::Dot`
-
-<b> `Dot(lhs, rhs)` </b>
-
-Arguments | Type | Semantics
---------- | ----------------------- | ---------------
-`lhs` | `ComputationDataHandle` | array of type T
-`rhs` | `ComputationDataHandle` | array of type T
-
-The exact semantics of this operation depend on the ranks of the operands:
-
-| Input | Output | Semantics |
-| ----------------------- | --------------------- | ----------------------- |
-| scalar `dot` scalar | scalar | scalar multiplication |
-| vector [n] `dot` vector | scalar | vector dot product |
-: [n] : : :
-| matrix [m x k] `dot` | vector [m] | matrix-vector |
-: vector [k] : : multiplication :
-| matrix [m x k] `dot` | matrix [m x n] | matrix-matrix |
-: matrix [k x n] : : multiplication :
-| array [p x q x r] `dot` | array [p x q x s x t] | array dot product (read |
-: array [s x r x t] : : below) :
-
-The operation performs sum of products over dimension 0 of `lhs` and dimensions
-1 of `rhs`. These are the "contracted" dimensions. If the dimension to contract
-exceeds the rank of the operand, the last dimension is contracted. This happens
-when the `lhs` operand is a scalar or the `rhs` operand is a scalar or a vector.
-The contracted dimensions of `lhs` and `rhs` must be of the same size.
-
-The rank of the result array is `max(rank(lhs) - 1, 0) + max(rank(rhs) - 1, 0)`.
-The result dimensions are ordered in the original order within each operand,
-with the `rhs` dimensions followed by the `lhs` dimensions except the contracted
-dimensions. For example, a dot product of two arrays `[p x q x r]` and `[s x r x
-t]` produces a 4 dimensional array of `[p x q x s x t]` by contracting the
-dimension of size `r`.
-
-Notes:
-
-1. This follows the typical definition of a dot operator, as in other numeric
- libraries such as [numpy](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html).
-2. There is currently no support for the general tensor dot operator
- [numpy.tensordot](http://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot).
-
-### Element-wise binary arithmetic operations
-
-See also `ComputationBuilder::Add`
-
-A set of element-wise binary arithmetic operations is supported.
-
-<b> `Op(lhs, rhs)` </b>
-
-Where `Op` is one of `Add` (addition), `Sub` (subtraction), `Mul`
-(multiplication), `Div` (division), `Rem` (remainder), `Max` (maximum), `Min`
-(minimum).
-
-Arguments | Type | Semantics
---------- | ----------------------- | ----------------------------------------
-`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T
-`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T
-
-The arguments' shapes have to be either similar or compatible. See the
-[broadcasting](#broadcasting_semantics) documentation about what it means for
-shapes to be compatible. The result of an operation has a shape which is the
-result of broadcasting the two input arrays. In this variant, operations between
-arrays of different ranks are *not* supported, unless one of the operands is a
-scalar.
-
-When `Op` is `Rem`, the sign of the result is taken from the dividend.
-
-An alternative variant with different-rank broadcasting support exists for these
-operations:
-
-<b> `Op(lhs, rhs, broadcast_dimensions)` </b>
-
-Where `Op` is the same as above. This variant of the operation should be used
-for arithmetic operations between arrays of different ranks (such as adding a
-matrix to a vector).
-
-The additional `broadcast_dimensions` operand is a slice of integers used to
-expand the rank of the lower-rank operand up to the rank of the higher-rank
-operand. `broadcast_dimensions` maps the dimensions of the lower-rank shape to
-the dimensions of the higher-rank shape. The unmapped dimensions of the expanded
-shape are filled with dimensions of size one. Degenerate-dimension broadcasting
-then broadcasts the shapes along these degenerate dimension to equalize the
-shapes of both operands. The semantics are described in detail in the
-[broadcasting](#broadcasting_semantics) documentation.
-
-### Element-wise comparison operations
-
-See also `ComputationBuilder::Eq`
-
-A set of standard element-wise binary comparison operations is supported. Note
-that standard IEEE 754 floating-point comparison semantics apply when comparing
-floating-point types.
-
-<b> `Op(lhs, rhs)` </b>
-
-Where `Op` is one of `Eq` (equal-to), `Ne` (not equal-to), `Ge`
-(greater-or-equal-than), `Gt` (greater-than), `Le` (less-or-equal-than), `Le`
-(less-than).
-
-Arguments | Type | Semantics
---------- | ----------------------- | ----------------------------------------
-`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T
-`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T
-
-The arguments' shapes have to be either similar or compatible. See the
-[broadcasting](#broadcasting_semantics) documentation about what it means for
-shapes to be compatible. The result of an operation has a shape which is the
-result of broadcasting the two input arrays with the element type `PRED`. In
-this variant, operations between arrays of different ranks are *not* supported,
-unless one of the operands is a scalar.
-
-An alternative variant with different-rank broadcasting support exists for these
-operations:
-
-<b> `Op(lhs, rhs, broadcast_dimensions)` </b>
-
-Where `Op` is the same as above. This variant of the operation should be used
-for comparison operations between arrays of different ranks (such as adding a
-matrix to a vector).
-
-The additional `broadcast_dimensions` operand is a slice of integers specifying
-the dimensions to use for broadcasting the operands. The semantics are described
-in detail in the [broadcasting](#broadcasting_semantics) documentation.
-
-### Element-wise unary functions
-
-ComputationBuilder supports these element-wise unary functions:
-
-<b>`Exp(operand)`</b> Element-wise natural exponential `x -> e^x`.
-
-<b>`Log(operand)`</b> Element-wise natural logarithm `x -> ln(x)`.
-
-<b>`Neg(operand)`</b> Element-wise negation `x -> -x`.
-
-<b>`Floor(operand)`</b> Element-wise floor `x -> ⌊x⌋`.
-
-<b>`Ceil(operand)`</b> Element-wise ceil `x -> ⌈x⌉`.
-
-<b>`Tanh(operand)`</b> Element-wise hyperbolic tangent `x -> tanh(x)`.
-
-Arguments | Type | Semantics
---------- | ----------------------- | ---------------------------
-`operand` | `ComputationDataHandle` | The operand to the function
-
-The function is applied to each element in the `operand` array, resulting in an
-array with the same shape. It is allowed for `operand` to be a scalar (rank 0).
-
-### GetTupleElement
-
-See also `ComputationBuilder::GetTupleElement`
-
-Indexes into a tuple with a compile-time-constant value.
-
-The value must be a compile-time-constant so that shape inference can determine
-the type of the resulting value.
-
-This is analogous to `std::get<int N>(t)` in C++. Conceptually:
-
-```
-let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
-let s: s32 = 5;
-let t: (f32[10], s32) = tuple(v, s);
-let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32.
-```
-
-See also [`Tuple`](#tuple).
-
-### Infeed
-
-See also `ComputationBuilder::Infeed`
-
-<b> `Infeed(shape)` </b>
-
-| Argument | Type | Semantics |
-| -------- | ------- | ----------------------------------------------------- |
-| `shape` | `Shape` | Shape of the data read from the Infeed interface. The |
-: : : layout field of the shape must be set to match the :
-: : : layout of the data sent to the device; otherwise its :
-: : : behavior is undefined. :
-
-Devices have an abstraction for feeding data to long-running computations, e.g.,
-feeding inputs to be consumed within the body of a [`While`](#while) loop.
-`Infeed` reads a single data item from the implicit Infeed streaming interface
-of the device, interpreting the data as the given shape and its layout, and
-returns a `ComputationDataHandle` of the data. Multiple Infeed operations are
-allowed in a computation, but there must be a total order among the Infeed
-operations. For example, two Infeeds in the code below have a total order since
-there is a dependency between the while loops. The compiler issues an error if
-there isn't a total order.
-
-```
-result1 = while (condition, init = init_value) {
- Infeed(shape)
-}
-
-result2 = while (condition, init = result1) {
- Infeed(shape)
-}
-```
-
-Nested tuple shapes are not supported. For an empty tuple shape, the Infeed
-operation is effectively a nop and proceeds without reading any data from the
-Infeed of the device.
-
-> Note: We plan to allow multiple Infeed operations without a total order, in
-> which case the compiler will provide information about how the Infeed
-> operations are serialized in the compiled program.
-
-### Map
-
-See also `ComputationBuilder::Map`
-
-<b> `Map(operands..., computation)` </b>
-
-| Arguments | Type | Semantics |
-| ----------------- | ------------------------ | ----------------------------- |
-| `operands` | sequence of N | N arrays of type T |
-: : `ComputationDataHandle`s : :
-| `computation` | `Computation` | computation of type `T_0, |
-: : : T_1, ..., T_{N + M -1}` -> S` :
-: : : with N parameters of type T :
-: : : and M of arbitrary type :
-| `static_operands` | sequence of M | M arrays of arbitrary type |
-: : `ComputationDataHandle`s : :
-
-Applies a scalar function over the given `operands` arrays, producing an array
-of the same dimensions where each element is the result of the mapped function
-applied to the corresponding elements in the input arrays with `static_operands`
-given as additional input to `computation`.
-
-The mapped function is an arbitrary computation with the restriction that it has
-N inputs of scalar type `T` and a single output with type `S`. The output has
-the same dimensions as the operands except that the element type T is replaced
-with S.
-
-For example: `Map(op1, op2, op3, computation, par1)` maps `elem_out <-
-computation(elem1, elem2, elem3, par1)` at each (multi-dimensional) index in the
-input arrays to produce the output array.
-
-### Pad
-
-See also `ComputationBuilder::Pad`
-
-<b> `Pad(operand, padding_value, padding_config)` </b>
-
-| Arguments | Type | Semantics |
-| ---------------- | ----------------------- | ----------------------------- |
-| `operand` | `ComputationDataHandle` | array of type `T` |
-| `padding_value` | `ComputationDataHandle` | scalar of type `T` to fill in |
-: : : the added padding :
-| `padding_config` | `PaddingConfig` | padding amount on both edges |
-: : : (low, high) and between the :
-: : : elements of each dimension :
-
-Expands the given `operand` array by padding around the array as well as between
-the elements of the array with the given `padding_value`. `padding_config`
-specifies the amount of edge padding and the interior padding for each
-dimension.
-
-`PaddingConfig` is a repeated field of `PaddingConfigDimension`, which contains
-three fields for each dimension: `edge_padding_low`, `edge_padding_high`, and
-`interior_padding`. `edge_padding_low` and `edge_padding_high` specifies the
-amount of padding added at the low-end (next to index 0) and the high-end (next
-to the highest index) of each dimension respectively. `interior_padding`
-specifies the amount of padding added between any two elements in each
-dimension. This operation is a no-op if the edge padding pairs are all (0, 0)
-and the interior padding values are all 0. Figure below shows examples of
-different `edge_padding` and `interior_padding` values for a two dimensional
-array.
-
-<center><iframe src="../images/xla-pad.svg" width="920" height="600" title="Pad Diagram" frameborder="0"></iframe></center>
-
-### Reduce
-
-See also `ComputationBuilder::Reduce`
-
-Applies a reduction function to an array.
-
-<b> `Reduce(operand, init_value, computation, dimensions)` </b>
-
-| Arguments | Type | Semantics |
-| ------------- | ----------------------- | -------------------------------- |
-| `operand` | `ComputationDataHandle` | array of type `T` |
-| `init_value` | `ComputationDataHandle` | scalar of type `T` |
-| `computation` | `Computation` | computation of type `T, T -> T` |
-| `dimensions` | `int64` array | unordered array of dimensions to |
-: : : reduce :
-
-Conceptually, this operation reduces one or more dimensions in the input array
-into scalars. The rank of the result array is `rank(operand) - len(dimensions)`.
-`init_value` is the initial value used for every reduction and may also be
-inserted anywhere during computation if the back-end chooses to do so. So in
-most cases `init_value` should be an identity of the reduction function (for
-example, 0 for addition).
-
-The evaluation order of the reduction function across the reduction dimensions
-is arbitrary and may be non-deterministic. Therefore, the reduction function
-should not be overly sensitive to reassociation[^1].
-
-As an example, when reducing across the one dimension in a 1D array with values
-[10, 11, 12, 13], with reduction function `f` (this is `computation`) then that
-could be computed as
-
-`f(10, f(11, f(12, f(init_value, 13)))`
-
-but there are also many other possibilities, e.g.
-
-`f(init_value, f(f(10, f(init_value, 11)), f(f(init_value, 12), f(13,
-init_value))))`
-
-The following is a rough pseudo-code example of how reduction could be
-implemented, using summation as the reduction computation with an initial value
-of 0.
-
-```python
-result_shape <- remove all dims in dimensions from operand_shape
-
-# Iterate over all elements in result_shape. The number of r's here is equal
-# to the rank of the result
-for r0 in range(result_shape[0]), r1 in range(result_shape[1]), ...:
- # Initialize this result element
- result[r0, r1...] <- 0
-
- # Iterate over all the reduction dimensions
- for d0 in range(dimensions[0]), d1 in range(dimensions[1]), ...:
- # Increment the result element with the value of the operand's element.
- # The index of the operand's element is constructed from all ri's and di's
- # in the right order (by construction ri's and di's together index over the
- # whole operand shape).
- result[r0, r1...] += operand[ri... di]
-```
-
-Here's an example of reducing a 2D array (matrix). The shape has rank 2,
-dimension 0 of size 2 and dimension 1 of size 3:
-
-<div/>
-<div style="text-align:center">
-![2d array](../images/xla-2d-matrix.svg)
-</div>
-
-Results of reducing dimensions 0 or 1 with an "add" function:
-
-<div/>
-<div style="text-align:center">
-![reducing 2d array](../images/xla-reduce-from-2d.svg)
-</div>
-
-Note that both reduction results are 1D arrays. The diagram shows one as column
-and another as row just for visual convenience.
-
-For a more complex example, here is a 3D array. Its rank is 3, dimension 0 of
-size 4, dimension 1 of size 2 and dimension 2 of size 3. For simplicity, the
-values 1 to 6 are replicated across dimension 0.
-
-<div/>
-<div style="text-align:center">
-![3d array](../images/xla-reduce-from-3d.svg)
-</div>
-
-Similarly to the 2D example, we can reduce just one dimension. If we reduce
-dimension 0, for example, we get a rank-2 array where all values across
-dimension 0 were folded into a scalar:
-
-```text
-| 4 8 12 |
-| 4 8 12 |
-```
-
-If we reduce dimension 2, we also get a rank-2 array where all values across
-dimension 2 were folded into a scalar:
-
-```text
-| 6 15 |
-| 6 15 |
-| 6 15 |
-| 6 15 |
-```
-
-Note that the relative order between the remaining dimensions in the input is
-preserved in the output, but some dimensions may get assigned new numbers (since
-the rank changes).
-
-We can also reduce multiple dimensions. Add-reducing dimensions 0 and 1 produces
-the 1D array `| 20 28 36 |`.
-
-Reducing the 3D array over all its dimensions produces the scalar `84`.
-
-### ReduceWindow
-
-See also `ComputationBuilder::ReduceWindow`
-
-Applies a reduction function to all elements in each window of the input
-multi-dimensional array, producing an output multi-dimensional array with the
-same number of elements as the number of valid positions of the window. A
-pooling layer can be expressed as a `ReduceWindow`.
-
-<b> `ReduceWindow(operand, computation, window, init_value)` </b>
-
-| Arguments | Type | Semantics |
-| ------------------- | ----------------------- | ---------------------------- |
-| `operand` | `ComputationDataHandle` | N dimensional array |
-: : : containing elements of type :
-: : : T. This is the base area on :
-: : : which the window is placed. :
-| `init_value` | `ComputationDataHandle` | Starting value for the |
-: : : reduction. See [Reduce] :
-: : : (#reduce) for details. :
-| `computation` | `Computation` | Reduction function of type |
-: : : `T, T -> T`, to apply to all :
-: : : elements in each window :
-| `window_dimensions` | `ArraySlice<int64>` | array of integers for window |
-: : : dimension values :
-| `window_strides` | `ArraySlice<int64>` | array of integers for window |
-: : : stride values :
-| `padding` | `Padding` | padding type for window |
-: : : (Padding\:\:kSame or :
-: : : Padding\:\:kValid) :
-
-Below code and figure shows an example of using `ReduceWindow`. Input is a
-matrix of size [4x6] and both window_dimensions and window_stride_dimensions are
-[2x3].
-
-```
-// Create a computation for the reduction (maximum).
-std::unique_ptr<Computation> max;
-{
- ComputationBuilder builder(client_, "max");
- auto y = builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "y");
- auto x = builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "x");
- builder.Max(y, x);
- max = builder.Build().ConsumeValueOrDie();
-}
-
-// Create a ReduceWindow computation with the max reduction computation.
-ComputationBuilder builder(client_, "reduce_window_2x3");
-auto shape = ShapeUtil::MakeShape(F32, {4, 6});
-auto input = builder.Parameter(0, shape, "input");
-builder.ReduceWindow(
- input, *max,
- /*init_val=*/builder_.ConstantR0<float>(std::numeric_limits<float>::min()),
- /*window_dimensions=*/{2, 3},
- /*window_stride_dimensions=*/{2, 3},
- Padding::kValid);
-```
-
-<center><iframe src="../images/xla-reduce-window.svg" width="920" height="300" title="ReduceWindow Diagram" frameborder="0"></iframe></center>
-
-Stride of 1 in a dimension specifies that the position of a window in the
-dimension is 1 element away from its adjacent window. In order to specify that
-no windows overlap with each other, window_stride_dimensions should be equal to
-window_dimensions. The figure below illustrates the use of two different stride
-values. Padding is applied to each dimension of the input and the calculations
-are the same as though the input came in with the dimensions it has after
-padding.
-
-<center><iframe src="../images/xla-reduce-window-stride.svg" width="920" height="300" title="ReduceWindow Stride Diagram" frameborder="0"></iframe></center>
-
-### Reshape
-
-See also `ComputationBuilder::Reshape` and the [`Collapse`](#collapse)
-operation.
-
-Reshapes the dimensions of an array into a new configuration.
-
-<b> `Reshape(operand, dimensions, new_sizes)` </b>
-
-Arguments | Type | Semantics
------------- | ----------------------- | ---------------------------------------
-`operand` | `ComputationDataHandle` | array of type T
-`dimensions` | `int64` vector | order in which dimensions are collapsed
-`new_sizes` | `int64` vector | vector of sizes of new dimensions
-
-Conceptually, reshape first flattens an array into a one-dimensional vector of
-data values, and then refines this vector into a new shape. The input arguments
-are an arbitrary array of type T, a compile-time-constant vector of dimension
-indices, and a compile-time-constant vector of dimension sizes for the result.
-The values in the `dimension` vector must be a permutation of all of T's
-dimensions. The order of the dimensions in `dimensions` is from slowest-varying
-dimension (most major) to fastest-varying dimension (most minor) in the loop
-nest which collapses the input array into a single dimension. The `new_sizes`
-vector determines the size of the output array. The value at index 0 in
-`new_sizes` is the size of dimension 0, the value at index 1 is the size of
-dimension 1, and so on. The product of the `new_size` dimensions must equal the
-product of the operand's dimension sizes. When refining the collapsed array into
-the multidimensional array defined by `new_sizes`, the dimensions in `new_sizes`
-are ordered from slowest varying (most major) and to fastest varying (most
-minor).
-
-For example, let v be an array of 24 elements:
-
-```
-let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}},
- {{20, 21, 22}, {25, 26, 27}},
- {{30, 31, 32}, {35, 36, 37}},
- {{40, 41, 42}, {45, 46, 47}}};
-
-In-order collapse:
-let v012_24 = Reshape(v, {0,1,2}, {24});
-then v012_24 == f32[24] {10, 11, 12, 15, 16, 17,
- 20, 21, 22, 25, 26, 27,
- 30, 31, 32, 35, 36, 37,
- 40, 41, 42, 45, 46, 47};
-
-let v012_83 = Reshape(v, {0,1,2}, {8,3});
-then v012_83 == f32[8x3] {{10, 11, 12}, {15, 16, 17},
- {20, 21, 22}, {25, 26, 27},
- {30, 31, 32}, {35, 36, 37},
- {40, 41, 42}, {45, 46, 47}};
-
-Out-of-order collapse:
-let v021_24 = Reshape(v, {1,2,0}, {24});
-then v012_24 == f32[24] {10, 11, 12, 20, 21, 22,
- 30, 31, 32, 40, 41, 42,
- 15, 16, 17, 25, 26, 27,
- 35, 36, 37, 45, 46, 47};
-
-let v021_83 = Reshape(v, {1,2,0}, {8,3});
-then v021_83 == f32[8x3] {{10, 11, 12}, {20, 21, 22},
- {30, 31, 32}, {40, 41, 42},
- {15, 16, 17}, {25, 26, 27},
- {35, 36, 37}, {45, 46, 47}};
-
-let v021_262 = Reshape(v, {1,2,0}, {2,6,2});
-then v021_262 == f32[2x6x2] {{{10, 11}, {12, 20}, {21, 22},
- {30, 31}, {32, 40}, {41, 42}},
- {{15, 16}, {17, 25}, {26, 27},
- {35, 36}, {37, 45}, {46, 47}}};
-
-```
-
-As a special case, reshape can transform a single-element array to a scalar and
-vice versa. For example, `Reshape(f32[1x1] {{5}}, {0,1}, {}) == 5; Reshape(5,
-{}, {1,1}) == f32[1x1] {{5}};`
-
-### Rev (reverse)
-
-See `ComputationBuilder::Rev`
-
-<b>`Rev(operand, dimensions)`</b>
-
-Arguments | Type | Semantics
------------- | ----------------------- | ---------------------
-`operand` | `ComputationDataHandle` | array of type T
-`dimensions` | `ArraySlice<int64>` | dimensions to reverse
-
-Reverses the order of elements in the `operand` array along the specified
-`dimensions`, generating an output array of the same shape. Each element of the
-operand array at a multidimensional index is stored into the output array at a
-transformed index. The multidimensional index is transformed by reversing the
-index in each dimension to be reversed (i.e., if a dimension of size N is one of
-the reversing dimensions, its index i is transformed into N - 1 - i).
-
-One use for the `Rev` operation is to reverse the convolution weight array along
-the two window dimensions during the gradient computation in neural networks.
-
-### RngBernoulli
-
-See also `ComputationBuilder::RngBernoulli`
-
-Constructs an output of a given shape with random numbers generated following
-the Bernoulli distribution. The parameter needs to be a scalar valued F32
-operand while the output shape needs to have elemental type U32.
-
-<b>`RngBernoulli(mean, shape)`</b>
-
-| Arguments | Type | Semantics |
-| --------- | ----------------------- | ------------------------------------- |
-| `mean` | `ComputationDataHandle` | Scalar of type F32 specifying mean of |
-: : : generated numbers :
-| `shape` | `Shape` | Output shape of type U32 |
-
-### RngNormal
-
-See also `ComputationBuilder::RngNormal`
-
-Constructs an output of a given shape with random numbers generated following
-the $$N(\mu, \sigma)$$ normal distribution. The parameters `mu` and `sigma`, and
-output shape have to have elemental type F32. The parameters furthermore have to
-be scalar valued.
-
-<b>`RngNormal(mean, sigma, shape)`</b>
-
-| Arguments | Type | Semantics |
-| --------- | ----------------------- | -------------------------------------- |
-| `mu` | `ComputationDataHandle` | Scalar of type F32 specifying mean of |
-: : : generated numbers :
-| `sigma` | `ComputationDataHandle` | Scalar of type F32 specifying standard |
-: : : deviation of generated numbers :
-| `shape` | `Shape` | Output shape of type F32 |
-
-### RngUniform
-
-See also `ComputationBuilder::RngUniform`
-
-Constructs an output of a given shape with random numbers generated following
-the uniform distribution over the interval $$[a,b]$$. The parameters and output
-shape may be either F32, S32 or U32, but the types have to be consistent.
-Furthermore, the parameters need to be scalar valued.
-
-<b>`RngUniform(a, b, shape)`</b>
-
-| Arguments | Type | Semantics |
-| --------- | ----------------------- | --------------------------------- |
-| `a` | `ComputationDataHandle` | Scalar of type T specifying lower |
-: : : limit of interval :
-| `b` | `ComputationDataHandle` | Scalar of type T specifying upper |
-: : : limit of interval :
-| `shape` | `Shape` | Output shape of type T |
-
-### SelectAndScatter
-
-See also `ComputationBuilder::SelectAndScatter`
-
-This operation can be considered as a composite operation that first computes
-`ReduceWindow` on the `operand` array to select an element from each window, and
-then scatters the `source` array to the indices of the selected elements to
-construct an output array with the same shape as the operand array. The binary
-`select` function is used to select an element from each window by applying it
-across each window, and it is called with the property that the first
-parameter's index vector is lexicographically less than the second parameter's
-index vector. The `select` function returns `true` if the first parameter is
-selected and returns `false` if the second parameter is selected, and the
-function must hold transitivity (i.e., if `select(a, b)` and `select(b, c)` are
-`true`, then `select(a, c)` is also `true`) so that the selected element does
-not depend on the order of the elements traversed for a given window.
-
-The function `scatter` is applied at each selected index in the output array. It
-takes two scalar parameters:
-
-1. Current value at the selected index in the output array
-2. The scatter value from `source` that applies to the selected index
-
-It combines the two parameters and returns a scalar value that's used to update
-the value at the selected index in the output array. Initially, all indices of
-the output array are set to `init_value`.
-
-The output array has the same shape as the `operand` array and the `source`
-array must have the same shape as the result of applying a `ReduceWindow`
-operation on the `operand` array. `SelectAndScatter` can be used to
-backpropagate the gradient values for a pooling layer in a neural network.
-
-<b>`SelectAndScatter(operand, select, window_dimensions, window_strides,
-padding, source, init_value, scatter)`</b>
-
-| Arguments | Type | Semantics |
-| ------------------- | ----------------------- | ---------------------------- |
-| `operand` | `ComputationDataHandle` | array of type T over which |
-: : : the windows slide :
-| `select` | `Computation` | binary computation of type |
-: : : `T, T -> PRED`, to apply to :
-: : : all elements in each window; :
-: : : returns `true` if the first :
-: : : parameter is selected and :
-: : : returns `false` if the :
-: : : second parameter is selected :
-| `window_dimensions` | `ArraySlice<int64>` | array of integers for window |
-: : : dimension values :
-| `window_strides` | `ArraySlice<int64>` | array of integers for window |
-: : : stride values :
-| `padding` | `Padding` | padding type for window |
-: : : (Padding\:\:kSame or :
-: : : Padding\:\:kValid) :
-| `source` | `ComputationDataHandle` | array of type T with the |
-: : : values to scatter :
-| `init_value` | `ComputationDataHandle` | scalar value of type T for |
-: : : the inital value of the :
-: : : output array :
-| `scatter` | `Computation` | binary computation of type |
-: : : `T, T -> T`, to apply each :
-: : : scatter source element with :
-: : : its destination element :
-
-The figure below shows examples of using `SelectAndScatter`, with the `select`
-function computing the maximal value among its parameters. Note that when the
-windows overlap, as in the figure (2) below, an index of the `operand` array may
-be selected multiple times by different windows. In the figure, the element of
-value 9 is selected by both of the top windows (blue and red) and the binary
-addition `scatter` function produces the output element of value 8 (2 + 6).
-
-<center><iframe src="../images/xla-scatter-to-selected-window-element.svg" width="1000" height="700" title="SelectAndScatter Diagram" frameborder="0"></iframe></center>
-
-### Select
-
-See also `ComputationBuilder::Select`
-
-Constructs an output array from elements of two input arrays, based on the
-values of a predicate array.
-
-<b> `Select(pred, on_true, on_false)` </b>
-
-Arguments | Type | Semantics
----------- | ----------------------- | ------------------
-`pred` | `ComputationDataHandle` | array of type PRED
-`on_true` | `ComputationDataHandle` | array of type T
-`on_false` | `ComputationDataHandle` | array of type T
-
-The arrays `on_true` and `on_false` must have the same shape. This is also the
-shape of the output array. The array `pred` must have the same dimensionality as
-`on_true` and `on_false`, with the `PRED` element type.
-
-For each element `P` of `pred`, the corresponding element of the output array is
-taken from `on_true` if the value of `P` is `true`, and from `on_false` if the
-value of `P` is `false`. As a restricted form of
-[broadcasting](#broadcasting_semantics), `pred` can be a scalar of type `PRED`.
-In this case, the output array is taken wholly from `on_true` if `pred` is
-`true`, and from `on_false` if `pred` is `false`.
-
-Example with non-scalar `pred`:
-
-```
-let pred: PRED[4] = {true, false, false, true};
-let v1: s32[4] = {1, 2, 3, 4};
-let v2: s32[4] = {100, 200, 300, 400};
-==>
-Select(pred, v1, v2) = s32[4]{1, 200, 300, 4};
-```
-
-Example with scalar `pred`:
-
-```
-let pred: PRED = true;
-let v1: s32[4] = {1, 2, 3, 4};
-let v2: s32[4] = {100, 200, 300, 400};
-==>
-Select(pred, v1, v2) = s32[4]{1, 2, 3, 4};
-```
-
-Selections between tuples are supported. Tuples are considered to be scalar
-types for this purpose. If `on_true` and `on_false` are tuples (which must have
-the same shape!) then `pred` has to be a scalar of type `PRED`.
-
-### Slice
-
-See also `ComputationBuilder::Slice`
-
-Slicing extracts a sub-array from the input array. The sub-array is of the same
-rank as the input and contains the values inside a bounding box within the input
-array where the dimensions and indices of the bounding box are given as
-arguments to the slice operation.
-
-<b> `Slice(operand, start_indices, limit_indices)` </b>
-
-| Arguments | Type | Semantics |
-| --------------- | ----------------------- | -------------------------------- |
-| `operand` | `ComputationDataHandle` | N dimensional array of type T |
-| `start_indices` | `ArraySlice<int64>` | List of N integers containing |
-: : : the starting indices of the :
-: : : slice for each dimension. Values :
-: : : must be greater than or equal to :
-: : : zero. :
-| `limit_indices` | `ArraySlice<int64>` | List of N integers containing |
-: : : the ending indices (exclusive) :
-: : : for the slice for each :
-: : : dimension. Each value must be :
-: : : strictly greater than the :
-: : : respective `start_indices` value :
-: : : for the dimension and less than :
-: : : or equal to the size of the :
-: : : dimension. :
-
-1-dimensional example:
-
-```
-let a = {0.0, 1.0, 2.0, 3.0, 4.0}
-Slice(a, {2}, {4}) produces:
- {2.0, 3.0}
-```
-
-2-dimensional example:
-
-```
-let b =
- { {0.0, 1.0, 2.0},
- {3.0, 4.0, 5.0},
- {6.0, 7.0, 8.0},
- {9.0, 10.0, 11.0} }
-
-Slice(b, {2, 1}, {4, 3}) produces:
- { { 7.0, 8.0},
- {10.0, 11.0} }
-```
-
-### Sort
-
-See `ComputationBuilder::Sort`
-
-Sorts the elements in the operand.
-
-<b>`Sort(operand)`</b>
-
-Arguments | Type | Semantics
---------- | ----------------------- | -------------------
-`operand` | `ComputationDataHandle` | The operand to sort
-
-### Trans (transpose)
-
-See also the [`Reshape`](#reshape) operation.
-
-<b>`Trans(operand)`</b>
-
-Arguments | Type | Semantics
---------- | ----------------------- | -------------------------
-`operand` | `ComputationDataHandle` | The operand to transpose.
-
-Returns the transpose of `operand`. `operand` must have rank 2.
-
-This is the same as Reshape(operand, {1, 0}, {operand.shape.dimensions[1],
-operand.shape,dimensions[0]}).
-
-### Tuple
-
-See also `ComputationBuilder::Tuple`
-
-A tuple containing a variable number of data handles, each of which has its own
-shape.
-
-This is analogous to `std::tuple` in C++. Conceptually:
-
-```
-let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
-let s: s32 = 5;
-let t: (f32[10], s32) = tuple(v, s);
-```
-
-Tuples can be deconstructed (accessed) via the
-[`GetTupleElement`](#gettupleelement) operation.
-
-### While
-
-See also `ComputationBuilder::While`
-
-<b> `While(condition, body, init)` </b>
-
-| Arguments | Type | Semantics |
-| ----------- | ------------- | ---------------------------------------------- |
-| `condition` | `Computation` | Computation of type `T -> PRED` which defines |
-: : : the termination condition of the loop. :
-| `body` | `Computation` | Computation of type `T -> T` which defines the |
-: : : body of the loop. :
-| `init` | `T` | Initial value for the parameter of `condition` |
-: : : and `body`. :
-
-Sequentially executes the `body` until the `condition` fails. This is similar to
-a typical while loop in many other languages except for the differences and
-restrictions listed below.
-
-* A `While` node returns a value of type `T`, which is the result from the
- last execution of the `body`.
-* The shape of the type `T` is statically determined and must be the same
- across all iterations.
-* `While` nodes are not allowed to be nested. (This restriction may be lifted
- in the future on some targets.)
-
-The T parameters of the computations are initialized with the `init` value in
-the first iteration and are automatically updated to the new result from `body`
-in each subsequent iteration.
-
-One main use case of the `While` node is to implement the repeated execution of
-training in neural networks. Simplified pseudocode is shown below with an graph
-that represents the computation. The type `T` in this example is a `Tuple`
-consisting of an `int32` for the iteration count and a `vector[10]` for the
-accumulator. For 1000 iterations, the loop keeps adding a constant vector to the
-accumulator.
-
-```
-// Pseudocode for the computation.
-init = {0, zero_vector[10]} // Tuple of int32 and float[10].
-result = init;
-while (result(0) < 1000) {
- iteration = result(0) + 1;
- new_vector = result(1) + constant_vector[10];
- result = {iteration, new_vector};
-}
-```
-
-## Broadcasting semantics
-
-This section describes how the broadcasting semantics in XLA work.
-
-### What is broadcasting
-
-Broadcasting may be required for operations between multi-dimensional arrays of
-different ranks, or between multi-dimensional arrays with different but
-compatible shapes. Consider the addition `X+v` where `X` is a matrix (an array
-of rank 2) and `v` is a vector (an array of rank 1). To perform element-wise
-addition, XLA needs to "broadcast" the vector `v` to the same rank as the
-matrix `X`, by replicating `v` a certain number of times. The vector's length
-has to match at least one of the dimensions of the matrix.
-
-For example:
-
- |1 2 3| + |7 8 9|
- |4 5 6|
-
-The matrix's dimensions are (2,3), the vector's are (3). We broadcast the vector
-by replicating it over rows to get:
-
- |1 2 3| + |7 8 9| = |8 10 12|
- |4 5 6| |7 8 9| |11 13 15|
-
-In Numpy, this is called
-[broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
-
-### Principles
-
-We see XLA as a low-level infrastructure. Therefore, we want to make the
-XLA language as strict and explicit as possible, avoiding implicit and
-"magical" features that may make some computations slightly easier to define, at
-the cost of more assumptions baked into user code that will be difficult to
-change in the long term. If necessary, implicit and magical features can be
-added in client-level wrappers.
-
-Specifically w.r.t. broadcasting, we will require explicit broadcasting
-specifications on operations between arrays of different ranks, instead of
-inferring a possible broadcasting like Numpy does.
-
-### Broadcasting a lower-rank array onto a higher-rank array
-
-*Scalars* can always be broadcast over arrays without an explicit specification
-of broadcasting dimensions. An element-wise binary operation between a scalar
-and an array means applying the operation with the scalar for each element in
-the array. For example, adding a scalar to a matrix means producing a matrix
-each element of which is a sum of the scalar with the corresponding input
-matrix's element.
-
- |1 2 3| + 7 = |8 9 10|
- |4 5 6| |11 12 13|
-
-Most broadcasting needs can be captured by using a tuple of dimensions on a
-binary operation. When the inputs to the operation have different ranks, this
-broadcasting tuple specifies which dimension(s) in the **higher-rank** array to
-match with the **lower-rank** array.
-
-Consider the previous example of adding a matrix with dimensions (2,3) to a
-vector with dimension (3). *Without specifying broadcasting, this operation is
-invalid.* Based on XLA convention, the left-most dimension is 0, and the
-number grows as we walk the dimensions right-wards. For a (2,3) matrix we'd
-index into it with `matrix[i,j]` with `i` running to 2 and `j` running to 3. `i`
-indexes over dimension 0 and `j` indexes over dimension 1.
-
-To correctly request our matrix-vector addition the user will specify the
-broadcasting dimension to be (1), meaning that the vector's dimension is matched
-to dimension 1 of the matrix. In 2D, if we consider dimension 0 as rows and
-dimension 1 as columns, this means that each element of the vector becomes a
-column of a size matching the number of rows in the matrix:
-
- |7 8 9| ==> |7 8 9|
- |7 8 9|
-
-As a more complex example, consider adding a 3-element vector (dimension (3)) to
-a 3x3 matrix (dimensions (3,3)). There are two ways broadcasting can happen
-here:
-
-Broadcasting dimension is 1, as before. Each vector element becomes a column -
-the vector is duplicated for each row in the matrix.
-
- |7 8 9| ==> |7 8 9|
- |7 8 9|
- |7 8 9|
-
-Broadcasting dimension is 0. Each vector element becomes a row - the vector is
-duplicated for each column in the matrix.
-
- |7| ==> |7 7 7|
- |8| |8 8 8|
- |9| |9 9 9|
-
-Note: when adding a 2x3 matrix to a 3-element vector, a broadcasting dimension
-of 0 is invalid.
-
-The broadcasting dimensions can be a tuple that describes how a smaller rank
-shape is broadcast into a larger rank shape. For example, given a 2x3x4 cuboid
-and a 3x4 matrix, a broadcasting tuple (1,2) means matching the matrix to
-dimensions 1 and 2 of the cuboid.
-
-This type of broadcast is used in the binary ops in `ComputationBuilder`, if the
-`broadcast_dimensions` argument is given. In the XLA source code, this type
-of broadcasting is sometimes called "InDim" broadcasting.
-
-#### Formal definition
-
-The broadcasting attribute allows matching a lower-rank array to a higher-rank
-array, by specifying which dimensions of the higher-rank array to match. For
-example, for an array with dimensions MxNxPxQ, we can match a vector with
-dimension T as follows:
-
- MxNxPxQ
-
- dim 3: T
- dim 2: T
- dim 1: T
- dim 0: T
-
-In each case, T has to be equal to the matching dimension of the higher-rank
-array. The vector's values are then broadcast from the matched dimension to all
-the other dimensions.
-
-If we want to match a TxV matrix onto the MxNxPxQ array, we have to use a pair
-of broadcasting dimensions:
-
- MxNxPxQ
- dim 2,3: T V
- dim 1,2: T V
- dim 0,3: T V
- etc...
-
-The order of dimensions in the broadcasting tuple has to be the order in which
-the lower-rank array's dimensions are expected to match the higher-rank array's
-dimensions. The first element in the tuple says which dimension in the
-higher-rank array has to match dimension 0 in the lower-rank array. The second
-element for dimension 1, and so on. The order of broadcast dimensions has to be
-strictly increasing. E.g. in the previous example, it's illegal to match V to N
-and T to P; also, it's illegal to match V to both P and N.
-
-### Broadcasting similar-rank arrays with degenerate dimensions
-
-A related broadcasting problem is broadcasting two arrays that have the same
-rank but different dimension sizes. Similarly to Numpy's rules, this is only
-possible when the arrays are *compatible*. Two arrays are compatible when all
-their dimensions are compatible. Two dimensions are compatible if:
-
-* They are equal, or
-* One of them is 1 (a "degenerate" dimension)
-
-When we encounter two compatible arrays, the result shape has the maximum among
-the two inputs at every dimension index.
-
-Examples:
-
-1. (2,1) and (2,3) broadcast to (2,3).
-2. (1,2,5) and (7,2,5) broadcast to (7,2,5)
-3. (7,2,5) and (7,1,5) broadcast to (7,2,5)
-4. (7,2,5) and (7,2,6) are incompatible and cannot be broadcast.
-
-A special case arises, and is also supported, where each of the input arrays has
-a degenerate dimension at a different index. In this case, we get an "outer
-operation": (2,1) and (1,3) broadcast to (2,3). For more examples, consult the
-[Numpy documentation on broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
-
-### Broadcast composition
-
-Broadcasting of a lower-rank array to a higher-rank array **and** broadcasting
-using degenerate dimensions can both be performed in the same binary operation.
-For example, a vector of size 4 and an matrix of size 1x2 can be added together
-using broadcast dimensions value of (0):
-
- |1 2 3 4| + [5 6] // [5 6] is a 1x2 matrix, not a vector.
-
-First the vector is broadcast up to rank 2 (matrix) using the broadcast
-dimensions. The single value (0) in the broadcast dimensions indicates that
-dimension zero of the vector matches to dimension zero of the matrix. This
-produces an matrix of size 4xM where the value M is chosen to match the
-corresponding dimension size in the 1x2 array. Therefore, a 4x2 matrix is
-produced:
-
- |1 1| + [5 6]
- |2 2|
- |3 3|
- |4 4|
-
-Then "degenerate dimension broadcasting" broadcasts dimension zero of the 1x2
-matrix to match the corresponding dimension size of the right hand side:
-
- |1 1| + |5 6| |6 7|
- |2 2| + |5 6| = |7 8|
- |3 3| + |5 6| |8 9|
- |4 4| + |5 6| |9 10|
-
-A more complicated example is a matrix of size 1x2 added to an array of size
-4x3x1 using broadcast dimensions of (1, 2). First the 1x2 matrix is broadcast up
-to rank 3 using the broadcast dimensions to produces an intermediate Mx1x2 array
-where the dimension size M is determined by the size of the larger operand (the
-4x3x1 array) producing a 4x1x2 intermediate array. The M is at dimension 0
-(left-most dimension) because the dimensions 1 and 2 are mapped to the
-dimensions of the original 1x2 matrix as the broadcast dimension are (1, 2).
-This intermediate array can be added to the 4x3x1 matrix using broadcasting of
-degenerate dimensions to produce a 4x3x2 array result.
-
-[^1]: Some obvious reductions like "add reduction" are not strictly associative
- for floats. However, if the range of the data is limited, floating-point
- addition is close enough to being associative for most practical uses. It
- is possible to conceive some complete un-associative reductions, however,
- and these will produce wrong results in XLA reductions.
-
-## C++ interface
-
-The following is a fragment of the class definition for the client
-`ComputationBuilder` interface, for reference:
-
-```c++
-class ComputationBuilder {
- public:
- // client: client in which to build the computation.
- // computation_name: name to use for the built computation.
- ComputationBuilder(Client* client, const string& computation_name);
-
- ~ComputationBuilder();
-
- // Returns the client the builder was initialized with.
- Client* client() { return client_; }
-
- // Returns the computation name.
- const string& name() { return name_; }
-
- // Sets the builder to a mode where it will die immediately when an error is
- // encountered, rather than producing it in a deferred fashion when Build() is
- // called (which is the default).
- void set_die_immediately_on_error(bool enabled) {
- die_immediately_on_error_ = enabled;
- }
-
- // Enqueues a "retrieve parameter value" instruction for a parameter that was
- // passed to the computation.
- ComputationDataHandle Parameter(int64 parameter_number, const Shape& shape,
- const string& name);
-
- // Retrieves the (inferred) shape of the operand in the computation.
- util::StatusOr<std::unique_ptr<Shape>> GetShape(
- const ComputationDataHandle& operand);
-
- // Checks that the operand has the given expected shape. Returns the operand
- // if yes, fails with a CHECK error if no.
- ComputationDataHandle CheckShape(const ComputationDataHandle& operand,
- const Shape& expected_shape);
-
- // Checks that the lhs and rhs results have the same shape.
- void CheckSameShape(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs);
-
- // Enqueues a constant with the value of the given literal onto the
- // computation.
- ComputationDataHandle ConstantLiteral(const Literal& literal);
-
- // Enqueues a constant onto the computation. Methods are templated on the
- // native host type (NativeT) which corresponds to a specific XLA
- // PrimitiveType as given in the following table:
- //
- // Native Type PrimitiveType
- // -----------------------------
- // bool PRED
- // int32 S32
- // int64 S64
- // uint32 U32
- // uint64 U64
- // float F32
- // double F64
- //
- // Note: not all primitive types defined in xla.proto have a corresponding
- // native type yet.
- template <typename NativeT>
- ComputationDataHandle ConstantR0(NativeT value);
- template <typename NativeT>
- ComputationDataHandle ConstantR1(gtl::ArraySlice<NativeT> values);
- template <typename NativeT>
- ComputationDataHandle ConstantR2(
- std::initializer_list<std::initializer_list<NativeT>> values);
- template <typename NativeT>
- ComputationDataHandle ConstantR2FromArray2D(const Array2D<NativeT>& values);
- template <typename NativeT>
- ComputationDataHandle ConstantR3FromArray3D(const Array3D<NativeT>& values);
- template <typename NativeT>
- ComputationDataHandle ConstantR4FromArray4D(const Array4D<NativeT>& values);
-
- // Enqueues a rank one constant (vector) onto the computation. The
- // vector has size 'length' and every element has the value 'value'.
- template <typename NativeT>
- ComputationDataHandle ConstantR1(int64 length, NativeT value);
-
- // Adds dimensions to an array by duplicating the data in the array.
- //
- // The new dimensions are inserted on the left, i.e. if
- // broadcast_sizes has values {a0, ..., aN} and the operand shape
- // has dimensions {b0, ..., bM} then the shape of the output has
- // dimensions {a0, ..., aN, b0, ..., bM}.
- //
- // The new dimensions index into copies of the operand, i.e.
- //
- // output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM]
- ComputationDataHandle Broadcast(const ComputationDataHandle& operand,
- gtl::ArraySlice<int64> broadcast_sizes);
-
- // Enqueues a pad operation onto the computation that pads the given value on
- // the edges as well as between the elements of the input. padding_config
- // specifies the padding amount for each dimension.
- ComputationDataHandle Pad(const ComputationDataHandle& operand,
- const ComputationDataHandle& padding_value,
- const PaddingConfig& padding_config);
-
- // Enqueues an operation onto the computation that flattens the operand based
- // on the dimension order (major/slowest-varying to minor/fastest-varying)
- // given, followed by reshaping it into the shape with the given dimension
- // sizes (also major to minor). Conceptually, this is a limited form of
- // "shape casting".
- ComputationDataHandle Reshape(const ComputationDataHandle& operand,
- gtl::ArraySlice<int64> dimensions,
- gtl::ArraySlice<int64> new_sizes);
-
- // Wrapper for Reshape.
- // Enqueues an operation to collapse the provided dimensions; e.g. an
- // operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to
- // {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must
- // be a consecutive, in-order subsequence of the operand dimensions.
- //
- // This could potentially cause data to be moved -- it provides a more
- // structured form of reshaping than an arbitrary Reshape operation.
- ComputationDataHandle Collapse(const ComputationDataHandle& operand,
- gtl::ArraySlice<int64> dimensions);
-
- // Enqueues a slice operation onto the computation that slices the operand
- // from the start indices to the limit indices; e.g.
- //
- // x
- // [ 0 1 2 3 ]
- // y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ]
- // [ 8 9 a b ]
- //
- // Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D
- // range notation.
- ComputationDataHandle Slice(const ComputationDataHandle& operand,
- gtl::ArraySlice<int64> start_indices,
- gtl::ArraySlice<int64> limit_indices);
-
- // Enqueues a concatenate instruction onto the computation.
- ComputationDataHandle ConcatInDim(
- gtl::ArraySlice<ComputationDataHandle> operands, int64 dimension);
-
- // Enqueue a tracing operation onto the computation; the computation will emit
- // a logging message with the operand.
- void Trace(const string& tag, const ComputationDataHandle& operand);
-
- // Enqueues a conditional-move-like select operation onto the computation;
- // predicated on pred, selects between on_true and on_false.
- ComputationDataHandle Select(const ComputationDataHandle& pred,
- const ComputationDataHandle& on_true,
- const ComputationDataHandle& on_false);
-
- // Enqueues a tuple-creation instruction onto the computation.
- ComputationDataHandle Tuple(gtl::ArraySlice<ComputationDataHandle> elements);
-
- // Enqueues a tuple-element-get instruction onto the computation.
- ComputationDataHandle GetTupleElement(const ComputationDataHandle& tuple_data,
- int64 index);
-
- // Enqueues an equal-to comparison instruction onto the computation.
- ComputationDataHandle Eq(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a not-equal comparison instruction onto the computation.
- ComputationDataHandle Ne(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a greater-or-equal comparison instruction onto the computation.
- ComputationDataHandle Ge(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a greater-than comparison instruction onto the computation.
- ComputationDataHandle Gt(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a less-than comparison instruction onto the computation.
- ComputationDataHandle Lt(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a less-or-equal comparison instruction onto the computation.
- ComputationDataHandle Le(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a dot instruction onto the computation.
- ComputationDataHandle Dot(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs);
-
- // Default dimension numbers used for a convolution.
- static constexpr int64 kConvBatchDimension = 0;
- static constexpr int64 kConvFeatureDimension = 1;
- static constexpr int64 kConvFirstSpatialDimension = 2;
- static constexpr int64 kConvSecondSpatialDimension = 3;
- static constexpr int64 kConvKernelOutputDimension = 0;
- static constexpr int64 kConvKernelInputDimension = 1;
- static constexpr int64 kConvKernelFirstSpatialDimension = 2;
- static constexpr int64 kConvKernelSecondSpatialDimension = 3;
-
- // Creates a default ConvolutionDimensionNumbers. For the input operand
- // {batch, feature, height, width} = {0, 1, 2, 3} and for the weight operand
- // {kernel_output_feature, kernel_input_feature, height, width = {0, 1, 2, 3}.
- static ConvolutionDimensionNumbers CreateDefaultConvDimensionNumbers();
-
- // Creates a ConvolutionDimensionNumbers with the given arguments. Returns an
- // error if either the input or the weight dimension numbers have conflicts.
- static util::StatusOr<ConvolutionDimensionNumbers> CreateConvDimensionNumbers(
- int64 batch, int64 feature, int64 first_spatial, int64 second_spatial,
- int64 kernel_output_feature, int64 kernel_input_feature,
- int64 kernel_first_spatial, int64 kernel_second_spatial);
-
- // Enqueues a convolution instruction onto the computation, which uses the
- // default convolution dimension numbers.
- ComputationDataHandle Conv(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> window_strides,
- Padding padding);
-
- // Enqueues a convolution instruction onto the computation, with the caller
- // provided padding configuration in the format returned by MakePadding().
- ComputationDataHandle ConvWithGeneralPadding(
- const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> window_strides,
- gtl::ArraySlice<std::pair<int64, int64>> padding);
-
- // Enqueues a convolution instruction onto the computation, with the caller
- // provided dimension numbers configuration.
- ComputationDataHandle ConvWithGeneralDimensions(
- const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> window_strides, Padding padding,
- const ConvolutionDimensionNumbers& dimension_numbers);
-
- // Enqueues a convolution instruction onto the computation, with the caller
- // provided padding configuration as well as the dimension numbers.
- ComputationDataHandle ConvGeneral(
- const ComputationDataHandle& lhs, const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> window_strides,
- gtl::ArraySlice<std::pair<int64, int64>> padding,
- const ConvolutionDimensionNumbers& dimension_numbers);
-
- // Enqueues an infeed instruction onto the computation, which reads data of
- // the given shape from the infeed buffer of the device.
- ComputationDataHandle Infeed(const Shape& shape);
-
- // Enqueues a custom call instruction onto the computation.
- // During code generation, a call instruction is emitted which targets a
- // symbol with the name |call_target_name|. The |operands| are passed to the
- // call instruction. |shape| is the resultant shape.
- ComputationDataHandle CustomCallOp(
- tensorflow::StringPiece call_target_name,
- gtl::ArraySlice<ComputationDataHandle> operands, const Shape& shape);
-
- // The following methods enqueue element-wise binary arithmetic operations
- // onto the computation. The shapes of the operands have to match unless one
- // of the operands is a scalar, or an explicit broadcast dimension is given
- // (see g3doc for more details).
-
- // Enqueues an add instruction onto the computation.
- ComputationDataHandle Add(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a subtract instruction onto the computation.
- ComputationDataHandle Sub(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a multiply instruction onto the computation.
- ComputationDataHandle Mul(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a divide instruction onto the computation.
- ComputationDataHandle Div(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a remainder instruction onto the computation.
- ComputationDataHandle Rem(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a max instruction onto the computation.
- ComputationDataHandle Max(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Enqueues a min instruction onto the computation.
- ComputationDataHandle Min(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs,
- gtl::ArraySlice<int64> broadcast_dimensions = {});
-
- // Reduces an array among the provided dimensions, given "computation" as a
- // reduction operator.
- ComputationDataHandle Reduce(const ComputationDataHandle& operand,
- const ComputationDataHandle& init_value,
- const Computation& computation,
- gtl::ArraySlice<int64> dimensions_to_reduce);
-
- // Enqueues a windowed reduce instruction onto the computation.
- ComputationDataHandle ReduceWindow(const ComputationDataHandle& operand,
- const ComputationDataHandle& init_value,
- const Computation& computation,
- gtl::ArraySlice<int64> window_dimensions,
- gtl::ArraySlice<int64> window_strides,
- Padding padding);
-
- // As ReduceWindow(), but the padding is given in the format
- // returned by MakePadding().
- ComputationDataHandle ReduceWindowWithGeneralPadding(
- const ComputationDataHandle& operand,
- const ComputationDataHandle& init_value, const Computation& computation,
- gtl::ArraySlice<int64> window_dimensions,
- gtl::ArraySlice<int64> window_strides,
- gtl::ArraySlice<std::pair<int64, int64>> padding);
-
- // Enqueues an operation that scatters the `source` array to the selected
- // indices of each window.
- ComputationDataHandle SelectAndScatter(
- const ComputationDataHandle& operand, const Computation& select,
- gtl::ArraySlice<int64> window_dimensions,
- gtl::ArraySlice<int64> window_strides, Padding padding,
- const ComputationDataHandle& source,
- const ComputationDataHandle& init_value, const Computation& scatter);
-
- // As SelectAndScatter(), but the padding is given in the format
- // returned by MakePadding().
- ComputationDataHandle SelectAndScatterWithGeneralPadding(
- const ComputationDataHandle& operand, const Computation& select,
- gtl::ArraySlice<int64> window_dimensions,
- gtl::ArraySlice<int64> window_strides,
- gtl::ArraySlice<std::pair<int64, int64>> padding,
- const ComputationDataHandle& source,
- const ComputationDataHandle& init_value, const Computation& scatter);
-
- // Enqueues an exp instruction onto the computation.
- ComputationDataHandle Exp(const ComputationDataHandle& operand);
-
- // Enqueues a floor instruction onto the computation.
- ComputationDataHandle Floor(const ComputationDataHandle& operand);
-
- // Enqueues a ceil instruction onto the computation.
- ComputationDataHandle Ceil(const ComputationDataHandle& operand);
-
- // Enqueues an log instruction (natural logarithm) onto the computation.
- ComputationDataHandle Log(const ComputationDataHandle& operand);
-
- // Enqueues a tanh instruction onto the computation.
- ComputationDataHandle Tanh(const ComputationDataHandle& operand);
-
- // Enqueues a float32 sqrt instruction onto the computation.
- // (float32 is specified as there is an implicit float32 0.5f constant
- // exponent).
- ComputationDataHandle SqrtF32(const ComputationDataHandle& operand);
-
- // Enqueues a float32 square instruction onto the computation.
- // (float32 is specified as there is an implicit float32 2.0f constant
- // exponent).
- ComputationDataHandle SquareF32(const ComputationDataHandle& operand);
-
- // Enqueues a lhs^rhs computation onto the computation.
- ComputationDataHandle Pow(const ComputationDataHandle& lhs,
- const ComputationDataHandle& rhs);
-
- // Enqueues a convert instruction onto the computation that changes the
- // element type of the operand array to primitive_type.
- ComputationDataHandle ConvertElementType(const ComputationDataHandle& operand,
- PrimitiveType new_element_type);
-
- // Enqueues a float32 reciprocal instruction onto the computation.
- // (float32 is specified as there is an implicit float32 -1.0f constant
- // exponent).
- //
- // TODO(leary) axe F32 suffix, can be determined by reflecting on the shape of
- // the operand.
- ComputationDataHandle ReciprocalF32(const ComputationDataHandle& operand);
-
- // Enqueues a negate instruction onto the computation.
- ComputationDataHandle Neg(const ComputationDataHandle& operand);
-
- // Enqueues a transpose instruction onto the computation.
- ComputationDataHandle Trans(const ComputationDataHandle& operand);
-
- // Enqueues a reverse instruction onto the computation. The order of the
- // elements in the given dimensions is reversed (i.e., the element at index i
- // is moved to index dimension_size - 1 - i).
- ComputationDataHandle Rev(const ComputationDataHandle& operand,
- gtl::ArraySlice<int64> dimensions);
-
- // Enqueues a sort (as increasing order) instruction onto the computation.
- ComputationDataHandle Sort(const ComputationDataHandle& operand);
-
- // Enqueues a clamp instruction onto the computation.
- ComputationDataHandle Clamp(const ComputationDataHandle& min,
- const ComputationDataHandle& operand,
- const ComputationDataHandle& max);
-
- // Enqueues a map instruction onto the computation.
- ComputationDataHandle Map(
- gtl::ArraySlice<ComputationDataHandle> operands,
- const Computation& computation,
- gtl::ArraySlice<ComputationDataHandle> static_operands = {});
-
- // Enqueues a N(mu, sigma) random number generation instruction onto the
- // computation.
- ComputationDataHandle RngNormal(const ComputationDataHandle& mu,
- const ComputationDataHandle& sigma,
- const Shape& shape);
-
- // Enqueues a U(a, b) random number generation instruction onto the
- // computation.
- ComputationDataHandle RngUniform(const ComputationDataHandle& a,
- const ComputationDataHandle& b,
- const Shape& shape);
-
- // Enqueues a B(1, p) random number generation instruction onto the
- // computation.
- ComputationDataHandle RngBernoulli(const ComputationDataHandle& mean,
- const Shape& shape);
-
- // Enqueues a while node onto the computation.
- ComputationDataHandle While(const Computation& condition,
- const Computation& body,
- const ComputationDataHandle& init);
-
- // Computes the value of a constant indicated by a
- // ComputationDataHandle.
- //
- // The handle must be from the computation currently being built -
- // i.e. returned from this builder with no intervening call to
- // Build(). This happens to currently work regardless of that, but
- // that may stop working at any time.
- //
- // The handle must represent a constant value, which in this case
- // means that it must not statically depend on a parameter to the
- // computation that is being built. Note this allows the output of
- // an Rng() node to count as constant - in that case you may receive
- // different values if you call this method several times. Let us
- // know if you have a use-case where that is a problem.
- //
- // This functionality can be useful when translating a computation
- // into XLA where something that looked dynamic is required by XLA
- // to be specified as a constant. E.g. the source computation
- // (outside of XLA) may include a dynamic computation of the shape
- // of something and ComputeConstant lets you determine what the
- // value of that computation is in the case where the value can be
- // determined at compile time.
- //
- // If output_layout is non-null, then the output of the computation
- // will be stored using that layout.
- util::StatusOr<std::unique_ptr<GlobalData>> ComputeConstant(
- const ComputationDataHandle& handle,
- const Layout* output_layout = nullptr);
-
- // Returns a new ComputationBuilder whose resultant Computation is used only
- // by this ComputationBuilder. The sub-ComputationBuilder has the same
- // die_immediately_on_error behavior as the parent.
- std::unique_ptr<ComputationBuilder> CreateSubBuilder(
- const string& computation_name);
-
- // Modifies the computation being built so that executions of it
- // will return the value associated with operand, rather than the
- // last expression enqueued on the ComputationBuilder. Any subsequent
- // operations added to the ComputationBuilder will not have any effect unless
- // SetReturnValue is called again.
- util::Status SetReturnValue(const ComputationDataHandle& operand);
-
- // Builds the computation with the requested operations, or returns a non-ok
- // status.
- util::StatusOr<std::unique_ptr<Computation>> Build();
-
- // Builds the computation with the requested operations, or notes an error in
- // the parent ComputationBuilder and returns an empty computation if building
- // failed. This function is intended to be used where the returned
- // Computation is only used by the parent ComputationBuilder and hence further
- // operation on the returned Computation will simply be error'ed out if an
- // error occurred while building this computation. If the built computation is
- // to be used by a ComputationBuilder other than the parent ComputationBuilder
- // then Build() should be used instead.
- std::unique_ptr<Computation> BuildAndNoteError();
-};
-```