diff options
author | A. Unique TensorFlower <gardener@tensorflow.org> | 2017-01-09 14:11:45 -0800 |
---|---|---|
committer | TensorFlower Gardener <gardener@tensorflow.org> | 2017-01-09 14:24:47 -0800 |
commit | 17fafb1f17843141fcd04b79c0e64996b508cb8b (patch) | |
tree | 1f067a2e26497a731fa10d48aa5ec1e3eccce1e7 /tensorflow/g3doc | |
parent | 8692d7de020fed91d19d42265c176954b83e83bb (diff) |
XLA Documents for initial launch
Change: 144006194
Diffstat (limited to 'tensorflow/g3doc')
-rw-r--r-- | tensorflow/g3doc/experimental/index.md | 13 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/leftnav_files | 8 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/broadcasting.md | 204 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/developing_new_backend.md | 77 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/index.md | 95 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/jit.md | 157 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/operation_semantics.md | 1468 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/shapes.md | 150 | ||||
-rw-r--r-- | tensorflow/g3doc/experimental/xla/tfcompile.md | 288 | ||||
-rw-r--r-- | tensorflow/g3doc/resources/xla_prerelease.md | 2223 |
10 files changed, 2460 insertions, 2223 deletions
diff --git a/tensorflow/g3doc/experimental/index.md b/tensorflow/g3doc/experimental/index.md new file mode 100644 index 0000000000..d941a336a8 --- /dev/null +++ b/tensorflow/g3doc/experimental/index.md @@ -0,0 +1,13 @@ +# Experimental + +Collected in this section is information about experimental features included in +Tensorflow. These features are either not complete or not included as part of +core Tensorflow. + +### XLA + +[XLA](./xla/) is a domain-specific compiler for linear algebra that optimizes +TensorFlow computations. The results are improvements in speed, memory usage, +and portability on server and mobile platforms. + +[Learn more about XLA](./xla/) diff --git a/tensorflow/g3doc/experimental/leftnav_files b/tensorflow/g3doc/experimental/leftnav_files new file mode 100644 index 0000000000..489f3a846a --- /dev/null +++ b/tensorflow/g3doc/experimental/leftnav_files @@ -0,0 +1,8 @@ +### XLA +xla/index.md +xla/jit.md +xla/tfcompile.md +xla/developing_new_backend.md +xla/operation_semantics.md +xla/shapes.md +xla/broadcasting.md
\ No newline at end of file diff --git a/tensorflow/g3doc/experimental/xla/broadcasting.md b/tensorflow/g3doc/experimental/xla/broadcasting.md new file mode 100644 index 0000000000..8dbf0d0446 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/broadcasting.md @@ -0,0 +1,204 @@ +# Broadcasting semantics + +This document describes how the broadcasting semantics in XLA work. + +## What is broadcasting? + +Broadcasting is the process of making arrays with different shapes have +compatible shapes for arithmetic operations. The terminology is borrowed from +Numpy +[(broadcasting)](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). + +Broadcasting may be required for operations between multi-dimensional arrays of +different ranks, or between multi-dimensional arrays with different but +compatible shapes. Consider the addition `X+v` where `X` is a matrix (an array +of rank 2) and `v` is a vector (an array of rank 1). To perform element-wise +addition, XLA needs to "broadcast" the vector `v` to the same rank as the +matrix `X`, by replicating `v` a certain number of times. The vector's length +has to match at least one of the dimensions of the matrix. + +For example: + + |1 2 3| + |7 8 9| + |4 5 6| + +The matrix's dimensions are (2,3), the vector's are (3). The vector is broadcast +by replicating it over rows to get: + + |1 2 3| + |7 8 9| = |8 10 12| + |4 5 6| |7 8 9| |11 13 15| + +In Numpy, this is called [broadcasting] +(http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). + +## Principles + +XLA is a low-level infrastructure with a XLA language this is as strict and +explicit as possible, avoiding implicit and "magical" features that may make +some computations slightly easier to define, at the cost of more assumptions +baked into user code that will be difficult to change in the long term. If +necessary, implicit and magical features can be added in client-level wrappers. + +In regards to broadcasting, explicit broadcasting specifications on operations +between arrays of different ranks is required. This is different from Numpy, +which infers the specification when possible. + +## Broadcasting a lower-rank array onto a higher-rank array + +*Scalars* can always be broadcast over arrays without an explicit specification +of broadcasting dimensions. An element-wise binary operation between a scalar +and an array means applying the operation with the scalar for each element in +the array. For example, adding a scalar to a matrix means producing a matrix +each element of which is a sum of the scalar with the corresponding input +matrix's element. + + |1 2 3| + 7 = |8 9 10| + |4 5 6| |11 12 13| + +Most broadcasting needs can be captured by using a tuple of dimensions on a +binary operation. When the inputs to the operation have different ranks, this +broadcasting tuple specifies which dimension(s) in the **higher-rank** array to +match with the **lower-rank** array. + +Consider the previous example, instead of adding a scalar to a (2,3) matrix, add +a vector of dimension (3) to a matrix of dimensions (2,3). *Without specifying +broadcasting, this operation is invalid.* To correctly request matrix-vector +addition, specify the broadcasting dimension to be (1), meaning the vector's +dimension is matched to dimension 1 of the matrix. In 2D, if dimension 0 is +considered as rows and dimension 1 as columns, this means that each element of +the vector becomes a column of a size matching the number of rows in the matrix: + + |7 8 9| ==> |7 8 9| + |7 8 9| + +As a more complex example, consider adding a 3-element vector (dimension (3)) to +a 3x3 matrix (dimensions (3,3)). There are two ways broadcasting can happen for +this example: + +(1) A broadcasting dimension of 1 can be used. Each vector element becomes a +column and the vector is duplicated for each row in the matrix. + + |7 8 9| ==> |7 8 9| + |7 8 9| + |7 8 9| + +(2) A broadcasting dimension of 0 can be used. Each vector element becomes a row +and the vector is duplicated for each column in the matrix. + + |7| ==> |7 7 7| + |8| |8 8 8| + |9| |9 9 9| + +> Note: when adding a 2x3 matrix to a 3-element vector, a broadcasting dimension +> of 0 is invalid. + +The broadcasting dimensions can be a tuple that describes how a smaller rank +shape is broadcast into a larger rank shape. For example, given a 2x3x4 cuboid +and a 3x4 matrix, a broadcasting tuple (1,2) means matching the matrix to +dimensions 1 and 2 of the cuboid. + +This type of broadcast is used in the binary ops in `ComputationBuilder`, if the +`broadcast_dimensions` argument is given. For example, see +[ComputationBuilder::Add](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.cc). +In the XLA source code, this type of broadcasting is sometimes called "InDim" +broadcasting. + +### Formal definition + +The broadcasting attribute allows matching a lower-rank array to a higher-rank +array, by specifying which dimensions of the higher-rank array to match. For +example, for an array with dimensions MxNxPxQ, a vector with dimension T can be +matched as follows: + + MxNxPxQ + + dim 3: T + dim 2: T + dim 1: T + dim 0: T + +In each case, T has to be equal to the matching dimension of the higher-rank +array. The vector's values are then broadcast from the matched dimension to all +the other dimensions. + +To match a TxV matrix onto the MxNxPxQ array, a pair of broadcasting dimensions +are used: + + MxNxPxQ + dim 2,3: T V + dim 1,2: T V + dim 0,3: T V + etc... + +The order of dimensions in the broadcasting tuple has to be the order in which +the lower-rank array's dimensions are expected to match the higher-rank array's +dimensions. The first element in the tuple says which dimension in the +higher-rank array has to match dimension 0 in the lower-rank array. The second +element for dimension 1, and so on. The order of broadcast dimensions has to be +strictly increasing. For example, in the previous example it is illegal to match +V to N and T to P; it is also illegal to match V to both P and N. + +## Broadcasting similar-rank arrays with degenerate dimensions + +A related broadcasting problem is broadcasting two arrays that have the same +rank but different dimension sizes. Similarly to Numpy's rules, this is only +possible when the arrays are *compatible*. Two arrays are compatible when all +their dimensions are compatible. Two dimensions are compatible if: + +* They are equal, or +* One of them is 1 (a "degenerate" dimension) + +When two compatible arrays are encountered, the result shape has the maximum +among the two inputs at every dimension index. + +Examples: + +1. (2,1) and (2,3) broadcast to (2,3). +2. (1,2,5) and (7,2,5) broadcast to (7,2,5) +3. (7,2,5) and (7,1,5) broadcast to (7,2,5) +4. (7,2,5) and (7,2,6) are incompatible and cannot be broadcast. + +A special case arises, and is also supported, where each of the input arrays has +a degenerate dimension at a different index. In this case, the result is an +"outer operation": (2,1) and (1,3) broadcast to (2,3). For more examples, +consult the [Numpy documentation on +broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). + +## Broadcast composition + +Broadcasting of a lower-rank array to a higher-rank array **and** broadcasting +using degenerate dimensions can both be performed in the same binary operation. +For example, a vector of size 4 and an matrix of size 1x2 can be added together +using broadcast dimensions value of (0): + + |1 2 3 4| + [5 6] // [5 6] is a 1x2 matrix, not a vector. + +First the vector is broadcast up to rank 2 (matrix) using the broadcast +dimensions. The single value (0) in the broadcast dimensions indicates that +dimension zero of the vector matches to dimension zero of the matrix. This +produces an matrix of size 4xM where the value M is chosen to match the +corresponding dimension size in the 1x2 array. Therefore, a 4x2 matrix is +produced: + + |1 1| + [5 6] + |2 2| + |3 3| + |4 4| + +Then "degenerate dimension broadcasting" broadcasts dimension zero of the 1x2 +matrix to match the corresponding dimension size of the right hand side: + + |1 1| + |5 6| |6 7| + |2 2| + |5 6| = |7 8| + |3 3| + |5 6| |8 9| + |4 4| + |5 6| |9 10| + +A more complicated example is a matrix of size 1x2 added to an array of size +4x3x1 using broadcast dimensions of (1, 2). First the 1x2 matrix is broadcast up +to rank 3 using the broadcast dimensions to produces an intermediate Mx1x2 array +where the dimension size M is determined by the size of the larger operand (the +4x3x1 array) producing a 4x1x2 intermediate array. The M is at dimension 0 +(left-most dimension) because the dimensions 1 and 2 are mapped to the +dimensions of the original 1x2 matrix as the broadcast dimension are (1, 2). +This intermediate array can be added to the 4x3x1 matrix using broadcasting of +degenerate dimensions to produce a 4x3x2 array result. diff --git a/tensorflow/g3doc/experimental/xla/developing_new_backend.md b/tensorflow/g3doc/experimental/xla/developing_new_backend.md new file mode 100644 index 0000000000..28010ff1b7 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/developing_new_backend.md @@ -0,0 +1,77 @@ +# Developing a new backend for XLA + +This preliminary guide is for early adopters that want to easily retarget +TensorFlow to their hardware in an efficient manner. The guide is not +step-by-step and assumes knowledge of [LLVM](http://llvm.org), +[Bazel](https://bazel.build/), and TensorFlow. + +XLA provides an abstract interface that a new architecture or accelerator can +implement to create a backend to run TensorFlow graphs. Retargeting XLA should +be significantly simpler and scalable than implementing every existing +TensorFlow Op for new hardware. + +Most implementations will fall into one of the following scenarios: + +1. Existing CPU architecture not yet officially supported by XLA, with or + without an existing [LLVM](http://llvm.org) backend. +2. Non-CPU-like hardware with an existing LLVM backend. +3. Non-CPU-like hardware without an existing LLVM backend. + +> Note: An LLVM backend can mean either one of the officially released LLVM +> backends or a custom LLVM backend developed in-house. + +## Scenario 1: Existing CPU architecture not yet officially supported by XLA + +In this scenario, start by looking at the existing [XLA CPU backend] +(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/cpu/). +XLA makes it easy to retarget TensorFlow to different CPUs by using LLVM, since +the main difference between XLA backends for CPUs is the code generated by LLVM. +Google tests XLA for x64 and ARM64 architectures. + +If the hardware vendor has an LLVM backend for their hardware, it is simple to +link the backend with the LLVM built with XLA. In JIT mode, the XLA CPU backend +emits code for the host CPU. For ahead-of-time compilation, +[`xla::AotCompilationOptions`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h) +can provide an LLVM triple to configure the target architecture. + +If there is no existing LLVM backend but another kind of code generator exists, +it should be possible to reuse most of the existing CPU backend. + +## Scenario 2: Non-CPU-like hardware with an existing LLVM backend + +It is possible to model a new +[`xla::Compiler`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h) +implementation on the existing [`xla::CPUCompiler`] +(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc) +and [`xla::GPUCompiler`] +(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc) +classes, since these already emit LLVM IR. Depending on the nature of the +hardware, it is possible that many of the LLVM IR generation aspects will have +to be changed, but a lot of code can be shared with the existing backends. + +A good example to follow is the [GPU backend] +(https://www.tensorflow.org/code/tensorflow/compiler/xla/service/gpu/) +of XLA. The GPU backend targets a non-CPU-like ISA, and therefore some aspects +of its code generation are unique to the GPU domain. Other kinds of hardware, +e.g. DSPs like Hexagon (which has an upstream LLVM backend), can reuse parts of +the LLVM IR emission logic, but other parts will be unique. + +## Scenario 3: Non-CPU-like hardware without an existing LLVM backend + +If it is not possible to utilize LLVM, then the best option is to implement a +new backend for XLA for the desired hardware. This option requires the most +effort. The classes that need to be implemented are as follows: + +* [StreamExecutor](https://www.tensorflow.org/code/tensorflow/stream_executor/stream_executor.h): + For many devices not all methods of `StreamExecutor` are needed. See + existing `StreamExecutor` implementations for details. +* [xla::Compiler](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/compiler.h): + This class encapsulates the compilation of a HLO computation into an + `xla::Executable`. +* [`xla::Executable`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/executable.h): + This class is used to launch a compiled computation on the platform. +* [`xla::TransferManager`](https://www.tensorflow.org/code/tensorflow/compiler/xla/service/transfer_manager.h): + This class enables backends to provide platform-specific mechanisms for + constructing XLA literal data from given device memory handles. In other + words, it helps encapsulate the transfer of data from the host to the device + and back. diff --git a/tensorflow/g3doc/experimental/xla/index.md b/tensorflow/g3doc/experimental/xla/index.md new file mode 100644 index 0000000000..702e03ea8f --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/index.md @@ -0,0 +1,95 @@ +# XLA Overview + +> Note: XLA is experimental and considered alpha. Most use cases will not +> see improvements in performance (speed or decreased memory usage). We have +> released XLA early so the Open Source Community can contribute to its +> development, as well as create a path for integration with hardware +> accelerators. + +XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear +algebra that optimizes TensorFlow computations. The results are improvements in +speed, memory usage, and portability on server and mobile platforms. Initially, +most users will not see large benefits from XLA, but are welcome to experiment +by using XLA via [just-in-time (JIT) compilaton](jit.md) or [ahead-of-time (AOT) +compilation](tfcompile.md). Developers targeting new hardware accelerators are +especially encouraged to try out XLA. + +The XLA framework is experimental and in active development. In particular, +while it is unlikely that the semantics of existing operations will change, it +is expected that more operations will be added to cover important use cases. The +team welcomes feedback from the community about missing functionality and +community contributions via GitHub. + +## Why did we build XLA? + +We had several objectives for XLA to work with TensorFlow: + +* *Improve execution speed.* Compile subgraphs to reduce the execution time of + short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse + pipelined operations to reduce memory overhead, and specialize to known + tensor shapes to allow for more aggressive constant propagation. + +* *Improve memory usage.* Analyze and schedule memory usage, in principle + eliminating many intermediate storage buffers. + +* *Reduce reliance on custom Ops.* Remove the need for many custom Ops by + improving the performance of automatically fused low-level Ops to match the + performance of custom Ops that were fused by hand. + +* *Reduce mobile footprint.* Eliminate the TensorFlow runtime by ahead-of-time + compiling the subgraph and emitting an object/header file pair that can be + linked directly into another application. The results can reduce the + footprint for mobile inference by several orders of magnitude. + +* *Improve portability.* Make it relatively easy to write a new backend for + novel hardware, at which point a large fraction of TensorFlow programs will + run unmodified on that hardware. This is in contrast with the approach of + specializing individual monolithic Ops for new hardware, which requires + TensorFlow programs to be rewritten to make use of those Ops. + +## How does XLA work? + +The input language to XLA is called "HLO IR", or just HLO (High Level +Optimizer). The semantics of HLO are described on the +[Operation Semantics](operation_semantics.md) page. It +is most convenient to think of HLO as a [compiler +IR](https://en.wikipedia.org/wiki/Intermediate_representation). + +XLA takes graphs ("computations") defined in HLO and compiles them into machine +instructions for various architectures. XLA is modular in the sense that it is +easy to slot in an alternative backend to [target some novel HW +architecture](developing_new_backend.md). The CPU backend for x64 and ARM64 as +well as the NVIDIA GPU backend are in the TensorFlow source tree. + +The following diagram shows the compilation process in XLA: + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img src="../../images/how-does-xla-work.png"> +</div> + +XLA comes with several optimizations and analyses that are target-independent, +such as [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination), +target-independent operation fusion, and buffer analysis for allocating runtime +memory for the computation. + +After the target-independent step, XLA sends the HLO computation to a backend. +The backend can perform further HLO-level analyses and optimizations, this time +with target specific information and needs in mind. For example, the XLA GPU +backend may perform operation fusion beneficial specifically for the GPU +programming model and determine how to partition the computation into streams. +At this stage, backends may also pattern-match certain operations or +combinations thereof to optimized library calls. + +The next step is target-specific code generation. The CPU and GPU backends +included with XLA use [LLVM](http://llvm.org) for low-level IR, optimization, +and code-generation. These backends emit the LLVM IR necessary to represent the +XLA HLO computation in an efficient manner, and then invoke LLVM to emit native +code from this LLVM IR. + +The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the +CPU backend supports multiple CPU ISAs. + +## Supported Platforms + +XLA currently supports [JIT compilation](jit.md) on x86-64 and NVIDIA GPUs; and +[AOT compilation](tfcompile.md) for x86-64 and ARM. diff --git a/tensorflow/g3doc/experimental/xla/jit.md b/tensorflow/g3doc/experimental/xla/jit.md new file mode 100644 index 0000000000..807dc57d40 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/jit.md @@ -0,0 +1,157 @@ +# Using JIT Compilation + +> Note: TensorFlow must be compiled from source to include XLA. + +## Why use just-in-time (JIT) compilation? + +The TensorFlow/XLA JIT compiler compiles and runs parts of TensorFlow graphs via +XLA. The benefit of this over the standard TensorFlow implementation is that XLA +can fuse multiple operators (kernel fusion) into a small number of compiled +kernels. Fusing operators can reduce memory bandwidth requirements and improve +performance compared to executing operators one-at-a-time, as the TensorFlow +executor does. + +## Running TensorFlow graphs via XLA + +There are two ways to run TensorFlow computations via XLA, either by +JIT-compiling operators placed on a CPU or GPU device, or by placing operators +on the `XLA_CPU` or `XLA_GPU` TensorFlow devices. Placing operators directly on +a TensorFlow XLA device forces the operator to run on that device and is mainly +used for testing. + +> Note: The XLA CPU backend produces fast single-threaded code (in most cases), +> but does not yet parallelize as well as the TensorFlow CPU backend. The XLA +> GPU backend is competitive with the standard TensorFlow implementation, +> sometimes faster, sometimes slower. + +### Turning on JIT compilation + +JIT compilation can be turned on at the session level or manually for select +operations. Both of these approaches are zero-copy --- data does not need to be +copied when passing data between a compiled XLA kernel and a TensorFlow operator +placed on the same device. + +#### Session + +Turning on JIT compilation at the session level will result in all possible +operators being greedily compiled into XLA computations. Each XLA computation +will be compiled into one or more kernels for the underlying device. (This does +not mean everything will be fused into a single CUDA kernel, for example.) + +Subject to a few constraints, if there are two adjacent operators in the graph +that both have XLA implementations, then they will be compiled into a single XLA +computation. + +JIT compilation is turned on at the session level by setting the +`global_jit_level` config to `tf.OptimizerOptions.ON_1` and passing the config +during session initialization. + +```python +# Config to turn on JIT compilation +config = tf.ConfigProto() +config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 + +sess = tf.Session(config=config) +``` + +#### Manual + +JIT compilation can also be turned on manually for one or more operators. This +is done by tagging the operators to compile with the attribute +`_XlaCompile=true`. The simplest way to do this is via the +`tf.contrib.compiler.jit.experimental_jit_scope()` scope defined in +[`tensorflow/contrib/compiler/jit.py`](https://www.tensorflow.org/code/tensorflow/contrib/compiler/jit.py). +Example usage: + +```python + jit_scope = tf.contrib.compiler.jit.experimental_jit_scope + + x = tf.placeholder(np.float32) + with jit_scope(): + y = tf.add(x, x) # The "add" will be compiled with XLA. +``` + +The `_XlaCompile` attribute is currently supported on a best-effort basis. If an +operator cannot be compiled, TensorFlow will silently fall back to the normal +implementation. + +### Placing operators on XLA devices + +Another way to run computations via XLA is to place an operator on a specific +XLA device. This method is normally only used for testing. Valid targets are +`XLA_CPU` or `XLA_GPU`. + +```python +with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"): + output = tf.add(input1, input2) +``` + +Unlike JIT compilation on the standard CPU and GPU devices, these devices make a +copy of data when it is transferred on and off the device. The extra copy makes +it expensive to mix XLA and TensorFlow operators in the same graph. + +## Tutorial + +This tutorial covers training a simple version of MNIST softmax with JIT turned +on. While, the tutorial was created using CPU only, the steps are the same and +the artifacts are similar to running on GPU. + +### Step #1: Prepare sample script + +Download or move +[mnist_softmax_xla.py](https://www.tensorflow.org/code/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py) +into a folder outside of the TensorFlow source tree. + +### Step #2: Run without XLA + +Execute the python script to train the model without XLA. + +```shell +python mnist_softmax_xla.py --xla=false +``` + +Using the Chrome Trace Event Profiler (browse to chrome://tracing), +open the timeline file created when the script finishes: `timeline.ctf.json`. \ +The rendered timeline should look similar to the +picture below with multiple green boxes labeled `MatMul`, possibly across +multiple CPUs. +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/jit_timeline_cpu.png"> +</div> + +### Step #3 Run with XLA + +Execute the python script to train the model with XLA and turn on a debugging +feature of XLA via an environmental variable that outputs the XLA graph. + +```shell +TF_XLA_FLAGS=--xla_generate_hlo_graph=.* python mnist_softmax_xla.py +``` + +Open the timeline file created (`timeline.ctf.json`). The rendered timeline +should look similar to the picture below with one long bar labeled `_XlaLaunch`. +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/jit_timeline_cpu_xla.png"> +</div> + +To understand what is happening in `_XlaLaunch`, look at the console output for +statements similar to the following: + +```shell +computation cluster_0[_XlaCompiledKernel=true,_XlaNumConstantArgs=1].v82 [CPU: +pipeline start, before inline]: /tmp/hlo_graph_0.dot + +``` + +The debug statements point to the location of `hlo_graph_xx.dot` files that +contain info about the graph created by XLA at runtime. To Render the .dot file +into a png, install [GraphViz](http://www.graphviz.org/Download..php) and run: + +```shell +dot -Tpng hlo_graph_0.dot -o hlo_graph_0.png +``` + +The result will look like the following: +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/jit_cpu_xla_graph.png"> +</div> diff --git a/tensorflow/g3doc/experimental/xla/operation_semantics.md b/tensorflow/g3doc/experimental/xla/operation_semantics.md new file mode 100644 index 0000000000..46ed6bd7c1 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/operation_semantics.md @@ -0,0 +1,1468 @@ +# Operation Semantics + +The following describes the semantics of operations defined in the +[`ComputationBuilder`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h) +interface. Typically, these operations map one-to-one to operations defined in +the RPC interface in +[`xla_data.proto`](https://www.tensorflow.org/code/tensorflow/compiler/xla/xla_data.proto). + +A note on nomenclature: the generalized data type XLA deals with is an +N-dimensional array holding elements of some uniform type (such as 32-bit +float). Throughout the documentation, *array* is used to denote an +arbitrary-dimensional array. For convenience, special cases have more specific +and familiar names; for example a *vector* is a 1-dimensional array and a +*matrix* is a 2-dimensional array. + +## Broadcast + +See also +[`ComputationBuilder::Broadcast`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Adds dimensions to an array by duplicating the data in the array. + +<b> `Broadcast(operand, broadcast_sizes)` </b> + +Arguments | Type | Semantics +----------------- | ----------------------- | ------------------------------- +`operand` | `ComputationDataHandle` | The array to duplicate +`broadcast_sizes` | `ArraySlice<int64>` | The sizes of the new dimensions + +The new dimensions are inserted on the left, i.e. if `broadcast_sizes` has +values `{a0, ..., aN}` and the operand shape has dimensions `{b0, ..., bM}` then +the shape of the output has dimensions `{a0, ..., aN, b0, ..., bM}`. + +The new dimensions index into copies of the operand, i.e. + +``` +output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM] +``` + +For example, if `operand` is a scalar `f32` with value `2.0f`, and +`broadcast_sizes` is `{2, 3}`, then the result will be an array with shape +`f32[2, 3]` and all the values in the result will be `2.0f`. + +## Call + +See also +[`ComputationBuilder::Call`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Invokes a computation with the given arguments. + +<b> `Call(computation, args...)` </b> + +| Arguments | Type | Semantics | +| ------------- | ------------------------ | -------------------------------- | +| `computation` | `Computation` | computation of type `T_0, T_1, | +: : : ..., T_N -> S` with N parameters : +: : : of arbitrary type : +| `args` | sequence of N | N arguments of arbitrary type | +: : `ComputationDataHandle`s : : + +The arity and types of the `args` must match the parameters of the +`computation`. It is allowed to have no `args`. + +## Collapse + +See also +[`ComputationBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h) +and the [`Reshape`](#reshape) operation. + +Collapses dimensions of an array into one dimension. + +<b> `Collapse(operand, dimensions)` </b> + +| Arguments | Type | Semantics | +| ------------ | ----------------------- | ----------------------------------- | +| `operand` | `ComputationDataHandle` | array of type T | +| `dimensions` | `int64` vector | in-order, consecutive subset of T's | +: : : dimensions. : + +Collapse replaces the given subset of the operand's dimensions by a single +dimension. The input arguments are an arbitrary array of type T and a +compile-time-constant vector of dimension indices. The dimension indices must be +an in-order (low to high dimension numbers), consecutive subset of T's +dimensions. Thus, {0, 1, 2}, {0, 1}, or {1, 2} are all valid dimension sets, but +{1, 0} or {0, 2} are not. They are replaced by a single new dimension, in the +same position in the dimension sequence as those they replace, with the new +dimension size equal to the product of original dimension sizes. The lowest +dimension number in `dimensions` is the slowest varying dimension (most major) +in the loop nest which collapses these dimension, and the highest dimension +number is fastest varying (most minor). See the [`Reshape`](#reshape) operator +if more general collapse ordering is needed. + +For example, let v be an array of 24 elements: + +``` +let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}}, + {{20, 21, 22}, {25, 26, 27}}, + {{30, 31, 32}, {35, 36, 37}}, + {{40, 41, 42}, {45, 46, 47}}}; + +// Collapse to a single dimension, leaving one dimension. +let v012 = Collapse(v, {0,1,2}); +then v012 == f32[24] {10, 11, 12, 15, 16, 17, + 20, 21, 22, 25, 26, 27, + 30, 31, 32, 35, 36, 37, + 40, 41, 42, 45, 46, 47}; + +// Collapse the two lower dimensions, leaving two dimensions. +let v01 = Collapse(v, {0,1}); +then v01 == f32[4x6] {{10, 11, 12, 15, 16, 17}, + {20, 21, 22, 25, 26, 27}, + {30, 31, 32, 35, 36, 37}, + {40, 41, 42, 45, 46, 47}}; + +// Collapse the two higher dimensions, leaving two dimensions. +let v12 = Collapse(v, {1,2}); +then v12 == f32[8x3] {{10, 11, 12}, + {15, 16, 17}, + {20, 21, 22}, + {25, 26, 27}, + {30, 31, 32}, + {35, 36, 37}, + {40, 41, 42}, + {45, 46, 47}}; + +``` + +## Concatenate + +See also +[`ComputationBuilder::ConcatInDim`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Concatenate composes an array from multiple array operands. The array is of the +same rank as each of the input array operands (which must be of the same rank as +each other) and contains the arguments in the order that they were specified. + +<b> `Concatenate(operands..., dimension)` </b> + +| Arguments | Type | Semantics | +| ----------- | ----------------------- | ------------------------------------ | +| `operands` | sequence of N | N arrays of type T with dimensions | +: : `ComputationDataHandle` : [L0, L1, ...]. Requires N >= 1. : +| `dimension` | `int64` | A value in the interval `[0, N)` | +: : : that names the dimension to be : +: : : concatenated between the `operands`. : + +With the exception of `dimension` all dimensions must be the same. This is +because XLA does not support "ragged" arrays Also note that rank-0 values +cannot be concatenated (as it's impossible to name the dimension along which the +concatenation occurs). + +1-dimensional example: + +``` +Concat({{2, 3}, {4, 5}, {6, 7}}, 0) +>>> {2, 3, 4, 5, 6, 7} +``` + +2-dimensional example: + +``` +let a = { + {1, 2}, + {3, 4}, + {5, 6}, +}; +let b = { + {7, 8}, +}; +Concat({a, b}, 0) +>>> { + {1, 2}, + {3, 4}, + {5, 6}, + {7, 8}, +} +``` + +Diagram: +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/ops_concatenate.png"> +</div> + +## ConvertElementType + +See also +[`ComputationBuilder::ConvertElementType`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Similar to an element-wise `static_cast` in C++, performs an element-wise +conversion operation from a data shape to a target shape. The dimensions must +match, and the conversion is an element-wise one; e.g. `s32` elements become +`f32` elements via an `s32`-to-`f32` conversion routine. + +<b> `ConvertElementType(operand, new_element_type)` </b> + +Arguments | Type | Semantics +------------------ | ----------------------- | --------------------------- +`operand` | `ComputationDataHandle` | array of type T with dims D +`new_element_type` | `PrimitiveType` | type U + +If the dimensions of the operand and the target shape do not match, or an +invalid conversion is requested (e.g. to/from a tuple) an error will be +produced. + +A conversion such as `T=s32` to `U=f32` will perform a normalizing int-to-float +conversion routine such as round-to-nearest-even. + +> Note: The precise float-to-int and visa-versa conversions are currently +> unspecified, but may become additional arguments to the convert operation in +> the future. Not all possible conversions have been implemented for all +>targets. + +``` +let a: s32[3] = {0, 1, 2}; +let b: f32[3] = convert(a, f32); +then b == f32[3]{0.0, 1.0, 2.0} +``` + +## Conv (convolution) + +See also +[`ComputationBuilder::Conv`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +As ConvWithGeneralPadding, but the padding is specified in a short-hand way as +either SAME or VALID. SAME padding pads the input (`lhs`) with zeroes so that +the output has the same shape as the input when not taking striding into +account. VALID padding simply means no padding. + +## ConvWithGeneralPadding (convolution) + +See also +[`ComputationBuilder::ConvWithGeneralPadding`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Computes a convolution of the kind used in neural networks. Here, a convolution +can be thought of as a n-dimensional window moving across a n-dimensional base +area and a computation is performed for each possible position of the window. + +| Arguments | Type | Semantics | +| ---------------- | ----------------------- | ----------------------------- | +| `lhs` | `ComputationDataHandle` | rank n+2 array of inputs | +| `rhs` | `ComputationDataHandle` | rank n+2 array of kernel | +: : : weights : +| `window_strides` | `ArraySlice<int64>` | n-d array of kernel strides | +| `padding` | `ArraySlice<pair<int64, | n-d array of (low, high) | +: : int64>>` : padding : +| `lhs_dilation` | `ArraySlice<int64>` | n-d lhs dilation factor array | +| `rhs_dilation` | `ArraySlice<int64>` | n-d rhs dilation factor array | + +Let n be the number of spatial dimensions. The `lhs` argument is a rank n+2 +array describing the base area. This is called the input, even though of course +the rhs is also an input. In a neural network, these are the input activations. +The n+2 dimensions are, in this order: + +* `batch`: Each coordinate in this dimension represents an independent input + for which convolution is carried out. +* `z/depth/features`: Each (y,x) position in the base area has a vector + associated to it, which goes into this dimension. +* `spatial_dims`: Describes the `n` spatial dimensions that define the base + area that the window moves across. + +The `rhs` argument is a rank n+2 array describing the convolutional +filter/kernel/window. The dimensions are, in this order: + +* `output-z`: The `z` dimension of the output. +* `input-z`: The size of this dimension should equal the size of the `z` + dimension in lhs. +* `spatial_dims`: Describes the `n` spatial dimensions that define the n-d + window that moves across the base area. + +The `window_strides` argument specifies the stride of the convolutional window +in the spatial dimensions. For example, if the stride in a the first spatial +dimension is 3, then the window can only be placed at coordinates where the +first spatial index is divisible by 3. + +The `padding` argument specifies the amount of zero padding to be applied to the +base area. The amount of padding can be negative -- the absolute value of +negative padding indicates the number of elements to remove from the specified +dimension before doing the convolution. `padding[0]` specifies the padding for +dimension `y` and `padding[1]` specifies the padding for dimension `x`. Each +pair has the low padding as the first element and the high padding as the second +element. The low padding is applied in the direction of lower indices while the +high padding is applied in the direction of higher indices. For example, if +`padding[1]` is `(2,3)` then there will be a padding by 2 zeroes on the left and +by 3 zeroes on the right in the second spatial dimension. Using padding is +equivalent to inserting those same zero values into the input (`lhs`) before +doing the convolution. + +The `lhs_dilation` and `rhs_dilation` arguments specify the dilation factor to +be applied to the lhs and rhs, respectively, in each spatial dimension. If the +dilation factor in a spatial dimension is d, then d-1 holes are implicitly +placed between each of the entries in that dimension, increasing the size of the +array. The holes are filled with a no-op value, which for convolution means +zeroes. + +Dilation of the rhs is also called atrous convolution. For more details, see the +[TensorFlow documentation on atrous convolution](https://www.tensorflow.org/ +versions/r0.11/api_docs/python/nn.html#atrous_conv2d). Dilation of the lhs is +also called deconvolution. + +The output shape has these dimensions, in this order: + +* `batch`: Same size as `batch` on the input (`lhs`). +* `z`: Same size as `output-z` on the kernel (`rhs`). +* `spatial_dims`: One value for each valid placement of the convolutional + window. + +The valid placements of the convolutional window are determined by the strides +and the size of the base area after padding. + +To describe what a convolution does, consider a 2d convolution, and pick some +fixed `batch`, `z`, `y`, `x` coordinates in the output. Then `(y,x)` is a +position of a corner of the window within the base area (e.g. the upper left +corner, depending on how you interpret the spatial dimensions). We now have a 2d +window, taken from the base area, where each 2d point is associated to a 1d +vector, so we get a 3d box. From the convolutional kernel, since we fixed the +output coordinate `z`, we also have a 3d box. The two boxes have the same +dimensions, so we can take the sum of the element-wise products between the two +boxes (similar to a dot product). That is the output value. + +Note that if `output-z` is e.g., 5, then each position of the window produces 5 +values in the output into the `z` dimension of the output. These values differ +in what part of the convolutional kernel is used - there is a separate 3d box of +values used for each `output-z` coordinate. So you could think of it as 5 +separate convolutions with a different filter for each of them. + +Here is pseudo-code for a 2d convolution with padding and striding: + +``` +for (b, oz, oy, ox) { // output coordinates + value = 0; + for (iz, ky, kx) { // kernel coordinates and input z + iy = oy*stride_y + ky - pad_low_y; + ix = ox*stride_x + kx - pad_low_x; + if ((iy, ix) inside the base area considered without padding) { + value += input(b, iz, iy, ix) * kernel(oz, iz, ky, kx); + } + } + output(b, oz, oy, ox) = value; +} +``` + +## CrossReplicaSum + +See also +[`ComputationBuilder::CrossReplicaSum`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Computes a sum across replicas. + +<b> `CrossReplicaSum(operand)` </b> + +| Arguments | Type | Semantics | +| ------------ | ----------------------- | ---------------------------------- | +| `operand` | `ComputationDataHandle` | Array to sum across replicas. | + +The output shape is the same as the input shape. For example, if there are two +replicas and the operand has the value `(1.0, 2.5)` and `(3.0, 5.1)` +respectively on the two replicas, then the output value from this op will be +`(4.0, 7.6)` on both replicas. + +Computing the result of CrossReplicaSum requires having one input from each +replica, so if one replica executes a CrossReplicaSum node more times than +another, then the former replica will wait forever. Since the replicas are all +running the same program, there are not a lot of ways for that to happen, but it +is possible when a while loop's condition depends on data from infeed and the +data that is infed causes the while loop to iterate more times on one replica +than another. + +## CustomCall + +See also +[`ComputationBuilder::CustomCall`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Call a user-provided function within a computation. + +<b> `CustomCall(target_name, args..., shape)` </b> + +| Arguments | Type | Semantics | +| ------------- | ------------------------ | -------------------------------- | +| `target_name` | `string` | Name of the function. A call | +: : : instruction will be emitted : +: : : which targets this symbol name. : +| `args` | sequence of N | N arguments of arbitrary type, | +: : `ComputationDataHandle`s : which will be passed to the : +: : : function. : +| `shape` | `Shape` | Output shape of the function | + +The function signature is the same, regardless of the arity or type of args: + +``` +extern "C" void target_name(void* out, void** in); +``` + +For example, if CustomCall is used as follows: + +``` +let x = f32[2] {1,2}; +let y = f32[2x3] {{10, 20, 30}, {40, 50, 60}}; + +CustomCall("myfunc", {x, y}, f32[3x3]) +``` + +Here is an example of an implementation of `myfunc`: + +``` +extern "C" void myfunc(void* out, void** in) { + float (&x)[2] = *static_cast<float(*)[2]>(in[0]); + float (&y)[2][3] = *static_cast<float(*)[2][3]>(in[1]); + EXPECT_EQ(1, x[0]); + EXPECT_EQ(2, x[1]); + EXPECT_EQ(10, y[0][0]); + EXPECT_EQ(20, y[0][1]); + EXPECT_EQ(30, y[0][2]); + EXPECT_EQ(40, y[1][0]); + EXPECT_EQ(50, y[1][1]); + EXPECT_EQ(60, y[1][2]); + float (&z)[3][3] = *static_cast<float(*)[3][3]>(out); + z[0][0] = x[1] + y[1][0]; + // ... +} +``` + +The user-provided function must not have side-effects and its execution must be +idempotent. + +> Note: The opaque nature of the user-provided function restricts optimization +> opportunities for the compiler. Try to express your computation in terms of +> native XLA ops whenever possible; only use CustomCall as a last resort. + +## Dot + +See also +[`ComputationBuilder::Dot`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b> `Dot(lhs, rhs)` </b> + +Arguments | Type | Semantics +--------- | ----------------------- | --------------- +`lhs` | `ComputationDataHandle` | array of type T +`rhs` | `ComputationDataHandle` | array of type T + +The exact semantics of this operation depend on the ranks of the operands: + +| Input | Output | Semantics | +| ----------------------- | --------------------- | ----------------------- | +| vector [n] `dot` vector | scalar | vector dot product | +: [n] : : : +| matrix [m x k] `dot` | vector [m] | matrix-vector | +: vector [k] : : multiplication : +| matrix [m x k] `dot` | matrix [m x n] | matrix-matrix | +: matrix [k x n] : : multiplication : + +The operation performs sum of products over the last dimension of `lhs` and the +one-before-last dimension of `rhs`. These are the "contracted" dimensions. The +contracted dimensions of `lhs` and `rhs` must be of the same size. In practice, +it can be used to perform dot products between vectors, vector/matrix +multiplications or matrix/matrix multiplications. + +## Element-wise binary arithmetic operations + +See also +[`ComputationBuilder::Add`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +A set of element-wise binary arithmetic operations is supported. + +<b> `Op(lhs, rhs)` </b> + +Where `Op` is one of `Add` (addition), `Sub` (subtraction), `Mul` +(multiplication), `Div` (division), `Rem` (remainder), `Max` (maximum), `Min` +(minimum), `LogicalAnd` (logical AND), or `LogicalOr` (logical OR). + +Arguments | Type | Semantics +--------- | ----------------------- | ---------------------------------------- +`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T +`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T + +The arguments' shapes have to be either similar or compatible. See the +[broadcasting](broadcasting.md) documentation about what it means for shapes to +be compatible. The result of an operation has a shape which is the result of +broadcasting the two input arrays. In this variant, operations between arrays of +different ranks are *not* supported, unless one of the operands is a scalar. + +When `Op` is `Rem`, the sign of the result is taken from the dividend, and the +absolute value of the result is always less than the divisor's absolute value. + +An alternative variant with different-rank broadcasting support exists for these +operations: + +<b> `Op(lhs, rhs, broadcast_dimensions)` </b> + +Where `Op` is the same as above. This variant of the operation should be used +for arithmetic operations between arrays of different ranks (such as adding a +matrix to a vector). + +The additional `broadcast_dimensions` operand is a slice of integers used to +expand the rank of the lower-rank operand up to the rank of the higher-rank +operand. `broadcast_dimensions` maps the dimensions of the lower-rank shape to +the dimensions of the higher-rank shape. The unmapped dimensions of the expanded +shape are filled with dimensions of size one. Degenerate-dimension broadcasting +then broadcasts the shapes along these degenerate dimension to equalize the +shapes of both operands. The semantics are described in detail on the +[broadcasting page](broadcasting.md). + +## Element-wise comparison operations + +See also +[`ComputationBuilder::Eq`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +A set of standard element-wise binary comparison operations is supported. Note +that standard IEEE 754 floating-point comparison semantics apply when comparing +floating-point types. + +<b> `Op(lhs, rhs)` </b> + +Where `Op` is one of `Eq` (equal-to), `Ne` (not equal-to), `Ge` +(greater-or-equal-than), `Gt` (greater-than), `Le` (less-or-equal-than), `Le` +(less-than). + +Arguments | Type | Semantics +--------- | ----------------------- | ---------------------------------------- +`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T +`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T + +The arguments' shapes have to be either similar or compatible. See the +[broadcasting](broadcasting.md) documentation about what it means for shapes to +be compatible. The result of an operation has a shape which is the result of +broadcasting the two input arrays with the element type `PRED`. In this variant, +operations between arrays of different ranks are *not* supported, unless one of +the operands is a scalar. + +An alternative variant with different-rank broadcasting support exists for these +operations: + +<b> `Op(lhs, rhs, broadcast_dimensions)` </b> + +Where `Op` is the same as above. This variant of the operation should be used +for comparison operations between arrays of different ranks (such as adding a +matrix to a vector). + +The additional `broadcast_dimensions` operand is a slice of integers specifying +the dimensions to use for broadcasting the operands. The semantics are described +in detail on the [broadcasting page](broadcasting.md). + +## Element-wise unary functions + +ComputationBuilder supports these element-wise unary functions: + +<b>`Abs(operand)`</b> Element-wise abs `x -> |x|`. + +<b>`Ceil(operand)`</b> Element-wise ceil `x -> ⌈x⌉`. + +<b>`Exp(operand)`</b> Element-wise natural exponential `x -> e^x`. + +<b>`Floor(operand)`</b> Element-wise floor `x -> ⌊x⌋`. + +<b>`Log(operand)`</b> Element-wise natural logarithm `x -> ln(x)`. + +<b>`LogicalNot(operand)`</b> Element-wise logical not `x -> !(x)`. + +<b>`Neg(operand)`</b> Element-wise negation `x -> -x`. + +<b>`Sign(operand)`</b> Element-wise sign operation `x -> sgn(x)` where + +$$\text{sgn}(x) = \begin{cases} -1 & x < 0\\ 0 & x = 0\\ 1 & x > 0 \end{cases}$$ + +using the comparison operator of the element type of `operand`. + +<b>`Tanh(operand)`</b> Element-wise hyperbolic tangent `x -> tanh(x)`. + + +Arguments | Type | Semantics +--------- | ----------------------- | --------------------------- +`operand` | `ComputationDataHandle` | The operand to the function + +The function is applied to each element in the `operand` array, resulting in an +array with the same shape. It is allowed for `operand` to be a scalar (rank 0). + +## GetTupleElement + +See also +[`ComputationBuilder::GetTupleElement`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Indexes into a tuple with a compile-time-constant value. + +The value must be a compile-time-constant so that shape inference can determine +the type of the resulting value. + +This is analogous to `std::get<int N>(t)` in C++. Conceptually: + +``` +let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; +let s: s32 = 5; +let t: (f32[10], s32) = tuple(v, s); +let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32. +``` + +See also [`Tuple`](#tuple). + +## Infeed + +See also +[`ComputationBuilder::Infeed`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b> `Infeed(shape)` </b> + +| Argument | Type | Semantics | +| -------- | ------- | ----------------------------------------------------- | +| `shape` | `Shape` | Shape of the data read from the Infeed interface. The | +: : : layout field of the shape must be set to match the : +: : : layout of the data sent to the device; otherwise its : +: : : behavior is undefined. : + +Reads a single data item from the implicit Infeed streaming interface of the +device, interpreting the data as the given shape and its layout, and returns a +`ComputationDataHandle` of the data. Multiple Infeed operations are allowed in a +computation, but there must be a total order among the Infeed operations. For +example, two Infeeds in the code below have a total order since there is a +dependency between the while loops. The compiler issues an error if there isn't +a total order. + +``` +result1 = while (condition, init = init_value) { + Infeed(shape) +} + +result2 = while (condition, init = result1) { + Infeed(shape) +} +``` + +Nested tuple shapes are not supported. For an empty tuple shape, the Infeed +operation is effectively a nop and proceeds without reading any data from the +Infeed of the device. + +> Note: We plan to allow multiple Infeed operations without a total order, in +> which case the compiler will provide information about how the Infeed +> operations are serialized in the compiled program. + +## Map + +See also +[`ComputationBuilder::Map`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b> `Map(operands..., computation)` </b> + +| Arguments | Type | Semantics | +| ----------------- | ------------------------ | ----------------------------- | +| `operands` | sequence of N | N arrays of types T_0..T_{N-1}| +: : `ComputationDataHandle`s : : +| `computation` | `Computation` | computation of type `T_0, | +: : : T_1, ..., T_{N + M -1} -> S` : +: : : with N parameters of type T : +: : : and M of arbitrary type : +| `static_operands` | sequence of M | M arrays of arbitrary type | +: : `ComputationDataHandle`s : : + +Applies a scalar function over the given `operands` arrays, producing an array +of the same dimensions where each element is the result of the mapped function +applied to the corresponding elements in the input arrays with `static_operands` +given as additional input to `computation`. + +The mapped function is an arbitrary computation with the restriction that it has +N inputs of scalar type `T` and a single output with type `S`. The output has +the same dimensions as the operands except that the element type T is replaced +with S. + +For example: `Map(op1, op2, op3, computation, par1)` maps `elem_out <- +computation(elem1, elem2, elem3, par1)` at each (multi-dimensional) index in the +input arrays to produce the output array. + +## Pad + +See also +[`ComputationBuilder::Pad`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b> `Pad(operand, padding_value, padding_config)` </b> + +| Arguments | Type | Semantics | +| ---------------- | ----------------------- | ----------------------------- | +| `operand` | `ComputationDataHandle` | array of type `T` | +| `padding_value` | `ComputationDataHandle` | scalar of type `T` to fill in | +: : : the added padding : +| `padding_config` | `PaddingConfig` | padding amount on both edges | +: : : (low, high) and between the : +: : : elements of each dimension : + +Expands the given `operand` array by padding around the array as well as between +the elements of the array with the given `padding_value`. `padding_config` +specifies the amount of edge padding and the interior padding for each +dimension. + +`PaddingConfig` is a repeated field of `PaddingConfigDimension`, which contains +three fields for each dimension: `edge_padding_low`, `edge_padding_high`, and +`interior_padding`. `edge_padding_low` and `edge_padding_high` specifies the +amount of padding added at the low-end (next to index 0) and the high-end (next +to the highest index) of each dimension respectively. `interior_padding` +specifies the amount of padding added between any two elements in each +dimension. This operation is a no-op if the edge padding pairs are all (0, 0) +and the interior padding values are all 0. Figure below shows examples of +different `edge_padding` and `interior_padding` values for a two dimensional +array. + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/ops_pad.png"> +</div> + +## Reduce + +See also +[`ComputationBuilder::Reduce`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Applies a reduction function to an array. + +<b> `Reduce(operand, init_value, computation, dimensions)` </b> + +| Arguments | Type | Semantics | +| ------------- | ----------------------- | -------------------------------- | +| `operand` | `ComputationDataHandle` | array of type `T` | +| `init_value` | `ComputationDataHandle` | scalar of type `T` | +| `computation` | `Computation` | computation of type `T, T -> T` | +| `dimensions` | `int64` array | unordered array of dimensions to | +: : : reduce : + +Conceptually, this operation reduces one or more dimensions in the input array +into scalars. The rank of the result array is `rank(operand) - len(dimensions)`. +`init_value` is the initial value used for every reduction and may also be +inserted anywhere during computation if the back-end chooses to do so. So in +most cases `init_value` should be an identity of the reduction function (for +example, 0 for addition). + +The evaluation order of the reduction function across the reduction dimensions +is arbitrary and may be non-deterministic. Therefore, the reduction function +should not be overly sensitive to reassociation[^1]. + +As an example, when reducing across the one dimension in a 1D array with values +[10, 11, 12, 13], with reduction function `f` (this is `computation`) then that +could be computed as + +`f(10, f(11, f(12, f(init_value, 13)))` + +but there are also many other possibilities, e.g. + +`f(init_value, f(f(10, f(init_value, 11)), f(f(init_value, 12), f(13, +init_value))))` + +The following is a rough pseudo-code example of how reduction could be +implemented, using summation as the reduction computation with an initial value +of 0. + +```python +result_shape <- remove all dims in dimensions from operand_shape + +# Iterate over all elements in result_shape. The number of r's here is equal +# to the rank of the result +for r0 in range(result_shape[0]), r1 in range(result_shape[1]), ...: + # Initialize this result element + result[r0, r1...] <- 0 + + # Iterate over all the reduction dimensions + for d0 in range(dimensions[0]), d1 in range(dimensions[1]), ...: + # Increment the result element with the value of the operand's element. + # The index of the operand's element is constructed from all ri's and di's + # in the right order (by construction ri's and di's together index over the + # whole operand shape). + result[r0, r1...] += operand[ri... di] +``` + +Here's an example of reducing a 2D array (matrix). The shape has rank 2, +dimension 0 of size 2 and dimension 1 of size 3: + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:35%" src="../../images/ops_2d_matrix.png"> +</div> + +Results of reducing dimensions 0 or 1 with an "add" function: + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:35%" src="../../images/ops_reduce_from_2d_matrix.png"> +</div> + +Note that both reduction results are 1D arrays. The diagram shows one as column +and another as row just for visual convenience. + +For a more complex example, here is a 3D array. Its rank is 3, dimension 0 of +size 4, dimension 1 of size 2 and dimension 2 of size 3. For simplicity, the +values 1 to 6 are replicated across dimension 0. + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:35%" src="../../images/ops_reduce_from_3d_matrix.png"> +</div> + +Similarly to the 2D example, we can reduce just one dimension. If we reduce +dimension 0, for example, we get a rank-2 array where all values across +dimension 0 were folded into a scalar: + +```text +| 4 8 12 | +| 4 8 12 | +``` + +If we reduce dimension 2, we also get a rank-2 array where all values across +dimension 2 were folded into a scalar: + +```text +| 6 15 | +| 6 15 | +| 6 15 | +| 6 15 | +``` + +Note that the relative order between the remaining dimensions in the input is +preserved in the output, but some dimensions may get assigned new numbers (since +the rank changes). + +We can also reduce multiple dimensions. Add-reducing dimensions 0 and 1 produces +the 1D array `| 20 28 36 |`. + +Reducing the 3D array over all its dimensions produces the scalar `84`. + +## ReduceWindow + +See also +[`ComputationBuilder::ReduceWindow`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Applies a reduction function to all elements in each window of the input +multi-dimensional array, producing an output multi-dimensional array with the +same number of elements as the number of valid positions of the window. A +pooling layer can be expressed as a `ReduceWindow`. + +<b> `ReduceWindow(operand, init_value, computation, window_dimensions, +window_strides, padding)` </b> + +| Arguments | Type | Semantics | +| ------------------- | ----------------------- | ---------------------------- | +| `operand` | `ComputationDataHandle` | N dimensional array | +: : : containing elements of type : +: : : T. This is the base area on : +: : : which the window is placed. : +| `init_value` | `ComputationDataHandle` | Starting value for the | +: : : reduction. See [Reduce] : +: : : (#reduce) for details. : +| `computation` | `Computation` | Reduction function of type | +: : : `T, T -> T`, to apply to all : +: : : elements in each window : +| `window_dimensions` | `ArraySlice<int64>` | array of integers for window | +: : : dimension values : +| `window_strides` | `ArraySlice<int64>` | array of integers for window | +: : : stride values : +| `padding` | `Padding` | padding type for window | +: : : (Padding\:\:kSame or : +: : : Padding\:\:kValid) : + +Below code and figure shows an example of using `ReduceWindow`. Input is a +matrix of size [4x6] and both window_dimensions and window_stride_dimensions are +[2x3]. + +``` +// Create a computation for the reduction (maximum). +Computation max; +{ + ComputationBuilder builder(client_, "max"); + auto y = builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "y"); + auto x = builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "x"); + builder.Max(y, x); + max = builder.Build().ConsumeValueOrDie(); +} + +// Create a ReduceWindow computation with the max reduction computation. +ComputationBuilder builder(client_, "reduce_window_2x3"); +auto shape = ShapeUtil::MakeShape(F32, {4, 6}); +auto input = builder.Parameter(0, shape, "input"); +builder.ReduceWindow( + input, *max, + /*init_val=*/builder.ConstantLiteral(LiteralUtil::MinValue(F32)), + /*window_dimensions=*/{2, 3}, + /*window_stride_dimensions=*/{2, 3}, + Padding::kValid); +``` + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:35%" src="../../images/ops_reduce_window.png"> +</div> + +Stride of 1 in a dimension specifies that the position of a window in the +dimension is 1 element away from its adjacent window. In order to specify that +no windows overlap with each other, window_stride_dimensions should be equal to +window_dimensions. The figure below illustrates the use of two different stride +values. Padding is applied to each dimension of the input and the calculations +are the same as though the input came in with the dimensions it has after +padding. + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:75%" src="../../images/ops_reduce_window_stride.png"> +</div> + +## Reshape + +See also +[`ComputationBuilder::Reshape`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h) +and the [`Collapse`](#collapse) operation. + +Reshapes the dimensions of an array into a new configuration. + +<b> `Reshape(operand, new_sizes)` </b> +<b> `Reshape(operand, dimensions, new_sizes)` </b> + +Arguments | Type | Semantics +------------ | ----------------------- | --------------------------------------- +`operand` | `ComputationDataHandle` | array of type T +`dimensions` | `int64` vector | order in which dimensions are collapsed +`new_sizes` | `int64` vector | vector of sizes of new dimensions + +Conceptually, reshape first flattens an array into a one-dimensional vector of +data values, and then refines this vector into a new shape. The input arguments +are an arbitrary array of type T, a compile-time-constant vector of dimension +indices, and a compile-time-constant vector of dimension sizes for the result. +The values in the `dimension` vector, if given, must be a permutation of all of +T's dimensions; the default if not given is `{0, ..., rank - 1}`. The order of +the dimensions in `dimensions` is from slowest-varying dimension (most major) to +fastest-varying dimension (most minor) in the loop nest which collapses the +input array into a single dimension. The `new_sizes` vector determines the size +of the output array. The value at index 0 in `new_sizes` is the size of +dimension 0, the value at index 1 is the size of dimension 1, and so on. The +product of the `new_size` dimensions must equal the product of the operand's +dimension sizes. When refining the collapsed array into the multidimensional +array defined by `new_sizes`, the dimensions in `new_sizes` are ordered from +slowest varying (most major) and to fastest varying (most minor). + +For example, let v be an array of 24 elements: + +``` +let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}}, + {{20, 21, 22}, {25, 26, 27}}, + {{30, 31, 32}, {35, 36, 37}}, + {{40, 41, 42}, {45, 46, 47}}}; + +In-order collapse: +let v012_24 = Reshape(v, {0,1,2}, {24}); +then v012_24 == f32[24] {10, 11, 12, 15, 16, 17, 20, 21, 22, 25, 26, 27, + 30, 31, 32, 35, 36, 37, 40, 41, 42, 45, 46, 47}; + +let v012_83 = Reshape(v, {0,1,2}, {8,3}); +then v012_83 == f32[8x3] {{10, 11, 12}, {15, 16, 17}, + {20, 21, 22}, {25, 26, 27}, + {30, 31, 32}, {35, 36, 37}, + {40, 41, 42}, {45, 46, 47}}; + +Out-of-order collapse: +let v021_24 = Reshape(v, {1,2,0}, {24}); +then v012_24 == f32[24] {10, 20, 30, 40, 11, 21, 31, 41, 12, 22, 32, 42, + 15, 25, 35, 45, 16, 26, 36, 46, 17, 27, 37, 47}; + +let v021_83 = Reshape(v, {1,2,0}, {8,3}); +then v021_83 == f32[8x3] {{10, 20, 30}, {40, 11, 21}, + {31, 41, 12}, {22, 32, 42}, + {15, 25, 35}, {45, 16, 26}, + {36, 46, 17}, {27, 37, 47}}; + + +let v021_262 = Reshape(v, {1,2,0}, {2,6,2}); +then v021_262 == f32[2x6x2] {{{10, 20}, {30, 40}, + {11, 21}, {31, 41}, + {12, 22}, {32, 42}}, + {{15, 25}, {35, 45}, + {16, 26}, {36, 46}, + {17, 27}, {37, 47}}}; +``` + +As a special case, reshape can transform a single-element array to a scalar and +vice versa. For example, + +``` +Reshape(f32[1x1] {{5}}, {0,1}, {}) == 5; +Reshape(5, {}, {1,1}) == f32[1x1] {{5}}; +``` + +## Rev (reverse) + +See also +[`ComputationBuilder::Rev`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b>`Rev(operand, dimensions)`</b> + +Arguments | Type | Semantics +------------ | ----------------------- | --------------------- +`operand` | `ComputationDataHandle` | array of type T +`dimensions` | `ArraySlice<int64>` | dimensions to reverse + +Reverses the order of elements in the `operand` array along the specified +`dimensions`, generating an output array of the same shape. Each element of the +operand array at a multidimensional index is stored into the output array at a +transformed index. The multidimensional index is transformed by reversing the +index in each dimension to be reversed (i.e., if a dimension of size N is one of +the reversing dimensions, its index i is transformed into N - 1 - i). + +One use for the `Rev` operation is to reverse the convolution weight array along +the two window dimensions during the gradient computation in neural networks. + +## RngBernoulli + +See also +[`ComputationBuilder::RngBernoulli`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Constructs an output of a given shape with random numbers generated following +the Bernoulli distribution. The parameter needs to be a scalar valued F32 +operand while the output shape needs to have elemental type U32. + +<b>`RngBernoulli(mean, shape)`</b> + +| Arguments | Type | Semantics | +| --------- | ----------------------- | ------------------------------------- | +| `mean` | `ComputationDataHandle` | Scalar of type F32 specifying mean of | +: : : generated numbers : +| `shape` | `Shape` | Output shape of type U32 | + +## RngNormal + +See also +[`ComputationBuilder::RngNormal`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Constructs an output of a given shape with random numbers generated following +the $$N(\mu, \sigma)$$ normal distribution. The parameters `mu` and `sigma`, and +output shape have to have elemental type F32. The parameters furthermore have to +be scalar valued. + +<b>`RngNormal(mean, sigma, shape)`</b> + +| Arguments | Type | Semantics | +| --------- | ----------------------- | -------------------------------------- | +| `mu` | `ComputationDataHandle` | Scalar of type F32 specifying mean of | +: : : generated numbers : +| `sigma` | `ComputationDataHandle` | Scalar of type F32 specifying standard | +: : : deviation of generated numbers : +| `shape` | `Shape` | Output shape of type F32 | + +## RngUniform + +See also +[`ComputationBuilder::RngUniform`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Constructs an output of a given shape with random numbers generated following +the uniform distribution over the interval $$[a,b]$$. The parameters and output +shape may be either F32, S32 or U32, but the types have to be consistent. +Furthermore, the parameters need to be scalar valued. + +<b>`RngUniform(a, b, shape)`</b> + +| Arguments | Type | Semantics | +| --------- | ----------------------- | --------------------------------- | +| `a` | `ComputationDataHandle` | Scalar of type T specifying lower | +: : : limit of interval : +| `b` | `ComputationDataHandle` | Scalar of type T specifying upper | +: : : limit of interval : +| `shape` | `Shape` | Output shape of type T | + +## SelectAndScatter + +See also +[`ComputationBuilder::SelectAndScatter`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +This operation can be considered as a composite operation that first computes +`ReduceWindow` on the `operand` array to select an element from each window, and +then scatters the `source` array to the indices of the selected elements to +construct an output array with the same shape as the operand array. The binary +`select` function is used to select an element from each window by applying it +across each window, and it is called with the property that the first +parameter's index vector is lexicographically less than the second parameter's +index vector. The `select` function returns `true` if the first parameter is +selected and returns `false` if the second parameter is selected, and the +function must hold transitivity (i.e., if `select(a, b)` and `select(b, c)` are +`true`, then `select(a, c)` is also `true`) so that the selected element does +not depend on the order of the elements traversed for a given window. + +The function `scatter` is applied at each selected index in the output array. It +takes two scalar parameters: + +1. Current value at the selected index in the output array +2. The scatter value from `source` that applies to the selected index + +It combines the two parameters and returns a scalar value that's used to update +the value at the selected index in the output array. Initially, all indices of +the output array are set to `init_value`. + +The output array has the same shape as the `operand` array and the `source` +array must have the same shape as the result of applying a `ReduceWindow` +operation on the `operand` array. `SelectAndScatter` can be used to +backpropagate the gradient values for a pooling layer in a neural network. + +<b>`SelectAndScatter(operand, select, window_dimensions, window_strides, +padding, source, init_value, scatter)`</b> + +| Arguments | Type | Semantics | +| ------------------- | ----------------------- | ---------------------------- | +| `operand` | `ComputationDataHandle` | array of type T over which | +: : : the windows slide : +| `select` | `Computation` | binary computation of type | +: : : `T, T -> PRED`, to apply to : +: : : all elements in each window; : +: : : returns `true` if the first : +: : : parameter is selected and : +: : : returns `false` if the : +: : : second parameter is selected : +| `window_dimensions` | `ArraySlice<int64>` | array of integers for window | +: : : dimension values : +| `window_strides` | `ArraySlice<int64>` | array of integers for window | +: : : stride values : +| `padding` | `Padding` | padding type for window | +: : : (Padding\:\:kSame or : +: : : Padding\:\:kValid) : +| `source` | `ComputationDataHandle` | array of type T with the | +: : : values to scatter : +| `init_value` | `ComputationDataHandle` | scalar value of type T for | +: : : the inital value of the : +: : : output array : +| `scatter` | `Computation` | binary computation of type | +: : : `T, T -> T`, to apply each : +: : : scatter source element with : +: : : its destination element : + +The figure below shows examples of using `SelectAndScatter`, with the `select` +function computing the maximal value among its parameters. Note that when the +windows overlap, as in the figure (2) below, an index of the `operand` array may +be selected multiple times by different windows. In the figure, the element of +value 9 is selected by both of the top windows (blue and red) and the binary +addition `scatter` function produces the output element of value 8 (2 + 6). + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" + src="../../images/ops_scatter_to_selected_window_element.png"> +</div> + +## Select + +See also +[`ComputationBuilder::Select`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Constructs an output array from elements of two input arrays, based on the +values of a predicate array. + +<b> `Select(pred, on_true, on_false)` </b> + +Arguments | Type | Semantics +---------- | ----------------------- | ------------------ +`pred` | `ComputationDataHandle` | array of type PRED +`on_true` | `ComputationDataHandle` | array of type T +`on_false` | `ComputationDataHandle` | array of type T + +The arrays `on_true` and `on_false` must have the same shape. This is also the +shape of the output array. The array `pred` must have the same dimensionality as +`on_true` and `on_false`, with the `PRED` element type. + +For each element `P` of `pred`, the corresponding element of the output array is +taken from `on_true` if the value of `P` is `true`, and from `on_false` if the +value of `P` is `false`. As a restricted form of [broadcasting] +(broadcasting.md), `pred` can be a scalar of type `PRED`. In this case, the +output array is taken wholly from `on_true` if `pred` is `true`, and from +`on_false` if `pred` is `false`. + +Example with non-scalar `pred`: + +``` +let pred: PRED[4] = {true, false, false, true}; +let v1: s32[4] = {1, 2, 3, 4}; +let v2: s32[4] = {100, 200, 300, 400}; +==> +Select(pred, v1, v2) = s32[4]{1, 200, 300, 4}; +``` + +Example with scalar `pred`: + +``` +let pred: PRED = true; +let v1: s32[4] = {1, 2, 3, 4}; +let v2: s32[4] = {100, 200, 300, 400}; +==> +Select(pred, v1, v2) = s32[4]{1, 2, 3, 4}; +``` + +Selections between tuples are supported. Tuples are considered to be scalar +types for this purpose. If `on_true` and `on_false` are tuples (which must have +the same shape!) then `pred` has to be a scalar of type `PRED`. + +## Slice + +See also +[`ComputationBuilder::Slice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Slicing extracts a sub-array from the input array. The sub-array is of the same +rank as the input and contains the values inside a bounding box within the input +array where the dimensions and indices of the bounding box are given as +arguments to the slice operation. + +<b> `Slice(operand, start_indices, limit_indices)` </b> + +| Arguments | Type | Semantics | +| --------------- | ----------------------- | -------------------------------- | +| `operand` | `ComputationDataHandle` | N dimensional array of type T | +| `start_indices` | `ArraySlice<int64>` | List of N integers containing | +: : : the starting indices of the : +: : : slice for each dimension. Values : +: : : must be greater than or equal to : +: : : zero. : +| `limit_indices` | `ArraySlice<int64>` | List of N integers containing | +: : : the ending indices (exclusive) : +: : : for the slice for each : +: : : dimension. Each value must be : +: : : strictly greater than the : +: : : respective `start_indices` value : +: : : for the dimension and less than : +: : : or equal to the size of the : +: : : dimension. : + +1-dimensional example: + +``` +let a = {0.0, 1.0, 2.0, 3.0, 4.0} +Slice(a, {2}, {4}) produces: + {2.0, 3.0} +``` + +2-dimensional example: + +``` +let b = + { {0.0, 1.0, 2.0}, + {3.0, 4.0, 5.0}, + {6.0, 7.0, 8.0}, + {9.0, 10.0, 11.0} } + +Slice(b, {2, 1}, {4, 3}) produces: + { { 7.0, 8.0}, + {10.0, 11.0} } +``` + +## DynamicSlice + +See also +[`ComputationBuilder::DynamicSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +DynamicSlice extracts a sub-array from the input array at dynamic +`start_indices`. The size of the slice in each dimension is passed in +`size_indices`, which specify the end point of exclusive slice intervals in each +dimension: [start, start + size). The shape of `start_indices` must be rank == +1, with dimension size equal to the rank of `operand`. +Note: handling of out-of-bounds slice indices (generated by incorrect runtime +calculation of 'start_indices') is currently implementation-defined. Currently, +slice indices are computed modulo input dimension sizes to prevent out-of-bound +array accesses, but this behavior may change in future implementations. + +<b> `DynamicSlice(operand, start_indices, size_indices)` </b> + +| Arguments | Type | Semantics | +| --------------- | ----------------------- | -------------------------------- | +| `operand` | `ComputationDataHandle` | N dimensional array of type T | +| `start_indices` | `ComputationDataHandle` | Rank 1 array of N integers | +: : : containing the starting indices : +: : : of the slice for each dimension. : +: : : Value must be greater than or : +: : : equal to zero. : +| `size_indices` | `ArraySlice<int64>` | List of N integers containing | +: : : the slice size for each : +: : : dimension. Each value must be : +: : : strictly greater than zero, and : +: : : start + size must be less than : +: : : or equal to the size of the : +: : : dimension to avoid wrapping : +: : : modulo dimension size. : + +1-dimensional example: + +``` +let a = {0.0, 1.0, 2.0, 3.0, 4.0} +let s = {2} + +DynamicSlice(a, s, {2}) produces: + {2.0, 3.0} +``` + +2-dimensional example: + +``` +let b = + { {0.0, 1.0, 2.0}, + {3.0, 4.0, 5.0}, + {6.0, 7.0, 8.0}, + {9.0, 10.0, 11.0} } +let s = {2, 1} + +DynamicSlice(b, s, {2, 2}) produces: + { { 7.0, 8.0}, + {10.0, 11.0} } +``` +## DynamicUpdateSlice + +See also +[`ComputationBuilder::DynamicUpdateSlice`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +DynamicUpdateSlice generates a result which is the value of the input array +`operand`, with a slice `update` overwritten at `start_indices`. +The shape of `update` determines the shape of the sub-array of the result which +is updated. +The shape of `start_indices` must be rank == 1, with dimension size equal to +the rank of `operand`. +Note: handling of out-of-bounds slice indices (generated by incorrect runtime +calculation of 'start_indices') is currently implementation-defined. Currently, +slice indices are computed modulo update dimension sizes to prevent out-of-bound +array accesses, but this behavior may change in future implementations. + +<b> `DynamicUpdateSlice(operand, update, start_indices)` </b> + +| Arguments | Type | Semantics | +| --------------- | ----------------------- | -------------------------------- | +| `operand` | `ComputationDataHandle` | N dimensional array of type T | +| `update` | `ComputationDataHandle` | N dimensional array of type T | +: : : containing the slice update. : +: : : Each dimension of update shape : +: : : must be strictly greater than : +: : : zero, and start + update must be : +: : : less than operand size for each : +: : : dimension to avoid generating : +: : : out-of-bounds update indices. : +| `start_indices` | `ComputationDataHandle` | Rank 1 array of N integers | +: : : containing the starting indices : +: : : of the slice for each dimension. : +: : : Value must be greater than or : +: : : equal to zero. : + +1-dimensional example: + +``` +let a = {0.0, 1.0, 2.0, 3.0, 4.0} +let u = {5.0, 6.0} +let s = {2} + +DynamicUpdateSlice(a, u, s) produces: + {0.0, 1.0, 5.0, 6.0, 4.0} +``` + +2-dimensional example: + +``` +let b = + { {0.0, 1.0, 2.0}, + {3.0, 4.0, 5.0}, + {6.0, 7.0, 8.0}, + {9.0, 10.0, 11.0} } +let u = + { {12.0, 13.0}, + {14.0, 15.0}, + {16.0, 17.0} } + +let s = {1, 1} + +DynamicUpdateSlice(b, u, s) produces: + { {0.0, 1.0, 2.0}, + {3.0, 12.0, 13.0}, + {6.0, 14.0, 15.0}, + {9.0, 16.0, 17.0} } +``` + +## Sort + +See also +[`ComputationBuilder::Sort`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +Sorts the elements in the operand. + +<b>`Sort(operand)`</b> + +Arguments | Type | Semantics +--------- | ----------------------- | ------------------- +`operand` | `ComputationDataHandle` | The operand to sort + +## Transpose + +See also the [`Reshape`](#reshape) operation. + +<b>`Transpose(operand)`</b> + +Arguments | Type | Semantics +--------- | ----------------------- | ------------------------- +`operand` | `ComputationDataHandle` | The operand to transpose. +`permutation` | `ArraySlice<int64>` | How to permute the dimensions. + + +Permutes the operand dimensions with the given permutation, so +`∀ i . 0 ≤ i < rank ⇒ input_dimensions[permutation[i]] = output_dimensions[i]`. + +This is the same as Reshape(operand, permutation, + Permute(permutation, operand.shape.dimensions)). + +## Tuple + +See also +[`ComputationBuilder::Tuple`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +A tuple containing a variable number of data handles, each of which has its own +shape. + +This is analogous to `std::tuple` in C++. Conceptually: + +``` +let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; +let s: s32 = 5; +let t: (f32[10], s32) = tuple(v, s); +``` + +Tuples can be deconstructed (accessed) via the [`GetTupleElement`] +(#gettupleelement) operation. + +## While + +See also +[`ComputationBuilder::While`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h). + +<b> `While(condition, body, init)` </b> + +| Arguments | Type | Semantics | +| ----------- | ------------- | ---------------------------------------------- | +| `condition` | `Computation` | Computation of type `T -> PRED` which defines | +: : : the termination condition of the loop. : +| `body` | `Computation` | Computation of type `T -> T` which defines the | +: : : body of the loop. : +| `init` | `T` | Initial value for the parameter of `condition` | +: : : and `body`. : + +Sequentially executes the `body` until the `condition` fails. This is similar to +a typical while loop in many other languages except for the differences and +restrictions listed below. + +* A `While` node returns a value of type `T`, which is the result from the + last execution of the `body`. +* The shape of the type `T` is statically determined and must be the same + across all iterations. +* `While` nodes are not allowed to be nested. (This restriction may be lifted + in the future on some targets.) + +The T parameters of the computations are initialized with the `init` value in +the first iteration and are automatically updated to the new result from `body` +in each subsequent iteration. + +One main use case of the `While` node is to implement the repeated execution of +training in neural networks. Simplified pseudocode is shown below with a graph +that represents the computation. The code can be found in +[`while_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/xla/tests/while_test.cc). +The type `T` in this example is a `Tuple` consisting of an `int32` for the +iteration count and a `vector[10]` for the accumulator. For 1000 iterations, the +loop keeps adding a constant vector to the accumulator. + +``` +// Pseudocode for the computation. +init = {0, zero_vector[10]} // Tuple of int32 and float[10]. +result = init; +while (result(0) < 1000) { + iteration = result(0) + 1; + new_vector = result(1) + constant_vector[10]; + result = {iteration, new_vector}; +} +``` + +<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> + <img style="width:100%" src="../../images/ops_while.png"> +</div> + +[^1]: Some obvious reductions like "add reduction" are not strictly associative + for floats. However, if the range of the data is limited, floating-point + addition is close enough to being associative for most practical uses. It + is possible to conceive some complete un-associative reductions, however, + and these will produce wrong results in TLA reductions. diff --git a/tensorflow/g3doc/experimental/xla/shapes.md b/tensorflow/g3doc/experimental/xla/shapes.md new file mode 100644 index 0000000000..39e74ff307 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/shapes.md @@ -0,0 +1,150 @@ +# Shapes and Layout + +The XLA `Shape` proto +([xla_data.proto](https://www.tensorflow.org/code/tensorflow/compiler/xla/xla_data.proto)) +describes the rank, size, and data type of an N-dimensional array (*array* in +short). + +## Terminology, Notation, and Conventions + +* The rank of an array is equal to the number of dimensions. The *true rank* + of an array is the number of dimensions which have a size greater than 1. + +* Dimensions are numbered from `0` up to `N-1` for an `N` dimensional array. + The dimension numbers are arbitrary labels for convenience. The order of + these dimension numbers does not imply a particular minor/major ordering in + the layout of the shape. The layout is determined by the `Layout` proto. + +* By convention, dimensions are listed in increasing order of dimension + number. For example, for a 3-dimensional array of size `[A x B x C]`, + dimension 0 has size `A`, dimension 1 has size `B` and dimension 2 has size + `C`. + + Some utilities in XLA also support negative indexing, similarly to Python; + dimension -1 is the last dimension (equivalent to `N-1` for an `N` + dimensional array). For example, for the 3-dimensional array described + above, dimension -1 has size `C`, dimension -2 has size `B` and so on. + +* Two, three, and four dimensional arrays often have specific letters + associated with dimensions. For example, for a 2D array: + + * dimension 0: `y` + * dimension 1: `x` + + For a 3D array: + + * dimension 0: `z` + * dimension 1: `y` + * dimension 2: `x` + + For a 4D array: + + * dimension 0: `p` + * dimension 1: `z` + * dimension 2: `y` + * dimension 3: `x` + +* Functions in the XLA API which take dimensions do so in increasing order of + dimension number. This matches the ordering used when passing dimensions as + an `initializer_list`; e.g. + + `ShapeUtil::MakeShape(F32, {A, B, C, D})` + + Will create a shape whose dimension size array consists of the sequence + `[A, B, C, D]`. + +## Layout + +The `Layout` proto describes how an array is represented in memory. The `Layout` +proto includes the following fields: + +``` +message Layout { + repeated int64 minor_to_major = 1; + repeated int64 padded_dimensions = 2; + optional PaddingValue padding_value = 3; +} +``` + +### Minor-to-major dimension ordering + +The only required field is `minor_to_major`. This field describes the +minor-to-major ordering of the dimensions within a shape. Values in +`minor_to_major` are an ordering of the dimensions of the array (`0` to `N-1` +for an `N` dimensional array) with the first value being the most-minor +dimension up to the last value which is the most-major dimension. The most-minor +dimension is the dimension which changes most rapidly when stepping through the +elements of the array laid out in linear memory. + +For example, consider the following 2D array of size `[2 x 3]`: + +``` +a b c +d e f +``` + +Here dimension `0` is size 2, and dimension `1` is size 3. If the +`minor_to_major` field in the layout is `[0, 1]` then dimension `0` is the +most-minor dimension and dimension `1` is the most-major dimension. This +corresponds to the following layout in linear memory: + +``` +a d b e c f +``` + +This minor-to-major dimension order of `0` up to `N-1` is akin to *column-major* +(at rank 2). Assuming a monotonic ordering of dimensions, another name we may +use to refer to this layout in the code is simply "dim 0 is minor". + +On the other hand, if the `minor_to_major` field in the layout is `[1, 0]` then +the layout in linear memory is: + +``` +a b c d e f +``` + +A minor-to-major dimension order of `N-1` down to `0` for an `N` dimensional +array is akin to *row-major* (at rank 2). Assuming a monotonic ordering of +dimensions, another name we may use to refer to this layout in the code is +simply "dim 0 is major". + +#### Default minor-to-major ordering + +The default layout for newly created Shapes is "dimension order is +major-to-minor" (akin to row-major at rank 2). + +### Padding + +Padding is defined in the optional `padded_dimensions` and `padding_value` +fields. The field `padded_dimensions` describes the sizes (widths) to which each +dimension is padded. If present, the number of elements in `padded_dimensions` +must equal the rank of the shape. + +For example, given the `[2 x 3]` array defined above, if `padded_dimension` is +`[3, 5]` then dimension 0 is padded to a width of 3 and dimension 1 is padded to +a width of 5. The layout in linear memory (assuming a padding value of 0 and +column-major layout) is: + +``` +a d 0 b e 0 c f 0 0 0 0 0 0 0 +``` + +This is equivalent to the layout of the following array with the same +minor-to-major dimension order: + +``` +a b c 0 0 +d e f 0 0 +0 0 0 0 0 +``` + +### Indexing into arrays + +The class `IndexUtil` in +[index_util.h](https://www.tensorflow.org/code/tensorflow/compiler/xla/index_util.h) +provides utilities for converting between multidimensional indices and linear +indices given a shape and layout. Multidimensional indices include a `int64` +index for each dimension. Linear indices are a single `int64` value which +indexes into the buffer holding the array. See `shape_util.h` and +`layout_util.h` in the same directory for utilities that simplify creation and +manipulation of shapes and layouts. diff --git a/tensorflow/g3doc/experimental/xla/tfcompile.md b/tensorflow/g3doc/experimental/xla/tfcompile.md new file mode 100644 index 0000000000..c99a337a10 --- /dev/null +++ b/tensorflow/g3doc/experimental/xla/tfcompile.md @@ -0,0 +1,288 @@ +# Using AOT compilation + +## What is tfcompile? + +`tfcompile` is a standalone tool that ahead-of-time (AOT) compiles TensorFlow +graphs into executable code. It can reduce total binary size, and also avoid +some runtime overheads. A typical use-case of `tfcompile` is to compile an +inference graph into executable code for mobile devices. + +The TensorFlow graph is normally executed by the TensorFlow runtime. This incurs +some runtime overhead for execution of each node in the graph. This also leads +to a larger total binary size, since the code for the TensorFlow runtime needs +to be available, in addition to the graph itself. The executable code produced +by `tfcompile` does not use the TensorFlow runtime, and only has dependencies on +kernels that are actually used in the computation. + +The compiler is built on top of the XLA framework. The code bridging TensorFlow +to the XLA framework resides under +[tensorflow/compiler](https://www.tensorflow.org/code/tensorflow/compiler/), +which also includes support for [just-in-time (JIT) compilation](jit.md) of +TensorFlow graphs. + +## What does tfcompile do? + +`tfcompile` takes a subgraph, identified by the TensorFlow concepts of +[`feeds`](../../get_started/basic_usage.md#feeds) and +[`fetches`](../../get_started/basic_usage.md#fetches), and generates a function +that implements that subgraph. The feeds are the input arguments for the +function, and the fetches are the output arguments for the function. All inputs +must be fully specified by the feeds; the resulting pruned subgraph cannot +contain Placeholder or Variable nodes. It is common to specify all Placeholders +and Variables as feeds, which ensures the resulting subgraph no longer contains +these nodes. The generated function is packaged as a `cc_library`, with a header +file exporting the function signature, and an object file containing the +implementation. The user writes code to invoke the generated function as +appropriate. + +## Using tfcompile + +This section details high level steps for generating an executable binary with +`tfcompile` from a TensorFlow subgraph. The steps are: + +* Step 1: Configure the subgraph to compile +* Step 2: Use the `tf_library` build macro to compile the subgraph +* Step 3: Write code to invoke the subgraph +* Step 4: Create the final binary + +### Step 1: Configure the subgraph to compile + +Identify the [`feeds`](../../get_started/basic_usage.md#feeds) and +[`fetches`](../../get_started/basic_usage.md#fetches) of the graph, which +correspond to the input and output arguments for the generated function. Then +configure the feeds and fetches in a +[`tensorflow.tfcompile.Config`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.proto) +proto. + +```textproto +# Each feed is a positional input argument for the generated function. The order +# of each entry matches the order of each input argument. Here “x_hold” and “y_hold” +# refer to the names of placeholder nodes defined in the graph. +feed { + id { node_name: "x_hold" } + shape { + dim { size: 2 } + dim { size: 3 } + } +} +feed { + id { node_name: "y_hold" } + shape { + dim { size: 3 } + dim { size: 2 } + } +} + +# Each fetch is a positional output argument for the generated function. The order +# of each entry matches the order of each output argument. Here “x_y_prod” +# refers to the name of a matmul node defined in the graph. +fetch { + id { node_name: "x_y_prod" } +} +``` + +### Step 2: Use tf_library build macro to compile the subgraph + +This step converts the graph into a `cc_library` using the `tf_library` build +macro. The `cc_library` consists of an object file containing the code generated +from the graph, along with a header file that gives access to the generated +code. `tf_library` utilizes `tfcompile` to compile the TensorFlow graph into +executable code. + +```build +load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library") + +# Use the tf_library macro to compile your graph into executable code. +tf_library( + # name is used to generate the following underlying build rules: + # <name> : cc_library packaging the generated header and object files + # <name>_test : cc_test containing a simple test and benchmark + # <name>_benchmark : cc_binary containing a stand-alone benchmark with minimal deps; + # can be run on a mobile device + name = "test_graph_tfmatmul", + # cpp_class specifies the name of the generated C++ class, with namespaces allowed. + # The class will be generated in the given namespace(s), or if no namespaces are + # given, within the global namespace. + cpp_class = "foo::bar::MatMulComp", + # graph is the input GraphDef proto, by default expected in binary format. To + # use the text format instead, just use the ‘.pbtxt’ suffix. A subgraph will be + # created from this input graph, with feeds as inputs and fetches as outputs. + # No Placeholder or Variable ops may exist in this subgraph. + graph = "test_graph_tfmatmul.pb", + # config is the input Config proto, by default expected in binary format. To + # use the text format instead, use the ‘.pbtxt’ suffix. This is where the + # feeds and fetches were specified above, in the previous step. + config = "test_graph_tfmatmul.config.pbtxt", +) +``` + +> To generate the GraphDef proto (test_graph_tfmatmul.pb) for this example, run +> [make_test_graphs.py]("https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/make_test_graphs.py") +> and specify the output location with the --out_dir flag. + +Typical graphs contain [`Variables`](../../api_docs/python/state_ops.md) +representing the weights that are learned via training, but `tfcompile` cannot +compile a subgraph that contain `Variables`. The +[freeze_graph.py](https://www.tensorflow.org/code/tensorflow/python/tools/freeze_graph.py) +tool converts variables into constants, using values stored in a checkpoint +file. As a convenience, the `tf_library` macro supports the `freeze_checkpoint` +argument, which runs the tool. For more examples see +[tensorflow/compiler/aot/tests/BUILD](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/BUILD). + +> Constants that show up in the compiled subgraph are compiled directly into the +> generated code. To pass the constants into the generated function, rather than +> having them compiled-in, simply pass them in as feeds. + +For details on the `tf_library` build macro, see +[tfcompile.bzl](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile.bzl). + +For details on the underlying `tfcompile` tool, see +[tfcompile_main.cc](https://www.tensorflow.org/code/tensorflow/compiler/aot/tfcompile_main.cc). + +### Step 3: Write code to invoke the subgraph + +This step uses the header file (`test_graph_tfmatmul.h`) generated by the +`tf_library` build macro in the previous step to invoke the generated code. The +header file is located in the `bazel-genfiles` directory corresponding to the +build package, and is named based on the name attribute set on the `tf_library` +build macro. For example, the header generated for `test_graph_tfmatmul` would +be `test_graph_tfmatmul.h`. Below is an abbreviated version of what is +generated. The generated file, in `bazel-genfiles`, contains additional useful +comments. + +```c++ +namespace foo { +namespace bar { + +// MatMulComp represents a computation previously specified in a +// TensorFlow graph, now compiled into executable code. +class MatMulComp { + public: + // AllocMode controls the buffer allocation mode. + enum class AllocMode { + ARGS_RESULTS_AND_TEMPS, // Allocate arg, result and temp buffers + RESULTS_AND_TEMPS_ONLY, // Only allocate result and temp buffers + }; + + MatMulComp(AllocMode mode = AllocMode::ARGS_RESULTS_AND_TEMPS); + ~MatMulComp(); + + // Runs the computation, with inputs read from arg buffers, and outputs + // written to result buffers. Returns true on success and false on failure. + bool Run(); + + // Arg methods for managing input buffers. Buffers are in row-major order. + // There is a set of methods for each positional argument. + void** args(); + + void set_arg0_data(float* data); + float* arg0_data(); + float& arg0(size_t dim0, size_t dim1); + + void set_arg1_data(float* data); + float* arg1_data(); + float& arg1(size_t dim0, size_t dim1); + + // Result methods for managing output buffers. Buffers are in row-major order. + // Must only be called after a successful Run call. There is a set of methods + // for each positional result. + void** results(); + + + float* result0_data(); + float& result0(size_t dim0, size_t dim1); +}; + +} // end namespace bar +} // end namespace foo +``` + +The generated C++ class is called `MatMulComp` in the `foo::bar` namespace, +because that was the `cpp_class` specified in the `tf_library` macro. All +generated classes have a similar API, with the only difference being the methods +to handle arg and result buffers. Those methods differ based on the number and +types of the buffers, which were specified by the `feed` and `fetch` arguments +to the `tf_library` macro. + +There are three types of buffers managed within the generated class: `args` +representing the inputs, `results` representing the outputs, and `temps` +representing temporary buffers used internally to perform the computation. By +default, each instance of the generated class allocates and manages all of these +buffers for you. The `AllocMode` constructor argument may be used to change this +behavior. A convenience library is provided in +[`tensorflow/compiler/aot/runtime.h`](https://www.tensorflow.org/code/tensorflow/compiler/aot/runtime.h) +to help with manual buffer allocation; usage of this library is optional. All +buffers should be aligned to 32-byte boundaries. + +The generated C++ class is just a wrapper around the low-level code generated by +XLA. + +Example of invoking the generated function based on +[`tfcompile_test.cc`](https://www.tensorflow.org/code/tensorflow/compiler/aot/tests/tfcompile_test.cc): + +```c++ +#define EIGEN_USE_THREADS +#define EIGEN_USE_CUSTOM_THREAD_POOL + +#include <iostream> +#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor" +#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h" // generated + +int main(int argc, char** argv) { + Eigen::ThreadPool tp(2); // Size the thread pool as appropriate. + Eigen::ThreadPoolDevice device(&tp, tp.NumThreads()); + + + foo::bar::MatMulComp matmul; + matmul.set_thread_pool(&device); + + // Set up args and run the computation. + const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; + std::copy(args + 0, args + 6, matmul.arg0_data()); + std::copy(args + 6, args + 12, matmul.arg1_data()); + matmul.Run(); + + // Check result + if (matmul.result0(0, 0) == 58) { + std::cout << "Success" << std::endl; + } else { + std::cout << "Failed. Expected value 58 at 0,0. Got:" + << matmul.result0(0, 0) << std::endl; + } + + return 0; +} +``` + +### Step 4: Create the final binary + +This step combines the library generated by `tf_library` in step 2 and the code +written in step 3 to create a final binary. Below is an example `bazel` BUILD +file. + +```build +# Example of linking your binary +# Also see //third_party/tensorflow/compiler/aot/tests/BUILD +load("//third_party/tensorflow/compiler/aot:tfcompile.bzl", "tf_library") + +# The same tf_library call from step 2 above. +tf_library( + name = "test_graph_tfmatmul", + ... +) + +# The executable code generated by tf_library can then be linked into your code. +cc_binary( + name = "my_binary", + srcs = [ + "my_code.cc", # include test_graph_tfmatmul.h to access the generated header + ], + deps = [ + ":test_graph_tfmatmul", # link in the generated object file + "//third_party/eigen3", + ], + linkopts = [ + "-lpthread", + ] +) +``` diff --git a/tensorflow/g3doc/resources/xla_prerelease.md b/tensorflow/g3doc/resources/xla_prerelease.md deleted file mode 100644 index 1a6cf1d5a4..0000000000 --- a/tensorflow/g3doc/resources/xla_prerelease.md +++ /dev/null @@ -1,2223 +0,0 @@ -# XLA: The TensorFlow compiler framework - -This document describes a compiler framework for linear algebra called *XLA* -that will be released as part of TensorFlow. Most users of TensorFlow will not -invoke XLA directly, but will benefit from it through improvements in speed, -memory usage, and portability. - -We are providing this preview for parties who are interested in details of -TensorFlow compilation and may want to provide feedback. We will provide more -documentation with the code release. - -[TOC] - -## Compiling TensorFlow - -The XLA compilation framework is invoked on subgraphs of TensorFlow -computations. The framework requires all tensor shapes to be fixed, so compiled -code is specialized to concrete shapes. This means, for example, that the -compiler may be invoked multiple times for the same subgraph if it is executed -on batches of different sizes. We had several goals in mind when designing the -TensorFlow compilation strategy: - -* *Improved execution speed.* Compiling subgraphs reduces the execution time - of short-lived Ops by eliminating overhead from the TensorFlow runtime. The - framework also fuses pipelined operations, reducing memory - overheads. Specializing to known tensor shapes improves performance by - allowing more aggressive constant propagation. -* *Improved tensor buffer memory usage.* The compiler framework has an - opportunity to analyze and schedule memory usage, in principle eliminating - many intermediate storage buffers. -* *Reduce reliance on custom Ops.* Many TensorFlow custom Ops are equivalent - to subgraphs of existing lower-level Ops. By focusing on the preceding two - goals we aim as far as possible to make the performance of low-level Ops be - the same as that of hand-written fused implementations, removing the need - for many custom Ops. -* *Much improved mobile footprint.* When the compiled subgraph is an entire - TensorFlow computation, it is possible to eliminate the TensorFlow runtime - altogether and simply emit an object/header file pair that can be linked - directly into another application. This is particularly useful for mobile - inference, and can reduce the footprint of a TensorFlow computation by - several orders of magnitude. -* *Improved portability.* The compiler-based framework is designed to target - different back-end hardware, including a variety of CPUs, GPUs, and custom - accelerator hardware such as TPUs. The CPU and GPU back-ends currently use - LLVM, while the internal Google TPU back-end (which will not be open-sourced - at this time) uses custom code generation. The goal for this and other - accelerators is that it should be relatively easy to write a new back-end - for novel hardware, at which point a large fraction of TensorFlow programs - will run unmodified on that hardware. This is in contrast with the approach - of specializing individual monolithic Ops for new hardware, which requires - TensorFlow programs to be rewritten to make use of those Ops. - -## XLA: Accelerated Linear Algebra - -XLA is a domain-specific compiler for linear algebra. The semantics of -operations are high level, e.g., arbitrary sized vector and matrix -operations. This makes the compiler easy to target from TensorFlow, and -preserves enough information to allow sophisticated scheduling and optimization. -The following tutorial provides introductory information about XLA. More details -follow in the [Operation Semantics](#operation_semantics) section. - -It is important to note that the XLA framework is not set in stone. In -particular, while it is unlikely that the semantics of existing operations will -be changed, it is expected that more operations will be added as necessary to -cover important use cases, and we welcome feedback from the community about -missing functionality. - -### Getting started - basic example - -The following code sample shows how to use XLA to compute a simple vector -expression: $$\alpha x+y$$ ("axpy"). - -This sample presents a self-contained function - `ComputeAxpyParameters`, that -takes data as input, uses XLA to build a graph to compute the expression and -returns the resulting data. - -This is done in several steps: - -1. Construct an XLA graph that encodes the expression we want to compute. - The graph's nodes are XLA operations (sometimes called "ops" or HLOs for - "high-level operations"), and its edges represent the data flow between - operations. -2. Ask XLA to create a "computation" based on this graph. This process - JIT-compiles the graph into optimized native code for the chosen platform - and returns a handle. -3. Use the computation handle and the input data to calculate the result. - -The XLA graph we construct for axpy is: -<div style="text-align:center"> -![axpy params graph](../images/xla-axpy-tutorial-params.svg) -</div> - -Note that all operations have predefined [shapes](#shapes_and_layout). A shape -describes the rank of the array, the size of each dimension and the primitive -element type. For example, `f32[10]` is a rank-1 array of single-precision -floats. `f32[]` is a single-precision float scalar. - -In XLA, shapes are statically determined, including the size of each -dimension in an array. This permits the XLA compiler to produce very -efficient code for all backends. When constructing the graph, only the shapes of -input nodes (parameters or constants) have to be provided explicitly - the rest -is automatically inferred by XLA; therefore, the burden on the developer is -minimal. - -Here is the part of the axpy sample code that constructs the graph (step 1): - -```c++ -std::unique_ptr<xla::Literal> ComputeAxpyParameters( - const xla::Literal& alpha, const xla::Literal& x, - const xla::Literal& y) { - // Get the singleton handle for an XLA client library and create a new - // computation builder. - xla::Client* client(xla::ClientLibrary::ClientLibraryOrDie()); - xla::ComputationBuilder builder(client, "axpy"); - - // Build the actual XLA computation graph. It's a function taking - // three parameters and computing a single output. - auto param_alpha = builder.Parameter(0, alpha.shape(), "alpha"); - auto param_x = builder.Parameter(1, x.shape(), "x"); - auto param_y = builder.Parameter(2, y.shape(), "y"); - auto axpy = builder.Add(builder.Mul(param_alpha, param_x), param_y); -``` - -XLA features a client-server design. `xla::ClientLibrary` provides a -simple way to instantiate an XLA server in the backend and connect to it with -an `xla::Client` object. - -The `ComputationBuilder` class provides a convenient programming interface to -construct XLA computations. The semantics of XLA operations with links -to `ComputationBuilder` methods are documented in -[Operation Semantics](#operation_semantics). - -Here is the part that JIT-compiles the graph (step 2): - -```c++ - // We're done building the graph. Create a computation on the server. - util::StatusOr<std::unique_ptr<xla::Computation>> computation_status = - builder.Build(); - std::unique_ptr<xla::Computation> computation = - computation_status.ConsumeValueOrDie(); -``` - -Here is the part that runs the compiled code on the input (step 3): - -```c++ - // Transfer the parameters to the server and get data handles that refer to - // them. - std::unique_ptr<xla::GlobalData> alpha_data = - client->TransferToServer(alpha).ConsumeValueOrDie(); - std::unique_ptr<xla::GlobalData> x_data = - client->TransferToServer(x).ConsumeValueOrDie(); - std::unique_ptr<xla::GlobalData> y_data = - client->TransferToServer(y).ConsumeValueOrDie(); - - // Now we have all we need to execute the computation on the device. We get - // the result back in the form of a Literal. - util::StatusOr<std::unique_ptr<xla::Literal>> result_status = - client->ExecuteAndTransfer( - *computation, {alpha_data.get(), x_data.get(), y_data.get()}); - return result_status.ConsumeValueOrDie(); -} -``` - -There is one thing noticeably absent from the above code: no specification of -the device to use. The choice of device is orthogonal to the computation -specified and can be selected by choosing the appropriate service plugin. - -### Moving data into and out of XLA - -The main way to move data into and out of XLA is by populating -`xla::Literal` objects. This enables maximal generality for the XLA -client-server model of computation. When the service is running in the same -process as the client, the `xla::Client::TransferInProcess` method may be -used to transfer arrays to and from the service more efficiently. - -### Constants vs. parameters - -For the simple axpy computation we've seen earlier, we can construct an -alternative XLA graph: -<div style="text-align:center"> -![axpy constants graph](../images/xla-axpy-tutorial-constants.svg) -</div> - -The code to construct and run this computation is: - -```c++ -std::unique_ptr<xla::Literal> ComputeAxpyConstants( - float alpha, gtl::ArraySlice<float> x, - gtl::ArraySlice<float> y) { - // Get the singleton handle for an XLA client library and create a new - // computation builder. - xla::Client* client(xla::ClientLibrary::ClientLibraryOrDie()); - xla::ComputationBuilder builder(client, "axpy"); - - auto constant_alpha = builder.ConstantR0<float>(alpha); - auto constant_x = builder.ConstantR1<float>(x); - auto constant_y = builder.ConstantR1<float>(y); - auto axpy = builder.Add(builder.Mul(constant_alpha, constant_x), constant_y); - - // We're done building the graph. Tell the server to create a Computation from - // it, and then execute this computation on the device, transferring the - // result back as a literal. - util::StatusOr<std::unique_ptr<xla::Computation>> computation_status = - builder.Build(); - std::unique_ptr<xla::Computation> computation = - computation_status.ConsumeValueOrDie(); - // No need to pass arguments into the computation since it accepts no - // parameters. - util::StatusOr<std::unique_ptr<xla::Literal>> result_status = - client->ExecuteAndTransfer(*computation, {}); - return result_status.ConsumeValueOrDie(); -} -``` - -This computation has no user-provided inputs - the inputs are constants that are -embedded into the graph itself. It highlights an important design tradeoff that -should be considered when using XLA. - -XLA is a JIT compiler. An XLA graph is created during the runtime of the host -program, and JIT-compiled to native code for the desired backend(s). This -compilation may take a non-trivial amount of time, which presents a tradeoff. - -Many uses will want to compile a single graph and then run it repeatedly with -different inputs. This is what `parameter` ops are most suitable for. Re-running -the computation with different data doesn't require recompiling the graph. -Sometimes, however, some of the inputs may be constant (or at least constant -throughout some subset of the host program's runtime). In those cases, it makes -sense to create an XLA graph where these inputs are `constant` ops instead of -parameters. This will permit the XLA compiler to perform constant folding -and other advanced optimizations that may result in significantly more efficient -code. On the other hand, this means a computation needs to be recompiled every -time the "constant" value actually needs to change. - -## Shapes and Layout - -The XLA `Shape` proto describes the rank, size, and data type of an -N-dimensional array (*array* in short). - -### Terminology, Notation, and Conventions - -* The rank of an array is equal to the number of dimensions. The *true rank* - of an array is the number of dimensions which have a size greater than 1. - -* Dimensions are numbered from `0` up to `N-1` for an `N` dimensional array. - The dimensions numbers are simply convenient labels. The order of these - dimension numbers does not imply a particular minor/major ordering in the - layout of the shape. The layout is determined by the `Layout` proto. - -* By convention, dimensions are listed in increasing order of dimension - number. For example, for a 3-dimensional array of size `[A x B x C]`, - dimension 0 has size `A`, dimension 1 has size `B` and dimension 2 has size - `C`. - -* Two, three, and four dimensional arrays often have specific letters - associated with dimensions. For example, for a 2D array: - - * dimension 0: `y` - * dimension 1: `x` - - For a 3D array: - - * dimension 0: `z` - * dimension 1: `y` - * dimension 2: `x` - - For a 4D array: - - * dimension 0: `p` - * dimension 1: `z` - * dimension 2: `y` - * dimension 3: `x` - -* Functions in the XLA API which take dimensions do so in increasing order - of dimension number. This matches the ordering used when passing dimensions - as an `initializer_list`; e.g. - - `ShapeUtil::MakeShape(F32, {A, B, C, D})` - - Will create a shape whose dimension array consists of the sequence `[A, B, - C, D]`. - -### Layout - -The `Layout` proto describes how an array is represented in memory. The `Layout` -proto includes the following fields: - -``` -message Layout { - repeated int64 minor_to_major = 1; - repeated int64 padded_dimensions = 2; - optional PaddingValue padding_value = 3; -} -``` - -#### Minor-to-major dimension ordering - -The only required field is `minor_to_major`. This field describes the -minor-to-major ordering of the dimensions within a shape. Values in -`minor_to_major` are an ordering of the dimensions of the array (`0` to `N-1` -for an `N` dimensional array) with the first value being the most-minor -dimension up to the last value which is the most-major dimension. The most-minor -dimension is the dimension which changes most rapidly when stepping through the -elements of the array laid out in linear memory. - -For example, consider the following 2D array of size `[2 x 3]`: - -``` -a b c -d e f -``` - -Here dimension `0` is size 2, and dimension `1` is size 3. If the -`minor_to_major` field in the layout is `[0, 1]` then dimension `0` is the -most-minor dimension and dimension `1` is the most-major dimension. This -corresponds to the following layout in linear memory: - -``` -a d b e c f -``` - -This minor-to-major dimension order of `0` up to `N-1` is akin to *column-major* -(at rank 2). Assuming a monotonic ordering of dimensions, another name we may -use to refer to this layout in the code is simply "dim 0 is minor". - -On the other hand, if the `minor_to_major` field in the layout is `[1, 0]` then -the layout in linear memory is: - -``` -a b c d e f -``` - -A minor-to-major dimension order of `N-1` down to `0` for an `N` dimensional -array is akin to *row-major* (at rank 2). Assuming a monotonic ordering of -dimensions, another name we may use to refer to this layout in the code is -simply "dim 0 is major". - -#### Padding - -Padding is defined in the optional `padded_dimensions` and `padding_value` -fields. The field `padded_dimensions` describes the sizes (widths) to which each -dimension is padded. If present, the number of elements in `padded_dimensions` -must equal the rank of the shape. - -For example, given the `[2 x 3]` array defined above, if `padded_dimension` is -`[3, 5]` then dimension 0 is padded to a width of 3 and dimension 1 is padded to -a width of 5. The layout in linear memory (assuming a padding value of 0 and -column-major layout) is: - -``` -a d 0 b e 0 c f 0 0 0 0 0 0 0 -``` - -This is equivalent to the layout of the following array with the same -minor-to-major dimension order: - -``` -a b c 0 0 -d e f 0 0 -0 0 0 0 0 -``` - -## Operation Semantics - -The following describes the semantics of operations defined in the -`ComputationBuilder` interface. - -A note on nomenclature: the generalized data type XLA deals with is an -N-dimensional array holding elements of some uniform type (such as 32-bit -float). Throughout the documentation, we use *array* to denote an -arbitrary-dimensional array. For convenience, special cases have more specific -and familiar names; for example a *vector* is a 1-dimensional array and a -*matrix* is a 2-dimensional array. - -### Broadcast - -Adds dimensions to an array by duplicating the data in the array. - -<b> `Broadcast(operand, broadcast_sizes)` </b> - -Arguments | Type | Semantics ------------------ | ----------------------- | ------------------------------- -`operand` | `ComputationDataHandle` | The array to duplicate -`broadcast_sizes` | `ArraySlice<int64>` | The sizes of the new dimensions - -The new dimensions are inserted on the left, i.e. if `broadcast_sizes` has -values `{a0, ..., aN}` and the operand shape has dimensions `{b0, ..., bM}` then -the shape of the output has dimensions `{a0, ..., aN, b0, ..., bM}`. - -The new dimensions index into copies of the operand, i.e. - -``` -output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM] -``` - -For example, if `operand` is a scalar `f32` with value `2.0f`, and -`broadcast_sizes` is `{2, 3}`, then the result will be an array with shape -`f32[2, 3]` and all the values in the result will be `2.0f`. - -### Collapse - -See also `ComputationBuilder::Collapse` and the [`Reshape`](#reshape) operation. - -Collapses dimensions of an array into one dimension. - -<b> `Collapse(operand, dimensions)` </b> - -| Arguments | Type | Semantics | -| ------------ | ----------------------- | ----------------------------------- | -| `operand` | `ComputationDataHandle` | array of type T | -| `dimensions` | `int64` vector | in-order, consecutive subset of T's | -: : : dimensions. : - -Collapse replaces the given subset of the operand's dimensions by a single -dimension. The input arguments are an arbitrary array of type T and a -compile-time-constant vector of dimension indices. The dimension indices must be -an in-order (low to high dimension numbers), consecutive subset of T's -dimensions. Thus, {0, 1, 2}, {0, 1}, or {1, 2} are all valid dimension sets, but -{1, 0} or {0, 2} are not. They are replaced by a single new dimension, in the -same position in the dimension sequence as those they replace, with the new -dimension size equal to the product of original dimension sizes. The lowest -dimension number in `dimensions` is the slowest varying dimension (most major) -in the loop nest which collapses these dimension, and the highest dimension -number is fastest varying (most minor). See the [`Reshape`](#reshape) operator -if more general collapse ordering is needed. - -For example, let v be an array of 24 elements: - -``` -let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}}, - {{20, 21, 22}, {25, 26, 27}}, - {{30, 31, 32}, {35, 36, 37}}, - {{40, 41, 42}, {45, 46, 47}}}; - -// Collapse to a single dimension, leaving one dimension. -let v012 = Collapse(v, {0,1,2}); -then v012 == f32[24] {10, 11, 12, 15, 16, 17, - 20, 21, 22, 25, 26, 27, - 30, 31, 32, 35, 36, 37, - 40, 41, 42, 45, 46, 47}; - -// Collapse the two lower dimensions, leaving two dimensions. -let v01 = Collapse(v, {0,1}); -then v01 == f32[4x6] {{10, 11, 12, 15, 16, 17}, - {20, 21, 22, 25, 26, 27}, - {30, 31, 32, 35, 36, 37}, - {40, 41, 42, 45, 46, 47}}; - -// Collapse the two higher dimensions, leaving two dimensions. -let v12 = Collapse(v, {1,2}); -then v12 == f32[8x3] {{10, 11, 12}, - {15, 16, 17}, - {20, 21, 22}, - {25, 26, 27}, - {30, 31, 32}, - {35, 36, 37}, - {40, 41, 42}, - {45, 46, 47}}; - -``` - -### Concatenate - -See also `ComputationBuilder::ConcatInDim` - -Concatenate composes an array from multiple array operands. The array is of the -same rank as each of the input array operands (which must be of the same rank as -each other) and contains the arguments in the order that they were specified. - -<b> `Concatenate(operands..., dimension)` </b> - -| Arguments | Type | Semantics | -| ----------- | ----------------------- | ------------------------------------ | -| `operands` | sequence of N | N arrays of type T with dimensions | -: : `ComputationDataHandle` : [L0, L1, ...] : -| `dimension` | `int64` | A value in the interval `[0, N)` | -: : : that names the dimension to be : -: : : concatenated between the `operands`. : - -With the exception of `dimension` all dimensions must be the same. This is -because XLA does not support "ragged" arrays -- the dimension which is being -concatenated must be the only one that differs between the operands. Also note -that rank-0 values cannot be concatenated (as it's impossible to name the -dimension along which the concatenation occurs). - -1-dimensional example: - -``` -Concat({{2, 3}, {4, 5}, {6, 7}}, 0) ->>> {2, 3, 4, 5, 6, 7} -``` - -2-dimensional example: - -``` -let a = { - {1, 2}, - {3, 4}, - {5, 6}, -}; -let b = { - {7, 8}, -}; -Concat({a, b}, 0) ->>> { - {1, 2}, - {3, 4}, - {5, 6}, - {7, 8}, -} -``` - -Diagram: - -<center><iframe src="../images/xla-concatenate.svg" width="920" height="400" title="Concatenate Diagram" frameborder="0"></iframe></center> - -### ConvertElementType - -See `ComputationBuilder::ConvertElementType` - -Similar to an element-wise `static_cast` in C++, performs an element-wise -conversion operation from a data shape to a target shape. The dimensions must -match, and the conversion is an element-wise one; e.g. `s32` elements become -`f32` elements via an `s32`-to-`f32` conversion routine. - -<b> `ConvertElementType(operand, new_element_type)` </b> - -Arguments | Type | Semantics ------------------- | ----------------------- | --------------------------- -`operand` | `ComputationDataHandle` | array of type T with dims D -`new_element_type` | `PrimitiveType` | type U - -If the dimensions of the operand and the target shape do not match, or an -invalid conversion is requested (e.g. to/from a tuple) an error will be -produced. - -A conversion such as `T=s32` to `U=f32` will perform a normalizing int-to-float -conversion routine such as round-to-nearest-even. - -> Note: The precise float-to-int and visa-versa conversions are currently -> unspecified, but may become additional arguments to the convert operation in -> the future. - -``` -let a: s32[3] = {0, 1, 2}; -let b: f32[3] = convert(a, f32); -then b == f32[3]{0.0, 1.0, 2.0} -``` - -### Conv (convolution) - -See `ComputationBuilder::Conv` - -As ConvWithGeneralPadding, but the padding is specified in a short-hand way as -either SAME or VALID. SAME padding pads the input (`lhs`) with zeroes so that -the output has the same shape as the input when not taking striding into -account. VALID padding simply means no padding. - -### ConvWithGeneralPadding (convolution) - -See `ComputationBuilder::ConvWithGeneralPadding` - -Computes a convolution of the kind used in neural networks. Here, a convolution -can be thought of as a 2d window moving across a 2d base area and a computation -is performed for each possible position of the window. - -| Arguments | Type | Semantics | -| ---------------- | ----------------------- | ------------------------------- | -| `lhs` | `ComputationDataHandle` | rank-4 array of inputs | -| `rhs` | `ComputationDataHandle` | rank-4 array of kernel weights | -| `window_strides` | `ArraySlice<int64>` | 2d array of kernel strides | -| `padding` | `ArraySlice<pair<int64, | 2d array of (low, high) padding | -: : int64>>` : : - -The `lhs` argument is a rank 4 array describing the base area. We will call this -the input, even though of course the rhs is also an input. In a neural network, -these are the input activations. The 4 dimensions are, in this order: - -* `batch`: Each coordinate in this dimension represents an independent input - for which convolution is carried out. -* `z/depth/features`: Each (y,x) position in the base area has a vector - associated to it, which goes into this dimension. -* `y` and `x`: Describes the two spatial dimensions that define the 2d base - area that the window moves across. - -The `rhs` argument is a rank 4 array describing the convolutional -filter/kernel/window. The dimensions are, in this order: - -* `output-z`: The `z` dimension of the output. -* `input-z`: The size of this dimension should equal the size of the `z` - dimension in lhs. -* `y` and `x`: Describes the two spatial dimensions that define the 2d window - that moves across the base area. - -The `window_strides` argument specifies the stride of the convolutional window -in the `y` and `x` dimensions. For example, if the stride in dimension `y` is 3, -then the window can only be placed at coordinates where the `y` index is -divisible by 3. - -The `padding` argument specifies the amount of zero padding to be applied to the -base area. `padding[0]` specifies the padding for dimension `y` and `padding[1]` -specifies the padding for dimension `x`. Each pair has the low padding as the -first element and the high padding as the second element. The low padding is -applied in the direction of lower indices while the high padding is applied in -the direction of higher indices. For example, if `padding[1]` is `(2,3)` then -there will be a padding by 2 zeroes on the left and by 3 zeroes on the right in -the `x` dimension. Using padding is equivalent to inserting those same zero -values into the input (`lhs`) before doing the convolution. - -The output shape has these dimensions, in this order: - -* `batch`: Same size as `batch` on the input (`lhs`). -* `z`: Same size as `output-z` on the kernel (`rhs`). -* `y` and `x`: One value for each valid placement of the convolutional window. - -The valid placements of the convolutional window are determined by the strides -and the size of the base area after padding. - -To describe what a convolution does, pick some fixed `batch`, `z`, `y`, `x` -coordinates in the output. Then `(y,x)` is a position of a corner of the window -within the base area (e.g. the upper left corner, depending on how you interpret -the spatial dimensions). We now have a 2d window, taken from the base area, -where each 2d point is associated to a 1d vector, so we get a 3d box. From the -convolutional kernel, since we fixed the output coordinate `z`, we also have a -3d box. The two boxes have the same dimensions, so we can take the sum of the -element-wise products between the two boxes (similar to a dot product). That is -the output value. - -Note that if `output-z` is e.g. 5, then each position of the window produces 5 -values in the output into the `z` dimension of the output. These values differ -in what part of the convolutional kernel is used - there is a separate 3d box of -values used for each `output-z` coordinate. So you could think of it as 5 -separate convolutions with a different filter for each of them. - -Here is pseudo-code for a convolution with padding and striding: - -``` -for (b, oz, oy, ox) { // output coordinates - value = 0; - for (iz, ky, kx) { // kernel coordinates and input z - iy = oy*stride_y + ky - pad_low_y; - ix = ox*stride_x + kx - pad_low_x; - if ((iy, ix) inside the base area considered without padding) { - value += input(b, iz, iy, ix) * kernel(oz, iz, ky, kx); - } - } - output(b, oz, oy, ox) = value; -} -``` - -### Dot - -See also `ComputationBuilder::Dot` - -<b> `Dot(lhs, rhs)` </b> - -Arguments | Type | Semantics ---------- | ----------------------- | --------------- -`lhs` | `ComputationDataHandle` | array of type T -`rhs` | `ComputationDataHandle` | array of type T - -The exact semantics of this operation depend on the ranks of the operands: - -| Input | Output | Semantics | -| ----------------------- | --------------------- | ----------------------- | -| scalar `dot` scalar | scalar | scalar multiplication | -| vector [n] `dot` vector | scalar | vector dot product | -: [n] : : : -| matrix [m x k] `dot` | vector [m] | matrix-vector | -: vector [k] : : multiplication : -| matrix [m x k] `dot` | matrix [m x n] | matrix-matrix | -: matrix [k x n] : : multiplication : -| array [p x q x r] `dot` | array [p x q x s x t] | array dot product (read | -: array [s x r x t] : : below) : - -The operation performs sum of products over dimension 0 of `lhs` and dimensions -1 of `rhs`. These are the "contracted" dimensions. If the dimension to contract -exceeds the rank of the operand, the last dimension is contracted. This happens -when the `lhs` operand is a scalar or the `rhs` operand is a scalar or a vector. -The contracted dimensions of `lhs` and `rhs` must be of the same size. - -The rank of the result array is `max(rank(lhs) - 1, 0) + max(rank(rhs) - 1, 0)`. -The result dimensions are ordered in the original order within each operand, -with the `rhs` dimensions followed by the `lhs` dimensions except the contracted -dimensions. For example, a dot product of two arrays `[p x q x r]` and `[s x r x -t]` produces a 4 dimensional array of `[p x q x s x t]` by contracting the -dimension of size `r`. - -Notes: - -1. This follows the typical definition of a dot operator, as in other numeric - libraries such as [numpy](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html). -2. There is currently no support for the general tensor dot operator - [numpy.tensordot](http://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot). - -### Element-wise binary arithmetic operations - -See also `ComputationBuilder::Add` - -A set of element-wise binary arithmetic operations is supported. - -<b> `Op(lhs, rhs)` </b> - -Where `Op` is one of `Add` (addition), `Sub` (subtraction), `Mul` -(multiplication), `Div` (division), `Rem` (remainder), `Max` (maximum), `Min` -(minimum). - -Arguments | Type | Semantics ---------- | ----------------------- | ---------------------------------------- -`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T -`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T - -The arguments' shapes have to be either similar or compatible. See the -[broadcasting](#broadcasting_semantics) documentation about what it means for -shapes to be compatible. The result of an operation has a shape which is the -result of broadcasting the two input arrays. In this variant, operations between -arrays of different ranks are *not* supported, unless one of the operands is a -scalar. - -When `Op` is `Rem`, the sign of the result is taken from the dividend. - -An alternative variant with different-rank broadcasting support exists for these -operations: - -<b> `Op(lhs, rhs, broadcast_dimensions)` </b> - -Where `Op` is the same as above. This variant of the operation should be used -for arithmetic operations between arrays of different ranks (such as adding a -matrix to a vector). - -The additional `broadcast_dimensions` operand is a slice of integers used to -expand the rank of the lower-rank operand up to the rank of the higher-rank -operand. `broadcast_dimensions` maps the dimensions of the lower-rank shape to -the dimensions of the higher-rank shape. The unmapped dimensions of the expanded -shape are filled with dimensions of size one. Degenerate-dimension broadcasting -then broadcasts the shapes along these degenerate dimension to equalize the -shapes of both operands. The semantics are described in detail in the -[broadcasting](#broadcasting_semantics) documentation. - -### Element-wise comparison operations - -See also `ComputationBuilder::Eq` - -A set of standard element-wise binary comparison operations is supported. Note -that standard IEEE 754 floating-point comparison semantics apply when comparing -floating-point types. - -<b> `Op(lhs, rhs)` </b> - -Where `Op` is one of `Eq` (equal-to), `Ne` (not equal-to), `Ge` -(greater-or-equal-than), `Gt` (greater-than), `Le` (less-or-equal-than), `Le` -(less-than). - -Arguments | Type | Semantics ---------- | ----------------------- | ---------------------------------------- -`lhs` | `ComputationDataHandle` | left-hand-side operand: array of type T -`rhs` | `ComputationDataHandle` | right-hand-side operand: array of type T - -The arguments' shapes have to be either similar or compatible. See the -[broadcasting](#broadcasting_semantics) documentation about what it means for -shapes to be compatible. The result of an operation has a shape which is the -result of broadcasting the two input arrays with the element type `PRED`. In -this variant, operations between arrays of different ranks are *not* supported, -unless one of the operands is a scalar. - -An alternative variant with different-rank broadcasting support exists for these -operations: - -<b> `Op(lhs, rhs, broadcast_dimensions)` </b> - -Where `Op` is the same as above. This variant of the operation should be used -for comparison operations between arrays of different ranks (such as adding a -matrix to a vector). - -The additional `broadcast_dimensions` operand is a slice of integers specifying -the dimensions to use for broadcasting the operands. The semantics are described -in detail in the [broadcasting](#broadcasting_semantics) documentation. - -### Element-wise unary functions - -ComputationBuilder supports these element-wise unary functions: - -<b>`Exp(operand)`</b> Element-wise natural exponential `x -> e^x`. - -<b>`Log(operand)`</b> Element-wise natural logarithm `x -> ln(x)`. - -<b>`Neg(operand)`</b> Element-wise negation `x -> -x`. - -<b>`Floor(operand)`</b> Element-wise floor `x -> ⌊x⌋`. - -<b>`Ceil(operand)`</b> Element-wise ceil `x -> ⌈x⌉`. - -<b>`Tanh(operand)`</b> Element-wise hyperbolic tangent `x -> tanh(x)`. - -Arguments | Type | Semantics ---------- | ----------------------- | --------------------------- -`operand` | `ComputationDataHandle` | The operand to the function - -The function is applied to each element in the `operand` array, resulting in an -array with the same shape. It is allowed for `operand` to be a scalar (rank 0). - -### GetTupleElement - -See also `ComputationBuilder::GetTupleElement` - -Indexes into a tuple with a compile-time-constant value. - -The value must be a compile-time-constant so that shape inference can determine -the type of the resulting value. - -This is analogous to `std::get<int N>(t)` in C++. Conceptually: - -``` -let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; -let s: s32 = 5; -let t: (f32[10], s32) = tuple(v, s); -let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32. -``` - -See also [`Tuple`](#tuple). - -### Infeed - -See also `ComputationBuilder::Infeed` - -<b> `Infeed(shape)` </b> - -| Argument | Type | Semantics | -| -------- | ------- | ----------------------------------------------------- | -| `shape` | `Shape` | Shape of the data read from the Infeed interface. The | -: : : layout field of the shape must be set to match the : -: : : layout of the data sent to the device; otherwise its : -: : : behavior is undefined. : - -Devices have an abstraction for feeding data to long-running computations, e.g., -feeding inputs to be consumed within the body of a [`While`](#while) loop. -`Infeed` reads a single data item from the implicit Infeed streaming interface -of the device, interpreting the data as the given shape and its layout, and -returns a `ComputationDataHandle` of the data. Multiple Infeed operations are -allowed in a computation, but there must be a total order among the Infeed -operations. For example, two Infeeds in the code below have a total order since -there is a dependency between the while loops. The compiler issues an error if -there isn't a total order. - -``` -result1 = while (condition, init = init_value) { - Infeed(shape) -} - -result2 = while (condition, init = result1) { - Infeed(shape) -} -``` - -Nested tuple shapes are not supported. For an empty tuple shape, the Infeed -operation is effectively a nop and proceeds without reading any data from the -Infeed of the device. - -> Note: We plan to allow multiple Infeed operations without a total order, in -> which case the compiler will provide information about how the Infeed -> operations are serialized in the compiled program. - -### Map - -See also `ComputationBuilder::Map` - -<b> `Map(operands..., computation)` </b> - -| Arguments | Type | Semantics | -| ----------------- | ------------------------ | ----------------------------- | -| `operands` | sequence of N | N arrays of type T | -: : `ComputationDataHandle`s : : -| `computation` | `Computation` | computation of type `T_0, | -: : : T_1, ..., T_{N + M -1}` -> S` : -: : : with N parameters of type T : -: : : and M of arbitrary type : -| `static_operands` | sequence of M | M arrays of arbitrary type | -: : `ComputationDataHandle`s : : - -Applies a scalar function over the given `operands` arrays, producing an array -of the same dimensions where each element is the result of the mapped function -applied to the corresponding elements in the input arrays with `static_operands` -given as additional input to `computation`. - -The mapped function is an arbitrary computation with the restriction that it has -N inputs of scalar type `T` and a single output with type `S`. The output has -the same dimensions as the operands except that the element type T is replaced -with S. - -For example: `Map(op1, op2, op3, computation, par1)` maps `elem_out <- -computation(elem1, elem2, elem3, par1)` at each (multi-dimensional) index in the -input arrays to produce the output array. - -### Pad - -See also `ComputationBuilder::Pad` - -<b> `Pad(operand, padding_value, padding_config)` </b> - -| Arguments | Type | Semantics | -| ---------------- | ----------------------- | ----------------------------- | -| `operand` | `ComputationDataHandle` | array of type `T` | -| `padding_value` | `ComputationDataHandle` | scalar of type `T` to fill in | -: : : the added padding : -| `padding_config` | `PaddingConfig` | padding amount on both edges | -: : : (low, high) and between the : -: : : elements of each dimension : - -Expands the given `operand` array by padding around the array as well as between -the elements of the array with the given `padding_value`. `padding_config` -specifies the amount of edge padding and the interior padding for each -dimension. - -`PaddingConfig` is a repeated field of `PaddingConfigDimension`, which contains -three fields for each dimension: `edge_padding_low`, `edge_padding_high`, and -`interior_padding`. `edge_padding_low` and `edge_padding_high` specifies the -amount of padding added at the low-end (next to index 0) and the high-end (next -to the highest index) of each dimension respectively. `interior_padding` -specifies the amount of padding added between any two elements in each -dimension. This operation is a no-op if the edge padding pairs are all (0, 0) -and the interior padding values are all 0. Figure below shows examples of -different `edge_padding` and `interior_padding` values for a two dimensional -array. - -<center><iframe src="../images/xla-pad.svg" width="920" height="600" title="Pad Diagram" frameborder="0"></iframe></center> - -### Reduce - -See also `ComputationBuilder::Reduce` - -Applies a reduction function to an array. - -<b> `Reduce(operand, init_value, computation, dimensions)` </b> - -| Arguments | Type | Semantics | -| ------------- | ----------------------- | -------------------------------- | -| `operand` | `ComputationDataHandle` | array of type `T` | -| `init_value` | `ComputationDataHandle` | scalar of type `T` | -| `computation` | `Computation` | computation of type `T, T -> T` | -| `dimensions` | `int64` array | unordered array of dimensions to | -: : : reduce : - -Conceptually, this operation reduces one or more dimensions in the input array -into scalars. The rank of the result array is `rank(operand) - len(dimensions)`. -`init_value` is the initial value used for every reduction and may also be -inserted anywhere during computation if the back-end chooses to do so. So in -most cases `init_value` should be an identity of the reduction function (for -example, 0 for addition). - -The evaluation order of the reduction function across the reduction dimensions -is arbitrary and may be non-deterministic. Therefore, the reduction function -should not be overly sensitive to reassociation[^1]. - -As an example, when reducing across the one dimension in a 1D array with values -[10, 11, 12, 13], with reduction function `f` (this is `computation`) then that -could be computed as - -`f(10, f(11, f(12, f(init_value, 13)))` - -but there are also many other possibilities, e.g. - -`f(init_value, f(f(10, f(init_value, 11)), f(f(init_value, 12), f(13, -init_value))))` - -The following is a rough pseudo-code example of how reduction could be -implemented, using summation as the reduction computation with an initial value -of 0. - -```python -result_shape <- remove all dims in dimensions from operand_shape - -# Iterate over all elements in result_shape. The number of r's here is equal -# to the rank of the result -for r0 in range(result_shape[0]), r1 in range(result_shape[1]), ...: - # Initialize this result element - result[r0, r1...] <- 0 - - # Iterate over all the reduction dimensions - for d0 in range(dimensions[0]), d1 in range(dimensions[1]), ...: - # Increment the result element with the value of the operand's element. - # The index of the operand's element is constructed from all ri's and di's - # in the right order (by construction ri's and di's together index over the - # whole operand shape). - result[r0, r1...] += operand[ri... di] -``` - -Here's an example of reducing a 2D array (matrix). The shape has rank 2, -dimension 0 of size 2 and dimension 1 of size 3: - -<div/> -<div style="text-align:center"> -![2d array](../images/xla-2d-matrix.svg) -</div> - -Results of reducing dimensions 0 or 1 with an "add" function: - -<div/> -<div style="text-align:center"> -![reducing 2d array](../images/xla-reduce-from-2d.svg) -</div> - -Note that both reduction results are 1D arrays. The diagram shows one as column -and another as row just for visual convenience. - -For a more complex example, here is a 3D array. Its rank is 3, dimension 0 of -size 4, dimension 1 of size 2 and dimension 2 of size 3. For simplicity, the -values 1 to 6 are replicated across dimension 0. - -<div/> -<div style="text-align:center"> -![3d array](../images/xla-reduce-from-3d.svg) -</div> - -Similarly to the 2D example, we can reduce just one dimension. If we reduce -dimension 0, for example, we get a rank-2 array where all values across -dimension 0 were folded into a scalar: - -```text -| 4 8 12 | -| 4 8 12 | -``` - -If we reduce dimension 2, we also get a rank-2 array where all values across -dimension 2 were folded into a scalar: - -```text -| 6 15 | -| 6 15 | -| 6 15 | -| 6 15 | -``` - -Note that the relative order between the remaining dimensions in the input is -preserved in the output, but some dimensions may get assigned new numbers (since -the rank changes). - -We can also reduce multiple dimensions. Add-reducing dimensions 0 and 1 produces -the 1D array `| 20 28 36 |`. - -Reducing the 3D array over all its dimensions produces the scalar `84`. - -### ReduceWindow - -See also `ComputationBuilder::ReduceWindow` - -Applies a reduction function to all elements in each window of the input -multi-dimensional array, producing an output multi-dimensional array with the -same number of elements as the number of valid positions of the window. A -pooling layer can be expressed as a `ReduceWindow`. - -<b> `ReduceWindow(operand, computation, window, init_value)` </b> - -| Arguments | Type | Semantics | -| ------------------- | ----------------------- | ---------------------------- | -| `operand` | `ComputationDataHandle` | N dimensional array | -: : : containing elements of type : -: : : T. This is the base area on : -: : : which the window is placed. : -| `init_value` | `ComputationDataHandle` | Starting value for the | -: : : reduction. See [Reduce] : -: : : (#reduce) for details. : -| `computation` | `Computation` | Reduction function of type | -: : : `T, T -> T`, to apply to all : -: : : elements in each window : -| `window_dimensions` | `ArraySlice<int64>` | array of integers for window | -: : : dimension values : -| `window_strides` | `ArraySlice<int64>` | array of integers for window | -: : : stride values : -| `padding` | `Padding` | padding type for window | -: : : (Padding\:\:kSame or : -: : : Padding\:\:kValid) : - -Below code and figure shows an example of using `ReduceWindow`. Input is a -matrix of size [4x6] and both window_dimensions and window_stride_dimensions are -[2x3]. - -``` -// Create a computation for the reduction (maximum). -std::unique_ptr<Computation> max; -{ - ComputationBuilder builder(client_, "max"); - auto y = builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "y"); - auto x = builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "x"); - builder.Max(y, x); - max = builder.Build().ConsumeValueOrDie(); -} - -// Create a ReduceWindow computation with the max reduction computation. -ComputationBuilder builder(client_, "reduce_window_2x3"); -auto shape = ShapeUtil::MakeShape(F32, {4, 6}); -auto input = builder.Parameter(0, shape, "input"); -builder.ReduceWindow( - input, *max, - /*init_val=*/builder_.ConstantR0<float>(std::numeric_limits<float>::min()), - /*window_dimensions=*/{2, 3}, - /*window_stride_dimensions=*/{2, 3}, - Padding::kValid); -``` - -<center><iframe src="../images/xla-reduce-window.svg" width="920" height="300" title="ReduceWindow Diagram" frameborder="0"></iframe></center> - -Stride of 1 in a dimension specifies that the position of a window in the -dimension is 1 element away from its adjacent window. In order to specify that -no windows overlap with each other, window_stride_dimensions should be equal to -window_dimensions. The figure below illustrates the use of two different stride -values. Padding is applied to each dimension of the input and the calculations -are the same as though the input came in with the dimensions it has after -padding. - -<center><iframe src="../images/xla-reduce-window-stride.svg" width="920" height="300" title="ReduceWindow Stride Diagram" frameborder="0"></iframe></center> - -### Reshape - -See also `ComputationBuilder::Reshape` and the [`Collapse`](#collapse) -operation. - -Reshapes the dimensions of an array into a new configuration. - -<b> `Reshape(operand, dimensions, new_sizes)` </b> - -Arguments | Type | Semantics ------------- | ----------------------- | --------------------------------------- -`operand` | `ComputationDataHandle` | array of type T -`dimensions` | `int64` vector | order in which dimensions are collapsed -`new_sizes` | `int64` vector | vector of sizes of new dimensions - -Conceptually, reshape first flattens an array into a one-dimensional vector of -data values, and then refines this vector into a new shape. The input arguments -are an arbitrary array of type T, a compile-time-constant vector of dimension -indices, and a compile-time-constant vector of dimension sizes for the result. -The values in the `dimension` vector must be a permutation of all of T's -dimensions. The order of the dimensions in `dimensions` is from slowest-varying -dimension (most major) to fastest-varying dimension (most minor) in the loop -nest which collapses the input array into a single dimension. The `new_sizes` -vector determines the size of the output array. The value at index 0 in -`new_sizes` is the size of dimension 0, the value at index 1 is the size of -dimension 1, and so on. The product of the `new_size` dimensions must equal the -product of the operand's dimension sizes. When refining the collapsed array into -the multidimensional array defined by `new_sizes`, the dimensions in `new_sizes` -are ordered from slowest varying (most major) and to fastest varying (most -minor). - -For example, let v be an array of 24 elements: - -``` -let v = f32[4x2x3] {{{10, 11, 12}, {15, 16, 17}}, - {{20, 21, 22}, {25, 26, 27}}, - {{30, 31, 32}, {35, 36, 37}}, - {{40, 41, 42}, {45, 46, 47}}}; - -In-order collapse: -let v012_24 = Reshape(v, {0,1,2}, {24}); -then v012_24 == f32[24] {10, 11, 12, 15, 16, 17, - 20, 21, 22, 25, 26, 27, - 30, 31, 32, 35, 36, 37, - 40, 41, 42, 45, 46, 47}; - -let v012_83 = Reshape(v, {0,1,2}, {8,3}); -then v012_83 == f32[8x3] {{10, 11, 12}, {15, 16, 17}, - {20, 21, 22}, {25, 26, 27}, - {30, 31, 32}, {35, 36, 37}, - {40, 41, 42}, {45, 46, 47}}; - -Out-of-order collapse: -let v021_24 = Reshape(v, {1,2,0}, {24}); -then v012_24 == f32[24] {10, 11, 12, 20, 21, 22, - 30, 31, 32, 40, 41, 42, - 15, 16, 17, 25, 26, 27, - 35, 36, 37, 45, 46, 47}; - -let v021_83 = Reshape(v, {1,2,0}, {8,3}); -then v021_83 == f32[8x3] {{10, 11, 12}, {20, 21, 22}, - {30, 31, 32}, {40, 41, 42}, - {15, 16, 17}, {25, 26, 27}, - {35, 36, 37}, {45, 46, 47}}; - -let v021_262 = Reshape(v, {1,2,0}, {2,6,2}); -then v021_262 == f32[2x6x2] {{{10, 11}, {12, 20}, {21, 22}, - {30, 31}, {32, 40}, {41, 42}}, - {{15, 16}, {17, 25}, {26, 27}, - {35, 36}, {37, 45}, {46, 47}}}; - -``` - -As a special case, reshape can transform a single-element array to a scalar and -vice versa. For example, `Reshape(f32[1x1] {{5}}, {0,1}, {}) == 5; Reshape(5, -{}, {1,1}) == f32[1x1] {{5}};` - -### Rev (reverse) - -See `ComputationBuilder::Rev` - -<b>`Rev(operand, dimensions)`</b> - -Arguments | Type | Semantics ------------- | ----------------------- | --------------------- -`operand` | `ComputationDataHandle` | array of type T -`dimensions` | `ArraySlice<int64>` | dimensions to reverse - -Reverses the order of elements in the `operand` array along the specified -`dimensions`, generating an output array of the same shape. Each element of the -operand array at a multidimensional index is stored into the output array at a -transformed index. The multidimensional index is transformed by reversing the -index in each dimension to be reversed (i.e., if a dimension of size N is one of -the reversing dimensions, its index i is transformed into N - 1 - i). - -One use for the `Rev` operation is to reverse the convolution weight array along -the two window dimensions during the gradient computation in neural networks. - -### RngBernoulli - -See also `ComputationBuilder::RngBernoulli` - -Constructs an output of a given shape with random numbers generated following -the Bernoulli distribution. The parameter needs to be a scalar valued F32 -operand while the output shape needs to have elemental type U32. - -<b>`RngBernoulli(mean, shape)`</b> - -| Arguments | Type | Semantics | -| --------- | ----------------------- | ------------------------------------- | -| `mean` | `ComputationDataHandle` | Scalar of type F32 specifying mean of | -: : : generated numbers : -| `shape` | `Shape` | Output shape of type U32 | - -### RngNormal - -See also `ComputationBuilder::RngNormal` - -Constructs an output of a given shape with random numbers generated following -the $$N(\mu, \sigma)$$ normal distribution. The parameters `mu` and `sigma`, and -output shape have to have elemental type F32. The parameters furthermore have to -be scalar valued. - -<b>`RngNormal(mean, sigma, shape)`</b> - -| Arguments | Type | Semantics | -| --------- | ----------------------- | -------------------------------------- | -| `mu` | `ComputationDataHandle` | Scalar of type F32 specifying mean of | -: : : generated numbers : -| `sigma` | `ComputationDataHandle` | Scalar of type F32 specifying standard | -: : : deviation of generated numbers : -| `shape` | `Shape` | Output shape of type F32 | - -### RngUniform - -See also `ComputationBuilder::RngUniform` - -Constructs an output of a given shape with random numbers generated following -the uniform distribution over the interval $$[a,b]$$. The parameters and output -shape may be either F32, S32 or U32, but the types have to be consistent. -Furthermore, the parameters need to be scalar valued. - -<b>`RngUniform(a, b, shape)`</b> - -| Arguments | Type | Semantics | -| --------- | ----------------------- | --------------------------------- | -| `a` | `ComputationDataHandle` | Scalar of type T specifying lower | -: : : limit of interval : -| `b` | `ComputationDataHandle` | Scalar of type T specifying upper | -: : : limit of interval : -| `shape` | `Shape` | Output shape of type T | - -### SelectAndScatter - -See also `ComputationBuilder::SelectAndScatter` - -This operation can be considered as a composite operation that first computes -`ReduceWindow` on the `operand` array to select an element from each window, and -then scatters the `source` array to the indices of the selected elements to -construct an output array with the same shape as the operand array. The binary -`select` function is used to select an element from each window by applying it -across each window, and it is called with the property that the first -parameter's index vector is lexicographically less than the second parameter's -index vector. The `select` function returns `true` if the first parameter is -selected and returns `false` if the second parameter is selected, and the -function must hold transitivity (i.e., if `select(a, b)` and `select(b, c)` are -`true`, then `select(a, c)` is also `true`) so that the selected element does -not depend on the order of the elements traversed for a given window. - -The function `scatter` is applied at each selected index in the output array. It -takes two scalar parameters: - -1. Current value at the selected index in the output array -2. The scatter value from `source` that applies to the selected index - -It combines the two parameters and returns a scalar value that's used to update -the value at the selected index in the output array. Initially, all indices of -the output array are set to `init_value`. - -The output array has the same shape as the `operand` array and the `source` -array must have the same shape as the result of applying a `ReduceWindow` -operation on the `operand` array. `SelectAndScatter` can be used to -backpropagate the gradient values for a pooling layer in a neural network. - -<b>`SelectAndScatter(operand, select, window_dimensions, window_strides, -padding, source, init_value, scatter)`</b> - -| Arguments | Type | Semantics | -| ------------------- | ----------------------- | ---------------------------- | -| `operand` | `ComputationDataHandle` | array of type T over which | -: : : the windows slide : -| `select` | `Computation` | binary computation of type | -: : : `T, T -> PRED`, to apply to : -: : : all elements in each window; : -: : : returns `true` if the first : -: : : parameter is selected and : -: : : returns `false` if the : -: : : second parameter is selected : -| `window_dimensions` | `ArraySlice<int64>` | array of integers for window | -: : : dimension values : -| `window_strides` | `ArraySlice<int64>` | array of integers for window | -: : : stride values : -| `padding` | `Padding` | padding type for window | -: : : (Padding\:\:kSame or : -: : : Padding\:\:kValid) : -| `source` | `ComputationDataHandle` | array of type T with the | -: : : values to scatter : -| `init_value` | `ComputationDataHandle` | scalar value of type T for | -: : : the inital value of the : -: : : output array : -| `scatter` | `Computation` | binary computation of type | -: : : `T, T -> T`, to apply each : -: : : scatter source element with : -: : : its destination element : - -The figure below shows examples of using `SelectAndScatter`, with the `select` -function computing the maximal value among its parameters. Note that when the -windows overlap, as in the figure (2) below, an index of the `operand` array may -be selected multiple times by different windows. In the figure, the element of -value 9 is selected by both of the top windows (blue and red) and the binary -addition `scatter` function produces the output element of value 8 (2 + 6). - -<center><iframe src="../images/xla-scatter-to-selected-window-element.svg" width="1000" height="700" title="SelectAndScatter Diagram" frameborder="0"></iframe></center> - -### Select - -See also `ComputationBuilder::Select` - -Constructs an output array from elements of two input arrays, based on the -values of a predicate array. - -<b> `Select(pred, on_true, on_false)` </b> - -Arguments | Type | Semantics ----------- | ----------------------- | ------------------ -`pred` | `ComputationDataHandle` | array of type PRED -`on_true` | `ComputationDataHandle` | array of type T -`on_false` | `ComputationDataHandle` | array of type T - -The arrays `on_true` and `on_false` must have the same shape. This is also the -shape of the output array. The array `pred` must have the same dimensionality as -`on_true` and `on_false`, with the `PRED` element type. - -For each element `P` of `pred`, the corresponding element of the output array is -taken from `on_true` if the value of `P` is `true`, and from `on_false` if the -value of `P` is `false`. As a restricted form of -[broadcasting](#broadcasting_semantics), `pred` can be a scalar of type `PRED`. -In this case, the output array is taken wholly from `on_true` if `pred` is -`true`, and from `on_false` if `pred` is `false`. - -Example with non-scalar `pred`: - -``` -let pred: PRED[4] = {true, false, false, true}; -let v1: s32[4] = {1, 2, 3, 4}; -let v2: s32[4] = {100, 200, 300, 400}; -==> -Select(pred, v1, v2) = s32[4]{1, 200, 300, 4}; -``` - -Example with scalar `pred`: - -``` -let pred: PRED = true; -let v1: s32[4] = {1, 2, 3, 4}; -let v2: s32[4] = {100, 200, 300, 400}; -==> -Select(pred, v1, v2) = s32[4]{1, 2, 3, 4}; -``` - -Selections between tuples are supported. Tuples are considered to be scalar -types for this purpose. If `on_true` and `on_false` are tuples (which must have -the same shape!) then `pred` has to be a scalar of type `PRED`. - -### Slice - -See also `ComputationBuilder::Slice` - -Slicing extracts a sub-array from the input array. The sub-array is of the same -rank as the input and contains the values inside a bounding box within the input -array where the dimensions and indices of the bounding box are given as -arguments to the slice operation. - -<b> `Slice(operand, start_indices, limit_indices)` </b> - -| Arguments | Type | Semantics | -| --------------- | ----------------------- | -------------------------------- | -| `operand` | `ComputationDataHandle` | N dimensional array of type T | -| `start_indices` | `ArraySlice<int64>` | List of N integers containing | -: : : the starting indices of the : -: : : slice for each dimension. Values : -: : : must be greater than or equal to : -: : : zero. : -| `limit_indices` | `ArraySlice<int64>` | List of N integers containing | -: : : the ending indices (exclusive) : -: : : for the slice for each : -: : : dimension. Each value must be : -: : : strictly greater than the : -: : : respective `start_indices` value : -: : : for the dimension and less than : -: : : or equal to the size of the : -: : : dimension. : - -1-dimensional example: - -``` -let a = {0.0, 1.0, 2.0, 3.0, 4.0} -Slice(a, {2}, {4}) produces: - {2.0, 3.0} -``` - -2-dimensional example: - -``` -let b = - { {0.0, 1.0, 2.0}, - {3.0, 4.0, 5.0}, - {6.0, 7.0, 8.0}, - {9.0, 10.0, 11.0} } - -Slice(b, {2, 1}, {4, 3}) produces: - { { 7.0, 8.0}, - {10.0, 11.0} } -``` - -### Sort - -See `ComputationBuilder::Sort` - -Sorts the elements in the operand. - -<b>`Sort(operand)`</b> - -Arguments | Type | Semantics ---------- | ----------------------- | ------------------- -`operand` | `ComputationDataHandle` | The operand to sort - -### Trans (transpose) - -See also the [`Reshape`](#reshape) operation. - -<b>`Trans(operand)`</b> - -Arguments | Type | Semantics ---------- | ----------------------- | ------------------------- -`operand` | `ComputationDataHandle` | The operand to transpose. - -Returns the transpose of `operand`. `operand` must have rank 2. - -This is the same as Reshape(operand, {1, 0}, {operand.shape.dimensions[1], -operand.shape,dimensions[0]}). - -### Tuple - -See also `ComputationBuilder::Tuple` - -A tuple containing a variable number of data handles, each of which has its own -shape. - -This is analogous to `std::tuple` in C++. Conceptually: - -``` -let v: f32[10] = f32[10]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; -let s: s32 = 5; -let t: (f32[10], s32) = tuple(v, s); -``` - -Tuples can be deconstructed (accessed) via the -[`GetTupleElement`](#gettupleelement) operation. - -### While - -See also `ComputationBuilder::While` - -<b> `While(condition, body, init)` </b> - -| Arguments | Type | Semantics | -| ----------- | ------------- | ---------------------------------------------- | -| `condition` | `Computation` | Computation of type `T -> PRED` which defines | -: : : the termination condition of the loop. : -| `body` | `Computation` | Computation of type `T -> T` which defines the | -: : : body of the loop. : -| `init` | `T` | Initial value for the parameter of `condition` | -: : : and `body`. : - -Sequentially executes the `body` until the `condition` fails. This is similar to -a typical while loop in many other languages except for the differences and -restrictions listed below. - -* A `While` node returns a value of type `T`, which is the result from the - last execution of the `body`. -* The shape of the type `T` is statically determined and must be the same - across all iterations. -* `While` nodes are not allowed to be nested. (This restriction may be lifted - in the future on some targets.) - -The T parameters of the computations are initialized with the `init` value in -the first iteration and are automatically updated to the new result from `body` -in each subsequent iteration. - -One main use case of the `While` node is to implement the repeated execution of -training in neural networks. Simplified pseudocode is shown below with an graph -that represents the computation. The type `T` in this example is a `Tuple` -consisting of an `int32` for the iteration count and a `vector[10]` for the -accumulator. For 1000 iterations, the loop keeps adding a constant vector to the -accumulator. - -``` -// Pseudocode for the computation. -init = {0, zero_vector[10]} // Tuple of int32 and float[10]. -result = init; -while (result(0) < 1000) { - iteration = result(0) + 1; - new_vector = result(1) + constant_vector[10]; - result = {iteration, new_vector}; -} -``` - -## Broadcasting semantics - -This section describes how the broadcasting semantics in XLA work. - -### What is broadcasting - -Broadcasting may be required for operations between multi-dimensional arrays of -different ranks, or between multi-dimensional arrays with different but -compatible shapes. Consider the addition `X+v` where `X` is a matrix (an array -of rank 2) and `v` is a vector (an array of rank 1). To perform element-wise -addition, XLA needs to "broadcast" the vector `v` to the same rank as the -matrix `X`, by replicating `v` a certain number of times. The vector's length -has to match at least one of the dimensions of the matrix. - -For example: - - |1 2 3| + |7 8 9| - |4 5 6| - -The matrix's dimensions are (2,3), the vector's are (3). We broadcast the vector -by replicating it over rows to get: - - |1 2 3| + |7 8 9| = |8 10 12| - |4 5 6| |7 8 9| |11 13 15| - -In Numpy, this is called -[broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). - -### Principles - -We see XLA as a low-level infrastructure. Therefore, we want to make the -XLA language as strict and explicit as possible, avoiding implicit and -"magical" features that may make some computations slightly easier to define, at -the cost of more assumptions baked into user code that will be difficult to -change in the long term. If necessary, implicit and magical features can be -added in client-level wrappers. - -Specifically w.r.t. broadcasting, we will require explicit broadcasting -specifications on operations between arrays of different ranks, instead of -inferring a possible broadcasting like Numpy does. - -### Broadcasting a lower-rank array onto a higher-rank array - -*Scalars* can always be broadcast over arrays without an explicit specification -of broadcasting dimensions. An element-wise binary operation between a scalar -and an array means applying the operation with the scalar for each element in -the array. For example, adding a scalar to a matrix means producing a matrix -each element of which is a sum of the scalar with the corresponding input -matrix's element. - - |1 2 3| + 7 = |8 9 10| - |4 5 6| |11 12 13| - -Most broadcasting needs can be captured by using a tuple of dimensions on a -binary operation. When the inputs to the operation have different ranks, this -broadcasting tuple specifies which dimension(s) in the **higher-rank** array to -match with the **lower-rank** array. - -Consider the previous example of adding a matrix with dimensions (2,3) to a -vector with dimension (3). *Without specifying broadcasting, this operation is -invalid.* Based on XLA convention, the left-most dimension is 0, and the -number grows as we walk the dimensions right-wards. For a (2,3) matrix we'd -index into it with `matrix[i,j]` with `i` running to 2 and `j` running to 3. `i` -indexes over dimension 0 and `j` indexes over dimension 1. - -To correctly request our matrix-vector addition the user will specify the -broadcasting dimension to be (1), meaning that the vector's dimension is matched -to dimension 1 of the matrix. In 2D, if we consider dimension 0 as rows and -dimension 1 as columns, this means that each element of the vector becomes a -column of a size matching the number of rows in the matrix: - - |7 8 9| ==> |7 8 9| - |7 8 9| - -As a more complex example, consider adding a 3-element vector (dimension (3)) to -a 3x3 matrix (dimensions (3,3)). There are two ways broadcasting can happen -here: - -Broadcasting dimension is 1, as before. Each vector element becomes a column - -the vector is duplicated for each row in the matrix. - - |7 8 9| ==> |7 8 9| - |7 8 9| - |7 8 9| - -Broadcasting dimension is 0. Each vector element becomes a row - the vector is -duplicated for each column in the matrix. - - |7| ==> |7 7 7| - |8| |8 8 8| - |9| |9 9 9| - -Note: when adding a 2x3 matrix to a 3-element vector, a broadcasting dimension -of 0 is invalid. - -The broadcasting dimensions can be a tuple that describes how a smaller rank -shape is broadcast into a larger rank shape. For example, given a 2x3x4 cuboid -and a 3x4 matrix, a broadcasting tuple (1,2) means matching the matrix to -dimensions 1 and 2 of the cuboid. - -This type of broadcast is used in the binary ops in `ComputationBuilder`, if the -`broadcast_dimensions` argument is given. In the XLA source code, this type -of broadcasting is sometimes called "InDim" broadcasting. - -#### Formal definition - -The broadcasting attribute allows matching a lower-rank array to a higher-rank -array, by specifying which dimensions of the higher-rank array to match. For -example, for an array with dimensions MxNxPxQ, we can match a vector with -dimension T as follows: - - MxNxPxQ - - dim 3: T - dim 2: T - dim 1: T - dim 0: T - -In each case, T has to be equal to the matching dimension of the higher-rank -array. The vector's values are then broadcast from the matched dimension to all -the other dimensions. - -If we want to match a TxV matrix onto the MxNxPxQ array, we have to use a pair -of broadcasting dimensions: - - MxNxPxQ - dim 2,3: T V - dim 1,2: T V - dim 0,3: T V - etc... - -The order of dimensions in the broadcasting tuple has to be the order in which -the lower-rank array's dimensions are expected to match the higher-rank array's -dimensions. The first element in the tuple says which dimension in the -higher-rank array has to match dimension 0 in the lower-rank array. The second -element for dimension 1, and so on. The order of broadcast dimensions has to be -strictly increasing. E.g. in the previous example, it's illegal to match V to N -and T to P; also, it's illegal to match V to both P and N. - -### Broadcasting similar-rank arrays with degenerate dimensions - -A related broadcasting problem is broadcasting two arrays that have the same -rank but different dimension sizes. Similarly to Numpy's rules, this is only -possible when the arrays are *compatible*. Two arrays are compatible when all -their dimensions are compatible. Two dimensions are compatible if: - -* They are equal, or -* One of them is 1 (a "degenerate" dimension) - -When we encounter two compatible arrays, the result shape has the maximum among -the two inputs at every dimension index. - -Examples: - -1. (2,1) and (2,3) broadcast to (2,3). -2. (1,2,5) and (7,2,5) broadcast to (7,2,5) -3. (7,2,5) and (7,1,5) broadcast to (7,2,5) -4. (7,2,5) and (7,2,6) are incompatible and cannot be broadcast. - -A special case arises, and is also supported, where each of the input arrays has -a degenerate dimension at a different index. In this case, we get an "outer -operation": (2,1) and (1,3) broadcast to (2,3). For more examples, consult the -[Numpy documentation on broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). - -### Broadcast composition - -Broadcasting of a lower-rank array to a higher-rank array **and** broadcasting -using degenerate dimensions can both be performed in the same binary operation. -For example, a vector of size 4 and an matrix of size 1x2 can be added together -using broadcast dimensions value of (0): - - |1 2 3 4| + [5 6] // [5 6] is a 1x2 matrix, not a vector. - -First the vector is broadcast up to rank 2 (matrix) using the broadcast -dimensions. The single value (0) in the broadcast dimensions indicates that -dimension zero of the vector matches to dimension zero of the matrix. This -produces an matrix of size 4xM where the value M is chosen to match the -corresponding dimension size in the 1x2 array. Therefore, a 4x2 matrix is -produced: - - |1 1| + [5 6] - |2 2| - |3 3| - |4 4| - -Then "degenerate dimension broadcasting" broadcasts dimension zero of the 1x2 -matrix to match the corresponding dimension size of the right hand side: - - |1 1| + |5 6| |6 7| - |2 2| + |5 6| = |7 8| - |3 3| + |5 6| |8 9| - |4 4| + |5 6| |9 10| - -A more complicated example is a matrix of size 1x2 added to an array of size -4x3x1 using broadcast dimensions of (1, 2). First the 1x2 matrix is broadcast up -to rank 3 using the broadcast dimensions to produces an intermediate Mx1x2 array -where the dimension size M is determined by the size of the larger operand (the -4x3x1 array) producing a 4x1x2 intermediate array. The M is at dimension 0 -(left-most dimension) because the dimensions 1 and 2 are mapped to the -dimensions of the original 1x2 matrix as the broadcast dimension are (1, 2). -This intermediate array can be added to the 4x3x1 matrix using broadcasting of -degenerate dimensions to produce a 4x3x2 array result. - -[^1]: Some obvious reductions like "add reduction" are not strictly associative - for floats. However, if the range of the data is limited, floating-point - addition is close enough to being associative for most practical uses. It - is possible to conceive some complete un-associative reductions, however, - and these will produce wrong results in XLA reductions. - -## C++ interface - -The following is a fragment of the class definition for the client -`ComputationBuilder` interface, for reference: - -```c++ -class ComputationBuilder { - public: - // client: client in which to build the computation. - // computation_name: name to use for the built computation. - ComputationBuilder(Client* client, const string& computation_name); - - ~ComputationBuilder(); - - // Returns the client the builder was initialized with. - Client* client() { return client_; } - - // Returns the computation name. - const string& name() { return name_; } - - // Sets the builder to a mode where it will die immediately when an error is - // encountered, rather than producing it in a deferred fashion when Build() is - // called (which is the default). - void set_die_immediately_on_error(bool enabled) { - die_immediately_on_error_ = enabled; - } - - // Enqueues a "retrieve parameter value" instruction for a parameter that was - // passed to the computation. - ComputationDataHandle Parameter(int64 parameter_number, const Shape& shape, - const string& name); - - // Retrieves the (inferred) shape of the operand in the computation. - util::StatusOr<std::unique_ptr<Shape>> GetShape( - const ComputationDataHandle& operand); - - // Checks that the operand has the given expected shape. Returns the operand - // if yes, fails with a CHECK error if no. - ComputationDataHandle CheckShape(const ComputationDataHandle& operand, - const Shape& expected_shape); - - // Checks that the lhs and rhs results have the same shape. - void CheckSameShape(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs); - - // Enqueues a constant with the value of the given literal onto the - // computation. - ComputationDataHandle ConstantLiteral(const Literal& literal); - - // Enqueues a constant onto the computation. Methods are templated on the - // native host type (NativeT) which corresponds to a specific XLA - // PrimitiveType as given in the following table: - // - // Native Type PrimitiveType - // ----------------------------- - // bool PRED - // int32 S32 - // int64 S64 - // uint32 U32 - // uint64 U64 - // float F32 - // double F64 - // - // Note: not all primitive types defined in xla.proto have a corresponding - // native type yet. - template <typename NativeT> - ComputationDataHandle ConstantR0(NativeT value); - template <typename NativeT> - ComputationDataHandle ConstantR1(gtl::ArraySlice<NativeT> values); - template <typename NativeT> - ComputationDataHandle ConstantR2( - std::initializer_list<std::initializer_list<NativeT>> values); - template <typename NativeT> - ComputationDataHandle ConstantR2FromArray2D(const Array2D<NativeT>& values); - template <typename NativeT> - ComputationDataHandle ConstantR3FromArray3D(const Array3D<NativeT>& values); - template <typename NativeT> - ComputationDataHandle ConstantR4FromArray4D(const Array4D<NativeT>& values); - - // Enqueues a rank one constant (vector) onto the computation. The - // vector has size 'length' and every element has the value 'value'. - template <typename NativeT> - ComputationDataHandle ConstantR1(int64 length, NativeT value); - - // Adds dimensions to an array by duplicating the data in the array. - // - // The new dimensions are inserted on the left, i.e. if - // broadcast_sizes has values {a0, ..., aN} and the operand shape - // has dimensions {b0, ..., bM} then the shape of the output has - // dimensions {a0, ..., aN, b0, ..., bM}. - // - // The new dimensions index into copies of the operand, i.e. - // - // output[i0, ..., iN, j0, ..., jM] = operand[j0, ..., jM] - ComputationDataHandle Broadcast(const ComputationDataHandle& operand, - gtl::ArraySlice<int64> broadcast_sizes); - - // Enqueues a pad operation onto the computation that pads the given value on - // the edges as well as between the elements of the input. padding_config - // specifies the padding amount for each dimension. - ComputationDataHandle Pad(const ComputationDataHandle& operand, - const ComputationDataHandle& padding_value, - const PaddingConfig& padding_config); - - // Enqueues an operation onto the computation that flattens the operand based - // on the dimension order (major/slowest-varying to minor/fastest-varying) - // given, followed by reshaping it into the shape with the given dimension - // sizes (also major to minor). Conceptually, this is a limited form of - // "shape casting". - ComputationDataHandle Reshape(const ComputationDataHandle& operand, - gtl::ArraySlice<int64> dimensions, - gtl::ArraySlice<int64> new_sizes); - - // Wrapper for Reshape. - // Enqueues an operation to collapse the provided dimensions; e.g. an - // operand with dimensions {x=256, y=2, z=2, p=32} can be collapsed to - // {x=1024, y=32} by collapsing dims {0, 1, 2}. Collapsing dimensions must - // be a consecutive, in-order subsequence of the operand dimensions. - // - // This could potentially cause data to be moved -- it provides a more - // structured form of reshaping than an arbitrary Reshape operation. - ComputationDataHandle Collapse(const ComputationDataHandle& operand, - gtl::ArraySlice<int64> dimensions); - - // Enqueues a slice operation onto the computation that slices the operand - // from the start indices to the limit indices; e.g. - // - // x - // [ 0 1 2 3 ] - // y [ 4 5 6 7 ] => slice(start={1, 1}, limit={2, 3}) => [ 5 6 ] - // [ 8 9 a b ] - // - // Note that "limit" means up-to-but-not-including; i.e. [start, limit) in 1D - // range notation. - ComputationDataHandle Slice(const ComputationDataHandle& operand, - gtl::ArraySlice<int64> start_indices, - gtl::ArraySlice<int64> limit_indices); - - // Enqueues a concatenate instruction onto the computation. - ComputationDataHandle ConcatInDim( - gtl::ArraySlice<ComputationDataHandle> operands, int64 dimension); - - // Enqueue a tracing operation onto the computation; the computation will emit - // a logging message with the operand. - void Trace(const string& tag, const ComputationDataHandle& operand); - - // Enqueues a conditional-move-like select operation onto the computation; - // predicated on pred, selects between on_true and on_false. - ComputationDataHandle Select(const ComputationDataHandle& pred, - const ComputationDataHandle& on_true, - const ComputationDataHandle& on_false); - - // Enqueues a tuple-creation instruction onto the computation. - ComputationDataHandle Tuple(gtl::ArraySlice<ComputationDataHandle> elements); - - // Enqueues a tuple-element-get instruction onto the computation. - ComputationDataHandle GetTupleElement(const ComputationDataHandle& tuple_data, - int64 index); - - // Enqueues an equal-to comparison instruction onto the computation. - ComputationDataHandle Eq(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a not-equal comparison instruction onto the computation. - ComputationDataHandle Ne(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a greater-or-equal comparison instruction onto the computation. - ComputationDataHandle Ge(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a greater-than comparison instruction onto the computation. - ComputationDataHandle Gt(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a less-than comparison instruction onto the computation. - ComputationDataHandle Lt(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a less-or-equal comparison instruction onto the computation. - ComputationDataHandle Le(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a dot instruction onto the computation. - ComputationDataHandle Dot(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs); - - // Default dimension numbers used for a convolution. - static constexpr int64 kConvBatchDimension = 0; - static constexpr int64 kConvFeatureDimension = 1; - static constexpr int64 kConvFirstSpatialDimension = 2; - static constexpr int64 kConvSecondSpatialDimension = 3; - static constexpr int64 kConvKernelOutputDimension = 0; - static constexpr int64 kConvKernelInputDimension = 1; - static constexpr int64 kConvKernelFirstSpatialDimension = 2; - static constexpr int64 kConvKernelSecondSpatialDimension = 3; - - // Creates a default ConvolutionDimensionNumbers. For the input operand - // {batch, feature, height, width} = {0, 1, 2, 3} and for the weight operand - // {kernel_output_feature, kernel_input_feature, height, width = {0, 1, 2, 3}. - static ConvolutionDimensionNumbers CreateDefaultConvDimensionNumbers(); - - // Creates a ConvolutionDimensionNumbers with the given arguments. Returns an - // error if either the input or the weight dimension numbers have conflicts. - static util::StatusOr<ConvolutionDimensionNumbers> CreateConvDimensionNumbers( - int64 batch, int64 feature, int64 first_spatial, int64 second_spatial, - int64 kernel_output_feature, int64 kernel_input_feature, - int64 kernel_first_spatial, int64 kernel_second_spatial); - - // Enqueues a convolution instruction onto the computation, which uses the - // default convolution dimension numbers. - ComputationDataHandle Conv(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> window_strides, - Padding padding); - - // Enqueues a convolution instruction onto the computation, with the caller - // provided padding configuration in the format returned by MakePadding(). - ComputationDataHandle ConvWithGeneralPadding( - const ComputationDataHandle& lhs, const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> window_strides, - gtl::ArraySlice<std::pair<int64, int64>> padding); - - // Enqueues a convolution instruction onto the computation, with the caller - // provided dimension numbers configuration. - ComputationDataHandle ConvWithGeneralDimensions( - const ComputationDataHandle& lhs, const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> window_strides, Padding padding, - const ConvolutionDimensionNumbers& dimension_numbers); - - // Enqueues a convolution instruction onto the computation, with the caller - // provided padding configuration as well as the dimension numbers. - ComputationDataHandle ConvGeneral( - const ComputationDataHandle& lhs, const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> window_strides, - gtl::ArraySlice<std::pair<int64, int64>> padding, - const ConvolutionDimensionNumbers& dimension_numbers); - - // Enqueues an infeed instruction onto the computation, which reads data of - // the given shape from the infeed buffer of the device. - ComputationDataHandle Infeed(const Shape& shape); - - // Enqueues a custom call instruction onto the computation. - // During code generation, a call instruction is emitted which targets a - // symbol with the name |call_target_name|. The |operands| are passed to the - // call instruction. |shape| is the resultant shape. - ComputationDataHandle CustomCallOp( - tensorflow::StringPiece call_target_name, - gtl::ArraySlice<ComputationDataHandle> operands, const Shape& shape); - - // The following methods enqueue element-wise binary arithmetic operations - // onto the computation. The shapes of the operands have to match unless one - // of the operands is a scalar, or an explicit broadcast dimension is given - // (see g3doc for more details). - - // Enqueues an add instruction onto the computation. - ComputationDataHandle Add(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a subtract instruction onto the computation. - ComputationDataHandle Sub(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a multiply instruction onto the computation. - ComputationDataHandle Mul(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a divide instruction onto the computation. - ComputationDataHandle Div(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a remainder instruction onto the computation. - ComputationDataHandle Rem(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a max instruction onto the computation. - ComputationDataHandle Max(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Enqueues a min instruction onto the computation. - ComputationDataHandle Min(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs, - gtl::ArraySlice<int64> broadcast_dimensions = {}); - - // Reduces an array among the provided dimensions, given "computation" as a - // reduction operator. - ComputationDataHandle Reduce(const ComputationDataHandle& operand, - const ComputationDataHandle& init_value, - const Computation& computation, - gtl::ArraySlice<int64> dimensions_to_reduce); - - // Enqueues a windowed reduce instruction onto the computation. - ComputationDataHandle ReduceWindow(const ComputationDataHandle& operand, - const ComputationDataHandle& init_value, - const Computation& computation, - gtl::ArraySlice<int64> window_dimensions, - gtl::ArraySlice<int64> window_strides, - Padding padding); - - // As ReduceWindow(), but the padding is given in the format - // returned by MakePadding(). - ComputationDataHandle ReduceWindowWithGeneralPadding( - const ComputationDataHandle& operand, - const ComputationDataHandle& init_value, const Computation& computation, - gtl::ArraySlice<int64> window_dimensions, - gtl::ArraySlice<int64> window_strides, - gtl::ArraySlice<std::pair<int64, int64>> padding); - - // Enqueues an operation that scatters the `source` array to the selected - // indices of each window. - ComputationDataHandle SelectAndScatter( - const ComputationDataHandle& operand, const Computation& select, - gtl::ArraySlice<int64> window_dimensions, - gtl::ArraySlice<int64> window_strides, Padding padding, - const ComputationDataHandle& source, - const ComputationDataHandle& init_value, const Computation& scatter); - - // As SelectAndScatter(), but the padding is given in the format - // returned by MakePadding(). - ComputationDataHandle SelectAndScatterWithGeneralPadding( - const ComputationDataHandle& operand, const Computation& select, - gtl::ArraySlice<int64> window_dimensions, - gtl::ArraySlice<int64> window_strides, - gtl::ArraySlice<std::pair<int64, int64>> padding, - const ComputationDataHandle& source, - const ComputationDataHandle& init_value, const Computation& scatter); - - // Enqueues an exp instruction onto the computation. - ComputationDataHandle Exp(const ComputationDataHandle& operand); - - // Enqueues a floor instruction onto the computation. - ComputationDataHandle Floor(const ComputationDataHandle& operand); - - // Enqueues a ceil instruction onto the computation. - ComputationDataHandle Ceil(const ComputationDataHandle& operand); - - // Enqueues an log instruction (natural logarithm) onto the computation. - ComputationDataHandle Log(const ComputationDataHandle& operand); - - // Enqueues a tanh instruction onto the computation. - ComputationDataHandle Tanh(const ComputationDataHandle& operand); - - // Enqueues a float32 sqrt instruction onto the computation. - // (float32 is specified as there is an implicit float32 0.5f constant - // exponent). - ComputationDataHandle SqrtF32(const ComputationDataHandle& operand); - - // Enqueues a float32 square instruction onto the computation. - // (float32 is specified as there is an implicit float32 2.0f constant - // exponent). - ComputationDataHandle SquareF32(const ComputationDataHandle& operand); - - // Enqueues a lhs^rhs computation onto the computation. - ComputationDataHandle Pow(const ComputationDataHandle& lhs, - const ComputationDataHandle& rhs); - - // Enqueues a convert instruction onto the computation that changes the - // element type of the operand array to primitive_type. - ComputationDataHandle ConvertElementType(const ComputationDataHandle& operand, - PrimitiveType new_element_type); - - // Enqueues a float32 reciprocal instruction onto the computation. - // (float32 is specified as there is an implicit float32 -1.0f constant - // exponent). - // - // TODO(leary) axe F32 suffix, can be determined by reflecting on the shape of - // the operand. - ComputationDataHandle ReciprocalF32(const ComputationDataHandle& operand); - - // Enqueues a negate instruction onto the computation. - ComputationDataHandle Neg(const ComputationDataHandle& operand); - - // Enqueues a transpose instruction onto the computation. - ComputationDataHandle Trans(const ComputationDataHandle& operand); - - // Enqueues a reverse instruction onto the computation. The order of the - // elements in the given dimensions is reversed (i.e., the element at index i - // is moved to index dimension_size - 1 - i). - ComputationDataHandle Rev(const ComputationDataHandle& operand, - gtl::ArraySlice<int64> dimensions); - - // Enqueues a sort (as increasing order) instruction onto the computation. - ComputationDataHandle Sort(const ComputationDataHandle& operand); - - // Enqueues a clamp instruction onto the computation. - ComputationDataHandle Clamp(const ComputationDataHandle& min, - const ComputationDataHandle& operand, - const ComputationDataHandle& max); - - // Enqueues a map instruction onto the computation. - ComputationDataHandle Map( - gtl::ArraySlice<ComputationDataHandle> operands, - const Computation& computation, - gtl::ArraySlice<ComputationDataHandle> static_operands = {}); - - // Enqueues a N(mu, sigma) random number generation instruction onto the - // computation. - ComputationDataHandle RngNormal(const ComputationDataHandle& mu, - const ComputationDataHandle& sigma, - const Shape& shape); - - // Enqueues a U(a, b) random number generation instruction onto the - // computation. - ComputationDataHandle RngUniform(const ComputationDataHandle& a, - const ComputationDataHandle& b, - const Shape& shape); - - // Enqueues a B(1, p) random number generation instruction onto the - // computation. - ComputationDataHandle RngBernoulli(const ComputationDataHandle& mean, - const Shape& shape); - - // Enqueues a while node onto the computation. - ComputationDataHandle While(const Computation& condition, - const Computation& body, - const ComputationDataHandle& init); - - // Computes the value of a constant indicated by a - // ComputationDataHandle. - // - // The handle must be from the computation currently being built - - // i.e. returned from this builder with no intervening call to - // Build(). This happens to currently work regardless of that, but - // that may stop working at any time. - // - // The handle must represent a constant value, which in this case - // means that it must not statically depend on a parameter to the - // computation that is being built. Note this allows the output of - // an Rng() node to count as constant - in that case you may receive - // different values if you call this method several times. Let us - // know if you have a use-case where that is a problem. - // - // This functionality can be useful when translating a computation - // into XLA where something that looked dynamic is required by XLA - // to be specified as a constant. E.g. the source computation - // (outside of XLA) may include a dynamic computation of the shape - // of something and ComputeConstant lets you determine what the - // value of that computation is in the case where the value can be - // determined at compile time. - // - // If output_layout is non-null, then the output of the computation - // will be stored using that layout. - util::StatusOr<std::unique_ptr<GlobalData>> ComputeConstant( - const ComputationDataHandle& handle, - const Layout* output_layout = nullptr); - - // Returns a new ComputationBuilder whose resultant Computation is used only - // by this ComputationBuilder. The sub-ComputationBuilder has the same - // die_immediately_on_error behavior as the parent. - std::unique_ptr<ComputationBuilder> CreateSubBuilder( - const string& computation_name); - - // Modifies the computation being built so that executions of it - // will return the value associated with operand, rather than the - // last expression enqueued on the ComputationBuilder. Any subsequent - // operations added to the ComputationBuilder will not have any effect unless - // SetReturnValue is called again. - util::Status SetReturnValue(const ComputationDataHandle& operand); - - // Builds the computation with the requested operations, or returns a non-ok - // status. - util::StatusOr<std::unique_ptr<Computation>> Build(); - - // Builds the computation with the requested operations, or notes an error in - // the parent ComputationBuilder and returns an empty computation if building - // failed. This function is intended to be used where the returned - // Computation is only used by the parent ComputationBuilder and hence further - // operation on the returned Computation will simply be error'ed out if an - // error occurred while building this computation. If the built computation is - // to be used by a ComputationBuilder other than the parent ComputationBuilder - // then Build() should be used instead. - std::unique_ptr<Computation> BuildAndNoteError(); -}; -``` |