tensorflow/docs_src/performance/xla/jit.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

# Using JIT Compilation

> Note: TensorFlow must be compiled from source to include XLA.

## Why use just-in-time (JIT) compilation?

The TensorFlow/XLA JIT compiler compiles and runs parts of TensorFlow graphs via
XLA. The benefit of this over the standard TensorFlow implementation is that XLA
can fuse multiple operators (kernel fusion) into a small number of compiled
kernels. Fusing operators can reduce memory bandwidth requirements and improve
performance compared to executing operators one-at-a-time, as the TensorFlow
executor does.

## Running TensorFlow graphs via XLA

There are two ways to run TensorFlow computations via XLA, either by
JIT-compiling operators placed on a CPU or GPU device, or by placing operators
on the `XLA_CPU` or `XLA_GPU` TensorFlow devices. Placing operators directly on
a TensorFlow XLA device forces the operator to run on that device and is mainly
used for testing.

> Note: The XLA CPU backend supports intra-op parallelism (i.e. it can shard a
> single operation across multiple cores) but it does not support inter-op
> parallelism (i.e. it cannot execute independent operations concurrently across
> multiple cores).  The XLA GPU backend is competitive with the standard
> TensorFlow implementation, sometimes faster, sometimes slower.

### Turning on JIT compilation

JIT compilation can be turned on at the session level or manually for select
operations. Both of these approaches are zero-copy --- data does not need to be
copied when passing data between a compiled XLA kernel and a TensorFlow operator
placed on the same device.

#### Session

Turning on JIT compilation at the session level will result in all possible
operators being greedily compiled into XLA computations. Each XLA computation
will be compiled into one or more kernels for the underlying device.

Subject to a few constraints, if there are two adjacent operators in the graph
that both have XLA implementations, then they will be compiled into a single XLA
computation.

JIT compilation is turned on at the session level by setting the
`global_jit_level` config to `tf.OptimizerOptions.ON_1` and passing the config
during session initialization.

```python
# Config to turn on JIT compilation
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

sess = tf.Session(config=config)
```

> Note: Turning on JIT at the session level will not result in operations being
> compiled for the CPU. JIT compilation for CPU operations must be done via
> the manual method documented below.

#### Manual

JIT compilation can also be turned on manually for one or more operators. This
is done by tagging the operators to compile with the attribute
`_XlaCompile=true`. The simplest way to do this is via the
`tf.contrib.compiler.jit.experimental_jit_scope()` scope defined in
[`tensorflow/contrib/compiler/jit.py`](https://www.tensorflow.org/code/tensorflow/contrib/compiler/jit.py).
Example usage:

```python
    jit_scope = tf.contrib.compiler.jit.experimental_jit_scope

    x = tf.placeholder(np.float32)
    with jit_scope():
      y = tf.add(x, x)  # The "add" will be compiled with XLA.
```

The `_XlaCompile` attribute is currently supported on a best-effort basis. If an
operator cannot be compiled, TensorFlow will silently fall back to the normal
implementation.

### Placing operators on XLA devices

Another way to run computations via XLA is to place an operator on a specific
XLA device. This method is normally only used for testing. Valid targets are
`XLA_CPU` or `XLA_GPU`.

```python
with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"):
  output = tf.add(input1, input2)
```

Unlike JIT compilation on the standard CPU and GPU devices, these devices make a
copy of data when it is transferred on and off the device. The extra copy makes
it expensive to mix XLA and TensorFlow operators in the same graph.

## Tutorial

This tutorial covers training a simple version of MNIST softmax with JIT turned
on. Currently JIT at the session level, which is what is used for the tutorial,
only supports GPU.

Before starting the tutorial verify that the LD_LIBRARY environment variable or
ldconfig contains `$CUDA_ROOT/extras/CUPTI/lib64`, which contains libraries for
the CUDA Profiling Tools Interface [(CUPTI)](http://docs.nvidia.com/cuda/cupti/index.html).
TensorFlow uses CUPTI to pull tracing information from the GPU.

### Step #1: Prepare sample script

Download or move
[mnist_softmax_xla.py](https://www.tensorflow.org/code/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py)
into a folder outside of the TensorFlow source tree.

### Step #2: Run without XLA

Execute the python script to train the model without XLA.

```shell
python mnist_softmax_xla.py --xla=''
```

Using the Chrome Trace Event Profiler (browse to chrome://tracing),
open the timeline file created when the script finishes: `timeline.ctf.json`.
The rendered timeline should look similar to the picture below with multiple
green boxes labeled `MatMul`, possibly across multiple CPUs.
<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="https://www.tensorflow.org/images/jit_timeline_gpu.png">
</div>

### Step #3 Run with XLA

Execute the python script to train the model with XLA and turn on a debugging
feature of XLA via an environmental variable that outputs the XLA graph.

```shell
TF_XLA_FLAGS=--xla_generate_hlo_graph=.* python mnist_softmax_xla.py
```

Open the timeline file created (`timeline.ctf.json`).  The rendered timeline
should look similar to the picture below with one long bar labeled `XlaLaunch`.
<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="https://www.tensorflow.org/images/jit_timeline_gpu_xla.png">
</div>

To understand what is happening in `XlaLaunch`, look at the console output for
statements similar to the following:

```shell
computation cluster_0[_XlaCompiledKernel=true,_XlaNumConstantArgs=1].v82 [CPU:
pipeline start, before inline]: /tmp/hlo_graph_0.dot

```

The console statements point to the location of `hlo_graph_xx.dot` files that
contain information about the graph created by XLA. The process that XLA takes
to fuse Ops is visible by starting at `hlo_graph_0.dot` and viewing each diagram
in succession.

To Render the .dot file into a png, install
[GraphViz](https://www.graphviz.org/download/) and run:

```shell
dot -Tpng hlo_graph_80.dot -o hlo_graph_80.png
```

The result will look like the following:
<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="https://www.tensorflow.org/images/jit_gpu_xla_graph.png">
</div>