aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/docs_src/performance
diff options
context:
space:
mode:
authorGravatar Mark Daoust <markdaoust@google.com>2018-08-07 14:28:32 -0700
committerGravatar TensorFlower Gardener <gardener@tensorflow.org>2018-08-07 14:32:57 -0700
commit02df0f46d562a0c48b6f24803eba6330d13d7213 (patch)
tree3a39933e12b7300ddcefb5afb90db053f295d824 /tensorflow/docs_src/performance
parent452f995e2c23cbd67c14b15b678bb3a352212633 (diff)
Remove usage of magic-api-link syntax from docs.
Back-ticks are now converted to links in the api_docs generator. With the new docs repo we're moving to simplify the docs pipeline, and make everything more readable. By doing this we no longer get test failures for symbols that don't exist (`tf.does_not_exist` will not get a link). There is also no way to set custom link text now. That's okay. This is the result of the following regex replacement (+ a couple of manual edits.): re: @\{([^$].*?)(\$.+?)?} sub: `\1` Which does the following replacements: "@{tf.symbol}" --> "`tf.symbol`" "@{tf.symbol$link_text}" --> "`tf.symbol`" PiperOrigin-RevId: 207780049
Diffstat (limited to 'tensorflow/docs_src/performance')
-rw-r--r--tensorflow/docs_src/performance/datasets_performance.md22
-rw-r--r--tensorflow/docs_src/performance/performance_guide.md42
-rw-r--r--tensorflow/docs_src/performance/performance_models.md18
-rw-r--r--tensorflow/docs_src/performance/quantization.md2
-rw-r--r--tensorflow/docs_src/performance/xla/operation_semantics.md12
5 files changed, 48 insertions, 48 deletions
diff --git a/tensorflow/docs_src/performance/datasets_performance.md b/tensorflow/docs_src/performance/datasets_performance.md
index 46b43b7673..5d9e4ba392 100644
--- a/tensorflow/docs_src/performance/datasets_performance.md
+++ b/tensorflow/docs_src/performance/datasets_performance.md
@@ -38,9 +38,9 @@ the heavy lifting of training your model. In addition, viewing input pipelines
as an ETL process provides structure that facilitates the application of
performance optimizations.
-When using the @{tf.estimator.Estimator} API, the first two phases (Extract and
+When using the `tf.estimator.Estimator` API, the first two phases (Extract and
Transform) are captured in the `input_fn` passed to
-@{tf.estimator.Estimator.train}. In code, this might look like the following
+`tf.estimator.Estimator.train`. In code, this might look like the following
(naive, sequential) implementation:
```
@@ -99,7 +99,7 @@ With pipelining, idle time diminishes significantly:
![with pipelining](/images/datasets_with_pipelining.png)
The `tf.data` API provides a software pipelining mechanism through the
-@{tf.data.Dataset.prefetch} transformation, which can be used to decouple the
+`tf.data.Dataset.prefetch` transformation, which can be used to decouple the
time data is produced from the time it is consumed. In particular, the
transformation uses a background thread and an internal buffer to prefetch
elements from the input dataset ahead of the time they are requested. Thus, to
@@ -130,7 +130,7 @@ The preceding recommendation is simply the most common application.
### Parallelize Data Transformation
When preparing a batch, input elements may need to be pre-processed. To this
-end, the `tf.data` API offers the @{tf.data.Dataset.map} transformation, which
+end, the `tf.data` API offers the `tf.data.Dataset.map` transformation, which
applies a user-defined function (for example, `parse_fn` from the running
example) to each element of the input dataset. Because input elements are
independent of one another, the pre-processing can be parallelized across
@@ -164,7 +164,7 @@ dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_c
Furthermore, if your batch size is in the hundreds or thousands, your pipeline
will likely additionally benefit from parallelizing the batch creation. To this
-end, the `tf.data` API provides the @{tf.contrib.data.map_and_batch}
+end, the `tf.data` API provides the `tf.contrib.data.map_and_batch`
transformation, which effectively "fuses" the map and batch transformations.
To apply this change to our running example, change:
@@ -205,7 +205,7 @@ is stored locally or remotely, but can be worse in the remote case if data is
not prefetched effectively.
To mitigate the impact of the various data extraction overheads, the `tf.data`
-API offers the @{tf.contrib.data.parallel_interleave} transformation. Use this
+API offers the `tf.contrib.data.parallel_interleave` transformation. Use this
transformation to parallelize the execution of and interleave the contents of
other datasets (such as data file readers). The
number of datasets to overlap can be specified by the `cycle_length` argument.
@@ -232,7 +232,7 @@ dataset = files.apply(tf.contrib.data.parallel_interleave(
The throughput of remote storage systems can vary over time due to load or
network events. To account for this variance, the `parallel_interleave`
transformation can optionally use prefetching. (See
-@{tf.contrib.data.parallel_interleave} for details).
+`tf.contrib.data.parallel_interleave` for details).
By default, the `parallel_interleave` transformation provides a deterministic
ordering of elements to aid reproducibility. As an alternative to prefetching
@@ -261,7 +261,7 @@ function (that is, have it operate over a batch of inputs at once) and apply the
### Map and Cache
-The @{tf.data.Dataset.cache} transformation can cache a dataset, either in
+The `tf.data.Dataset.cache` transformation can cache a dataset, either in
memory or on local storage. If the user-defined function passed into the `map`
transformation is expensive, apply the cache transformation after the map
transformation as long as the resulting dataset can still fit into memory or
@@ -281,9 +281,9 @@ performance (for example, to enable fusing of the map and batch transformations)
### Repeat and Shuffle
-The @{tf.data.Dataset.repeat} transformation repeats the input data a finite (or
+The `tf.data.Dataset.repeat` transformation repeats the input data a finite (or
infinite) number of times; each repetition of the data is typically referred to
-as an _epoch_. The @{tf.data.Dataset.shuffle} transformation randomizes the
+as an _epoch_. The `tf.data.Dataset.shuffle` transformation randomizes the
order of the dataset's examples.
If the `repeat` transformation is applied before the `shuffle` transformation,
@@ -296,7 +296,7 @@ internal state of the `shuffle` transformation. In other words, the former
(`shuffle` before `repeat`) provides stronger ordering guarantees.
When possible, we recommend using the fused
-@{tf.contrib.data.shuffle_and_repeat} transformation, which combines the best of
+`tf.contrib.data.shuffle_and_repeat` transformation, which combines the best of
both worlds (good performance and strong ordering guarantees). Otherwise, we
recommend shuffling before repeating.
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
index dafacbe379..df70309568 100644
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ b/tensorflow/docs_src/performance/performance_guide.md
@@ -94,7 +94,7 @@ sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
#### Fused decode and crop
If inputs are JPEG images that also require cropping, use fused
-@{tf.image.decode_and_crop_jpeg} to speed up preprocessing.
+`tf.image.decode_and_crop_jpeg` to speed up preprocessing.
`tf.image.decode_and_crop_jpeg` only decodes the part of
the image within the crop window. This significantly speeds up the process if
the crop window is much smaller than the full image. For imagenet data, this
@@ -187,14 +187,14 @@ some models makes up a large percentage of the operation time. Using fused batch
norm can result in a 12%-30% speedup.
There are two commonly used batch norms and both support fusing. The core
-@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3.
+`tf.layers.batch_normalization` added fused starting in TensorFlow 1.3.
```python
bn = tf.layers.batch_normalization(
input_layer, fused=True, data_format='NCHW')
```
-The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option
+The contrib `tf.contrib.layers.batch_norm` method has had fused as an option
since before TensorFlow 1.0.
```python
@@ -205,43 +205,43 @@ bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
There are many ways to specify an RNN computation in TensorFlow and they have
trade-offs with respect to model flexibility and performance. The
-@{tf.nn.rnn_cell.BasicLSTMCell} should be considered a reference implementation
+`tf.nn.rnn_cell.BasicLSTMCell` should be considered a reference implementation
and used only as a last resort when no other options will work.
When using one of the cells, rather than the fully fused RNN layers, you have a
-choice of whether to use @{tf.nn.static_rnn} or @{tf.nn.dynamic_rnn}. There
+choice of whether to use `tf.nn.static_rnn` or `tf.nn.dynamic_rnn`. There
shouldn't generally be a performance difference at runtime, but large unroll
-amounts can increase the graph size of the @{tf.nn.static_rnn} and cause long
-compile times. An additional advantage of @{tf.nn.dynamic_rnn} is that it can
+amounts can increase the graph size of the `tf.nn.static_rnn` and cause long
+compile times. An additional advantage of `tf.nn.dynamic_rnn` is that it can
optionally swap memory from the GPU to the CPU to enable training of very long
sequences. Depending on the model and hardware configuration, this can come at
a performance cost. It is also possible to run multiple iterations of
-@{tf.nn.dynamic_rnn} and the underlying @{tf.while_loop} construct in parallel,
+`tf.nn.dynamic_rnn` and the underlying `tf.while_loop` construct in parallel,
although this is rarely useful with RNN models as they are inherently
sequential.
-On NVIDIA GPUs, the use of @{tf.contrib.cudnn_rnn} should always be preferred
+On NVIDIA GPUs, the use of `tf.contrib.cudnn_rnn` should always be preferred
unless you want layer normalization, which it doesn't support. It is often at
-least an order of magnitude faster than @{tf.contrib.rnn.BasicLSTMCell} and
-@{tf.contrib.rnn.LSTMBlockCell} and uses 3-4x less memory than
-@{tf.contrib.rnn.BasicLSTMCell}.
+least an order of magnitude faster than `tf.contrib.rnn.BasicLSTMCell` and
+`tf.contrib.rnn.LSTMBlockCell` and uses 3-4x less memory than
+`tf.contrib.rnn.BasicLSTMCell`.
If you need to run one step of the RNN at a time, as might be the case in
reinforcement learning with a recurrent policy, then you should use the
-@{tf.contrib.rnn.LSTMBlockCell} with your own environment interaction loop
-inside a @{tf.while_loop} construct. Running one step of the RNN at a time and
+`tf.contrib.rnn.LSTMBlockCell` with your own environment interaction loop
+inside a `tf.while_loop` construct. Running one step of the RNN at a time and
returning to Python is possible, but it will be slower.
-On CPUs, mobile devices, and if @{tf.contrib.cudnn_rnn} is not available on
+On CPUs, mobile devices, and if `tf.contrib.cudnn_rnn` is not available on
your GPU, the fastest and most memory efficient option is
-@{tf.contrib.rnn.LSTMBlockFusedCell}.
+`tf.contrib.rnn.LSTMBlockFusedCell`.
-For all of the less common cell types like @{tf.contrib.rnn.NASCell},
-@{tf.contrib.rnn.PhasedLSTMCell}, @{tf.contrib.rnn.UGRNNCell},
-@{tf.contrib.rnn.GLSTMCell}, @{tf.contrib.rnn.Conv1DLSTMCell},
-@{tf.contrib.rnn.Conv2DLSTMCell}, @{tf.contrib.rnn.LayerNormBasicLSTMCell},
+For all of the less common cell types like `tf.contrib.rnn.NASCell`,
+`tf.contrib.rnn.PhasedLSTMCell`, `tf.contrib.rnn.UGRNNCell`,
+`tf.contrib.rnn.GLSTMCell`, `tf.contrib.rnn.Conv1DLSTMCell`,
+`tf.contrib.rnn.Conv2DLSTMCell`, `tf.contrib.rnn.LayerNormBasicLSTMCell`,
etc., one should be aware that they are implemented in the graph like
-@{tf.contrib.rnn.BasicLSTMCell} and as such will suffer from the same poor
+`tf.contrib.rnn.BasicLSTMCell` and as such will suffer from the same poor
performance and high memory usage. One should consider whether or not those
trade-offs are worth it before using these cells. For example, while layer
normalization can speed up convergence, because cuDNN is 20x faster the fastest
diff --git a/tensorflow/docs_src/performance/performance_models.md b/tensorflow/docs_src/performance/performance_models.md
index 359b0e904d..66bf684d5b 100644
--- a/tensorflow/docs_src/performance/performance_models.md
+++ b/tensorflow/docs_src/performance/performance_models.md
@@ -10,8 +10,8 @@ incorporated into high-level APIs.
## Input Pipeline
The @{$performance_guide$Performance Guide} explains how to identify possible
-input pipeline issues and best practices. We found that using @{tf.FIFOQueue}
-and @{tf.train.queue_runner} could not saturate multiple current generation GPUs
+input pipeline issues and best practices. We found that using `tf.FIFOQueue`
+and `tf.train.queue_runner` could not saturate multiple current generation GPUs
when using large inputs and processing with higher samples per second, such
as training ImageNet with [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf).
This is due to the use of Python threads as its underlying implementation. The
@@ -29,7 +29,7 @@ implementation is made up of 3 stages:
The dominant part of each stage is executed in parallel with the other stages
using `data_flow_ops.StagingArea`. `StagingArea` is a queue-like operator
-similar to @{tf.FIFOQueue}. The difference is that `StagingArea` does not
+similar to `tf.FIFOQueue`. The difference is that `StagingArea` does not
guarantee FIFO ordering, but offers simpler functionality and can be executed
on both CPU and GPU in parallel with other stages. Breaking the input pipeline
into 3 stages that operate independently in parallel is scalable and takes full
@@ -62,10 +62,10 @@ and executed in parallel. The image preprocessing ops include operations such as
image decoding, distortion, and resizing.
Once the images are through preprocessing, they are concatenated together into 8
-tensors each with a batch-size of 32. Rather than using @{tf.concat} for this
+tensors each with a batch-size of 32. Rather than using `tf.concat` for this
purpose, which is implemented as a single op that waits for all the inputs to be
-ready before concatenating them together, @{tf.parallel_stack} is used.
-@{tf.parallel_stack} allocates an uninitialized tensor as an output, and each
+ready before concatenating them together, `tf.parallel_stack` is used.
+`tf.parallel_stack` allocates an uninitialized tensor as an output, and each
input tensor is written to its designated portion of the output tensor as soon
as the input is available.
@@ -94,7 +94,7 @@ the GPU, all the tensors are already available.
With all the stages capable of being driven by different processors,
`data_flow_ops.StagingArea` is used between them so they run in parallel.
-`StagingArea` is a queue-like operator similar to @{tf.FIFOQueue} that offers
+`StagingArea` is a queue-like operator similar to `tf.FIFOQueue` that offers
simpler functionalities that can be executed on both CPU and GPU.
Before the model starts running all the stages, the input pipeline stages are
@@ -153,7 +153,7 @@ weights obtained from training.
The default batch-normalization in TensorFlow is implemented as composite
operations. This is very general, but often leads to suboptimal performance. An
alternative is to use fused batch-normalization which often has much better
-performance on GPU. Below is an example of using @{tf.contrib.layers.batch_norm}
+performance on GPU. Below is an example of using `tf.contrib.layers.batch_norm`
to implement fused batch-normalization.
```python
@@ -301,7 +301,7 @@ In order to broadcast variables and aggregate gradients across different GPUs
within the same host machine, we can use the default TensorFlow implicit copy
mechanism.
-However, we can instead use the optional NCCL (@{tf.contrib.nccl}) support. NCCL
+However, we can instead use the optional NCCL (`tf.contrib.nccl`) support. NCCL
is an NVIDIA® library that can efficiently broadcast and aggregate data across
different GPUs. It schedules a cooperating kernel on each GPU that knows how to
best utilize the underlying hardware topology; this kernel uses a single SM of
diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md
index c97f74139c..4499f5715c 100644
--- a/tensorflow/docs_src/performance/quantization.md
+++ b/tensorflow/docs_src/performance/quantization.md
@@ -163,7 +163,7 @@ bazel build tensorflow/contrib/lite/toco:toco && \
--std_value=127.5 --mean_value=127.5
```
-See the documentation for @{tf.contrib.quantize} and
+See the documentation for `tf.contrib.quantize` and
[TensorFlow Lite](/mobile/tflite/).
## Quantized accuracy
diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md
index edc777a3c7..8726fdb67a 100644
--- a/tensorflow/docs_src/performance/xla/operation_semantics.md
+++ b/tensorflow/docs_src/performance/xla/operation_semantics.md
@@ -270,7 +270,7 @@ Clamp(min, operand, max) = s32[3]{0, 5, 6};
See also
[`XlaBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h)
-and the @{tf.reshape} operation.
+and the `tf.reshape` operation.
Collapses dimensions of an array into one dimension.
@@ -291,7 +291,7 @@ same position in the dimension sequence as those they replace, with the new
dimension size equal to the product of original dimension sizes. The lowest
dimension number in `dimensions` is the slowest varying dimension (most major)
in the loop nest which collapses these dimension, and the highest dimension
-number is fastest varying (most minor). See the @{tf.reshape} operator
+number is fastest varying (most minor). See the `tf.reshape` operator
if more general collapse ordering is needed.
For example, let v be an array of 24 elements:
@@ -490,8 +490,8 @@ array. The holes are filled with a no-op value, which for convolution means
zeroes.
Dilation of the rhs is also called atrous convolution. For more details, see
-@{tf.nn.atrous_conv2d}. Dilation of the lhs is also called transposed
-convolution. For more details, see @{tf.nn.conv2d_transpose}.
+`tf.nn.atrous_conv2d`. Dilation of the lhs is also called transposed
+convolution. For more details, see `tf.nn.conv2d_transpose`.
The output shape has these dimensions, in this order:
@@ -1270,7 +1270,7 @@ let t: (f32[10], s32) = tuple(v, s);
let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32.
```
-See also @{tf.tuple}.
+See also `tf.tuple`.
## Infeed
@@ -2250,7 +2250,7 @@ element types.
## Transpose
-See also the @{tf.reshape} operation.
+See also the `tf.reshape` operation.
<b>`Transpose(operand)`</b>