diff options
author | 2018-08-07 14:28:32 -0700 | |
---|---|---|
committer | 2018-08-07 14:32:57 -0700 | |
commit | 02df0f46d562a0c48b6f24803eba6330d13d7213 (patch) | |
tree | 3a39933e12b7300ddcefb5afb90db053f295d824 /tensorflow/docs_src/performance | |
parent | 452f995e2c23cbd67c14b15b678bb3a352212633 (diff) |
Remove usage of magic-api-link syntax from docs.
Back-ticks are now converted to links in the api_docs generator. With the new docs repo we're moving to simplify the docs pipeline, and make everything more readable.
By doing this we no longer get test failures for symbols that don't exist (`tf.does_not_exist` will not get a link).
There is also no way to set custom link text now. That's okay.
This is the result of the following regex replacement (+ a couple of manual edits.):
re: @\{([^$].*?)(\$.+?)?}
sub: `\1`
Which does the following replacements:
"@{tf.symbol}" --> "`tf.symbol`"
"@{tf.symbol$link_text}" --> "`tf.symbol`"
PiperOrigin-RevId: 207780049
Diffstat (limited to 'tensorflow/docs_src/performance')
5 files changed, 48 insertions, 48 deletions
diff --git a/tensorflow/docs_src/performance/datasets_performance.md b/tensorflow/docs_src/performance/datasets_performance.md index 46b43b7673..5d9e4ba392 100644 --- a/tensorflow/docs_src/performance/datasets_performance.md +++ b/tensorflow/docs_src/performance/datasets_performance.md @@ -38,9 +38,9 @@ the heavy lifting of training your model. In addition, viewing input pipelines as an ETL process provides structure that facilitates the application of performance optimizations. -When using the @{tf.estimator.Estimator} API, the first two phases (Extract and +When using the `tf.estimator.Estimator` API, the first two phases (Extract and Transform) are captured in the `input_fn` passed to -@{tf.estimator.Estimator.train}. In code, this might look like the following +`tf.estimator.Estimator.train`. In code, this might look like the following (naive, sequential) implementation: ``` @@ -99,7 +99,7 @@ With pipelining, idle time diminishes significantly: ![with pipelining](/images/datasets_with_pipelining.png) The `tf.data` API provides a software pipelining mechanism through the -@{tf.data.Dataset.prefetch} transformation, which can be used to decouple the +`tf.data.Dataset.prefetch` transformation, which can be used to decouple the time data is produced from the time it is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. Thus, to @@ -130,7 +130,7 @@ The preceding recommendation is simply the most common application. ### Parallelize Data Transformation When preparing a batch, input elements may need to be pre-processed. To this -end, the `tf.data` API offers the @{tf.data.Dataset.map} transformation, which +end, the `tf.data` API offers the `tf.data.Dataset.map` transformation, which applies a user-defined function (for example, `parse_fn` from the running example) to each element of the input dataset. Because input elements are independent of one another, the pre-processing can be parallelized across @@ -164,7 +164,7 @@ dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_c Furthermore, if your batch size is in the hundreds or thousands, your pipeline will likely additionally benefit from parallelizing the batch creation. To this -end, the `tf.data` API provides the @{tf.contrib.data.map_and_batch} +end, the `tf.data` API provides the `tf.contrib.data.map_and_batch` transformation, which effectively "fuses" the map and batch transformations. To apply this change to our running example, change: @@ -205,7 +205,7 @@ is stored locally or remotely, but can be worse in the remote case if data is not prefetched effectively. To mitigate the impact of the various data extraction overheads, the `tf.data` -API offers the @{tf.contrib.data.parallel_interleave} transformation. Use this +API offers the `tf.contrib.data.parallel_interleave` transformation. Use this transformation to parallelize the execution of and interleave the contents of other datasets (such as data file readers). The number of datasets to overlap can be specified by the `cycle_length` argument. @@ -232,7 +232,7 @@ dataset = files.apply(tf.contrib.data.parallel_interleave( The throughput of remote storage systems can vary over time due to load or network events. To account for this variance, the `parallel_interleave` transformation can optionally use prefetching. (See -@{tf.contrib.data.parallel_interleave} for details). +`tf.contrib.data.parallel_interleave` for details). By default, the `parallel_interleave` transformation provides a deterministic ordering of elements to aid reproducibility. As an alternative to prefetching @@ -261,7 +261,7 @@ function (that is, have it operate over a batch of inputs at once) and apply the ### Map and Cache -The @{tf.data.Dataset.cache} transformation can cache a dataset, either in +The `tf.data.Dataset.cache` transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the `map` transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or @@ -281,9 +281,9 @@ performance (for example, to enable fusing of the map and batch transformations) ### Repeat and Shuffle -The @{tf.data.Dataset.repeat} transformation repeats the input data a finite (or +The `tf.data.Dataset.repeat` transformation repeats the input data a finite (or infinite) number of times; each repetition of the data is typically referred to -as an _epoch_. The @{tf.data.Dataset.shuffle} transformation randomizes the +as an _epoch_. The `tf.data.Dataset.shuffle` transformation randomizes the order of the dataset's examples. If the `repeat` transformation is applied before the `shuffle` transformation, @@ -296,7 +296,7 @@ internal state of the `shuffle` transformation. In other words, the former (`shuffle` before `repeat`) provides stronger ordering guarantees. When possible, we recommend using the fused -@{tf.contrib.data.shuffle_and_repeat} transformation, which combines the best of +`tf.contrib.data.shuffle_and_repeat` transformation, which combines the best of both worlds (good performance and strong ordering guarantees). Otherwise, we recommend shuffling before repeating. diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md index dafacbe379..df70309568 100644 --- a/tensorflow/docs_src/performance/performance_guide.md +++ b/tensorflow/docs_src/performance/performance_guide.md @@ -94,7 +94,7 @@ sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) #### Fused decode and crop If inputs are JPEG images that also require cropping, use fused -@{tf.image.decode_and_crop_jpeg} to speed up preprocessing. +`tf.image.decode_and_crop_jpeg` to speed up preprocessing. `tf.image.decode_and_crop_jpeg` only decodes the part of the image within the crop window. This significantly speeds up the process if the crop window is much smaller than the full image. For imagenet data, this @@ -187,14 +187,14 @@ some models makes up a large percentage of the operation time. Using fused batch norm can result in a 12%-30% speedup. There are two commonly used batch norms and both support fusing. The core -@{tf.layers.batch_normalization} added fused starting in TensorFlow 1.3. +`tf.layers.batch_normalization` added fused starting in TensorFlow 1.3. ```python bn = tf.layers.batch_normalization( input_layer, fused=True, data_format='NCHW') ``` -The contrib @{tf.contrib.layers.batch_norm} method has had fused as an option +The contrib `tf.contrib.layers.batch_norm` method has had fused as an option since before TensorFlow 1.0. ```python @@ -205,43 +205,43 @@ bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW') There are many ways to specify an RNN computation in TensorFlow and they have trade-offs with respect to model flexibility and performance. The -@{tf.nn.rnn_cell.BasicLSTMCell} should be considered a reference implementation +`tf.nn.rnn_cell.BasicLSTMCell` should be considered a reference implementation and used only as a last resort when no other options will work. When using one of the cells, rather than the fully fused RNN layers, you have a -choice of whether to use @{tf.nn.static_rnn} or @{tf.nn.dynamic_rnn}. There +choice of whether to use `tf.nn.static_rnn` or `tf.nn.dynamic_rnn`. There shouldn't generally be a performance difference at runtime, but large unroll -amounts can increase the graph size of the @{tf.nn.static_rnn} and cause long -compile times. An additional advantage of @{tf.nn.dynamic_rnn} is that it can +amounts can increase the graph size of the `tf.nn.static_rnn` and cause long +compile times. An additional advantage of `tf.nn.dynamic_rnn` is that it can optionally swap memory from the GPU to the CPU to enable training of very long sequences. Depending on the model and hardware configuration, this can come at a performance cost. It is also possible to run multiple iterations of -@{tf.nn.dynamic_rnn} and the underlying @{tf.while_loop} construct in parallel, +`tf.nn.dynamic_rnn` and the underlying `tf.while_loop` construct in parallel, although this is rarely useful with RNN models as they are inherently sequential. -On NVIDIA GPUs, the use of @{tf.contrib.cudnn_rnn} should always be preferred +On NVIDIA GPUs, the use of `tf.contrib.cudnn_rnn` should always be preferred unless you want layer normalization, which it doesn't support. It is often at -least an order of magnitude faster than @{tf.contrib.rnn.BasicLSTMCell} and -@{tf.contrib.rnn.LSTMBlockCell} and uses 3-4x less memory than -@{tf.contrib.rnn.BasicLSTMCell}. +least an order of magnitude faster than `tf.contrib.rnn.BasicLSTMCell` and +`tf.contrib.rnn.LSTMBlockCell` and uses 3-4x less memory than +`tf.contrib.rnn.BasicLSTMCell`. If you need to run one step of the RNN at a time, as might be the case in reinforcement learning with a recurrent policy, then you should use the -@{tf.contrib.rnn.LSTMBlockCell} with your own environment interaction loop -inside a @{tf.while_loop} construct. Running one step of the RNN at a time and +`tf.contrib.rnn.LSTMBlockCell` with your own environment interaction loop +inside a `tf.while_loop` construct. Running one step of the RNN at a time and returning to Python is possible, but it will be slower. -On CPUs, mobile devices, and if @{tf.contrib.cudnn_rnn} is not available on +On CPUs, mobile devices, and if `tf.contrib.cudnn_rnn` is not available on your GPU, the fastest and most memory efficient option is -@{tf.contrib.rnn.LSTMBlockFusedCell}. +`tf.contrib.rnn.LSTMBlockFusedCell`. -For all of the less common cell types like @{tf.contrib.rnn.NASCell}, -@{tf.contrib.rnn.PhasedLSTMCell}, @{tf.contrib.rnn.UGRNNCell}, -@{tf.contrib.rnn.GLSTMCell}, @{tf.contrib.rnn.Conv1DLSTMCell}, -@{tf.contrib.rnn.Conv2DLSTMCell}, @{tf.contrib.rnn.LayerNormBasicLSTMCell}, +For all of the less common cell types like `tf.contrib.rnn.NASCell`, +`tf.contrib.rnn.PhasedLSTMCell`, `tf.contrib.rnn.UGRNNCell`, +`tf.contrib.rnn.GLSTMCell`, `tf.contrib.rnn.Conv1DLSTMCell`, +`tf.contrib.rnn.Conv2DLSTMCell`, `tf.contrib.rnn.LayerNormBasicLSTMCell`, etc., one should be aware that they are implemented in the graph like -@{tf.contrib.rnn.BasicLSTMCell} and as such will suffer from the same poor +`tf.contrib.rnn.BasicLSTMCell` and as such will suffer from the same poor performance and high memory usage. One should consider whether or not those trade-offs are worth it before using these cells. For example, while layer normalization can speed up convergence, because cuDNN is 20x faster the fastest diff --git a/tensorflow/docs_src/performance/performance_models.md b/tensorflow/docs_src/performance/performance_models.md index 359b0e904d..66bf684d5b 100644 --- a/tensorflow/docs_src/performance/performance_models.md +++ b/tensorflow/docs_src/performance/performance_models.md @@ -10,8 +10,8 @@ incorporated into high-level APIs. ## Input Pipeline The @{$performance_guide$Performance Guide} explains how to identify possible -input pipeline issues and best practices. We found that using @{tf.FIFOQueue} -and @{tf.train.queue_runner} could not saturate multiple current generation GPUs +input pipeline issues and best practices. We found that using `tf.FIFOQueue` +and `tf.train.queue_runner` could not saturate multiple current generation GPUs when using large inputs and processing with higher samples per second, such as training ImageNet with [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf). This is due to the use of Python threads as its underlying implementation. The @@ -29,7 +29,7 @@ implementation is made up of 3 stages: The dominant part of each stage is executed in parallel with the other stages using `data_flow_ops.StagingArea`. `StagingArea` is a queue-like operator -similar to @{tf.FIFOQueue}. The difference is that `StagingArea` does not +similar to `tf.FIFOQueue`. The difference is that `StagingArea` does not guarantee FIFO ordering, but offers simpler functionality and can be executed on both CPU and GPU in parallel with other stages. Breaking the input pipeline into 3 stages that operate independently in parallel is scalable and takes full @@ -62,10 +62,10 @@ and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion, and resizing. Once the images are through preprocessing, they are concatenated together into 8 -tensors each with a batch-size of 32. Rather than using @{tf.concat} for this +tensors each with a batch-size of 32. Rather than using `tf.concat` for this purpose, which is implemented as a single op that waits for all the inputs to be -ready before concatenating them together, @{tf.parallel_stack} is used. -@{tf.parallel_stack} allocates an uninitialized tensor as an output, and each +ready before concatenating them together, `tf.parallel_stack` is used. +`tf.parallel_stack` allocates an uninitialized tensor as an output, and each input tensor is written to its designated portion of the output tensor as soon as the input is available. @@ -94,7 +94,7 @@ the GPU, all the tensors are already available. With all the stages capable of being driven by different processors, `data_flow_ops.StagingArea` is used between them so they run in parallel. -`StagingArea` is a queue-like operator similar to @{tf.FIFOQueue} that offers +`StagingArea` is a queue-like operator similar to `tf.FIFOQueue` that offers simpler functionalities that can be executed on both CPU and GPU. Before the model starts running all the stages, the input pipeline stages are @@ -153,7 +153,7 @@ weights obtained from training. The default batch-normalization in TensorFlow is implemented as composite operations. This is very general, but often leads to suboptimal performance. An alternative is to use fused batch-normalization which often has much better -performance on GPU. Below is an example of using @{tf.contrib.layers.batch_norm} +performance on GPU. Below is an example of using `tf.contrib.layers.batch_norm` to implement fused batch-normalization. ```python @@ -301,7 +301,7 @@ In order to broadcast variables and aggregate gradients across different GPUs within the same host machine, we can use the default TensorFlow implicit copy mechanism. -However, we can instead use the optional NCCL (@{tf.contrib.nccl}) support. NCCL +However, we can instead use the optional NCCL (`tf.contrib.nccl`) support. NCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across different GPUs. It schedules a cooperating kernel on each GPU that knows how to best utilize the underlying hardware topology; this kernel uses a single SM of diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md index c97f74139c..4499f5715c 100644 --- a/tensorflow/docs_src/performance/quantization.md +++ b/tensorflow/docs_src/performance/quantization.md @@ -163,7 +163,7 @@ bazel build tensorflow/contrib/lite/toco:toco && \ --std_value=127.5 --mean_value=127.5 ``` -See the documentation for @{tf.contrib.quantize} and +See the documentation for `tf.contrib.quantize` and [TensorFlow Lite](/mobile/tflite/). ## Quantized accuracy diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md index edc777a3c7..8726fdb67a 100644 --- a/tensorflow/docs_src/performance/xla/operation_semantics.md +++ b/tensorflow/docs_src/performance/xla/operation_semantics.md @@ -270,7 +270,7 @@ Clamp(min, operand, max) = s32[3]{0, 5, 6}; See also [`XlaBuilder::Collapse`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/xla_builder.h) -and the @{tf.reshape} operation. +and the `tf.reshape` operation. Collapses dimensions of an array into one dimension. @@ -291,7 +291,7 @@ same position in the dimension sequence as those they replace, with the new dimension size equal to the product of original dimension sizes. The lowest dimension number in `dimensions` is the slowest varying dimension (most major) in the loop nest which collapses these dimension, and the highest dimension -number is fastest varying (most minor). See the @{tf.reshape} operator +number is fastest varying (most minor). See the `tf.reshape` operator if more general collapse ordering is needed. For example, let v be an array of 24 elements: @@ -490,8 +490,8 @@ array. The holes are filled with a no-op value, which for convolution means zeroes. Dilation of the rhs is also called atrous convolution. For more details, see -@{tf.nn.atrous_conv2d}. Dilation of the lhs is also called transposed -convolution. For more details, see @{tf.nn.conv2d_transpose}. +`tf.nn.atrous_conv2d`. Dilation of the lhs is also called transposed +convolution. For more details, see `tf.nn.conv2d_transpose`. The output shape has these dimensions, in this order: @@ -1270,7 +1270,7 @@ let t: (f32[10], s32) = tuple(v, s); let element_1: s32 = gettupleelement(t, 1); // Inferred shape matches s32. ``` -See also @{tf.tuple}. +See also `tf.tuple`. ## Infeed @@ -2250,7 +2250,7 @@ element types. ## Transpose -See also the @{tf.reshape} operation. +See also the `tf.reshape` operation. <b>`Transpose(operand)`</b> |