diff options
Diffstat (limited to 'tensorflow/docs_src/performance/performance_guide.md')
-rw-r--r-- | tensorflow/docs_src/performance/performance_guide.md | 16 |
1 files changed, 8 insertions, 8 deletions
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md index df70309568..9ea1d6a705 100644 --- a/tensorflow/docs_src/performance/performance_guide.md +++ b/tensorflow/docs_src/performance/performance_guide.md @@ -41,7 +41,7 @@ approaches to identifying issues: utilization is not approaching 80-100%, then the input pipeline may be the bottleneck. * Generate a timeline and look for large blocks of white space (waiting). An - example of generating a timeline exists as part of the @{$jit$XLA JIT} + example of generating a timeline exists as part of the [XLA JIT](../performance/xla/jit.md) tutorial. * Check CPU usage. It is possible to have an optimized input pipeline and lack the CPU cycles to process the pipeline. @@ -68,7 +68,7 @@ the CPU. #### Using the tf.data API -The @{$datasets$tf.data API} is replacing `queue_runner` as the recommended API +The [tf.data API](../guide/datasets.md) is replacing `queue_runner` as the recommended API for building input pipelines. This [ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py) ([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) @@ -78,7 +78,7 @@ training CIFAR-10 illustrates the use of the `tf.data` API along with The `tf.data` API utilizes C++ multi-threading and has a much lower overhead than the Python-based `queue_runner` that is limited by Python's multi-threading performance. A detailed performance guide for the `tf.data` API can be found -@{$datasets_performance$here}. +[here](../performance/datasets_performance.md). While feeding data using a `feed_dict` offers a high level of flexibility, in general `feed_dict` does not provide a scalable solution. If only a single GPU @@ -174,7 +174,7 @@ faster using `NHWC` than the normally most efficient `NCHW`. ### Common fused Ops Fused Ops combine multiple operations into a single kernel for improved -performance. There are many fused Ops within TensorFlow and @{$xla$XLA} will +performance. There are many fused Ops within TensorFlow and [XLA](../performance/xla/index.md) will create fused Ops when possible to automatically improve performance. Collected below are select fused Ops that can greatly improve performance and may be overlooked. @@ -257,7 +257,7 @@ the CPU in use. Speedups for training and inference on CPU are documented below in [Comparing compiler optimizations](#comparing-compiler-optimizations). To install the most optimized version of TensorFlow, -@{$install_sources$build and install} from source. If there is a need to build +[build and install](../install/install_sources.md) from source. If there is a need to build TensorFlow on a platform that has different hardware than the target, then cross-compile with the highest optimizations for the target platform. The following command is an example of using `bazel` to compile for a specific @@ -298,7 +298,7 @@ each of the towers. How each tower gets the updated variables and how the gradients are applied has an impact on the performance, scaling, and convergence of the model. The rest of this section provides an overview of variable placement and the towering of a model on multiple GPUs. -@{$performance_models$High-Performance Models} gets into more details regarding +[High-Performance Models](../performance/performance_models.md) gets into more details regarding more complex methods that can be used to share and update variables between towers. @@ -307,7 +307,7 @@ and even how the hardware has been configured. An example of this, is that two systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the optimal solution for each system may be different. For real world examples, read -the @{$performance/benchmarks$benchmark} page which details the settings that +the [benchmark](../performance/benchmarks.md) page which details the settings that were optimal for a variety of platforms. Below is a summary of what was learned from benchmarking various platforms and configurations: @@ -433,7 +433,7 @@ scenarios. ## Optimizing for CPU CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when -TensorFlow is @{$install_sources$built from source} with all of the instructions +TensorFlow is [built from source](../install/install_sources.md) with all of the instructions supported by the target CPU. Beyond using the latest instruction sets, Intel® has added support for the |