# Performance Guide This guide contains a collection of best practices for optimizing your TensorFlow code. The best practices apply to both new and experienced Tensorflow users. As a complement to the best practices in this document, the @{$performance_models$High-Performance Models} document links to example code and details for creating models that scale on a variety of hardware. ## Best Practices While optimizing implementations of different types of models can be different, the topics below cover best practices to get the most performance from TensorFlow. Although these suggestions focus on image-based models, we will regularly add tips for all kinds of models. The following list highlights key best practices: * Build and install from source * Utilize queues for reading data * Preprocessing on the CPU * Use `NCHW` image data format * Place shared parameters on the GPU * Use fused batch norm The following sections detail the preceding suggestions. ### Build and install from source To install the most optimized version of TensorFlow, build and install TensorFlow from source by following [Installing TensorFlow from Source](../install/install_sources). Building from source with compiler optimizations for the target hardware and ensuring the latest CUDA platform and cuDNN libraries are installed results in the highest performing installs. For the most stable experience, build from the [latest release](https://github.com/tensorflow/tensorflow/releases) branch. To get the latest performance changes and accept some stability risk, build from [master](https://github.com/tensorflow/tensorflow). If there is a need to build TensorFlow on a platform that has different hardware than the target, then cross-compile with the highest optimizations for the target platform. The following command is an example of telling `bazel` to compile for a specific platform: ```python # This command optimizes for Intel’s Broadwell processor bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package ``` #### Environment, build, and install tips * Compile with the highest level of compute the [GPU supports](http://developer.nvidia.com/cuda-gpus), e.g. P100: 6.0, Titan X (pascal): 6.2, Titan X (maxwell): 5.2, and K80: 3.7. * Install the latest CUDA platform and cuDNN libraries. * Make sure to use a version of gcc that supports all of the optimizations of the target CPU. The recommended minimum gcc version is 4.8.3. On OS X upgrade to the latest Xcode version and use the version of clang that comes with Xcode. * TensorFlow checks on startup whether it has been compiled with the optimizations available on the CPU. If the optimizations are not included, TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not included. ### Utilize queues for reading data One common cause of poor performance is underutilizing GPUs, or essentially "starving" them of data by not setting up an efficient pipeline. Make sure to set up an input pipeline to utilize queues and stream data effectively. Review the @{$reading_data#reading_from_files$Reading Data guide} for implementation details. One way to identify a "starved" GPU is to generate and review timelines. A detailed tutorial for timelines does not exist, but a quick example of generating a timeline exists as part of the @{$jit$XLA JIT} tutorial. Another simple way to check if a GPU is underutilized is to run `watch nvidia-smi`, and if GPU utilization is not approaching 100% then the GPU is not getting data fast enough. Unless for a special circumstance or for example code, do not feed data into the session from Python variables, e.g. `dictionary`. ```python # Using feed_dict often results in suboptimal performance when using large inputs. sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) ``` ### Preprocessing on the CPU Placing preprocessing operations on the CPU can significantly improve performance. When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU. When preprocessing is placed on the CPU, the data flow is CPU (preprocessing) -> GPU (training). Another benefit is preprocessing on the CPU frees GPU time to focus on training. Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec processed, which could lead to training in 1/6th of the time. To ensure preprocessing is on the CPU, wrap the preprocessing operations as shown below: ```python with tf.device('/cpu:0'): # function to get and process images or data. distorted_inputs = load_and_distort_images() ``` ### Use large files Under some circumstances, both the CPU and GPU can be starved for data by the I/O system. If you are using many small files to form your input data set, you may be limited by the speed of your filesystem. If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked. If this is the case, you should pre-process your input data, creating a few large TFRecord files. ### Use NCHW image data format Image data format refers to the representation of batches of images. TensorFlow supports `NHWC` (TensorFlow default) and `NCHW` (cuDNN default). N refers to the number of images in a batch, H refers to the number of pixels in the vertical dimension, W refers to the number of pixels in the horizontal dimension, and C refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although cuDNN can operate on both formats, it is faster to operate in its default format. The best practice is to build models that work with both `NCHW` and `NHWC` as it is common to train using `NCHW` on GPU, and then do inference with NHWC on CPU. There are edge cases where `NCHW` can be slower on GPU than `NHWC`. One [case](https://github.com/tensorflow/tensorflow/issues/7551#issuecomment-280421351) is using non-fused batch norm on WRN-16-4 without dropout. In that case using fused batch norm, which is also recommended, is the optimal solution. The very brief history of these two formats is that TensorFlow started by using `NHWC` because it was a little faster on CPUs. Then the TensorFlow team discovered that `NCHW` performs better when using the NVIDIA cuDNN library. The current recommendation is that users support both formats in their models. In the long term, we plan to rewrite graphs to make switching between the formats transparent. ### Use fused batch norm When using batch norm @{tf.contrib.layers.batch_norm} set the attribute `fused=True`: ```python bn = tf.contrib.layers.batch_norm( input_layer, fused=True, data_format='NCHW' scope=scope, **kwargs) ``` The non-fused batch norm does computations using several individual Ops. Fused batch norm combines the individual operations into a single kernel, which runs faster.