1 files changed, 0 insertions, 733 deletions
diff --git a/tensorflow/docs_src/performance/performance_guide.md b/tensorflow/docs_src/performance/performance_guide.md
deleted file mode 100644
index 9ea1d6a705..0000000000
--- a/tensorflow/docs_src/performance/performance_guide.md
+++ /dev/null
@@ -1,733 +0,0 @@
-# Performance Guide
-
-This guide contains a collection of best practices for optimizing TensorFlow
-code. The guide is divided into a few sections:
-
-*   [General best practices](#general_best_practices) covers topics that are
-    common across a variety of model types and hardware.
-*   [Optimizing for GPU](#optimizing_for_gpu) details tips specifically relevant
-    to GPUs.
-*   [Optimizing for CPU](#optimizing_for_cpu) details CPU specific information.
-
-## General best practices
-
-The sections below cover best practices that are relevant to a variety of
-hardware and models. The best practices section is broken down into the
-following sections:
-
-*   [Input pipeline optimizations](#input-pipeline-optimization)
-*   [Data formats](#data-formats)
-*   [Common fused Ops](#common-fused-ops)
-*   [RNN Performance](#rnn-performance)
-*   [Building and installing from source](#building-and-installing-from-source)
-
-### Input pipeline optimization
-
-Typical models retrieve data from disk and preprocess it before sending the data
-through the network. For example, models that process JPEG images will follow
-this flow: load image from disk, decode JPEG into a tensor, crop and pad,
-possibly flip and distort, and then batch. This flow is referred to as the input
-pipeline. As GPUs and other hardware accelerators get faster, preprocessing of
-data can be a bottleneck.
-
-Determining if the input pipeline is the bottleneck can be complicated. One of
-the most straightforward methods is to reduce the model to a single operation
-(trivial model) after the input pipeline and measure the examples per second. If
-the difference in examples per second for the full model and the trivial model
-is minimal then the input pipeline is likely a bottleneck. Below are some other
-approaches to identifying issues:
-
-*   Check if a GPU is underutilized by running `nvidia-smi -l 2`. If GPU
-    utilization is not approaching 80-100%, then the input pipeline may be the
-    bottleneck.
-*   Generate a timeline and look for large blocks of white space (waiting). An
-    example of generating a timeline exists as part of the [XLA JIT](../performance/xla/jit.md)
-    tutorial.
-*   Check CPU usage. It is possible to have an optimized input pipeline and lack
-    the CPU cycles to process the pipeline.
-*   Estimate the throughput needed and verify the disk used is capable of that
-    level of throughput. Some cloud solutions have network attached disks that
-    start as low as 50 MB/sec, which is slower than spinning disks (150 MB/sec),
-    SATA SSDs (500 MB/sec), and PCIe SSDs (2,000+ MB/sec).
-
-#### Preprocessing on the CPU
-
-Placing input pipeline operations on the CPU can significantly improve
-performance. Utilizing the CPU for the input pipeline frees the GPU to focus on
-training. To ensure preprocessing is on the CPU, wrap the preprocessing
-operations as shown below:
-
-```python
-with tf.device('/cpu:0'):
-  # function to get and process images or data.
-  distorted_inputs = load_and_distort_images()
-```
-
-If using `tf.estimator.Estimator` the input function is automatically placed on
-the CPU.
-
-#### Using the tf.data API
-
-The [tf.data API](../guide/datasets.md) is replacing `queue_runner` as the recommended API
-for building input pipelines. This
-[ResNet example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/cifar10_main.py)
-([arXiv:1512.03385](https://arxiv.org/abs/1512.03385))
-training CIFAR-10 illustrates the use of the `tf.data` API along with
-`tf.estimator.Estimator`.
-
-The `tf.data` API utilizes C++ multi-threading and has a much lower overhead
-than the Python-based `queue_runner` that is limited by Python's multi-threading
-performance. A detailed performance guide for the `tf.data` API can be found
-[here](../performance/datasets_performance.md).
-
-While feeding data using a `feed_dict` offers a high level of flexibility, in
-general `feed_dict` does not provide a scalable solution. If only a single GPU
-is used, the difference between the `tf.data` API and `feed_dict` performance
-may be negligible. Our recommendation is to avoid using `feed_dict` for all but
-trivial examples. In particular, avoid using `feed_dict` with large inputs:
-
-```python
-# feed_dict often results in suboptimal performance when using large inputs.
-sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
-```
-
-#### Fused decode and crop
-
-If inputs are JPEG images that also require cropping, use fused
-`tf.image.decode_and_crop_jpeg` to speed up preprocessing.
-`tf.image.decode_and_crop_jpeg` only decodes the part of
-the image within the crop window. This significantly speeds up the process if
-the crop window is much smaller than the full image. For imagenet data, this
-approach could speed up the input pipeline by up to 30%.
-
-Example Usage:
-
-```python
-def _image_preprocess_fn(image_buffer):
-    # image_buffer 1-D string Tensor representing the raw JPEG image buffer.
-
-    # Extract image shape from raw JPEG image buffer.
-    image_shape = tf.image.extract_jpeg_shape(image_buffer)
-
-    # Get a crop window with distorted bounding box.
-    sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box(
-      image_shape, ...)
-    bbox_begin, bbox_size, distort_bbox = sample_distorted_bounding_box
-
-    # Decode and crop image.
-    offset_y, offset_x, _ = tf.unstack(bbox_begin)
-    target_height, target_width, _ = tf.unstack(bbox_size)
-    crop_window = tf.stack([offset_y, offset_x, target_height, target_width])
-    cropped_image = tf.image.decode_and_crop_jpeg(image, crop_window)
-```
-
-`tf.image.decode_and_crop_jpeg` is available on all platforms. There is no speed
-up on Windows due to the use of `libjpeg` vs. `libjpeg-turbo` on other
-platforms.
-
-#### Use large files
-
-Reading large numbers of small files significantly impacts I/O performance.
-One approach to get maximum I/O throughput is to preprocess input data into
-larger (~100MB) `TFRecord` files. For smaller data sets (200MB-1GB), the best
-approach is often to load the entire data set into memory. The document
-[Downloading and converting to TFRecord format](https://github.com/tensorflow/models/tree/master/research/slim#downloading-and-converting-to-tfrecord-format)
-includes information and scripts for creating `TFRecords` and this
-[script](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py)
-converts the CIFAR-10 data set into `TFRecords`.
-
-### Data formats
-
-Data formats refers to the structure of the Tensor passed to a given Op. The
-discussion below is specifically about 4D Tensors representing images. In
-TensorFlow the parts of the 4D tensor are often referred to by the following
-letters:
-
-*   N refers to the number of images in a batch.
-*   H refers to the number of pixels in the vertical (height) dimension.
-*   W refers to the number of pixels in the horizontal (width) dimension.
-*   C refers to the channels. For example, 1 for black and white or grayscale
-    and 3 for RGB.
-
-Within TensorFlow there are two naming conventions representing the two most
-common data formats:
-
-*   `NCHW` or `channels_first`
-*   `NHWC` or `channels_last`
-
-`NHWC` is the TensorFlow default and `NCHW` is the optimal format to use when
-training on NVIDIA GPUs using [cuDNN](https://developer.nvidia.com/cudnn).
-
-The best practice is to build models that work with both data formats. This
-simplifies training on GPUs and then running inference on CPUs. If TensorFlow is
-compiled with the [Intel MKL](#tensorflow_with_intel_mkl-dnn) optimizations,
-many operations, especially those related to CNN based models, will be optimized
-and support `NCHW`. If not using the MKL, some operations are not supported on
-CPU when using `NCHW`.
-
-The brief history of these two formats is that TensorFlow started by using
-`NHWC` because it was a little faster on CPUs. In the long term, we are working
-on tools to auto rewrite graphs to make switching between the formats
-transparent and take advantages of micro optimizations where a GPU Op may be
-faster using `NHWC` than the normally most efficient `NCHW`.
-
-### Common fused Ops
-
-Fused Ops combine multiple operations into a single kernel for improved
-performance. There are many fused Ops within TensorFlow and [XLA](../performance/xla/index.md) will
-create fused Ops when possible to automatically improve performance. Collected
-below are select fused Ops that can greatly improve performance and may be
-overlooked.
-
-#### Fused batch norm
-
-Fused batch norm combines the multiple operations needed to do batch
-normalization into a single kernel. Batch norm is an expensive process that for
-some models makes up a large percentage of the operation time. Using fused batch
-norm can result in a 12%-30% speedup.
-
-There are two commonly used batch norms and both support fusing. The core
-`tf.layers.batch_normalization` added fused starting in TensorFlow 1.3.
-
-```python
-bn = tf.layers.batch_normalization(
-    input_layer, fused=True, data_format='NCHW')
-```
-
-The contrib `tf.contrib.layers.batch_norm` method has had fused as an option
-since before TensorFlow 1.0.
-
-```python
-bn = tf.contrib.layers.batch_norm(input_layer, fused=True, data_format='NCHW')
-```
-
-### RNN Performance
-
-There are many ways to specify an RNN computation in TensorFlow and they have
-trade-offs with respect to model flexibility and performance. The
-`tf.nn.rnn_cell.BasicLSTMCell` should be considered a reference implementation
-and used only as a last resort when no other options will work.
-
-When using one of the cells, rather than the fully fused RNN layers, you have a
-choice of whether to use `tf.nn.static_rnn` or `tf.nn.dynamic_rnn`.  There
-shouldn't generally be a performance difference at runtime, but large unroll
-amounts can increase the graph size of the `tf.nn.static_rnn` and cause long
-compile times.  An additional advantage of `tf.nn.dynamic_rnn` is that it can
-optionally swap memory from the GPU to the CPU to enable training of very long
-sequences.  Depending on the model and hardware configuration, this can come at
-a performance cost.  It is also possible to run multiple iterations of
-`tf.nn.dynamic_rnn` and the underlying `tf.while_loop` construct in parallel,
-although this is rarely useful with RNN models as they are inherently
-sequential.
-
-On NVIDIA GPUs, the use of `tf.contrib.cudnn_rnn` should always be preferred
-unless you want layer normalization, which it doesn't support.  It is often at
-least an order of magnitude faster than `tf.contrib.rnn.BasicLSTMCell` and
-`tf.contrib.rnn.LSTMBlockCell` and uses 3-4x less memory than
-`tf.contrib.rnn.BasicLSTMCell`.
-
-If you need to run one step of the RNN at a time, as might be the case in
-reinforcement learning with a recurrent policy, then you should use the
-`tf.contrib.rnn.LSTMBlockCell` with your own environment interaction loop
-inside a `tf.while_loop` construct. Running one step of the RNN at a time and
-returning to Python is possible, but it will be slower.
-
-On CPUs, mobile devices, and if `tf.contrib.cudnn_rnn` is not available on
-your GPU, the fastest and most memory efficient option is
-`tf.contrib.rnn.LSTMBlockFusedCell`.
-
-For all of the less common cell types like `tf.contrib.rnn.NASCell`,
-`tf.contrib.rnn.PhasedLSTMCell`, `tf.contrib.rnn.UGRNNCell`,
-`tf.contrib.rnn.GLSTMCell`, `tf.contrib.rnn.Conv1DLSTMCell`,
-`tf.contrib.rnn.Conv2DLSTMCell`, `tf.contrib.rnn.LayerNormBasicLSTMCell`,
-etc., one should be aware that they are implemented in the graph like
-`tf.contrib.rnn.BasicLSTMCell` and as such will suffer from the same poor
-performance and high memory usage.  One should consider whether or not those
-trade-offs are worth it before using these cells. For example, while layer
-normalization can speed up convergence, because cuDNN is 20x faster the fastest
-wall clock time to convergence is usually obtained without it.
-
-
-### Building and installing from source
-
-The default TensorFlow binaries target the broadest range of hardware to make
-TensorFlow accessible to everyone. If using CPUs for training or inference, it
-is recommended to compile TensorFlow with all of the optimizations available for
-the CPU in use. Speedups for training and inference on CPU are documented below
-in [Comparing compiler optimizations](#comparing-compiler-optimizations).
-
-To install the most optimized version of TensorFlow,
-[build and install](../install/install_sources.md) from source. If there is a need to build
-TensorFlow on a platform that has different hardware than the target, then
-cross-compile with the highest optimizations for the target platform. The
-following command is an example of using `bazel` to compile for a specific
-platform:
-
-```python
-# This command optimizes for Intel’s Broadwell processor
-bazel build -c opt --copt=-march="broadwell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
-
-```
-
-#### Environment, build, and install tips
-
-*   `./configure` asks which compute capability to include in the build. This
-    does not impact overall performance but does impact initial startup. After
-    running TensorFlow once, the compiled kernels are cached by CUDA. If using
-    a docker container, the data is not cached and the penalty is paid each time
-    TensorFlow starts. The best practice is to include the
-    [compute capabilities](http://developer.nvidia.com/cuda-gpus)
-    of the GPUs that will be used, e.g. P100: 6.0, Titan X (Pascal): 6.1, Titan
-    X (Maxwell): 5.2, and K80: 3.7.
-*   Use a version of gcc that supports all of the optimizations of the target
-    CPU. The recommended minimum gcc version is 4.8.3. On OS X, upgrade to the
-    latest Xcode version and use the version of clang that comes with Xcode.
-*   Install the latest stable CUDA platform and cuDNN libraries supported by
-    TensorFlow.
-
-## Optimizing for GPU
-
-This section contains GPU-specific tips that are not covered in the
-[General best practices](#general-best-practices). Obtaining optimal performance
-on multi-GPUs is a challenge. A common approach is to use data parallelism.
-Scaling through the use of data parallelism involves making multiple copies of
-the model, which are referred to as "towers", and then placing one tower on each
-of the GPUs. Each tower operates on a different mini-batch of data and then
-updates variables, also known as parameters, that need to be shared between
-each of the towers. How each tower gets the updated variables and how the
-gradients are applied has an impact on the performance, scaling, and convergence
-of the model.  The rest of this section provides an overview of variable
-placement and the towering of a model on multiple GPUs.
-[High-Performance Models](../performance/performance_models.md) gets into more details regarding
-more complex methods that can be used to share and update variables between
-towers.
-
-The best approach to handling variable updates depends on the model, hardware,
-and even how the hardware has been configured. An example of this, is that two
-systems can be built with NVIDIA Tesla P100s but one may be using PCIe and the
-other [NVLink](http://www.nvidia.com/object/nvlink.html). In that scenario, the
-optimal solution for each system may be different. For real world examples, read
-the [benchmark](../performance/benchmarks.md) page which details the settings that
-were optimal for a variety of platforms. Below is a summary of what was learned
-from benchmarking various platforms and configurations:
-
-*   **Tesla K80**: If the GPUs are on the same PCI Express root complex and are
-    able to use [NVIDIA GPUDirect](https://developer.nvidia.com/gpudirect) Peer
-    to Peer, then placing the variables equally across the GPUs used for
-    training is the best approach. If the GPUs cannot use GPUDirect, then
-    placing the variables on the CPU is the best option.
-
-*   **Titan X (Maxwell and Pascal), M40, P100, and similar**: For models like
-    ResNet and InceptionV3, placing variables on the CPU is the optimal setting,
-    but for models with a lot of variables like AlexNet and VGG, using GPUs with
-    `NCCL` is better.
-
-A common approach to managing where variables are placed, is to create a method
-to determine where each Op is to be placed and use that method in place of a
-specific device name when calling `with tf.device():`. Consider a scenario where
-a model is being trained on 2 GPUs and the variables are to be placed on the
-CPU. There would be a loop for creating and placing the "towers" on each of the
-2 GPUs. A custom device placement method would be created that watches for Ops
-of type `Variable`, `VariableV2`, and `VarHandleOp` and indicates that they are
-to be placed on the CPU. All other Ops would be placed on the target GPU.
-The building of the graph would proceed as follows:
-
-*   On the first loop a "tower" of the model would be created for `gpu:0`.
-    During the placement of the Ops, the custom device placement method would
-    indicate that variables are to be placed on `cpu:0` and all other Ops on
-    `gpu:0`.
-
-*   On the second loop, `reuse` is set to `True` to indicate that variables are
-    to be reused and then the "tower" is created on `gpu:1`. During the
-    placement of the Ops associated with the "tower", the variables that were
-    placed on `cpu:0` are reused and all other Ops are created and placed on
-    `gpu:1`.
-
-The final result is all of the variables are placed on the CPU with each GPU
-having a copy of all of the computational Ops associated with the model.
-
-The code snippet below illustrates two different approaches for variable
-placement: one is placing variables on the CPU; the other is placing variables
-equally across the GPUs.
-
-```python
-
-class GpuParamServerDeviceSetter(object):
-  """Used with tf.device() to place variables on the least loaded GPU.
-
-    A common use for this class is to pass a list of GPU devices, e.g. ['gpu:0',
-    'gpu:1','gpu:2'], as ps_devices.  When each variable is placed, it will be
-    placed on the least loaded gpu. All other Ops, which will be the computation
-    Ops, will be placed on the worker_device.
-  """
-
-  def __init__(self, worker_device, ps_devices):
-    """Initializer for GpuParamServerDeviceSetter.
-    Args:
-      worker_device: the device to use for computation Ops.
-      ps_devices: a list of devices to use for Variable Ops. Each variable is
-      assigned to the least loaded device.
-    """
-    self.ps_devices = ps_devices
-    self.worker_device = worker_device
-    self.ps_sizes = [0] * len(self.ps_devices)
-
-  def __call__(self, op):
-    if op.device:
-      return op.device
-    if op.type not in ['Variable', 'VariableV2', 'VarHandleOp']:
-      return self.worker_device
-
-    # Gets the least loaded ps_device
-    device_index, _ = min(enumerate(self.ps_sizes), key=operator.itemgetter(1))
-    device_name = self.ps_devices[device_index]
-    var_size = op.outputs[0].get_shape().num_elements()
-    self.ps_sizes[device_index] += var_size
-
-    return device_name
-
-def _create_device_setter(is_cpu_ps, worker, num_gpus):
-  """Create device setter object."""
-  if is_cpu_ps:
-    # tf.train.replica_device_setter supports placing variables on the CPU, all
-    # on one GPU, or on ps_servers defined in a cluster_spec.
-    return tf.train.replica_device_setter(
-        worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
-  else:
-    gpus = ['/gpu:%d' % i for i in range(num_gpus)]
-    return ParamServerDeviceSetter(worker, gpus)
-
-# The method below is a modified snippet from the full example.
-def _resnet_model_fn():
-    # When set to False, variables are placed on the least loaded GPU. If set
-    # to True, the variables will be placed on the CPU.
-    is_cpu_ps = False
-
-    # Loops over the number of GPUs and creates a copy ("tower") of the model on
-    # each GPU.
-    for i in range(num_gpus):
-      worker = '/gpu:%d' % i
-      # Creates a device setter used to determine where Ops are to be placed.
-      device_setter = _create_device_setter(is_cpu_ps, worker, FLAGS.num_gpus)
-      # Creates variables on the first loop.  On subsequent loops reuse is set
-      # to True, which results in the "towers" sharing variables.
-      with tf.variable_scope('resnet', reuse=bool(i != 0)):
-        with tf.name_scope('tower_%d' % i) as name_scope:
-          # tf.device calls the device_setter for each Op that is created.
-          # device_setter returns the device the Op is to be placed on.
-          with tf.device(device_setter):
-            # Creates the "tower".
-            _tower_fn(is_training, weight_decay, tower_features[i],
-                      tower_labels[i], tower_losses, tower_gradvars,
-                      tower_preds, False)
-
-```
-
-In the near future the above code will be for illustration purposes only as
-there will be easy to use high level methods to support a wide range of popular
-approaches. This
-[example](https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
-will continue to get updated as the API expands and evolves to address multi-GPU
-scenarios.
-
-## Optimizing for CPU
-
-CPUs, which includes Intel® Xeon Phi™, achieve optimal performance when
-TensorFlow is [built from source](../install/install_sources.md) with all of the instructions
-supported by the target CPU.
-
-Beyond using the latest instruction sets, Intel® has added support for the
-Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to
-TensorFlow. While the name is not completely accurate, these optimizations are
-often simply referred to as 'MKL' or 'TensorFlow with MKL'. [TensorFlow
-with Intel® MKL-DNN](#tensorflow_with_intel_mkl_dnn) contains details on the
-MKL optimizations.
-
-The two configurations listed below are used to optimize CPU performance by
-adjusting the thread pools.
-
-*   `intra_op_parallelism_threads`: Nodes that can use multiple threads to
-    parallelize their execution will schedule the individual pieces into this
-    pool.
-*   `inter_op_parallelism_threads`: All ready nodes are scheduled in this pool.
-
-These configurations are set via the `tf.ConfigProto` and passed to `tf.Session`
-in the `config` attribute as shown in the snippet below.  For both configuration
-options, if they are unset or set to 0, will default to the number of logical
-CPU cores. Testing has shown that the default is effective for systems ranging
-from one CPU with 4 cores to multiple CPUs with 70+ combined logical cores.
-A common alternative optimization is to set the number of threads in both pools
-equal to the number of physical cores rather than logical cores.
-
-```python
-
-  config = tf.ConfigProto()
-  config.intra_op_parallelism_threads = 44
-  config.inter_op_parallelism_threads = 44
-  tf.Session(config=config)
-
-```
-
-The [Comparing compiler optimizations](#comparing-compiler-optimizations)
-section contains the results of tests that used different compiler
-optimizations.
-
-### TensorFlow with Intel® MKL DNN
-
-Intel® has added optimizations to TensorFlow for Intel® Xeon® and Intel® Xeon
-Phi™ through the use of the Intel® Math Kernel Library for Deep Neural Networks
-(Intel® MKL-DNN) optimized primitives. The optimizations also provide speedups
-for the consumer line of processors, e.g. i5 and i7 Intel processors. The Intel
-published paper
-[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
-contains additional details on the implementation.
-
-> Note: MKL was added as of TensorFlow 1.2 and currently only works on Linux. It
-> also does not work when also using `--config=cuda`.
-
-In addition to providing significant performance improvements for training CNN
-based models, compiling with the MKL creates a binary that is optimized for AVX
-and AVX2. The result is a single binary that is optimized and compatible with
-most modern (post-2011) processors.
-
-TensorFlow can be compiled with the MKL optimizations using the following
-commands that depending on the version of the TensorFlow source used.
-
-For TensorFlow source versions after 1.3.0:
-
-```bash
-./configure
-# Pick the desired options
-bazel build --config=mkl --config=opt //tensorflow/tools/pip_package:build_pip_package
-
-```
-
-For TensorFlow versions 1.2.0 through 1.3.0:
-
-```bash
-./configure
-Do you wish to build TensorFlow with MKL support? [y/N] Y
-Do you wish to download MKL LIB from the web? [Y/n] Y
-# Select the defaults for the rest of the options.
-
-bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
-
-```
-
-#### Tuning MKL for the best performance
-
-This section details the different configurations and environment variables that
-can be used to tune the MKL to get optimal performance. Before tweaking various
-environment variables make sure the model is using the `NCHW` (`channels_first`)
-[data format](#data-formats). The MKL is optimized for `NCHW` and Intel is
-working to get near performance parity when using `NHWC`.
-
-MKL uses the following environment variables to tune performance:
-
-*   KMP_BLOCKTIME - Sets the time, in milliseconds, that a thread should wait,
-    after completing the execution of a parallel region, before sleeping.
-*   KMP_AFFINITY - Enables the run-time library to bind threads to physical
-    processing units.
-*   KMP_SETTINGS - Enables (true) or disables (false) the printing of OpenMP*
-    run-time library environment variables during program execution.
-*   OMP_NUM_THREADS - Specifies the number of threads to use.
-
-More details on the KMP variables are on
-[Intel's](https://software.intel.com/en-us/node/522775) site and the OMP
-variables on
-[gnu.org](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html)
-
-While there can be substantial gains from adjusting the environment variables,
-which is discussed below, the simplified advice is to set the
-`inter_op_parallelism_threads` equal to the number of physical CPUs and to set
-the following environment variables:
-
-*   KMP_BLOCKTIME=0
-*   KMP_AFFINITY=granularity=fine,verbose,compact,1,0
-
-Example setting MKL variables with command-line arguments:
-
-```bash
-KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 \
-KMP_SETTINGS=1 python your_python_script.py
-```
-
-Example setting MKL variables with python `os.environ`:
-
-```python
-os.environ["KMP_BLOCKTIME"] = str(FLAGS.kmp_blocktime)
-os.environ["KMP_SETTINGS"] = str(FLAGS.kmp_settings)
-os.environ["KMP_AFFINITY"]= FLAGS.kmp_affinity
-if FLAGS.num_intra_threads > 0:
-  os.environ["OMP_NUM_THREADS"]= str(FLAGS.num_intra_threads)
-
-```
-
-There are models and hardware platforms that benefit from different settings.
-Each variable that impacts performance is discussed below.
-
-*   **KMP_BLOCKTIME**: The MKL default is 200ms, which was not optimal in our
-    testing. 0 (0ms) was a good default for CNN based models that were tested.
-    The best performance for AlexNex was achieved at 30ms and both GoogleNet and
-    VGG11 performed best set at 1ms.
-
-*   **KMP_AFFINITY**: The recommended setting is
-    `granularity=fine,verbose,compact,1,0`.
-
-*   **OMP_NUM_THREADS**: This defaults to the number of physical cores.
-    Adjusting this parameter beyond matching the number of cores can have an
-    impact when using Intel® Xeon Phi™ (Knights Landing) for some models. See
-    [TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture)
-    for optimal settings.
-
-*   **intra_op_parallelism_threads**: Setting this equal to the number of
-    physical cores is recommended. Setting the value to 0, which is the default,
-    results in the value being set to the number of logical cores - this is an
-    alternate option to try for some architectures.  This value and `OMP_NUM_THREADS`
-    should be equal.
-
-*   **inter_op_parallelism_threads**: Setting this equal to the number of
-    sockets is recommended. Setting the value to 0, which is the default,
-    results in the value being set to the number of logical cores.
-
-### Comparing compiler optimizations
-
-Collected below are performance results running training and inference on
-different types of CPUs on different platforms with various compiler
-optimizations.  The models used were ResNet-50
-([arXiv:1512.03385](https://arxiv.org/abs/1512.03385)) and
-InceptionV3 ([arXiv:1512.00567](https://arxiv.org/abs/1512.00567)).
-
-For each test, when the MKL optimization was used the environment variable
-KMP_BLOCKTIME was set to 0 (0ms) and KMP_AFFINITY to
-`granularity=fine,verbose,compact,1,0`.
-
-#### Inference InceptionV3
-
-**Environment**
-
-*   Instance Type: AWS EC2 m4.xlarge
-*   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
-*   Dataset: ImageNet
-*   TensorFlow Version: 1.2.0 RC2
-*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
-
-**Batch Size: 1**
-
-Command executed for the MKL test:
-
-```bash
-python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
---kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
---batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
---data_dir=<path to ImageNet TFRecords>
-```
-
-| Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
-:              :             : (step time)  :               :               :
-| ------------ | ----------- | ------------ | ------------- | ------------- |
-| AVX2         | NHWC        | 7.0 (142ms)  | 4             | 0             |
-| MKL          | NCHW        | 6.6 (152ms)  | 4             | 1             |
-| AVX          | NHWC        | 5.0 (202ms)  | 4             | 0             |
-| SSE3         | NHWC        | 2.8 (361ms)  | 4             | 0             |
-
-**Batch Size: 32**
-
-Command executed for the MKL test:
-
-```bash
-python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
---kmp_blocktime=0 --nodistortions --model=inception3 --data_format=NCHW \
---batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
---data_dir=<path to ImageNet TFRecords>
-```
-
-| Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
-:              :             : (step time)   :               :               :
-| ------------ | ----------- | ------------- | ------------- | ------------- |
-| MKL          | NCHW        | 10.3          | 4             | 1             |
-:              :             : (3,104ms)     :               :               :
-| AVX2         | NHWC        | 7.5 (4,255ms) | 4             | 0             |
-| AVX          | NHWC        | 5.1 (6,275ms) | 4             | 0             |
-| SSE3         | NHWC        | 2.8 (11,428ms)| 4             | 0             |
-
-#### Inference ResNet-50
-
-**Environment**
-
-*   Instance Type: AWS EC2 m4.xlarge
-*   CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (Broadwell)
-*   Dataset: ImageNet
-*   TensorFlow Version: 1.2.0 RC2
-*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
-
-**Batch Size: 1**
-
-Command executed for the MKL test:
-
-```bash
-python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
---kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
---batch_size=1 --num_inter_threads=1 --num_intra_threads=4 \
---data_dir=<path to ImageNet TFRecords>
-```
-
-| Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
-:              :             : (step time)  :               :               :
-| ------------ | ----------- | ------------ | ------------- | ------------- |
-| AVX2         | NHWC        | 8.8 (113ms)  | 4             | 0             |
-| MKL          | NCHW        | 8.5 (120ms)  | 4             | 1             |
-| AVX          | NHWC        | 6.4 (157ms)  | 4             | 0             |
-| SSE3         | NHWC        | 3.7 (270ms)  | 4             | 0             |
-
-**Batch Size: 32**
-
-Command executed for the MKL test:
-
-```bash
-python tf_cnn_benchmarks.py --forward_only=True --device=cpu --mkl=True \
---kmp_blocktime=0 --nodistortions --model=resnet50 --data_format=NCHW \
---batch_size=32 --num_inter_threads=1 --num_intra_threads=4 \
---data_dir=<path to ImageNet TFRecords>
-```
-
-| Optimization | Data Format | Images/Sec    | Intra threads | Inter Threads |
-:              :             : (step time)   :               :               :
-| ------------ | ----------- | ------------- | ------------- | ------------- |
-| MKL          | NCHW        | 12.4          | 4             | 1             |
-:              :             : (2,590ms)     :               :               :
-| AVX2         | NHWC        | 10.4 (3,079ms)| 4             | 0             |
-| AVX          | NHWC        | 7.3 (4,4416ms)| 4             | 0             |
-| SSE3         | NHWC        | 4.0 (8,054ms) | 4             | 0             |
-
-#### Training InceptionV3
-
-**Environment**
-
-*   Instance Type: Dedicated AWS EC2 r4.16xlarge (Broadwell)
-*   CPU: Intel Xeon E5-2686 v4 (Broadwell) Processors
-*   Dataset: ImageNet
-*   TensorFlow Version: 1.2.0 RC2
-*   Test Script: [tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/mkl_experiment/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)
-
-Command executed for MKL test:
-
-```bash
-python tf_cnn_benchmarks.py --device=cpu --mkl=True --kmp_blocktime=0 \
---nodistortions --model=resnet50 --data_format=NCHW --batch_size=32 \
---num_inter_threads=2 --num_intra_threads=36 \
---data_dir=<path to ImageNet TFRecords>
-```
-
-Optimization | Data Format | Images/Sec | Intra threads | Inter Threads
------------- | ----------- | ---------- | ------------- | -------------
-MKL          | NCHW        | 20.8       | 36            | 2
-AVX2         | NHWC        | 6.2        | 36            | 0
-AVX          | NHWC        | 5.7        | 36            | 0
-SSE3         | NHWC        | 4.3        | 36            | 0
-
-ResNet and [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
-were also run on this configuration but in an ad hoc manner. There were not
-enough runs executed to publish a coherent table of results. The incomplete
-results strongly indicated the final result would be similar to the table above
-with MKL providing significant 3x+ gains over AVX2.