1 files changed, 462 insertions, 0 deletions
diff --git a/tensorflow/g3doc/tutorials/deep_cnn/index.md b/tensorflow/g3doc/tutorials/deep_cnn/index.md
new file mode 100644
index 0000000000..f40a94ba7a
--- /dev/null
+++ b/tensorflow/g3doc/tutorials/deep_cnn/index.md
@@ -0,0 +1,462 @@
+# Convolutional Neural Networks for Object Recognition
+
+**NOTE:** This tutorial is intended for *advanced* users of TensorFlow
+and assumes expertise and experience in machine learning.
+
+## Overview
+
+CIFAR-10 classification is a common benchmark problem in machine learning.  The
+problem is to classify RGB 32x32 pixel images across 10 categories:
+```airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.```
+
+![CIFAR-10 Samples](./cifar_samples.png "CIFAR-10 Samples, from http://www.cs.toronto.edu/~kriz/cifar.html")
+
+For more details refer to the [CIFAR-10 page](http://www.cs.toronto.edu/~kriz/cifar.html)
+and a [Tech Report](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)
+by Alex Krizhevsky.
+
+### Goals
+
+The goal of this tutorial is to build a relatively small convolutional neural
+network (CNN) for recognizing images. In the process this tutorial:
+
+1. Highlights a canonical organization for network architecture,
+training and evaluation.
+2. Provides a template for constructing larger and more sophisticated models.
+
+The reason CIFAR-10 was selected was because it contains enough complexity to
+exercise much of TensorFlow's ability to scale to large models. At the same
+time, the model is small enough to train fast in order to test new ideas and
+experiments.
+
+### Highlights of the Tutorial
+The CIFAR-10 tutorial demonstrates several important constructs for
+designing larger and more sophisticated models in TensorFlow:
+
+* Core mathematical components including
+[convolution](../../api_docs/python/nn.md#conv2d),
+[rectified linear activations](../../api_docs/python/nn.md#relu),
+[max pooling](../../api_docs/python/nn.md#max_pool) and
+[local response normalization](../../api_docs/python/nn.md#local_response_normalization).
+* [Visualization](../../how_tos/summaries_and_tensorboard/index.md)
+of network activity during training including input images,
+losses and distributions of activations and gradients.
+* Routines for calculating the
+[moving average](../../api_docs/python/train.md#ExponentialMovingAverage)
+of learned parameters and using these averages
+during evaluation to boost predictive performance.
+* Implementation of a
+[learning rate schedule](../../api_docs/python/train.md#exponential_decay)
+that systematically decrements over time.
+* Prefetching [queues](../../api_docs/python/io_ops.md#shuffle_batch)
+for input
+data to isolate the model from disk latency and expensive image pre-processing.
+
+We also provide a multi-GPU version of the model which demonstrates:
+
+* Configuring a model to train across multiple GPU cards in parallel.
+* Sharing and updating variables between multiple GPUs.
+
+We hope that this tutorial provides a launch point for building larger CNNs for
+vision tasks on TensorFlow.
+
+### Model Architecture
+
+The model in this CIFAR-10 tutorial is a multi-layer architecture consisting of
+alternating convolutions and nonlinearities. These layers are followed by fully
+connected layers leading into a softmax classifier.  The model follows the
+architecture described by
+[Alex Krizhevsky](https://code.google.com/p/cuda-convnet/), with a few
+differences in the top few layers.
+
+This model achieves a peak performance of about 86% accuracy within a few hours
+of training time on a GPU. Please see [below](#evaluating-a-model) and the code
+for details.  It consists of 1,068,298 learnable parameters and requires about
+19.5M multiply-add operations to compute inference on a single image.
+
+## Code Organization
+
+The code for this tutorial resides in
+[`tensorflow/models/image/cifar10/`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/).
+
+File | Purpose
+--- | ---
+[`cifar10_input.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10_input.py) | Read the native CIFAR-10 binary file format.
+[`cifar10.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10.py) | Build the CIFAR-10 model.
+[`cifar10_train.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10_train.py) | Train a CIFAR-10 model on a single machine.
+[`cifar10_multi_gpu_train.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py) | Train a CIFAR-10 model on multiple GPUs.
+[`cifar10_eval.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10_eval.py) | Evaluates the predictive performance of a CIFAR-10 model.
+
+
+## CIFAR-10 Model
+
+The CIFAR-10 network is largely contained in
+[`cifar10.py`](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/image/cifar10/cifar10.py).
+The complete training
+graph contains roughly 765 operations. We find that we can make the code most
+reusable by constructing the graph with the following modules:
+
+1. [**Model inputs:**](#model-inputs) `inputs()` and `distorted_inputs()` add
+operations that read and preprocess CIFAR images for evaluation and training,
+respectively.
+1. [**Model prediction:**](#model-prediction) `inference()`
+adds operations that perform inference, i.e. classification, on supplied images.
+1. [**Model training:**](#model-training) `loss()` and `train()`
+add operations that compute the loss,
+gradients, variable updates and visualization summaries.
+
+### Model Inputs
+
+The input part of the model is built by the functions `inputs()` and
+`distorted_inputs()` which read images from the CIFAR-10 binary data files.
+These files contain fixed byte length records, so we use
+[`tf.FixedLengthRecordReader`](../../api_docs/python/io_ops.md#FixedLengthRecordReader).
+See [`Reading Data`](../../how_tos/reading_data/index.md#reading-from-files) to
+learn more about how the `Reader` class works.
+
+The images are processed as follows:
+
+*  They are cropped to 24 x 24 pixels, centrally for evaluation or
+   [randomly](../../api_docs/python/image.md#random_crop) for training.
+*  They are [approximately whitened](../../api_docs/python/image.md#per_image_whitening)
+   to make the model insensitive to dynamic range.
+
+For training, we additionally apply a series of random distortions to
+artificially increase the data set size:
+
+* [Randomly flip](../../api_docs/python/image.md#random_flip_left_right) the image from left to right.
+* Randomly distort the [image brightness](../../api_docs/python/image.md#random_brightness).
+* Randomly distort the [image contrast](../../api_docs/python/image.md#tf_image_random_contrast).
+
+Please see the [`Images`](../../api_docs/python/image.md) page for the list of
+available distortions. We also attach an
+[`image_summary`](../../api_docs/python/train.md?#image_summary) to the images
+so that we may visualize them in TensorBoard.  This is a good practice to verify
+that inputs are built correctly.
+
+<div style="width:50%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:70%" src="./cifar_image_summary.png">
+</div>
+
+Reading images from disk and distorting them can use a non-trivial amount of
+processing time. To prevent these operations from slowing down training, we run
+them inside 16 separate threads which continuously fill a TensorFlow
+[queue](../../api_docs/python/io_ops.md#shuffle_batch).
+
+### Model Prediction
+
+The prediction part of the model is constructed by the `inference()` function
+which adds operations to compute the *logits* of the predictions. That part of
+the model is organized as follows:
+
+Layer Name | Description
+--- | ---
+`conv1` | [convolution](../../api_docs/python/nn.md#conv2d) and [rectified linear](../../api_docs/python/nn.md#relu) activation.
+`pool1` | [max pooling](../../api_docs/python/nn.md#max_pool).
+`norm1` | [local response normalization](../../api_docs/python/nn.md#local_response_normalization).
+`conv2` | [convolution](../../api_docs/python/nn.md#conv2d) and [rectified linear](../../api_docs/python/nn.md#relu) activation.
+`norm2` | [local response normalization](../../api_docs/python/nn.md#local_response_normalization).
+`pool2` | [max pooling](../../api_docs/python/nn.md#max_pool).
+`local3` | [fully connected layer with rectified linear activation](../../api_docs/python/nn.md).
+`local4` | [fully connected layer with rectified linear activation](../../api_docs/python/nn.md).
+`softmax_linear` | linear transformation to produce logits.
+
+Here is a graph generated from TensorBoard describing the inference operation:
+
+<div style="width:15%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%" src="./cifar_graph.png">
+</div>
+
+> **EXERCISE**: The output of `inference` are un-normalized logits. Try editing
+the network architecture to return normalized predictions using [`tf.softmax()`]
+(../../api_docs/python/nn.md?cl=head#softmax).
+
+The `inputs()` and `inference()` functions provide all of the components
+necessary to perform evaluation on a model. We now shift our focus towards
+building operations for training a model.
+
+> **EXERCISE:** The model architecture in `inference()` differs slightly from
+the CIFAR-10 model specified in
+[cuda-convnet](https://code.google.com/p/cuda-convnet/).  In particular, the top
+layers are locally connected and not fully connected. Try editing the
+architecture to exactly replicate that fully connected model.
+
+### Model Training
+
+The usual method for training a network to perform N-way classification is
+[multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression),
+aka. *softmax regression*. Softmax regression applies a
+[softmax](../../api_docs/python/nn.md#softmax) nonlinearity to the
+output of the network and calculates the
+[cross-entropy](../../api_docs/python/nn.md#softmax_cross_entropy_with_logits)
+between the normalized predictions and a
+[1-hot encoding](../../api_docs/python/sparse_ops.md#sparse_to_dense) of the label.
+For regularization, we also apply the usual
+[weight decay](../../api_docs/python/nn.md#l2_loss) losses to all learned
+variables.  The objective function for the model is the sum of the cross entropy
+loss and all these weight decay terms, as returned by the `loss()` function.
+
+We visualize it in TensorBoard with a [scalar_summary](../../api_docs/python/train.md?#scalar_summary):
+
+[![CIFAR-10 Loss](./cifar_loss.png "CIFAR-10 Total Loss")](#TODO(migmigmig)#TODO(danmane))
+
+We train the model using standard
+[gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)
+algorithm (see [Training](../../api_docs/python/train.md) for other methods)
+with a learning rate that
+[exponentially decays](../../api_docs/python/train.md#exponential_decay)
+over time.
+
+[![CIFAR-10 Learning Rate Decay](./cifar_lr_decay.png "CIFAR-10 Learning Rate Decay")](#TODO(migmigmig)#TODO(danmane))
+
+The `train()` function adds the operations needed to minimize the objective by
+calculating the gradient and updating the learned variables (see
+[`GradientDescentOptimizer`](../../api_docs/python/train.md#GradientDescentOptimizer)
+for details).  It returns an operation that executes all of the calculations
+needed to train and update the model for one batch of images.
+
+## Launching and Training the Model
+
+We have built the model, let's now launch it and run the training operation with
+the script `cifar10_train.py`.
+
+```shell
+python cifar10_train.py
+```
+
+**NOTE:** The first time your run any target in the CIFAR-10 tutorial,
+the CIFAR-10 dataset is automatically downloaded. The data set is ~160MB
+so you may want to grab a quick cup of coffee for your first run.
+
+You should see the output:
+
+```shell
+Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
+2015-11-04 11:45:45.927302: step 0, loss = 4.68 (2.0 examples/sec; 64.221 sec/batch)
+2015-11-04 11:45:49.133065: step 10, loss = 4.66 (533.8 examples/sec; 0.240 sec/batch)
+2015-11-04 11:45:51.397710: step 20, loss = 4.64 (597.4 examples/sec; 0.214 sec/batch)
+2015-11-04 11:45:54.446850: step 30, loss = 4.62 (391.0 examples/sec; 0.327 sec/batch)
+2015-11-04 11:45:57.152676: step 40, loss = 4.61 (430.2 examples/sec; 0.298 sec/batch)
+2015-11-04 11:46:00.437717: step 50, loss = 4.59 (406.4 examples/sec; 0.315 sec/batch)
+...
+```
+
+The script reports the total loss every 10 steps as well the speed at which
+the last batch of data was processed. A few comments:
+
+* The first batch of data can be inordinately slow (e.g. several minutes) as the
+preprocessing threads fill up the shuffling queue with 20,000 processed CIFAR
+images.
+
+* The reported loss is the average loss of the most recent batch. Remember that
+this loss is the sum of the cross entropy and all weight decay terms.
+
+* Keep an eye on the processing speed of a batch. The numbers shown above were
+run on a Tesla K40c. If you are running on a CPU, expect slower performance.
+
+
+> **EXERCISE:** When experimenting, it is sometimes annoying that the first
+training step can take so long. Try decreasing the number of images initially
+that initially fill up the queue.  Search for `NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN`
+in `cifar10.py`.
+
+`cifar10_train.py` periodically [saves](../../api_docs/python/state_ops.md#Saver)
+all model parameters in
+[checkpoint files](../../how_tos/variables.md#saving-and-restoring)
+but it does *not* evaluate the model. The checkpoint file
+will be used by `cifar10_eval.py` to measure the predictive
+performance (see [Evaluating a Model](#evaluating-a-model) below).
+
+
+If you followed the previous steps, then you have now started training
+a CIFAR-10 model. [Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0)
+
+The terminal text returned from `cifar10_train.py` provides minimal insight into
+how the model is training. We want more insight into the model during training:
+
+* Is the loss *really* decreasing or is that just noise?
+* Is the model being provided appropriate images?
+* Are the gradients, activations and weights reasonable?
+* What is the learning rate currently at?
+
+[TensorBoard](../../how_tos/summaries_and_tensorboard/index.md) provides this
+functionality, displaying data exported periodically from `cifar10_train.py` via
+a
+[`SummaryWriter`](../../api_docs/python/train.md#SummaryWriter).
+
+For instance, we can watch how the distribution of activations and degree of
+sparsity in `local3` features evolve during training:
+
+<div style="width:100%; margin:auto; margin-bottom:10px; margin-top:20px; display: flex; flex-direction: row">
+  <img style="flex-grow:1; flex-shrink:1;" src="./cifar_sparsity.png">
+  <img style="flex-grow:1; flex-shrink:1;" src="./cifar_activations.png">
+</div>
+
+Individual loss functions, as well as the total loss, are particularly
+interesting to track over time. However, the loss exhibits a considerable amount
+of noise due to the small batch size employed by training.  In practice we find
+it extremely useful to visualize their moving averages in addition to their raw
+values.  See how the scripts use
+[ExponentialMovingAverage](../../api_docs/python/train.md#ExponentialMovingAverage)
+for this purpose.
+
+## Evaluating a Model
+
+Let us now evaluate how well the trained model performs on a hold-out data set.
+the model is evaluated by the script `cifar10_eval.py`.  It constructs the model
+with the `inference()` function and uses all 10,000 images in the evaluation set
+of CIFAR-10. It calculates the *precision at 1:* how often the top prediction
+matches the true label of the image.
+
+To monitor how the model improves during training, the evaluation script runs
+periodically on the latest checkpoint files created by the `cifar10_train.py`.
+
+```shell
+python cifar10_eval.py
+```
+
+> Be careful not to run the evaluation and training binary on the same GPU or
+else you might run out of memory. Consider running the evaluation on
+a separate GPU if available or suspending the training binary while running
+the evaluation on the same GPU.
+
+You should see the output:
+
+```shell
+2015-11-06 08:30:44.391206: precision @ 1 = 0.860
+...
+```
+
+The script merely returns the precision @ 1 periodically -- in this case
+it returned 86% accuracy. `cifar10_eval.py` also
+exports summaries that may be visualized in TensorBoard. These summaries
+provide additional insight into the model during evaluation.
+
+The training script calculates the
+[moving average](../../api_docs/python/train.md#ExponentialMovingAverage)
+version of all learned variables. The evaluation script substitutes
+all learned model parameters with the moving average version. This
+substitution boosts model performance at evaluation time.
+
+> **EXERCISE:** Employing averaged parameters may boost predictive performance
+by about 3% as measured by precision@1. Edit `cifar10_eval.py` to not employ the
+averaged parameters for the model and verify that the predictive performance
+drops.
+
+
+## Training a Model Using Multiple GPU Cards
+
+Modern workstations may contain multiple GPUs for scientific computation.
+TensorFlow can leverage this environment to run the training operation
+concurrently across multiple cards.
+
+Training a model in a parallel, distributed fashion requires
+coordinating training processes. For what follows we term *model replica*
+to be one copy of a model training on a subset of data.
+
+Naively employing asynchronous updates of model parameters
+leads to sub-optimal training performance
+because an individual model replica might be trained on a stale
+copy of the model parameters. Conversely, employing fully synchronous
+updates will be as slow as the slowest model replica.
+
+In a workstation with multiple GPU cards, each GPU will have similar speed
+and contain enough memory to run an entire CIFAR-10 model. Thus, we opt to
+design our training system in the following manner:
+
+* Place an individual model replica on each GPU.
+* Update model parameters synchronously by waiting for all GPUs to finish
+processing a batch of data.
+
+Here is a diagram of this model:
+
+<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%" src="./Parallelism.png">
+</div>
+
+Note that each GPU computes inference as well as the gradients for a unique
+batch of data. This setup effectively permits dividing up a larger batch
+of data across the GPUs.
+
+This setup requires that all GPUs share the model parameters. A well-known
+fact is that transferring data to and from GPUs is quite slow. For this
+reason, we decide to store and update all model parameters on the CPU (see
+green box). A fresh set of model parameters are transferred to the GPU
+when a new batch of data is processed by all GPUs.
+
+The GPUs are synchronized in operation. All gradients are accumulated from
+the GPUs and averaged (see green box). The model parameters are updated with
+the gradients averaged across all model replicas.
+
+### Placing Variables and Operations on Devices
+
+Placing operations and variables on devices requires some special
+abstractions.
+
+The first abstraction we require is a function for computing inference and
+gradients for a single model replica. In the code we term this abstraction
+a *tower*. We must set two attributes for each tower:
+
+* A unique name for all operations within a tower.
+[`tf.name_scope()`](../../api_docs/python/framework.md#name_scope) provides
+this unique name by prepending a scope. For instance, all operations in
+the first tower are prepended with `tower_0`, e.g. `tower_0/conv1/Conv2D`.
+
+* A preferred hardware device to run the operation within a tower.
+[`tf.device()`](../../api_docs/python/framework.md#device) specifies this. For
+instance, all operations in the first tower reside within `device('/gpu:0')`
+scope indicating that they should be run on the first GPU.
+
+All variables are pinned to the CPU and accessed via
+[`tf.get_variable()`](../../api_docs/python/state_ops.md#get_variable)
+in order to share them in a multi-GPU version.
+See how-to on [Sharing Variables](../../how_tos/variable_scope/index.md).
+
+### Launching and Training the Model on Multiple GPU cards
+
+If you have several GPU cards installed on your machine you can use them to
+train the model faster with the `cifar10_multi_gpu_train.py` script.  It is a
+variation of the training script that parallelizes the model across multiple GPU
+cards.
+
+```shell
+python cifar10_multi_gpu_train.py --num_gpus=2
+```
+
+The training script should output:
+
+```shell
+Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
+2015-11-04 11:45:45.927302: step 0, loss = 4.68 (2.0 examples/sec; 64.221 sec/batch)
+2015-11-04 11:45:49.133065: step 10, loss = 4.66 (533.8 examples/sec; 0.240 sec/batch)
+2015-11-04 11:45:51.397710: step 20, loss = 4.64 (597.4 examples/sec; 0.214 sec/batch)
+2015-11-04 11:45:54.446850: step 30, loss = 4.62 (391.0 examples/sec; 0.327 sec/batch)
+2015-11-04 11:45:57.152676: step 40, loss = 4.61 (430.2 examples/sec; 0.298 sec/batch)
+2015-11-04 11:46:00.437717: step 50, loss = 4.59 (406.4 examples/sec; 0.315 sec/batch)
+...
+```
+
+Note that the number of GPU cards used defaults to 1. Additionally, if only 1
+GPU is available on your machine, all computations will be placed on it, even if
+you ask for more.
+
+> **EXERCISE:** The default settings for `cifar10_train.py` is to
+run on a batch size of 128. Try running `cifar10_multi_gpu_train.py` on 2 GPUs
+with a batch size of 64 and compare the training speed.
+
+## Next Steps
+
+[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0). You have
+completed the CIFAR-10 tutorial.
+
+If you are now interested in developing and training your own image
+classification system, we recommend forking this tutorial and replacing
+components to build address your image classification problem.
+
+> **EXERCISE:** Download the
+[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/) data set.
+Fork the CIFAR-10 tutorial and swap in the SVHN as the input data. Try adapting
+the network architecture to improve predictive performance.
+
+
+