1 files changed, 331 insertions, 0 deletions
diff --git a/tensorflow/g3doc/tutorials/seq2seq/index.md b/tensorflow/g3doc/tutorials/seq2seq/index.md
new file mode 100644
index 0000000000..e421c814aa
--- /dev/null
+++ b/tensorflow/g3doc/tutorials/seq2seq/index.md
@@ -0,0 +1,331 @@
+# Sequence-to-Sequence Models: Learning to Translate
+
+Recurrent neural networks can learn to model language, as already discussed
+in the [RNN Tutorial](../recurrent/index.md)
+(if you did not read it, please go through it before proceeding with this one).
+This raises an interesting question: could we condition the generated words on
+some input and generate a meaningful response? For example, could we train
+a neural network to translate from English to French? It turns out that
+the answer is *yes*.
+
+This tutorial will show you how to build and train such a system end-to-end.
+You can start by running this binary.
+
+```
+bazel run -c opt <...>/models/rnn/translate/translate.py
+  --data_dir [your_data_directory]
+```
+
+It will download English-to-French translation data from the
+[WMT'15 Website](http://www.statmt.org/wmt15/translation-task.html)
+prepare it for training and train. It takes  about 20GB of disk space,
+and a while to download and prepare (see [later](#run_it) for details),
+so you can start and leave it running while reading this tutorial.
+
+This tutorial references the following files from `models/rnn`.
+
+File | What's in it?
+--- | ---
+`seq2seq.py` | Library for building sequence-to-sequence models.
+`translate/seq2seq_model.py` | Neural translation sequence-to-sequence model.
+`translate/data_utils.py` | Helper functions for preparing translation data.
+`translate/translate.py` | Binary that trains and runs the translation model.
+
+
+## Sequence-to-Sequence Basics
+
+A basic sequence-to-sequence model, as introduced in
+[Cho et al., 2014](http://arxiv.org/pdf/1406.1078v3.pdf),
+consists of two recurrent neural networks (RNNs): an *encoder* that
+processes the input and a *decoder* that generates the output.
+This basic architecture is depicted below.
+
+<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="basic_seq2seq.png" />
+</div>
+
+Each box in the picture above represents a cell of the RNN, most commonly
+a GRU cell or an LSTM cell (see the [RNN Tutorial](../recurrent/index.md)
+for an explanation of those). Encoder and decoder can share weights or,
+as is more common, use a different set of parameters. Mutli-layer cells
+have been successfully used in sequence-to-sequence models too, e.g. for
+translation [Sutskever et al., 2014](http://arxiv.org/abs/1409.3215).
+
+In the basic model depicted above, every input has to be encoded into
+a fixed-size state vector, as that is the only thing passed to the decoder.
+To allow the decoder more direct access to the input, an *attention* mechanism
+was introduced in [Bahdanu et al., 2014](http://arxiv.org/abs/1409.0473).
+We will not go into the details of the attention mechanism (see the paper),
+suffice it to say that it allows the decoder to peek into the input at every
+decoding step. A multi-layer sequence-to-sequence network with LSTM cells and
+attention mechanism in the decoder looks like this.
+
+<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="attention_seq2seq.png" />
+</div>
+
+## TensorFlow seq2seq Library
+
+As you can see above, there are many different sequence-to-sequence
+models. Each of these models can use different RNN cells, but all
+of them accept encoder inputs and decoder inputs. This motivates
+the interfaces in the TensorFlow seq2seq library (`models/rnn/seq2seq.py`).
+The basic RNN encoder-decoder sequence-to-sequence model works as follows.
+
+```python
+outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)
+```
+
+In the above call, `encoder_inputs` are a list of tensors representing inputs
+to the encoder, i.e., corresponding to the letters *A, B, C* in the first
+picture above. Similarly, `decoder_inputs` are tensors representing inputs
+to the decoder, *GO, W, X, Y, Z* on the first picture.
+
+The `cell` argument is an instance of the `models.rnn.rnn_cell.RNNCell` class
+that determines which cell will be used inside the model. You can use
+an existing cell, such as `GRUCell` or `LSTMCell`, or you can write your own.
+Moreover, `rnn_cell` provides wrappers to construct multi-layer cells,
+add dropout to cell inputs or outputs, or to do other transformations,
+see the [RNN Tutorial](../recurrent/index.md) for examples.
+
+The call to `basic_rnn_seq2seq` returns two arguments: `outputs` and `states`.
+Both of them are lists of tensors of the same length as `decoder_inputs`.
+Naturally, `outputs` correspond to the outputs of the decoder in each time-step,
+in the first picture above that would be *W, X, Y, Z, EOS*. The returned
+`states` represent the internal state of the decoder at every time-step.
+
+In many applications of sequence-to-sequence models, the output of the decoder
+at time t is fed back and becomes the input of the decoder at time t+1. At test
+time, when decoding a sequence, this is how the sequence is constructed.
+During training, on the other hand, it is common to provide the correct input
+to the decoder at every time-step, even if the decoder made a mistake before.
+Functions in `seq2seq.py` support both modes using the `feed_previous` argument.
+For example, let's analyze the following use of an embedding RNN model.
+
+```python
+outputs, states = embedding_rnn_seq2seq(
+    encoder_inputs, decoder_inputs, cell,
+    num_encoder_symbols, num_decoder_symbols,
+    output_projection=None, feed_previous=False)
+```
+
+In the `embedding_rnn_seq2seq` model, all inputs (both `encoder_inputs` and
+`decoder_inputs`) are integer-tensors that represent discrete values.
+They will be embedded into a dense representation (see the
+[Vectors Representations Tutorial](../word2vec/index.md) for more details
+on embeddings), but to construct these embeddings we need to specify
+the maximum number of discrete symbols that will appear: `num_encoder_symbols`
+on the encoder side, and `num_decoder_symbols` on the decoder side.
+
+In the above invocation, we set `feed_previous` to False. This means that the
+decoder will use `decoder_inputs` tensors as provided. If we set `feed_previous`
+to True, the decoder would only use the first element of `decoder_inputs`.
+All other tensors from this list would be ignored, and instead the previous
+output of the encoder would be used. This is used for decoding translations
+in our translation model, but it can also be used during training, to make
+the model more robust to its own mistakes, similar
+to [Bengio et al., 2015](http://arxiv.org/pdf/1506.03099v2.pdf).
+
+One more important argument used above is `output_projection`. If not specified,
+the outputs of the embedding model will be tensors of shape batch-size by
+`num_decoder_symbols` as they represent the logits for each generated symbol.
+When training models with large output vocabularies, i.e., when
+`num_decoder_symbols` is large, it is not practical to store these large
+tensors. Instead, it is better to return smaller output tensors, which will
+later be projected onto a large output tensor using `output_projection`.
+This allows to use our seq2seq models with a sampled softmax loss, as described
+in [Jean et. al., 2015](http://arxiv.org/pdf/1412.2007v2.pdf).
+
+In addition to `basic_rnn_seq2seq` and `embedding_rnn_seq2seq` there are a few
+more sequence-to-sequence models in `seq2seq.py`, take a look there. They all
+have similar interfaces, so we will not describe them in detail. We will use
+`embedding_attention_seq2seq` for our translation model below.
+
+## Neural Translation Model
+
+While the core of the sequence-to-sequence model is constructed by
+the functions in `models/rnn/seq2seq.py`, there are still a few tricks
+that are worth mentioning that are used in our translation model in
+`models/rnn/translate/seq2seq_model.py`.
+
+### Sampled softmax and output projection
+
+For one, as already mentioned above, we want to use sampled softmax to
+handle large output vocabulary. To decode from it, we need to keep track
+of the output projection. Both the sampled softmax loss and the output
+projections are constructed by the following code in `seq2seq_model.py`.
+
+```python
+  if num_samples > 0 and num_samples < self.target_vocab_size:
+    w = tf.get_variable("proj_w", [size, self.target_vocab_size])
+    w_t = tf.transpose(w)
+    b = tf.get_variable("proj_b", [self.target_vocab_size])
+    output_projection = (w, b)
+
+    def sampled_loss(inputs, labels):
+      labels = tf.reshape(labels, [-1, 1])
+      return tf.nn.sampled_softmax_loss(w_t, b, inputs, labels, num_samples,
+                                        self.target_vocab_size)
+```
+
+First, note that we only construct a sampled softmax if the number of samples
+(512 by default) is smaller that the target vocabulary size. For vocabularies
+smaller than 512 it might be a better idea to just use a standard softmax loss.
+
+Then, as you can see, we construct an output projection. It is a pair,
+consisting of a weight matrix and a bias vector. If used, the rnn cell
+will return vectors of shape batch-size by `size`, rather than batch-size
+by `target_vocab_size`. To recover logits, we need to multiply by the weight
+matrix and add the biases, as is done in lines 124-126 in `seq2seq_model.py`.
+
+```python
+if output_projection is not None:
+  self.outputs[b] = [tf.matmul(output, output_projection[0]) +
+                     output_projection[1] for ...]
+```
+
+### Bucketing and padding
+
+In addition to sampled softmax, our translation model also makes use
+of *bucketing*, which is a method to efficiently handle sentences of
+different lengths. Let us first clarify the problem. When translating
+English to French, we will have English sentences of different lengths L1
+on input, and French sentences of different lengths L2 on output. Since
+the English sentence is passed as `encoder_inputs`, and the French sentence
+comes as `decoder_inputs` (prefixed by a GO symbol), we should in principle
+create a seq2seq model for every pair (L1, L2+1) of lengths of an English
+and French sentence. This would result in an enormous graph consisting of
+many very similar subgraphs. On the other hand, we could just pad every
+sentence with a special PAD symbol. Then we'd need only one seq2seq model,
+for the padded lengths. But on shorter sentence our model would be inefficient,
+encoding and decoding many PAD symbols that are useless.
+
+As a compromise between contructing a graph for every pair of lengths and
+padding to a single length, we use a number of *buckets* and pad each sentence
+to the length of the bucket above it. In `translate.py` we use the following
+default buckets.
+
+```python
+buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
+```
+
+This means that if the input is an English sentence with 3 tokens,
+and the corresponding output is a French sentence with 6 tokens,
+then they will be put in the first bucket and padded to length 5 for
+encoder inputs, and length 10 for decoder inputs. If we have an English
+sentence with 8 tokens and the corresponding French sentence has 18 tokens,
+then they will not fit into the (10, 15) bucket, and so the (20, 25) bucket
+will be used, i.e. the English sentence will be padded to 20, and the French
+one to 25.
+
+Remember that when constructing decoder inputs we prepend the special `GO`
+symbol to the input data. This is done in the `get_batch()` function in
+`seq2seq_model.py`, which also reverses the input English sentence.
+Reversing the inputs was shown to improve results for the neural translation
+model in [Sutskever et al., 2014](http://arxiv.org/abs/1409.3215).
+To put it all together, imagine we have the sentence "I go.", tokenized
+as `["I", "go", "."]` as input and the sentence "Je vais." as output,
+tokenized `["Je", "vais", "."]`. It will be put in the (5, 10) bucket,
+with encoder inputs representing `[PAD PAD "." "go" "I"]` and decoder
+inputs `[GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]`.
+
+
+## Let's Run It {#run_it}
+
+To train the model described above, we need to a large English-French corpus.
+We will use the *10^9-French-English corpus* from the
+[WMT'15 Website](http://www.statmt.org/wmt15/translation-task.html)
+for training, and the 2013 news test from the same site as development set.
+Both data-sets will be downloaded to `data_dir` and training will start,
+saving checkpoints in `train_dir`, when this command is run.
+
+```
+bazel run -c opt <...>/models/rnn/translate:translate
+  --data_dir [your_data_directory] --train_dir [checkpoints_directory]
+  --en_vocab_size=40000 --fr_vocab_size=40000
+```
+
+It takes  about 18GB of disk space and several hours to prepare the training
+corpus. It is unpacked, vocabulary files are created in `data_dir`, and then
+the corpus is tokenized and converted to integer ids. Note the parameters
+that determine vocabulary sizes. In the example above, all words outside
+the 40K most common ones will be converted to an `UNK` token representing
+unknown words. So if you change vocabulary size, the binary will re-map
+the corpus to token-ids again.
+
+After the data is prepared, training starts. Default parameters in `translate`
+are set to quite large values. Large models trained over a long time give good
+results, but it might take too long or use too much memory for your GPU.
+You can request to train a smaller model as in the following example.
+
+```
+bazel run -c opt <...>/models/rnn/translate:translate
+  --data_dir [your_data_directory] --train_dir [checkpoints_directory]
+  --size=256 --num_layers=2 --steps_per_checkpoint=50
+```
+
+The above command will train a model with 2 layers (the default is 3),
+each layer with 256 units (default is 1024), and will save a checkpoint
+every 50 steps (the default is 200). You can play with these parameters
+to find out how large a model can be to fit into the memory of your GPU.
+
+During training, every `steps_per_checkpoint` steps the binary will print
+out statistics from recent steps. With the default parameters (3 layers
+of size 1024), first messages look like this.
+
+```
+global step 200 learning rate 0.5000 step-time 1.39 perplexity 1720.62
+  eval: bucket 0 perplexity 184.97
+  eval: bucket 1 perplexity 248.81
+  eval: bucket 2 perplexity 341.64
+  eval: bucket 3 perplexity 469.04
+global step 400 learning rate 0.5000 step-time 1.38 perplexity 379.89
+  eval: bucket 0 perplexity 151.32
+  eval: bucket 1 perplexity 190.36
+  eval: bucket 2 perplexity 227.46
+  eval: bucket 3 perplexity 238.66
+```
+
+You can see that each step takes just under 1.4 seconds, the perplexity
+on the training set and the perplexities on the development set
+for each bucket. After about 30K steps, we see perplexities on short
+sentences (bucket 0 and 1) going into single digits.
+Since the training corpus contains ~22M sentences, one epoch (going through
+the training data once) takes about 340K steps with batch-size of 64. At this
+point the model can be used for translating English sentences to French
+using the `--decode` option.
+
+```
+bazel run -c opt <...>/models/rnn/translate:translate --decode
+  --data_dir [your_data_directory] --train_dir [checkpoints_directory]
+
+Reading model parameters from /tmp/translate.ckpt-340000
+>  Who is the president of the United States?
+ Qui est le président des États-Unis ?
+```
+
+## What Next?
+
+The example above shows how you can build your own English-to-French
+translator, end-to-end. Run it and see how the model performs for yourself.
+While it has reasonable quality, the default parameters will not give you
+the best translation model. Here are a few things you can improve.
+
+First of all, we use a very promitive tokenizer, the `basic_tokenizer` function
+in `data_utils`. A better tokenizer can be found on the
+[WMT'15 Website](http://www.statmt.org/wmt15/translation-task.html).
+Using that tokenizer, and a larger vocabulary, should improve your translations.
+
+Also, the default parameters of the translation model are not tuned.
+You can try changing the learning rate, decay, or initializing the weights
+of your model in a different way. You can also change the default
+`GradientDescentOptimizer` in `seq2seq_model.py` to a more advanced one, such
+as `AdagradOptimizer`. Try these things and see how they improve your results!
+
+Finally, the model presented above can be used for any sequence-to-sequence
+task, not only for translation. Even if you want to transform a sequence to
+a tree, for example to generate a parsing tree, the same model as above can
+give state-of-the-art results, as demonstrated in
+[Vinyals & Kaiser et al., 2015](http://arxiv.org/abs/1412.7449).
+So you can not only build your own translator, you can also build a parser,
+a chat-bot, or any program that comes to your mind. Experiment!