aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/g3doc/tutorials/recurrent/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'tensorflow/g3doc/tutorials/recurrent/index.md')
-rw-r--r--tensorflow/g3doc/tutorials/recurrent/index.md209
1 files changed, 209 insertions, 0 deletions
diff --git a/tensorflow/g3doc/tutorials/recurrent/index.md b/tensorflow/g3doc/tutorials/recurrent/index.md
new file mode 100644
index 0000000000..29d058cd5d
--- /dev/null
+++ b/tensorflow/g3doc/tutorials/recurrent/index.md
@@ -0,0 +1,209 @@
+# Recurrent Neural Networks
+
+## Introduction
+
+Take a look at [this great article]
+(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
+for an introduction to recurrent neural networks and LSTMs in particular.
+
+## Language Modeling
+
+In this tutorial we will show how to train a recurrent neural network on
+a challenging task of language modeling. The goal of the problem is to fit a
+probabilistic model which assigns probablities to sentences. It does so by
+predicting next words in a text given a history of previous words. For this
+purpose we will use the Penn Tree Bank (PTB) dataset, which is a popular
+benchmark for measuring quality of these models, whilst being small and
+relatively fast to train.
+
+Language modeling is key to many interesting problems such as speech
+recognition, machine translation, or image captioning. It is also fun, too --
+take a look [here] (http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
+
+For the purpose of this tutorial, we will reproduce the results from
+[Zaremba et al., 2014] (http://arxiv.org/abs/1409.2329), which achieves very
+good results on the PTB dataset.
+
+## Tutorial Files
+
+This tutorial references the following files from `models/rnn/ptb`:
+
+File | Purpose
+--- | ---
+`ptb_word_lm.py` | The code to train a language model on the PTB dataset.
+`reader.py` | The code to read the dataset.
+
+## Download and Prepare the Data
+
+The data required for this tutorial is in the data/ directory of the
+PTB dataset from Tomas Mikolov's webpage:
+http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
+
+The dataset is already preprocessed and contains overall 10000 different words,
+including the end-of-sentence marker and a special symbol (\<unk\>) for rare
+words. We convert all of them in the `reader.py` to unique integer identifiers
+to make it easy for the neural network to process.
+
+## The Model
+
+### LSTM
+
+The core of the model consists of an LSTM cell that processes one word at the
+time and computes probabilities of the possible continuations of the sentence.
+The memory state of the network is initialized with a vector of zeros and gets
+updated after reading each word. Also, for computational reasons, we will
+process data in mini-batches of size `batch_size`.
+
+The basic pseudocode looks as follows:
+
+```python
+lstm = rnn_cell.BasicLSTMCell(lstm_size)
+# Initial state of the LSTM memory.
+state = tf.zeros([batch_size, lstm.state_size])
+
+loss = 0.0
+for current_batch_of_words in words_in_dataset:
+ # The value of state is updated after processing each batch of words.
+ output, state = lstm(current_batch_of_words, state)
+
+ # The LSTM output can be used to make next word predictions
+ logits = tf.matmul(output, softmax_w) + softmax_b
+ probabilities = tf.nn.softmax(logits)
+ loss += loss_function(probabilities, target_words)
+```
+
+### Truncated Backpropagation
+
+In order to make the learning process tractable, it is a common practice to
+truncate the gradients for backpropagation to a fixed number (`num_steps`)
+of unrolled steps.
+This is easy to implement by feeding inputs of length `num_steps` at a time and
+doing backward pass after each iteration.
+
+A simplifed version of the code for the graph creation for truncated
+backpropagation:
+
+```python
+# Placeholder for the inputs in a given iteration.
+words = tf.placeholder(tf.int32, [batch_size, num_steps])
+
+lstm = rnn_cell.BasicLSTMCell(lstm_size)
+# Initial state of the LSTM memory.
+initial_state = state = tf.zeros([batch_size, lstm.state_size])
+
+for i in range(len(num_steps)):
+ # The value of state is updated after processing each batch of words.
+ output, state = lstm(words[:, i], state)
+
+ # The rest of the code.
+ # ...
+
+final_state = state
+```
+
+And this is how to implement an iteration over the whole dataset:
+
+```python
+# A numpy array holding the state of LSTM after each batch of words.
+numpy_state = initial_state.eval()
+total_loss = 0.0
+for current_batch_of_words in words_in_dataset:
+ numpy_state, current_loss = session.run([final_state, loss],
+ # Initialize the LSTM state from the previous iteration.
+ feed_dict={initial_state: numpy_state, words: current_batch_of_words})
+ total_loss += current_loss
+```
+
+### Inputs
+
+The word IDs will be embedded into a dense representation (see the
+[Vectors Representations Tutorial](../word2vec/index.md)) before feeding to
+the LSTM. This allows the model to efficiently represent the knowledge about
+particular words. It is also easy to write:
+
+```python
+# embedding_matrix is a tensor of shape [vocabulary_size, embedding size]
+word_embeddings = tf.nn.embedding_lookup(embedding_matrix, word_ids)
+```
+
+The embedding matrix will be initialized randomly and the model will learn to
+differentiate the meaning of words just by looking at the data.
+
+### Loss Fuction
+
+We want to minimize the average negative log probability of the target words:
+
+$$ \text{loss} = -\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i} $$
+
+It is not very difficult to implement but the function
+`sequence_loss_by_example` is already available, so we can just use it here.
+
+The typical measure reported in the papers is average per-word perplexity (often
+just called perplexity), which is equal to
+
+$$e^{-\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}} = e^{\text{loss}} $$
+
+and we will monitor its value throughout the training process.
+
+### Stacking multiple LSTMs
+
+To give the model more expressive power, we can add multiple layers of LSTMs
+to process the data. The output of the first layer will become the input of
+the second and so on.
+
+We have a class called `MultiRNNCell` that makes the implementation seemless:
+
+```python
+lstm = rnn_cell.BasicLSTMCell(lstm_size)
+stacked_lstm = rnn_cell.MultiRNNCell([lstm] * number_of_layers)
+
+initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
+for i in range(len(num_steps)):
+ # The value of state is updated after processing each batch of words.
+ output, state = stacked_lstm(words[:, i], state)
+
+ # The rest of the code.
+ # ...
+
+final_state = state
+```
+
+## Compile and Run the Code
+
+First, the library needs to be built. To compile it on CPU:
+
+```
+bazel build -c opt tensorflow/models/rnn/ptb:ptb_word_lm
+```
+
+And if you have a fast GPU, run the following:
+
+```
+bazel build -c opt tensorflow --config=cuda \
+ tensorflow/models/rnn/ptb:ptb_word_lm
+```
+
+Now we can run the model:
+
+```
+bazel-bin/.../ptb_word_lm \
+ --data_path=/tmp/simple-examples/data/ --alsologtostderr --model small
+```
+
+There are 3 supported model configurations in the tutorial code: "small",
+"medium" and "large". The difference between them is in size of the LSTMs and
+the set of hyperparameters used for training.
+
+The larger the model, the better results it should get. The `small` model should
+be able to reach perplexity below 120 on the test set and the `large` one below
+80, though it might take several hours to train.
+
+## What Next?
+
+There are several tricks that we haven't mentioned that make the model better,
+including:
+
+* decreasing learning rate schedule,
+* dropout between the LSTM layers.
+
+Study the code and modify it to improve the model even further.