aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/g3doc/tutorials/word2vec/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'tensorflow/g3doc/tutorials/word2vec/index.md')
-rw-r--r--tensorflow/g3doc/tutorials/word2vec/index.md396
1 files changed, 396 insertions, 0 deletions
diff --git a/tensorflow/g3doc/tutorials/word2vec/index.md b/tensorflow/g3doc/tutorials/word2vec/index.md
new file mode 100644
index 0000000000..8779f33ad7
--- /dev/null
+++ b/tensorflow/g3doc/tutorials/word2vec/index.md
@@ -0,0 +1,396 @@
+# Learning Vector Representations of Words
+
+In this tutorial we look at the word2vec model by
+[Mikolov et al.](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
+This model is used for learning vector representations of words, called *word
+embeddings*.
+
+## Highlights
+
+This tutorial is meant to highlight the interesting, substantive parts of
+building a word2vec model in TensorFlow.
+
+* We start by giving the motivation for why we would want to
+represent words as vectors.
+* We look at the intuition behind the model and how it is trained
+(with a splash of math for good measure).
+* We also show a simple implementation of the model in TensorFlow.
+* Finally, we look at ways to make the naive version scale better.
+
+We walk through the code later during the tutorial, but if you'd prefer to
+dive straight in, feel free to look at the minimalistic implementation in
+[tensorflow/g3doc/tutorials/word2vec/word2vec_basic.py](./word2vec_basic.py)
+This basic example contains the code needed to download some data, train on it
+a bit and visualize the result. Once you get
+comfortable with reading and running the basic version, you can graduate to
+[tensorflow/models/embedding/word2vec.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec.py)
+which is a more serious implementation that showcases some more advanced
+TensorFlow principles about how to efficiently use threads to move data into a
+text model, how to checkpoint during training, etc.
+
+But first, let's look at why we would want to learn word embeddings in the first
+place. Feel free to skip this section if you're an Embedding Pro and you'd just
+like to get your hands dirty with the details.
+
+## Motivation: Why Learn Word Embeddings?
+
+Image and audio processing systems work with rich, high-dimensional datasets
+encoded as vectors of the individual raw pixel-intensities for image data, or
+e.g. power spectral density coefficients for audio data. For tasks like object
+or speech recognition we know that all the information required to successfully
+perform the task is encoded in the data (because humans can perform these tasks
+from the raw data). However, natural language processing systems traditionally
+treat words as discrete atomic symbols, and therefore 'cat' may be represented
+as `Id537` and 'dog' as `Id143`. These encodings are arbitrary, and provide
+no useful information to the system regarding the relationships that may exist
+between the individual symbols. This means that the model can leverage
+very little of what it has learned about 'cats' when it is processing data about
+'dogs' (such that they are both animals, four-legged, pets, etc.). Representing
+words as unique, discrete ids furthermore leads to data sparsity, and usually
+means that we may need more data in order to successfully train statistical
+models. Using vector representations can overcome some of these obstacles.
+
+<div style="width:100%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="img/audio-image-text.png" alt>
+</div>
+
+[Vector space models](https://en.wikipedia.org/wiki/Vector_space_model) (VSMs)
+represent (embed) words in a continuous vector space where semantically
+similar words are mapped to nearby points ('are embedded nearby each other').
+VSMs have a long, rich history in NLP, but all methods depend in some way or
+another on the
+[Distributional Hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis),
+which states that words that appear in the same contexts share
+semantic meaning. The different approaches that leverage this principle can be
+divided into two categories: *count-based methods* (e.g.
+[Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis)),
+and *predictive methods* (e.g.
+[neural probabilistic language models](http://www.scholarpedia.org/article/Neural_net_language_models)).
+
+This distinction is elaborated in much more detail by
+[Baroni et al.](http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf),
+but in a nutshell: Count-based methods compute the statistics of
+how often some word co-occurs with its neighbor words in a large text corpus,
+and then map these count-statistics down to a small, dense vector for each word.
+Predictive models directly try to predict a word from its neighbors in terms of
+learned small, dense *embedding vectors* (considered parameters of the
+model).
+
+Word2vec is a particularly computationally-efficient predictive model for
+learning word embeddings from raw text. It comes in two flavors, the Continuous
+Bag-of-Words model (CBOW) and the Skip-Gram model. Algorithmically, these
+models are similar, except that CBOW predicts target words (e.g. 'mat') from
+source context words ('the cat sits on the'), while the skip-gram does the
+inverse and predicts source context-words from the target words. This inversion
+might seem like an arbitrary choice, but statistically it has the effect that
+CBOW smoothes over a lot of the distributional information (by treating an
+entire context as one observation). For the most part, this turns out to be a
+useful thing for smaller datasets. However, skip-gram treats each context-target
+pair as a new observation, and this tends to do better when we have larger
+datasets. We will focus on the skip-gram model in the rest of this tutorial.
+
+
+## Scaling up with Noise-Contrastive Training
+
+Neural probabilistic language models are traditionally trained using the
+[maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood) (ML)
+principle to maximize the probability of the next word $$w_t$$ (for 'target)
+given the previous words $$h$$ (for 'history') in terms of a
+[*softmax* function](https://en.wikipedia.org/wiki/Softmax_function),
+
+$$
+\begin{align}
+P(w_t | h) &= \text{softmax}(\exp \{ \text{score}(w_t, h) \}) \\
+ &= \frac{\exp \{ \text{score}(w_t, h) \} }
+ {\sum_\text{Word w' in Vocab} \exp \{ \text{score}(w', h) \} }.
+\end{align}
+$$
+
+where $$\text{score}(w_t, h)$$ computes the compatibility of word $$w_t$$ with
+the context $$h$$ (a dot product is commonly used). We train this model by
+maximizing its log-likelihood on the training set, i.e. by maximizing
+
+$$
+\begin{align}
+ J_\text{ML} &= \log P(w_t | h) \\
+ &= \text{score}(w_t, h) -
+ \log \left( \sum_\text{Word w' in Vocab} \exp \{ \text{score}(w', h) \} \right)
+\end{align}
+$$
+
+This yields a properly normalized probabilistic model for language modeling.
+However this is very expensive, because we need to compute and normalize each
+probability using the score for all other $$V$$ words $$w'$$ in the current
+context $$h$$, *at every training step*.
+
+<div style="width:60%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="img/softmax-nplm.png" alt>
+</div>
+
+On the other hand, for feature learning in word2vec we do not need a full
+probabilistic model. The CBOW and skip-gram models are instead trained using a
+binary classification objective (logistic regression) to discriminate the real
+target words $$w_t$$ from $$k$$ imaginary (noise) words $$\tilde w$$, in the
+same context. We illustrate this below for a CBOW model. For skip-gram the
+direction is simply inverted.
+
+<div style="width:60%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="img/nce-nplm.png" alt>
+</div>
+
+Mathematically, the objective (for each example) is to maximize
+
+$$J_\text{NEG} = \log Q_\theta(D=1 |w_t, h) +
+ k \mathop{\mathbb{E}}_{\tilde w \sim P_\text{noise}}
+ \left[ \log Q_\theta(D = 0 |\tilde w, h) \right]$$,
+
+where $$Q_\theta(D=1 | w, h)$$ is the binary logistic regression probability
+under the model of seeing the word $$w$$ in the context $$h$$ in the dataset
+$$D$$, calculated in terms of the learned embedding vectors $$\theta$$. In
+practice we approximate the expectation by drawing $$k$$ constrastive words
+from the noise distribution (i.e. we compute a
+[Monte Carlo average](https://en.wikipedia.org/wiki/Monte_Carlo_integration)).
+
+This objective is maximized when the model assigns high probabilities
+to the real words, and low probabilities to noise words. Technically, this is
+called
+[Negative Sampling](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf),
+and there is good mathematical motivation for using this loss function:
+The updates it proposes approximate the updates of the softmax function in the
+limit. But computationally it is especially appealing because computing the
+loss function now scales only with the number of *noise words* that we
+select ($$k$$), and not *all words* in the vocabulary ($$V$$). This makes it
+much faster to train. We will actually make use of the very similar
+[noise-contrastive estimation (NCE)](http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf)
+loss, for which TensorFlow has a handy helper function `tf.nn.nce_loss()`.
+
+Let's get an intuitive feel for how this would work in practice!
+
+## The Skip-gram Model
+
+As an example, let's consider the dataset
+
+`the quick brown fox jumped over the lazy dog`
+
+We first form a dataset of words and the contexts in which they appear. We
+could define 'context' in any way that makes sense, and in fact people have
+looked at syntactic contexts (i.e. the syntactic dependents of the current
+target word, see e.g.
+[Levy et al.](https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf)),
+words-to-the-left of the target, words-to-the-right of the target, etc. For now,
+let's stick to the vanilla definition and define 'context' as the window
+of words to the left and to the right of a target word. Using a window
+size of 1, we then have the dataset
+
+`([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...`
+
+of `(context, target)` pairs. Recall that skip-gram inverts contexts and
+targets, and tries to predict each context word from its target word, so the
+task becomes to predict 'the' and 'brown' from 'quick', 'quick' and 'fox' from
+'brown', etc. Therefore our dataset becomes
+
+`(quick, the), (quick, brown), (brown, quick), (brown, fox), ...`
+
+of `(input, output)` pairs. The objective function is defined over the entire
+dataset, but we typically optimize this with
+[stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
+(SGD) using one example at a time (or a 'minibatch' of `batch_size` examples,
+where typically `16 <= batch_size <= 512`). So let's look at one step of
+this process.
+
+Let's imagine at training step $$t$$ we observe the first training case above,
+where the goal is to predict `the` from `quick`. We select `num_noise` number
+of noisy (contrastive) examples by drawing from some noise distribution,
+typically the unigram distribution, $$P(w)$$. For simplicity let's say
+`num_noise=1` and we select `sheep` as a noisy example. Next we compute the
+loss for this pair of observed and noisy examples, i.e. the objective at time
+step $$t$$ becomes
+
+$$J^{(t)}_\text{NEG} = \log Q_\theta(D=1 | \text{the, quick}) +
+ \log(Q_\theta(D=0 | \text{sheep, quick}))$$.
+
+The goal is to make an update to the embedding parameters $$\theta$$ to improve
+(in this case, maximize) this objective function. We do this by deriving the
+gradient of the loss with respect to the embedding parameters $$\theta$$, i.e.
+$$\frac{\partial}{\partial \theta} J_\text{NEG}$$ (luckily TensorFlow provides
+easy helper functions for doing this!). We then perform an update to the
+embeddings by taking a small step in the direction of the gradient. When this
+process is repeated over the entire training set, this has the effect of
+'moving' the embedding vectors around for each word until the model is
+successful at discriminating real words from noise words.
+
+We can visualize the learned vectors by projecting them down to 2 dimensions
+using for instance something like the
+[t-SNE dimensionality reduction technique](http://lvdmaaten.github.io/tsne/).
+When we inspect these visualizations it becomes apparent that the vectors
+capture some general, and in fact quite useful, semantic information about
+words and their relationships to one another. It was very interesting when we
+first discovered that certain directions in the induced vector space specialize
+towards certain semantic relationships, e.g. *male-female*, *gender* and
+even *country-capital* relationships between words, as illustrated in the figure
+below (see also for example
+[Mikolov et al., 2013](http://www.aclweb.org/anthology/N13-1090)).
+
+<div style="width:100%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="img/linear-relationships.png" alt>
+</div>
+
+This explains why these vectors are also useful as features for many canonical
+NLP prediction tasks, such as part-of-speech tagging or named entity recognition
+(see for example the original work by
+[Collobert et al.](http://arxiv.org/pdf/1103.0398v1.pdf), or follow-up work by
+[Turian et al.](http://www.aclweb.org/anthology/P10-1040)).
+
+But for now, let's just use them to draw pretty pictures!
+
+## Building the Graph
+
+This is all about embeddings, so let's define our embedding matrix.
+This is just a big random matrix to start. We'll initialize the values to be
+uniform in the unit cube.
+
+```python
+embeddings = tf.Variable(
+ tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
+```
+
+The noise-contrastive estimation loss is defined in terms a logistic regression
+model. For this, we need to define the weights and biases for each word in the
+vocabulary (also called the `output weights` as opposed to the `input
+embeddings`). So let's define that.
+
+```python
+nce_weights = tf.Variable(
+ tf.truncated_normal([vocabulary_size, embedding_size],
+ stddev=1.0 / math.sqrt(embedding_size)))
+nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
+```
+
+Now that we have the parameters in place, we can define our skip-gram model
+graph. For simplicity, let's suppose we've already integerized our text corpus
+with a vocabulary so that each word is represented as an integer (see
+[tensorflow/g3doc/tutorials/word2vec/word2vec_basic.py](./word2vec_basic.py) for
+the details). The skip-gram model takes two inputs. One is a batch full of
+integers representing the source context words, the other is for the target
+words. Let's create placeholder nodes for these inputs, so that we can feed in
+data later.
+
+```python
+# Placeholders for inputs
+train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
+train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
+```
+
+Now what we need to do is look up the vector for each of the source words in
+the batch. TensorFlow has handy helpers that make this easy.
+
+```python
+embed = tf.nn.embedding_lookup(embeddings, train_inputs)
+```
+
+Ok, now that we have the embeddings for each word, we'd like to try to predict
+the target word using the noise-contrastive training objective.
+
+```python
+# Compute the NCE loss, using a sample of the negative labels each time.
+loss = tf.reduce_mean(
+ tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
+ num_sampled, vocabulary_size))
+```
+
+Now that we have a loss node, we need to add the nodes required to compute
+gradients and update the parameters, etc. For this we will use stochastic
+gradient descent, and TensorFlow has handy helpers to make this easy.
+
+```python
+# We use the SGD optimizer.
+optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
+```
+
+## Training the Model
+
+Training the model is then as simple as using a `feed_dict` to push data into
+the placeholders and calling `session.run` with this new data in a loop.
+
+```python
+for inputs, labels in generate_batch(...):
+ feed_dict = {training_inputs: inputs, training_labels: labels}
+ _, cur_loss = session.run([optimizer, loss], feed_dict=feed_dict)
+```
+
+See the full example code in
+[tensorflow/g3doc/tutorials/word2vec/word2vec_basic.py](./word2vec_basic.py).
+
+## Visualizing the Learned Embeddings
+
+After training has finished we can visualize the learned embeddings using
+t-SNE.
+
+<div style="width:100%; margin:auto; margin-bottom:10px; margin-top:20px;">
+<img style="width:100%" src="img/tsne.png" alt>
+</div>
+
+Et voila! As expected, words that are similar end up clustering nearby each
+other. For a more heavyweight implementation of word2vec that showcases more of
+the advanced features of TensorFlow, see the implementation in
+[tensorflow/models/embedding/word2vec.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec.py).
+
+## Evaluating Embeddings: Analogical Reasoning
+
+Embeddings are useful for a wide variety of prediction tasks in NLP. Short of
+training a full-blown part-of-speech model or named-entity model, one simple way
+to evaluate embeddings is to directly use them to predict syntactic and semantic
+relationships like `king is to queen as father is to ?`. This is called
+*analogical reasoning* and the task was introduced by
+[Mikolov and colleagues](http://msr-waypoint.com/en-us/um/people/gzweig/Pubs/NAACL2013Regularities.pdf),
+and the dataset can be downloaded from here:
+https://word2vec.googlecode.com/svn/trunk/questions-words.txt.
+
+To see how we do this evaluation, have a look at the `build_eval_graph()` and
+`eval()` functions in
+[tensorflow/models/embedding/word2vec.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec.py).
+
+The choice of hyperparameters can strongly influence the accuracy on this task.
+To achieve state-of-the-art performance on this task requires training over a
+very large dataset, carefully tuning the hyperparameters and making use of
+tricks like subsampling the data, which is out of the scope of this tutorial.
+
+
+## Optimizing the Implementation
+
+Our vanilla implementation showcases the flexibility of TensorFlow. For
+example, changing the training objective is as simple as swapping out the call
+to `tf.nn.nce_loss()` for an off-the-shelf alternative such as
+`tf.nn.sampled_softmax_loss()`. If you have a new idea for a loss function, you
+can manually write an expression for the new objective in TensorFlow and let
+the optimizer compute its derivatives. This flexibility is invaluable in the
+exploratory phase of machine learning model development, where we are trying
+out several different ideas and iterating quickly.
+
+Once you have a model structure you're satisfied with, it may be worth
+optimizing your implementation to run more efficiently (and cover more data in
+less time). For example, the naive code we used in this tutorial would suffer
+compromised speed because we use Python for reading and feeding data items --
+each of which require very little work on the TensorFlow back-end. If you find
+your model is seriously bottlenecked on input data, you may want to implement a
+custom data reader for your problem, as described in [New Data
+Formats](../how_tos/new_data_formats/index.md). For the case of Skip-Gram
+modeling, we've actually already done this for you as an example in
+[tensorflow/models/embedding/word2vec.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec.py).
+
+If your model is no longer I/O bound but you want still more performance, you
+can take things further by writing your own TensorFlow Ops, as described in
+[Adding a New Op](../how_tos/adding_an_op/index.md). Again we've provided an
+example of this for the Skip-Gram case
+[tensorflow/models/embedding/word2vec_optimized.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec_optimized.py).
+Feel free to benchmark these against each other to measure performance
+improvements at each stage.
+
+## Conclusion
+
+In this tutorial we covered the word2vec model, a computationally efficient
+model for learning word embeddings. We motivated why embeddings are useful,
+discussed efficient training techniques and showed how to implement all of this
+in TensorFlow. Overall, we hope that this has show-cased how TensorFlow affords
+you the flexibility you need for early experimentation, and the control you
+later need for bespoke optimized implementation.