aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/docs_src/programmers_guide/embedding.md
blob: 975850349f0e29f698dd36060ecdf51cf29e87bb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
# Embeddings

[TOC]

## Introduction

An embedding is a mapping from discrete objects, such as words, to vectors of
real numbers. For example, a 300-dimensional embedding for English words could
include:

```
blue:  (0.01359, 0.00075997, 0.24608, ..., -0.2524, 1.0048, 0.06259)
blues:  (0.01396, 0.11887, -0.48963, ..., 0.033483, -0.10007, 0.1158)
orange:  (-0.24776, -0.12359, 0.20986, ..., 0.079717, 0.23865, -0.014213)
oranges:  (-0.35609, 0.21854, 0.080944, ..., -0.35413, 0.38511, -0.070976)
```

Embeddings let you apply machine learning to discrete inputs. Classifiers, and
neural networks more generally, are designed to work with dense continuous
vectors, where all values contribute to define what an object is.  If discrete
objects are naively encoded as discrete atoms, e.g., unique id numbers, they
hinder learning and generalization. One way to think of embeddings is as a way
to transform non-vector objects into useful inputs for machine learning.

Embeddings are also useful as outputs of machine learning. Because embeddings
map objects to vectors, applications can use similarity in vector space (e.g.,
Euclidean distance or the angle between vectors) as a robust and flexible
measure of object similarity. One common use is to find nearest neighbors.
Using the same word embeddings above, for instance, here are the three nearest
neighbors for each word and the corresponding angles (in degrees):

```
blue:  (red, 47.6°), (yellow, 51.9°), (purple, 52.4°)
blues:  (jazz, 53.3°), (folk, 59.1°), (bluegrass, 60.6°)
orange:  (yellow, 53.5°), (colored, 58.0°), (bright, 59.9°)
oranges:  (apples, 45.3°), (lemons, 48.3°), (mangoes, 50.4°)
```

This would tell an application that apples and oranges are in some way more
similar (45.3° apart) than lemons and oranges (48.3° apart).

## Training an Embedding

To train word embeddings in TensorFlow, we first need to split the text into
words and assign an integer to every word in the vocabulary. Let us assume that
this has already been done, and that `word_ids` is a vector of these integers.
For example, the sentence “I have a cat.” could be split into
`[“I”, “have”, “a”, “cat”, “.”]` and then the corresponding `word_ids` tensor
would have shape `[5]` and consist of 5 integers. To get these word ids
embedded, we need to create the embedding variable and use the `tf.gather`
function as follows:

```
word_embeddings = tf.get_variable(“word_embeddings”,
    [vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
```

After this, the tensor `embedded_word_ids` will have shape `[5, embedding_size]`
in our example and contain the embeddings (dense vectors) for each of the 5
words. The variable `word_embeddings` will be learned and at the end of the
training it will contain the embeddings for all words in the vocabulary.
The embeddings can be trained in many ways, depending on the data available.
For example, one could use a recurrent neural network to predict the next word
from the previous one given a large corpus of sentences, or one could train
two networks to do multi-lingual translation. These methods are described in
[Vector Representations of Words](../tutorials/word2vec.md) tutorial, but in
all cases there is an embedding variable like above and words are embedded
using `tf.gather`, as shown.

## Visualizing Embeddings

TensorBoard has a built-in visualizer, called the <i>Embedding Projector</i>,
for interactive visualization of embeddings. The embedding projector will read
the embeddings from your checkpoint file and project them into 3 dimensions using
[principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
For a visual explanation of PCA, see
[this article](http://setosa.io/ev/principal-component-analysis/). Another
very useful projection you can use is
[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).

If you are working with an embedding, you'll probably want to attach
labels/images to the data points. You can do this by generating a
[metadata file](#metadata) containing the labels for each point and configuring
the projector either by using our Python API, or manually constructing and
saving a
<code>[projector_config.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/plugins/projector/projector_config.proto)</code>
in the same directory as your checkpoint file.

### Setup

For in depth information on how to run TensorBoard and make sure you are
logging all the necessary information, see
[TensorBoard: Visualizing Learning](../get_started/summaries_and_tensorboard.md).

To visualize your embeddings, there are 3 things you need to do:

1) Setup a 2D tensor that holds your embedding(s).

```python
embedding_var = tf.get_variable(....)
```

2) Periodically save your model variables in a checkpoint in
<code>LOG_DIR</code>.

```python
saver = tf.train.Saver()
saver.save(session, os.path.join(LOG_DIR, "model.ckpt"), step)
```

3) (Optional) Associate metadata with your embedding.

If you have any metadata (labels, images) associated with your embedding, you
can tell TensorBoard about it either by directly storing a
<code>[projector_config.pbtxt](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorboard/plugins/projector/projector_config.proto)</code>
in the <code>LOG_DIR</code>, or use our python API.

For instance, the following <code>projector_config.ptxt</code> associates the
<code>word_embedding</code> tensor with metadata stored in <code>$LOG_DIR/metadata.tsv</code>:

```
embeddings {
  tensor_name: 'word_embedding'
  metadata_path: '$LOG_DIR/metadata.tsv'
}
```

The same config can be produced programmatically using the following code snippet:

```python
from tensorflow.contrib.tensorboard.plugins import projector

# Create randomly initialized embedding weights which will be trained.
vocabulary_size = 10000
embedding_size = 200
embedding_var = tf.get_variable('word_embedding', [vocabulary_size, embedding_size])

# Format: tensorflow/tensorboard/plugins/projector/projector_config.proto
config = projector.ProjectorConfig()

# You can add multiple embeddings. Here we add only one.
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# Link this tensor to its metadata file (e.g. labels).
embedding.metadata_path = os.path.join(LOG_DIR, 'metadata.tsv')

# Use the same LOG_DIR where you stored your checkpoint.
summary_writer = tf.summary.FileWriter(LOG_DIR)

# The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will
# read this file during startup.
projector.visualize_embeddings(summary_writer, config)
```

After running your model and training your embeddings, run TensorBoard and point
it to the <code>LOG_DIR</code> of the job.

```python
tensorboard --logdir=LOG_DIR
```

Then click on the *Embeddings* tab on the top pane
and select the appropriate run (if there are more than one run).


### Metadata
Usually embeddings have metadata associated with it (e.g. labels, images). The
metadata should be stored in a separate file outside of the model checkpoint
since the metadata is not a trainable parameter of the model. The format should
be a [TSV file](https://en.wikipedia.org/wiki/Tab-separated_values)
(tab characters shown in red) with the first line containing column headers
(shown in bold) and subsequent lines contain the metadata values:

<code>
<b>Word<span style="color:#800;">\t</span>Frequency</b><br/>
  Airplane<span style="color:#800;">\t</span>345<br/>
  Car<span style="color:#800;">\t</span>241<br/>
  ...
</code>

There is no explicit key shared with the main data file; instead, the order in
the metadata file is assumed to match the order in the embedding tensor. In
other words, the first line is the header information and the (i+1)-th line in
the metadata file corresponds to the i-th row of the embedding tensor stored in
the checkpoint.

Note: If the TSV metadata file has only a single column, then we don’t expect a
header row, and assume each row is the label of the embedding. We include this
exception because it matches the commonly-used "vocab file" format.

### Images
If you have images associated with your embeddings, you will need to
produce a single image consisting of small thumbnails of each data point.
This is known as the
[sprite image](https://www.google.com/webhp#q=what+is+a+sprite+image).
The sprite should have the same number of rows and columns with thumbnails
stored in row-first order: the first data point placed in the top left and the
last data point in the bottom right:

<table style="border: none;">
<tr style="background-color: transparent;">
  <td style="border: 1px solid black">0</td>
  <td style="border: 1px solid black">1</td>
  <td style="border: 1px solid black">2</td>
</tr>
<tr style="background-color: transparent;">
  <td style="border: 1px solid black">3</td>
  <td style="border: 1px solid black">4</td>
  <td style="border: 1px solid black">5</td>
</tr>
<tr style="background-color: transparent;">
  <td style="border: 1px solid black">6</td>
  <td style="border: 1px solid black">7</td>
  <td style="border: 1px solid black"></td>
</tr>
</table>

Note in the example above that the last row doesn't have to be filled. For a
concrete example of a sprite, see
[this sprite image](https://www.tensorflow.org/images/mnist_10k_sprite.png) of 10,000 MNIST digits
(100x100).

Note: We currently support sprites up to 8192px X 8192px.

After constructing the sprite, you need to tell the Embedding Projector where
to find it:


```python
embedding.sprite.image_path = PATH_TO_SPRITE_IMAGE
# Specify the width and height of a single thumbnail.
embedding.sprite.single_image_dim.extend([w, h])
```

### Interaction

The Embedding Projector has three panels:

1. *Data panel* on the top left, where you can choose the run, the embedding
   tensor and data columns to color and label points by.
2. *Projections panel* on the bottom left, where you choose the type of
    projection (e.g. PCA, t-SNE).
3. *Inspector panel* on the right side, where you can search for particular
   points and see a list of nearest neighbors.

### Projections
The Embedding Projector has three methods of reducing the dimensionality of a
data set: two linear and one nonlinear. Each method can be used to create either
a two- or three-dimensional view.

**Principal Component Analysis** A straightforward technique for reducing
dimensions is Principal Component Analysis (PCA). The Embedding Projector
computes the top 10 principal components. The menu lets you project those
components onto any combination of two or three. PCA is a linear projection,
often effective at examining global geometry.

**t-SNE** A popular non-linear dimensionality reduction technique is t-SNE.
The Embedding Projector offers both two- and three-dimensional t-SNE views.
Layout is performed client-side animating every step of the algorithm. Because
t-SNE often preserves some local structure, it is useful for exploring local
neighborhoods and finding clusters. Although extremely useful for visualizing
high-dimensional data, t-SNE plots can sometimes be mysterious or misleading.
See this [great article](http://distill.pub/2016/misread-tsne/) for how to use
t-SNE effectively.

**Custom** You can also construct specialized linear projections based on text
searches for finding meaningful directions in space. To define a projection
axis, enter two search strings or regular expressions. The program computes the
centroids of the sets of points whose labels match these searches, and uses the
difference vector between centroids as a projection axis.

### Navigation

To explore a data set, you can navigate the views in either a 2D or a 3D mode,
zooming, rotating, and panning using natural click-and-drag gestures.
Clicking on a point causes the right pane to show an explicit textual list of
nearest neighbors, along with distances to the current point. The
nearest-neighbor points themselves are highlighted on the projection.

Zooming into the cluster gives some information, but it is sometimes more
helpful to restrict the view to a subset of points and perform projections only
on those points. To do so, you can select points in multiple ways:

1. After clicking on a point, its nearest neighbors are also selected.
2. After a search, the points matching the query are selected.
3. Enabling selection, clicking on a point and dragging defines a selection
   sphere.

After selecting a set of points, you can isolate those points for
further analysis on their own with the "Isolate Points" button in the Inspector
pane on the right hand side.


![Selection of nearest neighbors](https://www.tensorflow.org/images/embedding-nearest-points.png "Selection of nearest neighbors")
*Selection of the nearest neighbors of “important” in a word embedding dataset.*

The combination of filtering with custom projection can be powerful. Below, we filtered
the 100 nearest neighbors of “politics” and projected them onto the
“best” - “worst” vector as an x axis. The y axis is random.

You can see that on the right side we have “ideas”, “science”, “perspective”,
“journalism” while on the left we have “crisis”, “violence” and “conflict”.

<table width="100%;">
  <tr>
    <td style="width: 30%;">
      <img src="https://www.tensorflow.org/images/embedding-custom-controls.png" alt="Custom controls panel" title="Custom controls panel" />
    </td>
    <td style="width: 70%;">
      <img src="https://www.tensorflow.org/images/embedding-custom-projection.png" alt="Custom projection" title="Custom projection" />
    </td>
  </tr>
  <tr>
    <td style="width: 30%;">
      Custom projection controls.
    </td>
    <td style="width: 70%;">
      Custom projection of neighbors of "politics" onto "best" - "worst" vector.
    </td>
  </tr>
</table>

### Collaborative Features

To share your findings, you can use the bookmark panel in the bottom right
corner and save the current state (including computed coordinates of any
projection) as a small file. The Projector can then be pointed to a set of one
or more of these files, producing the panel below. Other users can then walk
through a sequence of bookmarks.

<img src="https://www.tensorflow.org/images/embedding-bookmark.png" alt="Bookmark panel" style="width:300px;">


## Mini-FAQ

**Is "embedding" an action or a thing?**
Both. People talk about embedding words in a vector space (action) and about
producing word embeddings (things).  Common to both is the notion of embedding
as a mapping from discrete objects to vectors. Creating or applying that
mapping is an action, but the mapping itself is a thing.

**Are embeddings high-dimensional or low-dimensional?**
It depends. A 300-dimensional vector space of words and phrases, for instance,
is often called low-dimensional (and dense) when compared to the millions of
words and phrases it can contain. But mathematically it is high-dimensional,
displaying many properties that are dramatically different from what our human
intuition has learned about 2- and 3-dimensional spaces.

**Is an embedding the same as an embedding layer?**
No; an embedding layer is a part of neural network, but an embedding is a more
general concept.