aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/docs_src/guide/feature_columns.md
blob: b189c4334ed5a5428de223f92de8d93f4ef052ba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
# Feature Columns

This document details feature columns. Think of **feature columns** as the
intermediaries between raw data and Estimators. Feature columns are very rich,
enabling you to transform a diverse range of raw data into formats that
Estimators can use, allowing easy experimentation.

In @{$premade_estimators$Premade Estimators}, we used the premade
Estimator, `tf.estimator.DNNClassifier` to train a model to
predict different types of Iris flowers from four input features. That example
created only numerical feature columns (of type
`tf.feature_column.numeric_column`). Although numerical feature columns model
the lengths of petals and sepals effectively, real world data sets contain all
kinds of features, many of which are non-numerical.

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/feature_cloud.jpg">
</div>
<div style="text-align: center">
Some real-world features (such as, longitude) are numerical, but many are not.
</div>

## Input to a Deep Neural Network

What kind of data can a deep neural network operate on? The answer
is, of course, numbers (for example, `tf.float32`). After all, every neuron in
a neural network performs multiplication and addition operations on weights and
input data. Real-life input data, however, often contains non-numerical
(categorical) data. For example, consider a `product_class` feature that can
contain the following three non-numerical values:

* `kitchenware`
* `electronics`
* `sports`

ML models generally represent categorical values as simple vectors in which a
1 represents the presence of a value and a 0 represents the absence of a value.
For example, when `product_class` is set to `sports`, an ML model would usually
represent `product_class` as  `[0, 0, 1]`, meaning:

* `0`: `kitchenware` is absent
* `0`: `electronics` is absent
* `1`: `sports` is present

So, although raw data can be numerical or categorical, an ML model represents
all features as numbers.

## Feature Columns

As the following figure suggests, you specify the input to a model through the
`feature_columns` argument of an Estimator (`DNNClassifier` for Iris).
Feature Columns bridge input data (as returned by `input_fn`) with your model.

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/inputs_to_model_bridge.jpg">
</div>
<div style="text-align: center">
Feature columns bridge raw data with the data your model needs.
</div>

To create feature columns, call functions from the
`tf.feature_column` module. This document explains nine of the functions in
that module. As the following figure shows, all nine functions return either a
Categorical-Column or a Dense-Column object, except `bucketized_column`, which
inherits from both classes:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/some_constructors.jpg">
</div>
<div style="text-align: center">
Feature column methods fall into two main categories and one hybrid category.
</div>

Let's look at these functions in more detail.

### Numeric column

The Iris classifier calls the `tf.feature_column.numeric_column` function for
all input features:

  * `SepalLength`
  * `SepalWidth`
  * `PetalLength`
  * `PetalWidth`

Although `tf.numeric_column` provides optional arguments, calling
`tf.numeric_column` without any arguments, as follows, is a fine way to specify
a numerical value with the default data type (`tf.float32`) as input to your
model:

```python
# Defaults to a tf.float32 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")
```

To specify a non-default numerical data type, use the `dtype` argument. For
example:

``` python
# Represent a tf.float64 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength",
                                                          dtype=tf.float64)
```

By default, a numeric column creates a single value (scalar). Use the shape
argument to specify another shape. For example:

<!--TODO(markdaoust) link to full example-->
```python
# Represent a 10-element vector in which each cell contains a tf.float32.
vector_feature_column = tf.feature_column.numeric_column(key="Bowling",
                                                         shape=10)

# Represent a 10x5 matrix in which each cell contains a tf.float32.
matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix",
                                                         shape=[10,5])
```
### Bucketized column

Often, you don't want to feed a number directly into the model, but instead
split its value into different categories based on numerical ranges.  To do so,
create a `tf.feature_column.bucketized_column`. For
example, consider raw data that represents the year a house was built. Instead
of representing that year as a scalar numeric column, we could split the year
into the following four buckets:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/bucketized_column.jpg">
</div>
<div style="text-align: center">
Dividing year data into four buckets.
</div>

The model will represent the buckets as follows:

|Date Range |Represented as... |
|:----------|:-----------------|
|< 1960               | [1, 0, 0, 0] |
|>= 1960 but < 1980   | [0, 1, 0, 0] |
|>= 1980 but < 2000   | [0, 0, 1, 0] |
|>= 2000              | [0, 0, 0, 1] |

Why would you want to split a number—a perfectly valid input to your
model—into a categorical value? Well, notice that the categorization splits a
single input number into a four-element vector. Therefore, the model now can
learn _four individual weights_ rather than just one; four weights creates a
richer model than one weight. More importantly, bucketizing enables the model
to clearly distinguish between different year categories since only one of the
elements is set (1) and the other three elements are cleared (0). For example,
when we just use a single number (a year) as input, a linear model can only
learn a linear relationship. So, bucketing provides the model with additional
flexibility that the model can use to learn.

The following code demonstrates how to create a bucketized feature:

<!--TODO(markdaoust) link to full example - housing price grid?-->
```python
# First, convert the raw input to a numeric column.
numeric_feature_column = tf.feature_column.numeric_column("Year")

# Then, bucketize the numeric column on the years 1960, 1980, and 2000.
bucketized_feature_column = tf.feature_column.bucketized_column(
    source_column = numeric_feature_column,
    boundaries = [1960, 1980, 2000])
```
Note that specifying a _three_-element boundaries vector creates a
_four_-element bucketized vector.


### Categorical identity column

**Categorical identity columns** can be seen as a special case of bucketized
columns. In traditional bucketized columns, each bucket represents a range of
values (for example, from 1960 to 1979). In a categorical identity column, each
bucket represents a single, unique integer. For example, let's say you want to
represent the integer range `[0, 4)`.  That is, you want to represent the
integers 0, 1, 2, or 3. In this case, the categorical identity mapping looks
like this:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/categorical_column_with_identity.jpg">
</div>
<div style="text-align: center">
A categorical identity column mapping. Note that this is a one-hot
encoding, not a binary numerical encoding.
</div>

As with bucketized columns, a model can learn a separate weight for each class
in a categorical identity column. For example, instead of using a string to
represent the `product_class`, let's represent each class with a unique integer
value. That is:

* `0="kitchenware"`
* `1="electronics"`
* `2="sport"`

Call `tf.feature_column.categorical_column_with_identity` to implement a
categorical identity column. For example:

``` python
# Create categorical output for an integer feature named "my_feature_b",
# The values of my_feature_b must be >= 0 and < num_buckets
identity_feature_column = tf.feature_column.categorical_column_with_identity(
    key='my_feature_b',
    num_buckets=4) # Values [0, 4)

# In order for the preceding call to work, the input_fn() must return
# a dictionary containing 'my_feature_b' as a key. Furthermore, the values
# assigned to 'my_feature_b' must belong to the set [0, 4).
def input_fn():
    ...
    return ({ 'my_feature_a':[7, 9, 5, 2], 'my_feature_b':[3, 1, 2, 2] },
            [Label_values])
```

### Categorical vocabulary column

We cannot input strings directly to a model. Instead, we must first map strings
to numeric or categorical values. Categorical vocabulary columns provide a good
way to represent strings as a one-hot vector. For example:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/categorical_column_with_vocabulary.jpg">
</div>
<div style="text-align: center">
Mapping string values to vocabulary columns.
</div>

As you can see, categorical vocabulary columns are kind of an enum version of
categorical identity columns. TensorFlow provides two different functions to
create categorical vocabulary columns:

* `tf.feature_column.categorical_column_with_vocabulary_list`
* `tf.feature_column.categorical_column_with_vocabulary_file`

`categorical_column_with_vocabulary_list` maps each string to an integer based
on an explicit vocabulary list. For example:

```python
# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature by mapping the input to one of
# the elements in the vocabulary list.
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature_name_from_input_fn,
        vocabulary_list=["kitchenware", "electronics", "sports"])
```

The preceding function is pretty straightforward, but it has a significant
drawback. Namely, there's way too much typing when the vocabulary list is long.
For these cases, call
`tf.feature_column.categorical_column_with_vocabulary_file` instead, which lets
you place the vocabulary words in a separate file. For example:

```python

# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature to our model by mapping the input to one of
# the elements in the vocabulary file
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_file(
        key=feature_name_from_input_fn,
        vocabulary_file="product_class.txt",
        vocabulary_size=3)
```

`product_class.txt` should contain one line for each vocabulary element. In our
case:

```None
kitchenware
electronics
sports
```

### Hashed Column

So far, we've worked with a naively small number of categories. For example,
our product_class example has only 3 categories. Often though, the number of
categories can be so big that it's not possible to have individual categories
for each vocabulary word or integer because that would consume too much memory.
For these cases, we can instead turn the question around and ask, "How many
categories am I willing to have for my input?"  In fact, the
`tf.feature_column.categorical_column_with_hash_bucket` function enables you
to specify the number of categories. For this type of feature column the model
calculates a hash value of the input, then puts it into one of
the `hash_bucket_size` categories using the modulo operator, as in the following
pseudocode:

```python
# pseudocode
feature_id = hash(raw_feature) % hash_bucket_size
```

The code to create the `feature_column` might look something like this:

``` python
hashed_feature_column =
    tf.feature_column.categorical_column_with_hash_bucket(
        key = "some_feature",
        hash_bucket_size = 100) # The number of categories
```
At this point, you might rightfully think: "This is crazy!" After all, we are
forcing the different input values to a smaller set of categories. This means
that two probably unrelated inputs will be mapped to the same
category, and consequently mean the same thing to the neural network. The
following figure illustrates this dilemma, showing that kitchenware and sports
both get assigned to category (hash bucket) 12:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/hashed_column.jpg">
</div>
<div style="text-align: center">
Representing data with hash buckets.
</div>

As with many counterintuitive phenomena in machine learning, it turns out that
hashing often works well in practice. That's because hash categories provide
the model with some separation. The model can use additional features to further
separate kitchenware from sports.

### Crossed column

Combining features into a single feature, better known as
[feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross),
enables the model to learn separate weights for each combination of
features.

More concretely, suppose we want our model to calculate real estate prices in
Atlanta, GA. Real-estate prices within this city vary greatly depending on
location. Representing latitude and longitude as separate features isn't very
useful in identifying real-estate location dependencies; however, crossing
latitude and longitude into a single feature can pinpoint locations. Suppose we
represent Atlanta as a grid of 100x100 rectangular sections, identifying each
of the 10,000 sections by a feature cross of latitude and longitude. This
feature cross enables the model to train on pricing conditions related to each
individual section, which is a much stronger signal than latitude and longitude
alone.

The following figure shows our plan, with the latitude & longitude values for
the corners of the city in red text:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/Atlanta.jpg">
</div>
<div style="text-align: center">
Map of Atlanta. Imagine this map divided into 10,000 sections of
equal size.
</div>

For the solution, we used a combination of the `bucketized_column` we looked at
earlier, with the `tf.feature_column.crossed_column` function.

<!--TODO(markdaoust) link to full example-->

``` python
def make_dataset(latitude, longitude, labels):
    assert latitude.shape == longitude.shape == labels.shape

    features = {'latitude': latitude.flatten(),
                'longitude': longitude.flatten()}
    labels=labels.flatten()

    return tf.data.Dataset.from_tensor_slices((features, labels))


# Bucketize the latitude and longitude using the `edges`
latitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('latitude'),
    list(atlanta.latitude.edges))

longitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('longitude'),
    list(atlanta.longitude.edges))

# Cross the bucketized columns, using 5000 hash bins.
crossed_lat_lon_fc = tf.feature_column.crossed_column(
    [latitude_bucket_fc, longitude_bucket_fc], 5000)

fc = [
    latitude_bucket_fc,
    longitude_bucket_fc,
    crossed_lat_lon_fc]

# Build and train the Estimator.
est = tf.estimator.LinearRegressor(fc, ...)
```

You may create a feature cross from either of the following:

* Feature names; that is, names from the `dict` returned from `input_fn`.
* Any categorical column, except `categorical_column_with_hash_bucket`
  (since `crossed_column` hashes the input).

When the feature columns `latitude_bucket_fc` and `longitude_bucket_fc` are
crossed, TensorFlow will create `(latitude_fc, longitude_fc)` pairs for each
example. This would produce a full grid of possibilities as follows:

``` None
 (0,0),  (0,1)...  (0,99)
 (1,0),  (1,1)...  (1,99)
   ...     ...       ...
(99,0), (99,1)...(99, 99)
```

Except that a full grid would only be tractable for inputs with limited
vocabularies. Instead of building this, potentially huge, table of inputs,
the `crossed_column` only builds the number requested by the `hash_bucket_size`
argument. The feature column assigns an example to a index by running a hash
function on the tuple of inputs, followed by a modulo operation with
`hash_bucket_size`.

As discussed earlier, performing the
hash and modulo function limits the number of categories, but can cause category
collisions; that is, multiple (latitude, longitude) feature crosses will end
up in the same hash bucket. In practice though, performing feature crosses
still adds significant value to the learning capability of your models.

Somewhat counterintuitively, when creating feature crosses, you typically still
should include the original (uncrossed) features in your model (as in the
preceding code snippet). The independent latitude and longitude features help the
model distinguish between examples where a hash collision has occurred in the
crossed feature.

## Indicator and embedding columns

Indicator columns and embedding columns never work on features directly, but
instead take categorical columns as input.

When using an indicator column, we're telling TensorFlow to do exactly what
we've seen in our categorical product_class example. That is, an
**indicator column** treats each category as an element in a one-hot vector,
where the matching category has value 1 and the rest have 0s:

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/categorical_column_with_identity.jpg">
</div>
<div style="text-align: center">
Representing data in indicator columns.
</div>

Here's how you create an indicator column by calling
`tf.feature_column.indicator_column`:

``` python
categorical_column = ... # Create any type of categorical column.

# Represent the categorical column as an indicator column.
indicator_column = tf.feature_column.indicator_column(categorical_column)
```

Now, suppose instead of having just three possible classes, we have a million.
Or maybe a billion. For a number of reasons, as the number of categories grow
large, it becomes infeasible to train a neural network using indicator columns.

We can use an embedding column to overcome this limitation. Instead of
representing the data as a one-hot vector of many dimensions, an
**embedding column** represents that data as a lower-dimensional, ordinary
vector in which each cell can contain any number, not just 0 or 1. By
permitting a richer palette of numbers for every cell, an embedding column
contains far fewer cells than an indicator column.

Let's look at an example comparing indicator and embedding columns. Suppose our
input examples consist of different words from a limited palette of only 81
words. Further suppose that the data set provides the following input
words in 4 separate examples:

* `"dog"`
* `"spoon"`
* `"scissors"`
* `"guitar"`

In that case, the following figure illustrates the processing path for
embedding columns or indicator columns.

<div style="width:80%; margin:auto; margin-bottom:10px; margin-top:20px;">
<img style="width:100%" src="../images/feature_columns/embedding_vs_indicator.jpg">
</div>
<div style="text-align: center">
An embedding column stores categorical data in a lower-dimensional
vector than an indicator column. (We just placed random numbers into the
embedding vectors; training determines the actual numbers.)
</div>

When an example is processed, one of the `categorical_column_with...` functions
maps the example string to a numerical categorical value. For example, a
function maps "spoon" to `[32]`. (The 32 comes from our imagination—the actual
values depend on the mapping function.) You may then represent these numerical
categorical values in either of the following two ways:

* As an indicator column. A function converts each numeric categorical value
  into an 81-element vector (because our palette consists of 81 words), placing
  a 1 in the index of the categorical value (0, 32, 79, 80) and a 0 in all the
  other positions.

* As an embedding column. A function uses the numerical categorical values
  `(0, 32, 79, 80)` as indices to a lookup table. Each slot in that lookup table
  contains a 3-element vector.

How do the values in the embeddings vectors magically get assigned? Actually,
the assignments happen during training. That is, the model learns the best way
to map your input numeric categorical values to the embeddings vector value in
order to solve your problem. Embedding columns increase your model's
capabilities, since an embeddings vector learns new relationships between
categories from the training data.

Why is the embedding vector size 3 in our example? Well, the following "formula"
provides a general rule of thumb about the number of embedding dimensions:

```python
embedding_dimensions =  number_of_categories**0.25
```

That is, the embedding vector dimension should be the 4th root of the number of
categories. Since our vocabulary size in this example is 81, the recommended
number of dimensions is 3:

``` python
3 =  81**0.25
```
Note that this is just a general guideline; you can set the number of embedding
dimensions as you please.

Call `tf.feature_column.embedding_column` to create an `embedding_column` as
suggested by the following snippet:

``` python
categorical_column = ... # Create any categorical column

# Represent the categorical column as an embedding column.
# This means creating an embedding vector lookup table with one element for each category.
embedding_column = tf.feature_column.embedding_column(
    categorical_column=categorical_column,
    dimension=embedding_dimensions)
```

@{$guide/embedding$Embeddings} is a significant topic within machine
learning. This information was just to get you started using them as feature
columns.

## Passing feature columns to Estimators

As the following list indicates, not all Estimators permit all types of
`feature_columns` argument(s):

* `tf.estimator.LinearClassifier` and
  `tf.estimator.LinearRegressor`: Accept all types of
  feature column.
* `tf.estimator.DNNClassifier` and
  `tf.estimator.DNNRegressor`: Only accept dense columns. Other
  column types must be wrapped in either an `indicator_column` or
  `embedding_column`.
* `tf.estimator.DNNLinearCombinedClassifier` and
  `tf.estimator.DNNLinearCombinedRegressor`:
    * The `linear_feature_columns` argument accepts any feature column type.
    * The `dnn_feature_columns` argument only accepts dense columns.

## Other Sources

For more examples on feature columns, view the following:

* The @{$low_level_intro#feature_columns$Low Level Introduction} demonstrates how
  experiment directly with `feature_columns` using TensorFlow's low level APIs.
* The [Estimator wide and deep learning tutorial](https://github.com/tensorflow/models/tree/master/official/wide_deep)
  solves a binary classification problem using `feature_columns` on a variety of
  input data types.

To learn more about embeddings, see the following:

* [Deep Learning, NLP, and representations](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)
  (Chris Olah's blog)
* The TensorFlow [Embedding Projector](http://projector.tensorflow.org)