From ef4169f362a7fc0dadc5d2d3c232829d3da3c64b Mon Sep 17 00:00:00 2001 From: Petros Mol Date: Wed, 24 May 2017 14:51:32 -0700 Subject: Tutorial for tf.contrib.kernel_methods module. PiperOrigin-RevId: 157040972 --- tensorflow/docs_src/tutorials/index.md | 4 +- tensorflow/docs_src/tutorials/kernel_methods.md | 299 ++++++++++++++++++++++++ tensorflow/docs_src/tutorials/leftnav_files | 1 + 3 files changed, 303 insertions(+), 1 deletion(-) create mode 100644 tensorflow/docs_src/tutorials/kernel_methods.md diff --git a/tensorflow/docs_src/tutorials/index.md b/tensorflow/docs_src/tutorials/index.md index f52a368681..b9c18f3288 100644 --- a/tensorflow/docs_src/tutorials/index.md +++ b/tensorflow/docs_src/tutorials/index.md @@ -40,10 +40,12 @@ The following tutorials focus on linear models: * @{$wide_and_deep$TensorFlow Wide & Deep Learning Tutorial}, which explains how to use the high-level API to jointly train both a wide linear model and a deep feed-forward neural network. + * @{$kernel_methods$Improving Linear Models Using Explicit Kernel Methods}, + which shows how to improve the quality of a linear model by using explicit + kernel mappings. Although TensorFlow specializes in machine learning, you may also use TensorFlow to solve other kinds of math problems. For example: * @{$mandelbrot$Mandelbrot Set} * @{$pdes$Partial Differential Equations} - diff --git a/tensorflow/docs_src/tutorials/kernel_methods.md b/tensorflow/docs_src/tutorials/kernel_methods.md new file mode 100644 index 0000000000..ee052db5d5 --- /dev/null +++ b/tensorflow/docs_src/tutorials/kernel_methods.md @@ -0,0 +1,299 @@ +# Improving Linear Models Using Explicit Kernel Methods + +In this tutorial, we demonstrate how combining (explicit) kernel methods with +linear models can drastically increase the latters' quality of predictions +without significantly increasing training and inference times. Unlike dual +kernel methods, explicit (primal) kernel methods scale well with the size of the +training dataset both in terms of training/inference times and in terms of +memory requirements. + +**Intended audience:** Even though we provide a high-level overview of concepts +related to explicit kernel methods, this tutorial primarily targets readers who +already have at least basic knowledge of kernel methods and Support Vector +Machines (SVMs). If you are new to kernel methods, refer to either of the +following sources for an introduction: + +* If you have a strong mathematical background: +[Kernel Methods in Machine Learning](http://www.kernel-machines.org/publications/pdfs/0701907.pdf) +* [Kernel method wikipedia page](https://en.wikipedia.org/wiki/Kernel_method) + +Currently, TensorFlow supports explicit kernel mappings for dense features only; +TensorFlow will provide support for sparse features at a later release. + +This tutorial uses [tf.contrib.learn](https://www.tensorflow.org/code/tensorflow/contrib/learn/python/learn) +(TensorFlow's high-level Machine Learning API) Estimators for our ML models. +If you are not familiar with this API, [tf.contrib.learn Quickstart](https://www.tensorflow.org/get_started/tflearn) +is a good place to start. We will use the MNIST dataset. The tutorial consists +of the following steps: + +* Load and prepare MNIST data for classification. +* Construct a simple linear model, train it, and evaluate it on the eval data. +* Replace the linear model with a kernelized linear model, re-train, and +re-evaluate. + +## Load and prepare MNIST data for classification +Run the following utility command to load the MNIST dataset: + +```python +data = tf.contrib.learn.datasets.mnist.load_mnist() +``` +The preceding method loads the entire MNIST dataset (containing 70K samples) and +splits it into train, validation, and test data with 55K, 5K, and 10K samples +respectively. Each split contains one numpy array for images (with shape +[sample_size, 784]) and one for labels (with shape [sample_size, 1]). In this +tutorial, we only use the train and validation splits to train and evaluate our +models respectively. + +In order to feed data to a tf.contrib.learn Estimator, it is helpful to convert +it to Tensors. For this, we will use an `input function` which adds Ops to the +TensorFlow graph that, when executed, create mini-batches of Tensors to be used +downstream. For more background on input functions, check +@{$get_started/input_fn$Building Input Functions with tf.contrib.learn}. In this +example, we will use the `tf.train.shuffle_batch` Op which, besides converting +numpy arrays to Tensors, allows us to specify the batch_size and whether to +randomize the input every time the input_fn Ops are executed (randomization +typically expedites convergence during training). The full code for loading and +preparing the data is shown in the snippet below. In this example, we use +mini-batches of size 256 for training and the entire sample (5K entries) for +evaluation. Feel free to experiment with different batch sizes. + +```python +import numpy as np +import tensorflow as tf + +def get_input_fn(dataset_split, batch_size, capacity=10000, min_after_dequeue=3000): + + def _input_fn(): + images_batch, labels_batch = tf.train.shuffle_batch( + tensors=[dataset_split.images, dataset_split.labels.astype(np.int32)], + batch_size=batch_size, + capacity=capacity, + min_after_dequeue=min_after_dequeue, + enqueue_many=True, + num_threads=4) + features_map = {'images': images_batch} + return features_map, labels_batch + + return _input_fn + +data = tf.contrib.learn.datasets.mnist.load_mnist() + +train_input_fn = get_input_fn(data.train, batch_size=256) +eval_input_fn = get_input_fn(data.validation, batch_size=5000) + +``` + +## Training a simple linear model +We can now train a linear model over the MNIST dataset. We will use the +@{tf.contrib.learn.LinearClassifier} estimator with 10 classes representing the +10 digits. The input features form a 784-dimensional dense vector which can +be specified as follows: + +```python +image_column = tf.contrib.layers.real_valued_column('images', dimension=784) +``` + +The full code for constructing, training and evaluating a LinearClassifier +estimator is as follows: + +```python +import time + +# Specify the feature(s) to be used by the estimator. +image_column = tf.contrib.layers.real_valued_column('images', dimension=784) +estimator = tf.contrib.learn.LinearClassifier(feature_columns=[image_column], n_classes=10) + +# Train. +start = time.time() +estimator.fit(input_fn=train_input_fn, steps=2000) +end = time.time() +print('Elapsed time: {} seconds'.format(end - start)) + +# Evaluate and report metrics. +eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1) +print(eval_metrics) +``` +The following table summarizes the results on the eval data. + +metric | value +:------------ | :------------ +loss | 0.25 to 0.30 +accuracy | 92.5% +training time | ~25 seconds on my machine + +Note: Metrics will vary depending on various factors. + +In addition to experimenting with the (training) batch size and the number of +training steps, there are a couple other parameters that can be tuned as well. +For instance, you can change the optimization method used to minimize the loss +by explicitly selecting another optimizer from the collection of +[available optimizers](https://www.tensorflow.org/code/tensorflow/python/training). +As an example, the following code constructs a LinearClassifier estimator that +uses the Follow-The-Regularized-Leader (FTRL) optimization strategy with a +specific learning rate and L2-regularization. + + +```python +optimizer = tf.train.FtrlOptimizer(learning_rate=5.0, l2_regularization_strength=1.0) +estimator = tf.contrib.learn.LinearClassifier( + feature_columns=[image_column], n_classes=10, optimizer=optimizer) +``` + +Regardless of the values of the parameters, the maximum accuracy a linear model +can achieve on this dataset caps at around **93%**. + +## Using explicit kernel mappings with the linear model. +The relatively high error (~7%) of the linear model over MNIST indicates that +the input data is not linearly separable. We will use explicit kernel mappings +to reduce the classification error. + +**Intuition:** The high-level idea is to use a non-linear map to transform the +input space to another feature space (of possibly higher dimension) where the +(transformed) features are (almost) linearly separable and then apply a linear +model on the mapped features. This is shown in the following figure: + +
+ +
+ + +### Technical details +In this example we will use **Random Fourier Features**, introduced in the +["Random Features for Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) +paper by Rahimi and Recht, to map the input data. Random Fourier Features map a +vector \\(\mathbf{x} \in \mathbb{R}^d\\) to \\(\mathbf{x'} \in \mathbb{R}^D\\) +via the following mapping: + +$$ +RFFM(\cdot): \mathbb{R}^d \to \mathbb{R}^D, \quad +RFFM(\mathbf{x}) = \cos(\mathbf{\Omega} \cdot \mathbf{x}+ \mathbf{b}) +$$ + +where \\(\mathbf{\Omega} \in \mathbb{R}^{D \times d}\\), +\\(\mathbf{x} \in \mathbb{R}^d,\\) \\(\mathbf{b} \in \mathbb{R}^D\\) and the +cosine is applied element-wise. + +In this example, the entries of \\(\mathbf{\Omega}\\) and \\(\mathbf{b}\\) are +sampled from distributions such that the mapping satisfies the following +property: + +$$ +RFFM(\mathbf{x})^T \cdot RFFM(\mathbf{y}) \approx +e^{-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2 \sigma^2}} +$$ + +The right-hand-side quantity of the expression above is known as the RBF (or +Gaussian) kernel function. This function is one of the most-widely used kernel +functions in Machine Learning and implicitly measures similarity in a different, +much higher dimensional space than the original one. See +[Radial basis function kernel](https://en.wikipedia.org/wiki/Radial_basis_function_kernel) +for more details. + +### Kernel classifier +@{tf.contrib.kernel_methods.KernelLinearClassifier} is a pre-packaged +`tf.contrib.learn` estimator that combines the power of explicit kernel mappings +with linear models. Its constructor is almost identical to that of the +LinearClassifier estimator with the additional option to specify a list of +explicit kernel mappings to be applied to each feature the classifier uses. The +following code snippet demonstrates how to replace LinearClassifier with +KernelLinearClassifier. + + +```python +# Specify the feature(s) to be used by the estimator. This is identical to the +# code used for the LinearClassifier. +image_column = tf.contrib.layers.real_valued_column('images', dimension=784) +optimizer = tf.train.FtrlOptimizer( + learning_rate=50.0, l2_regularization_strength=0.001) + + +kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper( + input_dim=784, output_dim=2000, stddev=5.0, name='rffm') +kernel_mappers = {image_column: [kernel_mapper]} +estimator = tf.contrib.kernel_methods.KernelLinearClassifier( + n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers) + +# Train. +start = time.time() +estimator.fit(input_fn=train_input_fn, steps=2000) +end = time.time() +print('Elapsed time: {} seconds'.format(end - start)) + +# Evaluate and report metrics. +eval_metrics = estimator.evaluate(input_fn=eval_input_fn, steps=1) +print(eval_metrics) +``` +The only additional parameter passed to `KernelLinearClassifier` is a dictionary +from feature_columns to a list of kernel mappings to be applied to the +corresponding feature column. The following lines instruct the classifier to +first map the initial 784-dimensional images to 2000-dimensional vectors using +random Fourier features and then learn a linear model on the transformed +vectors: + +```python +kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper( + input_dim=784, output_dim=2000, stddev=5.0, name='rffm') +kernel_mappers = {image_column: [kernel_mapper]} +estimator = tf.contrib.kernel_methods.KernelLinearClassifier( + n_classes=10, optimizer=optimizer, kernel_mappers=kernel_mappers) +``` +Notice the `stddev` parameter. This is the standard deviation (\\(\sigma\\)) of +the approximated RBF kernel and controls the similarity measure used in +classification. `stddev` is typically determined via hyperparameter tuning. + +The results of running the preceding code are summarized in the following table. +We can further increase the accuracy by increasing the output dimension of the +mapping and tuning the standard deviation. + +metric | value +:------------ | :------------ +loss | 0.10 +accuracy | 97% +training time | ~35 seconds on my machine + + +### stddev +The classification quality is very sensitive to the value of stddev. The +following table shows the accuracy of the classifier on the eval data for +different values of stddev. The optimal value is stddev=5.0. Notice how too +small or too high stddev values can dramatically decrease the accuracy of the +classification. + +stddev | eval accuracy +:----- | :------------ +1.0 | 0.1362 +2.0 | 0.4764 +4.0 | 0.9654 +5.0 | 0.9766 +8.0 | 0.9714 +16.0 | 0.8878 + +### Output dimension +Intuitively, the larger the output dimension of the mapping, the closer the +inner product of two mapped vectors approximates the kernel, which typically +translates to better classification accuracy. Another way to think about this is +that the output dimension equals the number of weights of the linear model; the +larger this dimension, the larger the "degrees of freedom" of the model. +However, after a certain threshold, higher output dimensions increase the +accuracy by very little, while making training take more time. This is shown in +the following two Figures which depict the eval accuracy as a function of the +output dimension and the training time, respectively. + +![image](../images/acc_vs_outdim.png) +![image](../images/acc-vs-trn_time.png) + + +## Summary +Explicit kernel mappings combine the predictive power of nonlinear models with +the scalability of linear models. Unlike traditional dual kernel methods, +explicit kernel methods can scale to millions or hundreds of millions of +samples. When using explicit kernel mappings, consider the following tips: + +* Random Fourier Features can be particularly effective for datasets with dense +features. +* The parameters of the kernel mapping are often data-dependent. Model quality +can be very sensitive to these parameters. Use hyperparameter tuning to find the +optimal values. +* If you have multiple numerical features, concatenate them into a single +multi-dimensional feature and apply the kernel mapping to the concatenated +vector. diff --git a/tensorflow/docs_src/tutorials/leftnav_files b/tensorflow/docs_src/tutorials/leftnav_files index c98e65f70f..fe71d3c408 100644 --- a/tensorflow/docs_src/tutorials/leftnav_files +++ b/tensorflow/docs_src/tutorials/leftnav_files @@ -12,3 +12,4 @@ wide.md wide_and_deep.md mandelbrot.md pdes.md +kernel_methods.md \ No newline at end of file -- cgit v1.2.3