diff options
Diffstat (limited to 'tensorflow/python/ops/nn.py')
-rw-r--r-- | tensorflow/python/ops/nn.py | 816 |
1 files changed, 816 insertions, 0 deletions
diff --git a/tensorflow/python/ops/nn.py b/tensorflow/python/ops/nn.py new file mode 100644 index 0000000000..7a4dc25e8b --- /dev/null +++ b/tensorflow/python/ops/nn.py @@ -0,0 +1,816 @@ +# pylint: disable=wildcard-import,unused-import,g-bad-import-order +"""## Activation Functions + +The activation ops provide different types of nonlinearities for use in +neural networks. These include smooth nonlinearities (`sigmoid`, +`tanh`, and `softplus`), continuous but not everywhere differentiable +functions (`relu`, `relu6`, and `relu_x`), and random regularization +(`dropout`). + +All activation ops apply componentwise, and produce a tensor of the same +shape as the input tensor. + +@@relu +@@relu6 +@@softplus +@@dropout +@@bias_add +@@sigmoid +@@tanh + +## Convolution + +The convolution ops sweep a 2-D filter over a batch of images, applying the +filter to each window of each image of the appropriate size. The different +ops trade off between generic vs. specific filters: + +* `conv2d`: Arbitrary filters that can mix channels together. +* `depthwise_conv2d`: Filters that operate on each channel independently. +* `separable_conv2d`: A depthwise spatial filter followed by a pointwise filter. + +Note that although these ops are called "convolution", they are strictly +speaking "cross-correlation" since the filter is combined with an input window +without reversing the filter. For details, see [the properties of +cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation#Properties). + +The filter is applied to image patches of the same size as the filter and +strided according to the `strides` argument. `strides = [1, 1, 1, 1]` applies +the filter to a patch at every offset, `strides = [1, 2, 2, 1]` applies the +filter to every other image patch in each dimension, etc. + +Ignoring channels for the moment, the spatial semantics of the convolution ops +are as follows. If the 4-D `input` has shape +`[batch, in_height, in_width, ...]` and the 4-D `filter` has shape +`[filter_height, filter_width, ...]`, then + + output.shape = [batch, + (in_height - filter_height + 1) / strides[1], + (in_width - filter_width + 1) / strides[2], + ...] + + output[b, i, j, :] = + sum_{di, dj} input[b, strides[1] * i + di, strides[2] * j + dj, ...] * + filter[di, dj, ...] + +Since `input` is 4-D, each `input[b, i, j, :]` is a vector. For `conv2d`, these +vectors are multiplied by the `filter[di, dj, :, :]` matrices to produce new +vectors. For `depthwise_conv_2d`, each scalar component `input[b, i, j, k]` +is multiplied by a vector `filter[di, dj, k]`, and all the vectors are +concatenated. + +In the formula for `output.shape`, the rounding direction depends on padding: + +* `padding = 'SAME'`: Round down (only full size windows are considered). +* `padding = 'VALID'`: Round up (partial windows are included). + +@@conv2d +@@depthwise_conv2d +@@separable_conv2d + +## Pooling + +The pooling ops sweep a rectangular window over the input tensor, computing a +reduction operation for each window (average, max, or max with argmax). Each +pooling op uses rectangular windows of size `ksize` separated by offset +`strides`. For example, if `strides` is all ones every window is used, if +`strides` is all twos every other window is used in each dimension, etc. + +In detail, the output is + + output[i] = reduce(value[strides * i:strides * i + ksize]) + +for each tuple of indices `i`. The output shape is + + output.shape = (value.shape - ksize + 1) / strides + +where the rounding direction depends on padding: + +* `padding = 'SAME'`: Round down (only full size windows are considered). +* `padding = 'VALID'`: Round up (partial windows are included). + +@@avg_pool +@@max_pool +@@max_pool_with_argmax + +## Normalization + +Normalization is useful to prevent neurons from saturating when inputs may +have varying scale, and to aid generalization. + +@@l2_normalize +@@local_response_normalization +@@moments + +## Losses + +The loss ops measure error between two tensors, or between a tensor and zero. +These can be used for measuring accuracy of a network in a regression task +or for regularization purposes (weight decay). + +@@l2_loss + +## Classification + +TensorFlow provides several operations that help you perform classification. + +@@sigmoid_cross_entropy_with_logits +@@softmax +@@softmax_cross_entropy_with_logits + +## Embeddings + +TensorFlow provides several operations that help you compute embeddings. + +@@embedding_lookup +@@embedding_lookup_sparse + +## Evaluation + +The evaluation ops are useful for measuring the performance of a network. +Since they are nondifferentiable, they are typically used at evaluation time. + +@@top_k +@@in_top_k + +## Candidate Sampling + +Do you want to train a multiclass or multilabel model with thousands +or millions of output classes (for example, a language model with a +large vocabulary)? Training with a full Softmax is slow in this case, +since all of the classes are evaluated for every training example. +Candidate Sampling training algorithms can speed up your step times by +only considering a small randomly-chosen subset of contrastive classes +(called candidates) for each batch of training examples. + +See our [Candidate Sampling Algorithms Reference] +(http://www.tensorflow.org/extras/candidate_sampling.pdf) + +### Sampled Loss Functions + +TensorFlow provides the following sampled loss functions for faster training. + +@@nce_loss +@@sampled_softmax_loss + +### Candidate Samplers + +TensorFlow provides the following samplers for randomly sampling candidate +classes when using one of the sampled loss functions above. + +@@uniform_candidate_sampler +@@log_uniform_candidate_sampler +@@learned_unigram_candidate_sampler +@@fixed_unigram_candidate_sampler + +### Miscellaneous candidate sampling utilities + +@@compute_accidental_hits + +""" + +from tensorflow.python.framework import ops +from tensorflow.python.framework import types +from tensorflow.python.ops import array_ops +from tensorflow.python.ops import candidate_sampling_ops +from tensorflow.python.ops import constant_op +from tensorflow.python.ops import control_flow_ops +from tensorflow.python.ops import embedding_ops +from tensorflow.python.ops import math_ops +from tensorflow.python.ops import nn_grad +from tensorflow.python.ops import nn_ops +from tensorflow.python.ops import numerics +from tensorflow.python.ops import random_ops +from tensorflow.python.ops import sparse_ops +from tensorflow.python.ops.math_ops import sigmoid +from tensorflow.python.ops.math_ops import tanh + +# Bring more nn-associated functionality into this package. +from tensorflow.python.ops.nn_ops import * +from tensorflow.python.ops.candidate_sampling_ops import * +from tensorflow.python.ops.embedding_ops import * + + +def sigmoid_cross_entropy_with_logits(logits, targets, name=None): + """Computes sigmoid cross entropy given `logits`. + + Measures the probability error in discrete classification tasks in which each + class is independent and not mutually exclusive. For instance, one could + perform multilabel classification where a picture can contain both an elephant + and a dog at the same time. + + For brevity, let `x = logits`, `z = targets`. The logistic loss is + + x - x * z + log(1 + exp(-x)) + + To ensure stability and avoid overflow, the implementation uses + + max(x, 0) - x * z + log(1 + exp(-abs(x))) + + `logits` and `targets` must have the same type and shape. + + Args: + logits: A `Tensor` of type `float32` or `float64`. + targets: A `Tensor` of the same type and shape as `logits`. + name: A name for the operation (optional). + + Returns: + A `Tensor` of the same shape as `logits` with the componentwise + logistic losses. + """ + with ops.op_scope([logits, targets], name, "logistic_loss") as name: + logits = ops.convert_to_tensor(logits, name="logits") + targets = ops.convert_to_tensor(targets, name="targets") + # The logistic loss formula from above is + # x - x * z + log(1 + exp(-x)) + # For x < 0, a more numerically stable formula is + # -x * z + log(1 + exp(x)) + # To avoid branching, we use the combined version + # max(x, 0) - x * z + log(1 + exp(-abs(x))) + return math_ops.add(nn_ops.relu(logits) - logits * targets, + math_ops.log(1 + math_ops.exp(-math_ops.abs(logits))), + name=name) + + +def xw_plus_b(x, weights, biases, name=None): + """Computes matmul(x, weights) + biases. + + Args: + x: a 2D tensor. Dimensions typically: batch, in_units + weights: a 2D tensor. Dimensions typically: in_units, out_units + biases: a 1D tensor. Dimensions: out_units + name: A name for the operation (optional). If not specified + "wx_plus_b" is used. + + Returns: + A 2-D Tensor computing matmul(x, weights) + biases. + Dimensions typically: batch, out_units. + """ + with ops.op_scope([x, weights, biases], name, "xw_plus_b") as name: + x = ops.convert_to_tensor(x, name="x") + weights = ops.convert_to_tensor(weights, name="weights") + biases = ops.convert_to_tensor(biases, name="biases") + mm = math_ops.matmul(x, weights) + return nn_ops.bias_add(mm, biases, name=name) + + +def relu_layer(x, weights, biases, name=None): + """Computes Relu(x * weight + biases). + + Args: + x: a 2D tensor. Dimensions typically: batch, in_units + weights: a 2D tensor. Dimensions typically: in_units, out_units + biases: a 1D tensor. Dimensions: out_units + name: A name for the operation (optional). If not specified + "nn_relu_layer" is used. + + Returns: + A 2-D Tensor computing relu(matmul(x, weights) + biases). + Dimensions typically: batch, out_units. + """ + with ops.op_scope([x, weights, biases], name, "relu_layer") as name: + x = ops.convert_to_tensor(x, name="x") + weights = ops.convert_to_tensor(weights, name="weights") + biases = ops.convert_to_tensor(biases, name="biases") + xw_plus_b = nn_ops.bias_add(math_ops.matmul(x, weights), biases) + return nn_ops.relu(xw_plus_b, name=name) + + +def l2_normalize(x, dim, epsilon=1e-12, name=None): + """Normalizes along dimension `dim` using an L2 norm. + + For a 1-D tensor with `dim = 0`, computes + + output = x / sqrt(max(sum(x**2), epsilon)) + + For `x` with more dimensions, independently normalizes each 1-D slice along + dimension `dim`. + + Args: + x: A `Tensor`. + dim: Dimension along which to normalize. + epsilon: A lower bound value for the norm. Will use `sqrt(epsilon)` as the + divisor if `norm < sqrt(epsilon)`. + name: A name for this operation (optional). + + Returns: + A `Tensor` with the same shape as `x`. + """ + with ops.op_scope([x], name, "l2_normalize") as name: + x = ops.convert_to_tensor(x, name="x") + square_sum = math_ops.reduce_sum(math_ops.square(x), [dim], keep_dims=True) + x_inv_norm = math_ops.rsqrt(math_ops.maximum(square_sum, epsilon)) + return math_ops.mul(x, x_inv_norm, name=name) + + +def zero_fraction(value, name=None): + """Returns the fraction of zeros in `value`. + + If `value` is empty, the result is `nan`. + + This is useful in summaries to measure and report sparsity. For example, + + z = tf.Relu(...) + summ = tf.scalar_summary('sparsity', tf.zero_fraction(z)) + + Args: + value: A tensor of numeric type. + name: A name for the operation (optional). + + Returns: + The fraction of zeros in `value`, with type `float32`. + """ + with ops.op_scope([value], name, "zero_fraction"): + value = ops.convert_to_tensor(value, name="value") + zero = constant_op.constant(0, dtype=value.dtype, name="zero") + return math_ops.reduce_mean(math_ops.cast(math_ops.equal(value, zero), + types.float32)) + + +def dropout(x, keep_prob, noise_shape=None, seed=None, name=None): + """Computes dropout. + + With probability `keep_prob`, outputs the input element scaled up by + `1 / keep_prob`, otherwise outputs `0`. The scaling is so that the expected + sum is unchanged. + + By default, each element is kept or dropped independently. If `noise_shape` + is specified, it must be + [broadcastable](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) + to the shape of `x`, and only dimensions with `noise_shape[i] == x.shape[i]` + will make independent decisions. For example, if `x.shape = [b, x, y, c]` and + `noise_shape = [b, 1, 1, c]`, each batch and channel component will be + kept independently and each row and column will be kept or not kept together. + + Args: + x: A tensor. + keep_prob: Float probability that each element is kept. + noise_shape: Shape for randomly generated keep/drop flags. + seed: A Python integer. Used to create a random seed. + See [`set_random_seed`](constant_op.md#set_random_seed) for behavior. + name: A name for this operation (optional). + + Returns: + A Tensor of the same shape of `x`. + + Raises: + ValueError: If `keep_prob` is not in `(0, 1]`. + """ + if not (0 < keep_prob <= 1): + raise ValueError("Expected keep_prob in (0, 1], got %g" % keep_prob) + with ops.op_scope([x], name, "dropout") as name: + x = ops.convert_to_tensor(x, name="x") + noise_shape = noise_shape or array_ops.shape(x) + # uniform [keep_prob, 1.0 + keep_prob) + random_tensor = keep_prob + random_tensor += random_ops.random_uniform( + noise_shape, seed=seed, dtype=x.dtype) + # 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob) + binary_tensor = math_ops.floor(random_tensor) + return x * (1.0 / keep_prob) * binary_tensor + + +def depthwise_conv2d(input, filter, strides, padding, name=None): + """Depthwise 2-D convolution. + + Given an input tensor of shape `[batch, in_height, in_width, in_channels]` + and a filter tensor of shape + `[filter_height, filter_width, in_channels, channel_multiplier]` + containing `in_channels` convolutional filters of depth 1, `depthwise_conv2d` + applies a different filter to each input channel (expanding from 1 channel + to `channel_multiplier` channels for each), then concatenates the results + together. The output has `in_channels * channel_multiplier` channels. + + In detail, + + output[b, i, j, k * channel_multiplier + q] = + sum_{di, dj} input[b, strides[1] * i + di, strides[2] * j + dj, k] * + filter[di, dj, k, q] + + Must have `strides[0] = strides[3] = 1`. For the most common case of the + same horizontal and vertical strides, `strides = [1, stride, stride, 1]`. + + Args: + input: 4-D with shape `[batch, in_height, in_width, in_channels]`. + filter: 4-D with shape + `[filter_height, filter_width, in_channels, channel_multiplier]`. + strides: 1-D of size 4. The stride of the sliding window for each + dimension of `input`. + padding: A string, either `'VALID'` or `'SAME'`. The padding algorithm. + name: A name for this operation (optional). + + Returns: + A 4-D `Tensor` of shape + `[batch, out_height, out_width, in_channels * channel_multiplier].` + """ + with ops.op_scope([input, filter], name, "depthwise") as name: + input = ops.convert_to_tensor(input, name="tensor_in") + filter = ops.convert_to_tensor(filter, name="filter_in") + # A shape is required to statically compute the number of separable filters. + if filter.get_shape().ndims is not None: + assert len(filter.get_shape()) == 4 + in_channels = filter.get_shape()[2] + # Sanity checks, if shape information is available for the inputs. + if input.get_shape().ndims is not None: + assert len(input.get_shape()) == 4 + assert input.get_shape()[3] == in_channels, ( + "Mismatched input depth %d and number of depthwise filters %d." % ( + input.get_shape()[3].value, in_channels)) + else: + assert input.get_shape().ndims is not None, ( + "Either tensor must provide static shape information.") + assert input.get_shape().ndims == 4 + in_channels = input.get_shape()[3] + + if in_channels == 1: + return nn_ops.conv2d(input, filter, strides, padding, name=name) + else: + # Create one separate convolution per channel. + convs = [] + for channel in xrange(in_channels): + with ops.name_scope("depth%d" % channel) as channel_scope: + t_in = array_ops.slice(input, [0, 0, 0, channel], [-1, -1, -1, 1], + name="slice_inputs") + f_in = array_ops.slice(filter, [0, 0, channel, 0], [-1, -1, 1, -1], + name="slice_params") + convs.append(nn_ops.conv2d(t_in, f_in, + strides, padding, name=channel_scope)) + # Concatenate the per-channel convolutions along the channel dimension. + return array_ops.concat(3, convs, name=name) + + +def separable_conv2d(input, depthwise_filter, pointwise_filter, strides, + padding, + name=None): + """2-D convolution with separable filters. + + Performs a depthwise convolution that acts separately on channels followed by + a pointwise convolution that mixes channels. Note that this is separability + between dimensions `[1, 2]` and `3`, not spatial separability between + dimensions `1` and `2`. + + In detail, + + output[b, i, j, k] = sum_{di, dj, q, r] + input[b, strides[1] * i + di, strides[2] * j + dj, q] * + depthwise_filter[di, dj, q, r] * + pointwise_filter[0, 0, q * channel_multiplier + r, k] + + `strides` controls the strides for the depthwise convolution only, since + the pointwise convolution has implicit strides of `[1, 1, 1, 1]`. Must have + `strides[0] = strides[3] = 1`. For the most common case of the same + horizontal and vertical strides, `strides = [1, stride, stride, 1]`. + + Args: + input: 4-D `Tensor` with shape `[batch, in_height, in_width, in_channels]`. + depthwise_filter: 4-D `Tensor` with shape + `[filter_height, filter_width, in_channels, channel_multiplier]`. + Contains `in_channels` convolutional filters of depth 1. + pointwise_filter: 4-D `Tensor` with shape + `[1, 1, channel_multiplier * in_channels, out_channels]`. Pointwise + filter to mix channels after `depthwise_filter` has convolved spatially. + strides: 1-D of size 4. The strides for the depthwise convolution for + each dimension of `input`. + padding: A string, either `'VALID'` or `'SAME'`. The padding algorithm. + name: A name for this operation (optional). + + Returns: + A 4-D `Tensor` of shape `[batch, out_height, out_width, out_channels]`. + """ + with ops.op_scope([input, depthwise_filter, pointwise_filter], + name, "separable_conv2d") as name: + input = ops.convert_to_tensor(input, name="tensor_in") + depthwise_filter = ops.convert_to_tensor(depthwise_filter, + name="depthwise_filter") + pointwise_filter = ops.convert_to_tensor(pointwise_filter, + name="pointwise_filter") + + if pointwise_filter.get_shape().ndims is not None: + assert len(pointwise_filter.get_shape()) == 4 + assert pointwise_filter.get_shape()[0] == 1 + assert pointwise_filter.get_shape()[1] == 1 + if depthwise_filter.get_shape().ndims and input.get_shape().ndims: + channel_multiplier = depthwise_filter.get_shape()[3] + in_channels = input.get_shape()[3] + out_channels = pointwise_filter.get_shape()[3] + # This would mean the separable convolutions is over-parametrized. + assert channel_multiplier * in_channels < out_channels + # The layout of the ops in the graph are expected to be as follows: + # separable_conv2d // Conv2D op corresponding to the pointwise conv. + # separable_conv2d/depthwise // Concat op for the deptwise outputs. + # separable_conv2d/depthwise/depth0 // Conv2D op for depth 0 + # separable_conv2d/depthwise/depth1 // Conv2D op for depth 1 + # separable_conv2d/depthwise/depth2 // Conv2D op for depth 2 + depthwise = depthwise_conv2d(input, depthwise_filter, strides, + padding, name="depthwise") + return nn_ops.conv2d(depthwise, pointwise_filter, [1, 1, 1, 1], + padding="VALID", name=name) + + +def moments(x, axes, name=None): + """Calculate the mean and variance of `x`. + + The mean and variance are calculated by aggregating the contents of `x` + across `axes`. If `x` is 1-D and `axes = [0]` this is just the mean + and variance of a vector. + + For so-called "global normalization" needed for convolutional filters pass + `axes=[0, 1, 2]` (batch, height, width). For batch normalization pass + `axes=[0]` (batch). + + Args: + x: A `Tensor`. + axes: array of ints. Axes along which to compute mean and + variance. + name: Name used to scope the operations that compute the moments. + + Returns: + Two `Tensors`: `mean` and `variance`. + """ + with ops.op_scope([x, axes], name, "moments"): + x = ops.convert_to_tensor(x, name="x") + divisor = 1.0 + for d in xrange(len(x.get_shape())): + if d in axes: + divisor *= x.get_shape()[d].value + divisor = constant_op.constant(1.0 / divisor, x.dtype, name="divisor") + axes = constant_op.constant(axes, name="axes") + # Note: We do not use Mean here because it is very slow on GPU. + # Note 2: The expression below is potentially more stable. + # It is however a bit slower and stability doesn't appear to be an issue. + # mean = math_ops.reduce_sum(math_ops.mul(x, divisor), axes, name="mean") + # var = math_ops.reduce_sum(math_ops.mul(math_ops.square(x - mean), + # divisor), axes, + # name="variance") + mean = math_ops.mul(math_ops.reduce_sum(x, axes), divisor, name="mean") + var = math_ops.mul(math_ops.reduce_sum(math_ops.square(x - mean), axes), + divisor, name="variance") + return mean, var + + +def _sum_rows(x): + """Returns a vector summing up each row of the matrix x.""" + # _sum_rows(x) is equivalent to math_ops.reduce_sum(x, 1) when x is + # a matrix. The gradient of _sum_rows(x) is more efficient than + # reduce_sum(x, 1)'s gradient in today's implementation. Therefore, + # we use _sum_rows(x) in the nce_loss() computation since the loss + # is mostly used for training. + cols = array_ops.shape(x)[1] + ones_shape = array_ops.pack([cols, 1]) + ones = array_ops.ones(ones_shape, x.dtype) + return array_ops.reshape(math_ops.matmul(x, ones), [-1]) + + +def _compute_sampled_logits(weights, biases, inputs, labels, num_sampled, + num_classes, num_true=1, + sampled_values=None, + subtract_log_q=True, + remove_accidental_hits=False, + name=None): + """Helper function for nce_loss and sampled_softmax_loss functions. + + Computes sampled output training logits and labels suitable for implementing + e.g. noise-contrastive estimation (see nce_loss) or sampled softmax (see + sampled_softmax_loss). + + Note: In the case where num_true > 1, we assign to each target class + the target probability 1 / num_true so that the target probabilities + sum to 1 per-example. + + Args: + weights: tensor of label embeddings with shape = [num_classes, dim] + biases: tensor of num_classes label biases + inputs: tensor with shape = [batch_size, dim] corresponding to forward + activations of the input network + labels: int tensor with shape [batch_size, num_true] + num_sampled: number of label classes to sample per batch + num_classes: number of possible label classes in the data (e.g. vocab size) + num_true: number of target classes per example (default: 1) + sampled_values: a tuple of (sampled_candidates, true_expected_count, + sampled_expected_count) returned by a *CandidateSampler function to use + (if None, we default to LogUniformCandidateSampler) + subtract_log_q: subtract the log expected count of the labels in the sample + to get the logits of the true labels (default: True) + Turn off for Negative Sampling. + remove_accidental_hits: whether to remove "accidental hits" where a sampled + label equals the true labels (bool, default: False) + name: name for this op + + Returns: + out_logits, out_labels: tensors with shape [batch_size, num_true + + num_sampled] for passing to either SigmoidCrossEntropyWithLogits (NCE) + or SoftmaxCrossEntropyWithLogits (sampled softmax). + + """ + + with ops.op_scope( + [weights, biases, inputs, labels], name, "compute_sampled_logits"): + if labels.dtype != types.int64: + labels = math_ops.cast(labels, types.int64) + labels_flat = array_ops.reshape(labels, [-1]) + + # Sample the negative labels. + # sampled shape: num_sampled vector + # true_expected_count shape = [batch_size, 1] + # sampled_expected_count shape = num_sampled vector + if sampled_values is None: + sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler( + true_classes=labels, + num_true=num_true, + num_sampled=num_sampled, + unique=True, + range_max=num_classes) + # NOTE: pylint cannot tell that 'sampled_values' is a sequence + # pylint: disable=unpacking-non-sequence + sampled, true_expected_count, sampled_expected_count = sampled_values + # pylint: enable=unpacking-non-sequence + + # weights shape is [num_classes, dim] + # labels_flat is a [batch_size * num_true] vector + # true_w shape is [batch_size * num_true, dim] + # true_b is a [batch_size * num_true] vector + true_w = embedding_ops.embedding_lookup(weights, labels_flat) + true_b = embedding_ops.embedding_lookup(biases, labels_flat) + + # inputs shape is [batch_size, dim] + # true_w shape is [batch_size * num_true, dim] + # row_wise_dots is [batch_size, num_true, dim] + dim = array_ops.shape(true_w)[1:2] + new_true_w_shape = array_ops.concat(0, [[-1, num_true], dim]) + row_wise_dots = math_ops.mul( + array_ops.expand_dims(inputs, 1), + array_ops.reshape(true_w, new_true_w_shape)) + # We want the row-wise dot plus biases which yields a + # [batch_size, num_true] tensor of true_logits. + dots_as_matrix = array_ops.reshape(row_wise_dots, + array_ops.concat(0, [[-1], dim])) + true_logits = array_ops.reshape(_sum_rows(dots_as_matrix), [-1, num_true]) + true_b = array_ops.reshape(true_b, [-1, num_true]) + true_logits += true_b + + # Lookup weights and biases for sampled labels. + # sampled is a num_sampled int vector + # sampled_w shape is [num_sampled, dim] + # sampled_b is a num_sampled float vector + sampled_w = embedding_ops.embedding_lookup(weights, sampled) + sampled_b = embedding_ops.embedding_lookup(biases, sampled) + + # inputs has shape [batch_size, dim] + # sampled_w has shape [num_sampled, dim] + # sampled_b has shape [num_sampled] + # Apply X*W'+B, which yields [batch_size, num_sampled] + sampled_logits = math_ops.matmul(inputs, + sampled_w, + transpose_b=True) + sampled_b + + if remove_accidental_hits: + acc_hits = candidate_sampling_ops.compute_accidental_hits( + labels, sampled, num_true=num_true) + acc_indices, acc_ids, acc_weights = acc_hits + + # This is how SparseToDense expects the indices. + acc_indices_2d = array_ops.reshape(acc_indices, [-1, 1]) + acc_ids_2d_int32 = array_ops.reshape(math_ops.cast( + acc_ids, types.int32), [-1, 1]) + sparse_indices = array_ops.concat( + 1, [acc_indices_2d, acc_ids_2d_int32], "sparse_indices") + # Create sampled_logits_shape = [batch_size, num_sampled] + sampled_logits_shape = array_ops.concat( + 0, + [array_ops.shape(labels)[:1], array_ops.expand_dims(num_sampled, 0)]) + sampled_logits += sparse_ops.sparse_to_dense( + sparse_indices, sampled_logits_shape, acc_weights, 0.0) + + if subtract_log_q: + # Subtract log of Q(l), prior probability that l appears in sampled. + true_logits -= math_ops.log(true_expected_count) + sampled_logits -= math_ops.log(sampled_expected_count) + + # Construct output logits and labels. The true labels/logits start at col 0. + out_logits = array_ops.concat(1, [true_logits, sampled_logits]) + # true_logits is a float tensor, ones_like(true_logits) is a float tensor + # of ones. We then divide by num_true to ensure the per-example labels sum + # to 1.0, i.e. form a proper probability distribution. + out_labels = array_ops.concat( + 1, [array_ops.ones_like(true_logits) / num_true, + array_ops.zeros_like(sampled_logits)]) + + return out_logits, out_labels + + +def nce_loss(weights, biases, inputs, labels, num_sampled, num_classes, + num_true=1, + sampled_values=None, + remove_accidental_hits=False, + name="nce_loss"): + """Computes and returns the noise-contrastive estimation training loss. + + See [Noise-contrastive estimation: A new estimation principle for + unnormalized statistical models] + (http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf). + Also see our [Candidate Sampling Algorithms Reference] + (http://www.tensorflow.org/extras/candidate_sampling.pdf) + + Note: In the case where num_true > 1, we assign to each target class + the target probability 1 / num_true so that the target probabilities + sum to 1 per-example. + + Note: It would be useful to allow a variable number of target classes per + example. We hope to provide this functionality in a future release. + For now, if you have a variable number of target classes, you can pad them + out to a constant number by either repeating them or by padding + with an otherwise unused class. + + Args: + weights: A `Tensor` of shape [num_classes, dim]. The class embeddings. + biases: A `Tensor` of shape [num_classes]. The class biases. + inputs: A `Tensor` of shape [batch_size, dim]. The forward + activations of the input network. + labels: A `Tensor` of type `int64` and shape `[batch_size, + num_true]`. The target classes. + num_sampled: An `int`. The number of classes to randomly sample per batch. + num_classes: An `int`. The number of possible classes. + num_true: An `int`. The number of target classes per training example. + sampled_values: a tuple of `(sampled_candidates, true_expected_count, + sampled_expected_count)` returned by a *_candidate_sampler function. + (if None, we default to LogUniformCandidateSampler) + remove_accidental_hits: A `bool`. Whether to remove "accidental hits" + where a sampled class equals one of the target classes. If set to + `True`, this is a "Sampled Logistic" loss instead of NCE, and we are + learning to generate log-odds instead of log probabilities. See + our [Candidate Sampling Algorithms Reference] + (http://www.tensorflow.org/extras/candidate_sampling.pdf). + Default is False. + name: A name for the operation (optional). + + Returns: + A batch_size 1-D tensor of per-example NCE losses. + """ + logits, labels = _compute_sampled_logits( + weights, biases, inputs, labels, num_sampled, num_classes, + num_true=num_true, + sampled_values=sampled_values, + subtract_log_q=True, + remove_accidental_hits=remove_accidental_hits, + name=name) + sampled_losses = sigmoid_cross_entropy_with_logits(logits, + labels, + name="sampled_losses") + # sampled_losses is batch_size x {true_loss, sampled_losses...} + # We sum out true and sampled losses. + return _sum_rows(sampled_losses) + + +def sampled_softmax_loss(weights, biases, inputs, labels, num_sampled, + num_classes, num_true=1, + sampled_values=None, + remove_accidental_hits=True, + name="sampled_softmax_loss"): + """Computes and returns the sampled softmax training loss. + + This is a faster way to train a softmax classifier over a huge number of + classes. + + This operation is for training only. It is generally an underestimate of + the full softmax loss. + + At inference time, you can compute full softmax probabilities with the + expression `tf.nn.softmax(tf.matmul(inputs, weights) + biases)`. + + See our [Candidate Sampling Algorithms Reference] + (http://www.tensorflow.org/extras/candidate_sampling.pdf) + + Also see Section 3 of http://arxiv.org/abs/1412.2007 for the math. + + Args: + weights: A `Tensor` of shape [num_classes, dim]. The class embeddings. + biases: A `Tensor` of shape [num_classes]. The class biases. + inputs: A `Tensor` of shape [batch_size, dim]. The forward + activations of the input network. + labels: A `Tensor` of type `int64` and shape `[batch_size, + num_true]`. The target classes. Note that this format differs from + the `labels` argument of `nn.softmax_cross_entropy_with_logits`. + num_sampled: An `int`. The number of classes to randomly sample per batch. + num_classes: An `int`. The number of possible classes. + num_true: An `int`. The number of target classes per training example. + sampled_values: a tuple of `(sampled_candidates, true_expected_count, + sampled_expected_count)` returned by a *_candidate_sampler function. + (if None, we default to LogUniformCandidateSampler) + remove_accidental_hits: A `bool`. whether to remove "accidental hits" + where a sampled class equals one of the target classes. Default is + True. + name: A name for the operation (optional). + + Returns: + A batch_size 1-D tensor of per-example sampled softmax losses. + + """ + logits, labels = _compute_sampled_logits( + weights, biases, inputs, labels, num_sampled, num_classes, + num_true=num_true, + sampled_values=sampled_values, + subtract_log_q=True, + remove_accidental_hits=remove_accidental_hits, + name=name) + sampled_losses = nn_ops.softmax_cross_entropy_with_logits(logits, labels) + # sampled_losses is a batch_size vector. + return sampled_losses |