aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/core/ops/candidate_sampling_ops.cc
blob: a98b0295eec2be2d457017e72da74bf3c56445d1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
#include "tensorflow/core/framework/op.h"

namespace tensorflow {

REGISTER_OP("UniformCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("range_max: int >= 1")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a uniform distribution.

See explanations of candidate sampling and the data formats at
go/candidate-sampling.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.

true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to randomly sample per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
range_max: The sampler will sample integers from the interval [0, range_max).
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("LogUniformCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("range_max: int >= 1")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a log-uniform distribution.

See explanations of candidate sampling and the data formats at
go/candidate-sampling.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.


true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to randomly sample per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
range_max: The sampler will sample integers from the interval [0, range_max).
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("LearnedUnigramCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("range_max: int >= 1")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a learned unigram distribution.

See explanations of candidate sampling and the data formats at
go/candidate-sampling.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.

true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to randomly sample per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
range_max: The sampler will sample integers from the interval [0, range_max).
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("ThreadUnsafeUnigramCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("range_max: int >= 1")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a learned unigram distribution.

See explanations of candidate sampling and the data formats at
go/candidate-sampling.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.

true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to randomly sample per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
range_max: The sampler will sample integers from the interval [0, range_max).
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("FixedUnigramCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("range_max: int >= 1")
    .Attr("vocab_file: string = ''")
    .Attr("distortion: float = 1.0")
    .Attr("num_reserved_ids: int = 0")
    .Attr("num_shards: int >= 1 = 1")
    .Attr("shard: int >= 0 = 0")
    .Attr("unigrams: list(float) = []")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a learned unigram distribution.

A unigram sampler could use a fixed unigram distribution read from a
file or passed in as an in-memory array instead of building up the distribution
from data on the fly. There is also an option to skew the distribution by
applying a distortion power to the weights.

The vocabulary file should be in CSV-like format, with the last field
being the weight associated with the word.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.

true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to randomly sample per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
range_max: The sampler will sample integers from the interval [0, range_max).
vocab_file: Each valid line in this file (which should have a CSV-like format)
  corresponds to a valid word ID. IDs are in sequential order, starting from
  num_reserved_ids. The last entry in each line is expected to be a value
  corresponding to the count or relative probability. Exactly one of vocab_file
  and unigrams needs to be passed to this op.
distortion: The distortion is used to skew the unigram probability distribution.
  Each weight is first raised to the distortion's power before adding to the
  internal unigram distribution. As a result, distortion = 1.0 gives regular
  unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
  a uniform distribution.
num_reserved_ids: Optionally some reserved IDs can be added in the range [0,
  ..., num_reserved_ids) by the users. One use case is that a special unknown
  word token is used as ID 0. These IDs will have a sampling probability of 0.
num_shards: A sampler can be used to sample from a subset of the original range
  in order to speed up the whole computation through parallelism. This parameter
  (together with 'shard') indicates the number of partitions that are being
  used in the overall computation.
shard: A sampler can be used to sample from a subset of the original range
  in order to speed up the whole computation through parallelism. This parameter
  (together with 'num_shards') indicates the particular partition number of a
  sampler op, when partitioning is being used.
unigrams: A list of unigram counts or probabilities, one per ID in sequential
  order. Exactly one of vocab_file and unigrams should be passed to this op.
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("AllCandidateSampler")
    .Input("true_classes: int64")
    .Output("sampled_candidates: int64")
    .Output("true_expected_count: float")
    .Output("sampled_expected_count: float")
    .Attr("num_true: int >= 1")
    .Attr("num_sampled: int >= 1")
    .Attr("unique: bool")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Generates labels for candidate sampling with a learned unigram distribution.

See explanations of candidate sampling and the data formats at
go/candidate-sampling.

For each batch, this op picks a single set of sampled candidate labels.

The advantages of sampling candidates per-batch are simplicity and the
possibility of efficient dense matrix multiplication. The disadvantage is that
the sampled candidates must be chosen independently of the context and of the
true labels.

true_classes: A batch_size * num_true matrix, in which each row contains the
  IDs of the num_true target_classes in the corresponding original label.
sampled_candidates: A vector of length num_sampled, in which each element is
  the ID of a sampled candidate.
true_expected_count: A batch_size * num_true matrix, representing
  the number of times each candidate is expected to occur in a batch
  of sampled candidates. If unique=true, then this is a probability.
sampled_expected_count: A vector of length num_sampled, for each sampled
  candidate represting the number of times the candidate is expected
  to occur in a batch of sampled candidates.  If unique=true, then this is a
  probability.
num_true: Number of true labels per context.
num_sampled: Number of candidates to produce per batch.
unique: If unique is true, we sample with rejection, so that all sampled
  candidates in a batch are unique. This requires some approximation to
  estimate the post-rejection sampling probabilities.
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

REGISTER_OP("ComputeAccidentalHits")
    .Input("true_classes: int64")
    .Input("sampled_candidates: int64")
    .Output("indices: int32")
    .Output("ids: int64")
    .Output("weights: float")
    .Attr("num_true: int")
    .Attr("seed: int = 0")
    .Attr("seed2: int = 0")
    .Doc(R"doc(
Computes the ids of the positions in sampled_candidates that match true_labels.

When doing log-odds NCE, the result of this op should be passed through a
SparseToDense op, then added to the logits of the sampled candidates. This has
the effect of 'removing' the sampled labels that match the true labels by
making the classifier sure that they are sampled labels.

true_classes: The true_classes output of UnpackSparseLabels.
sampled_candidates: The sampled_candidates output of CandidateSampler.
indices: A vector of indices corresponding to rows of true_candidates.
ids: A vector of IDs of positions in sampled_candidates that match a true_label
  for the row with the corresponding index in indices.
weights: A vector of the same length as indices and ids, in which each element
  is -FLOAT_MAX.
num_true: Number of true labels per context.
seed: If either seed or seed2 are set to be non-zero, the random number
  generator is seeded by the given seed.  Otherwise, it is seeded by a
  random seed.
seed2: An second seed to avoid seed collision.
)doc");

}  // namespace tensorflow