| Commit message (Collapse) | Author | Age |
|
|
|
| |
Change: 146677928
|
|
|
|
| |
Change: 123900938
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
class and sample counts by calculating a CDF function over the data first.
Benchmark:
Benchmark Time(ns) With CL Without CL Speedup
--------------------------------------------------------------
BM_Multinomial_1_10000_4 222430 1891139 8.5
BM_Multinomial_1_10000_128 251088 51780050 206.2
BM_Multinomial_1_10000_10000 1178162 ~Forever
BM_Multinomial_1_100000_4 1613117 17625439 10.9
BM_Multinomial_1_100000_128 1494122 521734326 349.2
BM_Multinomial_32_10000_4 1011524 8253509 8.2
BM_Multinomial_32_10000_128 966981 209806476 217.0
BM_Multinomial_32_100000_4 7700229 68502921 8.8
BM_Multinomial_32_100000_128 7699189 2075835399 269.6
BM_Multinomial_128_100000_1 25459698 71740408 2.8
BM_Multinomial_128_100000_128 25733778 7614311500 295.8
Change: 123491123
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
** Performance compared to GPU baseline (pinning the composed ops to GPU):
Composition of existing ops vs. Native Multinomial op [50 iters]
BatchSize NumClasses NumSamples sec(composed) sec(native)
speedup
32 10000 1 1.949 0.286 6.81
32 10000 4 0.141 0.015 9.36
32 10000 32 0.269 0.072 3.76
32 100000 1 0.809 0.046 17.63
32 100000 4 1.342 0.104 12.92
32 100000 32 2.651 0.675 3.93
128 10000 1 0.102 0.039 2.64
128 10000 4 0.200 0.068 2.93
128 10000 32 0.684 0.292 2.34
128 100000 1 0.965 0.191 5.04
128 100000 4 2.231 0.445 5.01
128 100000 32 6.873 2.800 2.45
The native GPU kernel enjoys up to 17x speedups.
** Performance compared to CPU kernel:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
BM_Multinomial_cpu_1_10000_4 1675838 1791808 390 21.3M items/s
BM_Multinomial_gpu_1_10000_4 36189 109382 6203 348.8M items/s
BM_Multinomial_cpu_1_10000_128 53309987 53398179 100 22.9M items/s
BM_Multinomial_gpu_1_10000_128 183619 295815 2399 4.0G items/s
BM_Multinomial_cpu_1_100000_4 16400318 16513301 100 23.1M items/s
BM_Multinomial_gpu_1_100000_4 74213 179964 3939 2.1G items/s
BM_Multinomial_cpu_1_100000_128 525420335 524868031 100 23.3M items/s
BM_Multinomial_gpu_1_100000_128 1556520 1725302 379 6.9G items/s
BM_Multinomial_cpu_32_10000_4 10004679 66266210 100 18.4M items/s
BM_Multinomial_gpu_32_10000_4 188598 295679 2352 4.0G items/s
BM_Multinomial_cpu_32_10000_128 207084457 2088864630 100 18.7M items/s
BM_Multinomial_gpu_32_10000_128 4925820 5134462 100 7.4G items/s
BM_Multinomial_cpu_32_100000_4 67593470 666348728 100 18.3M items/s
BM_Multinomial_gpu_32_100000_4 1691660 1867438 338 6.4G items/s
BM_Multinomial_cpu_32_100000_128 2123027247 20865462918 100 18.7M items/s
BM_Multinomial_gpu_32_100000_128 52065179 52579436 100 7.3G items/s
BM_Multinomial_cpu_128_100000_1 69011329 756379313 100 16.1M items/s
BM_Multinomial_gpu_128_100000_1 1844878 2039173 284 5.8G items/s
The native GPU kernel enjoys up to ~500x speedups. For realistic input sizes
-- for instance, (128, 100k, 1) -- it sees a ~38x speedup.
Change: 123432401
|
|
Example usage:
samples = tf.multinomial(tf.log([[0.5, 0.5]]), 10)
# samples has shape [1, 10], where each value is either 0 or 1 (equal prob.).
samples = tf.multinomial([[1, -1, -1]], 10)
# samples is equivalent to tf.zeros([1, 10], dtype=tf.int64).
The implementation uses the Gumbel nosie trick. To validate the worthiness of
adding a native op, we benchmark against the one-liner approach of composing
existing TF ops to compute the same things. From
"third_party/tensorflow/python:multinomial_op_test" built with "-c opt --copt=-mavx":
("sec" represents wall-time in seconds aggregated for 5 iters.)
Composition of existing ops vs. Native Multinomial op [5 iters]
BatchSize NumClasses NumSamples sec(composed) sec(native) speedup
1 10000 1 0.069 0.040 1.74
1 10000 4 0.006 0.004 1.54
1 10000 128 0.056 0.063 0.89
1 100000 1 0.009 0.008 1.16
1 100000 4 0.017 0.022 0.77
1 100000 128 0.328 0.600 0.55
32 10000 1 0.019 0.007 2.86
32 10000 4 0.048 0.009 5.56
32 10000 128 0.847 0.091 9.31
32 100000 1 0.102 0.027 3.74
32 100000 4 0.274 0.064 4.28
32 100000 128 10.579 0.880 12.02
128 10000 1 0.050 0.036 1.39
128 10000 4 0.135 0.048 2.84
128 10000 128 3.071 0.377 8.15
128 100000 1 0.352 0.133 2.65
128 100000 4 0.995 0.260 3.82
128 100000 128 40.455 3.574 11.32
The speedup is up to 12x.
Change: 121593174
|