tensorflow - machine learning framework

	Commit message (Collapse)	Author	Age
*	Comment out some of the more expensive microbenchmark cases for tf.multinomial.	Zongheng Yang	2017-02-06
\| \| \| \|	Change: 146677928
*	Update copyright for 3p/tf/core.	A. Unique TensorFlower	2016-06-02
\| \| \| \|	Change: 123900938
*	Change algorithm in tf.multinomial to have better performance for large ↵	A. Unique TensorFlower	2016-05-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	class and sample counts by calculating a CDF function over the data first. Benchmark: Benchmark Time(ns) With CL Without CL Speedup -------------------------------------------------------------- BM_Multinomial_1_10000_4 222430 1891139 8.5 BM_Multinomial_1_10000_128 251088 51780050 206.2 BM_Multinomial_1_10000_10000 1178162 ~Forever BM_Multinomial_1_100000_4 1613117 17625439 10.9 BM_Multinomial_1_100000_128 1494122 521734326 349.2 BM_Multinomial_32_10000_4 1011524 8253509 8.2 BM_Multinomial_32_10000_128 966981 209806476 217.0 BM_Multinomial_32_100000_4 7700229 68502921 8.8 BM_Multinomial_32_100000_128 7699189 2075835399 269.6 BM_Multinomial_128_100000_1 25459698 71740408 2.8 BM_Multinomial_128_100000_128 25733778 7614311500 295.8 Change: 123491123
*	GPU kernel for the Multinomial op.	Zongheng Yang	2016-05-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Performance compared to GPU baseline (pinning the composed ops to GPU): Composition of existing ops vs. Native Multinomial op [50 iters] BatchSize NumClasses NumSamples sec(composed) sec(native) speedup 32 10000 1 1.949 0.286 6.81 32 10000 4 0.141 0.015 9.36 32 10000 32 0.269 0.072 3.76 32 100000 1 0.809 0.046 17.63 32 100000 4 1.342 0.104 12.92 32 100000 32 2.651 0.675 3.93 128 10000 1 0.102 0.039 2.64 128 10000 4 0.200 0.068 2.93 128 10000 32 0.684 0.292 2.34 128 100000 1 0.965 0.191 5.04 128 100000 4 2.231 0.445 5.01 128 100000 32 6.873 2.800 2.45 The native GPU kernel enjoys up to 17x speedups. Performance compared to CPU kernel: Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- BM_Multinomial_cpu_1_10000_4 1675838 1791808 390 21.3M items/s BM_Multinomial_gpu_1_10000_4 36189 109382 6203 348.8M items/s BM_Multinomial_cpu_1_10000_128 53309987 53398179 100 22.9M items/s BM_Multinomial_gpu_1_10000_128 183619 295815 2399 4.0G items/s BM_Multinomial_cpu_1_100000_4 16400318 16513301 100 23.1M items/s BM_Multinomial_gpu_1_100000_4 74213 179964 3939 2.1G items/s BM_Multinomial_cpu_1_100000_128 525420335 524868031 100 23.3M items/s BM_Multinomial_gpu_1_100000_128 1556520 1725302 379 6.9G items/s BM_Multinomial_cpu_32_10000_4 10004679 66266210 100 18.4M items/s BM_Multinomial_gpu_32_10000_4 188598 295679 2352 4.0G items/s BM_Multinomial_cpu_32_10000_128 207084457 2088864630 100 18.7M items/s BM_Multinomial_gpu_32_10000_128 4925820 5134462 100 7.4G items/s BM_Multinomial_cpu_32_100000_4 67593470 666348728 100 18.3M items/s BM_Multinomial_gpu_32_100000_4 1691660 1867438 338 6.4G items/s BM_Multinomial_cpu_32_100000_128 2123027247 20865462918 100 18.7M items/s BM_Multinomial_gpu_32_100000_128 52065179 52579436 100 7.3G items/s BM_Multinomial_cpu_128_100000_1 69011329 756379313 100 16.1M items/s BM_Multinomial_gpu_128_100000_1 1844878 2039173 284 5.8G items/s The native GPU kernel enjoys up to ~500x speedups. For realistic input sizes -- for instance, (128, 100k, 1) -- it sees a ~38x speedup. Change: 123432401
*	Introduce a Multinomial op and a parallel CPU kernel.	Zongheng Yang	2016-05-05
	Example usage: samples = tf.multinomial(tf.log([[0.5, 0.5]]), 10) # samples has shape [1, 10], where each value is either 0 or 1 (equal prob.). samples = tf.multinomial([[1, -1, -1]], 10) # samples is equivalent to tf.zeros([1, 10], dtype=tf.int64). The implementation uses the Gumbel nosie trick. To validate the worthiness of adding a native op, we benchmark against the one-liner approach of composing existing TF ops to compute the same things. From "third_party/tensorflow/python:multinomial_op_test" built with "-c opt --copt=-mavx": ("sec" represents wall-time in seconds aggregated for 5 iters.) Composition of existing ops vs. Native Multinomial op [5 iters] BatchSize NumClasses NumSamples sec(composed) sec(native) speedup 1 10000 1 0.069 0.040 1.74 1 10000 4 0.006 0.004 1.54 1 10000 128 0.056 0.063 0.89 1 100000 1 0.009 0.008 1.16 1 100000 4 0.017 0.022 0.77 1 100000 128 0.328 0.600 0.55 32 10000 1 0.019 0.007 2.86 32 10000 4 0.048 0.009 5.56 32 10000 128 0.847 0.091 9.31 32 100000 1 0.102 0.027 3.74 32 100000 4 0.274 0.064 4.28 32 100000 128 10.579 0.880 12.02 128 10000 1 0.050 0.036 1.39 128 10000 4 0.135 0.048 2.84 128 10000 128 3.071 0.377 8.15 128 100000 1 0.352 0.133 2.65 128 100000 4 0.995 0.260 3.82 128 100000 128 40.455 3.574 11.32 The speedup is up to 12x. Change: 121593174