diff options
author | Eugene Brevdo <ebrevdo@google.com> | 2017-07-06 13:41:47 -0700 |
---|---|---|
committer | TensorFlower Gardener <gardener@tensorflow.org> | 2017-07-06 13:46:16 -0700 |
commit | 0b5cce367cf95a5d7fbb6b037e7a8646f6c47b70 (patch) | |
tree | 9f8b5791f631f35ae41402053e7c5c12b5431d13 /tensorflow/python/kernel_tests/where_op_test.py | |
parent | 0597c4189a07e4ba1141a90ead76c9900a99d773 (diff) |
Get TopK op working on GPU again. Extend using cub's radix sort.
1. Undo rollback of Andreas Kirsch's initial implementation.
2. Use cub segmented radix sort if Andreas' heap-based impl
for large k and small num_cols (thresholds of k=100, n=1000
determined empirically).
3. Use cub segmented radix sort if k == num_cols (this case is always faster).
4. Added benchmarks.
Benchmarks show that the GPU implementation is up to 3x slower for small k but
can be 10x faster for large num_cols and k.
Benchmarks:
Benchmark: m_128_n_10_k_5_use_gpu_False wall_time: 0.000166 s Throughput: 0.0077 GB/s
Benchmark: m_128_n_10_k_5_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_False wall_time: 0.00017 s Throughput: 0.00751 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_True wall_time: 0.000796 s Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_False wall_time: 0.00017 s Throughput: 0.00753 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_True wall_time: 0.000775 s Throughput: 0.00165 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_False wall_time: 0.000155 s Throughput: 0.0826 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_True wall_time: 0.000796 s Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_False wall_time: 0.000247 s Throughput: 0.0519 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_True wall_time: 0.0008 s Throughput: 0.016 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_False wall_time: 0.000261 s Throughput: 0.049 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_True wall_time: 0.000794 s Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_False wall_time: 0.000239 s Throughput: 0.0536 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_True wall_time: 0.000777 s Throughput: 0.0165 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_False wall_time: 0.000324 s Throughput: 0.395 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_True wall_time: 0.000916 s Throughput: 0.14 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_False wall_time: 0.00042 s Throughput: 0.305 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_True wall_time: 0.000902 s Throughput: 0.142 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_False wall_time: 0.0011 s Throughput: 0.116 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_True wall_time: 0.00097 s Throughput: 0.132 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_False wall_time: 0.00133 s Throughput: 0.0962 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_True wall_time: 0.000993 s Throughput: 0.129 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_False wall_time: 0.00102 s Throughput: 0.126 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_True wall_time: 0.000964 s Throughput: 0.133 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_False wall_time: 0.002 s Throughput: 0.64 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_True wall_time: 0.00288 s Throughput: 0.445 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_False wall_time: 0.00233 s Throughput: 0.549 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_True wall_time: 0.00325 s Throughput: 0.394 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_False wall_time: 0.0127 s Throughput: 0.101 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_True wall_time: 0.00381 s Throughput: 0.336 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_False wall_time: 0.015 s Throughput: 0.0853 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_True wall_time: 0.00438 s Throughput: 0.292 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_False wall_time: 0.0104 s Throughput: 0.123 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_True wall_time: 0.00427 s Throughput: 0.3 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_False wall_time: 0.0148 s Throughput: 0.865 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_True wall_time: 0.0262 s Throughput: 0.488 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_False wall_time: 0.0201 s Throughput: 0.636 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_True wall_time: 0.0263 s Throughput: 0.486 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_False wall_time: 0.214 s Throughput: 0.0599 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_True wall_time: 0.0322 s Throughput: 0.398 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_False wall_time: 0.262 s Throughput: 0.0489 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_True wall_time: 0.0377 s Throughput: 0.34 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_False wall_time: 0.118 s Throughput: 0.108 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_True wall_time: 0.0365 s Throughput: 0.351 GB/s
END_PUBLIC
BEGIN_PUBLIC
Automated g4 rollback of changelist 157169178
PiperOrigin-RevId: 161124193
Diffstat (limited to 'tensorflow/python/kernel_tests/where_op_test.py')
-rw-r--r-- | tensorflow/python/kernel_tests/where_op_test.py | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/tensorflow/python/kernel_tests/where_op_test.py b/tensorflow/python/kernel_tests/where_op_test.py index a428d26996..3e1fa0a287 100644 --- a/tensorflow/python/kernel_tests/where_op_test.py +++ b/tensorflow/python/kernel_tests/where_op_test.py @@ -100,7 +100,7 @@ class WhereOpTest(test.TestCase): class WhereBenchmark(test.Benchmark): - def benchmarkWhereCPU(self): + def benchmarkWhere(self): for (m, n, p, use_gpu) in itertools.product( [10], [10, 100, 1000, 10000, 100000, 1000000], |