GPU-enabled WhereOp using CUB.

* Import CUB. * Add GPU-enabled async WhereOp. * Added benchmarks. * Added support for bool ResourceVariables on GPU. Benchmark results on machine with single K40 tesla GPU: Where on bool matrix shape [m x n] with p percentage values true below. For small-medium sizes, running WhereOp on GPU is ~4-2x slower. For realistic large problem sizes, it's 2-5x faster. This timing ignores the time spent copying a tensor from GPU -> CPU and back from CPU -> GPU when the WhereOp is between GPU computations (so the performance impact should actually be better). Benchmark: m_10_n_10_p_0.01_use_gpu_False wall_time: 9.01e-05 s Throughput: 0.00129 GB/s Benchmark: m_10_n_10_p_0.01_use_gpu_True wall_time: 0.000187 s Throughput: 0.000621 GB/s Benchmark: m_10_n_10_p_0.5_use_gpu_False wall_time: 9.3e-05 s Throughput: 0.00968 GB/s Benchmark: m_10_n_10_p_0.5_use_gpu_True wall_time: 0.000252 s Throughput: 0.00357 GB/s Benchmark: m_10_n_10_p_0.99_use_gpu_False wall_time: 0.000152 s Throughput: 0.0111 GB/s Benchmark: m_10_n_10_p_0.99_use_gpu_True wall_time: 0.000245 s Throughput: 0.00687 GB/s Benchmark: m_10_n_100_p_0.01_use_gpu_False wall_time: 9.3e-05 s Throughput: 0.0125 GB/s Benchmark: m_10_n_100_p_0.01_use_gpu_True wall_time: 0.000253 s Throughput: 0.00458 GB/s Benchmark: m_10_n_100_p_0.5_use_gpu_False wall_time: 9.8e-05 s Throughput: 0.0918 GB/s Benchmark: m_10_n_100_p_0.5_use_gpu_True wall_time: 0.00026 s Throughput: 0.0346 GB/s Benchmark: m_10_n_100_p_0.99_use_gpu_False wall_time: 0.000104 s Throughput: 0.162 GB/s Benchmark: m_10_n_100_p_0.99_use_gpu_True wall_time: 0.000288 s Throughput: 0.0586 GB/s Benchmark: m_10_n_1000_p_0.01_use_gpu_False wall_time: 0.000105 s Throughput: 0.111 GB/s Benchmark: m_10_n_1000_p_0.01_use_gpu_True wall_time: 0.000283 s Throughput: 0.041 GB/s Benchmark: m_10_n_1000_p_0.5_use_gpu_False wall_time: 0.000185 s Throughput: 0.486 GB/s Benchmark: m_10_n_1000_p_0.5_use_gpu_True wall_time: 0.000335 s Throughput: 0.269 GB/s Benchmark: m_10_n_1000_p_0.99_use_gpu_False wall_time: 0.000203 s Throughput: 0.83 GB/s Benchmark: m_10_n_1000_p_0.99_use_gpu_True wall_time: 0.000346 s Throughput: 0.486 GB/s Benchmark: m_10_n_10000_p_0.01_use_gpu_False wall_time: 0.00019 s Throughput: 0.609 GB/s Benchmark: m_10_n_10000_p_0.01_use_gpu_True wall_time: 0.00028 s Throughput: 0.414 GB/s Benchmark: m_10_n_10000_p_0.5_use_gpu_False wall_time: 0.00117 s Throughput: 0.771 GB/s Benchmark: m_10_n_10000_p_0.5_use_gpu_True wall_time: 0.000426 s Throughput: 2.11 GB/s Benchmark: m_10_n_10000_p_0.99_use_gpu_False wall_time: 0.0014 s Throughput: 1.2 GB/s Benchmark: m_10_n_10000_p_0.99_use_gpu_True wall_time: 0.000482 s Throughput: 3.5 GB/s Benchmark: m_10_n_100000_p_0.01_use_gpu_False wall_time: 0.00129 s Throughput: 0.899 GB/s Benchmark: m_10_n_100000_p_0.01_use_gpu_True wall_time: 0.000336 s Throughput: 3.45 GB/s Benchmark: m_10_n_100000_p_0.5_use_gpu_False wall_time: 0.0102 s Throughput: 0.885 GB/s Benchmark: m_10_n_100000_p_0.5_use_gpu_True wall_time: 0.00136 s Throughput: 6.6 GB/s Benchmark: m_10_n_100000_p_0.99_use_gpu_False wall_time: 0.0116 s Throughput: 1.45 GB/s Benchmark: m_10_n_100000_p_0.99_use_gpu_True wall_time: 0.00233 s Throughput: 7.23 GB/s Benchmark: m_10_n_1000000_p_0.01_use_gpu_False wall_time: 0.0111 s Throughput: 1.04 GB/s Benchmark: m_10_n_1000000_p_0.01_use_gpu_True wall_time: 0.00109 s Throughput: 10.6 GB/s Benchmark: m_10_n_1000000_p_0.5_use_gpu_False wall_time: 0.0895 s Throughput: 1.01 GB/s Benchmark: m_10_n_1000000_p_0.5_use_gpu_True wall_time: 0.0103 s Throughput: 8.7 GB/s Benchmark: m_10_n_1000000_p_0.99_use_gpu_False wall_time: 0.107 s Throughput: 1.58 GB/s Benchmark: m_10_n_1000000_p_0.99_use_gpu_True wall_time: 0.0201 s Throughput: 8.39 GB/s PiperOrigin-RevId: 160582709
author: Eugene Brevdo <ebrevdo@google.com> 2017-06-29 15:33:13 -0700
committer: TensorFlower Gardener <gardener@tensorflow.org> 2017-06-29 15:37:15 -0700
commit: 8280e0ae9083a65b23608b34723f07e028a56dc8 (patch)
tree: 0f2df282cfd5cd712920e440cea88a093668cbf2 /tensorflow/core/framework/register_types.h
parent: 4aa7c4d2330ce110b5be348144ee67143841272c (diff)
1 files changed, 4 insertions, 1 deletions
diff --git a/tensorflow/core/framework/register_types.h b/tensorflow/core/framework/register_types.h
index 2f7b140295..b62fe647e2 100644
--- a/tensorflow/core/framework/register_types.h
+++ b/tensorflow/core/framework/register_types.h
@@ -167,10 +167,13 @@ limitations under the License.
 // Call "m" on POD and string types.
 #define TF_CALL_POD_STRING_TYPES(m) TF_CALL_POD_TYPES(m) TF_CALL_string(m)
 
-// Call "m" on all types supported on GPU.
+// Call "m" on all number types supported on GPU.
 #define TF_CALL_GPU_NUMBER_TYPES(m) \
   TF_CALL_half(m) TF_CALL_float(m) TF_CALL_double(m)
 
+// Call "m" on all types supported on GPU.
+#define TF_CALL_GPU_ALL_TYPES(m) TF_CALL_GPU_NUMBER_TYPES(m) TF_CALL_bool(m)
+
 #define TF_CALL_GPU_NUMBER_TYPES_NO_HALF(m) TF_CALL_float(m) TF_CALL_double(m)
 
 // Call "m" on all quantized types.
author	Eugene Brevdo <ebrevdo@google.com>	2017-06-29 15:33:13 -0700
committer	TensorFlower Gardener <gardener@tensorflow.org>	2017-06-29 15:37:15 -0700
commit	8280e0ae9083a65b23608b34723f07e028a56dc8 (patch)
tree	0f2df282cfd5cd712920e440cea88a093668cbf2 /tensorflow/core/framework/register_types.h
parent	4aa7c4d2330ce110b5be348144ee67143841272c (diff)