tensorflow - machine learning framework

	Commit message (Collapse)	Author	Age
*	Add depthwise ops for NAS cell in nn_ops_test to improve the inference time on	A. Unique TensorFlower	2017-11-28
\| \| \| \| \| \|	the particular depthwise ops. PiperOrigin-RevId: 177235744
*	Switch the softmax to use the new deterministic reductions on the GPU,	A. Unique TensorFlower	2017-09-13
\| \| \| \| \| \| \| \| \| \| \|	results in a speed up of 10-40x on the existing ImageNet benchmarks and 2-3x on the newly added transformer benchmarks. Update the benchmark to also run on the GPU. Remove duplicate cpu tests. PiperOrigin-RevId: 168596693
*	Speed up TopK op and add a benchmark.	Eugene Brevdo	2017-06-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Special-case the k=1 version. 2. Special case the k=num_cols version (use in-place stable_sort) 3. Add multithreading in several places; especially the index sort across rows and the final value shuffle. Real-time (wall time) speedup is significant in interesting cases i.e., realistic beam search scenarios: before: CPU: Intel Ivybridge with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:12MB Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------- BM_TopK_CPU_1_100_1_16 9259 25679 70793 10.300M items/s topk_r_1_c_100_k_1_th_16 BM_TopK_CPU_1_100_2_16 9276 25803 74858 10.281M items/s topk_r_1_c_100_k_2_th_16 BM_TopK_CPU_1_100_10_16 9183 25089 71922 10.385M items/s topk_r_1_c_100_k_10_th_16 BM_TopK_CPU_1_100_50_16 10487 27793 67717 9.094M items/s topk_r_1_c_100_k_50_th_16 BM_TopK_CPU_1_100_100_16 10064 27144 68466 9.476M items/s topk_r_1_c_100_k_100_th_16 BM_TopK_CPU_32_100_1_16 16832 40640 43761 181.306M items/s topk_r_32_c_100_k_1_th_16 BM_TopK_CPU_32_100_2_16 20329 47194 34889 150.117M items/s topk_r_32_c_100_k_2_th_16 BM_TopK_CPU_32_100_10_16 52341 95654 10000 58.305M items/s topk_r_32_c_100_k_10_th_16 BM_TopK_CPU_32_100_50_16 134493 172223 5155 22.691M items/s topk_r_32_c_100_k_50_th_16 BM_TopK_CPU_32_100_100_16 112498 151952 6110 27.127M items/s topk_r_32_c_100_k_100_th_16 BM_TopK_CPU_128_100_1_16 45214 84196 15854 269.981M items/s topk_r_128_c_100_k_1_th_16 BM_TopK_CPU_128_100_2_16 63425 101001 10000 192.464M items/s topk_r_128_c_100_k_2_th_16 BM_TopK_CPU_128_100_10_16 178288 216585 3906 68.468M items/s topk_r_128_c_100_k_10_th_16 BM_TopK_CPU_128_100_50_16 566432 649432 1000 21.551M items/s topk_r_128_c_100_k_50_th_16 BM_TopK_CPU_128_100_100_16 469575 555467 1500 25.996M items/s topk_r_128_c_100_k_100_th_16 BM_TopK_CPU_128_1000_1_16 213300 253660 3284 572.293M items/s topk_r_128_c_1000_k_1_th_16 BM_TopK_CPU_128_1000_2_16 257206 304476 2881 474.601M items/s topk_r_128_c_1000_k_2_th_16 BM_TopK_CPU_128_1000_10_16 497052 577491 1418 245.588M items/s topk_r_128_c_1000_k_10_th_16 BM_TopK_CPU_128_1000_50_16 1515879 1607193 459 80.528M items/s topk_r_128_c_1000_k_50_th_16 BM_TopK_CPU_128_1000_100_16 2571640 2658854 272 47.468M items/s topk_r_128_c_1000_k_100_th_16 BM_TopK_CPU_128_1000_500_16 7333097 7423285 94 16.646M items/s topk_r_128_c_1000_k_500_th_16 BM_TopK_CPU_128_1000_1000_16 5770553 5840202 100 21.154M items/s topk_r_128_c_1000_k_1000_th_16 BM_TopK_CPU_16_10000_10000_16 9166191 9232878 74 16.647M items/s topk_nmt_r_16_c_10000_k_10000_th_16 BM_TopK_CPU_16_20000_20000_16 19449875 19519678 35 15.690M items/s topk_nmt_r_16_c_20000_k_20000_th_16 BM_TopK_CPU_16_50000_50000_16 52296451 52302305 10 14.589M items/s topk_nmt_r_16_c_50000_k_50000_th_16 BM_TopK_CPU_16_100000_100000_16 112297965 112270944 6 13.588M items/s topk_nmt_r_16_c_100000_k_100000_th_16 BM_TopK_CPU_16_35000_35000_16 35879266 35913330 19 14.885M items/s topk_nmt_r_16_c_35000_k_35000_th_16 BM_TopK_CPU_16_70000_70000_16 76116905 76111531 9 14.033M items/s topk_nmt_r_16_c_70000_k_70000_th_16 BM_TopK_CPU_16_175000_175000_16 201008026 200863079 3 13.284M items/s topk_nmt_r_16_c_175000_k_175000_th_16 BM_TopK_CPU_16_350000_350000_16 433559602 433161430 2 12.318M items/s topk_nmt_r_16_c_350000_k_350000_th_16 BM_TopK_CPU_128_10000_10000_16 72610283 72609110 9 16.812M items/s topk_nmt_r_128_c_10000_k_10000_th_16 BM_TopK_CPU_128_20000_20000_16 158373008 158279209 5 15.416M items/s topk_nmt_r_128_c_20000_k_20000_th_16 BM_TopK_CPU_128_50000_50000_16 417896471 417215294 2 14.605M items/s topk_nmt_r_128_c_50000_k_50000_th_16 BM_TopK_CPU_128_100000_100000_16 884346025 883177699 1 13.803M items/s topk_nmt_r_128_c_100000_k_100000_th_16 BM_TopK_CPU_128_35000_35000_16 286974608 286727426 2 14.888M items/s topk_nmt_r_128_c_35000_k_35000_th_16 BM_TopK_CPU_128_70000_70000_16 614528753 614007815 1 13.905M items/s topk_nmt_r_128_c_70000_k_70000_th_16 BM_TopK_CPU_128_175000_175000_16 1607903552 1606329364 1 13.286M items/s topk_nmt_r_128_c_175000_k_175000_th_16 BM_TopK_CPU_128_350000_350000_16 3402143043 3398494095 1 12.558M items/s topk_nmt_r_128_c_350000_k_350000_th_16 after: CPU: Intel Ivybridge with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:12MB Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------- BM_TopK_CPU_1_100_1_16 9018 24839 79347 10.575M items/s topk_r_1_c_100_k_1_th_16 BM_TopK_CPU_1_100_2_16 8950 24456 76591 10.656M items/s topk_r_1_c_100_k_2_th_16 BM_TopK_CPU_1_100_10_16 9427 25658 74100 10.117M items/s topk_r_1_c_100_k_10_th_16 BM_TopK_CPU_1_100_50_16 11148 28933 62073 8.555M items/s topk_r_1_c_100_k_50_th_16 BM_TopK_CPU_1_100_100_16 9590 26127 73189 9.944M items/s topk_r_1_c_100_k_100_th_16 BM_TopK_CPU_32_100_1_16 10467 27561 64591 291.556M items/s topk_r_32_c_100_k_1_th_16 BM_TopK_CPU_32_100_2_16 19883 46413 35023 153.488M items/s topk_r_32_c_100_k_2_th_16 BM_TopK_CPU_32_100_10_16 50567 88639 10000 60.351M items/s topk_r_32_c_100_k_10_th_16 BM_TopK_CPU_32_100_50_16 63118 347897 10000 48.350M items/s topk_r_32_c_100_k_50_th_16 BM_TopK_CPU_32_100_100_16 88105 126842 7796 34.638M items/s topk_r_32_c_100_k_100_th_16 BM_TopK_CPU_128_100_1_16 16760 40292 41596 728.325M items/s topk_r_128_c_100_k_1_th_16 BM_TopK_CPU_128_100_2_16 64006 101836 10000 190.718M items/s topk_r_128_c_100_k_2_th_16 BM_TopK_CPU_128_100_10_16 68867 464997 9190 177.256M items/s topk_r_128_c_100_k_10_th_16 BM_TopK_CPU_128_100_50_16 144858 1155994 5231 84.269M items/s topk_r_128_c_100_k_50_th_16 BM_TopK_CPU_128_100_100_16 93782 622829 7509 130.164M items/s topk_r_128_c_100_k_100_th_16 BM_TopK_CPU_128_1000_1_16 96098 210082 7428 1.240G items/s topk_r_128_c_1000_k_1_th_16 BM_TopK_CPU_128_1000_2_16 90252 709497 7554 1.321G items/s topk_r_128_c_1000_k_2_th_16 BM_TopK_CPU_128_1000_10_16 124348 1086216 5626 981.684M items/s topk_r_128_c_1000_k_10_th_16 BM_TopK_CPU_128_1000_50_16 324603 3245178 2151 376.060M items/s topk_r_128_c_1000_k_50_th_16 BM_TopK_CPU_128_1000_100_16 455413 4106649 1684 268.043M items/s topk_r_128_c_1000_k_100_th_16 BM_TopK_CPU_128_1000_500_16 904824 8810352 597 134.911M items/s topk_r_128_c_1000_k_500_th_16 BM_TopK_CPU_128_1000_1000_16 753409 7232945 886 162.024M items/s topk_r_128_c_1000_k_1000_th_16 BM_TopK_CPU_16_10000_10000_16 1579482 11781021 435 96.606M items/s topk_nmt_r_16_c_10000_k_10000_th_16 BM_TopK_CPU_16_20000_20000_16 3326291 25598536 212 91.747M items/s topk_nmt_r_16_c_20000_k_20000_th_16 BM_TopK_CPU_16_50000_50000_16 9192127 72737661 81 82.999M items/s topk_nmt_r_16_c_50000_k_50000_th_16 BM_TopK_CPU_16_100000_100000_16 20328234 163896476 35 75.062M items/s topk_nmt_r_16_c_100000_k_100000_th_16 BM_TopK_CPU_16_35000_35000_16 6120448 47771027 100 87.258M items/s topk_nmt_r_16_c_35000_k_35000_th_16 BM_TopK_CPU_16_70000_70000_16 15198457 108215957 53 70.278M items/s topk_nmt_r_16_c_70000_k_70000_th_16 BM_TopK_CPU_16_175000_175000_16 36581899 318660494 19 72.995M items/s topk_nmt_r_16_c_175000_k_175000_th_16 BM_TopK_CPU_16_350000_350000_16 86169153 834154721 8 61.978M items/s topk_nmt_r_16_c_350000_k_350000_th_16 BM_TopK_CPU_128_10000_10000_16 9022381 95945196 73 135.297M items/s topk_nmt_r_128_c_10000_k_10000_th_16 BM_TopK_CPU_128_20000_20000_16 20012433 209172356 32 121.994M items/s topk_nmt_r_128_c_20000_k_20000_th_16 BM_TopK_CPU_128_50000_50000_16 59536858 606791128 10 102.517M items/s topk_nmt_r_128_c_50000_k_50000_th_16 BM_TopK_CPU_128_100000_100000_16 119065841 1375709415 6 102.523M items/s topk_nmt_r_128_c_100000_k_100000_th_16 BM_TopK_CPU_128_35000_35000_16 34995900 399661847 20 122.085M items/s topk_nmt_r_128_c_35000_k_35000_th_16 BM_TopK_CPU_128_70000_70000_16 82103990 904735845 9 104.074M items/s topk_nmt_r_128_c_70000_k_70000_th_16 BM_TopK_CPU_128_175000_175000_16 230992936 2675073107 3 92.480M items/s topk_nmt_r_128_c_175000_k_175000_th_16 BM_TopK_CPU_128_350000_350000_16 616369221 7200013156 1 69.317M items/s topk_nmt_r_128_c_350000_k_350000_th_16 relative throughput difference (new - old)/old: $ paste /tmp/OLD /tmp/NEW \| perl -ne '@r = $_ =~ /([\d\.]+[MG]) it/g; if ($r[0] =~ /G/) { $r[0] = 1000$r[0] }; if ($r[1] =~ /G/) { $r[1] = 1000$r[1]}; if (@r) {printf("%s\t\trelative throughput difference: %.2f%%\n", (split(" ",$_))[-1], ($r[1] - $r[0])/$r[0] * 100)}' topk_r_1_c_100_k_1_th_16 relative throughput difference: 2.67% topk_r_1_c_100_k_2_th_16 relative throughput difference: 3.65% topk_r_1_c_100_k_10_th_16 relative throughput difference: -2.58% topk_r_1_c_100_k_50_th_16 relative throughput difference: -5.93% topk_r_1_c_100_k_100_th_16 relative throughput difference: 4.94% topk_r_32_c_100_k_1_th_16 relative throughput difference: 60.81% topk_r_32_c_100_k_2_th_16 relative throughput difference: 2.25% topk_r_32_c_100_k_10_th_16 relative throughput difference: 3.51% topk_r_32_c_100_k_50_th_16 relative throughput difference: 113.08% topk_r_32_c_100_k_100_th_16 relative throughput difference: 27.69% topk_r_128_c_100_k_1_th_16 relative throughput difference: 169.77% topk_r_128_c_100_k_2_th_16 relative throughput difference: -0.91% topk_r_128_c_100_k_10_th_16 relative throughput difference: 158.89% topk_r_128_c_100_k_50_th_16 relative throughput difference: 291.02% topk_r_128_c_100_k_100_th_16 relative throughput difference: 400.71% topk_r_128_c_1000_k_1_th_16 relative throughput difference: 116.67% topk_r_128_c_1000_k_2_th_16 relative throughput difference: 178.34% topk_r_128_c_1000_k_10_th_16 relative throughput difference: 299.73% topk_r_128_c_1000_k_50_th_16 relative throughput difference: 366.99% topk_r_128_c_1000_k_100_th_16 relative throughput difference: 464.68% topk_r_128_c_1000_k_500_th_16 relative throughput difference: 710.47% topk_r_128_c_1000_k_1000_th_16 relative throughput difference: 665.93% topk_nmt_r_16_c_10000_k_10000_th_16 relative throughput difference: 480.32% topk_nmt_r_16_c_20000_k_20000_th_16 relative throughput difference: 484.75% topk_nmt_r_16_c_50000_k_50000_th_16 relative throughput difference: 468.91% topk_nmt_r_16_c_100000_k_100000_th_16 relative throughput difference: 452.41% topk_nmt_r_16_c_35000_k_35000_th_16 relative throughput difference: 486.21% topk_nmt_r_16_c_70000_k_70000_th_16 relative throughput difference: 400.81% topk_nmt_r_16_c_175000_k_175000_th_16 relative throughput difference: 449.50% topk_nmt_r_16_c_350000_k_350000_th_16 relative throughput difference: 403.15% topk_nmt_r_128_c_10000_k_10000_th_16 relative throughput difference: 704.76% topk_nmt_r_128_c_20000_k_20000_th_16 relative throughput difference: 691.35% topk_nmt_r_128_c_50000_k_50000_th_16 relative throughput difference: 601.93% topk_nmt_r_128_c_100000_k_100000_th_16 relative throughput difference: 642.76% topk_nmt_r_128_c_35000_k_35000_th_16 relative throughput difference: 720.02% topk_nmt_r_128_c_70000_k_70000_th_16 relative throughput difference: 648.46% topk_nmt_r_128_c_175000_k_175000_th_16 relative throughput difference: 596.07% topk_nmt_r_128_c_350000_k_350000_th_16 relative throughput difference: 451.97% PiperOrigin-RevId: 158472620
*	Improve speed of depthwise conv backward on GPU.	A. Unique TensorFlower	2017-03-02
\| \| \| \|	Change: 149047908
*	Update generated C++ API. This is a roll-forward past CLs that were	A. Unique TensorFlower	2017-02-01
\| \| \| \| \|	commented out plus some new overrides. Change: 146330232
*	Benchmark fused convolution ops, and use multi-threaded EigenTensor for conv2d	Pete Warden	2016-11-15
\| \| \| \|	Change: 139215742
*	Make it possible for each LocalDevice to own a separate Eigen Threadpool.	A. Unique TensorFlower	2016-10-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is controlled by a private interface, currently only accessible by tensorflow::test::Benchmark to allow benchmarks with different numbers of threads to be run in the same invocation. (See b/30009830, b/29000403). Before: Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------- BM_ConvFloatFwdCPU1_conv0 9252919 9409887 100 25.726G items/s 32_5_5_1248_128_1_1_1_2_f_cpu1 BM_ConvFloatFwdCPU4_conv0 9236290 9396430 100 25.772G items/s 32_5_5_1248_128_1_1_1_2_f_cpu4 BM_ConvFloatDepthwiseFwdCPU1_conv0 65055411 65452691 100 2.482G items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseFwdCPU4_conv0 63588193 63981662 100 2.540G items/s 32_112_112_3_8_24_3_3_1_2_cpu4 After: Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------- BM_ConvFloatFwdCPU1_conv0 9231144 9371349 100 25.786G items/s 32_5_5_1248_128_1_1_1_2_f_cpu1 BM_ConvFloatFwdCPU4_conv0 2911355 11476373 270 81.762G items/s 32_5_5_1248_128_1_1_1_2_f_cpu4 BM_ConvFloatDepthwiseFwdCPU1_conv0 64183629 64580719 100 2.516G items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseFwdCPU4_conv0 20300639 75878738 100 7.955G items/s 32_112_112_3_8_24_3_3_1_2_cpu4 Change: 135971493
*	Fixed a Graph ownership bug. test::Benchmark takes ownership of the graph,	Manjunath Kudlur	2016-10-07
\| \| \| \| \|	so passing it a heap constructed graph instead of a local object. Change: 135509506
*	Improvements to the C++ graph building API.	Manjunath Kudlur	2016-07-15
\| \| \| \| \| \|	TESTED: - passed opensource_build: http://ci.tensorflow.org/job/tensorflow-cl-presubmit-multijob/2780/ Change: 127585603
*	Refactor Get2dOutputSizes/Get2dOutputSizesVerbose/Get3dOutputSizes to share ↵	A. Unique TensorFlower	2016-06-20
\| \| \| \| \| \| \| \| \| \| \|	a common 1-dimensional GetWindowedOutputSize/GetWindowedOutputSizeVerbose. The output sizes and padding of each dimension of a windowed operation (such as convolution or pooling) are orthogonal and can be computed independently. We can simplify the code by providing a 1D size computation and calling it for each dimension. Also remove special cases for 1x1 spatial convolutions in dimension calculations; they add complexity and are a case that the general code handles correctly. In general, 2D convolutions and their gradients have a lot of shape calculation code that is duplicated for each spatial dimension. This CL is a step in the direction of treating spatial dimensions the same so we can share more code. Change: 125360639
*	Update copyright for 3p/tf/core.	A. Unique TensorFlower	2016-06-02
\| \| \| \|	Change: 123900938
*	Enable fp16 for convolution operations, gated on CUDA 7.5. (The fp16 tests	A. Unique TensorFlower	2016-05-27
\| \| \| \| \| \| \| \|	will not be run under 7.0.) This is GPU-only for now; there are still bugs in Eigen that block fp16 convolutions on CPU, but this should hopefully not last for long. Change: 123410990
*	Parallelize MaxPool across batch dimension.	A. Unique TensorFlower	2016-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MaxPool_32_112_112_64_3_3_2_VALID_1 28173747 28956041 -2.8% BM_MaxPool_32_56_56_192_3_3_2_VALID_1 14467716 14581478 -0.8% BM_MaxPool_32_28_28_352_3_3_2_VALID_1 5318842 5367336 -0.9% BM_MaxPool_32_14_14_576_3_3_2_VALID_1 1331917 1351642 -1.5% BM_MaxPool_32_112_112_64_3_3_2_SAME_1 28757024 29005280 -0.9% BM_MaxPool_32_56_56_192_3_3_2_SAME_1 15119295 15478783 -2.4% BM_MaxPool_32_28_28_352_3_3_2_SAME_1 5802450 5871220 -1.2% BM_MaxPool_32_14_14_576_3_3_2_SAME_1 1632582 1662128 -1.8% BM_MaxPool_32_112_112_64_3_3_2_VALID_4 28579650 8240771 +71.2% BM_MaxPool_32_56_56_192_3_3_2_VALID_4 14621344 4373595 +70.1% BM_MaxPool_32_28_28_352_3_3_2_VALID_4 5404303 1571711 +70.9% BM_MaxPool_32_14_14_576_3_3_2_VALID_4 1343607 427873 +68.2% BM_MaxPool_32_112_112_64_3_3_2_SAME_4 29195151 8204002 +71.9% BM_MaxPool_32_56_56_192_3_3_2_SAME_4 15314088 4642979 +69.7% BM_MaxPool_32_28_28_352_3_3_2_SAME_4 6094918 1777112 +70.8% BM_MaxPool_32_14_14_576_3_3_2_SAME_4 1643584 544554 +66.9% TESTED: - passed opensource_build - passed unit tests Change: 120128184
*	Optimized implementation of depthwise conv backprop filter for CPU.	A. Unique TensorFlower	2016-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	// OLD Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------ BM_ConvFloatDepthwiseBkFilterCPU1_conv0 281152179 280588497 100 588.2M items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv1 760242956 758694909 100 580.1M items/s 32_112_112_64_1_64_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv2 383554418 382741182 100 574.9M items/s 32_56_56_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv3 98924384 98665676 100 557.2M items/s 32_56_56_128_1_128_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv4 94237506 94005920 100 585.0M items/s 32_28_28_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv5 106895864 106648144 100 515.7M items/s 32_14_14_512_1_512_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv6 69247718 69078442 100 398.0M items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv7 70304661 70126053 100 588.1M items/s 32_112_112_3_8_24_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv8 67619710 67447142 100 611.4M items/s 32_112_112_3_8_24_3_3_2_1_cpu1 // NEW 1-thread Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------ BM_ConvFloatDepthwiseBkFilterCPU1_conv0 59981294 59569328 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv1 165631344 165250674 100 2.6G items/s 32_112_112_64_1_64_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv2 76910026 76705735 100 2.8G items/s 32_56_56_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv3 21491439 21375872 100 2.5G items/s 32_56_56_128_1_128_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv4 18677714 18587209 100 2.9G items/s 32_28_28_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv5 23474236 23377934 100 2.3G items/s 32_14_14_512_1_512_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv6 17066829 16982791 100 1.6G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv7 14822571 14744419 100 2.7G items/s 32_112_112_3_8_24_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkFilterCPU1_conv8 14325480 14254559 100 2.8G items/s 32_112_112_3_8_24_3_3_2_1_cpu1 // NEW 4-threads Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------------ BM_ConvFloatDepthwiseBkFilterCPU4_conv0 21809044 69141049 100 7.4G items/s 32_112_112_3_8_24_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv1 57704422 192333505 100 7.5G items/s 32_112_112_64_1_64_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv2 29761264 91848609 100 7.2G items/s 32_56_56_128_1_128_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv3 9075773 26429821 100 5.9G items/s 32_56_56_128_1_128_3_3_2_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv4 7276754 22100190 100 7.4G items/s 32_28_28_128_1_128_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv5 6756189 24510067 100 8.0G items/s 32_14_14_512_1_512_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv6 4837993 17723279 142 5.6G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv7 6676347 19935585 100 6.0G items/s 32_112_112_3_8_24_3_3_2_2_cpu4 BM_ConvFloatDepthwiseBkFilterCPU4_conv8 5951583 17181079 100 6.8G items/s 32_112_112_3_8_24_3_3_2_1_cpu4 TESTED: - passed opensource_build - passed unit tests Change: 120125325
*	Optimized DepthwiseConvBackpropInputOp for CPU.	A. Unique TensorFlower	2016-03-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	// OLD Benchmark Time(ns) CPU(ns) Iterations -------------------------------------------------------------------- BM_ConvFloatDepthwiseBkInCPU1_conv0 207770233 207338129 100 796.0M items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv1 715403538 713939287 100 616.4M items/s 32_112_112_64_1_64_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv2 357349749 356594057 100 617.0M items/s 32_56_56_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv3 274697435 274160117 100 802.7M items/s 32_56_56_128_1_128_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv4 87072020 86874244 100 633.1M items/s 32_28_28_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv5 87172482 86948501 100 632.4M items/s 32_14_14_512_1_512_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv6 46763611 46620163 100 589.4M items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1 // NEW 1-thread Benchmark Time(ns) CPU(ns) Iterations -------------------------------------------------------------------- BM_ConvFloatDepthwiseBkInCPU1_conv0 60173061 59839526 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv1 99396102 99143542 100 4.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv2 39376616 39226953 100 5.5G items/s 32_56_56_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv3 35987577 35843443 100 6.0G items/s 32_56_56_128_1_128_3_3_2_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv4 9665813 9600518 100 5.6G items/s 32_28_28_128_1_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv5 12498989 12427035 100 4.3G items/s 32_14_14_512_1_512_3_3_1_2_cpu1 BM_ConvFloatDepthwiseBkInCPU1_conv6 8459759 8397047 100 3.2G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1 // NEW 4-threads Benchmark Time(ns) CPU(ns) Iterations -------------------------------------------------------------------- BM_ConvFloatDepthwiseBkInCPU4_conv0 30696635 101663830 100 5.3G items/s 32_112_112_3_8_24_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv1 68884630 198616710 100 6.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv2 16948037 50360587 100 12.7G items/s 32_56_56_128_1_128_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv3 15834408 46873689 100 13.6G items/s 32_56_56_128_1_128_3_3_2_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv4 3904734 11659079 167 13.8G items/s 32_28_28_128_1_128_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv5 3482083 12555105 188 15.5G items/s 32_14_14_512_1_512_3_3_1_2_cpu4 BM_ConvFloatDepthwiseBkInCPU4_conv6 2330680 8593020 281 11.5G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu4 Change: 118514706
*	Add benchmark tests for depthwise conv forward gpu kernels	Jianmin Chen	2016-03-18
\| \| \| \| \| \| \| \| \| \| \| \|	Benchmark Time(ns) CPU(ns) Iterations BM_ConvFloatDepthwiseFwdGPU_conv0 4800416 4937895 141 32.7G items/s 32_112_112_3_8_24_3_3_1_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv1 13550072 13922813 100 30.9G items/s 32_112_112_64_1_64_3_3_1_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv2 7032385 7324553 100 29.4G items/s 32_56_56_128_1_128_3_3_1_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv3 2285033 2425335 228 22.2G items/s 32_56_56_128_1_128_3_3_2_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv4 1743948 1858093 359 29.0G items/s 32_28_28_128_1_128_3_3_1_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv5 1784560 1897147 320 28.4G items/s 32_14_14_512_1_512_3_3_1_2_gpu BM_ConvFloatDepthwiseFwdGPU_conv6 971179 1044185 562 25.8G items/s 32_7_7_1024_1_1024_3_3_1_2_gpu Change: 117553964
*	First attempt at an optimized implementation of DepthwiseConv2D for CPU.	A. Unique TensorFlower	2016-03-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	// OLD Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------- BM_ConvFloatDepthwiseFwdCPU1_conv0 247698841 247715520 100 667.6M items/s 32_112_112_3_8_128_3_3_1_2_cpu1 BM_ConvFloatDepthwiseFwdCPU1_conv1 662664406 662723089 100 665.5M items/s 32_112_112_64_1_128_3_3_1_2_cpu1 // NEW Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------- BM_ConvFloatDepthwiseFwdCPU1_conv0 60316894 60215905 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1 BM_ConvFloatDepthwiseFwdCPU1_conv1 158600898 158571194 100 2.7G items/s 32_112_112_64_1_64_3_3_1_2_cpu1 // NEW 4-THREADS Benchmark Time(ns) CPU(ns) Iterations ------------------------------------------------------------------- BM_ConvFloatDepthwiseFwdCPU4_conv0 16703436 64535709 100 9.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu4 BM_ConvFloatDepthwiseFwdCPU4_conv1 51874080 182896805 100 8.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu4 Change: 116555067
*	Give tensorflow/core/kernels/ its own BUILD file.	Josh Levenberg	2016-02-24
	Change: 115379524