| Commit message (Collapse) | Author | Age |
|
|
|
|
|
| |
the particular depthwise ops.
PiperOrigin-RevId: 177235744
|
|
|
|
|
|
|
|
|
|
|
| |
results in a speed up of 10-40x on the existing ImageNet benchmarks and 2-3x
on the newly added transformer benchmarks.
Update the benchmark to also run on the GPU.
Remove duplicate cpu tests.
PiperOrigin-RevId: 168596693
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Special-case the k=1 version.
2. Special case the k=num_cols version (use in-place stable_sort)
3. Add multithreading in several places; especially the index sort across rows and the final value shuffle.
Real-time (wall time) speedup is significant in interesting cases i.e., realistic beam search scenarios:
before:
CPU: Intel Ivybridge with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:12MB
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------------
BM_TopK_CPU_1_100_1_16 9259 25679 70793 10.300M items/s topk_r_1_c_100_k_1_th_16
BM_TopK_CPU_1_100_2_16 9276 25803 74858 10.281M items/s topk_r_1_c_100_k_2_th_16
BM_TopK_CPU_1_100_10_16 9183 25089 71922 10.385M items/s topk_r_1_c_100_k_10_th_16
BM_TopK_CPU_1_100_50_16 10487 27793 67717 9.094M items/s topk_r_1_c_100_k_50_th_16
BM_TopK_CPU_1_100_100_16 10064 27144 68466 9.476M items/s topk_r_1_c_100_k_100_th_16
BM_TopK_CPU_32_100_1_16 16832 40640 43761 181.306M items/s topk_r_32_c_100_k_1_th_16
BM_TopK_CPU_32_100_2_16 20329 47194 34889 150.117M items/s topk_r_32_c_100_k_2_th_16
BM_TopK_CPU_32_100_10_16 52341 95654 10000 58.305M items/s topk_r_32_c_100_k_10_th_16
BM_TopK_CPU_32_100_50_16 134493 172223 5155 22.691M items/s topk_r_32_c_100_k_50_th_16
BM_TopK_CPU_32_100_100_16 112498 151952 6110 27.127M items/s topk_r_32_c_100_k_100_th_16
BM_TopK_CPU_128_100_1_16 45214 84196 15854 269.981M items/s topk_r_128_c_100_k_1_th_16
BM_TopK_CPU_128_100_2_16 63425 101001 10000 192.464M items/s topk_r_128_c_100_k_2_th_16
BM_TopK_CPU_128_100_10_16 178288 216585 3906 68.468M items/s topk_r_128_c_100_k_10_th_16
BM_TopK_CPU_128_100_50_16 566432 649432 1000 21.551M items/s topk_r_128_c_100_k_50_th_16
BM_TopK_CPU_128_100_100_16 469575 555467 1500 25.996M items/s topk_r_128_c_100_k_100_th_16
BM_TopK_CPU_128_1000_1_16 213300 253660 3284 572.293M items/s topk_r_128_c_1000_k_1_th_16
BM_TopK_CPU_128_1000_2_16 257206 304476 2881 474.601M items/s topk_r_128_c_1000_k_2_th_16
BM_TopK_CPU_128_1000_10_16 497052 577491 1418 245.588M items/s topk_r_128_c_1000_k_10_th_16
BM_TopK_CPU_128_1000_50_16 1515879 1607193 459 80.528M items/s topk_r_128_c_1000_k_50_th_16
BM_TopK_CPU_128_1000_100_16 2571640 2658854 272 47.468M items/s topk_r_128_c_1000_k_100_th_16
BM_TopK_CPU_128_1000_500_16 7333097 7423285 94 16.646M items/s topk_r_128_c_1000_k_500_th_16
BM_TopK_CPU_128_1000_1000_16 5770553 5840202 100 21.154M items/s topk_r_128_c_1000_k_1000_th_16
BM_TopK_CPU_16_10000_10000_16 9166191 9232878 74 16.647M items/s topk_nmt_r_16_c_10000_k_10000_th_16
BM_TopK_CPU_16_20000_20000_16 19449875 19519678 35 15.690M items/s topk_nmt_r_16_c_20000_k_20000_th_16
BM_TopK_CPU_16_50000_50000_16 52296451 52302305 10 14.589M items/s topk_nmt_r_16_c_50000_k_50000_th_16
BM_TopK_CPU_16_100000_100000_16 112297965 112270944 6 13.588M items/s topk_nmt_r_16_c_100000_k_100000_th_16
BM_TopK_CPU_16_35000_35000_16 35879266 35913330 19 14.885M items/s topk_nmt_r_16_c_35000_k_35000_th_16
BM_TopK_CPU_16_70000_70000_16 76116905 76111531 9 14.033M items/s topk_nmt_r_16_c_70000_k_70000_th_16
BM_TopK_CPU_16_175000_175000_16 201008026 200863079 3 13.284M items/s topk_nmt_r_16_c_175000_k_175000_th_16
BM_TopK_CPU_16_350000_350000_16 433559602 433161430 2 12.318M items/s topk_nmt_r_16_c_350000_k_350000_th_16
BM_TopK_CPU_128_10000_10000_16 72610283 72609110 9 16.812M items/s topk_nmt_r_128_c_10000_k_10000_th_16
BM_TopK_CPU_128_20000_20000_16 158373008 158279209 5 15.416M items/s topk_nmt_r_128_c_20000_k_20000_th_16
BM_TopK_CPU_128_50000_50000_16 417896471 417215294 2 14.605M items/s topk_nmt_r_128_c_50000_k_50000_th_16
BM_TopK_CPU_128_100000_100000_16 884346025 883177699 1 13.803M items/s topk_nmt_r_128_c_100000_k_100000_th_16
BM_TopK_CPU_128_35000_35000_16 286974608 286727426 2 14.888M items/s topk_nmt_r_128_c_35000_k_35000_th_16
BM_TopK_CPU_128_70000_70000_16 614528753 614007815 1 13.905M items/s topk_nmt_r_128_c_70000_k_70000_th_16
BM_TopK_CPU_128_175000_175000_16 1607903552 1606329364 1 13.286M items/s topk_nmt_r_128_c_175000_k_175000_th_16
BM_TopK_CPU_128_350000_350000_16 3402143043 3398494095 1 12.558M items/s topk_nmt_r_128_c_350000_k_350000_th_16
after:
CPU: Intel Ivybridge with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:12MB
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------------
BM_TopK_CPU_1_100_1_16 9018 24839 79347 10.575M items/s topk_r_1_c_100_k_1_th_16
BM_TopK_CPU_1_100_2_16 8950 24456 76591 10.656M items/s topk_r_1_c_100_k_2_th_16
BM_TopK_CPU_1_100_10_16 9427 25658 74100 10.117M items/s topk_r_1_c_100_k_10_th_16
BM_TopK_CPU_1_100_50_16 11148 28933 62073 8.555M items/s topk_r_1_c_100_k_50_th_16
BM_TopK_CPU_1_100_100_16 9590 26127 73189 9.944M items/s topk_r_1_c_100_k_100_th_16
BM_TopK_CPU_32_100_1_16 10467 27561 64591 291.556M items/s topk_r_32_c_100_k_1_th_16
BM_TopK_CPU_32_100_2_16 19883 46413 35023 153.488M items/s topk_r_32_c_100_k_2_th_16
BM_TopK_CPU_32_100_10_16 50567 88639 10000 60.351M items/s topk_r_32_c_100_k_10_th_16
BM_TopK_CPU_32_100_50_16 63118 347897 10000 48.350M items/s topk_r_32_c_100_k_50_th_16
BM_TopK_CPU_32_100_100_16 88105 126842 7796 34.638M items/s topk_r_32_c_100_k_100_th_16
BM_TopK_CPU_128_100_1_16 16760 40292 41596 728.325M items/s topk_r_128_c_100_k_1_th_16
BM_TopK_CPU_128_100_2_16 64006 101836 10000 190.718M items/s topk_r_128_c_100_k_2_th_16
BM_TopK_CPU_128_100_10_16 68867 464997 9190 177.256M items/s topk_r_128_c_100_k_10_th_16
BM_TopK_CPU_128_100_50_16 144858 1155994 5231 84.269M items/s topk_r_128_c_100_k_50_th_16
BM_TopK_CPU_128_100_100_16 93782 622829 7509 130.164M items/s topk_r_128_c_100_k_100_th_16
BM_TopK_CPU_128_1000_1_16 96098 210082 7428 1.240G items/s topk_r_128_c_1000_k_1_th_16
BM_TopK_CPU_128_1000_2_16 90252 709497 7554 1.321G items/s topk_r_128_c_1000_k_2_th_16
BM_TopK_CPU_128_1000_10_16 124348 1086216 5626 981.684M items/s topk_r_128_c_1000_k_10_th_16
BM_TopK_CPU_128_1000_50_16 324603 3245178 2151 376.060M items/s topk_r_128_c_1000_k_50_th_16
BM_TopK_CPU_128_1000_100_16 455413 4106649 1684 268.043M items/s topk_r_128_c_1000_k_100_th_16
BM_TopK_CPU_128_1000_500_16 904824 8810352 597 134.911M items/s topk_r_128_c_1000_k_500_th_16
BM_TopK_CPU_128_1000_1000_16 753409 7232945 886 162.024M items/s topk_r_128_c_1000_k_1000_th_16
BM_TopK_CPU_16_10000_10000_16 1579482 11781021 435 96.606M items/s topk_nmt_r_16_c_10000_k_10000_th_16
BM_TopK_CPU_16_20000_20000_16 3326291 25598536 212 91.747M items/s topk_nmt_r_16_c_20000_k_20000_th_16
BM_TopK_CPU_16_50000_50000_16 9192127 72737661 81 82.999M items/s topk_nmt_r_16_c_50000_k_50000_th_16
BM_TopK_CPU_16_100000_100000_16 20328234 163896476 35 75.062M items/s topk_nmt_r_16_c_100000_k_100000_th_16
BM_TopK_CPU_16_35000_35000_16 6120448 47771027 100 87.258M items/s topk_nmt_r_16_c_35000_k_35000_th_16
BM_TopK_CPU_16_70000_70000_16 15198457 108215957 53 70.278M items/s topk_nmt_r_16_c_70000_k_70000_th_16
BM_TopK_CPU_16_175000_175000_16 36581899 318660494 19 72.995M items/s topk_nmt_r_16_c_175000_k_175000_th_16
BM_TopK_CPU_16_350000_350000_16 86169153 834154721 8 61.978M items/s topk_nmt_r_16_c_350000_k_350000_th_16
BM_TopK_CPU_128_10000_10000_16 9022381 95945196 73 135.297M items/s topk_nmt_r_128_c_10000_k_10000_th_16
BM_TopK_CPU_128_20000_20000_16 20012433 209172356 32 121.994M items/s topk_nmt_r_128_c_20000_k_20000_th_16
BM_TopK_CPU_128_50000_50000_16 59536858 606791128 10 102.517M items/s topk_nmt_r_128_c_50000_k_50000_th_16
BM_TopK_CPU_128_100000_100000_16 119065841 1375709415 6 102.523M items/s topk_nmt_r_128_c_100000_k_100000_th_16
BM_TopK_CPU_128_35000_35000_16 34995900 399661847 20 122.085M items/s topk_nmt_r_128_c_35000_k_35000_th_16
BM_TopK_CPU_128_70000_70000_16 82103990 904735845 9 104.074M items/s topk_nmt_r_128_c_70000_k_70000_th_16
BM_TopK_CPU_128_175000_175000_16 230992936 2675073107 3 92.480M items/s topk_nmt_r_128_c_175000_k_175000_th_16
BM_TopK_CPU_128_350000_350000_16 616369221 7200013156 1 69.317M items/s topk_nmt_r_128_c_350000_k_350000_th_16
relative throughput difference (new - old)/old:
$ paste /tmp/OLD /tmp/NEW | perl -ne '@r = $_ =~ /([\d\.]+[MG]) it/g; if ($r[0] =~ /G/) { $r[0] = 1000*$r[0] }; if ($r[1] =~ /G/) { $r[1] = 1000*$r[1]}; if (@r) {printf("%s\t\trelative throughput difference: %.2f%%\n", (split(" ",$_))[-1], ($r[1] - $r[0])/$r[0] * 100)}'
topk_r_1_c_100_k_1_th_16 relative throughput difference: 2.67%
topk_r_1_c_100_k_2_th_16 relative throughput difference: 3.65%
topk_r_1_c_100_k_10_th_16 relative throughput difference: -2.58%
topk_r_1_c_100_k_50_th_16 relative throughput difference: -5.93%
topk_r_1_c_100_k_100_th_16 relative throughput difference: 4.94%
topk_r_32_c_100_k_1_th_16 relative throughput difference: 60.81%
topk_r_32_c_100_k_2_th_16 relative throughput difference: 2.25%
topk_r_32_c_100_k_10_th_16 relative throughput difference: 3.51%
topk_r_32_c_100_k_50_th_16 relative throughput difference: 113.08%
topk_r_32_c_100_k_100_th_16 relative throughput difference: 27.69%
topk_r_128_c_100_k_1_th_16 relative throughput difference: 169.77%
topk_r_128_c_100_k_2_th_16 relative throughput difference: -0.91%
topk_r_128_c_100_k_10_th_16 relative throughput difference: 158.89%
topk_r_128_c_100_k_50_th_16 relative throughput difference: 291.02%
topk_r_128_c_100_k_100_th_16 relative throughput difference: 400.71%
topk_r_128_c_1000_k_1_th_16 relative throughput difference: 116.67%
topk_r_128_c_1000_k_2_th_16 relative throughput difference: 178.34%
topk_r_128_c_1000_k_10_th_16 relative throughput difference: 299.73%
topk_r_128_c_1000_k_50_th_16 relative throughput difference: 366.99%
topk_r_128_c_1000_k_100_th_16 relative throughput difference: 464.68%
topk_r_128_c_1000_k_500_th_16 relative throughput difference: 710.47%
topk_r_128_c_1000_k_1000_th_16 relative throughput difference: 665.93%
topk_nmt_r_16_c_10000_k_10000_th_16 relative throughput difference: 480.32%
topk_nmt_r_16_c_20000_k_20000_th_16 relative throughput difference: 484.75%
topk_nmt_r_16_c_50000_k_50000_th_16 relative throughput difference: 468.91%
topk_nmt_r_16_c_100000_k_100000_th_16 relative throughput difference: 452.41%
topk_nmt_r_16_c_35000_k_35000_th_16 relative throughput difference: 486.21%
topk_nmt_r_16_c_70000_k_70000_th_16 relative throughput difference: 400.81%
topk_nmt_r_16_c_175000_k_175000_th_16 relative throughput difference: 449.50%
topk_nmt_r_16_c_350000_k_350000_th_16 relative throughput difference: 403.15%
topk_nmt_r_128_c_10000_k_10000_th_16 relative throughput difference: 704.76%
topk_nmt_r_128_c_20000_k_20000_th_16 relative throughput difference: 691.35%
topk_nmt_r_128_c_50000_k_50000_th_16 relative throughput difference: 601.93%
topk_nmt_r_128_c_100000_k_100000_th_16 relative throughput difference: 642.76%
topk_nmt_r_128_c_35000_k_35000_th_16 relative throughput difference: 720.02%
topk_nmt_r_128_c_70000_k_70000_th_16 relative throughput difference: 648.46%
topk_nmt_r_128_c_175000_k_175000_th_16 relative throughput difference: 596.07%
topk_nmt_r_128_c_350000_k_350000_th_16 relative throughput difference: 451.97%
PiperOrigin-RevId: 158472620
|
|
|
|
| |
Change: 149047908
|
|
|
|
|
| |
commented out plus some new overrides.
Change: 146330232
|
|
|
|
| |
Change: 139215742
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is controlled by a private interface, currently only accessible by tensorflow::test::Benchmark to allow benchmarks with different numbers of threads to be run in the same invocation. (See b/30009830, b/29000403).
Before:
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatFwdCPU1_conv0 9252919 9409887 100 25.726G items/s 32_5_5_1248_128_1_1_1_2_f_cpu1
BM_ConvFloatFwdCPU4_conv0 9236290 9396430 100 25.772G items/s 32_5_5_1248_128_1_1_1_2_f_cpu4
BM_ConvFloatDepthwiseFwdCPU1_conv0 65055411 65452691 100 2.482G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU4_conv0 63588193 63981662 100 2.540G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
After:
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatFwdCPU1_conv0 9231144 9371349 100 25.786G items/s 32_5_5_1248_128_1_1_1_2_f_cpu1
BM_ConvFloatFwdCPU4_conv0 2911355 11476373 270 81.762G items/s 32_5_5_1248_128_1_1_1_2_f_cpu4
BM_ConvFloatDepthwiseFwdCPU1_conv0 64183629 64580719 100 2.516G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU4_conv0 20300639 75878738 100 7.955G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
Change: 135971493
|
|
|
|
|
| |
so passing it a heap constructed graph instead of a local object.
Change: 135509506
|
|
|
|
|
|
| |
TESTED:
- passed opensource_build: http://ci.tensorflow.org/job/tensorflow-cl-presubmit-multijob/2780/
Change: 127585603
|
|
|
|
|
|
|
|
|
|
|
| |
a common 1-dimensional GetWindowedOutputSize/GetWindowedOutputSizeVerbose.
The output sizes and padding of each dimension of a windowed operation (such as convolution or pooling) are orthogonal and can be computed independently. We can simplify the code by providing a 1D size computation and calling it for each dimension.
Also remove special cases for 1x1 spatial convolutions in dimension calculations; they add complexity and are a case that the general code handles correctly.
In general, 2D convolutions and their gradients have a lot of shape calculation code that is duplicated for each spatial dimension. This CL is a step in the direction of treating spatial dimensions the same so we can share more code.
Change: 125360639
|
|
|
|
| |
Change: 123900938
|
|
|
|
|
|
|
|
| |
will not be run under 7.0.)
This is GPU-only for now; there are still bugs in Eigen that block fp16
convolutions on CPU, but this should hopefully not last for long.
Change: 123410990
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_MaxPool_32_112_112_64_3_3_2_VALID_1 28173747 28956041 -2.8%
BM_MaxPool_32_56_56_192_3_3_2_VALID_1 14467716 14581478 -0.8%
BM_MaxPool_32_28_28_352_3_3_2_VALID_1 5318842 5367336 -0.9%
BM_MaxPool_32_14_14_576_3_3_2_VALID_1 1331917 1351642 -1.5%
BM_MaxPool_32_112_112_64_3_3_2_SAME_1 28757024 29005280 -0.9%
BM_MaxPool_32_56_56_192_3_3_2_SAME_1 15119295 15478783 -2.4%
BM_MaxPool_32_28_28_352_3_3_2_SAME_1 5802450 5871220 -1.2%
BM_MaxPool_32_14_14_576_3_3_2_SAME_1 1632582 1662128 -1.8%
BM_MaxPool_32_112_112_64_3_3_2_VALID_4 28579650 8240771 +71.2%
BM_MaxPool_32_56_56_192_3_3_2_VALID_4 14621344 4373595 +70.1%
BM_MaxPool_32_28_28_352_3_3_2_VALID_4 5404303 1571711 +70.9%
BM_MaxPool_32_14_14_576_3_3_2_VALID_4 1343607 427873 +68.2%
BM_MaxPool_32_112_112_64_3_3_2_SAME_4 29195151 8204002 +71.9%
BM_MaxPool_32_56_56_192_3_3_2_SAME_4 15314088 4642979 +69.7%
BM_MaxPool_32_28_28_352_3_3_2_SAME_4 6094918 1777112 +70.8%
BM_MaxPool_32_14_14_576_3_3_2_SAME_4 1643584 544554 +66.9%
TESTED:
- passed opensource_build
- passed unit tests
Change: 120128184
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
// OLD
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------------------------------
BM_ConvFloatDepthwiseBkFilterCPU1_conv0 281152179 280588497 100 588.2M items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv1 760242956 758694909 100 580.1M items/s 32_112_112_64_1_64_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv2 383554418 382741182 100 574.9M items/s 32_56_56_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv3 98924384 98665676 100 557.2M items/s 32_56_56_128_1_128_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv4 94237506 94005920 100 585.0M items/s 32_28_28_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv5 106895864 106648144 100 515.7M items/s 32_14_14_512_1_512_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv6 69247718 69078442 100 398.0M items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv7 70304661 70126053 100 588.1M items/s 32_112_112_3_8_24_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv8 67619710 67447142 100 611.4M items/s 32_112_112_3_8_24_3_3_2_1_cpu1
// NEW 1-thread
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------------------------------
BM_ConvFloatDepthwiseBkFilterCPU1_conv0 59981294 59569328 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv1 165631344 165250674 100 2.6G items/s 32_112_112_64_1_64_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv2 76910026 76705735 100 2.8G items/s 32_56_56_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv3 21491439 21375872 100 2.5G items/s 32_56_56_128_1_128_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv4 18677714 18587209 100 2.9G items/s 32_28_28_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv5 23474236 23377934 100 2.3G items/s 32_14_14_512_1_512_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv6 17066829 16982791 100 1.6G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv7 14822571 14744419 100 2.7G items/s 32_112_112_3_8_24_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkFilterCPU1_conv8 14325480 14254559 100 2.8G items/s 32_112_112_3_8_24_3_3_2_1_cpu1
// NEW 4-threads
Benchmark Time(ns) CPU(ns) Iterations
------------------------------------------------------------------------
BM_ConvFloatDepthwiseBkFilterCPU4_conv0 21809044 69141049 100 7.4G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv1 57704422 192333505 100 7.5G items/s 32_112_112_64_1_64_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv2 29761264 91848609 100 7.2G items/s 32_56_56_128_1_128_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv3 9075773 26429821 100 5.9G items/s 32_56_56_128_1_128_3_3_2_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv4 7276754 22100190 100 7.4G items/s 32_28_28_128_1_128_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv5 6756189 24510067 100 8.0G items/s 32_14_14_512_1_512_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv6 4837993 17723279 142 5.6G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv7 6676347 19935585 100 6.0G items/s 32_112_112_3_8_24_3_3_2_2_cpu4
BM_ConvFloatDepthwiseBkFilterCPU4_conv8 5951583 17181079 100 6.8G items/s 32_112_112_3_8_24_3_3_2_1_cpu4
TESTED:
- passed opensource_build
- passed unit tests
Change: 120125325
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
// OLD
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------
BM_ConvFloatDepthwiseBkInCPU1_conv0 207770233 207338129 100 796.0M items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv1 715403538 713939287 100 616.4M items/s 32_112_112_64_1_64_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv2 357349749 356594057 100 617.0M items/s 32_56_56_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv3 274697435 274160117 100 802.7M items/s 32_56_56_128_1_128_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv4 87072020 86874244 100 633.1M items/s 32_28_28_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv5 87172482 86948501 100 632.4M items/s 32_14_14_512_1_512_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv6 46763611 46620163 100 589.4M items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1
// NEW 1-thread
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------
BM_ConvFloatDepthwiseBkInCPU1_conv0 60173061 59839526 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv1 99396102 99143542 100 4.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv2 39376616 39226953 100 5.5G items/s 32_56_56_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv3 35987577 35843443 100 6.0G items/s 32_56_56_128_1_128_3_3_2_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv4 9665813 9600518 100 5.6G items/s 32_28_28_128_1_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv5 12498989 12427035 100 4.3G items/s 32_14_14_512_1_512_3_3_1_2_cpu1
BM_ConvFloatDepthwiseBkInCPU1_conv6 8459759 8397047 100 3.2G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu1
// NEW 4-threads
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------
BM_ConvFloatDepthwiseBkInCPU4_conv0 30696635 101663830 100 5.3G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv1 68884630 198616710 100 6.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv2 16948037 50360587 100 12.7G items/s 32_56_56_128_1_128_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv3 15834408 46873689 100 13.6G items/s 32_56_56_128_1_128_3_3_2_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv4 3904734 11659079 167 13.8G items/s 32_28_28_128_1_128_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv5 3482083 12555105 188 15.5G items/s 32_14_14_512_1_512_3_3_1_2_cpu4
BM_ConvFloatDepthwiseBkInCPU4_conv6 2330680 8593020 281 11.5G items/s 32_7_7_1024_1_1024_3_3_1_2_cpu4
Change: 118514706
|
|
|
|
|
|
|
|
|
|
|
|
| |
Benchmark Time(ns) CPU(ns) Iterations
BM_ConvFloatDepthwiseFwdGPU_conv0 4800416 4937895 141 32.7G items/s 32_112_112_3_8_24_3_3_1_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv1 13550072 13922813 100 30.9G items/s 32_112_112_64_1_64_3_3_1_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv2 7032385 7324553 100 29.4G items/s 32_56_56_128_1_128_3_3_1_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv3 2285033 2425335 228 22.2G items/s 32_56_56_128_1_128_3_3_2_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv4 1743948 1858093 359 29.0G items/s 32_28_28_128_1_128_3_3_1_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv5 1784560 1897147 320 28.4G items/s 32_14_14_512_1_512_3_3_1_2_gpu
BM_ConvFloatDepthwiseFwdGPU_conv6 971179 1044185 562 25.8G items/s 32_7_7_1024_1_1024_3_3_1_2_gpu
Change: 117553964
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
// OLD
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU1_conv0 247698841 247715520 100 667.6M items/s 32_112_112_3_8_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU1_conv1 662664406 662723089 100 665.5M items/s 32_112_112_64_1_128_3_3_1_2_cpu1
// NEW
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU1_conv0 60316894 60215905 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU1_conv1 158600898 158571194 100 2.7G items/s 32_112_112_64_1_64_3_3_1_2_cpu1
// NEW 4-THREADS
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU4_conv0 16703436 64535709 100 9.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
BM_ConvFloatDepthwiseFwdCPU4_conv1 51874080 182896805 100 8.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu4
Change: 116555067
|
|
Change: 115379524
|