aboutsummaryrefslogtreecommitdiffhomepage
path: root/bench
diff options
context:
space:
mode:
authorGravatar Mark D Ryan <mark.d.ryan@intel.com>2018-10-12 15:20:21 +0200
committerGravatar Mark D Ryan <mark.d.ryan@intel.com>2018-10-12 15:20:21 +0200
commitaa110e681b8b2237757a652ba47da49e1fbd2cd6 (patch)
tree3e5e0d1481045c9bdc64d6c4d700ea13a0c7b2f4 /bench
parentd9392f9e557950e0160b6d4ba1c44035649d8e41 (diff)
PR 526: Speed up multiplication of small, dynamically sized matrices
The Packet16f, Packet8f and Packet8d types are too large to use with dynamically sized matrices typically processed by the SliceVectorizedTraversal specialization of the dense_assignment_loop. Using these types is likely to lead to little or no vectorization. Significant slowdown in the multiplication of these small matrices can be observed when building with AVX and AVX512 enabled. This patch introduces a new dense_assignment_kernel that is used when computing small products whose operands have dynamic dimensions. It ensures that the PacketSize used is no larger than 4, thereby increasing the chance that vectorized instructions will be used when computing the product. I tested all 969 possible combinations of M, K, and N that are handled by the dense_assignment_loop on x86 builds. Although a few combinations are slowed down by this patch they are far outnumbered by the cases that are sped up, as the following results demonstrate. Disabling Packed8d on AVX512 builds: Total Cases: 969 Better: 511 Worse: 85 Same: 373 Max Improvement: 169.00% (4 8 6) Max Degradation: 36.50% (8 5 3) Median Improvement: 35.46% Median Degradation: 17.41% Total FLOPs Improvement: 19.42% Disabling Packet16f and Packed8f on AVX512 builds: Total Cases: 969 Better: 658 Worse: 5 Same: 306 Max Improvement: 214.05% (8 6 5) Max Degradation: 22.26% (16 2 1) Median Improvement: 60.05% Median Degradation: 13.32% Total FLOPs Improvement: 59.58% Disabling Packed8f on AVX builds: Total Cases: 969 Better: 663 Worse: 96 Same: 210 Max Improvement: 155.29% (4 10 5) Max Degradation: 35.12% (8 3 2) Median Improvement: 34.28% Median Degradation: 15.05% Total FLOPs Improvement: 26.02%
Diffstat (limited to 'bench')
0 files changed, 0 insertions, 0 deletions