aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorCustomOp.h
diff options
context:
space:
mode:
authorGravatar Sameer Agarwal <sameeragarwal@google.com>2019-02-01 15:23:53 -0800
committerGravatar Sameer Agarwal <sameeragarwal@google.com>2019-02-01 15:23:53 -0800
commitb55b5c7280a0481f01fe5ec764d55c443a8b6496 (patch)
treecd324f6a7c070c2359b403f8d4867fd86b65a99b /unsupported/Eigen/CXX11/src/Tensor/TensorCustomOp.h
parent7ef879f6bfa465a80109216e6d0b18266ef97321 (diff)
Speed up row-major matrix-vector product on ARM
The row-major matrix-vector multiplication code uses a threshold to check if processing 8 rows at a time would thrash the cache. This change introduces two modifications to this logic. 1. A smaller threshold for ARM and ARM64 devices. The value of this threshold was determined empirically using a Pixel2 phone, by benchmarking a large number of matrix-vector products in the range [1..4096]x[1..4096] and measuring performance separately on small and little cores with frequency pinning. On big (out-of-order) cores, this change has little to no impact. But on the small (in-order) cores, the matrix-vector products are up to 700% faster. Especially on large matrices. The motivation for this change was some internal code at Google which was using hand-written NEON for implementing similar functionality, processing the matrix one row at a time, which exhibited substantially better performance than Eigen. With the current change, Eigen handily beats that code. 2. Make the logic for choosing number of simultaneous rows apply unifiormly to 8, 4 and 2 rows instead of just 8 rows. Since the default threshold for non-ARM devices is essentially unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM performance. This was verified by running the same set of benchmarks on a Xeon desktop.
Diffstat (limited to 'unsupported/Eigen/CXX11/src/Tensor/TensorCustomOp.h')
0 files changed, 0 insertions, 0 deletions