Complete rewrite of column-major-matrix * vector product to deliver higher performance of modern CPU.

The previous code has been optimized for Intel core2 for which unaligned loads/stores were prohibitively expensive. This new version exhibits much higher instruction independence (better pipelining) and explicitly leverage FMA. According to my benchmark, on Haswell this new kernel is always faster than the previous one, and sometimes even twice as fast. Even higher performance could be achieved with a better blocking size heuristic and, perhaps, with explicit prefetching. We should also check triangular product/solve to optimally exploit this new kernel (working on vertical panel of 4 columns is probably not optimal anymore).
author: Gael Guennebaud <g.gael@free.fr> 2016-12-03 21:14:14 +0100
committer: Gael Guennebaud <g.gael@free.fr> 2016-12-03 21:14:14 +0100
commit: 6a5fe860985311cc275c4bb7000e0d261822c756 (patch)
tree: 27f6c9826b284cf7dbadc0030a892ed53b6ca385 /Eigen/src/Core/util/BlasUtil.h
parent: 2bfece5cd1b13c14471177d13f24b46a28638d27 (diff)
1 files changed, 6 insertions, 0 deletions
diff --git a/Eigen/src/Core/util/BlasUtil.h b/Eigen/src/Core/util/BlasUtil.h
index 6e6ee119b..8deacb894 100755
--- a/Eigen/src/Core/util/BlasUtil.h
+++ b/Eigen/src/Core/util/BlasUtil.h
@@ -222,6 +222,12 @@ class blas_data_mapper {
     return ploadt<Packet, AlignmentType>(&operator()(i, j));
   }
 
+  template <typename PacketT, int AlignmentT>
+  EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE PacketT load(Index i, Index j) const {
+    //return ploadt<PacketT, AlignmentT>(&operator()(i, j));
+    return ploadu<PacketT>(m_data+j*m_stride+i);
+  }
+
   EIGEN_DEVICE_FUNC EIGEN_ALWAYS_INLINE HalfPacket loadHalfPacket(Index i, Index j) const {
     return ploadt<HalfPacket, AlignmentType>(&operator()(i, j));
   }
author	Gael Guennebaud <g.gael@free.fr>	2016-12-03 21:14:14 +0100
committer	Gael Guennebaud <g.gael@free.fr>	2016-12-03 21:14:14 +0100
commit	6a5fe860985311cc275c4bb7000e0d261822c756 (patch)
tree	27f6c9826b284cf7dbadc0030a892ed53b6ca385 /Eigen/src/Core/util/BlasUtil.h
parent	2bfece5cd1b13c14471177d13f24b46a28638d27 (diff)