| Commit message (Collapse) | Author | Age |
|
|
|
| |
Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
|
| |
|
|
|
|
| |
power of 2 (e.g on Haswell CPUs).
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
Using a static instance of a class to initialize the values for
the CPU cache sizes guarantees thread-safe initialization of the
values when using C++11. Therefore under C++11 it is no longer
necessary to call Eigen::initParallel() before calling any eigen
functions on different threads.
|
| |
|
| |
|
|
|
|
| |
world as in microbenchmark.
|
| |
|
| |
|
|
|
|
| |
this hits unsupported paths in symm/triangular products code
|
| |
|
|
|
|
| |
, also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
|
| |
|
|
|
|
| |
consistent with the known L1 cache size, see comment
|
|
|
|
| |
using lookup tables
|
|
|
|
| |
outside of x86 (10% faster on Nexus 5)
|
| |
|
|
|
|
| |
in L1 (allows to keep packed rhs in L1)
|
|
|
|
|
|
| |
by a more general one:
It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
|
|
|
|
| |
memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
|
| |
|
| |
|
|
|
|
| |
loop within product kernel.
|
|
|
|
| |
mpreal_support unit test.
|
|
|
|
|
|
| |
using templates instead of macros and if()'s.
That was needed to fix the build of unit tests on ARM, which I had
broken. My bad for not testing earlier.
|
| |
|
|
|
|
|
|
|
| |
inlines.
It performs extremely well on Haswell. The main issue is to reliably and quickly find the
actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes).
On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400.
I could not see any significant impact of this offset.
On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0.
So let's just go with 0!
Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
|
|
|
|
|
|
|
|
|
|
|
| |
- the first prefetch is actually harmful on Haswell with FMA,
but it is the most beneficial on ARM.
- the second prefetch... I was very stupid and multiplied by sizeof(scalar)
and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8.
So this effectively restores the older offset. Actually, there were
two prefetches here, one with offset 48 and one with offset 64. I could not
confirm any benefit from this strange 48 offset on either the haswell or
my ARM device.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This is substantially faster on ARM, where it's important to minimize the number of loads.
This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome.
Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
|
| |
|
|\ |
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is only a debugging/testing patch. It allows testing specific
product blocking sizes, typically to study the impact on performance.
Example usage:
int testk, testm, testn;
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn
#include <Eigen/Core>
|
| |
|
|
|
|
| |
This gives a 10% speedup on nexus 4 and on nexus 5.
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
because this is what they are about. "Fused" means "no intermediate rounding
between the mul and the add, only one rounding at the end". Instead,
what we are concerned about here is whether a temporary register is needed,
i.e. whether the MUL and ADD are separate instructions.
Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA.
But a true fused mul-add is only available on VFPv4: VFMA.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
in both GCC and Clang on ARM/NEON, whereby they spill registers,
severely harming performance. The reason why the asm comments
make a difference is that they prevent the compiler from
reordering code across these boundaries, which has the effect
of extending the lifetime of local variables and increasing
register pressure on this register-tight code.
|
| |
| |
| |
| | |
Also optimized the blocking parameters to take into account the number of threads used for a computation
|
| | |
|
|/ |
|