| Commit message (Collapse) | Author | Age |
... | |
| | |
|
| |
| |
| |
| | |
5d51a7f12c69138ed2a43df240bdf27a5313f7ce
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
e56aabf205a1e8f581dd8a46d7d46ce79c45e158
.
Register blocking sizes are better handled by the cache size heuristics.
The current code introduced very small blocks, for instance for 9x9 matrix,
thus killing performance.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
- Replace internal::scalar_product_traits<A,B> by Eigen::ScalarBinaryOpTraits<A,B,OP>
- Remove the "functor_is_product_like" helper (was pretty ugly)
- Currently, OP is not used, but it is available to the user for fine grained tuning
- Currently, only the following operators have been generalized: *,/,+,-,=,*=,/=,+=,-=
- TODO: generalize all other binray operators (comparisons,pow,etc.)
- TODO: handle "scalar op array" operators (currently only * is handled)
- TODO: move the handling of the "void" scalar type to ScalarBinaryOpTraits
|
| |
| |
| |
| | |
Krait) that are not as ubiquitous today as they were when I introduced it.
|
| |
| |
| |
| | |
cases that violate the assumptions made by the optimized code path.
|
|\| |
|
| |
| |
| |
| | |
Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
|
| | |
|
| |
| |
| |
| | |
power of 2 (e.g on Haswell CPUs).
|
| | |
|
|\| |
|
| | |
|
| | |
|
| | |
|
|/
|
|
| |
bit registers
|
| |
|
|
|
|
|
|
|
|
| |
Using a static instance of a class to initialize the values for
the CPU cache sizes guarantees thread-safe initialization of the
values when using C++11. Therefore under C++11 it is no longer
necessary to call Eigen::initParallel() before calling any eigen
functions on different threads.
|
| |
|
| |
|
|
|
|
| |
world as in microbenchmark.
|
| |
|
| |
|
|
|
|
| |
this hits unsupported paths in symm/triangular products code
|
| |
|
|
|
|
| |
, also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
|
| |
|
|
|
|
| |
consistent with the known L1 cache size, see comment
|
|
|
|
| |
using lookup tables
|
|
|
|
| |
outside of x86 (10% faster on Nexus 5)
|
| |
|
|
|
|
| |
in L1 (allows to keep packed rhs in L1)
|
|
|
|
|
|
| |
by a more general one:
It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
|
|
|
|
| |
memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
|
| |
|
| |
|
|
|
|
| |
loop within product kernel.
|
|
|
|
| |
mpreal_support unit test.
|
|
|
|
|
|
| |
using templates instead of macros and if()'s.
That was needed to fix the build of unit tests on ARM, which I had
broken. My bad for not testing earlier.
|
| |
|
|
|
|
|
|
|
| |
inlines.
It performs extremely well on Haswell. The main issue is to reliably and quickly find the
actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes).
On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400.
I could not see any significant impact of this offset.
On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0.
So let's just go with 0!
Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
|
|
|
|
|
|
|
|
|
|
|
| |
- the first prefetch is actually harmful on Haswell with FMA,
but it is the most beneficial on ARM.
- the second prefetch... I was very stupid and multiplied by sizeof(scalar)
and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8.
So this effectively restores the older offset. Actually, there were
two prefetches here, one with offset 48 and one with offset 64. I could not
confirm any benefit from this strange 48 offset on either the haswell or
my ARM device.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This is substantially faster on ARM, where it's important to minimize the number of loads.
This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome.
Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
|
| |
|
|\ |
|
| | |
|