| Commit message (Collapse) | Author | Age |
... | |
| |
|
|
|
|
|
|
|
| |
inlines.
It performs extremely well on Haswell. The main issue is to reliably and quickly find the
actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes).
On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400.
I could not see any significant impact of this offset.
On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0.
So let's just go with 0!
Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
|
|
|
|
|
|
|
|
|
|
|
| |
- the first prefetch is actually harmful on Haswell with FMA,
but it is the most beneficial on ARM.
- the second prefetch... I was very stupid and multiplied by sizeof(scalar)
and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8.
So this effectively restores the older offset. Actually, there were
two prefetches here, one with offset 48 and one with offset 64. I could not
confirm any benefit from this strange 48 offset on either the haswell or
my ARM device.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This is substantially faster on ARM, where it's important to minimize the number of loads.
This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome.
Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
|
| |
|
|\ |
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is only a debugging/testing patch. It allows testing specific
product blocking sizes, typically to study the impact on performance.
Example usage:
int testk, testm, testn;
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn
#include <Eigen/Core>
|
|
|
|
| |
might be lower than the number of requested ones
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
bug #877, bug #572: Get rid of Index conversion warnings, summary of changes:
- Introduce a global typedef Eigen::Index making Eigen::DenseIndex and AnyExpr<>::Index deprecated (default is std::ptrdiff_t).
- Eigen::Index is used throughout the API to represent indices, offsets, and sizes.
- Classes storing an array of indices uses the type StorageIndex to store them. This is a template parameter of the class. Default is int.
- Methods that *explicitly* set or return an element of such an array take or return a StorageIndex type. In all other cases, the Index type is used.
|
| | |
|
| | |
|
|/
|
|
| |
This gives a 10% speedup on nexus 4 and on nexus 5.
|
|\ |
|
| | |
|
|\ \
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
because this is what they are about. "Fused" means "no intermediate rounding
between the mul and the add, only one rounding at the end". Instead,
what we are concerned about here is whether a temporary register is needed,
i.e. whether the MUL and ADD are separate instructions.
Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA.
But a true fused mul-add is only available on VFPv4: VFMA.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
in both GCC and Clang on ARM/NEON, whereby they spill registers,
severely harming performance. The reason why the asm comments
make a difference is that they prevent the compiler from
reordering code across these boundaries, which has the effect
of extending the lifetime of local variables and increasing
register pressure on this register-tight code.
|
| |
| |
| |
| | |
work correctly even when the input coefficients aren't aligned.
|
| | |
|
| |
| |
| |
| | |
Also optimized the blocking parameters to take into account the number of threads used for a computation
|
| |
| |
| |
| | |
manual new[]/delete[] pairs in AMD and Paralellizer
|
| | |
|
| |
| |
| |
| | |
integer conversion
|
| | |
|
| |\ |
|
| | | |
|
| | |\
| |_|/
|/| | |
|
| | | |
|
| | |
| | |
| | |
| | | |
EIGEN_USE_BLAS is defined
|
|/ / |
|
| | |
|
| |
| |
| |
| | |
glu_shape<S1,S2> helper to assemble sparse/dense shapes with triagular/seladjoint views.
|
| |\
| |/
|/| |
|
| | |
|
| |
| |
| |
| | |
coefficient-wise operations.
|
| | |
|
| |\
| |/
|/| |
|
| |
| |
| |
| |
| |
| | |
expressions to ease specializing them.
2- Remove a lot of code which should not be there with evaluators, in particular coeff/packet methods implemented in the expressions.
|
| | |
|
| | |
|
| |
| |
| |
| | |
This caused redefinition warnings if IACA headers were included from elsewhere. For a clean solution we should define our own EIGEN_IACA_* macros
|
| | |
|
| | |
|