| Commit message (Collapse) | Author | Age |
... | |
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is only a debugging/testing patch. It allows testing specific
product blocking sizes, typically to study the impact on performance.
Example usage:
int testk, testm, testn;
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn
#include <Eigen/Core>
|
| |
|
|
|
|
| |
This gives a 10% speedup on nexus 4 and on nexus 5.
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
because this is what they are about. "Fused" means "no intermediate rounding
between the mul and the add, only one rounding at the end". Instead,
what we are concerned about here is whether a temporary register is needed,
i.e. whether the MUL and ADD are separate instructions.
Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA.
But a true fused mul-add is only available on VFPv4: VFMA.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
in both GCC and Clang on ARM/NEON, whereby they spill registers,
severely harming performance. The reason why the asm comments
make a difference is that they prevent the compiler from
reordering code across these boundaries, which has the effect
of extending the lifetime of local variables and increasing
register pressure on this register-tight code.
|
| |
| |
| |
| | |
Also optimized the blocking parameters to take into account the number of threads used for a computation
|
| | |
|
|/ |
|
| |
|
|
|
|
| |
This caused redefinition warnings if IACA headers were included from elsewhere. For a clean solution we should define our own EIGEN_IACA_* macros
|
| |
|
| |
|
|
|
|
| |
version seems significantly slower.
|
|
|
|
|
| |
- use pbroadcast4 (helpful when AVX is not available)
- process all remaining rows at once (significant speedup for small matrices)
|
|
|
|
| |
Rename and generalize Kernel<*> to PacketBlock<*,N>.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
speeup on Haswell.
This changeset also introduce new vector functions: ploadquad and predux4.
|
| |
|
|
|
|
|
| |
1 - increase peeling level along the depth dimention (+5% for large matrices, i.e., >1000)
2 - improve pipelining when dealing with latest rows of the lhs
|
|
|
|
| |
multiplication code. This speeds up the multiplication of matrices which size is not a multiple of the packet size.
|
|\ |
|
| | |
|
| | |
|
| |
| |
| |
| | |
argument in a matrix-matrix product when AVX instructions are used. No vectorization takes place when SSE instructions are used, however this doesn't seem to impact performance.
|
| |
| |
| |
| | |
argument in a matrix-matrix product.
|
| | |
|
| |
| |
| |
| | |
matrices is vectorized when nr == 2*PacketSize (which is the case for SSE when compiling in 64bit mode).
|
|\ \ |
|
| |/
| |
| |
| | |
ones.
|
| | |
|
|\ \ |
|
|\ \ \ |
|
| | | | |
|
| | |/
| | |
| | |
| | | |
change reduces the pressure on the L1 cache by removing the calls to gebp_traits::unpackRhs(). Instead the packetization of the rhs blocks is done on the fly in gebp_traits::loadRhs(). This adds numerous calls to pset1<ResPacket> (since we're packetizing on the fly in the inner loop) but this is more than compensated by the fact that we're decreasing the memory transfers by a factor RhsPacketSize.
|
| |/ |
|
| |
| |
| |
| | |
warnings. This also fix the issue of the previous changeset in a much nicer way.
|
|/ |
|
| |
|
| |
|
|
|
|
| |
(after some benchmarking, it was not useful anymore)
|
| |
|
|
|
|
|
|
|
| |
initParallel()
function which must be called at the initialization time of any multi-threaded
application calling Eigen from multiple threads.
|
|
|
|
| |
After all, the solution based on threadprivate is not that costly.
|
| |
|
| |
|
| |
|