aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/products/GeneralBlockPanelKernel.h
Commit message (Collapse)AuthorAge
...
| * bug #958 - Allow testing specific blocking sizesGravatar Benoit Jacob2015-02-18
|/ | | | | | | | | | | | | | This is only a debugging/testing patch. It allows testing specific product blocking sizes, typically to study the impact on performance. Example usage: int testk, testm, testn; #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn #include <Eigen/Core>
* Fix bug #945: workaround MSVC warningGravatar Gael Guennebaud2015-02-18
|
* bug #953 - Fix prefetches in 3px4 product kernelGravatar Benoit Jacob2015-02-13
| | | | This gives a 10% speedup on nexus 4 and on nexus 5.
* Pulled the latest changes from the trunkGravatar Benoit Steiner2015-02-06
|\
| * bug #936, patch 1.5/3: rename _FUSED_ macros to _SINGLE_INSTRUCTION_,Gravatar Benoit Jacob2015-01-31
| | | | | | | | | | | | | | | | | | because this is what they are about. "Fused" means "no intermediate rounding between the mul and the add, only one rounding at the end". Instead, what we are concerned about here is whether a temporary register is needed, i.e. whether the MUL and ADD are separate instructions. Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA. But a true fused mul-add is only available on VFPv4: VFMA.
| * bug #936, patch 1/3: some cleanup and renaming for consistency.Gravatar Benoit Jacob2015-01-30
| |
| * bug #935: Add asm comments in GEBP kernels to work around a bugGravatar Benoit Jacob2015-01-30
| | | | | | | | | | | | | | | | | | in both GCC and Clang on ARM/NEON, whereby they spill registers, severely harming performance. The reason why the asm comments make a difference is that they prevent the compiler from reordering code across these boundaries, which has the effect of extending the lifetime of local variables and increasing register pressure on this register-tight code.
* | Made the blocking computation aware of the l3 cacheGravatar Benoit Steiner2014-10-15
| | | | | | | | Also optimized the blocking parameters to take into account the number of threads used for a computation
* | Generalized the gebp apisGravatar Benoit Steiner2014-10-02
| |
| * Initial VSX commitGravatar Konstantinos Margaritis2014-08-29
|/
* Missed to remove IACA_END in previous commitGravatar Christoph Hertzberg2014-05-05
|
* Removed IACA-definesGravatar Christoph Hertzberg2014-05-05
| | | | This caused redefinition warnings if IACA headers were included from elsewhere. For a clean solution we should define our own EIGEN_IACA_* macros
* Product kernel: skip loop on columns if there is no remaining rowsGravatar Gael Guennebaud2014-04-25
|
* Fix for mixed productsGravatar Gael Guennebaud2014-04-25
|
* Disable 3pX4 kernel on Altivec: despite this platform has 32 registers, this ↵Gravatar Gael Guennebaud2014-04-25
| | | | version seems significantly slower.
* Minor optimizations in product kernel:Gravatar Gael Guennebaud2014-04-25
| | | | | - use pbroadcast4 (helpful when AVX is not available) - process all remaining rows at once (significant speedup for small matrices)
* Enable vectorization of pack_rhs with a column-major RHS.Gravatar Gael Guennebaud2014-04-25
| | | | Rename and generalize Kernel<*> to PacketBlock<*,N>.
* Enable fused madd for AltivecGravatar Gael Guennebaud2014-04-24
|
* Smarter block size computationGravatar Gael Guennebaud2014-04-18
|
* Fix and optimize mixed productsGravatar Gael Guennebaud2014-04-17
|
* New gebp kernel handling up to 3 packets x 4 register-level blocks. Huge ↵Gravatar Gael Guennebaud2014-04-16
| | | | | | speeup on Haswell. This changeset also introduce new vector functions: ploadquad and predux4.
* Finally, prefetching seems to help getting more stable performanceGravatar Gael Guennebaud2014-03-31
|
* Optimize gebp kernel:Gravatar Gael Guennebaud2014-03-30
| | | | | 1 - increase peeling level along the depth dimention (+5% for large matrices, i.e., >1000) 2 - improve pipelining when dealing with latest rows of the lhs
* Vectorized the loop peeling of the inner loop of the block-panel matrix ↵Gravatar Benoit Steiner2014-03-28
| | | | multiplication code. This speeds up the multiplication of matrices which size is not a multiple of the packet size.
* merge with default branchGravatar Gael Guennebaud2014-03-28
|\
* | Fixed compilation error when FMA instructions are enabled.Gravatar Benoit Steiner2014-03-27
| |
* | Silenced "unused variable" warnings when compiling with FMA.Gravatar Benoit Steiner2014-03-27
| |
* | Vectorized the packing of a col-major matrix used as the right hand side ↵Gravatar Benoit Steiner2014-03-27
| | | | | | | | argument in a matrix-matrix product when AVX instructions are used. No vectorization takes place when SSE instructions are used, however this doesn't seem to impact performance.
* | Vectorized the packing of a row-major matrix used as the left hand side ↵Gravatar Benoit Steiner2014-03-27
| | | | | | | | argument in a matrix-matrix product.
| * Fix warningGravatar Gael Guennebaud2014-03-27
| |
* | Made sure that the version of gemm_pack_rhs specialized for row major ↵Gravatar Benoit Steiner2014-03-26
| | | | | | | | matrices is vectorized when nr == 2*PacketSize (which is the case for SSE when compiling in 64bit mode).
* | Merged latest updates from the parent branchGravatar Benoit Steiner2014-03-26
|\ \
| | * Update gebp kernel to process a panle of 4 columns at once for the remaining ↵Gravatar Gael Guennebaud2014-03-26
| |/ | | | | | | ones.
| * Implement new 1 packet x 8 gebp kernelGravatar Gael Guennebaud2014-03-26
| |
* | Merged latest changes from the parentGravatar Benoit Steiner2014-03-18
|\ \
* \ \ Merged latest changes from the main trunkGravatar Benoit Steiner2014-02-24
|\ \ \
* | | | Added support for FMA instructionsGravatar Benoit Steiner2014-02-24
| | | |
| | | * Improved the efficiency if the block-panel matrix multiplication code: the ↵Gravatar Benoit Steiner2014-01-02
| | |/ | | | | | | | | | change reduces the pressure on the L1 cache by removing the calls to gebp_traits::unpackRhs(). Instead the packetization of the rhs blocks is done on the fly in gebp_traits::loadRhs(). This adds numerous calls to pset1<ResPacket> (since we're packetizing on the fly in the inner loop) but this is more than compensated by the fact that we're decreasing the memory transfers by a factor RhsPacketSize.
| | * Use vectorization when packing row-major rhs matrices. (bug #717)Gravatar Benoit Steiner2013-12-17
| |/
| * Implement bug #317: use a template function call to suppress unused variable ↵Gravatar Gael Guennebaud2014-02-24
| | | | | | | | warnings. This also fix the issue of the previous changeset in a much nicer way.
| * Workaround clang ABI change with unsed arguments (ugly fix)Gravatar Gael Guennebaud2014-02-24
|/
* Fix "routine is both "inline" and "noinline"" warningsGravatar Gael Guennebaud2013-02-28
|
* Fix bug #551: compilation issue when using EIGEN_DEFAULT_DENSE_INDEX_TYPEGravatar Gael Guennebaud2013-02-09
|
* fix bug #495: remove too aggressive EIGEN_FLATTEN_ATTRIB attributeGravatar Gael Guennebaud2012-08-02
| | | | (after some benchmarking, it was not useful anymore)
* Automatic relicensing to MPL2 using Keirs script. Manual fixup follows.Gravatar Benoit Jacob2012-07-13
|
* bug #466: better fix for the race condition: this new patch add an ↵Gravatar Gael Guennebaud2012-06-14
| | | | | | | initParallel() function which must be called at the initialization time of any multi-threaded application calling Eigen from multiple threads.
* Fix bug #466: race condition destected by helgrind in manage_caching_sizes.Gravatar Gael Guennebaud2012-06-08
| | | | After all, the solution based on threadprivate is not that costly.
* Get rid of include directives inside namespace blocks (bug #339).Gravatar Jitse Niesen2012-04-15
|
* fix conjugation in packet_lhsGravatar Gael Guennebaud2012-02-05
|
* add missing inline keyword (linking issue)Gravatar Gael Guennebaud2012-01-26
|