aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/products/GeneralBlockPanelKernel.h
Commit message (Collapse)AuthorAge
* Made the index type a template parameter to evaluateProductBlockingSizesGravatar Benoit Steiner2016-04-27
| | | | Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
* Deleted extraneous comma.Gravatar Benoit Steiner2016-04-15
|
* Improved the matrix multiplication blocking in the case where mr is not a ↵Gravatar Benoit Steiner2016-04-15
| | | | power of 2 (e.g on Haswell CPUs).
* Added ability to access the cache sizes from the tensor devicesGravatar Benoit Steiner2016-04-14
|
* bug #1161: fix division by zero for huge scalar typesGravatar Gael Guennebaud2016-02-03
|
* Make sure that block sizes are smaller than input matrix sizes.Gravatar Gael Guennebaud2016-01-26
|
* Fix degenerate cases in syrk and trsmGravatar Gael Guennebaud2015-11-30
|
* Use a class constructor to initialize CPU cache sizesGravatar Chris Jones2015-11-20
| | | | | | | | Using a static instance of a class to initialize the values for the CPU cache sizes guarantees thread-safe initialization of the values when using C++11. Therefore under C++11 it is no longer necessary to call Eigen::initParallel() before calling any eigen functions on different threads.
* bug #1043: Avoid integer conversion sign warningGravatar Christoph Hertzberg2015-08-19
|
* Enable runtime stack alignment in gemm_blocking_space.Gravatar Gael Guennebaud2015-08-06
|
* Abandon blocking size lookup table approach. Not performing as well in real ↵Gravatar Benoit Jacob2015-05-19
| | | | world as in microbenchmark.
* Improved the blocking strategy to speedup multithreaded tensor contractions.Gravatar Benoit Steiner2015-04-09
|
* add a note on bug #992Gravatar Gael Guennebaud2015-04-08
|
* bug #992: don't select a 3p GEMM path with non-vectorizable scalar types, ↵Gravatar Benoit Jacob2015-04-07
| | | | this hits unsupported paths in symm/triangular products code
* Fix computeProductBlockingSizes with m==0, and add respective unit test.Gravatar Gael Guennebaud2015-03-31
|
* Similar to cset 3589a9c115a892ea3ca5dac74d71a1526764cb38Gravatar Benoit Jacob2015-03-16
| | | | , also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
* Fix bug in case where EIGEN_TEST_SPECIFIC_BLOCKING_SIZE is defined but falseGravatar Benoit Jacob2015-03-15
|
* actual_panel_rows computation should always be resilient to parameters not ↵Gravatar Benoit Jacob2015-03-15
| | | | consistent with the known L1 cache size, see comment
* Refactor computeProductBlockingSizes to make room for the possibility of ↵Gravatar Benoit Jacob2015-03-15
| | | | using lookup tables
* organize a little our default cache sizes, and use a saner default L1 ↵Gravatar Benoit Jacob2015-03-13
| | | | outside of x86 (10% faster on Nexus 5)
* Avoid undeflow when blocking size are tuned manually.Gravatar Gael Guennebaud2015-03-06
|
* Improve blocking heuristic: if the lhs fit within L1, then block on the rhs ↵Gravatar Gael Guennebaud2015-03-06
| | | | in L1 (allows to keep packed rhs in L1)
* Improve product kernel: replace the previous dynamic loop swaping strategy ↵Gravatar Gael Guennebaud2015-03-06
| | | | | | by a more general one: It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
* Product optimization: implement a dynamic loop-swapping startegy to improve ↵Gravatar Gael Guennebaud2015-03-05
| | | | memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
* Fix asm comments in 1px1 kernelGravatar Benoit Jacob2015-03-03
|
* Add a benchmark-default-sizes action to benchmark-blocking-sizes.cppGravatar Benoit Jacob2015-03-03
|
* Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵Gravatar Gael Guennebaud2015-02-27
| | | | loop within product kernel.
* Re-enbale detection of min/max parentheses protection, and re-enable ↵Gravatar Gael Guennebaud2015-02-27
| | | | mpreal_support unit test.
* Reimplement the selection between rotating and non-rotating kernelsGravatar Benoit Jacob2015-02-27
| | | | | | using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier.
* Make sure that the block size computation is tested by our unit test.Gravatar Gael Guennebaud2015-02-26
|
* Implement a more generic blocking-size selection algorithm. See explanations ↵Gravatar Gael Guennebaud2015-02-26
| | | | | | | inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
* Fix typos in block-size testing code, and set peeling on k to 8.Gravatar Gael Guennebaud2015-02-26
|
* So I extensively measured the impact of the offset in this prefetch. I tried ↵Gravatar Benoit Jacob2015-02-25
| | | | | | | | | | | | | | offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
* Fix my recent prefetch changes:Gravatar Benoit Jacob2015-02-23
| | | | | | | | | | | - the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device.
* rotating kernel: avoid compiling anything outside of ARMGravatar Benoit Jacob2015-02-18
|
* remove a newly introduced redundant typedef - sorry.Gravatar Benoit Jacob2015-02-18
|
* bug #955 - Implement a rotating kernel alternative in the 3px4 gebp pathGravatar Benoit Jacob2015-02-18
| | | | | | | | This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
* Fixed template parameter.Gravatar Hauke Heibel2015-02-18
|
* mergeGravatar Gael Guennebaud2015-02-18
|\
* | Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)Gravatar Gael Guennebaud2015-02-18
| |
| * bug #958 - Allow testing specific blocking sizesGravatar Benoit Jacob2015-02-18
|/ | | | | | | | | | | | | | This is only a debugging/testing patch. It allows testing specific product blocking sizes, typically to study the impact on performance. Example usage: int testk, testm, testn; #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn #include <Eigen/Core>
* Fix bug #945: workaround MSVC warningGravatar Gael Guennebaud2015-02-18
|
* bug #953 - Fix prefetches in 3px4 product kernelGravatar Benoit Jacob2015-02-13
| | | | This gives a 10% speedup on nexus 4 and on nexus 5.
* Pulled the latest changes from the trunkGravatar Benoit Steiner2015-02-06
|\
| * bug #936, patch 1.5/3: rename _FUSED_ macros to _SINGLE_INSTRUCTION_,Gravatar Benoit Jacob2015-01-31
| | | | | | | | | | | | | | | | | | because this is what they are about. "Fused" means "no intermediate rounding between the mul and the add, only one rounding at the end". Instead, what we are concerned about here is whether a temporary register is needed, i.e. whether the MUL and ADD are separate instructions. Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA. But a true fused mul-add is only available on VFPv4: VFMA.
| * bug #936, patch 1/3: some cleanup and renaming for consistency.Gravatar Benoit Jacob2015-01-30
| |
| * bug #935: Add asm comments in GEBP kernels to work around a bugGravatar Benoit Jacob2015-01-30
| | | | | | | | | | | | | | | | | | in both GCC and Clang on ARM/NEON, whereby they spill registers, severely harming performance. The reason why the asm comments make a difference is that they prevent the compiler from reordering code across these boundaries, which has the effect of extending the lifetime of local variables and increasing register pressure on this register-tight code.
* | Made the blocking computation aware of the l3 cacheGravatar Benoit Steiner2014-10-15
| | | | | | | | Also optimized the blocking parameters to take into account the number of threads used for a computation
* | Generalized the gebp apisGravatar Benoit Steiner2014-10-02
| |
| * Initial VSX commitGravatar Konstantinos Margaritis2014-08-29
|/