aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/products/GeneralBlockPanelKernel.h
Commit message (Collapse)AuthorAge
...
| * Fix product for custom complex type. (conjugation was ignored)Gravatar Gael Guennebaud2016-09-14
| |
| * Fix performance regression in dgemm introduced by changeset ↵Gravatar Gael Guennebaud2016-07-02
| | | | | | | | 5d51a7f12c69138ed2a43df240bdf27a5313f7ce
| * Fix performance regression introduced in changeset ↵Gravatar Gael Guennebaud2016-07-02
| | | | | | | | | | | | | | | | | | e56aabf205a1e8f581dd8a46d7d46ce79c45e158 . Register blocking sizes are better handled by the cache size heuristics. The current code introduced very small blocks, for instance for 9x9 matrix, thus killing performance.
| * Relax mixing-type constraints for binary coefficient-wise operators:Gravatar Gael Guennebaud2016-06-06
| | | | | | | | | | | | | | | | | | | | - Replace internal::scalar_product_traits<A,B> by Eigen::ScalarBinaryOpTraits<A,B,OP> - Remove the "functor_is_product_like" helper (was pretty ugly) - Currently, OP is not used, but it is available to the user for fine grained tuning - Currently, only the following operators have been generalized: *,/,+,-,=,*=,/=,+=,-= - TODO: generalize all other binray operators (comparisons,pow,etc.) - TODO: handle "scalar op array" operators (currently only * is handled) - TODO: move the handling of the "void" scalar type to ScalarBinaryOpTraits
| * Remove the rotating kernel. It was only useful on some ARM CPUs (Qualcomm ↵Gravatar Benoit Jacob2016-05-24
| | | | | | | | Krait) that are not as ubiquitous today as they were when I introduced it.
| * Don't optimize the processing of the last rows of a matrix matrix product in ↵Gravatar Benoit Steiner2016-05-23
| | | | | | | | cases that violate the assumptions made by the optimized code path.
* | Pulled latest updates from upstreamGravatar Benoit Steiner2016-04-29
|\|
| * Made the index type a template parameter to evaluateProductBlockingSizesGravatar Benoit Steiner2016-04-27
| | | | | | | | Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
| * Deleted extraneous comma.Gravatar Benoit Steiner2016-04-15
| |
| * Improved the matrix multiplication blocking in the case where mr is not a ↵Gravatar Benoit Steiner2016-04-15
| | | | | | | | power of 2 (e.g on Haswell CPUs).
| * Added ability to access the cache sizes from the tensor devicesGravatar Benoit Steiner2016-04-14
| |
* | Pull latest updates from upstreamGravatar Benoit Steiner2016-04-11
|\|
| * bug #1161: fix division by zero for huge scalar typesGravatar Gael Guennebaud2016-02-03
| |
* | Updated the matrix multiplication code to make it compile with AVX512 enabled.Gravatar Benoit Steiner2016-02-01
| |
| * Make sure that block sizes are smaller than input matrix sizes.Gravatar Gael Guennebaud2016-01-26
| |
* | Disabled part of the matrix matrix peeling code that's incompatible with 512 ↵Gravatar Benoit Steiner2015-12-21
|/ | | | bit registers
* Fix degenerate cases in syrk and trsmGravatar Gael Guennebaud2015-11-30
|
* Use a class constructor to initialize CPU cache sizesGravatar Chris Jones2015-11-20
| | | | | | | | Using a static instance of a class to initialize the values for the CPU cache sizes guarantees thread-safe initialization of the values when using C++11. Therefore under C++11 it is no longer necessary to call Eigen::initParallel() before calling any eigen functions on different threads.
* bug #1043: Avoid integer conversion sign warningGravatar Christoph Hertzberg2015-08-19
|
* Enable runtime stack alignment in gemm_blocking_space.Gravatar Gael Guennebaud2015-08-06
|
* Abandon blocking size lookup table approach. Not performing as well in real ↵Gravatar Benoit Jacob2015-05-19
| | | | world as in microbenchmark.
* Improved the blocking strategy to speedup multithreaded tensor contractions.Gravatar Benoit Steiner2015-04-09
|
* add a note on bug #992Gravatar Gael Guennebaud2015-04-08
|
* bug #992: don't select a 3p GEMM path with non-vectorizable scalar types, ↵Gravatar Benoit Jacob2015-04-07
| | | | this hits unsupported paths in symm/triangular products code
* Fix computeProductBlockingSizes with m==0, and add respective unit test.Gravatar Gael Guennebaud2015-03-31
|
* Similar to cset 3589a9c115a892ea3ca5dac74d71a1526764cb38Gravatar Benoit Jacob2015-03-16
| | | | , also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
* Fix bug in case where EIGEN_TEST_SPECIFIC_BLOCKING_SIZE is defined but falseGravatar Benoit Jacob2015-03-15
|
* actual_panel_rows computation should always be resilient to parameters not ↵Gravatar Benoit Jacob2015-03-15
| | | | consistent with the known L1 cache size, see comment
* Refactor computeProductBlockingSizes to make room for the possibility of ↵Gravatar Benoit Jacob2015-03-15
| | | | using lookup tables
* organize a little our default cache sizes, and use a saner default L1 ↵Gravatar Benoit Jacob2015-03-13
| | | | outside of x86 (10% faster on Nexus 5)
* Avoid undeflow when blocking size are tuned manually.Gravatar Gael Guennebaud2015-03-06
|
* Improve blocking heuristic: if the lhs fit within L1, then block on the rhs ↵Gravatar Gael Guennebaud2015-03-06
| | | | in L1 (allows to keep packed rhs in L1)
* Improve product kernel: replace the previous dynamic loop swaping strategy ↵Gravatar Gael Guennebaud2015-03-06
| | | | | | by a more general one: It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
* Product optimization: implement a dynamic loop-swapping startegy to improve ↵Gravatar Gael Guennebaud2015-03-05
| | | | memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
* Fix asm comments in 1px1 kernelGravatar Benoit Jacob2015-03-03
|
* Add a benchmark-default-sizes action to benchmark-blocking-sizes.cppGravatar Benoit Jacob2015-03-03
|
* Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵Gravatar Gael Guennebaud2015-02-27
| | | | loop within product kernel.
* Re-enbale detection of min/max parentheses protection, and re-enable ↵Gravatar Gael Guennebaud2015-02-27
| | | | mpreal_support unit test.
* Reimplement the selection between rotating and non-rotating kernelsGravatar Benoit Jacob2015-02-27
| | | | | | using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier.
* Make sure that the block size computation is tested by our unit test.Gravatar Gael Guennebaud2015-02-26
|
* Implement a more generic blocking-size selection algorithm. See explanations ↵Gravatar Gael Guennebaud2015-02-26
| | | | | | | inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
* Fix typos in block-size testing code, and set peeling on k to 8.Gravatar Gael Guennebaud2015-02-26
|
* So I extensively measured the impact of the offset in this prefetch. I tried ↵Gravatar Benoit Jacob2015-02-25
| | | | | | | | | | | | | | offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
* Fix my recent prefetch changes:Gravatar Benoit Jacob2015-02-23
| | | | | | | | | | | - the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device.
* rotating kernel: avoid compiling anything outside of ARMGravatar Benoit Jacob2015-02-18
|
* remove a newly introduced redundant typedef - sorry.Gravatar Benoit Jacob2015-02-18
|
* bug #955 - Implement a rotating kernel alternative in the 3px4 gebp pathGravatar Benoit Jacob2015-02-18
| | | | | | | | This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
* Fixed template parameter.Gravatar Hauke Heibel2015-02-18
|
* mergeGravatar Gael Guennebaud2015-02-18
|\
* | Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)Gravatar Gael Guennebaud2015-02-18
| |