eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
...
\| *	Fix product for custom complex type. (conjugation was ignored)	Gael Guennebaud	2016-09-14
\| \|
\| *	Fix performance regression in dgemm introduced by changeset ↵	Gael Guennebaud	2016-07-02
\| \| \| \| \| \| \| \|	5d51a7f12c69138ed2a43df240bdf27a5313f7ce
\| *	Fix performance regression introduced in changeset ↵	Gael Guennebaud	2016-07-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	e56aabf205a1e8f581dd8a46d7d46ce79c45e158 . Register blocking sizes are better handled by the cache size heuristics. The current code introduced very small blocks, for instance for 9x9 matrix, thus killing performance.
\| *	Relax mixing-type constraints for binary coefficient-wise operators:	Gael Guennebaud	2016-06-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Replace internal::scalar_product_traits<A,B> by Eigen::ScalarBinaryOpTraits<A,B,OP> - Remove the "functor_is_product_like" helper (was pretty ugly) - Currently, OP is not used, but it is available to the user for fine grained tuning - Currently, only the following operators have been generalized: ,/,+,-,=,=,/=,+=,-= - TODO: generalize all other binray operators (comparisons,pow,etc.) - TODO: handle "scalar op array" operators (currently only * is handled) - TODO: move the handling of the "void" scalar type to ScalarBinaryOpTraits
\| *	Remove the rotating kernel. It was only useful on some ARM CPUs (Qualcomm ↵	Benoit Jacob	2016-05-24
\| \| \| \| \| \| \| \|	Krait) that are not as ubiquitous today as they were when I introduced it.
\| *	Don't optimize the processing of the last rows of a matrix matrix product in ↵	Benoit Steiner	2016-05-23
\| \| \| \| \| \| \| \|	cases that violate the assumptions made by the optimized code path.
* \|	Pulled latest updates from upstream	Benoit Steiner	2016-04-29
\|\\|
\| *	Made the index type a template parameter to evaluateProductBlockingSizes	Benoit Steiner	2016-04-27
\| \| \| \| \| \| \| \|	Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
\| *	Deleted extraneous comma.	Benoit Steiner	2016-04-15
\| \|
\| *	Improved the matrix multiplication blocking in the case where mr is not a ↵	Benoit Steiner	2016-04-15
\| \| \| \| \| \| \| \|	power of 2 (e.g on Haswell CPUs).
\| *	Added ability to access the cache sizes from the tensor devices	Benoit Steiner	2016-04-14
\| \|
* \|	Pull latest updates from upstream	Benoit Steiner	2016-04-11
\|\\|
\| *	bug #1161: fix division by zero for huge scalar types	Gael Guennebaud	2016-02-03
\| \|
* \|	Updated the matrix multiplication code to make it compile with AVX512 enabled.	Benoit Steiner	2016-02-01
\| \|
\| *	Make sure that block sizes are smaller than input matrix sizes.	Gael Guennebaud	2016-01-26
\| \|
* \|	Disabled part of the matrix matrix peeling code that's incompatible with 512 ↵	Benoit Steiner	2015-12-21
\|/ \| \| \|	bit registers
*	Fix degenerate cases in syrk and trsm	Gael Guennebaud	2015-11-30
\|
*	Use a class constructor to initialize CPU cache sizes	Chris Jones	2015-11-20
\| \| \| \| \| \| \| \|	Using a static instance of a class to initialize the values for the CPU cache sizes guarantees thread-safe initialization of the values when using C++11. Therefore under C++11 it is no longer necessary to call Eigen::initParallel() before calling any eigen functions on different threads.
*	bug #1043: Avoid integer conversion sign warning	Christoph Hertzberg	2015-08-19
\|
*	Enable runtime stack alignment in gemm_blocking_space.	Gael Guennebaud	2015-08-06
\|
*	Abandon blocking size lookup table approach. Not performing as well in real ↵	Benoit Jacob	2015-05-19
\| \| \| \|	world as in microbenchmark.
*	Improved the blocking strategy to speedup multithreaded tensor contractions.	Benoit Steiner	2015-04-09
\|
*	add a note on bug #992	Gael Guennebaud	2015-04-08
\|
*	bug #992: don't select a 3p GEMM path with non-vectorizable scalar types, ↵	Benoit Jacob	2015-04-07
\| \| \| \|	this hits unsupported paths in symm/triangular products code
*	Fix computeProductBlockingSizes with m==0, and add respective unit test.	Gael Guennebaud	2015-03-31
\|
*	Similar to cset 3589a9c115a892ea3ca5dac74d71a1526764cb38	Benoit Jacob	2015-03-16
\| \| \| \|	, also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
*	Fix bug in case where EIGEN_TEST_SPECIFIC_BLOCKING_SIZE is defined but false	Benoit Jacob	2015-03-15
\|
*	actual_panel_rows computation should always be resilient to parameters not ↵	Benoit Jacob	2015-03-15
\| \| \| \|	consistent with the known L1 cache size, see comment
*	Refactor computeProductBlockingSizes to make room for the possibility of ↵	Benoit Jacob	2015-03-15
\| \| \| \|	using lookup tables
*	organize a little our default cache sizes, and use a saner default L1 ↵	Benoit Jacob	2015-03-13
\| \| \| \|	outside of x86 (10% faster on Nexus 5)
*	Avoid undeflow when blocking size are tuned manually.	Gael Guennebaud	2015-03-06
\|
*	Improve blocking heuristic: if the lhs fit within L1, then block on the rhs ↵	Gael Guennebaud	2015-03-06
\| \| \| \|	in L1 (allows to keep packed rhs in L1)
*	Improve product kernel: replace the previous dynamic loop swaping strategy ↵	Gael Guennebaud	2015-03-06
\| \| \| \| \| \|	by a more general one: It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
*	Product optimization: implement a dynamic loop-swapping startegy to improve ↵	Gael Guennebaud	2015-03-05
\| \| \| \|	memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
*	Fix asm comments in 1px1 kernel	Benoit Jacob	2015-03-03
\|
*	Add a benchmark-default-sizes action to benchmark-blocking-sizes.cpp	Benoit Jacob	2015-03-03
\|
*	Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵	Gael Guennebaud	2015-02-27
\| \| \| \|	loop within product kernel.
*	Re-enbale detection of min/max parentheses protection, and re-enable ↵	Gael Guennebaud	2015-02-27
\| \| \| \|	mpreal_support unit test.
*	Reimplement the selection between rotating and non-rotating kernels	Benoit Jacob	2015-02-27
\| \| \| \| \| \|	using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier.
*	Make sure that the block size computation is tested by our unit test.	Gael Guennebaud	2015-02-26
\|
*	Implement a more generic blocking-size selection algorithm. See explanations ↵	Gael Guennebaud	2015-02-26
\| \| \| \| \| \| \|	inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
*	Fix typos in block-size testing code, and set peeling on k to 8.	Gael Guennebaud	2015-02-26
\|
*	So I extensively measured the impact of the offset in this prefetch. I tried ↵	Benoit Jacob	2015-02-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
*	Fix my recent prefetch changes:	Benoit Jacob	2015-02-23
\| \| \| \| \| \| \| \| \| \| \|	- the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device.
*	rotating kernel: avoid compiling anything outside of ARM	Benoit Jacob	2015-02-18
\|
*	remove a newly introduced redundant typedef - sorry.	Benoit Jacob	2015-02-18
\|
*	bug #955 - Implement a rotating kernel alternative in the 3px4 gebp path	Benoit Jacob	2015-02-18
\| \| \| \| \| \| \| \|	This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
*	Fixed template parameter.	Hauke Heibel	2015-02-18
\|
*	merge	Gael Guennebaud	2015-02-18
\|\
* \|	Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)	Gael Guennebaud	2015-02-18
\| \|