eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Made the index type a template parameter to evaluateProductBlockingSizes	Benoit Steiner	2016-04-27
\| \| \| \|	Use numext::mini and numext::maxi instead of std::min/std::max to compute blocking sizes.
*	Deleted extraneous comma.	Benoit Steiner	2016-04-15
\|
*	Improved the matrix multiplication blocking in the case where mr is not a ↵	Benoit Steiner	2016-04-15
\| \| \| \|	power of 2 (e.g on Haswell CPUs).
*	Added ability to access the cache sizes from the tensor devices	Benoit Steiner	2016-04-14
\|
*	bug #1161: fix division by zero for huge scalar types	Gael Guennebaud	2016-02-03
\|
*	Make sure that block sizes are smaller than input matrix sizes.	Gael Guennebaud	2016-01-26
\|
*	Fix degenerate cases in syrk and trsm	Gael Guennebaud	2015-11-30
\|
*	Use a class constructor to initialize CPU cache sizes	Chris Jones	2015-11-20
\| \| \| \| \| \| \| \|	Using a static instance of a class to initialize the values for the CPU cache sizes guarantees thread-safe initialization of the values when using C++11. Therefore under C++11 it is no longer necessary to call Eigen::initParallel() before calling any eigen functions on different threads.
*	bug #1043: Avoid integer conversion sign warning	Christoph Hertzberg	2015-08-19
\|
*	Enable runtime stack alignment in gemm_blocking_space.	Gael Guennebaud	2015-08-06
\|
*	Abandon blocking size lookup table approach. Not performing as well in real ↵	Benoit Jacob	2015-05-19
\| \| \| \|	world as in microbenchmark.
*	Improved the blocking strategy to speedup multithreaded tensor contractions.	Benoit Steiner	2015-04-09
\|
*	add a note on bug #992	Gael Guennebaud	2015-04-08
\|
*	bug #992: don't select a 3p GEMM path with non-vectorizable scalar types, ↵	Benoit Jacob	2015-04-07
\| \| \| \|	this hits unsupported paths in symm/triangular products code
*	Fix computeProductBlockingSizes with m==0, and add respective unit test.	Gael Guennebaud	2015-03-31
\|
*	Similar to cset 3589a9c115a892ea3ca5dac74d71a1526764cb38	Benoit Jacob	2015-03-16
\| \| \| \|	, also in 2px4 kernel: actual_panel_rows computation should always be resilient to parameters not consistent with the known L1 cache size, see comment
*	Fix bug in case where EIGEN_TEST_SPECIFIC_BLOCKING_SIZE is defined but false	Benoit Jacob	2015-03-15
\|
*	actual_panel_rows computation should always be resilient to parameters not ↵	Benoit Jacob	2015-03-15
\| \| \| \|	consistent with the known L1 cache size, see comment
*	Refactor computeProductBlockingSizes to make room for the possibility of ↵	Benoit Jacob	2015-03-15
\| \| \| \|	using lookup tables
*	organize a little our default cache sizes, and use a saner default L1 ↵	Benoit Jacob	2015-03-13
\| \| \| \|	outside of x86 (10% faster on Nexus 5)
*	Avoid undeflow when blocking size are tuned manually.	Gael Guennebaud	2015-03-06
\|
*	Improve blocking heuristic: if the lhs fit within L1, then block on the rhs ↵	Gael Guennebaud	2015-03-06
\| \| \| \|	in L1 (allows to keep packed rhs in L1)
*	Improve product kernel: replace the previous dynamic loop swaping strategy ↵	Gael Guennebaud	2015-03-06
\| \| \| \| \| \|	by a more general one: It consists in increasing the actual number of rows of lhs's micro horizontal panel for small depth such that L1 cache is fully exploited.
*	Product optimization: implement a dynamic loop-swapping startegy to improve ↵	Gael Guennebaud	2015-03-05
\| \| \| \|	memory accesses to the destination matrix in the case of K-rank-update like products, i.e., for products of the kind: "large x small" * "small x large"
*	Fix asm comments in 1px1 kernel	Benoit Jacob	2015-03-03
\|
*	Add a benchmark-default-sizes action to benchmark-blocking-sizes.cpp	Benoit Jacob	2015-03-03
\|
*	Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵	Gael Guennebaud	2015-02-27
\| \| \| \|	loop within product kernel.
*	Re-enbale detection of min/max parentheses protection, and re-enable ↵	Gael Guennebaud	2015-02-27
\| \| \| \|	mpreal_support unit test.
*	Reimplement the selection between rotating and non-rotating kernels	Benoit Jacob	2015-02-27
\| \| \| \| \| \|	using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier.
*	Make sure that the block size computation is tested by our unit test.	Gael Guennebaud	2015-02-26
\|
*	Implement a more generic blocking-size selection algorithm. See explanations ↵	Gael Guennebaud	2015-02-26
\| \| \| \| \| \| \|	inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
*	Fix typos in block-size testing code, and set peeling on k to 8.	Gael Guennebaud	2015-02-26
\|
*	So I extensively measured the impact of the offset in this prefetch. I tried ↵	Benoit Jacob	2015-02-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
*	Fix my recent prefetch changes:	Benoit Jacob	2015-02-23
\| \| \| \| \| \| \| \| \| \| \|	- the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device.
*	rotating kernel: avoid compiling anything outside of ARM	Benoit Jacob	2015-02-18
\|
*	remove a newly introduced redundant typedef - sorry.	Benoit Jacob	2015-02-18
\|
*	bug #955 - Implement a rotating kernel alternative in the 3px4 gebp path	Benoit Jacob	2015-02-18
\| \| \| \| \| \| \| \|	This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
*	Fixed template parameter.	Hauke Heibel	2015-02-18
\|
*	merge	Gael Guennebaud	2015-02-18
\|\
* \|	Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)	Gael Guennebaud	2015-02-18
\| \|
\| *	bug #958 - Allow testing specific blocking sizes	Benoit Jacob	2015-02-18
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \|	This is only a debugging/testing patch. It allows testing specific product blocking sizes, typically to study the impact on performance. Example usage: int testk, testm, testn; #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn #include <Eigen/Core>
*	Fix bug #945: workaround MSVC warning	Gael Guennebaud	2015-02-18
\|
*	bug #953 - Fix prefetches in 3px4 product kernel	Benoit Jacob	2015-02-13
\| \| \| \|	This gives a 10% speedup on nexus 4 and on nexus 5.
*	Pulled the latest changes from the trunk	Benoit Steiner	2015-02-06
\|\
\| *	bug #936, patch 1.5/3: rename _FUSED_ macros to _SINGLE_INSTRUCTION_,	Benoit Jacob	2015-01-31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	because this is what they are about. "Fused" means "no intermediate rounding between the mul and the add, only one rounding at the end". Instead, what we are concerned about here is whether a temporary register is needed, i.e. whether the MUL and ADD are separate instructions. Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA. But a true fused mul-add is only available on VFPv4: VFMA.
\| *	bug #936, patch 1/3: some cleanup and renaming for consistency.	Benoit Jacob	2015-01-30
\| \|
\| *	bug #935: Add asm comments in GEBP kernels to work around a bug	Benoit Jacob	2015-01-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in both GCC and Clang on ARM/NEON, whereby they spill registers, severely harming performance. The reason why the asm comments make a difference is that they prevent the compiler from reordering code across these boundaries, which has the effect of extending the lifetime of local variables and increasing register pressure on this register-tight code.
* \|	Made the blocking computation aware of the l3 cache	Benoit Steiner	2014-10-15
\| \| \| \| \| \| \| \|	Also optimized the blocking parameters to take into account the number of threads used for a computation
* \|	Generalized the gebp apis	Benoit Steiner	2014-10-02
\| \|
\| *	Initial VSX commit	Konstantinos Margaritis	2014-08-29
\|/