eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Implement AVX512 vectorization of std::complex<float/double>	Gael Guennebaud	2018-12-06
\|
*	do not read buffers out of bounds -- load only the 4 bytes we know exist ↵	Benoit Jacob	2018-11-27
\| \| \| \|	here. Could also have done a vld1_lane_f32 but doing so here, without the overhead of initializing the unused lane, would have triggered used-of-uninitialized-value errors in tools such as ASan. Note that this code is sub-optimal before or after this change: we should be reading either 2 or 4 float32 values per load-instruction (2 for ARM in-order cores with an affinity for 8-byte loads; 4 for ARM out-of-order cores able to dual-issue 16-byte load instructions with arithmetic instructions). Before or after this patch, we are only loading 4 bytes of useful data here (even if before this patch, we were technically loading 8, only to use only the 4 first).
*	fix the build on 64-bit ARM when NEON is disabled	Benoit Jacob	2018-11-27
\|
*	bug #1624: improve matrix-matrix product on ARM 64, 20% speedup	Gael Guennebaud	2018-11-23
\|
*	Vectorize row-by-row gebp loop iterations on 16 packets as well	Gustavo Lima Chaves	2018-11-06
\| \| \| \| \|	Signed-off-by: Gustavo Lima Chaves <gustavo.lima.chaves@intel.com> Signed-off-by: Mark D. Ryan <mark.d.ryan@intel.com>
*	Fix regression introduced by the previous fix for AVX512.	Gael Guennebaud	2018-09-20
\| \| \| \|	It brokes the complex-complex case on SSE.
*	Fix gebp kernel for real+complex in case only reals are vectorized (e.g., ↵	Gael Guennebaud	2018-09-20
\| \| \| \| \| \|	AVX512). This commit also removes "half-packet" from data-mappers: it was not used and conceptually broken anyways.
*	bug #1578: Improve prefetching in matrix multiplication on MIPS.	Alexey Frunze	2018-07-24
\|
*	bug #1572: use c++11 atomic instead of volatile if c++11 is available, and ↵	Gael Guennebaud	2018-07-17
\| \| \| \|	disable multi-threaded GEMM on non-x86 without c++11.
*	Updates corresponding to the latest round of PR feedback	Deven Desai	2018-07-11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The major changes are 1. Moving CUDA/PacketMath.h to GPU/PacketMath.h 2. Moving CUDA/MathFunctions.h to GPU/MathFunction.h 3. Moving CUDA/CudaSpecialFunctions.h to GPU/GpuSpecialFunctions.h The above three changes effectively enable the Eigen "Packet" layer for the HIP platform 4. Merging the "hip_basic" and "cuda_basic" unit tests into one ("gpu_basic") 5. Updating the "EIGEN_DEVICE_FUNC" marking in some places The change has been tested on the HIP and CUDA platforms.
*	Skip null numerators in triangular-vector-solve (as in BLAS TRSV).	Gael Guennebaud	2018-07-09
\|
*	Fix legitimate "declaration shadows a typedef" warning	Gael Guennebaud	2018-07-09
\|
*	Extend CUDA support to matrix inversion and selfadjointeigensolver	Andrea Bocci	2018-06-11
\|
*	bug #1562: optimize evaluation of small products of the form sAB by ↵	Gael Guennebaud	2018-07-02
\| \| \| \|	rewriting them as: s*(A.lazyProduct(B)) to save a costly temporary. Measured speedup from 2x to 5x...
*	Fix typos found using codespell	Gael Guennebaud	2018-06-07
\|
*	Fix "suggest parentheses around comparison" warning	Christoph Hertzberg	2018-05-15
\|
*	Rename predux_downto4 to be more accurate on its semantic.	Gael Guennebaud	2018-04-03
\|
*	MIsc. source and comment typos	luz.paz	2018-03-11
\| \| \| \|	Found using `codespell` and `grep` from downstream FreeCAD
*	bug #1517: fix triangular product with unit diagonal and nested scaling ↵	Gael Guennebaud	2018-02-09
\| \| \| \|	factor: (sA).triangularView<UpperUnit>()B
*	Make the threshold from gemm to coeff-based-product configurable, and add ↵	Gael Guennebaud	2017-08-24
\| \| \| \|	some explanations.
*	Merged in dtrebbien/eigen/patch-1 (pull request PR-312)	Gael Guennebaud	2017-08-22
\| \| \| \|	Work around a compilation error seen with nvcc V8.0.61
*	Fix support for MKL's BLAS when using MKL_DIRECT_CALL.	Gael Guennebaud	2017-08-17
\|
*	bug #1405: enable StrictlyLower/StrictlyUpper triangularView as the ↵	Gael Guennebaud	2017-06-09
\| \| \| \|	destination of matrix*matrix products.
*	Adjusted the EIGEN_DEVICE_FUNC qualifiers to make sure that:	Benoit Steiner	2017-03-01
\| \| \| \| \|	* they're used consistently between the declaration and the definition of a function * we avoid calling host only methods from host device methods.
*	Fix tracking of temporaries in unit tests	Gael Guennebaud	2017-02-19
\|
*	Improve multi-threading heuristic for matrix products with a small number of ↵	Gael Guennebaud	2017-02-07
\| \| \| \|	columns.
*	Use Index instead of size_t	Gael Guennebaud	2017-01-23
\|
*	Defer set-to-zero in triangular = product so that no aliasing issue occur in ↵	Gael Guennebaud	2017-01-17
\| \| \| \| \| \| \|	the common: A.triangularView() = BA.sefladjointView()B.adjoint() case that used to work in 3.2.
*	bug #1365: fix another type mismatch warning	Gael Guennebaud	2016-12-28
\| \| \| \|	(sync is set from and compared to an Index)
*	bug #1369: fix type mismatch warning.	Gael Guennebaud	2016-12-28
\| \| \| \| \|	Returned values of omp thread id and numbers are int, o let's use int instead of Index here.
*	Revert vec/y to vec*(1/y) in row-major TRSM:	Gael Guennebaud	2016-12-06
\| \| \| \| \| \|	- div is extremely costly - this is consistent with the column-major case - this is consistent with all other BLAS implementations
*	Fix BLAS backend for symmetric rank K updates.	Gael Guennebaud	2016-12-06
\|
*	typo	Gael Guennebaud	2016-12-05
\|
*	Improve performance of row-major-dense-matrix * vector products for recent CPUs.	Gael Guennebaud	2016-12-05
\| \| \| \| \|	This revised version does not bother about aligned loads/stores, and rather processes 8 rows at ones for better instruction pipelining.
*	Complete rewrite of column-major-matrix * vector product to deliver higher ↵	Gael Guennebaud	2016-12-03
\| \| \| \| \| \| \| \| \| \|	performance of modern CPU. The previous code has been optimized for Intel core2 for which unaligned loads/stores were prohibitively expensive. This new version exhibits much higher instruction independence (better pipelining) and explicitly leverage FMA. According to my benchmark, on Haswell this new kernel is always faster than the previous one, and sometimes even twice as fast. Even higher performance could be achieved with a better blocking size heuristic and, perhaps, with explicit prefetching. We should also check triangular product/solve to optimally exploit this new kernel (working on vertical panel of 4 columns is probably not optimal anymore).
*	Fix misleading-indentation warnings.	Gael Guennebaud	2016-12-01
\|
*	Merged eigen/eigen into default	Benoit Steiner	2016-11-03
\|\
\| *	Fix previous merge.	Gael Guennebaud	2016-10-14
\| \|
* \|	Renamed predux_half into predux_downto4	Benoit Steiner	2016-10-06
\| \|
\| *	Add a simple cost model to prevent Eigen's parallel GEMM from using too many ↵	Rasmus Munk Larsen	2016-10-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	threads when the inner dimension is small. Timing for square matrices is unchanged, but both CPU and Wall time are significantly improved for skinny matrices. The benchmarks below are for multiplying NxK * KxN matrices with test names of the form BM_OuterishProd/N/K. Improvements in Wall time: Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_OuterishProd/64/1 3088 1610 +47.9% BM_OuterishProd/64/4 3562 2414 +32.2% BM_OuterishProd/64/32 8861 7815 +11.8% BM_OuterishProd/128/1 11363 6504 +42.8% BM_OuterishProd/128/4 11128 9794 +12.0% BM_OuterishProd/128/64 27691 27396 +1.1% BM_OuterishProd/256/1 33214 28123 +15.3% BM_OuterishProd/256/4 34312 36818 -7.3% BM_OuterishProd/256/128 174866 176398 -0.9% BM_OuterishProd/512/1 7963684 104224 +98.7% BM_OuterishProd/512/4 7987913 112867 +98.6% BM_OuterishProd/512/256 8198378 1306500 +84.1% BM_OuterishProd/1k/1 7356256 324432 +95.6% BM_OuterishProd/1k/4 8129616 331621 +95.9% BM_OuterishProd/1k/512 27265418 7517538 +72.4% Improvements in CPU time: Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_OuterishProd/64/1 6169 1608 +73.9% BM_OuterishProd/64/4 7117 2412 +66.1% BM_OuterishProd/64/32 17702 15616 +11.8% BM_OuterishProd/128/1 45415 6498 +85.7% BM_OuterishProd/128/4 44459 9786 +78.0% BM_OuterishProd/128/64 110657 109489 +1.1% BM_OuterishProd/256/1 265158 28101 +89.4% BM_OuterishProd/256/4 274234 183885 +32.9% BM_OuterishProd/256/128 1397160 1408776 -0.8% BM_OuterishProd/512/1 78947048 520703 +99.3% BM_OuterishProd/512/4 86955578 1349742 +98.4% BM_OuterishProd/512/256 74701613 15584661 +79.1% BM_OuterishProd/1k/1 78352601 3877911 +95.1% BM_OuterishProd/1k/4 78521643 3966221 +94.9% BM_OuterishProd/1k/512 258104736 89480530 +65.3%
* \|	Merged latest updates from trunk	Benoit Steiner	2016-10-05
\|\\|
\| *	Fix alignement of statically allocated temporaries in symv, and trmv.	Gael Guennebaud	2016-09-21
\| \|
\| *	Fix product for custom complex type. (conjugation was ignored)	Gael Guennebaud	2016-09-14
\| \|
\| *	bug #1167: simplify installation of header files using cmake's ↵	Gael Guennebaud	2016-08-29
\| \| \| \| \| \| \| \|	install(DIRECTORY ...) command.
\| *	bug #1278: ease parsing	Gael Guennebaud	2016-08-22
\| \|
\| *	Fix performance regression in dgemm introduced by changeset ↵	Gael Guennebaud	2016-07-02
\| \| \| \| \| \| \| \|	5d51a7f12c69138ed2a43df240bdf27a5313f7ce
\| *	Fix performance regression introduced in changeset ↵	Gael Guennebaud	2016-07-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	e56aabf205a1e8f581dd8a46d7d46ce79c45e158 . Register blocking sizes are better handled by the cache size heuristics. The current code introduced very small blocks, for instance for 9x9 matrix, thus killing performance.
\| *	Relax mixing-type constraints for binary coefficient-wise operators:	Gael Guennebaud	2016-06-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Replace internal::scalar_product_traits<A,B> by Eigen::ScalarBinaryOpTraits<A,B,OP> - Remove the "functor_is_product_like" helper (was pretty ugly) - Currently, OP is not used, but it is available to the user for fine grained tuning - Currently, only the following operators have been generalized: ,/,+,-,=,=,/=,+=,-= - TODO: generalize all other binray operators (comparisons,pow,etc.) - TODO: handle "scalar op array" operators (currently only * is handled) - TODO: move the handling of the "void" scalar type to ScalarBinaryOpTraits
\| *	Handle some Index to int conversions in BLAS/LAPACK support.	Gael Guennebaud	2016-05-26
\| \|
\| *	Introduce internal's UIntPtr and IntPtr types for pointer to integer ↵	Gael Guennebaud	2016-05-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	conversions. This fixes "conversion from pointer to same-sized integral type" warnings by ICC. Ideally, we would use the std::[u]intptr_t types all the time, but since they are C99/C++11 only, let's be safe.