eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Add CUDA complex sqrt.	Antonio Sanchez	2020-12-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
*	Fix missing EIGEN_DEVICE_FUNC	rgreenblatt	2020-12-20
\|
*	* Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵	Rasmus Munk Larsen	2020-12-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
*	Merge branch 'lambdaknight/eigen-master'	Turing Eret	2020-12-16
\|\
\| *	Replace call to FixedDimensions() with a singleton instance of	Turing Eret	2020-12-16
\| \| \| \| \| \| \| \|	FixedDimensions.
* \|	Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which ↵	Rasmus Munk Larsen	2020-12-15
\| \| \| \| \| \| \| \|	otherwise has an error of ~1000 ulps.
\| *	TensorStorage with FixedDimensions now has zero instance memory overhead.	Turing Eret	2020-12-14
\|/ \| \| \| \| \| \|	Removed m_dimension as instance member of TensorStorage with FixedDimensions and instead use the template parameter. This means that the sizeof a pure fixed-size storage is exactly equal to the data it is storing.
*	Remove code checking for CMake < 3.5	Alexander Grund	2020-12-14
\| \| \| \|	As the CMake version is at least 3.5 the code checking for earlier versions can be removed.
*	Remove comma at the end of enumeration list to silence C++03 warnings	David Tellenbach	2020-12-13
\|
*	Fix implicit cast to double.	Antonio Sanchez	2020-12-12
\| \| \| \| \|	Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.
*	Fix NEON pmax<PropagateNumbers,Packet4bf>.	Antonio Sanchez	2020-12-11
\| \| \| \|	Simple typo, the max impl called pmin instead of pmax for floats.
*	Fix typo in AVX512 packet math.	Antonio Sanchez	2020-12-11
\|
*	Remove unused macro in Half.h	David Tellenbach	2020-12-12
\|
*	Fix more SSE/AVX packet conversions for peven.	Antonio Sanchez	2020-12-11
\| \| \| \|	MSVC doesn't like function-style casts and forces us to use intrinsics.
*	Replace M_LOG2E and M_LN2 with custom macros.	Antonio Sanchez	2020-12-11
\| \| \| \| \| \| \| \| \| \|	For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
*	Fix MSVC SSE casts.	Antonio Sanchez	2020-12-11
\| \| \| \| \|	MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be converted using intrinsic methods.
*	Fix for broken ROCm/HIP Support	Deven Desai	2020-12-11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following commit introduced a breakage in ROCm/HIP support for Eigen. https://gitlab.com/libeigen/eigen/-/commit/5ec4907434742d4555df4aa708b665868b88f3b4#1958e65719641efe5483abc4ce0b61806270f6f3_525_517 ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial
*	Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64	David Tellenbach	2020-12-11
\|
*	Add Armv8 guard on PropagateNumbers implementation.	Everton Constantino	2020-12-10
\|
*	Remove private access of std::deque::_M_impl.	Antonio Sanchez	2020-12-10
\| \| \| \| \|	This no longer works on gcc or clang, so we should just remove the hack. The default should compile to similar code anyways.
*	Fix vectorization of complex sqrt on NEON	David Tellenbach	2020-12-10
\|
*	Remove comma at end of enumerator list in NEON PacketMath	David Tellenbach	2020-12-10
\|
*	Fix a typo in SparseMatrix documentation.	David Tellenbach	2020-12-09
\| \| \| \|	This fixes issue #2091.
*	Implement vectorized complex square root.	Rasmus Munk Larsen	2020-12-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```
*	Fix host/device calls for __half.	Antonio Sanchez	2020-12-08
\| \| \| \| \| \|	The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear.
*	- Enabling PropagateNaN and PropagateNumbers for NEON.	Everton Constantino	2020-12-08
\| \| \| \|	- Adding propagate tests to bfloat16.
*	Fix unused warning on new `dense_assignment_loop` impl.	Antonio Sanchez	2020-12-07
\|
*	Add specialization for compile-time zero-sized dense assignment.	Antonio Sanchez	2020-12-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the current `dense_assignment_loop` implementations, if the destination's inner or outer size is zero at compile time and if the kernel involves a product, we currently get a compile error (#2080). This is triggered by attempting to multiply a non-existent row by a column (or vice-versa). To address this, we add a specialization for zero-sized assignments (`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static check to ensure the size is in-fact zero. This now seems to be the only existing use of `AllAtOnceTraversal`. Fixes #2080.
*	Clean up `#if`s in GPU PacketPath.	Antonio Sanchez	2020-12-04
\| \| \| \| \| \| \| \| \| \| \|	Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847
*	Add log2() to Eigen.	Rasmus Munk Larsen	2020-12-04
\|
*	Fix bad NEON fp16 check	Antonio Sanchez	2020-12-04
\|
*	Special function implementations for half/bfloat16 packets.	Antonio Sanchez	2020-12-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.
*	Remove duplicate #if clause	David Tellenbach	2020-12-04
\|
*	Fix shfl* macros for CUDA/HIP	Antonio Sanchez	2020-12-04
\| \| \| \| \| \| \| \| \| \|	The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so they are defined whenever the corresponding CUDA/HIP ones are. Also changed the HIP/CUDA<9.0 versions to cast to int instead of doing the conversion `half`<->`float`. Fixes #2083
*	The function 'prefetch' did not work correctly on the win64 platform	shrek1402	2020-12-04
\|
*	Revert "Add log2() operator to Eigen"	Rasmus Munk Larsen	2020-12-03
\| \| \| \|	This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
*	Add log2() operator to Eigen	Rasmus Munk Larsen	2020-12-03
\|
*	Small cleanup of generic plog implementations:	Rasmus Munk Larsen	2020-12-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding the term eln(2) is split into two step for no obvious reason. This dates back to the original Cephes code from which the algorithm is adapted. It appears that this was done in Cephes to prevent the compiler from reordering the addition of the 3 terms in the approximation log(1+x) ~= x - 0.5x^2 + x^3*P(x)/Q(x) which must be added in reverse order since \|x\| < (sqrt(2)-1). This allows rewriting the code to just 2 pmadd and 1 padd instructions, which on a Skylake processor speeds up the code by 5-7%.
*	Include chrono in main for c++11.	Antonio Sanchez	2020-12-03
\| \| \| \|	Hack to fix tensor tests, since min/max are overridden by `main.h`.
*	Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.	Rasmus Munk Larsen	2020-12-02
\|
*	Fix typo in `F32MaskToBf16Mask`.	Antonio Sanchez	2020-12-02
\|
*	Fix neon cmp* functions for bf16.	Antonio Sanchez	2020-12-02
\| \| \| \| \| \| \| \| \| \| \|	The current impl corrupts the comparison masks when converting from float back to bfloat16. The resulting masks are then no longer all zeros or all ones, which breaks when used with `pselect` (e.g. in `pmin<PropagateNumbers>`). This was causing `packetmath_15` to fail on arm. Introducing a simple `F32MaskToBf16Mask` corrects this (takes the lower 16-bits for each float mask).
*	Implement CUDA __shfl* for Eigen::half	Antonio Sanchez	2020-12-01
\| \| \| \| \| \| \|	Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu` test are broken, as well as several ops in Tensorflow. The gpu functions `__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float. Here we add the required specializations.
*	Fix a few issues for AVX512. This change enables vectorized versions of log, ↵	Rasmus Munk Larsen	2020-12-01
\| \| \| \|	exp, log1p, expm1 when AVX512DQ is not available.
*	Fix #2077, `EIGEN_CONSTEXPR` in `Half`.	Antonio Sanchez	2020-12-01
\| \| \| \| \| \| \| \| \|	`bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from `raw_half_as_uint16(...)`. This shouldn't affect anything else, since it is only used in `a bit_cast<uint16_t,half>()` which is not itself `constexpr`. Fixes #2077.
*	add EIGEN_DEVICE_FUNC to methods	acxz	2020-12-01
\|
*	AVX512 missing ops.	Antonio Sanchez	2020-11-30
\| \| \| \| \| \| \| \| \| \|	This allows the `packetmath` tests to pass for AVX512 on skylake. Made `half` and `bfloat16` consistent in terms of ops they support. Note the `log` tests are currently disabled for `bfloat16` since they fail due to poor precision (they were previously disabled for `Packet8bf` via test function specialization -- I just removed that specialization and disabled it in the generic test).
*	Fix typo in doc	Florian Maurin	2020-11-30
\|
*	Workaround for doxygen class template titles in which the template	Jim Lersch	2020-11-27
\| \| \| \| \| \|	part of the class signature is lost due to a problem with forward declarations. The problem is probably caused by doxygen bug #7689. It is confirmed to be fixed in doxygen >= 1.8.19.
*	Fix doxygen class blocks that were not associated with the correct classes.	Jim Lersch	2020-11-27
\|