eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Implement a generic vectorized version of Smith's algorithms for complex ↵	Rasmus Munk Larsen	2021-07-01
\| \| \| \|	division.
*	Correct declarations for aarch64-pc-windows-msvc	大河メタル	2021-06-30
\|
*	Small cleanup: Get rid of the macros EIGEN_HAS_SINGLE_INSTRUCTION_CJMADD and ↵	Rasmus Munk Larsen	2021-06-24
\| \| \| \|	CJMADD, which were effectively unused, apart from on x86, where the change results in identically performing code.
*	Get rid of code duplication for conj_helper. For packets where ↵	Rasmus Munk Larsen	2021-06-24
\| \| \| \|	LhsType=RhsType a single generic implementation suffices. For scalars, the generic implementation of pconj automatically forwards to numext::conj, so much of the existing specialization can be avoided. For mixed types we still need specializations.
*	Use bit_cast to create -0.0 for floating point types to avoid compiler ↵	Rasmus Munk Larsen	2021-06-11
\| \| \| \|	optimization changing sign with --ffast-math enabled.
*	Add missing NEON ptranspose implementations.	Antonio Sanchez	2021-05-25
\| \| \| \|	Unified implementation using only `vzip`.
*	Revert addition of unused `paddsub<Packet2cf>`. This fixes #2242	Christoph Hertzberg	2021-05-06
\|
*	Add missing pcmp_lt_or_nan for NEON Packet4bf.	Antonio Sanchez	2021-04-27
\|
*	Remove yet another comma at end of enum	David Tellenbach	2021-03-18
\|
*	Fix rint SSE/NEON again, using optimization barrier.	Antonio Sanchez	2021-03-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a new version of !423, which failed for MSVC. Defined `EIGEN_OPTIMIZATION_BARRIER(X)` that uses inline assembly to prevent operations involving `X` from crossing that barrier. Should work on most `GNUC` compatible compilers (MSVC doesn't seem to need this). This is a modified version adapted from what was used in `psincos_float` and tested on more platforms (see #1674, https://godbolt.org/z/73ezTG). Modified `rint` to use the barrier to prevent the add/subtract rounding trick from being optimized away. Also fixed an edge case for large inputs that get bumped up a power of two and ends up rounding away more than just the fractional part. If we are over `2^digits` then just return the input. This edge case was missed in the test since the test was comparing approximate equality, which was still satisfied. Adding a strict equality option catches it.
*	Revert "Fix rint for SSE/NEON."	Antonio Sánchez	2021-03-03
\| \| \|	This reverts commit e72dfeb8b9fa5662831b5d0bb9d132521f9173dd
*	Fix rint for SSE/NEON.	Antonio Sanchez	2021-03-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It seems sometimes with aggressive optimizations the combination `psub(padd(a, b), b)` trick to force rounding is compiled away. Here we replace with inline assembly to prevent this (I tried `volatile`, but that leads to additional loads from memory). Also fixed an edge case for large inputs `a` where adding `b` bumps the value up a power of two and ends up rounding away more than just the fractional part. If we are over `2^digits` then just return the input. This edge case was missed in the test since the test was comparing approximate equality, which was still satisfied. Adding a strict equality option catches it.
*	Add print for SSE/NEON, use NEON rounding intrinsics if available.	Antonio Sanchez	2021-02-27
\| \| \| \| \| \| \| \| \| \|	In SSE, by adding/subtracting 2^MantissaBits, we force rounding according to the current rounding mode. For NEON, we use the provided intrinsics for rint/floor/ceil if available (armv8). Related to #1969.
*	Fix NEON sqrt for 32-bit, add prsqrt.	Antonio Sanchez	2021-02-26
\| \| \| \| \| \| \| \| \| \| \| \|	With !406, we accidentally broke arm 32-bit NEON builds, since `vsqrt_f32` is only available for 64-bit. Here we add back the `rsqrt` implementation for 32-bit, relying on a `prsqrt` implementation with better handling of edge cases. Note that several of the 32-bit NEON packet tests are currently failing - either due to denormal handling (NEON versions flush to zero, but scalar paths don't) or due to accuracy (e.g. sin/cos).
*	Fix floor/ceil for NEON fp16.	Antonio Sanchez	2021-02-25
\| \| \| \|	Forgot to test this. Fixes bug introduced in !416.
*	Fix SSE/NEON pfloor/pceil for saturated values.	Antonio Sanchez	2021-02-25
\| \| \| \| \| \| \| \| \| \|	The original will saturate if the input does not fit into an integer type. Here we fix this, returning the input if it doesn't have enough precision to have a fractional part. Also added `pceil` for NEON. Fixes #1969.
*	Disable fast psqrt for NEON.	Antonio Sanchez	2021-02-23
\| \| \| \| \| \| \|	Accuracy is too poor - requires at least two Newton iterations, but then it is no longer significantly faster than `vsqrt`. Fixes #2094.
*	Updated pfrexp implementation.	Antonio Sanchez	2021-02-17
\| \| \| \| \| \|	The original implementation fails for 0, denormals, inf, and NaN. See #2150
*	missing method in packetmath.h void ptranspose(PacketBlock<Packet16uc, 4>& ↵	Ashutosh Sharma	2021-02-16
\| \| \| \|	kernel)
*	Use vrsqrts for rsqrt Newton iterations.	Antonio Sanchez	2021-02-11
\| \| \| \| \|	It's slightly faster and slightly more accurate, allowing our current packetmath tests to pass for sqrt with a single iteration.
*	loop less ptranspose	Ashutosh Sharma	2021-02-10
\|
*	Fix excessive GEBP register spilling for 32-bit NEON.	Antonio Sanchez	2021-02-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM, leading to excessive 16-byte register spills, slowing down basic f32 matrix multiplication by approx 50%. By specializing `gebp_traits`, we can eliminate the register spills. Volatile inline ASM both acts as a barrier to prevent reordering and enforces strict register use. In a simple f32 matrix multiply example, this modification reduces 16-byte spills from 109 instances to zero, leading to a 1.5x speed increase (search for `16-byte Spill` in the assembly in https://godbolt.org/z/chsPbE). This is a replacement of !379. See there for further discussion. Also moved `gebp_traits` specializations for NEON to `Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside other NEON-specific code. Fixes #2138.
*	Fix pfrexp/pldexp for half.	Antonio Sanchez	2021-01-21
\| \| \| \| \| \| \| \| \| \|	The recent addition of vectorized pow (!330) relies on `pfrexp` and `pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`. Adding tests for these packet ops also exposed an issue with handling negative values in `pfrexp`, returning an incorrect exponent. Added the missing implementations, corrected the exponent in `pfrexp1`, and added `packetmath` tests.
*	1)provide a better generic paddsub op implementation	Guoqiang QI	2021-01-13
\| \| \| \| \|	2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON 3)make paddsub op support the Packet2cf/Packet4f in SSE
*	* Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵	Rasmus Munk Larsen	2020-12-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
*	Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which ↵	Rasmus Munk Larsen	2020-12-15
\| \| \| \|	otherwise has an error of ~1000 ulps.
*	Fix NEON pmax<PropagateNumbers,Packet4bf>.	Antonio Sanchez	2020-12-11
\| \| \| \|	Simple typo, the max impl called pmin instead of pmax for floats.
*	Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64	David Tellenbach	2020-12-11
\|
*	Add Armv8 guard on PropagateNumbers implementation.	Everton Constantino	2020-12-10
\|
*	Fix vectorization of complex sqrt on NEON	David Tellenbach	2020-12-10
\|
*	Remove comma at end of enumerator list in NEON PacketMath	David Tellenbach	2020-12-10
\|
*	- Enabling PropagateNaN and PropagateNumbers for NEON.	Everton Constantino	2020-12-08
\| \| \| \|	- Adding propagate tests to bfloat16.
*	Special function implementations for half/bfloat16 packets.	Antonio Sanchez	2020-12-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.
*	Fix typo in `F32MaskToBf16Mask`.	Antonio Sanchez	2020-12-02
\|
*	Fix neon cmp* functions for bf16.	Antonio Sanchez	2020-12-02
\| \| \| \| \| \| \| \| \| \| \|	The current impl corrupts the comparison masks when converting from float back to bfloat16. The resulting masks are then no longer all zeros or all ones, which breaks when used with `pselect` (e.g. in `pmin<PropagateNumbers>`). This was causing `packetmath_15` to fail on arm. Introducing a simple `F32MaskToBf16Mask` corrects this (takes the lower 16-bits for each float mask).
*	Fixes duplicate symbol when building blas	Antonio Sanchez	2020-11-20
\| \| \| \| \| \| \|	Missing inline breaks blas, since symbol generated in `complex_single.cpp`, `complex_double.cpp`, `single.cpp`, `double.cpp` Changed rest of inlines to `EIGEN_STRONG_INLINE`.
*	Re-enable Arm Neon Eigen::half packets of size 8	David Tellenbach	2020-11-18
\| \| \| \| \| \|	- Add predux_half_dowto4 - Remove explicit casts in Half.h to match the behaviour of BFloat16.h - Enable more packetmath tests for Eigen::half
*	Avoid promotion of Arm __fp16 to float in Neon PacketMath	David Tellenbach	2020-11-17
\| \| \| \| \| \|	Using overloaded arithmetic operators for Arm __fp16 always causes a promotion to float. We replace operator* by vmulh_f16 to avoid this.
*	Unify Inverse_SSE.h and Inverse_NEON.h into a single generic implementation ↵	Guoqiang QI	2020-11-17
\| \| \| \|	using PacketMath.
*	Fix typo in NEON/PacketMath.h	guoqiangqi	2020-11-13
\|
*	Add support for Armv8.2-a __fp16	David Tellenbach	2020-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Armv8.2-a provides a native half-precision floating point (__fp16 aka. float16_t). This patch introduces * __fp16 as underlying type of Eigen::half if this type is available * the packet types Packet4hf and Packet8hf representing float16x4_t and float16x8_t respectively * packet-math for the above packets with corresponding scalar type Eigen::half The packet-math functionality has been implemented by Ashutosh Sharma <ashutosh.sharma@amperecomputing.com>. This closes #1940.
*	Clean up packetmath tests and fix various bugs to make bfloat16 pass ↵	Rasmus Munk Larsen	2020-10-09
\| \| \| \|	(almost) all packetmath tests with SSE, AVX, and AVX512.
*	Fix undefined reference to pset1frombits bug on different platforms	guoqiangqi	2020-09-19
\|
*	Fix more mildly embarrassing typos in ARM intrinsics in PacketMath.h.	Rasmus Munk Larsen	2020-09-18
\| \| \|	'vmvnq_u64' does not exist for some reason.
*	Fix typo in PacketMath.h	Rasmus Munk Larsen	2020-09-18
\|
*	Add missing packet op pcmp_lt_or_nan for Packet2d on ARM.	Rasmus Munk Larsen	2020-09-18
\|
*	Add support for CastXML on ARM aarch64	Brad King	2020-09-16
\| \| \| \| \| \| \| \|	CastXML simulates the preprocessors of other compilers, but actually parses the translation unit with an internal Clang compiler. Use the same `vld1q_u64` workaround that we do for Clang. Fixes: #1979
*	Remove old Clang compiler bug work-arounds. The two LLVM bugs referenced in ↵	Benoit Jacob	2020-09-15
\| \| \| \|	the comments here have long been fixed. The workarounds were now detrimental because (1) they prevented using fused mul-add on Clang/ARM32 and (2) the unnecessary 'volatile' in 'asm volatile' prevented legitimate reordering by the compiler.
*	Add plog ops support packet2d for NEON	Guoqiang QI	2020-09-15
\|
*	Add Neon psqrt<Packet2d> and pexp<Packet2d>	Guoqiang QI	2020-09-08
\|