eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Fix pfrexp/pldexp for half.	Antonio Sanchez	2021-01-21
\| \| \| \| \| \| \| \| \| \|	The recent addition of vectorized pow (!330) relies on `pfrexp` and `pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`. Adding tests for these packet ops also exposed an issue with handling negative values in `pfrexp`, returning an incorrect exponent. Added the missing implementations, corrected the exponent in `pfrexp1`, and added `packetmath` tests.
*	* Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵	Rasmus Munk Larsen	2020-12-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
*	Add log2() to Eigen.	Rasmus Munk Larsen	2020-12-04
\|
*	Revert "Add log2() operator to Eigen"	Rasmus Munk Larsen	2020-12-03
\| \| \| \|	This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
*	Add log2() operator to Eigen	Rasmus Munk Larsen	2020-12-03
\|
*	Implement missing AVX half ops.	Antonio Sanchez	2020-11-24
\| \| \| \| \| \| \| \|	Minimal implementation of AVX `Eigen::half` ops to bring in line with `bfloat16`. Allows `packetmath_13` to pass. Also adjusted `bfloat16` packet traits to match the supported set of ops (e.g. Bessel is not actually implemented).
*	Add AVX plog<Packet4d> and AVX512 plog<Packet8d> ops,also unified AVX512 ↵	Guoqiang QI	2020-10-15
\| \| \| \|	plog<Packet16f> op with generic api
*	AVX path for BF16	Sheng Yang	2020-07-14
\|
*	1. Fix a bug in psqrt and make it return 0 for +inf arguments.	Rasmus Munk Larsen	2019-11-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2. Simplify handling of special cases by taking advantage of the fact that the builtin vrsqrt approximation handles negative, zero and +inf arguments correctly. This speeds up the SSE and AVX implementations by ~20%. 3. Make the Newton-Raphson formula used for rsqrt more numerically robust: Before: y = y * (1.5 - x/2 * y^2) After: y = y * (1.5 - y * (x/2) * y) Forming y^2 can overflow for very large or very small (denormalized) values of x, while x*y ~= 1. For AVX512, this makes it possible to compute accurate results for denormal inputs down to ~1e-42 in single precision. 4. Add a faster double precision implementation for Knights Landing using the vrsqrt28 instruction and a single Newton-Raphson iteration. Benchmark results: https://bitbucket.org/snippets/rmlarsen/5LBq9o
*	Move implementation of vectorized error function erf() to ↵	Rasmus Munk Larsen	2019-09-27
\| \| \| \|	SpecialFunctionsImpl.h.
*	Add generic PacketMath implementation of the Error Function (erf).	Rasmus Munk Larsen	2019-09-19
\|
*	Implement vectorized versions of log1p and expm1 in Eigen using Kahan's ↵	Rasmus Munk Larsen	2019-08-12
\| \| \| \| \| \| \| \| \| \| \| \|	formulas, and change the scalar implementations to properly handle infinite arguments. Depending on instruction set, significant speedups are observed for the vectorized path: log1p wall time is reduced 60-93% (2.5x - 15x speedup) expm1 wall time is reduced 0-85% (1x - 7x speedup) The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly. Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
*	Cleanup	Gael Guennebaud	2018-11-30
\|
*	Extend the generic psin_float code to handle cosine and make SSE and AVX use ↵	Gael Guennebaud	2018-11-30
\| \| \| \|	it (-> this adds pcos for AVX)
*	Unify SSE/AVX psin functions.	Gael Guennebaud	2018-11-27
\| \| \| \| \| \| \| \|	It is based on the SSE version which is much more accurate, though very slightly slower. This changeset also includes the following required changes: - add packet-float to packet-int type traits - add packet float<->int reinterpret casts - add faster pselect for AVX based on blendv
*	cleanup	Gael Guennebaud	2018-11-26
\|
*	Unify SSE and AVX pexp for double.	Gael Guennebaud	2018-11-26
\|
*	Unify SSE and AVX implementation of pexp	Gael Guennebaud	2018-11-26
\|
*	First step toward a unification of packet log implementation, currently only ↵	Gael Guennebaud	2018-11-26
\| \| \| \| \| \|	SSE and AVX are unified. To this end, I added the following functions: pzero, pcmp_*, pfrexp, pset1frombits functions.
*	Fix copy-paste error: Must use _mm256_cmp_ps for AVX.	Rasmus Munk Larsen	2016-10-12
\|
*	Update comment for fast sqrt.	Rasmus Munk Larsen	2016-10-04
\|
*	Fix a bug in the implementation of Carmack's fast sqrt algorithm in Eigen ↵	Rasmus Munk Larsen	2016-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	(enabled by EIGEN_FAST_MATH), which causes the vectorized parts of the computation to return -0.0 instead of NaN for negative arguments. Benchmark speed in Giga-sqrts/s Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz ----------------------------------------- SSE AVX Fast=1 2.529G 4.380G Fast=0 1.944G 1.898G Fast=1 fixed 2.214G 3.739G This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.
*	Factorize the 4 copies of tanh implementations, make numext::tanh consistent ↵	Gael Guennebaud	2016-08-23
\| \| \| \|	with array::tanh, enable fast tanh in fast-math mode only.
*	Improved implementation of ptanh for SSE and AVX	Benoit Steiner	2016-02-18
\|
*	Avoid implicit cast from double to float.	Benoit Steiner	2016-02-10
\|
*	Optimized implementation of the hyperbolic tangent function for AVX	Benoit Steiner	2016-02-10
\|
*	Workaround compilers that do not even define _mm256_set_m128.	Gael Guennebaud	2015-12-24
\|
*	Fixed a typo.	Benoit Steiner	2015-12-18
\|
*	bug #1140: remove custom definition and use of _mm256_setr_m128	Gael Guennebaud	2015-12-18
\|
*	bug #1069: fix AVX support on MSVC (use of non portable C-style cast)	Gael Guennebaud	2015-09-28
\|
*	Added a double-precision implementation of the exp() function for AVX.	Benoit Steiner	2015-05-04
\|
*	Fixed the optimized AVX implementation of the fast rsqrt function	Benoit Steiner	2015-03-02
\|
*	Added an optimized version of rsqrt for SSE and AVX that is used when ↵	Benoit Steiner	2015-03-02
\| \| \| \|	EIGEN_FAST_MATH is defined.
*	Added support for fast reciprocal square root computation.	Benoit Steiner	2015-02-26
\|
*	Optimized version of the sin(), exp(), log() and sqrt() function for AVX	Benoit Steiner	2015-02-13