aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/arch/SSE/MathFunctions.h
Commit message (Collapse)AuthorAge
* Improved std::complex sqrt and rsqrt.Gravatar Antonio Sanchez2021-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously `complex_sqrt` was only used for CUDA and MSVC), and implements custom `complex_rsqrt`. Also introduces `numext::rsqrt` to simplify implementation, and modified `numext::hypot` to adhere to IEEE IEC 6059 for special cases. The `complex_sqrt` and `complex_rsqrt` implementations were found to be significantly faster than `std::sqrt<std::complex<T>>` and `1/numext::sqrt<std::complex<T>>`. Benchmark file attached. ``` GCC 10, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448 BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545 BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062 BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248 BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474 BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127 BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028 BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711 Clang 11, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042 BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878 BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030 BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588 BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666 BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676 BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956 BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798 Clang 9, Pixel 2, aarch64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031 BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926 BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591 BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375 BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383 BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438 BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689 BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944 ```
* * Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵Gravatar Rasmus Munk Larsen2020-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
* Add log2() to Eigen.Gravatar Rasmus Munk Larsen2020-12-04
|
* Revert "Add log2() operator to Eigen"Gravatar Rasmus Munk Larsen2020-12-03
| | | | This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
* Add log2() operator to EigenGravatar Rasmus Munk Larsen2020-12-03
|
* Fix boolean float conversion and product warnings.Gravatar Antonio Sanchez2020-11-24
| | | | | | | | | | | | | | | | | | | | | This fixes some gcc warnings such as: ``` Eigen/src/Core/GenericPacketMath.h:655:63: warning: implicit conversion turns floating-point number into bool: 'typename __gnu_cxx::__enable_if<__is_integer<bool>::__value, double>::__type' (aka 'double') to 'bool' [-Wimplicit-conversion-floating-point-to-bool] Packet psqrt(const Packet& a) { EIGEN_USING_STD(sqrt); return sqrt(a); } ``` Details: - Added `scalar_sqrt_op<bool>` (`-Wimplicit-conversion-floating-point-to-bool`). - Added `scalar_square_op<bool>` and `scalar_cube_op<bool>` specializations (`-Wint-in-bool-context`) - Deprecated above specialized ops for bool. - Modified `cxx11_tensor_block_eval` to specialize generator for booleans (`-Wint-in-bool-context`) and to use `abs` instead of `square` to avoid deprecated bool ops.
* Add plog ops support packet2d for NEONGravatar Guoqiang QI2020-09-15
|
* Update old links to bitbucket to point to gitlab.comGravatar Gael Guennebaud2019-12-04
|
* 1. Fix a bug in psqrt and make it return 0 for +inf arguments.Gravatar Rasmus Munk Larsen2019-11-15
| | | | | | | | | | | | | | | | 2. Simplify handling of special cases by taking advantage of the fact that the builtin vrsqrt approximation handles negative, zero and +inf arguments correctly. This speeds up the SSE and AVX implementations by ~20%. 3. Make the Newton-Raphson formula used for rsqrt more numerically robust: Before: y = y * (1.5 - x/2 * y^2) After: y = y * (1.5 - y * (x/2) * y) Forming y^2 can overflow for very large or very small (denormalized) values of x, while x*y ~= 1. For AVX512, this makes it possible to compute accurate results for denormal inputs down to ~1e-42 in single precision. 4. Add a faster double precision implementation for Knights Landing using the vrsqrt28 instruction and a single Newton-Raphson iteration. Benchmark results: https://bitbucket.org/snippets/rmlarsen/5LBq9o
* Move implementation of vectorized error function erf() to ↵Gravatar Rasmus Munk Larsen2019-09-27
| | | | SpecialFunctionsImpl.h.
* Add generic PacketMath implementation of the Error Function (erf).Gravatar Rasmus Munk Larsen2019-09-19
|
* Fix compilation without vector engine available (e.g., x86 with SSE disabled):Gravatar Gael Guennebaud2019-09-05
| | | | -> ppolevl is required by ndtri even for the scalar path
* Implement vectorized versions of log1p and expm1 in Eigen using Kahan's ↵Gravatar Rasmus Munk Larsen2019-08-12
| | | | | | | | | | | | formulas, and change the scalar implementations to properly handle infinite arguments. Depending on instruction set, significant speedups are observed for the vectorized path: log1p wall time is reduced 60-93% (2.5x - 15x speedup) expm1 wall time is reduced 0-85% (1x - 7x speedup) The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly. Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
* Extend the generic psin_float code to handle cosine and make SSE and AVX use ↵Gravatar Gael Guennebaud2018-11-30
| | | | it (-> this adds pcos for AVX)
* Unify SSE/AVX psin functions.Gravatar Gael Guennebaud2018-11-27
| | | | | | | | It is based on the SSE version which is much more accurate, though very slightly slower. This changeset also includes the following required changes: - add packet-float to packet-int type traits - add packet float<->int reinterpret casts - add faster pselect for AVX based on blendv
* cleanupGravatar Gael Guennebaud2018-11-26
|
* Unify SSE and AVX pexp for double.Gravatar Gael Guennebaud2018-11-26
|
* Unify SSE and AVX implementation of pexpGravatar Gael Guennebaud2018-11-26
|
* First step toward a unification of packet log implementation, currently only ↵Gravatar Gael Guennebaud2018-11-26
| | | | | | SSE and AVX are unified. To this end, I added the following functions: pzero, pcmp_*, pfrexp, pset1frombits functions.
* MIsc. source and comment typosGravatar luz.paz2018-03-11
| | | | Found using `codespell` and `grep` from downstream FreeCAD
* Update comment for fast sqrt.Gravatar Rasmus Munk Larsen2016-10-04
|
* Fix a bug in the implementation of Carmack's fast sqrt algorithm in Eigen ↵Gravatar Rasmus Munk Larsen2016-10-04
| | | | | | | | | | | | | | (enabled by EIGEN_FAST_MATH), which causes the vectorized parts of the computation to return -0.0 instead of NaN for negative arguments. Benchmark speed in Giga-sqrts/s Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz ----------------------------------------- SSE AVX Fast=1 2.529G 4.380G Fast=0 1.944G 1.898G Fast=1 fixed 2.214G 3.739G This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.
* Factorize the 4 copies of tanh implementations, make numext::tanh consistent ↵Gravatar Gael Guennebaud2016-08-23
| | | | with array::tanh, enable fast tanh in fast-math mode only.
* Improved implementation of ptanh for SSE and AVXGravatar Benoit Steiner2016-02-18
|
* Avoid implicit cast from double to float.Gravatar Benoit Steiner2016-02-10
|
* Optimized implementation of the tanh function for SSEGravatar Benoit Steiner2016-02-10
|
* Make the GCC workaround for sqrt GCC-only; detect Emscripten as non-GCCGravatar Benoit Jacob2016-02-10
|
* Work around Emscripten bug - https://github.com/kripken/emscripten/issues/4088Gravatar Benoit Jacob2016-02-10
|
* Fix compilation on old gcc+AVXGravatar Gael Guennebaud2016-01-21
|
* Add numext::sqrt function to enable custom optimized implementation.Gravatar Gael Guennebaud2016-01-21
| | | | | | | | This changeset add two specializations for float/double on SSE. Those are mostly usefull with GCC for which std::sqrt add an extra and costly check on the result of _mm_sqrt_*. Clang does not add this burden. In this changeset, only DenseBase::norm() makes use of it.
* Added an optimized version of rsqrt for SSE and AVX that is used when ↵Gravatar Benoit Steiner2015-03-02
| | | | EIGEN_FAST_MATH is defined.
* Added support for fast reciprocal square root computation.Gravatar Benoit Steiner2015-02-26
|
* Remove some dead stores.Gravatar Gael Guennebaud2015-02-18
|
* Addendum to bug #859: pexp(NaN) for double did not return NaN, also, ↵Gravatar Christoph Hertzberg2014-10-20
| | | | | | plog(NaN) did not return NaN. psqrt(NaN) and psqrt(-1) shall return NaN if EIGEN_FAST_MATH==0
* Fix bug #859: pexp(NaN) returned Inf instead of NaNGravatar Gael Guennebaud2014-10-20
|
* Workaround gcc's default ABI not being able to distinghish between vector ↵Gravatar Gael Guennebaud2014-04-22
| | | | types of different sizes.
* fix a few "dead stores" warningsGravatar Gael Guennebaud2013-10-26
|
* typoGravatar Gael Guennebaud2013-08-19
|
* Fix bug #642: add vectorization of sqrt for doubles, and make sqrt really ↵Gravatar Gael Guennebaud2013-08-19
| | | | safe if EIGEN_FAST_MATH is disabled
* Make psqrt works with numeric_limits<float>::minGravatar Gael Guennebaud2013-06-14
|
* Fix bug #613: psqrt was incorrect for small numbersGravatar Jeff Dean2013-06-13
|
* Fix SSE plog<float> to return -INF on 0Gravatar Gael Guennebaud2013-02-14
|
* fix warningGravatar Gael Guennebaud2012-08-01
|
* fix lower acceptable bound of SSE pexp for doubleGravatar Gael Guennebaud2012-07-31
|
* add SSE pexp function for double, make use of _mm_floor_p* for pexp with SSE4.1Gravatar Gael Guennebaud2012-07-27
|
* Automatic relicensing to MPL2 using Keirs script. Manual fixup follows.Gravatar Benoit Jacob2012-07-13
|
* fix bug #475: .exp() now returns +inf when overflow occurs (SSE)Gravatar Gael Guennebaud2012-06-14
|
* Get rid of include directives inside namespace blocks (bug #339).Gravatar Jitse Niesen2012-04-15
|
* bug #86 : use internal:: namespace instead of ei_ prefixGravatar Benoit Jacob2010-10-25
|
* allow vectorization of mat44.col() by adding a InnerPanel booleanGravatar Gael Guennebaud2010-07-23
| | | | template parameter to Block