diff options
author | Rasmus Munk Larsen <rmlarsen@google.com> | 2019-12-16 21:33:42 +0000 |
---|---|---|
committer | Rasmus Munk Larsen <rmlarsen@google.com> | 2019-12-16 21:33:42 +0000 |
commit | a5660744801116ad2f9ab4e9e389f194ba307a35 (patch) | |
tree | a7150bf1b016baab78b82ab8d2195b5e91916126 /Eigen/src/Core/arch/AVX512 | |
parent | 8e5da71466591cc24352782b08dc78ddb94f0717 (diff) |
Improve accuracy of fast approximate tanh and the logistic functions in Eigen, such that they preserve relative accuracy to within a few ULPs where their function values tend to zero (around x=0 for tanh, and for large negative x for the logistic function).
This change re-instates the fast rational approximation of the logistic function for float32 in Eigen (removed in https://gitlab.com/libeigen/eigen/commit/66f07efeaed39d6a67005343d7e0caf7d9eeacdb), but uses the more accurate approximation 1/(1+exp(-1)) ~= exp(x) below -9. The exponential is only calculated on the vectorized path if at least one element in the SIMD input vector is less than -9.
This change also contains a few improvements to speed up the original float specialization of logistic:
- Introduce EIGEN_PREDICT_{FALSE,TRUE} for __builtin_predict and use it to predict that the logistic-only path is most likely (~2-3% speedup for the common case).
- Carefully set the upper clipping point to the smallest x where the approximation evaluates to exactly 1. This saves the explicit clamping of the output (~7% speedup).
The increased accuracy for tanh comes at a cost of 10-20% depending on instruction set.
The benchmarks below repeated calls
u = v.logistic() (u = v.tanh(), respectively)
where u and v are of type Eigen::ArrayXf, have length 8k, and v contains random numbers in [-1,1].
Benchmark numbers for logistic:
Before:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
SSE
BM_eigen_logistic_float 4467 4468 155835 model_time: 4827
AVX
BM_eigen_logistic_float 2347 2347 299135 model_time: 2926
AVX+FMA
BM_eigen_logistic_float 1467 1467 476143 model_time: 2926
AVX512
BM_eigen_logistic_float 805 805 858696 model_time: 1463
After:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
SSE
BM_eigen_logistic_float 2589 2590 270264 model_time: 4827
AVX
BM_eigen_logistic_float 1428 1428 489265 model_time: 2926
AVX+FMA
BM_eigen_logistic_float 1059 1059 662255 model_time: 2926
AVX512
BM_eigen_logistic_float 673 673 1000000 model_time: 1463
Benchmark numbers for tanh:
Before:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
SSE
BM_eigen_tanh_float 2391 2391 292624 model_time: 4242
AVX
BM_eigen_tanh_float 1256 1256 554662 model_time: 2633
AVX+FMA
BM_eigen_tanh_float 823 823 866267 model_time: 1609
AVX512
BM_eigen_tanh_float 443 443 1578999 model_time: 805
After:
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
SSE
BM_eigen_tanh_float 2588 2588 273531 model_time: 4242
AVX
BM_eigen_tanh_float 1536 1536 452321 model_time: 2633
AVX+FMA
BM_eigen_tanh_float 1007 1007 694681 model_time: 1609
AVX512
BM_eigen_tanh_float 471 471 1472178 model_time: 805
Diffstat (limited to 'Eigen/src/Core/arch/AVX512')
-rw-r--r-- | Eigen/src/Core/arch/AVX512/PacketMath.h | 30 |
1 files changed, 24 insertions, 6 deletions
diff --git a/Eigen/src/Core/arch/AVX512/PacketMath.h b/Eigen/src/Core/arch/AVX512/PacketMath.h index 19c03cfe2..a53f9bc6a 100644 --- a/Eigen/src/Core/arch/AVX512/PacketMath.h +++ b/Eigen/src/Core/arch/AVX512/PacketMath.h @@ -365,6 +365,12 @@ EIGEN_STRONG_INLINE Packet16f cat256(Packet8f a, Packet8f b) { } #endif +template <> +EIGEN_STRONG_INLINE Packet16f pcmp_eq(const Packet16f& a, const Packet16f& b) { + __mmask16 mask = _mm512_cmp_ps_mask(a, b, _CMP_EQ_OQ); + return _mm512_castsi512_ps( + _mm512_mask_set1_epi32(_mm512_set1_epi32(0), mask, 0xffffffffu)); +} template<> EIGEN_STRONG_INLINE Packet16f pcmp_le(const Packet16f& a, const Packet16f& b) { __mmask16 mask = _mm512_cmp_ps_mask(a, b, _CMP_LE_OQ); return _mm512_castsi512_ps( @@ -388,12 +394,6 @@ template<> EIGEN_STRONG_INLINE Packet16i pcmp_eq(const Packet16i& a, const Packe return _mm512_mask_set1_epi32(_mm512_set1_epi32(0), mask, 0xffffffffu); } -template <> -EIGEN_STRONG_INLINE Packet16f pcmp_eq(const Packet16f& a, const Packet16f& b) { - __mmask16 mask = _mm512_cmp_ps_mask(a, b, _CMP_EQ_OQ); - return _mm512_castsi512_ps( - _mm512_mask_set1_epi32(_mm512_set1_epi32(0), mask, 0xffffffffu)); -} template <> EIGEN_STRONG_INLINE Packet8d pcmp_eq(const Packet8d& a, const Packet8d& b) { @@ -401,6 +401,24 @@ EIGEN_STRONG_INLINE Packet8d pcmp_eq(const Packet8d& a, const Packet8d& b) { return _mm512_castsi512_pd( _mm512_mask_set1_epi64(_mm512_set1_epi64(0), mask, 0xffffffffffffffffu)); } +template <> +EIGEN_STRONG_INLINE Packet8d pcmp_le(const Packet8d& a, const Packet8d& b) { + __mmask8 mask = _mm512_cmp_pd_mask(a, b, _CMP_LE_OQ); + return _mm512_castsi512_pd( + _mm512_mask_set1_epi64(_mm512_set1_epi64(0), mask, 0xffffffffffffffffu)); +} +template <> +EIGEN_STRONG_INLINE Packet8d pcmp_lt(const Packet8d& a, const Packet8d& b) { + __mmask8 mask = _mm512_cmp_pd_mask(a, b, _CMP_LT_OQ); + return _mm512_castsi512_pd( + _mm512_mask_set1_epi64(_mm512_set1_epi64(0), mask, 0xffffffffffffffffu)); +} +template <> +EIGEN_STRONG_INLINE Packet8d pcmp_lt_or_nan(const Packet8d& a, const Packet8d& b) { + __mmask8 mask = _mm512_cmp_pd_mask(a, b, _CMP_NGT_UQ); + return _mm512_castsi512_pd( + _mm512_mask_set1_epi64(_mm512_set1_epi64(0), mask, 0xffffffffffffffffu)); +} template <> EIGEN_STRONG_INLINE Packet16i ptrue<Packet16i>(const Packet16i& /*a*/) { |