aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/arch/Default
Commit message (Collapse)AuthorAge
* Get rid of redundant `pabs` instruction in complex square root.Gravatar Rasmus Munk Larsen2021-06-29
|
* Commit 52a5f982 broke conjhelper functionality for HIP GPUs.Gravatar Rohit Santhanam2021-06-25
| | | | This commit addresses this.
* Get rid of code duplication for conj_helper. For packets where ↵Gravatar Rasmus Munk Larsen2021-06-24
| | | | LhsType=RhsType a single generic implementation suffices. For scalars, the generic implementation of pconj automatically forwards to numext::conj, so much of the existing specialization can be avoided. For mixed types we still need specializations.
* Remove pset, replace with ploadu.Gravatar Antonio Sanchez2021-06-16
| | | | | | | | | We can't make guarantees on alignment for existing calls to `pset`, so we should default to loading unaligned. But in that case, we should just use `ploadu` directly. For loading constants, this load should hopefully get optimized away. This is causing segfaults in Google Maps.
* Use bit_cast to create -0.0 for floating point types to avoid compiler ↵Gravatar Rasmus Munk Larsen2021-06-11
| | | | optimization changing sign with --ffast-math enabled.
* Eliminate `round_impl` double-promotion warnings for c++03.Gravatar Antonio Sanchez2021-03-25
|
* Augment NumTraits with min/max_exponent() again.Gravatar Antonio Sanchez2021-03-16
| | | | | | | | | | | | Replace usage of `std::numeric_limits<...>::min/max_exponent` in codebase where possible. Also replaced some other `numeric_limits` usages in affected tests with the `NumTraits` equivalent. The previous MR !443 failed for c++03 due to lack of `constexpr`. Because of this, we need to keep around the `std::numeric_limits` version in enum expressions until the switch to c++11. Fixes #2148
* Revert "Augment NumTraits with min/max_exponent()."Gravatar David Tellenbach2021-03-17
| | | | This reverts commit 75ce9cd2a7aefaaea8543e2db14ce4dc149eeb03.
* Augment NumTraits with min/max_exponent().Gravatar Antonio Sanchez2021-03-17
| | | | | | | | Replace usage of `std::numeric_limits<...>::min/max_exponent` in codebase. Also replaced some other `numeric_limits` usages in affected tests with the `NumTraits` equivalent. Fixes #2148
* Add fmod(half, half).Gravatar Antonio Sanchez2021-03-15
| | | | This is to support TensorFlow's `tf.math.floormod` for half.
* Add increment/decrement operators to Eigen::half.Gravatar Antonio Sanchez2021-03-15
| | | | | This is for consistency with bfloat16, and to support initialization with `std::iota`.
* Fix ambiguous call to CUDA __half constructor.Gravatar Antonio Sanchez2021-03-08
|
* Fix typo: DEVICE -> GPUGravatar Antonio Sanchez2021-03-08
|
* Fix non-trivial Half constructor for CUDA.Gravatar Antonio Sanchez2021-03-08
| | | | | | | | Both CUDA and HIP require trivial default constructors for types used in shared memory. Otherwise failing with ``` error: initialization is not supported for __shared__ variables. ```
* Changing the Eigen::half implementation for HIPGravatar Deven Desai2021-03-05
| | | | | | | | | | | | | | | | Currently, when compiling with HIP, Eigen::half is derived from the `__half_raw` struct that is defined within the hip_fp16.h header file. This is true for both the "host" compile phase and the "device" compile phase. This was causing a very hard to detect bug in the ROCm TensorFlow build. In the ROCm Tensorflow build, * files that do not contain ant GPU code get compiled via gcc, and * files that contnain GPU code get compiled via hipcc. In certain case, we have a function that is defined in a file that is compiled by hipcc, and is called in a file that is compiled by gcc. If such a function had Eigen::half has a "pass-by-value" argument, its value was getting corrupted, when received by the function. The reason for this seems to be that for the gcc compile, Eigen::half is derived from a `__half_raw` struct that has `uint16_t` as the data-store, and for hipcc the `__half_raw` implementation uses `_Float16` as the data store. There is some ABI incompatibility between gcc / hipcc (which is essentially latest clang), which results in the Eigen::half value (which is correct at the call-site) getting randomly corrupted when passed to the function. Changing the Eigen::half argument to be "pass by reference" seems to workaround the error. In order to fix it such that we do not run into it again in TF, this commit changes the Eigne::half implementation to use the same `__half_raw` implementation as the non-GPU compile, during host compile phase of the hipcc compile.
* Fix rint SSE/NEON again, using optimization barrier.Gravatar Antonio Sanchez2021-03-05
| | | | | | | | | | | | | | | | | | | | This is a new version of !423, which failed for MSVC. Defined `EIGEN_OPTIMIZATION_BARRIER(X)` that uses inline assembly to prevent operations involving `X` from crossing that barrier. Should work on most `GNUC` compatible compilers (MSVC doesn't seem to need this). This is a modified version adapted from what was used in `psincos_float` and tested on more platforms (see #1674, https://godbolt.org/z/73ezTG). Modified `rint` to use the barrier to prevent the add/subtract rounding trick from being optimized away. Also fixed an edge case for large inputs that get bumped up a power of two and ends up rounding away more than just the fractional part. If we are over `2^digits` then just return the input. This edge case was missed in the test since the test was comparing approximate equality, which was still satisfied. Adding a strict equality option catches it.
* Make half/bfloat16 constructor take inputs by value, fix powerpc test.Gravatar Antonio Sanchez2021-02-27
| | | | | | | | | | | | Since `numeric_limits<half>::max_exponent` is a static inline constant, it cannot be directly passed by reference. This triggers a linker error in recent versions of `g++-powerpc64le`. Changing `half` to take inputs by value fixes this. Wrapping `max_exponent` with `int(...)` to make an addressable integer also fixes this and may help with other custom `Scalar` types down-the-road. Also eliminated some compile warnings for powerpc.
* Fix double-promotion warningsGravatar Christoph Hertzberg2021-02-27
| | | | (cherry picked from commit c22c103e932e511e96645186831363585a44b7a3)
* Fix some CUDA warnings.Gravatar Antonio Sanchez2021-02-24
| | | | | | | | | | | | | | | | | Added `EIGEN_HAS_STD_HASH` macro, checking for C++11 support and not running on GPU. `std::hash<float>` is not a device function, so cannot be used by `std::hash<bfloat16>`. Removed `EIGEN_DEVICE_FUNC` and only define if `EIGEN_HAS_STD_HASH`. Same for `half`. Added `EIGEN_CUDA_HAS_FP16_ARITHMETIC` to improve readability, eliminate warnings about `EIGEN_CUDA_ARCH` not being defined. Replaced a couple C-style casts with `reinterpret_cast` for aligned loading of `half*` to `half2*`. This eliminates `-Wcast-align` warnings in clang. Although not ideal due to potential type aliasing, this is how CUDA handles these conversions internally.
* Accurate pow, part 2. This change adds specializations of log2 and exp2 for ↵Gravatar Rasmus Munk Larsen2021-02-23
| | | | | | | double that make pow<double> accurate the 1 ULP. Speed for AVX-512 is within 0.5% of the currect implementation.
* Use the Cephes double subtraction trick in pexp<float> even when FMA is ↵Gravatar Rasmus Munk Larsen2021-02-18
| | | | available. Otherwise the accuracy drops from 1 ulp to 3 ulp.
* New accurate algorithm for pow(x,y). This version is accurate to 1.4 ulps ↵Gravatar Rasmus Munk Larsen2021-02-17
| | | | for float, while still being 10x faster than std::pow for AVX512. A future change will introduce a specialization for double.
* Updated pfrexp implementation.Gravatar Antonio Sanchez2021-02-17
| | | | | | The original implementation fails for 0, denormals, inf, and NaN. See #2150
* Adjust bounds for pexp_float/doubleGravatar Antonio Sanchez2021-02-10
| | | | | | | | | | | | | | The original clamping bounds on `_x` actually produce finite values: ``` exp(88.3762626647950) = 2.40614e+38 < 3.40282e+38 exp(709.437) = 1.27226e+308 < 1.79769e+308 ``` so with an accurate `ldexp` implementation, `pexp` fails for large inputs, producing finite values instead of `inf`. This adjusts the bounds slightly outside the finite range so that the output will overflow to +/- `inf` as expected.
* Fix ldexp implementations.Gravatar Antonio Sanchez2021-02-10
| | | | | | | | | | | | | | | | | The previous implementations produced garbage values if the exponent did not fit within the exponent bits. See #2131 for a complete discussion, and !375 for other possible implementations. Here we implement the 4-factor version. See `pldexp_impl` in `GenericPacketMathFunctions.h` for a full description. The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>` requires `por`. Left as a "TODO" is to delegate to a faster version if we know the exponent does fit within the exponent bits. Fixes #2131.
* Add more tests for pow and fix a corner case for huge exponent where the ↵Gravatar Rasmus Munk Larsen2021-02-05
| | | | result is always zero or infinite unless x is one.
* Fix pow and other cwise ops for half/bfloat16.Gravatar Antonio Sanchez2021-01-22
| | | | | | | | | | | | | The new `generic_pow` implementation was failing for half/bfloat16 since their construction from int/float is not `constexpr`. Modified in `GenericPacketMathFunctions` to remove `constexpr`. While adding tests for half/bfloat16, found other issues related to implicit conversions. Also needed to implement `numext::arg` for non-integer, non-complex, non-float/double/long double types. These seem to be implicitly converted to `std::complex<T>`, which then fails for half/bfloat16.
* Fix pfrexp/pldexp for half.Gravatar Antonio Sanchez2021-01-21
| | | | | | | | | | The recent addition of vectorized pow (!330) relies on `pfrexp` and `pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`. Adding tests for these packet ops also exposed an issue with handling negative values in `pfrexp`, returning an incorrect exponent. Added the missing implementations, corrected the exponent in `pfrexp1`, and added `packetmath` tests.
* Vectorize `pow(x, y)`. This closes ↵Gravatar Rasmus Munk Larsen2021-01-18
| | | | | | | | | | | | | | | | | | | | | | https://gitlab.com/libeigen/eigen/-/issues/2085, which also contains a description of the algorithm. I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics: ``` max_rel_error = 8.34405e-07 rms_rel_error = 2.76654e-07 ``` If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`: ``` max_rel_error = 0.666667 rms = 6.8727e-05 count = 1335165689 argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45 ``` which seems reasonable, since these results are subnormals with only couple of significant bits left.
* 1)provide a better generic paddsub op implementationGravatar Guoqiang QI2021-01-13
| | | | | 2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON 3)make paddsub op support the Packet2cf/Packet4f in SSE
* Add CUDA complex sqrt.Gravatar Antonio Sanchez2020-12-22
| | | | | | | | | | | | | | | This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
* Fix implicit cast to double.Gravatar Antonio Sanchez2020-12-12
| | | | | Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.
* Remove unused macro in Half.hGravatar David Tellenbach2020-12-12
|
* Replace M_LOG2E and M_LN2 with custom macros.Gravatar Antonio Sanchez2020-12-11
| | | | | | | | | | For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
* Implement vectorized complex square root.Gravatar Rasmus Munk Larsen2020-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```
* Add log2() to Eigen.Gravatar Rasmus Munk Larsen2020-12-04
|
* Special function implementations for half/bfloat16 packets.Gravatar Antonio Sanchez2020-12-04
| | | | | | | | | | | | | Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.
* Fix shfl* macros for CUDA/HIPGravatar Antonio Sanchez2020-12-04
| | | | | | | | | | The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so they are defined whenever the corresponding CUDA/HIP ones are. Also changed the HIP/CUDA<9.0 versions to cast to int instead of doing the conversion `half`<->`float`. Fixes #2083
* Revert "Add log2() operator to Eigen"Gravatar Rasmus Munk Larsen2020-12-03
| | | | This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
* Add log2() operator to EigenGravatar Rasmus Munk Larsen2020-12-03
|
* Small cleanup of generic plog implementations:Gravatar Rasmus Munk Larsen2020-12-03
| | | | | | | | | | | | | | Adding the term e*ln(2) is split into two step for no obvious reason. This dates back to the original Cephes code from which the algorithm is adapted. It appears that this was done in Cephes to prevent the compiler from reordering the addition of the 3 terms in the approximation log(1+x) ~= x - 0.5*x^2 + x^3*P(x)/Q(x) which must be added in reverse order since |x| < (sqrt(2)-1). This allows rewriting the code to just 2 pmadd and 1 padd instructions, which on a Skylake processor speeds up the code by 5-7%.
* Implement CUDA __shfl* for Eigen::halfGravatar Antonio Sanchez2020-12-01
| | | | | | | Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu` test are broken, as well as several ops in Tensorflow. The gpu functions `__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float. Here we add the required specializations.
* Fix #2077, `EIGEN_CONSTEXPR` in `Half`.Gravatar Antonio Sanchez2020-12-01
| | | | | | | | | `bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from `raw_half_as_uint16(...)`. This shouldn't affect anything else, since it is only used in `a bit_cast<uint16_t,half>()` which is not itself `constexpr`. Fixes #2077.
* Revert "Fix Half NaN definition and test."Gravatar Rasmus Munk Larsen2020-11-24
| | | | This reverts commit c770746d709686ef2b8b652616d9232f9b028e78.
* Fix Half NaN definition and test.Gravatar Rasmus Munk Larsen2020-11-24
| | | | | | | | | | | | | The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`, the signaling `NaN` is quieted). There was also an inconsistency between `numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we correct the inconsistency and compare NaNs according to the IEEE 754 definition. Also modified the `bfloat16_float` test to match. Tested with `cortex-a53` and `cortex-a55`.
* Implement missing AVX half ops.Gravatar Antonio Sanchez2020-11-24
| | | | | | | | Minimal implementation of AVX `Eigen::half` ops to bring in line with `bfloat16`. Allows `packetmath_13` to pass. Also adjusted `bfloat16` packet traits to match the supported set of ops (e.g. Bessel is not actually implemented).
* Fix Half NaN definition and test.Gravatar Antonio Sanchez2020-11-23
| | | | | | | | | | | | | The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`, the signaling `NaN` is quieted). There was also an inconsistency between `numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we correct the inconsistency and compare NaNs according to the IEEE 754 definition. Also modified the `bfloat16_float` test to match. Tested with `cortex-a53` and `cortex-a55`.
* Remove explicit casts from Eigen::half and Eigen::bfloat16 to boolGravatar David Tellenbach2020-11-19
| | | | | | | | | Both, Eigen::half and Eigen::Bfloat16 are implicitly convertible to float and can hence be converted to bool via the conversion chain Eigen::{half,bfloat16} -> float -> bool We thus remove the explicit cast operator to bool.
* Re-enable Arm Neon Eigen::half packets of size 8Gravatar David Tellenbach2020-11-18
| | | | | | - Add predux_half_dowto4 - Remove explicit casts in Half.h to match the behaviour of BFloat16.h - Enable more packetmath tests for Eigen::half
* Add bit_cast for half/bfloat to/from uint16_t, fix TensorRandomGravatar Antonio Sanchez2020-11-18
| | | | | | | | | | The existing `TensorRandom.h` implementation makes the assumption that `half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not always true. This currently fails on arm64, where `x` has type `__fp16`. Added `bit_cast` specializations to allow casting to/from `uint16_t` for both `half` and `bfloat16`. Also added tests in `half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch these errors in the future.