aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/arch
Commit message (Collapse)AuthorAge
* Changing the storage of the SSE complex packets to that of the wrapper. This ↵Gravatar guoqiangqi2021-05-10
| | | | should fix #2242 .
* Revert addition of unused `paddsub<Packet2cf>`. This fixes #2242Gravatar Christoph Hertzberg2021-05-06
|
* Better CUDA complex division.Gravatar Antonio Sanchez2021-04-29
| | | | | | The original produced NaNs when dividing 0/b for subnormal b. The `complex_divide_stable` was changed to use the more common Smith's algorithm.
* Add missing pcmp_lt_or_nan for NEON Packet4bf.Gravatar Antonio Sanchez2021-04-27
|
* Tests added and AVX512 bug fixed for pcmp_lt_or_nanGravatar Jakub Lichman2021-04-25
|
* Fix taking address of rvalue compiler issue with TensorFlow (plus other ↵Gravatar Chip-Kerchner2021-04-21
| | | | warnings).
* HasExp added for AVX512 Packet8dGravatar Jakub Lichman2021-04-20
|
* Fix ldexp for AVX512 (#2215)Gravatar Antonio Sanchez2021-04-20
| | | | | | | Wrong shuffle was used. Need to interleave low/high halves with a `permute` instruction. Fixes #2215.
* Avoid using uninitialized inputs and if available, use slightly more ↵Gravatar Christoph Hertzberg2021-04-13
| | | | efficient `movsd` instruction for `pset1<Packet2cf>`.
* Fix address of temporary object errors in clang11.Gravatar Chip Kerchner2021-04-02
| | | | This fixes the problem with taking the address of temporary objects which clang11 treats as errors.
* Eliminate `round_impl` double-promotion warnings for c++03.Gravatar Antonio Sanchez2021-03-25
|
* Fixed performance issues for complex VSX and P10 MMA in gebp_kernel (level 3).Gravatar Chip Kerchner2021-03-25
|
* Revert "Uses _mm512_abs_pd for Packet8d pabs"Gravatar Christoph Hertzberg2021-03-23
| | | This reverts commit f019b97aca82071f35726b1aaebf1c598770f0f5
* Remove yet another comma at end of enumGravatar David Tellenbach2021-03-18
|
* Uses _mm512_abs_pd for Packet8d pabsGravatar Steve Bronder2021-03-18
|
* Augment NumTraits with min/max_exponent() again.Gravatar Antonio Sanchez2021-03-16
| | | | | | | | | | | | Replace usage of `std::numeric_limits<...>::min/max_exponent` in codebase where possible. Also replaced some other `numeric_limits` usages in affected tests with the `NumTraits` equivalent. The previous MR !443 failed for c++03 due to lack of `constexpr`. Because of this, we need to keep around the `std::numeric_limits` version in enum expressions until the switch to c++11. Fixes #2148
* Fix another warning on missing commasGravatar David Tellenbach2021-03-17
|
* Revert "Augment NumTraits with min/max_exponent()."Gravatar David Tellenbach2021-03-17
| | | | This reverts commit 75ce9cd2a7aefaaea8543e2db14ce4dc149eeb03.
* Augment NumTraits with min/max_exponent().Gravatar Antonio Sanchez2021-03-17
| | | | | | | | Replace usage of `std::numeric_limits<...>::min/max_exponent` in codebase. Also replaced some other `numeric_limits` usages in affected tests with the `NumTraits` equivalent. Fixes #2148
* Silence warning on comma at end of enumerator listGravatar David Tellenbach2021-03-17
|
* Add fmod(half, half).Gravatar Antonio Sanchez2021-03-15
| | | | This is to support TensorFlow's `tf.math.floormod` for half.
* Fix pround and add printGravatar Chip Kerchner2021-03-15
|
* Fix NVCC+ICC issues.Gravatar Antonio Sanchez2021-03-15
| | | | | | | | | | | | | | | | | | | | | | | | NVCC does not understand `__forceinline`, so we need to use `inline` when compiling for GPU. ICC specializes `std::complex` operators for `float` and `double` by default, which cannot be used on device and conflict with Eigen's workaround in CUDA/Complex.h. This can be prevented by defining `_OVERRIDE_COMPLEX_SPECIALIZATION_` before including `<complex>`. Added this define to the tests and to `Eigen/Core`, but this will not work if the user includes `<complex>` before `<Eigen/Core>`. ICC also seems to generate a duplicate `Map` symbol in `PlainObjectBase`: ``` error: "Map" has already been declared in the current scope static ConstMapType Map(const Scalar *data) ``` I tracked this down to `friend class Eigen::Map`. Putting the `friend` statements at the bottom of the class seems to resolve this issue. Fixes #2180
* Add increment/decrement operators to Eigen::half.Gravatar Antonio Sanchez2021-03-15
| | | | | This is for consistency with bfloat16, and to support initialization with `std::iota`.
* Fix ambiguous call to CUDA __half constructor.Gravatar Antonio Sanchez2021-03-08
|
* Fix typo: DEVICE -> GPUGravatar Antonio Sanchez2021-03-08
|
* Fix non-trivial Half constructor for CUDA.Gravatar Antonio Sanchez2021-03-08
| | | | | | | | Both CUDA and HIP require trivial default constructors for types used in shared memory. Otherwise failing with ``` error: initialization is not supported for __shared__ variables. ```
* Changing the Eigen::half implementation for HIPGravatar Deven Desai2021-03-05
| | | | | | | | | | | | | | | | Currently, when compiling with HIP, Eigen::half is derived from the `__half_raw` struct that is defined within the hip_fp16.h header file. This is true for both the "host" compile phase and the "device" compile phase. This was causing a very hard to detect bug in the ROCm TensorFlow build. In the ROCm Tensorflow build, * files that do not contain ant GPU code get compiled via gcc, and * files that contnain GPU code get compiled via hipcc. In certain case, we have a function that is defined in a file that is compiled by hipcc, and is called in a file that is compiled by gcc. If such a function had Eigen::half has a "pass-by-value" argument, its value was getting corrupted, when received by the function. The reason for this seems to be that for the gcc compile, Eigen::half is derived from a `__half_raw` struct that has `uint16_t` as the data-store, and for hipcc the `__half_raw` implementation uses `_Float16` as the data store. There is some ABI incompatibility between gcc / hipcc (which is essentially latest clang), which results in the Eigen::half value (which is correct at the call-site) getting randomly corrupted when passed to the function. Changing the Eigen::half argument to be "pass by reference" seems to workaround the error. In order to fix it such that we do not run into it again in TF, this commit changes the Eigne::half implementation to use the same `__half_raw` implementation as the non-GPU compile, during host compile phase of the hipcc compile.
* Fix rint SSE/NEON again, using optimization barrier.Gravatar Antonio Sanchez2021-03-05
| | | | | | | | | | | | | | | | | | | | This is a new version of !423, which failed for MSVC. Defined `EIGEN_OPTIMIZATION_BARRIER(X)` that uses inline assembly to prevent operations involving `X` from crossing that barrier. Should work on most `GNUC` compatible compilers (MSVC doesn't seem to need this). This is a modified version adapted from what was used in `psincos_float` and tested on more platforms (see #1674, https://godbolt.org/z/73ezTG). Modified `rint` to use the barrier to prevent the add/subtract rounding trick from being optimized away. Also fixed an edge case for large inputs that get bumped up a power of two and ends up rounding away more than just the fractional part. If we are over `2^digits` then just return the input. This edge case was missed in the test since the test was comparing approximate equality, which was still satisfied. Adding a strict equality option catches it.
* Revert "Fix rint for SSE/NEON."Gravatar Antonio Sánchez2021-03-03
| | | This reverts commit e72dfeb8b9fa5662831b5d0bb9d132521f9173dd
* Fix rint for SSE/NEON.Gravatar Antonio Sanchez2021-03-03
| | | | | | | | | | | | | | It seems *sometimes* with aggressive optimizations the combination `psub(padd(a, b), b)` trick to force rounding is compiled away. Here we replace with inline assembly to prevent this (I tried `volatile`, but that leads to additional loads from memory). Also fixed an edge case for large inputs `a` where adding `b` bumps the value up a power of two and ends up rounding away more than just the fractional part. If we are over `2^digits` then just return the input. This edge case was missed in the test since the test was comparing approximate equality, which was still satisfied. Adding a strict equality option catches it.
* Add print for SSE/NEON, use NEON rounding intrinsics if available.Gravatar Antonio Sanchez2021-02-27
| | | | | | | | | | In SSE, by adding/subtracting 2^MantissaBits, we force rounding according to the current rounding mode. For NEON, we use the provided intrinsics for rint/floor/ceil if available (armv8). Related to #1969.
* Make half/bfloat16 constructor take inputs by value, fix powerpc test.Gravatar Antonio Sanchez2021-02-27
| | | | | | | | | | | | Since `numeric_limits<half>::max_exponent` is a static inline constant, it cannot be directly passed by reference. This triggers a linker error in recent versions of `g++-powerpc64le`. Changing `half` to take inputs by value fixes this. Wrapping `max_exponent` with `int(...)` to make an addressable integer also fixes this and may help with other custom `Scalar` types down-the-road. Also eliminated some compile warnings for powerpc.
* Fix double-promotion warningsGravatar Christoph Hertzberg2021-02-27
| | | | (cherry picked from commit c22c103e932e511e96645186831363585a44b7a3)
* Fix NEON sqrt for 32-bit, add prsqrt.Gravatar Antonio Sanchez2021-02-26
| | | | | | | | | | | | With !406, we accidentally broke arm 32-bit NEON builds, since `vsqrt_f32` is only available for 64-bit. Here we add back the `rsqrt` implementation for 32-bit, relying on a `prsqrt` implementation with better handling of edge cases. Note that several of the 32-bit NEON packet tests are currently failing - either due to denormal handling (NEON versions flush to zero, but scalar paths don't) or due to accuracy (e.g. sin/cos).
* Fix floor/ceil for NEON fp16.Gravatar Antonio Sanchez2021-02-25
| | | | Forgot to test this. Fixes bug introduced in !416.
* Fix SSE/NEON pfloor/pceil for saturated values.Gravatar Antonio Sanchez2021-02-25
| | | | | | | | | | The original will saturate if the input does not fit into an integer type. Here we fix this, returning the input if it doesn't have enough precision to have a fractional part. Also added `pceil` for NEON. Fixes #1969.
* Fix clang compile when no MMA flags are set. Simplify MMA compiler detection.Gravatar Chip-Kerchner2021-02-24
|
* Having forward template function declarations in a P10 file causes bad code ↵Gravatar Chip-Kerchner2021-02-24
| | | | in certain situations.
* Fixes to support old and new versions of the compilers for built-ins. Cast ↵Gravatar Chip-Kerchner2021-02-24
| | | | to non-const when using vector_pair with certain built-ins.
* Disable fast psqrt for NEON.Gravatar Antonio Sanchez2021-02-23
| | | | | | | Accuracy is too poor - requires at least two Newton iterations, but then it is no longer significantly faster than `vsqrt`. Fixes #2094.
* Fix some CUDA warnings.Gravatar Antonio Sanchez2021-02-24
| | | | | | | | | | | | | | | | | Added `EIGEN_HAS_STD_HASH` macro, checking for C++11 support and not running on GPU. `std::hash<float>` is not a device function, so cannot be used by `std::hash<bfloat16>`. Removed `EIGEN_DEVICE_FUNC` and only define if `EIGEN_HAS_STD_HASH`. Same for `half`. Added `EIGEN_CUDA_HAS_FP16_ARITHMETIC` to improve readability, eliminate warnings about `EIGEN_CUDA_ARCH` not being defined. Replaced a couple C-style casts with `reinterpret_cast` for aligned loading of `half*` to `half2*`. This eliminates `-Wcast-align` warnings in clang. Although not ideal due to potential type aliasing, this is how CUDA handles these conversions internally.
* Accurate pow, part 2. This change adds specializations of log2 and exp2 for ↵Gravatar Rasmus Munk Larsen2021-02-23
| | | | | | | double that make pow<double> accurate the 1 ULP. Speed for AVX-512 is within 0.5% of the currect implementation.
* Fix compilation errors with later versions of GCC and use of MMA.Gravatar Chip-Kerchner2021-02-22
|
* Fixes Bug #1925. Packets should be passed by const reference, even to inline ↵Gravatar Christoph Hertzberg2021-02-20
| | | | functions.
* Use the Cephes double subtraction trick in pexp<float> even when FMA is ↵Gravatar Rasmus Munk Larsen2021-02-18
| | | | available. Otherwise the accuracy drops from 1 ulp to 3 ulp.
* Fix uninitialized warning on AVX.Gravatar Antonio Sanchez2021-02-17
|
* Fixed performance issues for VSX and P10 MMA in general_matrix_matrix_productGravatar Chip Kerchner2021-02-17
|
* New accurate algorithm for pow(x,y). This version is accurate to 1.4 ulps ↵Gravatar Rasmus Munk Larsen2021-02-17
| | | | for float, while still being 10x faster than std::pow for AVX512. A future change will introduce a specialization for double.
* Updated pfrexp implementation.Gravatar Antonio Sanchez2021-02-17
| | | | | | The original implementation fails for 0, denormals, inf, and NaN. See #2150