aboutsummaryrefslogtreecommitdiffhomepage
Commit message (Collapse)AuthorAge
* Add CUDA complex sqrt.Gravatar Antonio Sanchez2020-12-22
| | | | | | | | | | | | | | | This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
* Fix missing EIGEN_DEVICE_FUNCGravatar rgreenblatt2020-12-20
|
* * Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵Gravatar Rasmus Munk Larsen2020-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
* Merge branch 'lambdaknight/eigen-master'Gravatar Turing Eret2020-12-16
|\
| * Replace call to FixedDimensions() with a singleton instance ofGravatar Turing Eret2020-12-16
| | | | | | | | FixedDimensions.
* | Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which ↵Gravatar Rasmus Munk Larsen2020-12-15
| | | | | | | | otherwise has an error of ~1000 ulps.
| * TensorStorage with FixedDimensions now has zero instance memory overhead.Gravatar Turing Eret2020-12-14
|/ | | | | | | Removed m_dimension as instance member of TensorStorage with FixedDimensions and instead use the template parameter. This means that the sizeof a pure fixed-size storage is exactly equal to the data it is storing.
* Remove code checking for CMake < 3.5Gravatar Alexander Grund2020-12-14
| | | | As the CMake version is at least 3.5 the code checking for earlier versions can be removed.
* Remove comma at the end of enumeration list to silence C++03 warningsGravatar David Tellenbach2020-12-13
|
* Fix implicit cast to double.Gravatar Antonio Sanchez2020-12-12
| | | | | Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.
* Fix NEON pmax<PropagateNumbers,Packet4bf>.Gravatar Antonio Sanchez2020-12-11
| | | | Simple typo, the max impl called pmin instead of pmax for floats.
* Fix typo in AVX512 packet math.Gravatar Antonio Sanchez2020-12-11
|
* Remove unused macro in Half.hGravatar David Tellenbach2020-12-12
|
* Fix more SSE/AVX packet conversions for peven.Gravatar Antonio Sanchez2020-12-11
| | | | MSVC doesn't like function-style casts and forces us to use intrinsics.
* Replace M_LOG2E and M_LN2 with custom macros.Gravatar Antonio Sanchez2020-12-11
| | | | | | | | | | For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
* Fix MSVC SSE casts.Gravatar Antonio Sanchez2020-12-11
| | | | | MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be converted using intrinsic methods.
* Fix for broken ROCm/HIP SupportGravatar Deven Desai2020-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | The following commit introduced a breakage in ROCm/HIP support for Eigen. https://gitlab.com/libeigen/eigen/-/commit/5ec4907434742d4555df4aa708b665868b88f3b4#1958e65719641efe5483abc4ce0b61806270f6f3_525_517 ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial
* Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64Gravatar David Tellenbach2020-12-11
|
* Add Armv8 guard on PropagateNumbers implementation.Gravatar Everton Constantino2020-12-10
|
* Remove private access of std::deque::_M_impl.Gravatar Antonio Sanchez2020-12-10
| | | | | This no longer works on gcc or clang, so we should just remove the hack. The default should compile to similar code anyways.
* Fix vectorization of complex sqrt on NEONGravatar David Tellenbach2020-12-10
|
* Remove comma at end of enumerator list in NEON PacketMathGravatar David Tellenbach2020-12-10
|
* Fix a typo in SparseMatrix documentation.Gravatar David Tellenbach2020-12-09
| | | | This fixes issue #2091.
* Implement vectorized complex square root.Gravatar Rasmus Munk Larsen2020-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```
* Fix host/device calls for __half.Gravatar Antonio Sanchez2020-12-08
| | | | | | The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear.
* - Enabling PropagateNaN and PropagateNumbers for NEON.Gravatar Everton Constantino2020-12-08
| | | | - Adding propagate tests to bfloat16.
* Fix unused warning on new `dense_assignment_loop` impl.Gravatar Antonio Sanchez2020-12-07
|
* Add specialization for compile-time zero-sized dense assignment.Gravatar Antonio Sanchez2020-12-07
| | | | | | | | | | | | | | | In the current `dense_assignment_loop` implementations, if the destination's inner or outer size is zero at compile time and if the kernel involves a product, we currently get a compile error (#2080). This is triggered by attempting to multiply a non-existent row by a column (or vice-versa). To address this, we add a specialization for zero-sized assignments (`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static check to ensure the size is in-fact zero. This now seems to be the only existing use of `AllAtOnceTraversal`. Fixes #2080.
* Clean up `#if`s in GPU PacketPath.Gravatar Antonio Sanchez2020-12-04
| | | | | | | | | | | Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847
* Add log2() to Eigen.Gravatar Rasmus Munk Larsen2020-12-04
|
* Fix bad NEON fp16 checkGravatar Antonio Sanchez2020-12-04
|
* Special function implementations for half/bfloat16 packets.Gravatar Antonio Sanchez2020-12-04
| | | | | | | | | | | | | Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.
* Remove duplicate #if clauseGravatar David Tellenbach2020-12-04
|
* Fix shfl* macros for CUDA/HIPGravatar Antonio Sanchez2020-12-04
| | | | | | | | | | The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so they are defined whenever the corresponding CUDA/HIP ones are. Also changed the HIP/CUDA<9.0 versions to cast to int instead of doing the conversion `half`<->`float`. Fixes #2083
* The function 'prefetch' did not work correctly on the win64 platformGravatar shrek14022020-12-04
|
* Revert "Add log2() operator to Eigen"Gravatar Rasmus Munk Larsen2020-12-03
| | | | This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
* Add log2() operator to EigenGravatar Rasmus Munk Larsen2020-12-03
|
* Small cleanup of generic plog implementations:Gravatar Rasmus Munk Larsen2020-12-03
| | | | | | | | | | | | | | Adding the term e*ln(2) is split into two step for no obvious reason. This dates back to the original Cephes code from which the algorithm is adapted. It appears that this was done in Cephes to prevent the compiler from reordering the addition of the 3 terms in the approximation log(1+x) ~= x - 0.5*x^2 + x^3*P(x)/Q(x) which must be added in reverse order since |x| < (sqrt(2)-1). This allows rewriting the code to just 2 pmadd and 1 padd instructions, which on a Skylake processor speeds up the code by 5-7%.
* Include chrono in main for c++11.Gravatar Antonio Sanchez2020-12-03
| | | | Hack to fix tensor tests, since min/max are overridden by `main.h`.
* Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.Gravatar Rasmus Munk Larsen2020-12-02
|
* Fix typo in `F32MaskToBf16Mask`.Gravatar Antonio Sanchez2020-12-02
|
* Fix neon cmp* functions for bf16.Gravatar Antonio Sanchez2020-12-02
| | | | | | | | | | | The current impl corrupts the comparison masks when converting from float back to bfloat16. The resulting masks are then no longer all zeros or all ones, which breaks when used with `pselect` (e.g. in `pmin<PropagateNumbers>`). This was causing `packetmath_15` to fail on arm. Introducing a simple `F32MaskToBf16Mask` corrects this (takes the lower 16-bits for each float mask).
* Implement CUDA __shfl* for Eigen::halfGravatar Antonio Sanchez2020-12-01
| | | | | | | Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu` test are broken, as well as several ops in Tensorflow. The gpu functions `__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float. Here we add the required specializations.
* Fix a few issues for AVX512. This change enables vectorized versions of log, ↵Gravatar Rasmus Munk Larsen2020-12-01
| | | | exp, log1p, expm1 when AVX512DQ is not available.
* Fix #2077, `EIGEN_CONSTEXPR` in `Half`.Gravatar Antonio Sanchez2020-12-01
| | | | | | | | | `bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from `raw_half_as_uint16(...)`. This shouldn't affect anything else, since it is only used in `a bit_cast<uint16_t,half>()` which is not itself `constexpr`. Fixes #2077.
* add EIGEN_DEVICE_FUNC to methodsGravatar acxz2020-12-01
|
* AVX512 missing ops.Gravatar Antonio Sanchez2020-11-30
| | | | | | | | | | This allows the `packetmath` tests to pass for AVX512 on skylake. Made `half` and `bfloat16` consistent in terms of ops they support. Note the `log` tests are currently disabled for `bfloat16` since they fail due to poor precision (they were previously disabled for `Packet8bf` via test function specialization -- I just removed that specialization and disabled it in the generic test).
* Fix typo in docGravatar Florian Maurin2020-11-30
|
* Workaround for doxygen class template titles in which the templateGravatar Jim Lersch2020-11-27
| | | | | | part of the class signature is lost due to a problem with forward declarations. The problem is probably caused by doxygen bug #7689. It is confirmed to be fixed in doxygen >= 1.8.19.
* Fix doxygen class blocks that were not associated with the correct classes.Gravatar Jim Lersch2020-11-27
|