aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/arch/CUDA
Commit message (Collapse)AuthorAge
* Better CUDA complex division.Gravatar Antonio Sanchez2021-04-29
| | | | | | The original produced NaNs when dividing 0/b for subnormal b. The `complex_divide_stable` was changed to use the more common Smith's algorithm.
* Fix NVCC+ICC issues.Gravatar Antonio Sanchez2021-03-15
| | | | | | | | | | | | | | | | | | | | | | | | NVCC does not understand `__forceinline`, so we need to use `inline` when compiling for GPU. ICC specializes `std::complex` operators for `float` and `double` by default, which cannot be used on device and conflict with Eigen's workaround in CUDA/Complex.h. This can be prevented by defining `_OVERRIDE_COMPLEX_SPECIALIZATION_` before including `<complex>`. Added this define to the tests and to `Eigen/Core`, but this will not work if the user includes `<complex>` before `<Eigen/Core>`. ICC also seems to generate a duplicate `Map` symbol in `PlainObjectBase`: ``` error: "Map" has already been declared in the current scope static ConstMapType Map(const Scalar *data) ``` I tracked this down to `friend class Eigen::Map`. Putting the `friend` statements at the bottom of the class seems to resolve this issue. Fixes #2180
* Specialize std::complex operators for use on GPU device.Gravatar Antonio Sanchez2021-01-22
| | | | | | | | | | | | | | | | | | NVCC and older versions of clang do not fully support `std::complex` on device, leading to either compile errors (Cannot call `__host__` function) or worse, runtime errors (Illegal instruction). For most functions, we can implement specialized `numext` versions. Here we specialize the standard operators (with the exception of stream operators and member function operators with a scalar that are already specialized in `<complex>`) so they can be used in device code as well. To import these operators into the current scope, use `EIGEN_USING_STD_COMPLEX_OPERATORS`. By default, these are imported into the `Eigen`, `Eigen:internal`, and `Eigen::numext` namespaces. This allow us to remove specializations of the sum/difference/product/quotient ops, and allow us to treat complex numbers like most other scalars (e.g. in tests).
* Improved std::complex sqrt and rsqrt.Gravatar Antonio Sanchez2021-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously `complex_sqrt` was only used for CUDA and MSVC), and implements custom `complex_rsqrt`. Also introduces `numext::rsqrt` to simplify implementation, and modified `numext::hypot` to adhere to IEEE IEC 6059 for special cases. The `complex_sqrt` and `complex_rsqrt` implementations were found to be significantly faster than `std::sqrt<std::complex<T>>` and `1/numext::sqrt<std::complex<T>>`. Benchmark file attached. ``` GCC 10, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448 BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545 BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062 BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248 BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474 BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127 BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028 BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711 Clang 11, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042 BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878 BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030 BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588 BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666 BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676 BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956 BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798 Clang 9, Pixel 2, aarch64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031 BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926 BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591 BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375 BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383 BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438 BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689 BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944 ```
* Only specialize complex `sqrt_impl` for CUDA if not MSVC.Gravatar Antonio Sanchez2021-01-11
| | | | | We already specialize `sqrt_impl` on windows due to MSVC's mishandling of `inf` (!355).
* Fix MSVC complex sqrt and packetmath test.Gravatar Antonio Sanchez2021-01-08
| | | | | | | | | MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`. Here we replace it with a custom version (currently used on GPU). Also fixed the `packetmath` test, which previously skipped several corner cases since `CHECK_CWISE1` only tests the first `PacketSize` elements.
* Add CUDA complex sqrt.Gravatar Antonio Sanchez2020-12-22
| | | | | | | | | | | | | | | This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
* renaming CUDA* to GPU* for some header filesGravatar Deven Desai2018-07-11
|
* moving Half headers from CUDA dir to GPU dir, removing the HIP versionsGravatar Deven Desai2018-06-13
|
* Derivative of the incomplete Gamma function and the sample of a Gamma random ↵Gravatar Michael Figurnov2018-06-06
| | | | | | | | | | variable. In addition to igamma(a, x), this code implements: * igamma_der_a(a, x) = d igamma(a, x) / da -- derivative of igamma with respect to the parameter * gamma_sample_der_alpha(alpha, sample) -- reparameterization derivative of a Gamma(alpha, 1) random variable sample with respect to the alpha parameter The derivatives are computed by forward mode differentiation of the igamma(a, x) code. Although gamma_sample_der_alpha can be implemented via igamma_der_a, a separate function is more accurate and efficient due to analytical cancellation of some terms. All three functions are implemented by a method parameterized with "mode" that always computes the derivatives, but does not return them unless required by the mode. The compiler is expected to (and, based on benchmarks, does) skip the unnecessary computations depending on the mode.
* Exponentially scaled modified Bessel functions of order zero and one.Gravatar Michael Figurnov2018-05-31
| | | | | | The functions are conventionally called i0e and i1e. The exponentially scaled version is more numerically stable. The standard Bessel functions can be obtained as i0(x) = exp(|x|) i0e(x) The code is ported from Cephes and tested against SciPy.
* bug #1520: workaround some -Wfloat-equal warnings by calling std::equal_toGravatar Gael Guennebaud2018-04-11
|
* Move up the specialization of std::numeric_limitsGravatar Daniel Trebbien2018-02-18
| | | | | This fixes a compilation error seen when building TensorFlow on macOS: https://github.com/tensorflow/tensorflow/issues/17067
* Replace __float2half_rn with __float2halfGravatar nluehr2017-11-28
| | | | The latter provides a consistent definition for CUDA 8.0 and 9.0.
* Fix incorrect integer cast in predux<half2>().Gravatar nluehr2017-11-21
| | | | Bug corrupts results on Maxwell and earlier GPU architectures.
* Restore `__device__`Gravatar Henry Schreiner2017-10-21
|
* Fixing missing inlines on device functions for newer CUDA cardsGravatar Henry Schreiner2017-10-20
|
* Add C++11 max_digits10 for half.Gravatar Gael Guennebaud2017-09-06
|
* Added support for CUDA 9.0.Gravatar Benoit Steiner2017-08-31
|
* bug #1462: remove all occurences of the deprecated __CUDACC_VER__ macro by ↵Gravatar Gael Guennebaud2017-08-24
| | | | introducing EIGEN_CUDACC_VER
* Add a EIGEN_NO_CUDA option, and introduce EIGEN_CUDACC and EIGEN_CUDA_ARCH ↵Gravatar Gael Guennebaud2017-07-17
| | | | aliases
* Added missing __device__ qualifierGravatar Benoit Steiner2017-06-13
|\
| * Added missing __device__ qualifierGravatar Benoit Steiner2017-06-13
| |
* | fix compilation of Half in C++98 (issue introduced in previous commit)Gravatar Gael Guennebaud2017-06-09
| |
* | Add missing std::numeric_limits specialization for half, and complete ↵Gravatar Gael Guennebaud2017-06-09
| | | | | | | | NumTraits<half>
* | Fixed nested angle barckets >> issue when compiling with cuda 8Gravatar Abhijit Kundu2017-04-27
|/
* remove UTF8 symbolsGravatar Gael Guennebaud2017-03-07
|
* Made most of the packet math primitives usable within CUDA kernel when ↵Gravatar Benoit Steiner2017-02-28
| | | | compiling with clang
* Avoid unecessary float to double conversions.Gravatar Benoit Steiner2017-02-27
|
* Fix expm1 CUDA implementation (do not shadow exp CUDA implementation).Gravatar Srinivas Vasudevan2016-12-05
|
* Fix small nit where I changed name of plog1p to pexpm1.Gravatar Srinivas Vasudevan2016-12-02
|
* Added support for expm1 in Eigen.Gravatar Srinivas Vasudevan2016-12-02
|
* Add a default constructor for the "fake" __half class when not using theGravatar Rasmus Munk Larsen2016-11-29
| | | | __half class provided by CUDA.
* Optimized the computation of exp, sqrt, ceil anf floor for fp16 on Pascal GPUsGravatar Benoit Steiner2016-11-16
|
* Merged eigen/eigen into defaultGravatar Benoit Steiner2016-11-03
|\
| * Gate the code that refers to cuda fp16 primitives more thoroughlyGravatar Benoit Steiner2016-11-01
| |
* | Deleted redundant implementation of preduxGravatar Benoit Steiner2016-10-12
| |
* | Merged eigen/eigen into defaultGravatar Benoit Steiner2016-10-12
|\|
* | Take advantage of AVX512 instructions whenever possible to speedup the ↵Gravatar Benoit Steiner2016-10-12
| | | | | | | | processing of 16 bit floats.
| * Added missing AVX intrinsics for fp16: in particular, implemented predux ↵Gravatar Benoit Steiner2016-10-06
|/ | | | which is required by the matrix-vector code.
* Properly characterize the CUDA packet primitives for fp16 as device onlyGravatar Benoit Steiner2016-10-04
|
* Added support for constand std::complex numbers on GPUGravatar Benoit Steiner2016-10-03
|
* Added missing typedefsGravatar Benoit Steiner2016-09-20
|
* Add CUDA-specific std::complex<T> specializations for scalar_sum_op, ↵Gravatar RJ Ryan2016-09-20
| | | | scalar_difference_op, scalar_product_op, and scalar_quotient_op.
* use ::hlog if available.Gravatar Gael Guennebaud2016-08-29
|
* bug #1167: simplify installation of header files using cmake's ↵Gravatar Gael Guennebaud2016-08-29
| | | | install(DIRECTORY ...) command.
* Fix compilation with MSVC by using our portable numext::log1p implementation.Gravatar Gael Guennebaud2016-08-22
|
* Fix compilation on CUDA 8 by removing call to h2log1pGravatar Igor Babuschkin2016-08-15
|
* Add log1p support for CUDA and half floatsGravatar Igor Babuschkin2016-08-08
|
* Fixed the constructors of the new half_base class.Gravatar Benoit Steiner2016-08-04
|