aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen
Commit message (Collapse)AuthorAge
* Include `<cstdint>` in one place, remove custom typedefsGravatar Antonio Sanchez2021-01-26
| | | | | | | | | | | | | | Originating from [this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case), some win32 compilers define `__int32` as a `long`, but MinGW defines `std::int32_t` as an `int`, leading to a type conflict. To avoid this, we remove the custom `typedef` definitions for win32. The Tensor module requires C++11 anyways, so we are guaranteed to have included `<cstdint>` already in `Eigen/Core`. Also re-arranged the headers to only include `<cstdint>` in one place to avoid this type of error again.
* Fix sqrt, ldexp and frexp compilation errors.Gravatar Chip Kerchner2021-01-25
|
* Remove unused variable in SparseLU.hGravatar Florian Maurin2021-01-22
|
* Fix pow and other cwise ops for half/bfloat16.Gravatar Antonio Sanchez2021-01-22
| | | | | | | | | | | | | The new `generic_pow` implementation was failing for half/bfloat16 since their construction from int/float is not `constexpr`. Modified in `GenericPacketMathFunctions` to remove `constexpr`. While adding tests for half/bfloat16, found other issues related to implicit conversions. Also needed to implement `numext::arg` for non-integer, non-complex, non-float/double/long double types. These seem to be implicitly converted to `std::complex<T>`, which then fails for half/bfloat16.
* Specialize std::complex operators for use on GPU device.Gravatar Antonio Sanchez2021-01-22
| | | | | | | | | | | | | | | | | | NVCC and older versions of clang do not fully support `std::complex` on device, leading to either compile errors (Cannot call `__host__` function) or worse, runtime errors (Illegal instruction). For most functions, we can implement specialized `numext` versions. Here we specialize the standard operators (with the exception of stream operators and member function operators with a scalar that are already specialized in `<complex>`) so they can be used in device code as well. To import these operators into the current scope, use `EIGEN_USING_STD_COMPLEX_OPERATORS`. By default, these are imported into the `Eigen`, `Eigen:internal`, and `Eigen::numext` namespaces. This allow us to remove specializations of the sum/difference/product/quotient ops, and allow us to treat complex numbers like most other scalars (e.g. in tests).
* Add support for Arm SVEGravatar David Tellenbach2021-01-21
| | | | | | | | | | | | This patch adds support for Arm's new vector extension SVE (Scalable Vector Extension). In contrast to other vector extensions that are supported by Eigen, SVE types are inherently *sizeless*. For the use in Eigen we fix their size at compile-time (note that this is not necessary in general, SVE is *length agnostic*). During compilation the flag `-msve-vector-bits=N` has to be set where `N` is a power of two in the range of `128`to `2048`, indicating the length of an SVE vector. Since SVE is rather young, we decided to disable it by default even if it would be available. A user has to enable it explicitly by defining `EIGEN_ARM64_USE_SVE`. This patch introduces the packet types `PacketXf` and `PacketXi` for packets of `float` and `int32_t` respectively. The size of these packets depends on the SVE vector length. E.g. if `-msve-vector-bits=512` is set, `PacketXf` will contain `512/32 = 16` elements. This MR is joint work with Miguel Tairum <miguel.tairum@arm.com>.
* Fix pfrexp/pldexp for half.Gravatar Antonio Sanchez2021-01-21
| | | | | | | | | | The recent addition of vectorized pow (!330) relies on `pfrexp` and `pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`. Adding tests for these packet ops also exposed an issue with handling negative values in `pfrexp`, returning an incorrect exponent. Added the missing implementations, corrected the exponent in `pfrexp1`, and added `packetmath` tests.
* Fix signed-unsigned comparison.Gravatar Antonio Sanchez2021-01-20
| | | | | | | | Hex literals are interpreted as unsigned, leading to a comparison between signed max supported function `abcd[0]` (which was negative) to the unsigned literal `0x80000006`. Should not change result since signed is implicitly converted to unsigned for the comparison, but eliminates the warning.
* Proper CPUIDGravatar Ivan Popivanov2021-01-18
|
* Vectorize `pow(x, y)`. This closes ↵Gravatar Rasmus Munk Larsen2021-01-18
| | | | | | | | | | | | | | | | | | | | | | https://gitlab.com/libeigen/eigen/-/issues/2085, which also contains a description of the algorithm. I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics: ``` max_rel_error = 8.34405e-07 rms_rel_error = 2.76654e-07 ``` If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`: ``` max_rel_error = 0.666667 rms = 6.8727e-05 count = 1335165689 argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45 ``` which seems reasonable, since these results are subnormals with only couple of significant bits left.
* Improved std::complex sqrt and rsqrt.Gravatar Antonio Sanchez2021-01-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously `complex_sqrt` was only used for CUDA and MSVC), and implements custom `complex_rsqrt`. Also introduces `numext::rsqrt` to simplify implementation, and modified `numext::hypot` to adhere to IEEE IEC 6059 for special cases. The `complex_sqrt` and `complex_rsqrt` implementations were found to be significantly faster than `std::sqrt<std::complex<T>>` and `1/numext::sqrt<std::complex<T>>`. Benchmark file attached. ``` GCC 10, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448 BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545 BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062 BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248 BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474 BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127 BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028 BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711 Clang 11, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042 BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878 BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030 BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588 BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666 BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676 BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956 BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798 Clang 9, Pixel 2, aarch64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031 BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926 BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591 BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375 BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383 BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438 BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689 BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944 ```
* 1)provide a better generic paddsub op implementationGravatar Guoqiang QI2021-01-13
| | | | | 2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON 3)make paddsub op support the Packet2cf/Packet4f in SSE
* Remove `inf` local variable.Gravatar Antonio Sanchez2021-01-12
| | | | | | Apparently `inf` is a macro on iOS for `std::numeric_limits<T>::infinity()`, causing a compile error here. We don't need the local anyways since it's only used in one spot.
* Remove TODO from Transform::computeScaleRotation()Gravatar Antonio Sanchez2021-01-11
| | | | | | | Upon investigation, `JacobiSVD` is significantly faster than `BDCSVD` for small matrices (twice as fast for 2x2, 20% faster for 3x3, 1% faster for 10x10). Since the majority of cases will be small, let's stick with `JacobiSVD`. See !361.
* Transform::computeScalingRotation flush determinant to +/- 1.Gravatar Antonio Sanchez2021-01-11
| | | | | | | | In the previous code, in attempting to correct for a negative determinant, we end up multiplying and dividing by a number that is often very near, but not exactly +/-1. By flushing to +/-1, we can replace a division with a multiplication, and results are more numerically consistent.
* Only specialize complex `sqrt_impl` for CUDA if not MSVC.Gravatar Antonio Sanchez2021-01-11
| | | | | We already specialize `sqrt_impl` on windows due to MSVC's mishandling of `inf` (!355).
* Fix for breakage in ROCm support - 210108Gravatar Deven Desai2021-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following commit breaks ROCm support for Eigen https://gitlab.com/libeigen/eigen/-/commit/f149e0ebc3d3d5ca63234e58ca72690caf07e3b5 All unit tests fail with the following error ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:19: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:166: /home/rocm-user/eigen/Eigen/src/Core/MathFunctionsImpl.h:105:35: error: __host__ __device__ function 'complex_sqrt' cannot overload __host__ function 'complex_sqrt' EIGEN_DEVICE_FUNC std::complex<T> complex_sqrt(const std::complex<T>& z) { ^ /home/rocm-user/eigen/Eigen/src/Core/MathFunctions.h:342:38: note: previous declaration is here template<typename T> std::complex<T> complex_sqrt(const std::complex<T>& a_x); ^ 1 error generated when compiling for gfx900. CMake Error at gpu_basic_generated_gpu_basic.cu.o.cmake:192 (message): Error generating file /home/rocm-user/eigen/build/test/CMakeFiles/gpu_basic.dir//./gpu_basic_generated_gpu_basic.cu.o test/CMakeFiles/gpu_basic.dir/build.make:63: recipe for target 'test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o' failed make[3]: *** [test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o] Error 1 CMakeFiles/Makefile2:16618: recipe for target 'test/CMakeFiles/gpu_basic.dir/all' failed make[2]: *** [test/CMakeFiles/gpu_basic.dir/all] Error 2 CMakeFiles/Makefile2:16625: recipe for target 'test/CMakeFiles/gpu_basic.dir/rule' failed make[1]: *** [test/CMakeFiles/gpu_basic.dir/rule] Error 2 Makefile:5401: recipe for target 'gpu_basic' failed make: *** [gpu_basic] Error 2 ``` The error message is accurate, and the fix (provided in thsi commit) is trivial.
* Fix MSVC complex sqrt and packetmath test.Gravatar Antonio Sanchez2021-01-08
| | | | | | | | | MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`. Here we replace it with a custom version (currently used on GPU). Also fixed the `packetmath` test, which previously skipped several corner cases since `CHECK_CWISE1` only tests the first `PacketSize` elements.
* Make Transform::computeRotationScaling(0,&S) continuousGravatar Essex Edwards2021-01-07
|
* Add missing #endif directive in Macros.hGravatar David Tellenbach2021-01-07
|
* #define was defined incorrectly because the result_of function was ↵Gravatar shrek14022021-01-07
| | | | deprecated in c++17 and removed in c++20. Also, EIGEN_COMP_MSVC (which is _MSC_VER) only affects result_of indirectly, which can cause errors.
* Fix Ref initialization.Gravatar Antonio Sanchez2021-01-06
| | | | | | | | | Since `eigen_assert` is a macro, the statements can become noops (e.g. when compiling for GPU), so they may not execute the contained logic -- which in this case is the entire `Ref` construction. We need to separate the assert from statements which have consequences. Fixes #2113
* Allow CwiseUnaryView to be used on device.Gravatar Antonio Sanchez2021-01-06
| | | | Added `EIGEN_DEVICE_FUNC` to methods.
* Fix Ref Stride checks.Gravatar Antonio Sanchez2021-01-05
| | | | | | | | | | | | | | | | | | | | The existing `Ref` class failed to consider cases where the Ref's `Stride` setting *could* match the underlying referred object's stride, but **didn't** at runtime. This led to trying to set invalid stride values, causing runtime failures in some cases, and garbage due to mismatched strides in others. Here we add the missing runtime checks. This involves computing the strides necessary to align with the referred object's storage, and verifying we can actually set those strides at runtime. In the `const` case, if it *may* be possible to refer to the original storage at compile-time but fails at runtime, then we defer to the `construct(...)` method that makes a copy. Added more tests to check these cases. Fixes #2093.
* Eliminate boolean product warnings by factoring out aGravatar Christoph Hertzberg2021-01-05
| | | `combine_scalar_factors` helper function.
* Add CUDA complex sqrt.Gravatar Antonio Sanchez2020-12-22
| | | | | | | | | | | | | | | This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
* Fix missing EIGEN_DEVICE_FUNCGravatar rgreenblatt2020-12-20
|
* * Add iterative psqrt<double> for AVX and SSE when FMA is available. This ↵Gravatar Rasmus Munk Larsen2020-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
* Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which ↵Gravatar Rasmus Munk Larsen2020-12-15
| | | | otherwise has an error of ~1000 ulps.
* Remove comma at the end of enumeration list to silence C++03 warningsGravatar David Tellenbach2020-12-13
|
* Fix implicit cast to double.Gravatar Antonio Sanchez2020-12-12
| | | | | Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.
* Fix NEON pmax<PropagateNumbers,Packet4bf>.Gravatar Antonio Sanchez2020-12-11
| | | | Simple typo, the max impl called pmin instead of pmax for floats.
* Fix typo in AVX512 packet math.Gravatar Antonio Sanchez2020-12-11
|
* Remove unused macro in Half.hGravatar David Tellenbach2020-12-12
|
* Fix more SSE/AVX packet conversions for peven.Gravatar Antonio Sanchez2020-12-11
| | | | MSVC doesn't like function-style casts and forces us to use intrinsics.
* Replace M_LOG2E and M_LN2 with custom macros.Gravatar Antonio Sanchez2020-12-11
| | | | | | | | | | For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
* Fix MSVC SSE casts.Gravatar Antonio Sanchez2020-12-11
| | | | | MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be converted using intrinsic methods.
* Fix for broken ROCm/HIP SupportGravatar Deven Desai2020-12-11
| | | | | | | | | | | | | | | | | | | | | | | | | The following commit introduced a breakage in ROCm/HIP support for Eigen. https://gitlab.com/libeigen/eigen/-/commit/5ec4907434742d4555df4aa708b665868b88f3b4#1958e65719641efe5483abc4ce0b61806270f6f3_525_517 ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial
* Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64Gravatar David Tellenbach2020-12-11
|
* Add Armv8 guard on PropagateNumbers implementation.Gravatar Everton Constantino2020-12-10
|
* Remove private access of std::deque::_M_impl.Gravatar Antonio Sanchez2020-12-10
| | | | | This no longer works on gcc or clang, so we should just remove the hack. The default should compile to similar code anyways.
* Fix vectorization of complex sqrt on NEONGravatar David Tellenbach2020-12-10
|
* Remove comma at end of enumerator list in NEON PacketMathGravatar David Tellenbach2020-12-10
|
* Fix a typo in SparseMatrix documentation.Gravatar David Tellenbach2020-12-09
| | | | This fixes issue #2091.
* Implement vectorized complex square root.Gravatar Rasmus Munk Larsen2020-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```
* Fix host/device calls for __half.Gravatar Antonio Sanchez2020-12-08
| | | | | | The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear.
* - Enabling PropagateNaN and PropagateNumbers for NEON.Gravatar Everton Constantino2020-12-08
| | | | - Adding propagate tests to bfloat16.
* Fix unused warning on new `dense_assignment_loop` impl.Gravatar Antonio Sanchez2020-12-07
|
* Add specialization for compile-time zero-sized dense assignment.Gravatar Antonio Sanchez2020-12-07
| | | | | | | | | | | | | | | In the current `dense_assignment_loop` implementations, if the destination's inner or outer size is zero at compile time and if the kernel involves a product, we currently get a compile error (#2080). This is triggered by attempting to multiply a non-existent row by a column (or vice-versa). To address this, we add a specialization for zero-sized assignments (`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static check to ensure the size is in-fact zero. This now seems to be the only existing use of `AllAtOnceTraversal`. Fixes #2080.
* Clean up `#if`s in GPU PacketPath.Gravatar Antonio Sanchez2020-12-04
| | | | | | | | | | | Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847