| Commit message (Collapse) | Author | Age |
... | |
|
|
|
| |
Forgot to test this. Fixes bug introduced in !416.
|
|
|
|
|
|
|
|
|
|
| |
The original will saturate if the input does not fit into an integer
type. Here we fix this, returning the input if it doesn't have
enough precision to have a fractional part.
Also added `pceil` for NEON.
Fixes #1969.
|
| |
|
|
|
|
| |
in certain situations.
|
|
|
|
| |
to non-const when using vector_pair with certain built-ins.
|
|
|
|
|
|
|
| |
Accuracy is too poor - requires at least two Newton iterations, but then
it is no longer significantly faster than `vsqrt`.
Fixes #2094.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added `EIGEN_HAS_STD_HASH` macro, checking for C++11 support and not
running on GPU.
`std::hash<float>` is not a device function, so cannot be used by
`std::hash<bfloat16>`. Removed `EIGEN_DEVICE_FUNC` and only
define if `EIGEN_HAS_STD_HASH`. Same for `half`.
Added `EIGEN_CUDA_HAS_FP16_ARITHMETIC` to improve readability,
eliminate warnings about `EIGEN_CUDA_ARCH` not being defined.
Replaced a couple C-style casts with `reinterpret_cast` for aligned
loading of `half*` to `half2*`. This eliminates `-Wcast-align`
warnings in clang. Although not ideal due to potential type aliasing,
this is how CUDA handles these conversions internally.
|
|
|
|
|
|
|
| |
double that
make pow<double> accurate the 1 ULP. Speed for AVX-512 is within 0.5% of the currect
implementation.
|
| |
|
|
|
|
| |
functions.
|
|
|
|
| |
available. Otherwise the accuracy drops from 1 ulp to 3 ulp.
|
| |
|
| |
|
|
|
|
| |
for float, while still being 10x faster than std::pow for AVX512. A future change will introduce a specialization for double.
|
|
|
|
|
|
| |
The original implementation fails for 0, denormals, inf, and NaN.
See #2150
|
|
|
|
| |
kernel)
|
|
|
|
|
| |
It's slightly faster and slightly more accurate, allowing our current
packetmath tests to pass for sqrt with a single iteration.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original clamping bounds on `_x` actually produce finite values:
```
exp(88.3762626647950) = 2.40614e+38 < 3.40282e+38
exp(709.437) = 1.27226e+308 < 1.79769e+308
```
so with an accurate `ldexp` implementation, `pexp` fails for large
inputs, producing finite values instead of `inf`.
This adjusts the bounds slightly outside the finite range so that
the output will overflow to +/- `inf` as expected.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous implementations produced garbage values if the exponent did
not fit within the exponent bits. See #2131 for a complete discussion,
and !375 for other possible implementations.
Here we implement the 4-factor version. See `pldexp_impl` in
`GenericPacketMathFunctions.h` for a full description.
The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>`
requires `por`.
Left as a "TODO" is to delegate to a faster version if we know the
exponent does fit within the exponent bits.
Fixes #2131.
|
| |
|
|
|
|
| |
result is always zero or infinite unless x is one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM,
leading to excessive 16-byte register spills, slowing down basic f32
matrix multiplication by approx 50%.
By specializing `gebp_traits`, we can eliminate the register spills.
Volatile inline ASM both acts as a barrier to prevent reordering and
enforces strict register use. In a simple f32 matrix multiply example,
this modification reduces 16-byte spills from 109 instances to zero,
leading to a 1.5x speed increase (search for `16-byte Spill` in the
assembly in https://godbolt.org/z/chsPbE).
This is a replacement of !379. See there for further discussion.
Also moved `gebp_traits` specializations for NEON to
`Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside
other NEON-specific code.
Fixes #2138.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allows the altivec packetmath tests to pass. There were a few issues:
- `pstoreu` was missing MSQ on `_BIG_ENDIAN` systems
- `cmp_*` didn't properly handle conversion of bool flags (0x7FC instead
of 0xFFFF)
- `pfrexp` needed to set the `exponent` argument.
Related to !370, #2128
cc: @ChipKerchner @pdrocaldeira
Tested on `_BIG_ENDIAN` running on QEMU with VSX. Couldn't figure out build
flags to get it to work for little endian.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originating from
[this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case),
some win32 compilers define `__int32` as a `long`, but MinGW defines
`std::int32_t` as an `int`, leading to a type conflict.
To avoid this, we remove the custom `typedef` definitions for win32. The
Tensor module requires C++11 anyways, so we are guaranteed to have
included `<cstdint>` already in `Eigen/Core`.
Also re-arranged the headers to only include `<cstdint>` in one place to
avoid this type of error again.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new `generic_pow` implementation was failing for half/bfloat16 since
their construction from int/float is not `constexpr`. Modified
in `GenericPacketMathFunctions` to remove `constexpr`.
While adding tests for half/bfloat16, found other issues related to
implicit conversions.
Also needed to implement `numext::arg` for non-integer, non-complex,
non-float/double/long double types. These seem to be implicitly
converted to `std::complex<T>`, which then fails for half/bfloat16.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
NVCC and older versions of clang do not fully support `std::complex` on device,
leading to either compile errors (Cannot call `__host__` function) or worse,
runtime errors (Illegal instruction). For most functions, we can
implement specialized `numext` versions. Here we specialize the standard
operators (with the exception of stream operators and member function operators
with a scalar that are already specialized in `<complex>`) so they can be used
in device code as well.
To import these operators into the current scope, use
`EIGEN_USING_STD_COMPLEX_OPERATORS`. By default, these are imported into
the `Eigen`, `Eigen:internal`, and `Eigen::numext` namespaces.
This allow us to remove specializations of the
sum/difference/product/quotient ops, and allow us to treat complex
numbers like most other scalars (e.g. in tests).
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds support for Arm's new vector extension SVE (Scalable Vector Extension). In contrast to other vector extensions that are supported by Eigen, SVE types are inherently *sizeless*. For the use in Eigen we fix their size at compile-time (note that this is not necessary in general, SVE is *length agnostic*).
During compilation the flag `-msve-vector-bits=N` has to be set where `N` is a power of two in the range of `128`to `2048`, indicating the length of an SVE vector.
Since SVE is rather young, we decided to disable it by default even if it would be available. A user has to enable it explicitly by defining `EIGEN_ARM64_USE_SVE`.
This patch introduces the packet types `PacketXf` and `PacketXi` for packets of `float` and `int32_t` respectively. The size of these packets depends on the SVE vector length. E.g. if `-msve-vector-bits=512` is set, `PacketXf` will contain `512/32 = 16` elements.
This MR is joint work with Miguel Tairum <miguel.tairum@arm.com>.
|
|
|
|
|
|
|
|
|
|
| |
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.
Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://gitlab.com/libeigen/eigen/-/issues/2085, which also contains a description of the algorithm.
I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics:
```
max_rel_error = 8.34405e-07
rms_rel_error = 2.76654e-07
```
If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`:
```
max_rel_error = 0.666667
rms = 6.8727e-05
count = 1335165689
argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45
```
which seems reasonable, since these results are subnormals with only couple of significant bits left.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously
`complex_sqrt` was only used for CUDA and MSVC), and implements
custom `complex_rsqrt`.
Also introduces `numext::rsqrt` to simplify implementation, and modified
`numext::hypot` to adhere to IEEE IEC 6059 for special cases.
The `complex_sqrt` and `complex_rsqrt` implementations were found to be
significantly faster than `std::sqrt<std::complex<T>>` and
`1/numext::sqrt<std::complex<T>>`.
Benchmark file attached.
```
GCC 10, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448
BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545
BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062
BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248
BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474
BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127
BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028
BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711
Clang 11, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042
BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878
BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030
BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588
BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666
BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676
BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956
BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798
Clang 9, Pixel 2, aarch64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031
BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926
BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591
BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375
BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383
BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438
BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689
BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944
```
|
|
|
|
|
| |
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
|
|
|
|
|
| |
We already specialize `sqrt_impl` on windows due to MSVC's mishandling
of `inf` (!355).
|
|
|
|
|
|
|
|
|
| |
MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`.
Here we replace it with a custom version (currently used on GPU).
Also fixed the `packetmath` test, which previously skipped several
corner cases since `CHECK_CWISE1` only tests the first `PacketSize`
elements.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.
Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.
Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
provides a ~10% speedup.
* Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA.
* Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster.
The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx.
AVX+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~
BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67%
BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69%
BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~
BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~
SSE+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~
BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50%
BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06%
BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~
BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~
SSE+FMA (double)
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65%
BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07%
BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31%
BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96%
AVX+FMA (double)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80%
BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94%
BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76%
BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
|
|
|
|
| |
otherwise has an error of ~1000 ulps.
|
| |
|
|
|
|
|
| |
Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors
in Google due to `-Wall`.
|
|
|
|
| |
Simple typo, the max impl called pmin instead of pmax for floats.
|
| |
|
| |
|
|
|
|
| |
MSVC doesn't like function-style casts and forces us to use intrinsics.
|
|
|
|
|
|
|
|
|
|
| |
For these to exist we would need to define `_USE_MATH_DEFINES` before
`cmath` or `math.h` is first included. However, we don't
control the include order for projects outside Eigen, so even defining
the macro in `Eigen/Core` does not fix the issue for projects that
end up including `<cmath>` before Eigen does (explicitly or transitively).
To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
|
|
|
|
|
| |
MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be
converted using intrinsic methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following commit introduced a breakage in ROCm/HIP support for Eigen.
https://gitlab.com/libeigen/eigen/-/commit/5ec4907434742d4555df4aa708b665868b88f3b4#1958e65719641efe5483abc4ce0b61806270f6f3_525_517
```
Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o
In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20:
In file included from /home/rocm-user/eigen/test/main.h:356:
In file included from /home/rocm-user/eigen/Eigen/QR:11:
In file included from /home/rocm-user/eigen/Eigen/Core:222:
/home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'?
return half2half2(from);
^~~~~~~~~~
__half2half2
/opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here
__half2 __half2half2(__half x)
^
1 error generated when compiling for gfx900.
```
The cause seems to be a copy-paster error, and the fix is trivial
|
| |
|
| |
|