| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
| |
CJMADD, which were effectively unused, apart from on x86, where the change results in identically performing code.
|
|
|
|
| |
optimization changing sign with --ffast-math enabled.
|
|
|
|
| |
Unified implementation using only `vzip`.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a new version of !423, which failed for MSVC.
Defined `EIGEN_OPTIMIZATION_BARRIER(X)` that uses inline assembly to
prevent operations involving `X` from crossing that barrier. Should
work on most `GNUC` compatible compilers (MSVC doesn't seem to need
this). This is a modified version adapted from what was used in
`psincos_float` and tested on more platforms
(see #1674, https://godbolt.org/z/73ezTG).
Modified `rint` to use the barrier to prevent the add/subtract rounding
trick from being optimized away.
Also fixed an edge case for large inputs that get bumped up a power of two
and ends up rounding away more than just the fractional part. If we are
over `2^digits` then just return the input. This edge case was missed in
the test since the test was comparing approximate equality, which was still
satisfied. Adding a strict equality option catches it.
|
|
|
| |
This reverts commit e72dfeb8b9fa5662831b5d0bb9d132521f9173dd
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems *sometimes* with aggressive optimizations the combination
`psub(padd(a, b), b)` trick to force rounding is compiled away. Here
we replace with inline assembly to prevent this (I tried `volatile`,
but that leads to additional loads from memory).
Also fixed an edge case for large inputs `a` where adding `b` bumps
the value up a power of two and ends up rounding away more than
just the fractional part. If we are over `2^digits` then just return
the input. This edge case was missed in the test since the test was
comparing approximate equality, which was still satisfied. Adding
a strict equality option catches it.
|
|
|
|
|
|
|
|
|
|
| |
In SSE, by adding/subtracting 2^MantissaBits, we force rounding according to the
current rounding mode.
For NEON, we use the provided intrinsics for rint/floor/ceil if
available (armv8).
Related to #1969.
|
|
|
|
|
|
|
|
|
|
|
|
| |
With !406, we accidentally broke arm 32-bit NEON builds, since
`vsqrt_f32` is only available for 64-bit.
Here we add back the `rsqrt` implementation for 32-bit, relying
on a `prsqrt` implementation with better handling of edge cases.
Note that several of the 32-bit NEON packet tests are currently
failing - either due to denormal handling (NEON versions flush
to zero, but scalar paths don't) or due to accuracy (e.g. sin/cos).
|
|
|
|
| |
Forgot to test this. Fixes bug introduced in !416.
|
|
|
|
|
|
|
|
|
|
| |
The original will saturate if the input does not fit into an integer
type. Here we fix this, returning the input if it doesn't have
enough precision to have a fractional part.
Also added `pceil` for NEON.
Fixes #1969.
|
|
|
|
|
|
|
| |
Accuracy is too poor - requires at least two Newton iterations, but then
it is no longer significantly faster than `vsqrt`.
Fixes #2094.
|
|
|
|
|
|
| |
The original implementation fails for 0, denormals, inf, and NaN.
See #2150
|
|
|
|
| |
kernel)
|
|
|
|
|
| |
It's slightly faster and slightly more accurate, allowing our current
packetmath tests to pass for sqrt with a single iteration.
|
| |
|
|
|
|
|
| |
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
provides a ~10% speedup.
* Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA.
* Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster.
The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx.
AVX+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~
BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67%
BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69%
BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~
BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~
SSE+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~
BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50%
BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06%
BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~
BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~
SSE+FMA (double)
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65%
BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07%
BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31%
BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96%
AVX+FMA (double)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80%
BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94%
BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76%
BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
|
|
|
|
| |
otherwise has an error of ~1000 ulps.
|
|
|
|
| |
Simple typo, the max impl called pmin instead of pmax for floats.
|
| |
|
| |
|
|
|
|
| |
- Adding propagate tests to bfloat16.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current implementations fail to consider half-float packets, only
half-float scalars. Added specializations for packets on AVX, AVX512 and
NEON. Added tests to `special_packetmath`.
The current `special_functions` tests would fail for half and bfloat16 due to
lack of precision. The NEON tests also fail with precision issues and
due to different handling of `sqrt(inf)`, so special functions bessel, ndtri
have been disabled.
Tested with AVX, AVX512.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The current impl corrupts the comparison masks when converting
from float back to bfloat16. The resulting masks are then
no longer all zeros or all ones, which breaks when used with
`pselect` (e.g. in `pmin<PropagateNumbers>`). This was
causing `packetmath_15` to fail on arm.
Introducing a simple `F32MaskToBf16Mask` corrects this (takes
the lower 16-bits for each float mask).
|
|
|
|
|
|
|
| |
Missing inline breaks blas, since symbol generated in
`complex_single.cpp`, `complex_double.cpp`, `single.cpp`, `double.cpp`
Changed rest of inlines to `EIGEN_STRONG_INLINE`.
|
|
|
|
|
|
| |
- Add predux_half_dowto4
- Remove explicit casts in Half.h to match the behaviour of BFloat16.h
- Enable more packetmath tests for Eigen::half
|
|
|
|
|
|
| |
Using overloaded arithmetic operators for Arm __fp16 always
causes a promotion to float. We replace operator* by vmulh_f16
to avoid this.
|
|
|
|
| |
using PacketMath.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Armv8.2-a provides a native half-precision floating point (__fp16 aka.
float16_t). This patch introduces
* __fp16 as underlying type of Eigen::half if this type is available
* the packet types Packet4hf and Packet8hf representing float16x4_t and
float16x8_t respectively
* packet-math for the above packets with corresponding scalar type Eigen::half
The packet-math functionality has been implemented by Ashutosh Sharma
<ashutosh.sharma@amperecomputing.com>.
This closes #1940.
|
|
|
|
| |
(almost) all packetmath tests with SSE, AVX, and AVX512.
|
| |
|
|
|
| |
'vmvnq_u64' does not exist for some reason.
|
| |
|
| |
|
|
|
|
| |
the comments here have long been fixed. The workarounds were now detrimental because (1) they prevented using fused mul-add on Clang/ARM32 and (2) the unnecessary 'volatile' in 'asm volatile' prevented legitimate reordering by the compiler.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
for large values.
The NEON implementation mimics the SSE implementation, but didn't mention the caveat that due to the unsigned of signed integer conversions, not all values in the original floating point represented are supported.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added missing `pmadd<Packet2f>` for NEON. This leads to significant
improvement in precision than previous `pmul+padd`, which was causing
the `pcos` tests to fail. Also added an approx test with
`std::sin`/`std::cos` since otherwise returning any `a^2+b^2=1` would
pass.
Modified `log(denorm)` tests. Denorms are not always supported by all
systems (returns `::min`), are always flushed to zero on 32-bit arm,
and configurably flush to zero on sse/avx/aarch64. This leads to
inconsistent results across different systems (i.e. `-inf` vs `nan`).
Added a check for existence and exclude ARM.
Removed logistic exactness test, since scalar and vectorized versions
follow different code-paths due to differences in `pexp` and `pmadd`,
which result in slightly different values. For example, exactness always
fails on arm, aarch64, and altivec.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current multiply (`pmul`) and comparison operators (`pcmp_lt`,
`pcmp_le`, `pcmp_eq`) are missing for packets `Packet2l` and
`Packet2ul`. This leads to compile errors for the `packetmath.cpp` tests
in clang. Here we add and test the missing ops.
Tested:
```
$ aarch64-linux-gnu-g++ -static -I./ '-DEIGEN_TEST_PART_9=1' '-DEIGEN_TEST_PART_10=1' test/packetmath.cpp -o packetmath
$ adb push packetmath /data/local/tmp/
$ adb shell "/data/local/tmp/packetmath"
$ arm-linux-gnueabihf-g++ -mfpu=neon -static -I./ '-DEIGEN_TEST_PART_9=1' '-DEIGEN_TEST_PART_10=1' test/packetmath.cpp -o packetmath
$ adb push packetmath /data/local/tmp/
$ adb shell "/data/local/tmp/packetmath"
$ clang++ -target aarch64-linux-android21 -static -I./ '-DEIGEN_TEST_PART_9=1' '-DEIGEN_TEST_PART_10=1' test/packetmath.cpp -o packetmath
$ adb push packetmath /data/local/tmp/
$ adb shell "/data/local/tmp/packetmath"
$ clang++ -target armv7-linux-android21 -static -mfpu=neon -I./ '-DEIGEN_TEST_PART_9=1' '-DEIGEN_TEST_PART_10=1' test/packetmath.cpp -o packetmath
$ adb push packetmath /data/local/tmp/
$ adb shell "/data/local/tmp/packetmath"
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The NEON `pcast` operators are all implemented and tested for existing
packets. This requires adding a `pcast(a,b,c,d,e,f,g,h)` for casting
between `int64_t` and `int8_t` in `GenericPacketMath.h`.
Removed incorrect `HasHalfPacket` definition for NEON's
`Packet2l`/`Packet2ul`.
Adjustments were also made to the `packetmath` tests. These include
- minor bug fixes for cast tests (i.e. 4:1 casts, only casting for
packets that are vectorizable)
- added 8:1 cast tests
- random number generation
- original had uninteresting 0 to 0 casts for many casts between
floating-point and integers, and exhibited signed overflow
undefined behavior
Tested:
```
$ aarch64-linux-gnu-g++ -static -I./ '-DEIGEN_TEST_PART_ALL=1' test/packetmath.cpp -o packetmath
$ adb push packetmath /data/local/tmp/
$ adb shell "/data/local/tmp/packetmath"
```
|
|
|
|
|
|
|
|
|
|
|
| |
The use of the `packet_traits<>::HasCast` field is currently inconsistent with
`type_casting_traits<>`, and is unused apart from within
`test/packetmath.cpp`. In addition, those packetmath cast tests do not
currently reflect how casts are performed in practice: they ignore the
`SrcCoeffRatio` and `TgtCoeffRatio` fields, assuming a 1:1 ratio.
Here we remove the unsed `HasCast`, and modify the packet cast tests to
better reflect their usage.
|
|
|
|
| |
ptranspose on NEON
|