| Commit message (Collapse) | Author | Age |
|
|
|
| |
LhsType=RhsType a single generic implementation suffices. For scalars, the generic implementation of pconj automatically forwards to numext::conj, so much of the existing specialization can be avoided. For mixed types we still need specializations.
|
|
|
|
| |
should fix #2242 .
|
| |
|
|
|
|
| |
efficient `movsd` instruction for `pset1<Packet2cf>`.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a new version of !423, which failed for MSVC.
Defined `EIGEN_OPTIMIZATION_BARRIER(X)` that uses inline assembly to
prevent operations involving `X` from crossing that barrier. Should
work on most `GNUC` compatible compilers (MSVC doesn't seem to need
this). This is a modified version adapted from what was used in
`psincos_float` and tested on more platforms
(see #1674, https://godbolt.org/z/73ezTG).
Modified `rint` to use the barrier to prevent the add/subtract rounding
trick from being optimized away.
Also fixed an edge case for large inputs that get bumped up a power of two
and ends up rounding away more than just the fractional part. If we are
over `2^digits` then just return the input. This edge case was missed in
the test since the test was comparing approximate equality, which was still
satisfied. Adding a strict equality option catches it.
|
|
|
| |
This reverts commit e72dfeb8b9fa5662831b5d0bb9d132521f9173dd
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems *sometimes* with aggressive optimizations the combination
`psub(padd(a, b), b)` trick to force rounding is compiled away. Here
we replace with inline assembly to prevent this (I tried `volatile`,
but that leads to additional loads from memory).
Also fixed an edge case for large inputs `a` where adding `b` bumps
the value up a power of two and ends up rounding away more than
just the fractional part. If we are over `2^digits` then just return
the input. This edge case was missed in the test since the test was
comparing approximate equality, which was still satisfied. Adding
a strict equality option catches it.
|
|
|
|
|
|
|
|
|
|
| |
In SSE, by adding/subtracting 2^MantissaBits, we force rounding according to the
current rounding mode.
For NEON, we use the provided intrinsics for rint/floor/ceil if
available (armv8).
Related to #1969.
|
|
|
|
|
|
|
|
|
|
| |
The original will saturate if the input does not fit into an integer
type. Here we fix this, returning the input if it doesn't have
enough precision to have a fractional part.
Also added `pceil` for NEON.
Fixes #1969.
|
|
|
|
| |
functions.
|
|
|
|
|
|
| |
The original implementation fails for 0, denormals, inf, and NaN.
See #2150
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous implementations produced garbage values if the exponent did
not fit within the exponent bits. See #2131 for a complete discussion,
and !375 for other possible implementations.
Here we implement the 4-factor version. See `pldexp_impl` in
`GenericPacketMathFunctions.h` for a full description.
The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>`
requires `por`.
Left as a "TODO" is to delegate to a faster version if we know the
exponent does fit within the exponent bits.
Fixes #2131.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://gitlab.com/libeigen/eigen/-/issues/2085, which also contains a description of the algorithm.
I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics:
```
max_rel_error = 8.34405e-07
rms_rel_error = 2.76654e-07
```
If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`:
```
max_rel_error = 0.666667
rms = 6.8727e-05
count = 1335165689
argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45
```
which seems reasonable, since these results are subnormals with only couple of significant bits left.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously
`complex_sqrt` was only used for CUDA and MSVC), and implements
custom `complex_rsqrt`.
Also introduces `numext::rsqrt` to simplify implementation, and modified
`numext::hypot` to adhere to IEEE IEC 6059 for special cases.
The `complex_sqrt` and `complex_rsqrt` implementations were found to be
significantly faster than `std::sqrt<std::complex<T>>` and
`1/numext::sqrt<std::complex<T>>`.
Benchmark file attached.
```
GCC 10, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448
BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545
BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062
BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248
BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474
BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127
BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028
BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711
Clang 11, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042
BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878
BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030
BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588
BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666
BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676
BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956
BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798
Clang 9, Pixel 2, aarch64:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031
BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926
BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591
BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375
BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383
BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438
BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689
BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944
```
|
|
|
|
|
| |
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
provides a ~10% speedup.
* Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA.
* Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster.
The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx.
AVX+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~
BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67%
BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69%
BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~
BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~
SSE+FMA (float)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~
BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~
BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~
BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~
BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50%
BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06%
BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~
BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~
SSE+FMA (double)
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65%
BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07%
BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31%
BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96%
AVX+FMA (double)
name old cpu/op new cpu/op delta
BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~
BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95%
BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80%
BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59%
BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79%
BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94%
BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76%
BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~
|
|
|
|
|
| |
MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be
converted using intrinsic methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Closes #1905
Measured speedup for sqrt of `complex<float>` on Skylake:
SSE:
```
name old time/op new time/op delta
BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01%
BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97%
BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49%
BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32%
BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03%
BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20%
BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18%
```
AVX2:
```
name old cpu/op new cpu/op delta
BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71%
BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86%
BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28%
BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81%
BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24%
BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31%
BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31%
```
AVX512:
```
name old cpu/op new cpu/op delta
BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75%
BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63%
BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54%
BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89%
BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13%
BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11%
BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13%
```
|
| |
|
|
|
|
| |
This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
|
| |
|
|
|
|
| |
This reverts commit c770746d709686ef2b8b652616d9232f9b028e78.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.
Also modified the `bfloat16_float` test to match.
Tested with `cortex-a53` and `cortex-a55`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes some gcc warnings such as:
```
Eigen/src/Core/GenericPacketMath.h:655:63: warning: implicit conversion turns floating-point number into bool: 'typename __gnu_cxx::__enable_if<__is_integer<bool>::__value, double>::__type' (aka 'double') to 'bool' [-Wimplicit-conversion-floating-point-to-bool]
Packet psqrt(const Packet& a) { EIGEN_USING_STD(sqrt); return sqrt(a); }
```
Details:
- Added `scalar_sqrt_op<bool>` (`-Wimplicit-conversion-floating-point-to-bool`).
- Added `scalar_square_op<bool>` and `scalar_cube_op<bool>`
specializations (`-Wint-in-bool-context`)
- Deprecated above specialized ops for bool.
- Modified `cxx11_tensor_block_eval` to specialize generator for
booleans (`-Wint-in-bool-context`) and to use `abs` instead of `square` to
avoid deprecated bool ops.
|
|
|
|
| |
using PacketMath.
|
|
|
|
|
| |
It was only defined under one `#ifdef` case. This fixes the `packetmath_14`
test for MSVC.
|
|
|
|
| |
not available, and avoid undefined behavior in C++. Also mask off the sign bit when extracting the exponent.
|
|
|
|
| |
for SSE/AVX/AVX512.
|
|
|
|
|
|
|
|
| |
64 bit builds, see:
https://stackoverflow.com/questions/60933486/mmx-intrinsics-like-mm-cvtpd-pi32-not-found-with-msvc-2019-for-64bit-targets-c
Instead use the equivalent SSE2 intrinsics.
|
|
|
|
| |
definition for SSE. SSE does not support conversion between 64 bit integers and double and the existing implementation of casting between Packet2d and Packer2l results in undefined behavior when casting NaN to int. Since pldexp and pfdexp only manipulate exponent fields that fit in 32 bit, this change provides specializations that use existing instructions _mm_cvtpd_pi32 and _mm_cvtsi32_pd instead.
|
|
|
|
| |
TypeCasting.h on platforms where uint64_t != unsigned long.
|
| |
|
|
|
|
| |
a: __i28d) ops with MSVC compiler
|
|
|
|
|
|
| |
available on 32 bit x86.
If SSE 4.1 is available use the faster _mm_extract_epi64 intrinsic.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tests as other arithmetic types.
This change also contains a few minor cleanups:
1. Remove packet op pnot, which is not needed for anything other than pcmp_le_or_nan,
which can be done in other ways.
2. Remove the "HasInsert" enum, which is no longer needed since we removed the
corresponding packet ops.
3. Add faster pselect op for Packet4i when SSE4.1 is supported.
Among other things, this makes the fast transposeInPlace() method available for Matrix<bool>.
Run on ************** (72 X 2994 MHz CPUs); 2020-05-09T10:51:02.372347913-07:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------------
BM_TransposeInPlace<float>/4 9.77 9.77 71670320
BM_TransposeInPlace<float>/8 21.9 21.9 31929525
BM_TransposeInPlace<float>/16 66.6 66.6 10000000
BM_TransposeInPlace<float>/32 243 243 2879561
BM_TransposeInPlace<float>/59 844 844 829767
BM_TransposeInPlace<float>/64 933 933 750567
BM_TransposeInPlace<float>/128 3944 3945 177405
BM_TransposeInPlace<float>/256 16853 16853 41457
BM_TransposeInPlace<float>/512 204952 204968 3448
BM_TransposeInPlace<float>/1k 1053889 1053861 664
BM_TransposeInPlace<bool>/4 14.4 14.4 48637301
BM_TransposeInPlace<bool>/8 36.0 36.0 19370222
BM_TransposeInPlace<bool>/16 31.5 31.5 22178902
BM_TransposeInPlace<bool>/32 111 111 6272048
BM_TransposeInPlace<bool>/59 626 626 1000000
BM_TransposeInPlace<bool>/64 428 428 1632689
BM_TransposeInPlace<bool>/128 1677 1677 417377
BM_TransposeInPlace<bool>/256 7126 7126 96264
BM_TransposeInPlace<bool>/512 29021 29024 24165
BM_TransposeInPlace<bool>/1k 116321 116330 6068
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
single place, and can be replaced by other ops when constructing the first/final packet in linspaced_op_impl::packetOp.
I cannot measure any performance changes for SSE, AVX, or AVX512.
name old time/op new time/op delta
BM_LinSpace<float>/1 1.63ns ± 0% 1.63ns ± 0% ~ (p=0.762 n=5+5)
BM_LinSpace<float>/8 4.92ns ± 3% 4.89ns ± 3% ~ (p=0.421 n=5+5)
BM_LinSpace<float>/64 34.6ns ± 0% 34.6ns ± 0% ~ (p=0.841 n=5+5)
BM_LinSpace<float>/512 217ns ± 0% 217ns ± 0% ~ (p=0.421 n=5+5)
BM_LinSpace<float>/4k 1.68µs ± 0% 1.68µs ± 0% ~ (p=1.000 n=5+5)
BM_LinSpace<float>/32k 13.3µs ± 0% 13.3µs ± 0% ~ (p=0.905 n=5+4)
BM_LinSpace<float>/256k 107µs ± 0% 107µs ± 0% ~ (p=0.841 n=5+5)
BM_LinSpace<float>/1M 427µs ± 0% 427µs ± 0% ~ (p=0.690 n=5+5)
|
|
|
|
| |
Clean up a compiler warning in c++03 mode in AVX512/Complex.h.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add ptranspose<*,4> to support matmul and add unit test for Matrix<bool> * Matrix<bool>
* work around a bug in slicing of Tensor<bool>.
* Add tensor tests
This speeds up matmul for boolean matrices by about 10x
name old time/op new time/op delta
BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5)
BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5)
BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5)
BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5)
BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5)
BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5)
BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
boolean operations on Tensors by up to 25x.
Benchmark numbers for the logical and of two NxN tensors:
name old time/op new time/op delta
BM_booleanAnd_1T/3 [using 1 threads] 14.6ns ± 0% 14.4ns ± 0% -0.96%
BM_booleanAnd_1T/4 [using 1 threads] 20.5ns ±12% 9.0ns ± 0% -56.07%
BM_booleanAnd_1T/7 [using 1 threads] 41.7ns ± 0% 10.5ns ± 0% -74.87%
BM_booleanAnd_1T/8 [using 1 threads] 52.1ns ± 0% 10.1ns ± 0% -80.59%
BM_booleanAnd_1T/10 [using 1 threads] 76.3ns ± 0% 13.8ns ± 0% -81.87%
BM_booleanAnd_1T/15 [using 1 threads] 167ns ± 0% 16ns ± 0% -90.45%
BM_booleanAnd_1T/16 [using 1 threads] 188ns ± 0% 16ns ± 0% -91.57%
BM_booleanAnd_1T/31 [using 1 threads] 667ns ± 0% 34ns ± 0% -94.83%
BM_booleanAnd_1T/32 [using 1 threads] 710ns ± 0% 35ns ± 0% -95.01%
BM_booleanAnd_1T/64 [using 1 threads] 2.80µs ± 0% 0.11µs ± 0% -95.93%
BM_booleanAnd_1T/128 [using 1 threads] 11.2µs ± 0% 0.4µs ± 0% -96.11%
BM_booleanAnd_1T/256 [using 1 threads] 44.6µs ± 0% 2.5µs ± 0% -94.31%
BM_booleanAnd_1T/512 [using 1 threads] 178µs ± 0% 10µs ± 0% -94.35%
BM_booleanAnd_1T/1k [using 1 threads] 717µs ± 0% 78µs ± 1% -89.07%
BM_booleanAnd_1T/2k [using 1 threads] 2.87ms ± 0% 0.31ms ± 1% -89.08%
BM_booleanAnd_1T/4k [using 1 threads] 11.7ms ± 0% 1.9ms ± 4% -83.55%
BM_booleanAnd_1T/10k [using 1 threads] 70.3ms ± 0% 17.2ms ± 4% -75.48%
|
|
|
|
|
|
|
| |
SSE/AVX/AVX512 as it is already used for NEON.
This will allow us to define multiple packet types backed by the same vector type, e.g., __m128i.
Use this machanism to define packets for half and clean up the packet op implementations.
|
| |
|