| Commit message (Collapse) | Author | Age |
... | |
| |
|
|
|
|
|
|
|
| |
This commit applies the following changes:
- Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
- Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
- minor fixes: commenting out an unused variable to avoid compiler warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
single place, and can be replaced by other ops when constructing the first/final packet in linspaced_op_impl::packetOp.
I cannot measure any performance changes for SSE, AVX, or AVX512.
name old time/op new time/op delta
BM_LinSpace<float>/1 1.63ns ± 0% 1.63ns ± 0% ~ (p=0.762 n=5+5)
BM_LinSpace<float>/8 4.92ns ± 3% 4.89ns ± 3% ~ (p=0.421 n=5+5)
BM_LinSpace<float>/64 34.6ns ± 0% 34.6ns ± 0% ~ (p=0.841 n=5+5)
BM_LinSpace<float>/512 217ns ± 0% 217ns ± 0% ~ (p=0.421 n=5+5)
BM_LinSpace<float>/4k 1.68µs ± 0% 1.68µs ± 0% ~ (p=1.000 n=5+5)
BM_LinSpace<float>/32k 13.3µs ± 0% 13.3µs ± 0% ~ (p=0.905 n=5+4)
BM_LinSpace<float>/256k 107µs ± 0% 107µs ± 0% ~ (p=0.841 n=5+5)
BM_LinSpace<float>/1M 427µs ± 0% 427µs ± 0% ~ (p=0.690 n=5+5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some architectures have no convinient way to determine cache sizes at
runtime. Eigen's GEBP kernel falls back to default cache values in this
case which might not be correct in all situations.
This patch introduces three preprocessor directives
`EIGEN_DEFAULT_L1_CACHE_SIZE`
`EIGEN_DEFAULT_L2_CACHE_SIZE`
`EIGEN_DEFAULT_L3_CACHE_SIZE`
to give users the possibility to set these default values explicitly.
|
|
|
|
| |
Clean up a compiler warning in c++03 mode in AVX512/Complex.h.
|
|
|
|
| |
packet op implementations.
|
| |
|
|
|
|
| |
transpose.
|
| |
|
|
|
|
| |
debug mode.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add ptranspose<*,4> to support matmul and add unit test for Matrix<bool> * Matrix<bool>
* work around a bug in slicing of Tensor<bool>.
* Add tensor tests
This speeds up matmul for boolean matrices by about 10x
name old time/op new time/op delta
BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5)
BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5)
BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5)
BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5)
BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5)
BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5)
BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
large speedup because we transpose in registers (or L1 if we spill), instead of one packet at a time, which in the worst case makes the code write to the same cache line PacketSize times instead of once.
rmlarsen@rmlarsen4:.../eigen_bench/google3$ benchy --benchmarks=.*TransposeInPlace.*float.* --reference=srcfs experimental/users/rmlarsen/bench:matmul_bench
10 / 10 [====================================================================================================================================================================================================================] 100.00% 2m50s
(Generated by http://go/benchy. Settings: --runs 5 --benchtime 1s --reference "srcfs" --benchmarks ".*TransposeInPlace.*float.*" experimental/users/rmlarsen/bench:matmul_bench)
name old time/op new time/op delta
BM_TransposeInPlace<float>/4 9.84ns ± 0% 6.51ns ± 0% -33.80% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/8 23.6ns ± 1% 17.6ns ± 0% -25.26% (p=0.016 n=5+4)
BM_TransposeInPlace<float>/16 78.8ns ± 0% 60.3ns ± 0% -23.50% (p=0.029 n=4+4)
BM_TransposeInPlace<float>/32 302ns ± 0% 229ns ± 0% -24.40% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/59 1.03µs ± 0% 0.84µs ± 1% -17.87% (p=0.016 n=5+4)
BM_TransposeInPlace<float>/64 1.20µs ± 0% 0.89µs ± 1% -25.81% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/128 8.96µs ± 0% 3.82µs ± 2% -57.33% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/256 152µs ± 3% 17µs ± 2% -89.06% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/512 837µs ± 1% 208µs ± 0% -75.15% (p=0.008 n=5+5)
BM_TransposeInPlace<float>/1k 4.28ms ± 2% 1.08ms ± 2% -74.72% (p=0.008 n=5+5)
|
| |
|
| |
|
|
|
|
| |
This enables operator== on Eigen matrices in device code.
|
|
|
|
| |
vector operations
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
boolean operations on Tensors by up to 25x.
Benchmark numbers for the logical and of two NxN tensors:
name old time/op new time/op delta
BM_booleanAnd_1T/3 [using 1 threads] 14.6ns ± 0% 14.4ns ± 0% -0.96%
BM_booleanAnd_1T/4 [using 1 threads] 20.5ns ±12% 9.0ns ± 0% -56.07%
BM_booleanAnd_1T/7 [using 1 threads] 41.7ns ± 0% 10.5ns ± 0% -74.87%
BM_booleanAnd_1T/8 [using 1 threads] 52.1ns ± 0% 10.1ns ± 0% -80.59%
BM_booleanAnd_1T/10 [using 1 threads] 76.3ns ± 0% 13.8ns ± 0% -81.87%
BM_booleanAnd_1T/15 [using 1 threads] 167ns ± 0% 16ns ± 0% -90.45%
BM_booleanAnd_1T/16 [using 1 threads] 188ns ± 0% 16ns ± 0% -91.57%
BM_booleanAnd_1T/31 [using 1 threads] 667ns ± 0% 34ns ± 0% -94.83%
BM_booleanAnd_1T/32 [using 1 threads] 710ns ± 0% 35ns ± 0% -95.01%
BM_booleanAnd_1T/64 [using 1 threads] 2.80µs ± 0% 0.11µs ± 0% -95.93%
BM_booleanAnd_1T/128 [using 1 threads] 11.2µs ± 0% 0.4µs ± 0% -96.11%
BM_booleanAnd_1T/256 [using 1 threads] 44.6µs ± 0% 2.5µs ± 0% -94.31%
BM_booleanAnd_1T/512 [using 1 threads] 178µs ± 0% 10µs ± 0% -94.35%
BM_booleanAnd_1T/1k [using 1 threads] 717µs ± 0% 78µs ± 1% -89.07%
BM_booleanAnd_1T/2k [using 1 threads] 2.87ms ± 0% 0.31ms ± 1% -89.08%
BM_booleanAnd_1T/4k [using 1 threads] 11.7ms ± 0% 1.9ms ± 4% -83.55%
BM_booleanAnd_1T/10k [using 1 threads] 70.3ms ± 0% 17.2ms ± 4% -75.48%
|
|
|
|
|
|
|
| |
SSE/AVX/AVX512 as it is already used for NEON.
This will allow us to define multiple packet types backed by the same vector type, e.g., __m128i.
Use this machanism to define packets for half and clean up the packet op implementations.
|
| |
|
|
|
|
|
|
| |
{uint8, int8} -> {int16, uint16, int32, uint32, float}
{uint16, int16} -> {int32, uint32, int64, uint64, float}
for NEON. These conversions were advertised as vectorized, but not actually implemented.
|
|
|
|
| |
commainitialier unit-test never actually called `test_block_recursion`, which also was not correctly implemented and would have caused too deep template recursion.
|
|
|
|
|
|
| |
The removed `finished()` call was responsible for enforcing that the
initializer was provided the correct number of values. Putting it back in
to restore previous behavior.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
half- or quarter-packet vectorized loads in gemm_pack_rhs if they have size 4, instead of dropping down the the scalar path.
Benchmark measurements below are for computing ```c.noalias() = a.transpose() * b;``` for square RowMajor matrices of varying size.
Measured improvement with AVX+FMA:
name old time/op new time/op delta
BM_MatMul_ATB/8 139ns ± 1% 129ns ± 1% -7.49% (p=0.008 n=5+5)
BM_MatMul_ATB/32 1.46µs ± 1% 1.22µs ± 0% -16.72% (p=0.008 n=5+5)
BM_MatMul_ATB/64 8.43µs ± 1% 7.41µs ± 0% -12.04% (p=0.008 n=5+5)
BM_MatMul_ATB/128 56.8µs ± 1% 52.9µs ± 1% -6.83% (p=0.008 n=5+5)
BM_MatMul_ATB/256 407µs ± 1% 395µs ± 3% -2.94% (p=0.032 n=5+5)
BM_MatMul_ATB/512 3.27ms ± 3% 3.18ms ± 1% ~ (p=0.056 n=5+5)
Measured improvement for AVX512:
name old time/op new time/op delta
BM_MatMul_ATB/8 167ns ± 1% 154ns ± 1% -7.63% (p=0.008 n=5+5)
BM_MatMul_ATB/32 1.08µs ± 1% 0.83µs ± 3% -23.58% (p=0.008 n=5+5)
BM_MatMul_ATB/64 6.21µs ± 1% 5.06µs ± 1% -18.47% (p=0.008 n=5+5)
BM_MatMul_ATB/128 36.1µs ± 2% 31.3µs ± 1% -13.32% (p=0.008 n=5+5)
BM_MatMul_ATB/256 263µs ± 2% 242µs ± 2% -7.92% (p=0.008 n=5+5)
BM_MatMul_ATB/512 1.95ms ± 2% 1.91ms ± 2% ~ (p=0.095 n=5+5)
BM_MatMul_ATB/1k 15.4ms ± 4% 14.8ms ± 2% ~ (p=0.095 n=5+5)
|
| |
|
|
|
|
| |
This will allow (among other things) computation of argmax and argmin of bool tensors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The error generated by the compiler was:
no matching function for call to 'maxi'
RealScalar threshold = numext::maxi(tol*tol*rhsNorm2,considerAsZero);
The important part in the following notes was:
candidate template ignored: deduced conflicting
types for parameter 'T'"
('codi::Multiply11<...>' vs. 'codi::ActiveReal<...>')
EIGEN_ALWAYS_INLINE T maxi(const T& x, const T& y)
I am using CoDiPack to provide the RealScalar type.
This bug was introduced in bc000deaa Fix conjugate-gradient for very small rhs
|
|
|
|
| |
https://gitlab.com/libeigen/eigen/-/commit/52d54278beefee8b2f19dcca4fd900916154e174
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
- access violation when initializing 0x0 matrices
- exception can be thrown during stack unwind while comma-initializing a matrix if eigen_assert if configured to throw
|
| |
|
|
|
|
| |
EIGEN_DEVICE_FUNC to diagonal_product_evaluator_base.
|
| |
|
| |
|
| |
|
|
|
|
| |
types.
|
| |
|
| |
|
| |
|
|
|
|
| |
the Eigen::Half packet type
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|