| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
specialization of ScanLauncher.
The issue was discovered when the GPU scan unit test was run and resulted in a segmentation fault.
The segmantation fault occurred because the unit test allocated GPU memory and passed a pointer to that memory to the computation that it presumed would execute on the GPU.
But because of the issue, the computation was scheduled to execute on the CPU so a situation was constructed where the CPU attempted to access a GPU memory location.
The fix expands the GPU specific ScanLauncher specialization to handle cases where vectorization is enabled.
Previously, the GPU specialization is chosen only if Vectorization is not used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original swap approach leads to potential undefined behavior (reading
uninitialized memory) and results in unnecessary copying of data for static
storage.
Here we pass down the move assignment to the underlying storage. Static
storage does a one-way copy, dynamic storage does a swap.
Modified the tests to no longer read from the moved-from matrix/tensor,
since that can lead to UB. Added a test to ensure we do not access
uninitialized memory in a move.
Fixes: #2119
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macro `__cplusplus` is not defined correctly in MSVC unless building
with the the `/Zc:__cplusplus` flag. Instead, it defines `_MSVC_LANG` to the
specified c++ standard version number.
Here we introduce `EIGEN_CPLUSPLUS` which will contain the c++ version
number both for MSVC and otherwise. This simplifies checks for supported
features.
Also replaced most instances of standard version checking via `__cplusplus`
with the existing `EIGEN_COMP_CXXVER` macro for better clarity.
Fixes: #2170
|
| |
|
|
|
|
|
|
| |
warnings
(cherry picked from commit 9bbb7ea4b54b1f307863be4ed8d105c38cdefe50)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originating from
[this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case),
some win32 compilers define `__int32` as a `long`, but MinGW defines
`std::int32_t` as an `int`, leading to a type conflict.
To avoid this, we remove the custom `typedef` definitions for win32. The
Tensor module requires C++11 anyways, so we are guaranteed to have
included `<cstdint>` already in `Eigen/Core`.
Also re-arranged the headers to only include `<cstdint>` in one place to
avoid this type of error again.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.
Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.
Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
|
|
|
|
| |
FixedDimensions.
|
|
|
|
|
|
|
| |
Removed m_dimension as instance member of TensorStorage with
FixedDimensions and instead use the template parameter. This
means that the sizeof a pure fixed-size storage is exactly
equal to the data it is storing.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The existing `TensorRandom.h` implementation makes the assumption that
`half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not
always true. This currently fails on arm64, where `x` has type `__fp16`.
Added `bit_cast` specializations to allow casting to/from `uint16_t`
for both `half` and `bfloat16`. Also added tests in
`half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch
these errors in the future.
|
|
|
|
|
|
|
| |
Adds copy constructors to Tensor ops, inherits assignment operators from
`TensorBase`.
Addresses #1863
|
|
|
|
| |
causing issues in embeded systems
|
|
|
|
| |
broken by c6953f799b01d36f4236b64f351cc1446e0abe17.
|
|
|
|
| |
`predux_fmax_nan` that implement reductions with `PropagateNaN`, and `PropagateNumbers` semantics. Add (slow) generic implementations for most reductions.
|
|
|
|
|
|
| |
across platforms.
Change test to only test for NaN-propagation for pfmin/pfmax.
|
|
|
|
| |
EIGEN_AVOID_THREAD_LOCAL and NDEBUG are defined
|
|
|
|
|
|
| |
constants static const or constexpr.
Move macro definition EIGEN_CONSTEXPR to Core and make all methods in NumTraits constexpr when EIGEN_HASH_CONSTEXPR is 1.
|
|
|
|
|
|
| |
PR 181 ( https://gitlab.com/libeigen/eigen/-/merge_requests/181 ) adds `__launch_bounds__(1024)` attribute to GPU kernels, that did not have that attribute explicitly specified.
That PR seems to cause regressions on the CUDA platform. This PR/commit makes the changes in PR 181, to be applicable for HIP only
|
|
|
|
|
|
|
|
|
|
| |
Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang.
This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user)
Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5.
This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)
|
|
|
|
| |
segfault when the argument is unaligned.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original tensor casts were only defined for
`SrcCoeffRatio`:`TgtCoeffRatio` 1:1, 1:2, 2:1, 4:1. Here we add the
missing 1:N and 8:1.
We also add casting `Eigen::half` to/from `std::complex<T>`, which
was missing to make it consistent with `Eigen:bfloat16`, and
generalize the overload to work for any complex type.
Tests were added to `basicstuff`, `packetmath`, and
`cxx11_tensor_casts` to test all cast configurations.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Running two chains exposes more instruction level parallelism,
by allowing to execute both chains at the same time.
Results are a bit noisy, but for medium length we almost hit
theoretical upper bound of 2x.
BM_fullReduction_16T/3 [using 16 threads] 17.3ns ±11% 17.4ns ± 9% ~ (p=0.178 n=18+19)
BM_fullReduction_16T/4 [using 16 threads] 17.6ns ±17% 17.0ns ±18% ~ (p=0.835 n=20+19)
BM_fullReduction_16T/7 [using 16 threads] 18.9ns ±12% 18.2ns ±10% ~ (p=0.756 n=20+18)
BM_fullReduction_16T/8 [using 16 threads] 19.8ns ±13% 19.4ns ±21% ~ (p=0.512 n=20+20)
BM_fullReduction_16T/10 [using 16 threads] 23.5ns ±15% 20.8ns ±24% -11.37% (p=0.000 n=20+19)
BM_fullReduction_16T/15 [using 16 threads] 35.8ns ±21% 26.9ns ±17% -24.76% (p=0.000 n=20+19)
BM_fullReduction_16T/16 [using 16 threads] 38.7ns ±22% 27.7ns ±18% -28.40% (p=0.000 n=20+19)
BM_fullReduction_16T/31 [using 16 threads] 146ns ±17% 74ns ±11% -49.05% (p=0.000 n=20+18)
BM_fullReduction_16T/32 [using 16 threads] 154ns ±19% 84ns ±30% -45.79% (p=0.000 n=20+19)
BM_fullReduction_16T/64 [using 16 threads] 603ns ± 8% 308ns ±12% -48.94% (p=0.000 n=17+17)
BM_fullReduction_16T/128 [using 16 threads] 2.44µs ±13% 1.22µs ± 1% -50.29% (p=0.000 n=17+17)
BM_fullReduction_16T/256 [using 16 threads] 9.84µs ±14% 5.13µs ±30% -47.82% (p=0.000 n=19+19)
BM_fullReduction_16T/512 [using 16 threads] 78.0µs ± 9% 56.1µs ±17% -28.02% (p=0.000 n=18+20)
BM_fullReduction_16T/1k [using 16 threads] 325µs ± 5% 263µs ± 4% -19.00% (p=0.000 n=20+16)
BM_fullReduction_16T/2k [using 16 threads] 1.09ms ± 3% 0.99ms ± 1% -9.04% (p=0.000 n=20+20)
BM_fullReduction_16T/4k [using 16 threads] 7.66ms ± 3% 7.57ms ± 3% -1.24% (p=0.017 n=20+20)
BM_fullReduction_16T/10k [using 16 threads] 65.3ms ± 4% 65.0ms ± 3% ~ (p=0.718 n=20+20)
|
|
|
|
|
|
|
| |
This commit applies the following changes:
- Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
- Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
- minor fixes: commenting out an unused variable to avoid compiler warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors.
Benchmark numbers for Tensor<uint32_t>:
name old time/op new time/op delta
BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4)
BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5)
BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4)
BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4)
BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5)
BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5)
BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4)
BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5)
BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5)
BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4)
BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions.
The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices.
name old time/op new time/op delta
BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45%
BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74%
BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77%
BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70%
BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58%
BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22%
BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~
BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~
BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82%
BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~
BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67%
BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95%
BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~
BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92%
BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58%
BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50%
BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68%
BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40%
BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10%
BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94%
BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67%
BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91%
BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42%
BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96%
BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06%
BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33%
BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93%
BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09%
BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65%
BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21%
BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88%
BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Add ptranspose<*,4> to support matmul and add unit test for Matrix<bool> * Matrix<bool>
* work around a bug in slicing of Tensor<bool>.
* Add tensor tests
This speeds up matmul for boolean matrices by about 10x
name old time/op new time/op delta
BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5)
BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5)
BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5)
BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5)
BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5)
BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5)
BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
|
|
|
| |
Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.
|
| |
|
| |
|
|
|
|
| |
the Eigen::Half packet type
|
|
|
|
| |
expose pmul/add/div/min/max on host
|
|
|
|
|
|
|
| |
Looking at profiles we spend ~10-20% of Steal on simply computing
random % size. We can reduce random 32-bit int into [0, size) range with
a single multiplication and shift. This transformation is described in
https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This provides a new op that matches std::rint and previous behavior of
pround. Also adds corresponding unsupported/../Tensor op.
Performance is the same as e. g. floor (tested SSE/AVX).
|
|
|
|
|
|
|
| |
* Adding Missing operations for vector comparison in SYCL. This caused compiler error for vector comparison when compiling SYCL
* Fixing the compiler error for placement new in TensorForcedEval.h This caused compiler error when compiling SYCL backend
* Reducing the SYCL warning by removing the abort function inside the kernel
* Adding Strong inline to functions inside SYCL interop.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The breakage was introduced by the following commit :
https://gitlab.com/libeigen/eigen/commit/ae07801dd8d295657f28b006e1e4999edf835052
After the commit, HIPCC errors out on some tests with the following error
```
Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_device_1.dir/cxx11_tensor_device_1_generated_cxx11_tensor_device.cu.o
In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_device.cu:17:
In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:100:
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:129:12: error: no matching constructor for initialization of 'Eigen::internal::TensorBlockResourceRequirements'
return {merge(lhs.shape_type, rhs.shape_type), // shape_type
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct TensorBlockResourceRequirements {
^
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 5 arguments, but 3 were provided
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 3 were provided
...
...
```
The fix is to explicitly decalre the (implicitly called) constructor as a device func
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|