eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Eliminate CMake FindPackageHandleStandardArgs warnings.	Antonio Sanchez	2021-02-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CMake complains that the package name does not match when the case differs, e.g.: ``` CMake Warning (dev) at /usr/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:273 (message): The package name passed to `find_package_handle_standard_args` (UMFPACK) does not match the name of the calling package (Umfpack). This can lead to problems in calling code that expects `find_package` result variables (e.g., `_FOUND`) to follow a certain pattern. Call Stack (most recent call first): cmake/FindUmfpack.cmake:50 (find_package_handle_standard_args) bench/spbench/CMakeLists.txt:24 (find_package) This warning is for project developers. Use -Wno-dev to suppress it. ``` Here we rename the libraries to match their true cases.
*	Add missing adolc isinf/isnan.	Antonio Sanchez	2021-02-19
\| \| \| \| \| \| \|	Also modified cmake/FindAdolc.cmake to eliminate warnings, and added search paths to match install layout. Fixed: #2157
*	Return nan at poles of polygamma, digamma, and zeta if limit is not defined	frgossen	2021-02-19
\|
*	Remove vim specific comments to recognoize correct file-type.	David Tellenbach	2021-02-09
\| \| \| \|	As discussed in #2143 we remove editor specific comments.
*	add specialization of check_sparse_solving() for SuperLU solver, in order to ↵	Ralf Hannemann-Tamas	2021-02-08
\| \| \| \|	test adjoint and transpose solves
*	Include `<cstdint>` in one place, remove custom typedefs	Antonio Sanchez	2021-01-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Originating from [this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case), some win32 compilers define `__int32` as a `long`, but MinGW defines `std::int32_t` as an `int`, leading to a type conflict. To avoid this, we remove the custom `typedef` definitions for win32. The Tensor module requires C++11 anyways, so we are guaranteed to have included `<cstdint>` already in `Eigen/Core`. Also re-arranged the headers to only include `<cstdint>` in one place to avoid this type of error again.
*	fix test of ExtractVolumePatchesOp	Gmc2	2021-01-25
\|
*	Remove std::cerr in iterative solver since we don't have iostream.	David Tellenbach	2021-01-21
\| \| \| \|	This fixes #2123
*	fix paddings of TensorVolumePatchOp	Maozhou, Ge	2021-01-15
\|
*	Add CUDA complex sqrt.	Antonio Sanchez	2020-12-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).
*	Replace call to FixedDimensions() with a singleton instance of	Turing Eret	2020-12-16
\| \| \| \|	FixedDimensions.
*	TensorStorage with FixedDimensions now has zero instance memory overhead.	Turing Eret	2020-12-14
\| \| \| \| \| \| \|	Removed m_dimension as instance member of TensorStorage with FixedDimensions and instead use the template parameter. This means that the sizeof a pure fixed-size storage is exactly equal to the data it is storing.
*	Remove code checking for CMake < 3.5	Alexander Grund	2020-12-14
\| \| \| \|	As the CMake version is at least 3.5 the code checking for earlier versions can be removed.
*	Fix bad NEON fp16 check	Antonio Sanchez	2020-12-04
\|
*	Special function implementations for half/bfloat16 packets.	Antonio Sanchez	2020-12-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.
*	Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.	Rasmus Munk Larsen	2020-12-02
\|
*	Make inclusion of doc sub-directory optional by adjusting options.	Bowie Owens	2020-11-27
\| \| \| \| \| \| \| \| \| \|	Allows exclusion of doc and related targets to help when using eigen via add_subdirectory(). Requested by: https://gitlab.com/libeigen/eigen/-/issues/1842 Also required making EIGEN_TEST_BUILD_DOCUMENTATION a dependent option on EIGEN_BUILD_DOC. This ensures documentation targets are properly defined when EIGEN_TEST_BUILD_DOCUMENTATION is ON.
*	Fix boolean float conversion and product warnings.	Antonio Sanchez	2020-11-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes some gcc warnings such as: ``` Eigen/src/Core/GenericPacketMath.h:655:63: warning: implicit conversion turns floating-point number into bool: 'typename __gnu_cxx::__enable_if<__is_integer<bool>::__value, double>::__type' (aka 'double') to 'bool' [-Wimplicit-conversion-floating-point-to-bool] Packet psqrt(const Packet& a) { EIGEN_USING_STD(sqrt); return sqrt(a); } ``` Details: - Added `scalar_sqrt_op<bool>` (`-Wimplicit-conversion-floating-point-to-bool`). - Added `scalar_square_op<bool>` and `scalar_cube_op<bool>` specializations (`-Wint-in-bool-context`) - Deprecated above specialized ops for bool. - Modified `cxx11_tensor_block_eval` to specialize generator for booleans (`-Wint-in-bool-context`) and to use `abs` instead of `square` to avoid deprecated bool ops.
*	Fix sparse_extra_3, disable counting temporaries for testing ↵	Antonio Sanchez	2020-11-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DynamicSparseMatrix. Multiplication of column-major `DynamicSparseMatrix`es involves three temporaries: - two for transposing twice to sort the coefficients (`ConservativeSparseSparseProduct.h`, L160-161) - one for a final copy assignment (`SparseAssign.h`, L108) The latter is avoided in an optimization for `SparseMatrix`. Since `DynamicSparseMatrix` is deprecated in favor of `SparseMatrix`, it's not worth the effort to optimize further, so I simply disabled counting temporaries via a macro. Note that due to the inclusion of `sparse_product.cpp`, the `sparse_extra` tests actually re-run all the original `sparse_product` tests as well. We may want to simply drop the `DynamicSparseMatrix` tests altogether, which would eliminate the test duplication. Related to #2048
*	Add bit_cast for half/bfloat to/from uint16_t, fix TensorRandom	Antonio Sanchez	2020-11-18
\| \| \| \| \| \| \| \| \| \|	The existing `TensorRandom.h` implementation makes the assumption that `half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not always true. This currently fails on arm64, where `x` has type `__fp16`. Added `bit_cast` specializations to allow casting to/from `uint16_t` for both `half` and `bfloat16`. Also added tests in `half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch these errors in the future.
*	Fix rule-of-3 for the Tensor module.	Antonio Sanchez	2020-11-18
\| \| \| \| \| \| \|	Adds copy constructors to Tensor ops, inherits assignment operators from `TensorBase`. Addresses #1863
*	Disable testing of OpenGL by default.	Antonio Sanchez	2020-11-12
\| \| \| \| \| \| \| \| \| \| \| \|	The `OpenGLSupport` module contains mostly deprecated features, and the test is highly GL context-dependent, relies on deprecated GLUT, and requires a display. Until the module is updated to support modern OpenGL and the test to use newer windowing frameworks (e.g. GLFW) it's probably best to disable the test by default. The test can be enabled with `cmake -DEIGEN_TEST_OPENGL=ON`. See #2053 for more details.
*	Address issues with `openglsupport` test.	Antonio Sanchez	2020-11-11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The existing test fails on several systems due to GL runtime version mismatches, the use of deprecated features, and memory errors due to improper use of GLUT. The test was modified to: - Run within a display function, allowing proper GLUT cleanup. - Generate dynamic shaders with a supported GLSL version string and output variables. - Report shader compilation errors. - Check GL context version before launching version-specific tests. Note that most of the existing `OpenGLSupport` module and tests rely on deprecated features (e.g. fixed-function pipeline). The test was modified to allow it to pass on various systems. We might want to consider removing the module or re-writing it entirely to support modern OpenGL. This is beyond the scope of this patch. Testing of legacy GL (for platforms that support it) can be enabled by defining `EIGEN_LEGACY_OPENGL`. Otherwise, the test will try to create a modern context. Tested on - MacBook Air (2019), macOS Catalina 10.15.7 (OpenGL 2.1, 4.1) - Debian 10.6, NVidia Quadro K1200 (OpenGL 3.1, 3.3)
*	CMakefile update for ROCm 4.0	Deven Desai	2020-10-29
\| \| \| \|	Starting with ROCm 4.0, the `hipconfig --platform` command will return `amd` (prior return value was `hcc`). Updating the CMakeLists.txt files in the test dirs to account for this change.
*	[SYCL clean up the code] : removing exrta #pragma unroll in SYCL which was ↵	mehdi-goli	2020-10-28
\| \| \| \|	causing issues in embeded systems
*	Remove leftover debug print statement in cxx11_tensor_expr.cpp	Rasmus Munk Larsen	2020-10-14
\|
*	Get rid of nested template specialization in TensorReductionGpu.h, which was ↵	Rasmus Munk Larsen	2020-10-13
\| \| \| \|	broken by c6953f799b01d36f4236b64f351cc1446e0abe17.
*	Add packet generic ops `predux_fmin`, `predux_fmin_nan`, `predux_fmax`, and ↵	Rasmus Munk Larsen	2020-10-13
\| \| \| \|	`predux_fmax_nan` that implement reductions with `PropagateNaN`, and `PropagateNumbers` semantics. Add (slow) generic implementations for most reductions.
*	Add EIGEN prefix for HAS_LGAMMA_R	David Tellenbach	2020-10-08
\|
*	Use lgamma_r if it is available (update check for glibc 2.19+)	Eugene Zhulenev	2020-10-08
\|
*	Don't make assumptions about NaN-propagation for pmin/pmax - it various ↵	Rasmus Munk Larsen	2020-10-07
\| \| \| \| \| \|	across platforms. Change test to only test for NaN-propagation for pfmin/pfmax.
*	Fix Eigen::ThreadPool::CurrentThreadId returning wrong thread id when ↵	Zhuyie	2020-09-25
\| \| \| \|	EIGEN_AVOID_THREAD_LOCAL and NDEBUG are defined
*	Get rid of initialization logic for blueNorm by making the computed ↵	Rasmus Munk Larsen	2020-09-18
\| \| \| \| \| \|	constants static const or constexpr. Move macro definition EIGEN_CONSTEXPR to Core and make all methods in NumTraits constexpr when EIGEN_HASH_CONSTEXPR is 1.
*	Fixing a CUDA / P100 regression introduced by PR 181	Deven Desai	2020-08-20
\| \| \| \| \| \|	PR 181 ( https://gitlab.com/libeigen/eigen/-/merge_requests/181 ) adds `__launch_bounds__(1024)` attribute to GPU kernels, that did not have that attribute explicitly specified. That PR seems to cause regressions on the CUDA platform. This PR/commit makes the changes in PR 181, to be applicable for HIP only
*	Disable min/max NaN propagation in test cxx11_tensor_expr	David Tellenbach	2020-08-14
\| \| \| \| \| \| \|	The current pmin/pmax implementation for Arm Neon propagate NaNs differently than std::min/std::max. See issue https://gitlab.com/libeigen/eigen/-/issues/1937
*	Adding an explicit launch_bounds(1024) attribute for GPU kernels.	Deven Desai	2020-08-05
\| \| \| \| \| \| \| \| \| \|	Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang. This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user) Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5. This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)
*	Inherit alignment trait from argument in TensorBroadcasting to avoid ↵	Rasmus Munk Larsen	2020-07-28
\| \| \| \|	segfault when the argument is unaligned.
*	Update tensor reduction test to avoid undefined division of bfloat16 by int.	Rasmus Munk Larsen	2020-07-22
\|
*	Fix tensor casts for large packets and casts to/from std::complex	Antonio Sanchez	2020-06-30
\| \| \| \| \| \| \| \| \| \| \| \| \|	The original tensor casts were only defined for `SrcCoeffRatio`:`TgtCoeffRatio` 1:1, 1:2, 2:1, 4:1. Here we add the missing 1:N and 8:1. We also add casting `Eigen::half` to/from `std::complex<T>`, which was missing to make it consistent with `Eigen:bfloat16`, and generalize the overload to work for any complex type. Tests were added to `basicstuff`, `packetmath`, and `cxx11_tensor_casts` to test all cast configurations.
*	Support BFloat16 in Eigen	Teng Lu	2020-06-20
\|
*	Run two independent chains, when reducing tensors.	Ilya Tokar	2020-06-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Running two chains exposes more instruction level parallelism, by allowing to execute both chains at the same time. Results are a bit noisy, but for medium length we almost hit theoretical upper bound of 2x. BM_fullReduction_16T/3 [using 16 threads] 17.3ns ±11% 17.4ns ± 9% ~ (p=0.178 n=18+19) BM_fullReduction_16T/4 [using 16 threads] 17.6ns ±17% 17.0ns ±18% ~ (p=0.835 n=20+19) BM_fullReduction_16T/7 [using 16 threads] 18.9ns ±12% 18.2ns ±10% ~ (p=0.756 n=20+18) BM_fullReduction_16T/8 [using 16 threads] 19.8ns ±13% 19.4ns ±21% ~ (p=0.512 n=20+20) BM_fullReduction_16T/10 [using 16 threads] 23.5ns ±15% 20.8ns ±24% -11.37% (p=0.000 n=20+19) BM_fullReduction_16T/15 [using 16 threads] 35.8ns ±21% 26.9ns ±17% -24.76% (p=0.000 n=20+19) BM_fullReduction_16T/16 [using 16 threads] 38.7ns ±22% 27.7ns ±18% -28.40% (p=0.000 n=20+19) BM_fullReduction_16T/31 [using 16 threads] 146ns ±17% 74ns ±11% -49.05% (p=0.000 n=20+18) BM_fullReduction_16T/32 [using 16 threads] 154ns ±19% 84ns ±30% -45.79% (p=0.000 n=20+19) BM_fullReduction_16T/64 [using 16 threads] 603ns ± 8% 308ns ±12% -48.94% (p=0.000 n=17+17) BM_fullReduction_16T/128 [using 16 threads] 2.44µs ±13% 1.22µs ± 1% -50.29% (p=0.000 n=17+17) BM_fullReduction_16T/256 [using 16 threads] 9.84µs ±14% 5.13µs ±30% -47.82% (p=0.000 n=19+19) BM_fullReduction_16T/512 [using 16 threads] 78.0µs ± 9% 56.1µs ±17% -28.02% (p=0.000 n=18+20) BM_fullReduction_16T/1k [using 16 threads] 325µs ± 5% 263µs ± 4% -19.00% (p=0.000 n=20+16) BM_fullReduction_16T/2k [using 16 threads] 1.09ms ± 3% 0.99ms ± 1% -9.04% (p=0.000 n=20+20) BM_fullReduction_16T/4k [using 16 threads] 7.66ms ± 3% 7.57ms ± 3% -1.24% (p=0.017 n=20+20) BM_fullReduction_16T/10k [using 16 threads] 65.3ms ± 4% 65.0ms ± 3% ~ (p=0.718 n=20+20)
*	Remove HasCast and fix packetmath cast tests.	Antonio Sanchez	2020-06-11
\| \| \| \| \| \| \| \| \| \| \|	The use of the `packet_traits<>::HasCast` field is currently inconsistent with `type_casting_traits<>`, and is unused apart from within `test/packetmath.cpp`. In addition, those packetmath cast tests do not currently reflect how casts are performed in practice: they ignore the `SrcCoeffRatio` and `TgtCoeffRatio` fields, assuming a 1:1 ratio. Here we remove the unsed `HasCast`, and modify the packet cast tests to better reflect their usage.
*	Update FindComputeCpp.cmake to fix build problems on Windows	Thales Sabino	2020-06-05
\| \| \| \| \|	- Use standard types in SYCL/PacketMath.h to avoid compilation problems on Windows - Add EIGEN_HAS_CONSTEXPR to cxx11_tensor_argmax_sycl.cpp to fix build problems on Windows
*	Disable test for 32-bit systems (e.g. ARM, i386)	Antonio Sánchez	2020-05-28
\| \| \| \| \| \| \|	Both i386 and 32-bit ARM do not define __uint128_t. On most systems, if __uint128_t is defined, then so is the macro __SIZEOF_INT128__. https://stackoverflow.com/questions/18531782/how-to-know-if-uint128-t-is-defined1
*	Eigen moved the `scanLauncehr` function inside the internal namespace.	mehdi-goli	2020-05-11
\| \| \| \| \| \| \|	This commit applies the following changes: - Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend. - Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard. - minor fixes: commenting out an unused variable to avoid compiler warnings.
*	Add parallelization of TensorScanOp for types without packet ops.	Rasmus Munk Larsen	2020-05-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors. Benchmark numbers for Tensor<uint32_t>: name old time/op new time/op delta BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4) BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5) BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4) BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4) BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5) BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5) BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4) BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5) BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5) BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5) BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5) BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4) BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5) BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)
*	Fix accidental copy of loop variable.	Rasmus Munk Larsen	2020-05-05
\|
*	Vectorize and parallelize TensorScanOp.	Rasmus Munk Larsen	2020-05-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions. The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices. name old time/op new time/op delta BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45% BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74% BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77% BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70% BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58% BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22% BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~ BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~ BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82% BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~ BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67% BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95% BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~ BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92% BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58% BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50% BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68% BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40% BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10% BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94% BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67% BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91% BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42% BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96% BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06% BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33% BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93% BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09% BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65% BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21% BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88% BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
*	Extend support for Packet16b:	Rasmus Munk Larsen	2020-04-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Add ptranspose<,4> to support matmul and add unit test for Matrix<bool> Matrix<bool> * work around a bug in slicing of Tensor<bool>. * Add tensor tests This speeds up matmul for boolean matrices by about 10x name old time/op new time/op delta BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5) BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5) BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5) BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5) BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5) BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5) BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
*	Add async evaluation support to TensorSlicingOp.	Eugene Zhulenev	2020-04-22
\| \| \|	Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.