aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core/products
Commit message (Collapse)AuthorAge
* Small cleanup: Get rid of the macros EIGEN_HAS_SINGLE_INSTRUCTION_CJMADD and ↵Gravatar Rasmus Munk Larsen2021-06-24
| | | | CJMADD, which were effectively unused, apart from on x86, where the change results in identically performing code.
* Fix more enum arithmetic.Gravatar Rasmus Munk Larsen2021-06-15
|
* Fix c++20 warnings about using enums in arithmetic expressions.Gravatar Rasmus Munk Larsen2021-06-10
|
* Fix excessive GEBP register spilling for 32-bit NEON.Gravatar Antonio Sanchez2021-02-03
| | | | | | | | | | | | | | | | | | | | | Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM, leading to excessive 16-byte register spills, slowing down basic f32 matrix multiplication by approx 50%. By specializing `gebp_traits`, we can eliminate the register spills. Volatile inline ASM both acts as a barrier to prevent reordering and enforces strict register use. In a simple f32 matrix multiply example, this modification reduces 16-byte spills from 109 instances to zero, leading to a 1.5x speed increase (search for `16-byte Spill` in the assembly in https://godbolt.org/z/chsPbE). This is a replacement of !379. See there for further discussion. Also moved `gebp_traits` specializations for NEON to `Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside other NEON-specific code. Fixes #2138.
* Eliminate boolean product warnings by factoring out aGravatar Christoph Hertzberg2021-01-05
| | | `combine_scalar_factors` helper function.
* Remove redundant branch for handling dynamic vector*vector. This will be ↵Gravatar Rasmus Munk Larsen2020-11-12
| | | | handled by the equivalent branch in the specialization for GemvProduct.
* Optimize matrix*matrix and matrix*vector products when they correspond to ↵Gravatar Rasmus Munk Larsen2020-11-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inner products at runtime. This speeds up inner products where the one or or both arguments is dynamic for small and medium-sized vectors (up to 32k). name old time/op new time/op delta BM_VecVecStatStat<float>/1 1.64ns ± 0% 1.64ns ± 0% ~ BM_VecVecStatStat<float>/8 2.99ns ± 0% 2.99ns ± 0% ~ BM_VecVecStatStat<float>/64 7.00ns ± 1% 7.04ns ± 0% +0.66% BM_VecVecStatStat<float>/512 61.6ns ± 0% 61.6ns ± 0% ~ BM_VecVecStatStat<float>/4k 551ns ± 0% 553ns ± 1% +0.26% BM_VecVecStatStat<float>/32k 4.45µs ± 0% 4.45µs ± 0% ~ BM_VecVecStatStat<float>/256k 77.9µs ± 0% 78.1µs ± 1% ~ BM_VecVecStatStat<float>/1M 312µs ± 0% 312µs ± 1% ~ BM_VecVecDynStat<float>/1 13.3ns ± 1% 4.6ns ± 0% -65.35% BM_VecVecDynStat<float>/8 14.4ns ± 0% 6.2ns ± 0% -57.00% BM_VecVecDynStat<float>/64 24.0ns ± 0% 10.2ns ± 3% -57.57% BM_VecVecDynStat<float>/512 138ns ± 0% 68ns ± 0% -50.52% BM_VecVecDynStat<float>/4k 1.11µs ± 0% 0.56µs ± 0% -49.72% BM_VecVecDynStat<float>/32k 8.89µs ± 0% 4.46µs ± 0% -49.89% BM_VecVecDynStat<float>/256k 78.2µs ± 0% 78.1µs ± 1% ~ BM_VecVecDynStat<float>/1M 313µs ± 0% 312µs ± 1% ~ BM_VecVecDynDyn<float>/1 10.4ns ± 0% 10.5ns ± 0% +0.91% BM_VecVecDynDyn<float>/8 12.0ns ± 3% 11.9ns ± 0% ~ BM_VecVecDynDyn<float>/64 37.4ns ± 0% 19.6ns ± 1% -47.57% BM_VecVecDynDyn<float>/512 159ns ± 0% 81ns ± 0% -49.07% BM_VecVecDynDyn<float>/4k 1.13µs ± 0% 0.58µs ± 1% -49.11% BM_VecVecDynDyn<float>/32k 8.91µs ± 0% 5.06µs ±12% -43.23% BM_VecVecDynDyn<float>/256k 78.2µs ± 0% 78.2µs ± 1% ~ BM_VecVecDynDyn<float>/1M 313µs ± 0% 312µs ± 1% ~
* dont use =* might not return a ScalarGravatar janos2020-10-02
|
* Fix failure in GEBP kernel when compiling with OpenMP and FMAGravatar David Tellenbach2020-09-30
| | | | Fixes #1995
* remove semi triggering -Wextra-semi-stmtGravatar Alexander Neumann2020-09-07
|
* Fix unused variable warning on ArmGravatar David Tellenbach2020-06-15
|
* Fix static analyzer warning in SelfadjointProduct.h.Gravatar Rasmus Munk Larsen2020-06-08
| | | | Fix compiler warnings in GeneralBlockPanelKernel.h.
* Fix #1874: it works on both MSVC 2017 and other platforms.Gravatar Kan Chen2020-05-21
|
* Fix #1874: workaround MSVC 2017 compilation issue.Gravatar Gael Guennebaud2020-05-15
|
* Possibility to specify user-defined default cache sizes for GEBP kernelGravatar David Tellenbach2020-05-08
| | | | | | | | | | | | | | Some architectures have no convinient way to determine cache sizes at runtime. Eigen's GEBP kernel falls back to default cache values in this case which might not be correct in all situations. This patch introduces three preprocessor directives `EIGEN_DEFAULT_L1_CACHE_SIZE` `EIGEN_DEFAULT_L2_CACHE_SIZE` `EIGEN_DEFAULT_L3_CACHE_SIZE` to give users the possibility to set these default values explicitly.
* Speed up matrix multiplication for small to medium size matrices by using ↵Gravatar Rasmus Munk Larsen2020-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | half- or quarter-packet vectorized loads in gemm_pack_rhs if they have size 4, instead of dropping down the the scalar path. Benchmark measurements below are for computing ```c.noalias() = a.transpose() * b;``` for square RowMajor matrices of varying size. Measured improvement with AVX+FMA: name old time/op new time/op delta BM_MatMul_ATB/8 139ns ± 1% 129ns ± 1% -7.49% (p=0.008 n=5+5) BM_MatMul_ATB/32 1.46µs ± 1% 1.22µs ± 0% -16.72% (p=0.008 n=5+5) BM_MatMul_ATB/64 8.43µs ± 1% 7.41µs ± 0% -12.04% (p=0.008 n=5+5) BM_MatMul_ATB/128 56.8µs ± 1% 52.9µs ± 1% -6.83% (p=0.008 n=5+5) BM_MatMul_ATB/256 407µs ± 1% 395µs ± 3% -2.94% (p=0.032 n=5+5) BM_MatMul_ATB/512 3.27ms ± 3% 3.18ms ± 1% ~ (p=0.056 n=5+5) Measured improvement for AVX512: name old time/op new time/op delta BM_MatMul_ATB/8 167ns ± 1% 154ns ± 1% -7.63% (p=0.008 n=5+5) BM_MatMul_ATB/32 1.08µs ± 1% 0.83µs ± 3% -23.58% (p=0.008 n=5+5) BM_MatMul_ATB/64 6.21µs ± 1% 5.06µs ± 1% -18.47% (p=0.008 n=5+5) BM_MatMul_ATB/128 36.1µs ± 2% 31.3µs ± 1% -13.32% (p=0.008 n=5+5) BM_MatMul_ATB/256 263µs ± 2% 242µs ± 2% -7.92% (p=0.008 n=5+5) BM_MatMul_ATB/512 1.95ms ± 2% 1.91ms ± 2% ~ (p=0.095 n=5+5) BM_MatMul_ATB/1k 15.4ms ± 4% 14.8ms ± 2% ~ (p=0.095 n=5+5)
* Adding correct cache sizes for PPC architecture.Gravatar Everton Constantino2020-01-13
|
* Fix -Werror -Wfloat-conversion warning.Gravatar Janek Kozicki2019-12-23
|
* PR 719: fix real/imag namespace conflictGravatar Gael Guennebaud2019-10-08
|
* bug #1741: fix self-adjoint*matrix, triangular*matrix, and ↵Gravatar Gael Guennebaud2019-09-11
| | | | triangular^1*matrix with a destination having a non-trivial inner-stride
* Fix compilation of BLAS backend and frontendGravatar Gael Guennebaud2019-09-11
|
* bug #1741: fix SelfAdjointView::rankUpdate and product to triangular part ↵Gravatar Gael Guennebaud2019-09-10
| | | | for destination with non-trivial inner stride
* bug #1741: fix C.noalias() = A*C; with C.innerStride()!=1Gravatar Gael Guennebaud2019-09-10
|
* GEMV: remove double declaration of constant.Gravatar Gustavo Lima Chaves2019-05-23
| | | | | | | | | | | | | That was hurting users with compilers that would object to proceed with that: """ ./Eigen/src/Core/products/GeneralMatrixVector.h:356:10: error: declaration shadows a static data member of 'general_matrix_vector_product<type-parameter-0-0, type-parameter-0-1, type-parameter-0-2, 1, ConjugateLhs, type-parameter-0-4, type-parameter-0-5, ConjugateRhs, Version>' [-Werror,-Wshadow] LhsPacketSize = Traits::LhsPacketSize, ^ ./Eigen/src/Core/products/GeneralMatrixVector.h:307:22: note: previous declaration is here static const Index LhsPacketSize = Traits::LhsPacketSize; """
* Speed up GEMV on AVX-512 builds, just as done for GEBP previously.Gravatar Gustavo Lima Chaves2019-04-26
| | | | | | We take advantage of smaller SIMD registers as well, in that case. Gains up to 3x for select input sizes.
* bug #1689 fix used-but-marked-unused warningGravatar Gael Guennebaud2019-03-05
|
* Revert ↵Gravatar Rasmus Munk Larsen2019-02-14
| | | | | | https://bitbucket.org/eigen/eigen/commits/b55b5c7280a0481f01fe5ec764d55c443a8b6496 .
* Make GEMM fallback to GEMV for runtime vectors.Gravatar Gael Guennebaud2019-02-07
| | | | | This is a more general and simpler version of changeset 4c0fa6ce0f81ce67dd6723528ddf72f66ae92ba2
* Backed out changeset 4c0fa6ce0f81ce67dd6723528ddf72f66ae92ba2Gravatar Gael Guennebaud2019-02-07
|
* Remove duplicated comment lineGravatar Eugene Zhulenev2019-02-04
|
* Fix GeneralBlockPanelKernel Android compilationGravatar Eugene Zhulenev2019-02-04
|
* Merged in rmlarsen/eigen (pull request PR-578)Gravatar Rasmus Larsen2019-02-02
|\ | | | | | | | | | | Speed up Eigen matrix*vector and vector*matrix multiplication. Approved-by: Eugene Zhulenev <ezhulenev@google.com>
* | Speed up row-major matrix-vector product on ARMGravatar Sameer Agarwal2019-02-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The row-major matrix-vector multiplication code uses a threshold to check if processing 8 rows at a time would thrash the cache. This change introduces two modifications to this logic. 1. A smaller threshold for ARM and ARM64 devices. The value of this threshold was determined empirically using a Pixel2 phone, by benchmarking a large number of matrix-vector products in the range [1..4096]x[1..4096] and measuring performance separately on small and little cores with frequency pinning. On big (out-of-order) cores, this change has little to no impact. But on the small (in-order) cores, the matrix-vector products are up to 700% faster. Especially on large matrices. The motivation for this change was some internal code at Google which was using hand-written NEON for implementing similar functionality, processing the matrix one row at a time, which exhibited substantially better performance than Eigen. With the current change, Eigen handily beats that code. 2. Make the logic for choosing number of simultaneous rows apply unifiormly to 8, 4 and 2 rows instead of just 8 rows. Since the default threshold for non-ARM devices is essentially unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM performance. This was verified by running the same set of benchmarks on a Xeon desktop.
| * Speed up Eigen matrix*vector and vector*matrix multiplication.Gravatar Rasmus Munk Larsen2019-01-31
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector. The benchmarks below test c.noalias()= n_by_n_matrix * n_by_1_matrix; c.noalias()= 1_by_n_matrix * n_by_n_matrix; respectively. Benchmark measurements: SSE: Run on *** (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 1096 312 +71.5% BM_MatVec/128 4581 1464 +68.0% BM_MatVec/256 18534 5710 +69.2% BM_MatVec/512 118083 24162 +79.5% BM_MatVec/1k 704106 173346 +75.4% BM_MatVec/2k 3080828 742728 +75.9% BM_MatVec/4k 25421512 4530117 +82.2% BM_VecMat/32 352 130 +63.1% BM_VecMat/64 1213 425 +65.0% BM_VecMat/128 4640 1564 +66.3% BM_VecMat/256 17902 5884 +67.1% BM_VecMat/512 70466 24000 +65.9% BM_VecMat/1k 340150 161263 +52.6% BM_VecMat/2k 1420590 645576 +54.6% BM_VecMat/4k 8083859 4364327 +46.0% AVX2: Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 619 120 +80.6% BM_MatVec/128 9693 752 +92.2% BM_MatVec/256 38356 2773 +92.8% BM_MatVec/512 69006 12803 +81.4% BM_MatVec/1k 443810 160378 +63.9% BM_MatVec/2k 2633553 646594 +75.4% BM_MatVec/4k 16211095 4327148 +73.3% BM_VecMat/64 925 227 +75.5% BM_VecMat/128 3438 830 +75.9% BM_VecMat/256 13427 2936 +78.1% BM_VecMat/512 53944 12473 +76.9% BM_VecMat/1k 302264 157076 +48.0% BM_VecMat/2k 1396811 675778 +51.6% BM_VecMat/4k 8962246 4459010 +50.2% AVX512: Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 401 111 +72.3% BM_MatVec/128 1846 513 +72.2% BM_MatVec/256 36739 1927 +94.8% BM_MatVec/512 54490 9227 +83.1% BM_MatVec/1k 487374 161457 +66.9% BM_MatVec/2k 2016270 643824 +68.1% BM_MatVec/4k 13204300 4077412 +69.1% BM_VecMat/32 324 106 +67.3% BM_VecMat/64 1034 246 +76.2% BM_VecMat/128 3576 802 +77.6% BM_VecMat/256 13411 2561 +80.9% BM_VecMat/512 58686 10037 +82.9% BM_VecMat/1k 320862 163750 +49.0% BM_VecMat/2k 1406719 651397 +53.7% BM_VecMat/4k 7785179 4124677 +47.0% Currently watchingStop watching
* GEBP: improves pipelining in the 1pX4 path with FMA.Gravatar Gael Guennebaud2019-01-30
| | | | | Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA. With AVX+FMA I measured a speed up of about x1.25 in such cases.
* Fix compilation with ARM64.Gravatar Gael Guennebaud2019-01-30
|
* Fix conflicts and mergeGravatar Gael Guennebaud2019-01-30
|\
* | According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101, the ↵Gravatar Gael Guennebaud2019-01-30
| | | | | | | | previous GCC issue is fixed in GCC trunk (will be gcc 9).
* | ARM64 & GEBP: add specialization for double +30% speed upGravatar Gael Guennebaud2019-01-30
| |
* | ARM64 & GEBP: Make use of vfmaq_laneq_f32 and workaround GCC's issue in ↵Gravatar Gael Guennebaud2019-01-30
| | | | | | | | generating good ASM
* | Fix compilation error in NEON GEBP specializaition of madd.Gravatar Rasmus Munk Larsen2019-01-25
| |
* | GEBP: fix swapped kernel mode with AVX512 and complex scalarsGravatar Gael Guennebaud2019-01-16
| |
* | GEBP: cleanup logic to choose between a 4 packets of 1 packetGravatar Gael Guennebaud2019-01-16
| |
* | bug #1661: fix regression in GEBP and AVX512Gravatar Gael Guennebaud2019-01-16
| |
* | bug #1633: use proper type for madd temporaries, factorize RhsPacketx4.Gravatar Gael Guennebaud2019-01-16
| |
* | Bug: 1633: refactor gebp kernel and optimize for neonGravatar Renjie Liu2019-01-16
| |
* | Make code compile again for older compilers.Gravatar Christoph Hertzberg2018-12-22
| | | | | | | | See https://stackoverflow.com/questions/7411515/
| * gebp: Add new ½ and ¼ packet rows per (peeling) round on the lhsGravatar Gustavo Lima Chaves2018-12-21
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The patch works by altering the gebp lhs packing routines to also consider ½ and ¼ packet lenght rows when packing, besides the original whole package and row-by-row attempts. Finally, gebp itself will try to fit a fraction of a packet at a time if: i) ½ and/or ¼ packets are available for the current context (e.g. AVX2 and SSE-sized SIMD register for x86) ii) The matrix's height is favorable to it (it may be it's too small in that dimension to take full advantage of the current/maximum packet width or it may be the case that last rows may take advantage of smaller packets before gebp goes row-by-row) This helps mitigate huge slowdowns one had on AVX512 builds when compared to AVX2 ones, for some dimensions. Gains top at an extra 1x in throughput. This patch is a complement to changeset 4ad359237aeb519dbd4b55eba43057b37988838c . Since packing is changed, Eigen users which would go for very low-level API usage, like TensorFlow, will have to be adapted to work fine with the changes.
* Artificially increase l1-blocking size for AVX512. +10% speedup with current ↵Gravatar Gael Guennebaud2018-12-11
| | | | | | kernels. With a 6pX4 kernel (not committed yet), this provides a +20% speedup.
* bug #1643: fix compilation issue with gcc and no optimizaionGravatar Gael Guennebaud2018-12-11
|