| Commit message (Collapse) | Author | Age |
|
|
|
| |
CJMADD, which were effectively unused, apart from on x86, where the change results in identically performing code.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM,
leading to excessive 16-byte register spills, slowing down basic f32
matrix multiplication by approx 50%.
By specializing `gebp_traits`, we can eliminate the register spills.
Volatile inline ASM both acts as a barrier to prevent reordering and
enforces strict register use. In a simple f32 matrix multiply example,
this modification reduces 16-byte spills from 109 instances to zero,
leading to a 1.5x speed increase (search for `16-byte Spill` in the
assembly in https://godbolt.org/z/chsPbE).
This is a replacement of !379. See there for further discussion.
Also moved `gebp_traits` specializations for NEON to
`Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside
other NEON-specific code.
Fixes #2138.
|
|
|
| |
`combine_scalar_factors` helper function.
|
|
|
|
| |
handled by the equivalent branch in the specialization for GemvProduct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
inner products at runtime.
This speeds up inner products where the one or or both arguments is dynamic for small and medium-sized vectors (up to 32k).
name old time/op new time/op delta
BM_VecVecStatStat<float>/1 1.64ns ± 0% 1.64ns ± 0% ~
BM_VecVecStatStat<float>/8 2.99ns ± 0% 2.99ns ± 0% ~
BM_VecVecStatStat<float>/64 7.00ns ± 1% 7.04ns ± 0% +0.66%
BM_VecVecStatStat<float>/512 61.6ns ± 0% 61.6ns ± 0% ~
BM_VecVecStatStat<float>/4k 551ns ± 0% 553ns ± 1% +0.26%
BM_VecVecStatStat<float>/32k 4.45µs ± 0% 4.45µs ± 0% ~
BM_VecVecStatStat<float>/256k 77.9µs ± 0% 78.1µs ± 1% ~
BM_VecVecStatStat<float>/1M 312µs ± 0% 312µs ± 1% ~
BM_VecVecDynStat<float>/1 13.3ns ± 1% 4.6ns ± 0% -65.35%
BM_VecVecDynStat<float>/8 14.4ns ± 0% 6.2ns ± 0% -57.00%
BM_VecVecDynStat<float>/64 24.0ns ± 0% 10.2ns ± 3% -57.57%
BM_VecVecDynStat<float>/512 138ns ± 0% 68ns ± 0% -50.52%
BM_VecVecDynStat<float>/4k 1.11µs ± 0% 0.56µs ± 0% -49.72%
BM_VecVecDynStat<float>/32k 8.89µs ± 0% 4.46µs ± 0% -49.89%
BM_VecVecDynStat<float>/256k 78.2µs ± 0% 78.1µs ± 1% ~
BM_VecVecDynStat<float>/1M 313µs ± 0% 312µs ± 1% ~
BM_VecVecDynDyn<float>/1 10.4ns ± 0% 10.5ns ± 0% +0.91%
BM_VecVecDynDyn<float>/8 12.0ns ± 3% 11.9ns ± 0% ~
BM_VecVecDynDyn<float>/64 37.4ns ± 0% 19.6ns ± 1% -47.57%
BM_VecVecDynDyn<float>/512 159ns ± 0% 81ns ± 0% -49.07%
BM_VecVecDynDyn<float>/4k 1.13µs ± 0% 0.58µs ± 1% -49.11%
BM_VecVecDynDyn<float>/32k 8.91µs ± 0% 5.06µs ±12% -43.23%
BM_VecVecDynDyn<float>/256k 78.2µs ± 0% 78.2µs ± 1% ~
BM_VecVecDynDyn<float>/1M 313µs ± 0% 312µs ± 1% ~
|
| |
|
|
|
|
| |
Fixes #1995
|
| |
|
| |
|
|
|
|
| |
Fix compiler warnings in GeneralBlockPanelKernel.h.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some architectures have no convinient way to determine cache sizes at
runtime. Eigen's GEBP kernel falls back to default cache values in this
case which might not be correct in all situations.
This patch introduces three preprocessor directives
`EIGEN_DEFAULT_L1_CACHE_SIZE`
`EIGEN_DEFAULT_L2_CACHE_SIZE`
`EIGEN_DEFAULT_L3_CACHE_SIZE`
to give users the possibility to set these default values explicitly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
half- or quarter-packet vectorized loads in gemm_pack_rhs if they have size 4, instead of dropping down the the scalar path.
Benchmark measurements below are for computing ```c.noalias() = a.transpose() * b;``` for square RowMajor matrices of varying size.
Measured improvement with AVX+FMA:
name old time/op new time/op delta
BM_MatMul_ATB/8 139ns ± 1% 129ns ± 1% -7.49% (p=0.008 n=5+5)
BM_MatMul_ATB/32 1.46µs ± 1% 1.22µs ± 0% -16.72% (p=0.008 n=5+5)
BM_MatMul_ATB/64 8.43µs ± 1% 7.41µs ± 0% -12.04% (p=0.008 n=5+5)
BM_MatMul_ATB/128 56.8µs ± 1% 52.9µs ± 1% -6.83% (p=0.008 n=5+5)
BM_MatMul_ATB/256 407µs ± 1% 395µs ± 3% -2.94% (p=0.032 n=5+5)
BM_MatMul_ATB/512 3.27ms ± 3% 3.18ms ± 1% ~ (p=0.056 n=5+5)
Measured improvement for AVX512:
name old time/op new time/op delta
BM_MatMul_ATB/8 167ns ± 1% 154ns ± 1% -7.63% (p=0.008 n=5+5)
BM_MatMul_ATB/32 1.08µs ± 1% 0.83µs ± 3% -23.58% (p=0.008 n=5+5)
BM_MatMul_ATB/64 6.21µs ± 1% 5.06µs ± 1% -18.47% (p=0.008 n=5+5)
BM_MatMul_ATB/128 36.1µs ± 2% 31.3µs ± 1% -13.32% (p=0.008 n=5+5)
BM_MatMul_ATB/256 263µs ± 2% 242µs ± 2% -7.92% (p=0.008 n=5+5)
BM_MatMul_ATB/512 1.95ms ± 2% 1.91ms ± 2% ~ (p=0.095 n=5+5)
BM_MatMul_ATB/1k 15.4ms ± 4% 14.8ms ± 2% ~ (p=0.095 n=5+5)
|
| |
|
| |
|
| |
|
|
|
|
| |
triangular^1*matrix with a destination having a non-trivial inner-stride
|
| |
|
|
|
|
| |
for destination with non-trivial inner stride
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That was hurting users with compilers that would object to proceed with
that:
"""
./Eigen/src/Core/products/GeneralMatrixVector.h:356:10: error: declaration shadows a static data member of 'general_matrix_vector_product<type-parameter-0-0, type-parameter-0-1, type-parameter-0-2, 1, ConjugateLhs, type-parameter-0-4, type-parameter-0-5, ConjugateRhs, Version>' [-Werror,-Wshadow]
LhsPacketSize = Traits::LhsPacketSize,
^
./Eigen/src/Core/products/GeneralMatrixVector.h:307:22: note: previous declaration is here
static const Index LhsPacketSize = Traits::LhsPacketSize;
"""
|
|
|
|
|
|
| |
We take advantage of smaller SIMD registers as well, in that case.
Gains up to 3x for select input sizes.
|
| |
|
|
|
|
|
|
| |
https://bitbucket.org/eigen/eigen/commits/b55b5c7280a0481f01fe5ec764d55c443a8b6496
.
|
|
|
|
|
| |
This is a more general and simpler version of changeset 4c0fa6ce0f81ce67dd6723528ddf72f66ae92ba2
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| | |
Speed up Eigen matrix*vector and vector*matrix multiplication.
Approved-by: Eugene Zhulenev <ezhulenev@google.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The row-major matrix-vector multiplication code uses a threshold to
check if processing 8 rows at a time would thrash the cache.
This change introduces two modifications to this logic.
1. A smaller threshold for ARM and ARM64 devices.
The value of this threshold was determined empirically using a Pixel2
phone, by benchmarking a large number of matrix-vector products in the
range [1..4096]x[1..4096] and measuring performance separately on
small and little cores with frequency pinning.
On big (out-of-order) cores, this change has little to no impact. But
on the small (in-order) cores, the matrix-vector products are up to
700% faster. Especially on large matrices.
The motivation for this change was some internal code at Google which
was using hand-written NEON for implementing similar functionality,
processing the matrix one row at a time, which exhibited substantially
better performance than Eigen.
With the current change, Eigen handily beats that code.
2. Make the logic for choosing number of simultaneous rows apply
unifiormly to 8, 4 and 2 rows instead of just 8 rows.
Since the default threshold for non-ARM devices is essentially
unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM
performance. This was verified by running the same set of benchmarks
on a Xeon desktop.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector.
The benchmarks below test
c.noalias()= n_by_n_matrix * n_by_1_matrix;
c.noalias()= 1_by_n_matrix * n_by_n_matrix;
respectively.
Benchmark measurements:
SSE:
Run on *** (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64 1096 312 +71.5%
BM_MatVec/128 4581 1464 +68.0%
BM_MatVec/256 18534 5710 +69.2%
BM_MatVec/512 118083 24162 +79.5%
BM_MatVec/1k 704106 173346 +75.4%
BM_MatVec/2k 3080828 742728 +75.9%
BM_MatVec/4k 25421512 4530117 +82.2%
BM_VecMat/32 352 130 +63.1%
BM_VecMat/64 1213 425 +65.0%
BM_VecMat/128 4640 1564 +66.3%
BM_VecMat/256 17902 5884 +67.1%
BM_VecMat/512 70466 24000 +65.9%
BM_VecMat/1k 340150 161263 +52.6%
BM_VecMat/2k 1420590 645576 +54.6%
BM_VecMat/4k 8083859 4364327 +46.0%
AVX2:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64 619 120 +80.6%
BM_MatVec/128 9693 752 +92.2%
BM_MatVec/256 38356 2773 +92.8%
BM_MatVec/512 69006 12803 +81.4%
BM_MatVec/1k 443810 160378 +63.9%
BM_MatVec/2k 2633553 646594 +75.4%
BM_MatVec/4k 16211095 4327148 +73.3%
BM_VecMat/64 925 227 +75.5%
BM_VecMat/128 3438 830 +75.9%
BM_VecMat/256 13427 2936 +78.1%
BM_VecMat/512 53944 12473 +76.9%
BM_VecMat/1k 302264 157076 +48.0%
BM_VecMat/2k 1396811 675778 +51.6%
BM_VecMat/4k 8962246 4459010 +50.2%
AVX512:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64 401 111 +72.3%
BM_MatVec/128 1846 513 +72.2%
BM_MatVec/256 36739 1927 +94.8%
BM_MatVec/512 54490 9227 +83.1%
BM_MatVec/1k 487374 161457 +66.9%
BM_MatVec/2k 2016270 643824 +68.1%
BM_MatVec/4k 13204300 4077412 +69.1%
BM_VecMat/32 324 106 +67.3%
BM_VecMat/64 1034 246 +76.2%
BM_VecMat/128 3576 802 +77.6%
BM_VecMat/256 13411 2561 +80.9%
BM_VecMat/512 58686 10037 +82.9%
BM_VecMat/1k 320862 163750 +49.0%
BM_VecMat/2k 1406719 651397 +53.7%
BM_VecMat/4k 7785179 4124677 +47.0%
Currently watchingStop watching
|
|
|
|
|
| |
Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA.
With AVX+FMA I measured a speed up of about x1.25 in such cases.
|
| |
|
|\ |
|
| |
| |
| |
| | |
previous GCC issue is fixed in GCC trunk (will be gcc 9).
|
| | |
|
| |
| |
| |
| | |
generating good ASM
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| | |
See https://stackoverflow.com/questions/7411515/
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The patch works by altering the gebp lhs packing routines to also
consider ½ and ¼ packet lenght rows when packing, besides the original
whole package and row-by-row attempts. Finally, gebp itself will try
to fit a fraction of a packet at a time if:
i) ½ and/or ¼ packets are available for the current context (e.g. AVX2
and SSE-sized SIMD register for x86)
ii) The matrix's height is favorable to it (it may be it's too small
in that dimension to take full advantage of the current/maximum
packet width or it may be the case that last rows may take
advantage of smaller packets before gebp goes row-by-row)
This helps mitigate huge slowdowns one had on AVX512 builds when
compared to AVX2 ones, for some dimensions. Gains top at an extra 1x
in throughput. This patch is a complement to changeset 4ad359237aeb519dbd4b55eba43057b37988838c
.
Since packing is changed, Eigen users which would go for very
low-level API usage, like TensorFlow, will have to be adapted to work
fine with the changes.
|
|
|
|
|
|
| |
kernels.
With a 6pX4 kernel (not committed yet), this provides a +20% speedup.
|
| |
|