| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
| |
guarantee an even spacing when possible.
Otherwise, the "high" bound is implicitly lowered to the largest value allowing for an even distribution.
This changeset also disable vectorization for this integer path.
|
| |
|
| |
|
| |
|
|\
| |
| |
| | |
Improve performance of parallelized matrix multiply for rectangular matrices
|
|\ \
| | |
| | |
| | | |
Enabling CUDA in Geometry
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
whether to limit the check to this compiler combination
(` || (EIGEN_COMP_MSVC == 1900 && __CUDACC_VER__) `)
or to leave it as it is. I also don't know if this will have any affect on
including Eigen in device code (I'm not in my current project).
|
| | | |
|
| | | |
|
| | | |
|
|\ \ \
| | | |
| | | |
| | | | |
Fix a bug in psqrt for SSE and AVX when EIGEN_FAST_MATH=1
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Additional CUDA necessary fixes in the Core (mostly usage of
EIGEN_USING_STD_MATH).
|
| | | | |
|
| |/ /
|/| | |
|
| | |
| | |
| | |
| | | |
version (i.e. JetPack 2.3) is used.
|
| | |
| | |
| | |
| | | |
ICC to find the right overload)
|
| | |
| | |
| | |
| | | |
which is required by the matrix-vector code.
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
threads when the inner dimension is small.
Timing for square matrices is unchanged, but both CPU and Wall time are significantly improved for skinny matrices. The benchmarks below are for multiplying NxK * KxN matrices with test names of the form BM_OuterishProd/N/K.
Improvements in Wall time:
Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1 3088 1610 +47.9%
BM_OuterishProd/64/4 3562 2414 +32.2%
BM_OuterishProd/64/32 8861 7815 +11.8%
BM_OuterishProd/128/1 11363 6504 +42.8%
BM_OuterishProd/128/4 11128 9794 +12.0%
BM_OuterishProd/128/64 27691 27396 +1.1%
BM_OuterishProd/256/1 33214 28123 +15.3%
BM_OuterishProd/256/4 34312 36818 -7.3%
BM_OuterishProd/256/128 174866 176398 -0.9%
BM_OuterishProd/512/1 7963684 104224 +98.7%
BM_OuterishProd/512/4 7987913 112867 +98.6%
BM_OuterishProd/512/256 8198378 1306500 +84.1%
BM_OuterishProd/1k/1 7356256 324432 +95.6%
BM_OuterishProd/1k/4 8129616 331621 +95.9%
BM_OuterishProd/1k/512 27265418 7517538 +72.4%
Improvements in CPU time:
Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1 6169 1608 +73.9%
BM_OuterishProd/64/4 7117 2412 +66.1%
BM_OuterishProd/64/32 17702 15616 +11.8%
BM_OuterishProd/128/1 45415 6498 +85.7%
BM_OuterishProd/128/4 44459 9786 +78.0%
BM_OuterishProd/128/64 110657 109489 +1.1%
BM_OuterishProd/256/1 265158 28101 +89.4%
BM_OuterishProd/256/4 274234 183885 +32.9%
BM_OuterishProd/256/128 1397160 1408776 -0.8%
BM_OuterishProd/512/1 78947048 520703 +99.3%
BM_OuterishProd/512/4 86955578 1349742 +98.4%
BM_OuterishProd/512/256 74701613 15584661 +79.1%
BM_OuterishProd/1k/1 78352601 3877911 +95.1%
BM_OuterishProd/1k/4 78521643 3966221 +94.9%
BM_OuterishProd/1k/512 258104736 89480530 +65.3%
|
| | |
|
| | |
|
| | |
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
(enabled by EIGEN_FAST_MATH), which causes the vectorized parts of the computation to return -0.0 instead of NaN for negative arguments.
Benchmark speed in Giga-sqrts/s
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
-----------------------------------------
SSE AVX
Fast=1 2.529G 4.380G
Fast=0 1.944G 1.898G
Fast=1 fixed 2.214G 3.739G
This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.
|
| |
|
| |
|
|
|
|
|
|
| |
by helping it to remove dead code.
The trick is to get rid of the nested expression in the evaluator by copying only the required information (here, the strides).
|
|
|
|
| |
* homogeneous
|
| |
|
| |
|
|
|
|
| |
in the range [-pi,pi]. This also increases accuracy when q.w is negative.
|
| |
|
|
|
|
| |
cuda 8.0
|
|\
| |
| |
| | |
Disabled MSVC level 4 warning C4714
|
| | |
|
| |
| |
| |
| | |
(scalar*small).lazyProduct(small)
|
| | |
|
| |
| |
| |
| |
| | |
The level 4 warning (/W4) warns about functions marked as __forceinline not
inlined, and generates a lot of noise.
|
| | |
|
|/
|
|
| |
std::complex<T> to be used when compiling a cuda kernel. This is unfortunately necessary to be able to process complex numbers from a CUDA kernel on MacOS.
|
| |
|
|\ |
|
|\ \ |
|
| | | |
|
| | |
| | |
| | |
| | | |
is a constexpr while the later isn't. This fixes compilation errors triggered by nvcc on Mac.
|
| |/ |
|
| |
| |
| |
| | |
The index of the highest value in a LinSpace is size-1.
|
| | |
|
| | |
|
| | |
|