aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen/src/Core
Commit message (Collapse)AuthorAge
* CleanupGravatar Gael Guennebaud2018-11-30
|
* Fix pandnot order in AVX512Gravatar Gael Guennebaud2018-11-30
|
* Extend the generic psin_float code to handle cosine and make SSE and AVX use ↵Gravatar Gael Guennebaud2018-11-30
| | | | it (-> this adds pcos for AVX)
* Disable fma gcc's workaround for gcc >= 8 (based on GEMM benchmarks)Gravatar Gael Guennebaud2018-11-28
|
* same for pmaxGravatar Gael Guennebaud2018-11-28
|
* pmin/pmax o SSE: make sure to use AVX instruction with AVX enabled, and ↵Gravatar Gael Guennebaud2018-11-28
| | | | disable gcc workaround for fixed gcc versions
* bug #1630: fix linspaced when requesting smaller packet size than default one.Gravatar Gael Guennebaud2018-11-28
|
* Use explicit packet type in SSE/PacketMath pldexpGravatar Eugene Zhulenev2018-11-27
|
* do not read buffers out of bounds -- load only the 4 bytes we know exist ↵Gravatar Benoit Jacob2018-11-27
| | | | here. Could also have done a vld1_lane_f32 but doing so here, without the overhead of initializing the unused lane, would have triggered used-of-uninitialized-value errors in tools such as ASan. Note that this code is sub-optimal before or after this change: we should be reading either 2 or 4 float32 values per load-instruction (2 for ARM in-order cores with an affinity for 8-byte loads; 4 for ARM out-of-order cores able to dual-issue 16-byte load instructions with arithmetic instructions). Before or after this patch, we are only loading 4 bytes of useful data here (even if before this patch, we were technically loading 8, only to use only the 4 first).
* bug #1631: fix compilation with ARM NEON and clang, and cleanup the weird ↵Gravatar Gael Guennebaud2018-11-27
| | | | pshiftright_and_cast and pcast_and_shiftleft functions.
* Update pshiftleft to pass the shift as a true compile-time integer.Gravatar Gael Guennebaud2018-11-27
|
* Unify SSE/AVX psin functions.Gravatar Gael Guennebaud2018-11-27
| | | | | | | | It is based on the SSE version which is much more accurate, though very slightly slower. This changeset also includes the following required changes: - add packet-float to packet-int type traits - add packet float<->int reinterpret casts - add faster pselect for AVX based on blendv
* fix the build on 64-bit ARM when NEON is disabledGravatar Benoit Jacob2018-11-27
|
* Unify Altivec/VSX pexp(double) with default implementationGravatar Gael Guennebaud2018-11-27
|
* cleanupGravatar Gael Guennebaud2018-11-26
|
* Unify SSE and AVX pexp for double.Gravatar Gael Guennebaud2018-11-26
|
* Unify NEON's pexp with generic implementationGravatar Gael Guennebaud2018-11-26
|
* Unify Altivec/VSX's pexp with generic implementationGravatar Gael Guennebaud2018-11-26
|
* Unify SSE and AVX implementation of pexpGravatar Gael Guennebaud2018-11-26
|
* Unify Altivec/VSX's plog with generic implementation, and enable it!Gravatar Gael Guennebaud2018-11-26
|
* Unify NEON's plog with generic implementationGravatar Gael Guennebaud2018-11-26
|
* First step toward a unification of packet log implementation, currently only ↵Gravatar Gael Guennebaud2018-11-26
| | | | | | SSE and AVX are unified. To this end, I added the following functions: pzero, pcmp_*, pfrexp, pset1frombits functions.
* Make SSE/AVX pandnot(A,B) consistent with generic version, i.e., "A and not B"Gravatar Gael Guennebaud2018-11-26
|
* bug #1611: fix plog(0) on NEONGravatar Gael Guennebaud2018-11-26
|
* Fix typosGravatar Patrik Huber2018-11-23
|
* Fix reserved usage of double __ in macro namesGravatar Gael Guennebaud2018-11-23
|
* bug #1624: improve matrix-matrix product on ARM 64, 20% speedupGravatar Gael Guennebaud2018-11-23
|
* Make MaxPacketSize a true upper bound, even for fixed-size inputsGravatar Gael Guennebaud2018-11-16
|
* PR 544: Set requestedAlignment correctly for SliceVectorizedTraversalsGravatar Mark D Ryan2018-11-13
| | | | | | | | | | | | | | | | | | | | | | | Commit aa110e681b8b2237757a652ba47da49e1fbd2cd6 optimised the multiplication of small dyanmically sized matrices by restricting the packet size to a maximum of 4, increasing the chances that SIMD instructions are used in the computation. However, it introduced a mismatch between the packet size and the requestedAlignment. This mismatch can lead to crashes when the destination is not aligned. This patch fixes the issue by ensuring that the AssignmentTraits are correctly computed when using a restricted packet size. * * * Bind LinearPacketType to MaxPacketSize This commit applies any packet size limit specified when instantiating copy_using_evaluator_traits to the LinearPacketType, providing that the size of the destination is not known at compile time. * * * Add unit test for restricted packet assignment A new unit test is added to check that multiplication of small dynamically sized matrices works correctly when the packet size is restricted to 4 and the destination is unaligned.
* Fix typo in comment on EIGEN_MAX_STATIC_ALIGN_BYTESGravatar Nikolaus Demmel2018-11-14
|
* typoGravatar Gael Guennebaud2018-11-14
|
* help doxygen linking to DenseBase::NulllaryExprGravatar Gael Guennebaud2018-11-14
|
* [PATCH 1/2] Misc. typosGravatar luz.paz"2018-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From 68d431b4c14ad60a778ee93c1f59ecc4b931950e Mon Sep 17 00:00:00 2001 Found via `codespell -q 3 -I ../eigen-word-whitelist.txt` where the whitelists consists of: ``` als ans cas dum lastr lowd nd overfl pres preverse substraction te uint whch ``` --- CMakeLists.txt | 26 +++++++++---------- Eigen/src/Core/GenericPacketMath.h | 2 +- Eigen/src/SparseLU/SparseLU.h | 2 +- bench/bench_norm.cpp | 2 +- doc/HiPerformance.dox | 2 +- doc/QuickStartGuide.dox | 2 +- .../Eigen/CXX11/src/Tensor/TensorChipping.h | 6 ++--- .../Eigen/CXX11/src/Tensor/TensorDeviceGpu.h | 2 +- .../src/Tensor/TensorForwardDeclarations.h | 4 +-- .../src/Tensor/TensorGpuHipCudaDefines.h | 2 +- .../Eigen/CXX11/src/Tensor/TensorReduction.h | 2 +- .../CXX11/src/Tensor/TensorReductionGpu.h | 2 +- .../test/cxx11_tensor_concatenation.cpp | 2 +- unsupported/test/cxx11_tensor_executor.cpp | 2 +- 14 files changed, 29 insertions(+), 29 deletions(-)
* Add optimized version of logistic function for float. As an example, this is ↵Gravatar Rasmus Munk Larsen2018-11-12
| | | | about 50% faster than the existing version on Haswell using AVX.
* Fix warning in c++03Gravatar Gael Guennebaud2018-11-10
|
* bug #1619: fix mixing of const and non-const generic iteratorsGravatar Gael Guennebaud2018-11-09
|
* bug #1619: make const and non-const iterators compatibleGravatar Gael Guennebaud2018-11-09
|
* Let doxygen sees lastNGravatar Gael Guennebaud2018-11-09
|
* Recent xcode versions does support EIGEN_HAS_STATIC_ARRAY_TEMPLATEGravatar Gael Guennebaud2018-11-09
|
* Fix max-size in indexed-viewGravatar Gael Guennebaud2018-11-08
|
* Merged in glchaves/eigen (pull request PR-539)Gravatar Gael Guennebaud2018-11-07
|\ | | | | | | Vectorize row-by-row gebp loop iterations on 16 packets as well
| * Vectorize row-by-row gebp loop iterations on 16 packets as wellGravatar Gustavo Lima Chaves2018-11-06
| | | | | | | | | | Signed-off-by: Gustavo Lima Chaves <gustavo.lima.chaves@intel.com> Signed-off-by: Mark D. Ryan <mark.d.ryan@intel.com>
* | PR 526: Speed up multiplication of small, dynamically sized matricesGravatar Mark D Ryan2018-10-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Packet16f, Packet8f and Packet8d types are too large to use with dynamically sized matrices typically processed by the SliceVectorizedTraversal specialization of the dense_assignment_loop. Using these types is likely to lead to little or no vectorization. Significant slowdown in the multiplication of these small matrices can be observed when building with AVX and AVX512 enabled. This patch introduces a new dense_assignment_kernel that is used when computing small products whose operands have dynamic dimensions. It ensures that the PacketSize used is no larger than 4, thereby increasing the chance that vectorized instructions will be used when computing the product. I tested all 969 possible combinations of M, K, and N that are handled by the dense_assignment_loop on x86 builds. Although a few combinations are slowed down by this patch they are far outnumbered by the cases that are sped up, as the following results demonstrate. Disabling Packed8d on AVX512 builds: Total Cases: 969 Better: 511 Worse: 85 Same: 373 Max Improvement: 169.00% (4 8 6) Max Degradation: 36.50% (8 5 3) Median Improvement: 35.46% Median Degradation: 17.41% Total FLOPs Improvement: 19.42% Disabling Packet16f and Packed8f on AVX512 builds: Total Cases: 969 Better: 658 Worse: 5 Same: 306 Max Improvement: 214.05% (8 6 5) Max Degradation: 22.26% (16 2 1) Median Improvement: 60.05% Median Degradation: 13.32% Total FLOPs Improvement: 59.58% Disabling Packed8f on AVX builds: Total Cases: 969 Better: 663 Worse: 96 Same: 210 Max Improvement: 155.29% (4 10 5) Max Degradation: 35.12% (8 3 2) Median Improvement: 34.28% Median Degradation: 15.05% Total FLOPs Improvement: 26.02%
* | Fix code formatGravatar Eugene Zhulenev2018-11-02
| |
* | Workaround nbcc+msvc compiler bugGravatar Eugene Zhulenev2018-11-02
|/
* bug #1617: Fix SolveTriangular.solveInPlace crashing for empty matrix.Gravatar Matthieu Vigne2018-10-31
| | | | | This made FullPivLU.kernel() crash when used on the zero matrix. Add unit test for FullPivLU.kernel() on the zero matrix.
* bug #1618: Use different power-of-2 check to avoid MSVC warningGravatar Christoph Hertzberg2018-11-01
|
* Collapsed revision (based on pull request PR-325)Gravatar Christian von Schultz2018-10-22
| | | | | | | * Support compiling without IO streams Add the preprocessor definition EIGEN_NO_IO which, if defined, disables all use of the IO streams part of the standard library.
* Do not rely on the compiler generating __device__ functions for constexpr in ↵Gravatar Rasmus Munk Larsen2018-10-22
| | | | | | | | | | | | | | | | | Cuda (via EIGEN_CONSTEXPR_ARE_DEVICE_FUNC. This breaks several target in the TensorFlow Cuda build, e.g., INFO: From Compiling tensorflow/core/kernels/maxpooling_op_gpu.cu.cc: /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNHWC< ::Eigen::half> ") is not allowed /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code" /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNCHW< ::Eigen::half> ") is not allowed /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code 4 errors detected in the compilation of "/tmp/tmpxft_00000011_00000000-6_maxpooling_op_gpu.cu.cpp1.ii". ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: output 'tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o' was not created ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: Couldn't build file tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o: not all outputs were created or valid
* Merged in rmlarsen/eigen (pull request PR-532)Gravatar Rasmus Munk Larsen2018-10-19
|\ | | | | | | Only set EIGEN_CONSTEXPR_ARE_DEVICE_FUNC for clang++ if cxx_relaxed_constexpr is available.