Commit message (Collapse) | Author | Age | |
---|---|---|---|
* | Create the ability to disable the specialized gemm_pack_rhs in Eigen (only ↵ | Chip Kerchner | 2021-06-30 |
| | | | | PPC) for TensorFlow | ||
* | Small cleanup: Get rid of the macros EIGEN_HAS_SINGLE_INSTRUCTION_CJMADD and ↵ | Rasmus Munk Larsen | 2021-06-24 |
| | | | | CJMADD, which were effectively unused, apart from on x86, where the change results in identically performing code. | ||
* | Get rid of code duplication for conj_helper. For packets where ↵ | Rasmus Munk Larsen | 2021-06-24 |
| | | | | LhsType=RhsType a single generic implementation suffices. For scalars, the generic implementation of pconj automatically forwards to numext::conj, so much of the existing specialization can be avoided. For mixed types we still need specializations. | ||
* | EIGEN_STRONG_INLINE was NOT inlining in some critical needed areas (6.6X ↵ | Chip-Kerchner | 2021-06-16 |
| | | | | slowdown) when used with Tensorflow. Changing to EIGEN_ALWAYS_INLINE where appropiate. | ||
* | Add missing ppc pcmp_lt_or_nan<Packet8bf> | Antonio Sanchez | 2021-06-15 |
| | |||
* | Use bit_cast to create -0.0 for floating point types to avoid compiler ↵ | Rasmus Munk Larsen | 2021-06-11 |
| | | | | optimization changing sign with --ffast-math enabled. | ||
* | Fix taking address of rvalue compiler issue with TensorFlow (plus other ↵ | Chip-Kerchner | 2021-04-21 |
| | | | | warnings). | ||
* | Fix address of temporary object errors in clang11. | Chip Kerchner | 2021-04-02 |
| | | | | This fixes the problem with taking the address of temporary objects which clang11 treats as errors. | ||
* | Fixed performance issues for complex VSX and P10 MMA in gebp_kernel (level 3). | Chip Kerchner | 2021-03-25 |
| | |||
* | Fix pround and add print | Chip Kerchner | 2021-03-15 |
| | |||
* | Make half/bfloat16 constructor take inputs by value, fix powerpc test. | Antonio Sanchez | 2021-02-27 |
| | | | | | | | | | | | | Since `numeric_limits<half>::max_exponent` is a static inline constant, it cannot be directly passed by reference. This triggers a linker error in recent versions of `g++-powerpc64le`. Changing `half` to take inputs by value fixes this. Wrapping `max_exponent` with `int(...)` to make an addressable integer also fixes this and may help with other custom `Scalar` types down-the-road. Also eliminated some compile warnings for powerpc. | ||
* | Fix clang compile when no MMA flags are set. Simplify MMA compiler detection. | Chip-Kerchner | 2021-02-24 |
| | |||
* | Having forward template function declarations in a P10 file causes bad code ↵ | Chip-Kerchner | 2021-02-24 |
| | | | | in certain situations. | ||
* | Fixes to support old and new versions of the compilers for built-ins. Cast ↵ | Chip-Kerchner | 2021-02-24 |
| | | | | to non-const when using vector_pair with certain built-ins. | ||
* | Fix compilation errors with later versions of GCC and use of MMA. | Chip-Kerchner | 2021-02-22 |
| | |||
* | Fixed performance issues for VSX and P10 MMA in general_matrix_matrix_product | Chip Kerchner | 2021-02-17 |
| | |||
* | Updated pfrexp implementation. | Antonio Sanchez | 2021-02-17 |
| | | | | | | The original implementation fails for 0, denormals, inf, and NaN. See #2150 | ||
* | Fix ldexp implementations. | Antonio Sanchez | 2021-02-10 |
| | | | | | | | | | | | | | | | | | The previous implementations produced garbage values if the exponent did not fit within the exponent bits. See #2131 for a complete discussion, and !375 for other possible implementations. Here we implement the 4-factor version. See `pldexp_impl` in `GenericPacketMathFunctions.h` for a full description. The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>` requires `por`. Left as a "TODO" is to delegate to a faster version if we know the exponent does fit within the exponent bits. Fixes #2131. | ||
* | Eliminate implicit conversions from float to double. | Antonio Sanchez | 2021-02-01 |
| | |||
* | Fix altivec packetmath. | Antonio Sanchez | 2021-01-28 |
| | | | | | | | | | | | | | | | Allows the altivec packetmath tests to pass. There were a few issues: - `pstoreu` was missing MSQ on `_BIG_ENDIAN` systems - `cmp_*` didn't properly handle conversion of bool flags (0x7FC instead of 0xFFFF) - `pfrexp` needed to set the `exponent` argument. Related to !370, #2128 cc: @ChipKerchner @pdrocaldeira Tested on `_BIG_ENDIAN` running on QEMU with VSX. Couldn't figure out build flags to get it to work for little endian. | ||
* | Fix clang compilation for AltiVec from previous check-in | Chip Kerchner | 2021-01-28 |
| | |||
* | Fix sqrt, ldexp and frexp compilation errors. | Chip Kerchner | 2021-01-25 |
| | |||
* | Add support for dynamic dispatch of MMA instructions for POWER 10 | Pedro Caldeira | 2020-11-12 |
| | |||
* | Add missing functions for Packet8bf in Altivec architecture. | Pedro Caldeira | 2020-09-08 |
| | | | | | Including new tests for bfloat16 Packets. Fix prsqrt on GenericPacketMath. | ||
* | MatrixProuct enhancements: | Everton Constantino | 2020-09-02 |
| | | | | | | | | | | | | | - Changes to Altivec/MatrixProduct Adapting code to gcc 10. Generic code style and performance enhancements. Adding PanelMode support. Adding stride/offset support. Enabling float64, std::complex and std::complex. Fixing lack of symm_pack. Enabling mixedtypes. - Adding std::complex tests to blasutil. - Adding an implementation of storePacketBlock when Incr!= 1. | ||
* | Changing u/int8_t to un/signed char because clang does not understand | Everton Constantino | 2020-09-02 |
| | | | | | | it. Implementing pcmp_eq to Packet8 and Packet16. | ||
* | Change Packet8s and Packet8us to use vector commands on Power for pmadd, ↵ | Chip Kerchner | 2020-08-28 |
| | | | | pmul and psub. | ||
* | Add support for Bfloat16 to use vector instructions on Altivec | Pedro Caldeira | 2020-08-10 |
| | | | | architecture | ||
* | Fix pscatter and pgather for Altivec Complex double | Pedro Caldeira | 2020-06-16 |
| | |||
* | Add pscatter for Packet16{u}c (int8) | Pedro Caldeira | 2020-05-20 |
| | |||
* | - Vectorizing MMA packing. | Everton Constantino | 2020-05-19 |
| | | | | | - Optimizing MMA kernel. - Adding PacketBlock store to blas_data_mapper. | ||
* | Altivec template functions to better code reusability | Pedro Caldeira | 2020-05-11 |
| | |||
* | Remove unused packet op "palign". | Rasmus Munk Larsen | 2020-05-07 |
| | | | | Clean up a compiler warning in c++03 mode in AVX512/Complex.h. | ||
* | Add support to vector instructions to Packet16uc and Packet16c | Pedro Caldeira | 2020-04-27 |
| | |||
* | Remove unused packet op "preduxp". | Rasmus Munk Larsen | 2020-04-23 |
| | |||
* | Add Packet8s and Packet8us to support signed/unsigned int16/short Altivec ↵ | Pedro Caldeira | 2020-04-21 |
| | | | | vector operations | ||
* | Adhere to recommended load/store intrinsics for pp64le | Everton Constantino | 2020-03-23 |
| | |||
* | Fixing float32's pround halfway criteria to match STL's criteria. | Everton Constantino | 2020-03-21 |
| | |||
* | Add shift_left<N> and shift_right<N> coefficient-wise unary Array functions | Joel Holdsworth | 2020-03-19 |
| | |||
* | Switching unpacket_traits<Packet4i> to vectorizable=true. | Everton Constantino | 2020-01-13 |
| | |||
* | Move implementation of vectorized error function erf() to ↵ | Rasmus Munk Larsen | 2019-09-27 |
| | | | | SpecialFunctionsImpl.h. | ||
* | Add generic PacketMath implementation of the Error Function (erf). | Rasmus Munk Larsen | 2019-09-19 |
| | |||
* | Fix compilation without vector engine available (e.g., x86 with SSE disabled): | Gael Guennebaud | 2019-09-05 |
| | | | | -> ppolevl is required by ndtri even for the scalar path | ||
* | Fix debug macros in p{load,store}u | João P. L. de Carvalho | 2019-08-14 |
| | |||
* | Add missing pcmp_XX methods for double/Packet2d | João P. L. de Carvalho | 2019-08-14 |
| | | | | This actually fixes an issue in unit-test packetmath_2 with pcmp_eq when it is compiled with clang. When pcmp_eq(Packet4f,Packet4f) is used instead of pcmp_eq(Packet2d,Packet2d), the unit-test does not pass due to NaN on ref vector. | ||
* | Fix packed load/store for PowerPC's VSX | João P. L. de Carvalho | 2019-08-09 |
| | | | | | | | | The vec_vsx_ld/vec_vsx_st builtins were wrongly used for aligned load/store. In fact, they perform unaligned memory access and, even when the address is 16-byte aligned, they are much slower (at least 2x) than their aligned counterparts. For double/Packet2d vec_xl/vec_xst should be prefered over vec_ld/vec_st, although the latter works when casted to float/Packet4f. Silencing some weird warning with throw but some GCC versions. Such warning are not thrown by Clang. | ||
* | Fix offset argument of ploadu/pstoreu for Altivec | João P. L. de Carvalho | 2019-08-09 |
| | | | | | | | | | | If no offset is given, them it should be zero. Also passes full address to vec_vsx_ld/st builtins. Removes userless _EIGEN_ALIGNED_PTR & _EIGEN_MASK_ALIGNMENT. Removes unnecessary casts. | ||
* | bug #1718: Add cast to successfully compile with clang on PowerPC | João P. L. de Carvalho | 2019-08-09 |
| | | | | Ignoring -Wc11-extensions warnings thrown by clang at Altivec/PacketMath.h | ||
* | Add masked_store_available to unpacket_traits | Eugene Zhulenev | 2019-05-02 |
| | |||
* | Adding lowlevel APIs for optimized RHS packet load in TensorFlow | Anuj Rawat | 2019-04-20 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 41350 | 15073 | 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 7277 | 7341 | 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 8675 | 8681 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 24155 | 16079 | 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 25052 | 17152 | 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 18269 | 18345 | 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 19468 | 19872 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 156060 | 42432 | 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 132701 | 36944 | 3.59X AVX2: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 26233 | 12393 | 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 6091 | 6062 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 7427 | 7408 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 23453 | 20826 | 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 23167 | 22091 | 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422 | 23682 | 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165 | 23663 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 72689 | 44969 | 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 61732 | 39779 | 1.55X All benchmarks on Intel Skylake server with 8 cores. |