Commit message (Collapse) | Author | Age | |
---|---|---|---|
* | Fix some CUDA warnings. | Antonio Sanchez | 2021-02-24 |
| | | | | | | | | | | | | | | | | | Added `EIGEN_HAS_STD_HASH` macro, checking for C++11 support and not running on GPU. `std::hash<float>` is not a device function, so cannot be used by `std::hash<bfloat16>`. Removed `EIGEN_DEVICE_FUNC` and only define if `EIGEN_HAS_STD_HASH`. Same for `half`. Added `EIGEN_CUDA_HAS_FP16_ARITHMETIC` to improve readability, eliminate warnings about `EIGEN_CUDA_ARCH` not being defined. Replaced a couple C-style casts with `reinterpret_cast` for aligned loading of `half*` to `half2*`. This eliminates `-Wcast-align` warnings in clang. Although not ideal due to potential type aliasing, this is how CUDA handles these conversions internally. | ||
* | Fix for broken ROCm/HIP Support | Deven Desai | 2020-12-11 |
| | | | | | | | | | | | | | | | | | | | | | | | | | The following commit introduced a breakage in ROCm/HIP support for Eigen. https://gitlab.com/libeigen/eigen/-/commit/5ec4907434742d4555df4aa708b665868b88f3b4#1958e65719641efe5483abc4ce0b61806270f6f3_525_517 ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial | ||
* | Fix host/device calls for __half. | Antonio Sanchez | 2020-12-08 |
| | | | | | | The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear. | ||
* | Clean up `#if`s in GPU PacketPath. | Antonio Sanchez | 2020-12-04 |
| | | | | | | | | | | | Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847 | ||
* | Fixing HIP breakage caused by the recent commit that introduces Packet4h2 as ↵ | Deven Desai | 2020-03-12 |
| | | | | the Eigen::Half packet type | ||
* | remove duplicate pset1 for half and add some comments about why we need ↵ | Sami Kama | 2020-03-10 |
| | | | | expose pmul/add/div/min/max on host | ||
* | Improve accuracy of fast approximate tanh and the logistic functions in ↵ | Rasmus Munk Larsen | 2019-12-16 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eigen, such that they preserve relative accuracy to within a few ULPs where their function values tend to zero (around x=0 for tanh, and for large negative x for the logistic function). This change re-instates the fast rational approximation of the logistic function for float32 in Eigen (removed in https://gitlab.com/libeigen/eigen/commit/66f07efeaed39d6a67005343d7e0caf7d9eeacdb), but uses the more accurate approximation 1/(1+exp(-1)) ~= exp(x) below -9. The exponential is only calculated on the vectorized path if at least one element in the SIMD input vector is less than -9. This change also contains a few improvements to speed up the original float specialization of logistic: - Introduce EIGEN_PREDICT_{FALSE,TRUE} for __builtin_predict and use it to predict that the logistic-only path is most likely (~2-3% speedup for the common case). - Carefully set the upper clipping point to the smallest x where the approximation evaluates to exactly 1. This saves the explicit clamping of the output (~7% speedup). The increased accuracy for tanh comes at a cost of 10-20% depending on instruction set. The benchmarks below repeated calls u = v.logistic() (u = v.tanh(), respectively) where u and v are of type Eigen::ArrayXf, have length 8k, and v contains random numbers in [-1,1]. Benchmark numbers for logistic: Before: Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- SSE BM_eigen_logistic_float 4467 4468 155835 model_time: 4827 AVX BM_eigen_logistic_float 2347 2347 299135 model_time: 2926 AVX+FMA BM_eigen_logistic_float 1467 1467 476143 model_time: 2926 AVX512 BM_eigen_logistic_float 805 805 858696 model_time: 1463 After: Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- SSE BM_eigen_logistic_float 2589 2590 270264 model_time: 4827 AVX BM_eigen_logistic_float 1428 1428 489265 model_time: 2926 AVX+FMA BM_eigen_logistic_float 1059 1059 662255 model_time: 2926 AVX512 BM_eigen_logistic_float 673 673 1000000 model_time: 1463 Benchmark numbers for tanh: Before: Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- SSE BM_eigen_tanh_float 2391 2391 292624 model_time: 4242 AVX BM_eigen_tanh_float 1256 1256 554662 model_time: 2633 AVX+FMA BM_eigen_tanh_float 823 823 866267 model_time: 1609 AVX512 BM_eigen_tanh_float 443 443 1578999 model_time: 805 After: Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- SSE BM_eigen_tanh_float 2588 2588 273531 model_time: 4242 AVX BM_eigen_tanh_float 1536 1536 452321 model_time: 2633 AVX+FMA BM_eigen_tanh_float 1007 1007 694681 model_time: 1609 AVX512 BM_eigen_tanh_float 471 471 1472178 model_time: 805 | ||
* | Fix for HIP breakage detected on 191210 | Deven Desai | 2019-12-10 |
| | | | | | | | | The following commit introduces compile errors when running eigen with hipcc https://gitlab.com/libeigen/eigen/commit/2918f85ba976dbfbf72f7d4c1961a577f5850148 hipcc errors out because it requies the device attribute on the methods within the TensorBlockV2ResourceRequirements struct instroduced by the commit above. The fix is to add the device attribute to those methods | ||
* | Add Bessel functions to SpecialFunctions. | Srinivas Vasudevan | 2019-09-14 |
| | | | | | | | | | - Split SpecialFunctions files in to a separate BesselFunctions file. In particular add: - Modified bessel functions of the second kind k0, k1, k0e, k1e - Bessel functions of the first kind j0, j1 - Bessel functions of the second kind y0, y1 | ||
* | Merging from eigen/eigen. | Srinivas Vasudevan | 2019-09-03 |
|\ | |||
* | | Add ndtri function, the inverse of the normal distribution function. | Srinivas Vasudevan | 2019-08-12 |
| | | |||
| * | Clean up float16 a.k.a. Eigen::half support in Eigen. Move the definition of ↵ | Rasmus Munk Larsen | 2019-08-27 |
|/ | | | | half to Core/arch/Default and move arch-specific packet ops to their respective sub-directories. | ||
* | Fix CUDA compilation error for pselect<half>. | Rasmus Munk Larsen | 2019-06-28 |
| | |||
* | [SYCL] This PR adds the minimum modifications to Eigen core required to run ↵ | Mehdi Goli | 2019-06-27 |
| | | | | | | | | Eigen unsupported modules on devices supporting SYCL. * Adding SYCL memory model * Enabling/Disabling SYCL backend in Core * Supporting Vectorization | ||
* | fix for a ROCm/HIP specificcompile errror introduced by a recent commit. | Deven Desai | 2019-06-22 |
| | |||
* | Remove extra "one" in comment. | Rasmus Munk Larsen | 2019-06-20 |
| | |||
* | Update comment as suggested by tra@google.com. | Rasmus Munk Larsen | 2019-06-20 |
| | |||
* | Fix grammar. | Rasmus Munk Larsen | 2019-06-20 |
| | |||
* | Added comment explaining the surprising EIGEN_COMP_CLANG && !EIGEN_COMP_NVCC ↵ | Rasmus Munk Larsen | 2019-06-20 |
| | | | | clause. | ||
* | Fix CUDA build on Mac. | Rasmus Munk Larsen | 2019-06-20 |
| | |||
* | Various fixes for packet ops. | Rasmus Munk Larsen | 2019-06-20 |
| | | | | | | 1. Fix buggy pcmp_eq and unit test for half types. 2. Add unit test for pselect and add specializations for SSE 4.1, AVX512, and half types. 3. Get rid of FIXME: Implement faster pnegate for half by XOR'ing with a sign bit mask. | ||
* | Clean up CUDA/NVCC version macros and their use in Eigen, and a few other ↵ | Rasmus Munk Larsen | 2019-05-31 |
| | | | | CUDA build failures. | ||
* | fix for HIP build errors that were introduced by a commit earlier this week | Deven Desai | 2019-05-24 |
| | |||
* | Make Eigen build with cuda 10 and clang. | Rasmus Munk Larsen | 2019-05-15 |
| | |||
* | Removing unused API to fix compile error in TensorFlow due to | Anuj Rawat | 2019-05-12 |
| | | | | AVX512VL, AVX512BW usage | ||
* | Fix AVX512 & GCC 6.3 compilation | Eugene Zhulenev | 2019-05-07 |
| | |||
* | Add masked_store_available to unpacket_traits | Eugene Zhulenev | 2019-05-02 |
| | |||
* | Add masked pstoreu for Packet16h | Eugene Zhulenev | 2019-05-02 |
| | |||
* | Adding lowlevel APIs for optimized RHS packet load in TensorFlow | Anuj Rawat | 2019-04-20 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 41350 | 15073 | 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 7277 | 7341 | 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 8675 | 8681 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 24155 | 16079 | 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 25052 | 17152 | 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 18269 | 18345 | 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 19468 | 19872 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 156060 | 42432 | 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 132701 | 36944 | 3.59X AVX2: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 26233 | 12393 | 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 6091 | 6062 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 7427 | 7408 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 23453 | 20826 | 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 23167 | 22091 | 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422 | 23682 | 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165 | 23663 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 72689 | 44969 | 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 61732 | 39779 | 1.55X All benchmarks on Intel Skylake server with 8 cores. | ||
* | updates requested in the PR feedback. Also droping coded within #ifdef ↵ | Deven Desai | 2019-03-19 |
| | | | | EIGEN_HAS_OLD_HIP_FP16 | ||
* | Merged eigen/eigen into default | Deven Desai | 2019-03-19 |
|\ | |||
| * | Clean up half packet traits and add a few more missing packet ops. | Rasmus Munk Larsen | 2019-03-14 |
| | | |||
| * | Clean up PacketMathHalf.h and add a few missing logical packet ops. | Rasmus Munk Larsen | 2019-03-11 |
| | | |||
| * | Add a few missing packet ops: cmp_eq for NEON. pfloor for GPU. | Rasmus Munk Larsen | 2019-02-21 |
| | | |||
| * | Fix conflicts and merge | Gael Guennebaud | 2019-01-30 |
| |\ | |||
| * | | Add missing logical packet ops for GPU and NEON. | Rasmus Munk Larsen | 2019-01-17 |
| | | | |||
| * | | Fix compilation error for logical packet ops with older compilers. | Rasmus Munk Larsen | 2019-01-16 |
| | | | |||
| * | | Fix warnings in ptrue for complex and half types. | Rasmus Munk Larsen | 2019-01-11 |
| | | | |||
| * | | Fix merge. | Rasmus Munk Larsen | 2019-01-11 |
| | | | |||
| * | | Merge. | Rasmus Munk Larsen | 2019-01-11 |
| |\ \ | |||
| * \ \ | Rename pones -> ptrue. Use _CMP_TRUE_UQ where appropriate. | Rasmus Munk Larsen | 2019-01-09 |
| |\ \ \ | |||
| | | * | | Collapsed revision | Rasmus Munk Larsen | 2019-01-09 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add packet up "pones". Write pnot(a) as pxor(pones(a), a). * Collapsed revision * Simplify a bit. * Undo useless diffs. * Fix typo. | ||
| * | | | | Collapsed revision | Rasmus Munk Larsen | 2019-01-09 |
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Collapsed revision * Add packet up "pones". Write pnot(a) as pxor(pones(a), a). * Collapsed revision * Simplify a bit. * Undo useless diffs. * Fix typo. | ||
| | * | | Add packet up "pones". Write pnot(a) as pxor(pones(a), a). | Rasmus Munk Larsen | 2019-01-09 |
| |/ / | |||
| * | | Add support for pcmp_eq and pnot, including for complex types. | Rasmus Munk Larsen | 2019-01-07 |
| | | | |||
| | * | Introducing "vectorized" byte on unpacket_traits structs | Gustavo Lima Chaves | 2018-12-19 |
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparation to a change on gebp_traits, where a new template argument will be introduced to dictate the packet size, so it won't be bound to the current/max packet size only anymore. By having packet types defined early on gebp_traits, one has now to act on packet types, not scalars anymore, for the enum values defined on that class. One approach for reaching the vectorizable/size properties one needs there could be getting the packet's scalar again with unpacket_traits<>, then the size/Vectorizable enum entries from packet_traits<>. It turns out guards like "#ifndef EIGEN_VECTORIZE_AVX512" at AVX/PacketMath.h will hide smaller packet variations of packet_traits<> for some types (and it makes sense to keep that). In other words, one can't go back to the scalar and create a new PacketType, as this will always lead to the maximum packet type for the architecture. The less costly/invasive solution for that, thus, is to add the vectorizable info on every unpacket_traits struct as well. | ||
| * | bug #1636: fix compilation with some ABI versions. | Gael Guennebaud | 2018-12-06 |
| | | |||
| * | Several improvements regarding packet-bitwise operations: | Gael Guennebaud | 2018-11-30 |
| | | | | | | | | | | | | - add unit tests - optimize their AVX512f implementation - add missing implementations (half, Packet4f, ...) | ||
* | | ROCm/HIP specfic fixes + updates | Deven Desai | 2018-11-19 |
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Eigen/src/Core/arch/GPU/Half.h Updating the HIPCC implementation half so that it can declared as a __shared__ variable 2. Eigen/src/Core/util/Macros.h, Eigen/src/Core/util/Memory.h introducing a EIGEN_USE_STD(func) macro that calls - std::func be default - ::func when eigen is being compiled with HIPCC This change was requested in the previous HIP PR (https://bitbucket.org/eigen/eigen/pull-requests/518/pr-with-hip-specific-fixes-for-the-eigen/diff) 3. unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h Removing EIGEN_DEVICE_FUNC attribute from pure virtual methods as it is not supported by HIPCC 4. unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Disabling the template specializations of InnerMostDimReducer as they run into HIPCC link errors | ||
* | Collapsed revision (based on pull request PR-325) | Christian von Schultz | 2018-10-22 |
| | | | | | | | * Support compiling without IO streams Add the preprocessor definition EIGEN_NO_IO which, if defined, disables all use of the IO streams part of the standard library. |