aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h
Commit message (Collapse)AuthorAge
* Add recursive work splitting to EvalShardedByInnerDimContextGravatar Eugene Zhulenev2019-12-05
|
* Add beta to TensorContractionKernel and make memset optionalGravatar Eugene Zhulenev2019-10-02
|
* Fix a bug in a packed block type in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-09-24
|
* Fix (or mask away) conversion warnings introduced in ↵Gravatar Christoph Hertzberg2019-09-23
| | | | | | 553caeb6a3bb545aef895f8fc9f219be44679017 .
* Fix maybe-unitialized warnings in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-09-13
|
* Use ThreadLocal container in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-09-13
|
* Fix shadow warnings in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-08-30
|
* evalSubExprsIfNeededAsync + async TensorContractionThreadPoolGravatar Eugene Zhulenev2019-08-30
|
* Remove XSMM support from Tensor moduleGravatar Eugene Zhulenev2019-08-19
|
* Remove deprecation annotation from typedef Eigen::Index Index, as it would ↵Gravatar Rasmus Munk Larsen2019-04-24
| | | | generate too many build warnings.
* Tweak cost model for tensor contraction when parallelizing over the inner ↵Gravatar Rasmus Munk Larsen2019-04-12
| | | | | | dimension. https://bitbucket.org/snippets/rmlarsen/MexxLo
* Add support for custom packed Lhs/Rhs blocks in tensor contractionsGravatar Eugene Zhulenev2019-04-01
|
* Tune tensor contraction threadpool heuristicsGravatar Eugene Zhulenev2019-03-05
|
* Don't do parallel_pack if we can use thread_local memory in tensor contractionsGravatar Eugene Zhulenev2019-02-07
|
* Do not reduce parallelism too much in contractions with small number of threadsGravatar Eugene Zhulenev2019-02-04
|
* Parallelize tensor contraction only by sharding dimension and use ↵Gravatar Eugene Zhulenev2019-02-04
| | | | 'thread-local' memory for packing
* Fix shorten-64-to-32 warning in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-01-11
|
* Fix shorten-64-to-32 warning in TensorContractionThreadPoolGravatar Eugene Zhulenev2019-01-10
|
* Optimize evalShardedByInnerDimGravatar Eugene Zhulenev2019-01-08
|
* Fix evalShardedByInnerDim for AVX512 buildsGravatar Mark D Ryan2018-12-05
| | | | | | | | | | | | evalShardedByInnerDim ensures that the values it passes for start_k and end_k to evalGemmPartialWithoutOutputKernel are multiples of 8 as the kernel does not work correctly when the values of k are not multiples of the packet_size. While this precaution works for AVX builds, it is insufficient for AVX512 builds where the maximum packet size is 16. The result is slightly incorrect float32 contractions on AVX512 builds. This commit fixes the problem by ensuring that k is always a multiple of the packet_size if the packet_size is > 8.
* Fix conversion warning ... againGravatar Christoph Hertzberg2018-10-02
|
* Fix a few warnings and rename a variable to not shadow "last".Gravatar Rasmus Munk Larsen2018-09-28
|
* Merged in ezhulenev/eigen-01 (pull request PR-514)Gravatar Rasmus Munk Larsen2018-09-28
|\ | | | | | | Add tests for evalShardedByInnerDim contraction + fix bugs
| * Add tests for evalShardedByInnerDim contraction + fix bugsGravatar Eugene Zhulenev2018-09-28
| |
* | Fix integer conversion warningsGravatar Christoph Hertzberg2018-09-28
|/
* Merge with eigen/eigen defaultGravatar Eugene Zhulenev2018-09-27
|\
* | Remove explicit mkldnn support and redundant TensorContractionKernelBlockingGravatar Eugene Zhulenev2018-09-27
| |
| * Remove "false &&" left over from test.Gravatar Rasmus Munk Larsen2018-09-26
| |
| * Parallelize tensor contraction over the inner dimension in cases where where ↵Gravatar Rasmus Munk Larsen2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | one or both of the outer dimensions (m and n) are small but k is large. This speeds up individual matmul microbenchmarks by up to 85%. Naming below is BM_Matmul_M_K_N_THREADS, measured on a 2-socket Intel Broadwell-based server. Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_Matmul_1_80_13522_1 387457 396013 -2.2% BM_Matmul_1_80_13522_2 406487 230789 +43.2% BM_Matmul_1_80_13522_4 395821 123211 +68.9% BM_Matmul_1_80_13522_6 391625 97002 +75.2% BM_Matmul_1_80_13522_8 408986 113828 +72.2% BM_Matmul_1_80_13522_16 399988 67600 +83.1% BM_Matmul_1_80_13522_22 411546 60044 +85.4% BM_Matmul_1_80_13522_32 393528 57312 +85.4% BM_Matmul_1_80_13522_44 390047 63525 +83.7% BM_Matmul_1_80_13522_88 387876 63592 +83.6% BM_Matmul_1_1500_500_1 245359 248119 -1.1% BM_Matmul_1_1500_500_2 401833 143271 +64.3% BM_Matmul_1_1500_500_4 210519 100231 +52.4% BM_Matmul_1_1500_500_6 251582 86575 +65.6% BM_Matmul_1_1500_500_8 211499 80444 +62.0% BM_Matmul_3_250_512_1 70297 68551 +2.5% BM_Matmul_3_250_512_2 70141 52450 +25.2% BM_Matmul_3_250_512_4 67872 58204 +14.2% BM_Matmul_3_250_512_6 71378 63340 +11.3% BM_Matmul_3_250_512_8 69595 41652 +40.2% BM_Matmul_3_250_512_16 72055 42549 +40.9% BM_Matmul_3_250_512_22 70158 54023 +23.0% BM_Matmul_3_250_512_32 71541 56042 +21.7% BM_Matmul_3_250_512_44 71843 57019 +20.6% BM_Matmul_3_250_512_88 69951 54045 +22.7% BM_Matmul_3_1500_512_1 369328 374284 -1.4% BM_Matmul_3_1500_512_2 428656 223603 +47.8% BM_Matmul_3_1500_512_4 205599 139508 +32.1% BM_Matmul_3_1500_512_6 214278 139071 +35.1% BM_Matmul_3_1500_512_8 184149 142338 +22.7% BM_Matmul_3_1500_512_16 156462 156983 -0.3% BM_Matmul_3_1500_512_22 163905 158259 +3.4% BM_Matmul_3_1500_512_32 155314 157662 -1.5% BM_Matmul_3_1500_512_44 235434 158657 +32.6% BM_Matmul_3_1500_512_88 156779 160275 -2.2% BM_Matmul_1500_4_512_1 363358 349528 +3.8% BM_Matmul_1500_4_512_2 303134 263319 +13.1% BM_Matmul_1500_4_512_4 176208 130086 +26.2% BM_Matmul_1500_4_512_6 148026 115449 +22.0% BM_Matmul_1500_4_512_8 131656 98421 +25.2% BM_Matmul_1500_4_512_16 134011 82861 +38.2% BM_Matmul_1500_4_512_22 134950 85685 +36.5% BM_Matmul_1500_4_512_32 133165 90081 +32.4% BM_Matmul_1500_4_512_44 133203 90644 +32.0% BM_Matmul_1500_4_512_88 134106 100566 +25.0% BM_Matmul_4_1500_512_1 439243 435058 +1.0% BM_Matmul_4_1500_512_2 451830 257032 +43.1% BM_Matmul_4_1500_512_4 276434 164513 +40.5% BM_Matmul_4_1500_512_6 182542 144827 +20.7% BM_Matmul_4_1500_512_8 179411 166256 +7.3% BM_Matmul_4_1500_512_16 158101 155560 +1.6% BM_Matmul_4_1500_512_22 152435 155448 -1.9% BM_Matmul_4_1500_512_32 155150 149538 +3.6% BM_Matmul_4_1500_512_44 193842 149777 +22.7% BM_Matmul_4_1500_512_88 149544 154468 -3.3%
* | Support multiple contraction kernel types in TensorContractionThreadPoolGravatar Eugene Zhulenev2018-09-26
|/
* Fix regression introduced by the previous fix for AVX512.Gravatar Gael Guennebaud2018-09-20
| | | | It brokes the complex-complex case on SSE.
* Merged in yuefengz/eigen (pull request PR-370)Gravatar Benoit Steiner2018-07-31
|\ | | | | | | Use device's allocate function instead of internal::aligned_malloc.
* | Reduce the number of template specializations of classes related to tensor ↵Gravatar Rasmus Munk Larsen2018-07-27
| | | | | | | | contraction to reduce binary size.
* | Reduce number of allocations in TensorContractionThreadPool.Gravatar Eugene Zhulenev2018-07-16
| |
| * Use device's allocate function instead of internal::aligned_malloc. This ↵Gravatar Yuefeng Zhou2018-02-20
| | | | | | | | would make it easier to track memory usage in device instances.
* | Remove SimpleThreadPool and always use {NonBlocking}ThreadPoolGravatar Eugene Zhulenev2018-07-16
| |
* | Fuse computations into the Tensor contractions using output kernelGravatar Eugene Zhulenev2018-07-10
| |
* | Fix typos found using codespellGravatar Gael Guennebaud2018-06-07
|/
* Added support for libxsmm kernel in multithreaded contractionsGravatar Benoit Steiner2016-12-21
|
* Use signed integers more consistently to encode the number of threads to use ↵Gravatar Benoit Steiner2016-06-09
| | | | to evaluate a tensor expression.
* Fixed some compilation warningsGravatar Benoit Steiner2016-05-26
|
* Merged in rmlarsen/eigen (pull request PR-188)Gravatar Benoit Steiner2016-05-23
|\ | | | | | | Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
* | Fix some sign-compare warningsGravatar Christoph Hertzberg2016-05-22
| |
| * Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of ↵Gravatar Rasmus Munk Larsen2016-05-18
|/ | | | EIGEN_USE_COST_MODEL.
* Turnon the new thread pool by default since it scales much better over ↵Gravatar Benoit Steiner2016-05-13
| | | | multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
* New multithreaded contraction that doesn't rely on the thread pool to run ↵Gravatar Benoit Steiner2016-05-13
| | | | the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.
* Replace implicit cast with an explicit oneGravatar Benoit Steiner2016-05-12
|
* Added tests for full contractions using thread pools and gpu devices.Gravatar Benoit Steiner2016-05-05
| | | | Fixed a couple of issues in the corresponding code.
* Replace std::vector with our own implementation, as using the stl when ↵Gravatar Benoit Steiner2016-03-08
| | | | compiling with nvcc and avx enabled leads to many issues.
* Fixed the tensor chipping code.Gravatar Benoit Steiner2016-03-08
|