aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorContraction.h
Commit message (Collapse)AuthorAge
...
| * Adds a fast memcpy function to Eigen. This takes advantage of the following:Gravatar Rasmus Munk Larsen2017-01-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. For small fixed sizes, the compiler generates inline code for memcpy, which is much faster. 2. My colleague eriche at googl dot com discovered that for large sizes, memmove is significantly faster than memcpy (at least on Linux with GCC or Clang). See benchmark numbers measured on a Haswell (HP Z440) workstation here: https://docs.google.com/a/google.com/spreadsheets/d/1jLs5bKzXwhpTySw65MhG1pZpsIwkszZqQTjwrd_n0ic/pubhtml This is of course surprising since memcpy is a less constrained version of memmove. This stackoverflow thread contains some speculation as to the causes: http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-linux Below are numbers for copying and slicing tensors using the multithreaded TensorDevice. The numbers show significant improvements for memcpy of very small blocks and for memcpy of large blocks single threaded (we were already able to saturate memory bandwidth for >1 threads before on large blocks). The "slicingSmallPieces" benchmark also shows small consistent improvements, since memcpy cost is a fair portion of that particular computation. The benchmarks operate on NxN matrices, and the names are of the form BM_$OP_${NUMTHREADS}T/${N}. Measured improvements in wall clock time: Run on rmlarsen3.mtv (12 X 3501 MHz CPUs); 2017-01-20T11:26:31.493023454-08:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_memcpy_1T/2 3.48 2.39 +31.3% BM_memcpy_1T/8 12.3 6.51 +47.0% BM_memcpy_1T/64 371 383 -3.2% BM_memcpy_1T/512 66922 66720 +0.3% BM_memcpy_1T/4k 9892867 6849682 +30.8% BM_memcpy_1T/5k 14951099 10332856 +30.9% BM_memcpy_2T/2 3.50 2.46 +29.7% BM_memcpy_2T/8 12.3 7.66 +37.7% BM_memcpy_2T/64 371 376 -1.3% BM_memcpy_2T/512 66652 66788 -0.2% BM_memcpy_2T/4k 6145012 6117776 +0.4% BM_memcpy_2T/5k 9181478 9010942 +1.9% BM_memcpy_4T/2 3.47 2.47 +31.0% BM_memcpy_4T/8 12.3 6.67 +45.8 BM_memcpy_4T/64 374 376 -0.5% BM_memcpy_4T/512 67833 68019 -0.3% BM_memcpy_4T/4k 5057425 5188253 -2.6% BM_memcpy_4T/5k 7555638 7779468 -3.0% BM_memcpy_6T/2 3.51 2.50 +28.8% BM_memcpy_6T/8 12.3 7.61 +38.1% BM_memcpy_6T/64 373 378 -1.3% BM_memcpy_6T/512 66871 66774 +0.1% BM_memcpy_6T/4k 5112975 5233502 -2.4% BM_memcpy_6T/5k 7614180 7772246 -2.1% BM_memcpy_8T/2 3.47 2.41 +30.5% BM_memcpy_8T/8 12.4 10.5 +15.3% BM_memcpy_8T/64 372 388 -4.3% BM_memcpy_8T/512 67373 66588 +1.2% BM_memcpy_8T/4k 5148462 5254897 -2.1% BM_memcpy_8T/5k 7660989 7799058 -1.8% BM_memcpy_12T/2 3.50 2.40 +31.4% BM_memcpy_12T/8 12.4 7.55 +39.1 BM_memcpy_12T/64 374 378 -1.1% BM_memcpy_12T/512 67132 66683 +0.7% BM_memcpy_12T/4k 5185125 5292920 -2.1% BM_memcpy_12T/5k 7717284 7942684 -2.9% BM_slicingSmallPieces_1T/2 47.3 47.5 +0.4% BM_slicingSmallPieces_1T/8 53.6 52.3 +2.4% BM_slicingSmallPieces_1T/64 491 476 +3.1% BM_slicingSmallPieces_1T/512 21734 18814 +13.4% BM_slicingSmallPieces_1T/4k 394660 396760 -0.5% BM_slicingSmallPieces_1T/5k 218722 209244 +4.3% BM_slicingSmallPieces_2T/2 80.7 79.9 +1.0% BM_slicingSmallPieces_2T/8 54.2 53.1 +2.0 BM_slicingSmallPieces_2T/64 497 477 +4.0% BM_slicingSmallPieces_2T/512 21732 18822 +13.4% BM_slicingSmallPieces_2T/4k 392885 390490 +0.6% BM_slicingSmallPieces_2T/5k 221988 208678 +6.0% BM_slicingSmallPieces_4T/2 80.8 80.1 +0.9% BM_slicingSmallPieces_4T/8 54.1 53.2 +1.7% BM_slicingSmallPieces_4T/64 493 476 +3.4% BM_slicingSmallPieces_4T/512 21702 18758 +13.6% BM_slicingSmallPieces_4T/4k 393962 404023 -2.6% BM_slicingSmallPieces_4T/5k 249667 211732 +15.2% BM_slicingSmallPieces_6T/2 80.5 80.1 +0.5% BM_slicingSmallPieces_6T/8 54.4 53.4 +1.8% BM_slicingSmallPieces_6T/64 488 478 +2.0% BM_slicingSmallPieces_6T/512 21719 18841 +13.3% BM_slicingSmallPieces_6T/4k 394950 397583 -0.7% BM_slicingSmallPieces_6T/5k 223080 210148 +5.8% BM_slicingSmallPieces_8T/2 81.2 80.4 +1.0% BM_slicingSmallPieces_8T/8 58.1 53.5 +7.9% BM_slicingSmallPieces_8T/64 489 480 +1.8% BM_slicingSmallPieces_8T/512 21586 18798 +12.9% BM_slicingSmallPieces_8T/4k 394592 400165 -1.4% BM_slicingSmallPieces_8T/5k 219688 208301 +5.2% BM_slicingSmallPieces_12T/2 80.2 79.8 +0.7% BM_slicingSmallPieces_12T/8 54.4 53.4 +1.8 BM_slicingSmallPieces_12T/64 488 476 +2.5% BM_slicingSmallPieces_12T/512 21931 18831 +14.1% BM_slicingSmallPieces_12T/4k 393962 396541 -0.7% BM_slicingSmallPieces_12T/5k 218803 207965 +5.0%
* | Adding Tensor ReverseOp; TensorStriding; TensorConversionOp; Modifying ↵Gravatar Mehdi Goli2017-01-16
| | | | | | | | Tensor Contractsycl to be located in any place in the expression tree.
| * Simplified the contraction code`Gravatar Benoit Steiner2016-12-21
| |
| * Leverage libxsmm kernels within signle threaded contractionsGravatar Benoit Steiner2016-12-21
|/
* Adding tensor contraction operation backend for Sycl; adding test for ↵Gravatar Mehdi Goli2016-12-14
| | | | contractionOp sycl backend; adding temporary solution to prevent memory leak in buffer; cleaning up cxx11_tensor_buildins_sycl.h
* Updated the contraction code to support constant inputs.Gravatar Benoit Steiner2016-09-01
|
* Properly detect the type of the result of a contraction.Gravatar Benoit Steiner2016-08-16
|
* Replace implicit cast with an explicit oneGravatar Benoit Steiner2016-05-12
|
* Added tests for full contractions using thread pools and gpu devices.Gravatar Benoit Steiner2016-05-05
| | | | Fixed a couple of issues in the corresponding code.
* Updated the contraction code to ensure that full contraction return a tensor ↵Gravatar Benoit Steiner2016-05-05
| | | | of rank 0
* Deleted trailing commasGravatar Benoit Steiner2016-04-29
|
* Deleted useless trailing commasGravatar Benoit Steiner2016-04-29
|
* Move the evalGemm method into the TensorContractionEvaluatorBase class to ↵Gravatar Benoit Steiner2016-04-15
| | | | make it accessible from both the single and multithreaded contraction evaluators.
* Fixed a few compilation warningsGravatar Benoit Steiner2016-04-15
|
* Eigen cost model part 1. This implements a basic recursive framework to ↵Gravatar Rasmus Munk Larsen2016-04-14
| | | | estimate the cost of evaluating tensor expressions.
* Fix bug in tensor contraction. The code assumes that contraction axis ↵Gravatar Benoit Steiner2016-03-17
| | | | indices for the LHS (after possibly swapping to ColMajor!) is increasing. Explicitly sort the contraction axis pairs to make it so.
* Fixed the tensor chipping code.Gravatar Benoit Steiner2016-03-08
|
* Decoupled the packet type definition from the definition of the tensor ops. ↵Gravatar Benoit Steiner2016-03-08
| | | | All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit.
* Marked several methods EIGEN_DEVICE_FUNCGravatar Benoit Steiner2016-01-28
|
* Leverage the new blocking code in the tensor contraction code.Gravatar Benoit Steiner2016-01-22
|
* Small cleanup and small fix to the contraction of row major tensorsGravatar Benoit Steiner2016-01-20
|
* Reduce the register pressure exerted by the tensor mappers whenever ↵Gravatar Benoit Steiner2016-01-20
| | | | possible. This improves the performance of the contraction of a matrix with a vector by about 35%.
* Moved the contraction mapping code to its own file to make the code more ↵Gravatar Benoit Steiner2016-01-19
| | | | manageable.
* Trigger the optimized matrix vector path more conservatively.Gravatar Benoit Steiner2016-01-12
|
* Improved the performance of the contraction of a 2d tensor with a 1d tensor ↵Gravatar Benoit Steiner2016-01-12
| | | | by a factor of 3 or more. This helps speedup LSTM neural networks.
* Fixed another compilation warningGravatar Benoit Steiner2015-12-07
|
* Fixed compilation warningsGravatar Benoit Steiner2015-12-07
|
* Use signed integers instead of unsigned ones more consistently in the codebase.Gravatar Benoit Steiner2015-12-04
|
* Fixed a compilation warningGravatar Benoit Steiner2015-10-29
|
* Reworked the tensor contraction mapper code to make it compile on AndroidGravatar Benoit Steiner2015-10-23
|
* Use numext::mini/numext::maxi instead of std::min/std::max in the tensor codeGravatar Benoit Steiner2015-08-28
|
* Many files were missing in previous changeset.Gravatar Gael Guennebaud2015-07-29
|
* Allowed tensor contraction operation with an empty array of dimension pairs, ↵Gravatar Godeffroy Valet2015-07-25
| | | | which performs a tensor product.
* Use numext::swap instead of std::swapGravatar Benoit Steiner2015-07-06
|
* Moved some utilities to TensorMeta.h to make it easier to reuse them accross ↵Gravatar Benoit Steiner2015-06-29
| | | | | | several tensor operations. Created the TensorDimensionList class to encode the list of all the dimensions of a tensor of rank n. This could be done using TensorIndexList, however TensorIndexList require cxx11 which isn't yet supported as widely as we'd like.
* Fixed compilation warning triggered by gcc 4.7Gravatar Benoit Steiner2015-04-18
|
* Fixed another batch of compilation warningsGravatar Benoit Steiner2015-02-28
|
* Silcenced a few compilation warningsGravatar Benoit Steiner2015-02-10
|
* Silenced several compilation warningsGravatar Benoit Steiner2015-02-10
|
* Marked the contraction operation as read only, since its result can't be ↵Gravatar Benoit Steiner2015-01-29
| | | | assigned.
* Added support for RowMajor inputs to the contraction code.Gravatar Benoit Steiner2015-01-14
|
* Fixed compilation errors with clang.Gravatar Benoit Steiner2014-11-13
| | | | H: Enter commit message. Lines beginning with 'HG:' are removed.
* Improved handling of 1d tensorsGravatar Benoit Steiner2014-11-03
|
* Silenced one last warningGravatar Benoit Steiner2014-10-16
|
* Silenced a few compilation warningsGravatar Benoit Steiner2014-10-16
| | | | Generalized a TensorMap constructor
* Made the blocking computation aware of the l3 cacheGravatar Benoit Steiner2014-10-15
| | | | Also optimized the blocking parameters to take into account the number of threads used for a computation
* Created the IndexPair type to store pair of tensor indices. CUDA doesn't ↵Gravatar Benoit Steiner2014-10-03
| | | | | | support std::pair so we can't use them when targeting GPUs. Improved the performance on tensor contractions
* Fixed a typo in the contraction codeGravatar Benoit Steiner2014-09-06
|
* Fixed misc typos.Gravatar Benoit Steiner2014-08-13
|
* Added missing apis.Gravatar Benoit Steiner2014-08-13
|