eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
...
\| *	Adds a fast memcpy function to Eigen. This takes advantage of the following:	Rasmus Munk Larsen	2017-01-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. For small fixed sizes, the compiler generates inline code for memcpy, which is much faster. 2. My colleague eriche at googl dot com discovered that for large sizes, memmove is significantly faster than memcpy (at least on Linux with GCC or Clang). See benchmark numbers measured on a Haswell (HP Z440) workstation here: https://docs.google.com/a/google.com/spreadsheets/d/1jLs5bKzXwhpTySw65MhG1pZpsIwkszZqQTjwrd_n0ic/pubhtml This is of course surprising since memcpy is a less constrained version of memmove. This stackoverflow thread contains some speculation as to the causes: http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-linux Below are numbers for copying and slicing tensors using the multithreaded TensorDevice. The numbers show significant improvements for memcpy of very small blocks and for memcpy of large blocks single threaded (we were already able to saturate memory bandwidth for >1 threads before on large blocks). The "slicingSmallPieces" benchmark also shows small consistent improvements, since memcpy cost is a fair portion of that particular computation. The benchmarks operate on NxN matrices, and the names are of the form BM_$OP_${NUMTHREADS}T/${N}. Measured improvements in wall clock time: Run on rmlarsen3.mtv (12 X 3501 MHz CPUs); 2017-01-20T11:26:31.493023454-08:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_memcpy_1T/2 3.48 2.39 +31.3% BM_memcpy_1T/8 12.3 6.51 +47.0% BM_memcpy_1T/64 371 383 -3.2% BM_memcpy_1T/512 66922 66720 +0.3% BM_memcpy_1T/4k 9892867 6849682 +30.8% BM_memcpy_1T/5k 14951099 10332856 +30.9% BM_memcpy_2T/2 3.50 2.46 +29.7% BM_memcpy_2T/8 12.3 7.66 +37.7% BM_memcpy_2T/64 371 376 -1.3% BM_memcpy_2T/512 66652 66788 -0.2% BM_memcpy_2T/4k 6145012 6117776 +0.4% BM_memcpy_2T/5k 9181478 9010942 +1.9% BM_memcpy_4T/2 3.47 2.47 +31.0% BM_memcpy_4T/8 12.3 6.67 +45.8 BM_memcpy_4T/64 374 376 -0.5% BM_memcpy_4T/512 67833 68019 -0.3% BM_memcpy_4T/4k 5057425 5188253 -2.6% BM_memcpy_4T/5k 7555638 7779468 -3.0% BM_memcpy_6T/2 3.51 2.50 +28.8% BM_memcpy_6T/8 12.3 7.61 +38.1% BM_memcpy_6T/64 373 378 -1.3% BM_memcpy_6T/512 66871 66774 +0.1% BM_memcpy_6T/4k 5112975 5233502 -2.4% BM_memcpy_6T/5k 7614180 7772246 -2.1% BM_memcpy_8T/2 3.47 2.41 +30.5% BM_memcpy_8T/8 12.4 10.5 +15.3% BM_memcpy_8T/64 372 388 -4.3% BM_memcpy_8T/512 67373 66588 +1.2% BM_memcpy_8T/4k 5148462 5254897 -2.1% BM_memcpy_8T/5k 7660989 7799058 -1.8% BM_memcpy_12T/2 3.50 2.40 +31.4% BM_memcpy_12T/8 12.4 7.55 +39.1 BM_memcpy_12T/64 374 378 -1.1% BM_memcpy_12T/512 67132 66683 +0.7% BM_memcpy_12T/4k 5185125 5292920 -2.1% BM_memcpy_12T/5k 7717284 7942684 -2.9% BM_slicingSmallPieces_1T/2 47.3 47.5 +0.4% BM_slicingSmallPieces_1T/8 53.6 52.3 +2.4% BM_slicingSmallPieces_1T/64 491 476 +3.1% BM_slicingSmallPieces_1T/512 21734 18814 +13.4% BM_slicingSmallPieces_1T/4k 394660 396760 -0.5% BM_slicingSmallPieces_1T/5k 218722 209244 +4.3% BM_slicingSmallPieces_2T/2 80.7 79.9 +1.0% BM_slicingSmallPieces_2T/8 54.2 53.1 +2.0 BM_slicingSmallPieces_2T/64 497 477 +4.0% BM_slicingSmallPieces_2T/512 21732 18822 +13.4% BM_slicingSmallPieces_2T/4k 392885 390490 +0.6% BM_slicingSmallPieces_2T/5k 221988 208678 +6.0% BM_slicingSmallPieces_4T/2 80.8 80.1 +0.9% BM_slicingSmallPieces_4T/8 54.1 53.2 +1.7% BM_slicingSmallPieces_4T/64 493 476 +3.4% BM_slicingSmallPieces_4T/512 21702 18758 +13.6% BM_slicingSmallPieces_4T/4k 393962 404023 -2.6% BM_slicingSmallPieces_4T/5k 249667 211732 +15.2% BM_slicingSmallPieces_6T/2 80.5 80.1 +0.5% BM_slicingSmallPieces_6T/8 54.4 53.4 +1.8% BM_slicingSmallPieces_6T/64 488 478 +2.0% BM_slicingSmallPieces_6T/512 21719 18841 +13.3% BM_slicingSmallPieces_6T/4k 394950 397583 -0.7% BM_slicingSmallPieces_6T/5k 223080 210148 +5.8% BM_slicingSmallPieces_8T/2 81.2 80.4 +1.0% BM_slicingSmallPieces_8T/8 58.1 53.5 +7.9% BM_slicingSmallPieces_8T/64 489 480 +1.8% BM_slicingSmallPieces_8T/512 21586 18798 +12.9% BM_slicingSmallPieces_8T/4k 394592 400165 -1.4% BM_slicingSmallPieces_8T/5k 219688 208301 +5.2% BM_slicingSmallPieces_12T/2 80.2 79.8 +0.7% BM_slicingSmallPieces_12T/8 54.4 53.4 +1.8 BM_slicingSmallPieces_12T/64 488 476 +2.5% BM_slicingSmallPieces_12T/512 21931 18831 +14.1% BM_slicingSmallPieces_12T/4k 393962 396541 -0.7% BM_slicingSmallPieces_12T/5k 218803 207965 +5.0%
* \|	Adding Tensor ReverseOp; TensorStriding; TensorConversionOp; Modifying ↵	Mehdi Goli	2017-01-16
\| \| \| \| \| \| \| \|	Tensor Contractsycl to be located in any place in the expression tree.
\| *	Simplified the contraction code`	Benoit Steiner	2016-12-21
\| \|
\| *	Leverage libxsmm kernels within signle threaded contractions	Benoit Steiner	2016-12-21
\|/
*	Adding tensor contraction operation backend for Sycl; adding test for ↵	Mehdi Goli	2016-12-14
\| \| \| \|	contractionOp sycl backend; adding temporary solution to prevent memory leak in buffer; cleaning up cxx11_tensor_buildins_sycl.h
*	Updated the contraction code to support constant inputs.	Benoit Steiner	2016-09-01
\|
*	Properly detect the type of the result of a contraction.	Benoit Steiner	2016-08-16
\|
*	Replace implicit cast with an explicit one	Benoit Steiner	2016-05-12
\|
*	Added tests for full contractions using thread pools and gpu devices.	Benoit Steiner	2016-05-05
\| \| \| \|	Fixed a couple of issues in the corresponding code.
*	Updated the contraction code to ensure that full contraction return a tensor ↵	Benoit Steiner	2016-05-05
\| \| \| \|	of rank 0
*	Deleted trailing commas	Benoit Steiner	2016-04-29
\|
*	Deleted useless trailing commas	Benoit Steiner	2016-04-29
\|
*	Move the evalGemm method into the TensorContractionEvaluatorBase class to ↵	Benoit Steiner	2016-04-15
\| \| \| \|	make it accessible from both the single and multithreaded contraction evaluators.
*	Fixed a few compilation warnings	Benoit Steiner	2016-04-15
\|
*	Eigen cost model part 1. This implements a basic recursive framework to ↵	Rasmus Munk Larsen	2016-04-14
\| \| \| \|	estimate the cost of evaluating tensor expressions.
*	Fix bug in tensor contraction. The code assumes that contraction axis ↵	Benoit Steiner	2016-03-17
\| \| \| \|	indices for the LHS (after possibly swapping to ColMajor!) is increasing. Explicitly sort the contraction axis pairs to make it so.
*	Fixed the tensor chipping code.	Benoit Steiner	2016-03-08
\|
*	Decoupled the packet type definition from the definition of the tensor ops. ↵	Benoit Steiner	2016-03-08
\| \| \| \|	All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit.
*	Marked several methods EIGEN_DEVICE_FUNC	Benoit Steiner	2016-01-28
\|
*	Leverage the new blocking code in the tensor contraction code.	Benoit Steiner	2016-01-22
\|
*	Small cleanup and small fix to the contraction of row major tensors	Benoit Steiner	2016-01-20
\|
*	Reduce the register pressure exerted by the tensor mappers whenever ↵	Benoit Steiner	2016-01-20
\| \| \| \|	possible. This improves the performance of the contraction of a matrix with a vector by about 35%.
*	Moved the contraction mapping code to its own file to make the code more ↵	Benoit Steiner	2016-01-19
\| \| \| \|	manageable.
*	Trigger the optimized matrix vector path more conservatively.	Benoit Steiner	2016-01-12
\|
*	Improved the performance of the contraction of a 2d tensor with a 1d tensor ↵	Benoit Steiner	2016-01-12
\| \| \| \|	by a factor of 3 or more. This helps speedup LSTM neural networks.
*	Fixed another compilation warning	Benoit Steiner	2015-12-07
\|
*	Fixed compilation warnings	Benoit Steiner	2015-12-07
\|
*	Use signed integers instead of unsigned ones more consistently in the codebase.	Benoit Steiner	2015-12-04
\|
*	Fixed a compilation warning	Benoit Steiner	2015-10-29
\|
*	Reworked the tensor contraction mapper code to make it compile on Android	Benoit Steiner	2015-10-23
\|
*	Use numext::mini/numext::maxi instead of std::min/std::max in the tensor code	Benoit Steiner	2015-08-28
\|
*	Many files were missing in previous changeset.	Gael Guennebaud	2015-07-29
\|
*	Allowed tensor contraction operation with an empty array of dimension pairs, ↵	Godeffroy Valet	2015-07-25
\| \| \| \|	which performs a tensor product.
*	Use numext::swap instead of std::swap	Benoit Steiner	2015-07-06
\|
*	Moved some utilities to TensorMeta.h to make it easier to reuse them accross ↵	Benoit Steiner	2015-06-29
\| \| \| \| \| \|	several tensor operations. Created the TensorDimensionList class to encode the list of all the dimensions of a tensor of rank n. This could be done using TensorIndexList, however TensorIndexList require cxx11 which isn't yet supported as widely as we'd like.
*	Fixed compilation warning triggered by gcc 4.7	Benoit Steiner	2015-04-18
\|
*	Fixed another batch of compilation warnings	Benoit Steiner	2015-02-28
\|
*	Silcenced a few compilation warnings	Benoit Steiner	2015-02-10
\|
*	Silenced several compilation warnings	Benoit Steiner	2015-02-10
\|
*	Marked the contraction operation as read only, since its result can't be ↵	Benoit Steiner	2015-01-29
\| \| \| \|	assigned.
*	Added support for RowMajor inputs to the contraction code.	Benoit Steiner	2015-01-14
\|
*	Fixed compilation errors with clang.	Benoit Steiner	2014-11-13
\| \| \| \|	H: Enter commit message. Lines beginning with 'HG:' are removed.
*	Improved handling of 1d tensors	Benoit Steiner	2014-11-03
\|
*	Silenced one last warning	Benoit Steiner	2014-10-16
\|
*	Silenced a few compilation warnings	Benoit Steiner	2014-10-16
\| \| \| \|	Generalized a TensorMap constructor
*	Made the blocking computation aware of the l3 cache	Benoit Steiner	2014-10-15
\| \| \| \|	Also optimized the blocking parameters to take into account the number of threads used for a computation
*	Created the IndexPair type to store pair of tensor indices. CUDA doesn't ↵	Benoit Steiner	2014-10-03
\| \| \| \| \| \|	support std::pair so we can't use them when targeting GPUs. Improved the performance on tensor contractions
*	Fixed a typo in the contraction code	Benoit Steiner	2014-09-06
\|
*	Fixed misc typos.	Benoit Steiner	2014-08-13
\|
*	Added missing apis.	Benoit Steiner	2014-08-13
\|