eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Advertize the packet api of the tensor reducers iff the corresponding packet ↵	Benoit Steiner	2016-05-18
\| \| \| \|	primitives are available.
*	bug #1229: bypass usage of Derived::Options which is available for plain ↵	Gael Guennebaud	2016-05-18
\| \| \| \|	matrix types only. Better use column-major storage anyway.
*	Pass argument by const ref instead of by value in pow(AutoDiffScalar...)	Gael Guennebaud	2016-05-18
\|
*	bug #1223: fix compilation of AutoDiffScalar's min/max operators, and add ↵	Gael Guennebaud	2016-05-18
\| \| \| \|	regression unit test.
*	bug #1222: fix compilation in AutoDiffScalar and add respective unit test	Gael Guennebaud	2016-05-18
\|
*	#if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if ↵	Benoit Steiner	2016-05-17
\| \| \| \|	!defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly.
*	Fixed compilation error	Benoit Steiner	2016-05-17
\|
*	Fixed compilation error in the tensor thread pool	Benoit Steiner	2016-05-17
\|
*	Merge upstream.	Rasmus Munk Larsen	2016-05-17
\|\
* \|	Roll back changes to core. Move include of TensorFunctors.h up to satisfy ↵	Rasmus Munk Larsen	2016-05-17
\| \| \| \| \| \| \| \|	dependence in TensorCostModel.h.
\| *	Merged eigen/eigen into default	Rasmus Larsen	2016-05-17
\|/\|
\| *	Enable the use of the packet api to evaluate tensor broadcasts. This speed ↵	Benoit Steiner	2016-05-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	things up quite a bit: Before" M_broadcasting/10 500000 3690 27.10 MFlops/s BM_broadcasting/80 500000 4014 1594.24 MFlops/s BM_broadcasting/640 100000 14770 27731.35 MFlops/s BM_broadcasting/4K 5000 632711 39512.48 MFlops/s After: BM_broadcasting/10 500000 4287 23.33 MFlops/s BM_broadcasting/80 500000 4455 1436.41 MFlops/s BM_broadcasting/640 200000 10195 40173.01 MFlops/s BM_broadcasting/4K 5000 423746 58997.57 MFlops/s
\| *	Allow vectorized padding on GPU. This helps speed things up a little	Benoit Steiner	2016-05-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s
\| *	Pulled latest updates from trunk.	Benoit Steiner	2016-05-17
\| \|\
\| * \|	Don't rely on c++11 extension when we don't have to.	Benoit Steiner	2016-05-17
\| \| \|
\| \| *	Added missing costPerCoeff method	Benoit Steiner	2016-05-16
\| \| \|
\| \| *	Turn on the cost model by default. This results in some significant speedups ↵	Benoit Steiner	2016-05-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for smaller tensors. For example, below are the results for the various tensor reductions. Before: BM_colReduction_12T/10 1000000 1949 51.29 MFlops/s BM_colReduction_12T/80 100000 15636 409.29 MFlops/s BM_colReduction_12T/640 20000 95100 4307.01 MFlops/s BM_colReduction_12T/4K 500 4573423 5466.36 MFlops/s BM_colReduction_4T/10 1000000 1867 53.56 MFlops/s BM_colReduction_4T/80 500000 5288 1210.11 MFlops/s BM_colReduction_4T/640 10000 106924 3830.75 MFlops/s BM_colReduction_4T/4K 500 9946374 2513.48 MFlops/s BM_colReduction_8T/10 1000000 1912 52.30 MFlops/s BM_colReduction_8T/80 200000 8354 766.09 MFlops/s BM_colReduction_8T/640 20000 85063 4815.22 MFlops/s BM_colReduction_8T/4K 500 5445216 4591.19 MFlops/s BM_rowReduction_12T/10 1000000 2041 48.99 MFlops/s BM_rowReduction_12T/80 100000 15426 414.87 MFlops/s BM_rowReduction_12T/640 50000 39117 10470.98 MFlops/s BM_rowReduction_12T/4K 500 3034298 8239.14 MFlops/s BM_rowReduction_4T/10 1000000 1834 54.51 MFlops/s BM_rowReduction_4T/80 500000 5406 1183.81 MFlops/s BM_rowReduction_4T/640 50000 35017 11697.16 MFlops/s BM_rowReduction_4T/4K 500 3428527 7291.76 MFlops/s BM_rowReduction_8T/10 1000000 1925 51.95 MFlops/s BM_rowReduction_8T/80 200000 8519 751.23 MFlops/s BM_rowReduction_8T/640 50000 33441 12248.42 MFlops/s BM_rowReduction_8T/4K 1000 2852841 8763.19 MFlops/s After: BM_colReduction_12T/10 50000000 59 1678.30 MFlops/s BM_colReduction_12T/80 5000000 725 8822.71 MFlops/s BM_colReduction_12T/640 20000 90882 4506.93 MFlops/s BM_colReduction_12T/4K 500 4668855 5354.63 MFlops/s BM_colReduction_4T/10 50000000 59 1687.37 MFlops/s BM_colReduction_4T/80 5000000 737 8681.24 MFlops/s BM_colReduction_4T/640 50000 108637 3770.34 MFlops/s BM_colReduction_4T/4K 500 7912954 3159.38 MFlops/s BM_colReduction_8T/10 50000000 60 1657.21 MFlops/s BM_colReduction_8T/80 5000000 726 8812.48 MFlops/s BM_colReduction_8T/640 20000 91451 4478.90 MFlops/s BM_colReduction_8T/4K 500 5441692 4594.16 MFlops/s BM_rowReduction_12T/10 20000000 93 1065.28 MFlops/s BM_rowReduction_12T/80 2000000 950 6730.96 MFlops/s BM_rowReduction_12T/640 50000 38196 10723.48 MFlops/s BM_rowReduction_12T/4K 500 3019217 8280.29 MFlops/s BM_rowReduction_4T/10 20000000 93 1064.30 MFlops/s BM_rowReduction_4T/80 2000000 959 6667.71 MFlops/s BM_rowReduction_4T/640 50000 37433 10941.96 MFlops/s BM_rowReduction_4T/4K 500 3036476 8233.23 MFlops/s BM_rowReduction_8T/10 20000000 93 1072.47 MFlops/s BM_rowReduction_8T/80 2000000 959 6670.04 MFlops/s BM_rowReduction_8T/640 50000 38069 10759.37 MFlops/s BM_rowReduction_8T/4K 1000 2758988 9061.29 MFlops/s
\| \| *	Fixed syntax error	Benoit Steiner	2016-05-16
\| \| \|
\| \| *	Turnon the new thread pool by default since it scales much better over ↵	Benoit Steiner	2016-05-13
\| \| \| \| \| \| \| \| \| \| \| \|	multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
\| \| *	New multithreaded contraction that doesn't rely on the thread pool to run ↵	Benoit Steiner	2016-05-13
\| \| \| \| \| \| \| \| \| \| \| \|	the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.
\| \| *	Removed unnecessary thread synchronization	Benoit Steiner	2016-05-13
\| \|/
\| *	Fixed compilation errors triggered by old versions of gcc	Benoit Steiner	2016-05-12
\| \|
* \|	Diasbled cost model by accident. Revert.	Rasmus Munk Larsen	2016-05-12
\| \|
* \|	Address comments by bsteiner.	Rasmus Munk Larsen	2016-05-12
\| \|
* \|	Improvements to parallelFor.	Rasmus Munk Larsen	2016-05-12
\|/ \| \| \|	Move some scalar functors from TensorFunctors. to Eigen core.
*	Worked around a compilation error triggered by nvcc when compiling a tensor ↵	Benoit Steiner	2016-05-12
\| \| \| \|	concatenation kernel.
*	Fixed potential race condition in the non blocking thread pool	Benoit Steiner	2016-05-12
\|
*	Replace implicit cast with an explicit one	Benoit Steiner	2016-05-12
\|
*	Worked around compilation errors with older versions of gcc	Benoit Steiner	2016-05-11
\|
*	Improved the portability of the tensor code	Benoit Steiner	2016-05-11
\|
*	Fixed a couple of bugs related to the Pascalfamily of GPUs	Benoit Steiner	2016-05-11
\| \| \| \|	H: Enter commit message. Lines beginning with 'HG:' are removed.
*	Avoid unnecessary conversions between floats and doubles	Benoit Steiner	2016-05-11
\|
*	Added more tests for half floats	Benoit Steiner	2016-05-11
\|
*	Added the ability to load fp16 using the texture path.	Benoit Steiner	2016-05-11
\| \| \| \|	Improved the performance of some reductions on fp16
*	Removed deprecated flag (which apparently was ignored anyway)	Christoph Hertzberg	2016-05-11
\|
*	fixed some double-promotion and sign-compare warnings	Christoph Hertzberg	2016-05-11
\|
*	Fixed a typo in my previous commit	Benoit Steiner	2016-05-11
\|
*	Fix potential race condition in the CUDA reduction code.	Benoit Steiner	2016-05-11
\|
*	Added a few tests to validate the generation of random tensors on GPU.	Benoit Steiner	2016-05-11
\|
*	Explicitely initialize all the atomic variables.	Benoit Steiner	2016-05-11
\|
*	Properly gate the use of half2.	Benoit Steiner	2016-05-10
\|
*	Added support for fp16 to the sigmoid functor.	Benoit Steiner	2016-05-10
\|
*	Small improvement to the full reduction of fp16	Benoit Steiner	2016-05-10
\|
*	Added a test to validate the new non blocking thread pool	Benoit Steiner	2016-05-10
\|
*	Simplified the reduction code a little.	Benoit Steiner	2016-05-10
\|
*	Fixed compilation warning	Benoit Steiner	2016-05-09
\|
*	Improved the performance of full reductions on GPU:	Benoit Steiner	2016-05-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before: BM_fullReduction/10 200000 11751 8.51 MFlops/s BM_fullReduction/80 5000 523385 12.23 MFlops/s BM_fullReduction/640 50 36179326 11.32 MFlops/s BM_fullReduction/4K 1 2173517195 11.50 MFlops/s After: BM_fullReduction/10 500000 5987 16.70 MFlops/s BM_fullReduction/80 200000 10636 601.73 MFlops/s BM_fullReduction/640 50000 58428 7010.31 MFlops/s BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s
*	Added the ability to use a scratch buffer in cuda kernels	Benoit Steiner	2016-05-09
\|
*	Added a new parallelFor api to the thread pool device.	Benoit Steiner	2016-05-09
\|
*	Optimized the non blocking thread pool:	Benoit Steiner	2016-05-09
\| \| \| \| \| \| \| \| \|	* Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered. * Directly pop from a non-empty queue when we are waiting for work, instead of first noticing that there is a non-empty queue and then doing another round of random stealing to re-discover the non-empty queue. * Steal only 1 task from a remote queue instead of half of tasks.