eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Vectorize and parallelize TensorScanOp.	Rasmus Munk Larsen	2020-05-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions. The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices. name old time/op new time/op delta BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45% BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74% BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77% BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70% BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58% BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22% BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~ BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~ BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82% BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~ BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67% BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95% BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~ BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92% BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58% BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50% BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68% BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40% BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10% BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94% BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67% BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91% BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42% BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96% BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06% BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33% BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93% BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09% BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65% BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21% BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88% BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
*	evalSubExprsIfNeededAsync + async TensorContractionThreadPool	Eugene Zhulenev	2019-08-30
\|
*	Remove shadow warnings in TensorDeviceThreadPool	Eugene Zhulenev	2019-08-28
\|
*	Asynchronous parallelFor in Eigen ThreadPoolDevice	Eugene Zhulenev	2019-08-22
\|
*	[SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵	Mehdi Goli	2019-06-28
\| \| \| \| \| \| \| \| \| \|	module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
*	Merged eigen/eigen into default	Deven Desai	2019-03-19
\|\
\| *	Parallelize tensor contraction only by sharding dimension and use ↵	Eugene Zhulenev	2019-02-04
\| \| \| \| \| \| \| \|	'thread-local' memory for packing
\| *	Fix shorten-64-to-32 warning. Use regular memcpy if num_threads==0.	Rasmus Munk Larsen	2018-12-12
\| \|
* \|	ROCm/HIP specfic fixes + updates	Deven Desai	2018-11-19
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Eigen/src/Core/arch/GPU/Half.h Updating the HIPCC implementation half so that it can declared as a __shared__ variable 2. Eigen/src/Core/util/Macros.h, Eigen/src/Core/util/Memory.h introducing a EIGEN_USE_STD(func) macro that calls - std::func be default - ::func when eigen is being compiled with HIPCC This change was requested in the previous HIP PR (https://bitbucket.org/eigen/eigen/pull-requests/518/pr-with-hip-specific-fixes-for-the-eigen/diff) 3. unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h Removing EIGEN_DEVICE_FUNC attribute from pure virtual methods as it is not supported by HIPCC 4. unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Disabling the template specializations of InnerMostDimReducer as they run into HIPCC link errors
*	Remove accidental changes.	Rasmus Munk Larsen	2018-11-12
\|
*	Add parallel memcpy to TensorThreadPoolDevice in Eigen, but limit the number ↵	Rasmus Munk Larsen	2018-11-12
\| \| \| \|	of threads to 4, beyond which we just seem to be wasting CPU cycles as the threads contend for memory bandwidth.
*	Move from rvalue arguments in ThreadPool enqueue* methods	Eugene Zhulenev	2018-10-16
\|
*	Reduce thread scheduling overhead in parallelFor	Eugene Zhulenev	2018-10-16
\|
*	Fiw shadowing of last and all	Gael Guennebaud	2018-09-21
\|
*	Add support for thread local support on platforms that do not support it ↵	Rasmus Munk Larsen	2018-08-13
\| \| \| \|	through emulation using a hash map.
*	Use NULL instead of nullptr to avoid adding a cxx11 requirement.	Benoit Steiner	2018-08-13
\|
*	Merged in paultucker/eigen (pull request PR-431)	Benoit Steiner	2018-08-01
\|\ \| \| \| \| \| \| \| \| \| \|	Optional ThreadPoolDevice allocator Approved-by: Benoit Steiner <benoit.steiner.goog@gmail.com>
* \|	Distinguishing between internal memory allocation/deallocation from explicit ↵	Mehdi Goli	2018-08-01
\| \| \| \| \| \| \| \|	user memory allocation/deallocation.
\| *	Change getAllocator() to allocator() in ThreadPoolDevice.	Paul Tucker	2018-07-31
\| \|
\| *	Add test coverage for ThreadPoolDevice optional allocator.	Paul Tucker	2018-07-19
\| \|
\| *	Actually add optional Allocator* arg to ThreadPoolDevice().	Paul Tucker	2018-07-16
\| \|
\| *	Add optional Allocator argument to ThreadPoolDevice constructor.	Paul Tucker	2018-07-16
\|/ \| \| \| \| \|	When supplied, this allocator will be used in place of internal::aligned_malloc. This permits e.g. use of a NUMA-node specific allocator where the thread-pool is also restricted a single NUMA-node.
*	Fix oversharding bug in parallelFor.	Rasmus Munk Larsen	2018-06-20
\|
*	Add a ThreadPoolInterface* getter for ThreadPoolDevice.	Penporn Koanantakool	2018-06-02
\|
*	Specialize ThreadPoolDevice::enqueueNotification for the case with no args. ↵	Rasmus Munk Larsen	2017-10-13
\| \| \| \|	As an example this reduces binary size of an TensorFlow demo app for Android by about 2.5%.
*	Moved the choice of ThreadPool to unsupported/Eigen/CXX11/ThreadPool	Benoit Steiner	2016-12-12
\|
*	Reduce dispatch overhead in parallelFor by only calling ↵	Rasmus Munk Larsen	2016-11-14
\| \| \| \|	thread_pool.Schedule() for one of the two recursive calls in handleRange. This avoids going through the scedule path to push both recursive calls onto another thread-queue in the binary tree, but instead executes one of them on the main thread. At the leaf level this will still activate a full complement of threads, but will save up to 50% of the overhead in Schedule (random number generation, insertion in queue which includes signaling via atomics).
*	Avoid unecessary object copies	Benoit Steiner	2016-08-01
\|
*	Return -1 from CurrentThreadId when called by thread outside the pool.	Rasmus Munk Larsen	2016-06-23
\|
*	Resolve merge.	Rasmus Munk Larsen	2016-06-23
\|\
* \|	size_t -> int	Rasmus Munk Larsen	2016-06-03
\| \|
* \|	Add CurrentThreadId and NumThreads methods to Eigen threadpools and ↵	Rasmus Munk Larsen	2016-06-03
\| \| \| \| \| \| \| \|	TensorDeviceThreadPool.
\| *	Use signed integers more consistently to encode the number of threads to use ↵	Benoit Steiner	2016-06-09
\|/ \| \| \|	to evaluate a tensor expression.
*	Fixed compilation error in the tensor thread pool	Benoit Steiner	2016-05-17
\|
*	Merged eigen/eigen into default	Rasmus Larsen	2016-05-17
\|\
\| *	Turnon the new thread pool by default since it scales much better over ↵	Benoit Steiner	2016-05-13
\| \| \| \| \| \| \| \|	multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
* \|	Address comments by bsteiner.	Rasmus Munk Larsen	2016-05-12
\| \|
* \|	Improvements to parallelFor.	Rasmus Munk Larsen	2016-05-12
\|/ \| \| \|	Move some scalar functors from TensorFunctors. to Eigen core.
*	Added a new parallelFor api to the thread pool device.	Benoit Steiner	2016-05-09
\|
*	Provide access to the base threadpool classes	Benoit Steiner	2016-04-21
\|
*	Added the ability to switch to the new thread pool with a #define	Benoit Steiner	2016-04-21
\|
*	Added ability to access the cache sizes from the tensor devices	Benoit Steiner	2016-04-14
\|
*	Prepared the migration to the new non blocking thread pool	Benoit Steiner	2016-04-14
\|
*	Made it possible to customize the threadpool	Benoit Steiner	2016-03-28
\|
*	Fixed compilation error	Benoit Steiner	2016-03-22
\|
*	Pulled latest updates from trunk	Benoit Steiner	2016-03-22
\|\
* \|	Use a single Barrier instead of a collection of Notifications to reduce the ↵	Benoit Steiner	2016-03-22
\| \| \| \| \| \| \| \|	thread synchronization overhead
\| *	Fixed a couple of typos	Benoit Steiner	2016-03-22
\| \|
\| *	Avoid using std::vector whenever possible	Benoit Steiner	2016-03-22
\|/
*	Split TensorDeviceType.h in 3 files to make it more manageable	Benoit Steiner	2015-11-20