aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h
Commit message (Collapse)AuthorAge
* Vectorize and parallelize TensorScanOp.Gravatar Rasmus Munk Larsen2020-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions. The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices. name old time/op new time/op delta BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45% BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74% BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77% BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70% BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58% BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22% BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~ BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~ BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82% BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~ BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67% BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95% BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~ BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92% BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58% BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50% BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68% BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40% BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10% BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94% BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67% BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91% BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42% BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96% BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06% BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33% BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93% BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09% BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65% BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21% BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88% BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
* evalSubExprsIfNeededAsync + async TensorContractionThreadPoolGravatar Eugene Zhulenev2019-08-30
|
* Remove shadow warnings in TensorDeviceThreadPoolGravatar Eugene Zhulenev2019-08-28
|
* Asynchronous parallelFor in Eigen ThreadPoolDeviceGravatar Eugene Zhulenev2019-08-22
|
* [SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵Gravatar Mehdi Goli2019-06-28
| | | | | | | | | | module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
* Merged eigen/eigen into defaultGravatar Deven Desai2019-03-19
|\
| * Parallelize tensor contraction only by sharding dimension and use ↵Gravatar Eugene Zhulenev2019-02-04
| | | | | | | | 'thread-local' memory for packing
| * Fix shorten-64-to-32 warning. Use regular memcpy if num_threads==0.Gravatar Rasmus Munk Larsen2018-12-12
| |
* | ROCm/HIP specfic fixes + updatesGravatar Deven Desai2018-11-19
|/ | | | | | | | | | | | | | | | | | | | | | | | | | 1. Eigen/src/Core/arch/GPU/Half.h Updating the HIPCC implementation half so that it can declared as a __shared__ variable 2. Eigen/src/Core/util/Macros.h, Eigen/src/Core/util/Memory.h introducing a EIGEN_USE_STD(func) macro that calls - std::func be default - ::func when eigen is being compiled with HIPCC This change was requested in the previous HIP PR (https://bitbucket.org/eigen/eigen/pull-requests/518/pr-with-hip-specific-fixes-for-the-eigen/diff) 3. unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h Removing EIGEN_DEVICE_FUNC attribute from pure virtual methods as it is not supported by HIPCC 4. unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Disabling the template specializations of InnerMostDimReducer as they run into HIPCC link errors
* Remove accidental changes.Gravatar Rasmus Munk Larsen2018-11-12
|
* Add parallel memcpy to TensorThreadPoolDevice in Eigen, but limit the number ↵Gravatar Rasmus Munk Larsen2018-11-12
| | | | of threads to 4, beyond which we just seem to be wasting CPU cycles as the threads contend for memory bandwidth.
* Move from rvalue arguments in ThreadPool enqueue* methodsGravatar Eugene Zhulenev2018-10-16
|
* Reduce thread scheduling overhead in parallelForGravatar Eugene Zhulenev2018-10-16
|
* Fiw shadowing of last and allGravatar Gael Guennebaud2018-09-21
|
* Add support for thread local support on platforms that do not support it ↵Gravatar Rasmus Munk Larsen2018-08-13
| | | | through emulation using a hash map.
* Use NULL instead of nullptr to avoid adding a cxx11 requirement.Gravatar Benoit Steiner2018-08-13
|
* Merged in paultucker/eigen (pull request PR-431)Gravatar Benoit Steiner2018-08-01
|\ | | | | | | | | | | Optional ThreadPoolDevice allocator Approved-by: Benoit Steiner <benoit.steiner.goog@gmail.com>
* | Distinguishing between internal memory allocation/deallocation from explicit ↵Gravatar Mehdi Goli2018-08-01
| | | | | | | | user memory allocation/deallocation.
| * Change getAllocator() to allocator() in ThreadPoolDevice.Gravatar Paul Tucker2018-07-31
| |
| * Add test coverage for ThreadPoolDevice optional allocator.Gravatar Paul Tucker2018-07-19
| |
| * Actually add optional Allocator* arg to ThreadPoolDevice().Gravatar Paul Tucker2018-07-16
| |
| * Add optional Allocator argument to ThreadPoolDevice constructor.Gravatar Paul Tucker2018-07-16
|/ | | | | | When supplied, this allocator will be used in place of internal::aligned_malloc. This permits e.g. use of a NUMA-node specific allocator where the thread-pool is also restricted a single NUMA-node.
* Fix oversharding bug in parallelFor.Gravatar Rasmus Munk Larsen2018-06-20
|
* Add a ThreadPoolInterface* getter for ThreadPoolDevice.Gravatar Penporn Koanantakool2018-06-02
|
* Specialize ThreadPoolDevice::enqueueNotification for the case with no args. ↵Gravatar Rasmus Munk Larsen2017-10-13
| | | | As an example this reduces binary size of an TensorFlow demo app for Android by about 2.5%.
* Moved the choice of ThreadPool to unsupported/Eigen/CXX11/ThreadPoolGravatar Benoit Steiner2016-12-12
|
* Reduce dispatch overhead in parallelFor by only calling ↵Gravatar Rasmus Munk Larsen2016-11-14
| | | | thread_pool.Schedule() for one of the two recursive calls in handleRange. This avoids going through the scedule path to push both recursive calls onto another thread-queue in the binary tree, but instead executes one of them on the main thread. At the leaf level this will still activate a full complement of threads, but will save up to 50% of the overhead in Schedule (random number generation, insertion in queue which includes signaling via atomics).
* Avoid unecessary object copiesGravatar Benoit Steiner2016-08-01
|
* Return -1 from CurrentThreadId when called by thread outside the pool.Gravatar Rasmus Munk Larsen2016-06-23
|
* Resolve merge.Gravatar Rasmus Munk Larsen2016-06-23
|\
* | size_t -> intGravatar Rasmus Munk Larsen2016-06-03
| |
* | Add CurrentThreadId and NumThreads methods to Eigen threadpools and ↵Gravatar Rasmus Munk Larsen2016-06-03
| | | | | | | | TensorDeviceThreadPool.
| * Use signed integers more consistently to encode the number of threads to use ↵Gravatar Benoit Steiner2016-06-09
|/ | | | to evaluate a tensor expression.
* Fixed compilation error in the tensor thread poolGravatar Benoit Steiner2016-05-17
|
* Merged eigen/eigen into defaultGravatar Rasmus Larsen2016-05-17
|\
| * Turnon the new thread pool by default since it scales much better over ↵Gravatar Benoit Steiner2016-05-13
| | | | | | | | multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
* | Address comments by bsteiner.Gravatar Rasmus Munk Larsen2016-05-12
| |
* | Improvements to parallelFor.Gravatar Rasmus Munk Larsen2016-05-12
|/ | | | Move some scalar functors from TensorFunctors. to Eigen core.
* Added a new parallelFor api to the thread pool device.Gravatar Benoit Steiner2016-05-09
|
* Provide access to the base threadpool classesGravatar Benoit Steiner2016-04-21
|
* Added the ability to switch to the new thread pool with a #defineGravatar Benoit Steiner2016-04-21
|
* Added ability to access the cache sizes from the tensor devicesGravatar Benoit Steiner2016-04-14
|
* Prepared the migration to the new non blocking thread poolGravatar Benoit Steiner2016-04-14
|
* Made it possible to customize the threadpoolGravatar Benoit Steiner2016-03-28
|
* Fixed compilation errorGravatar Benoit Steiner2016-03-22
|
* Pulled latest updates from trunkGravatar Benoit Steiner2016-03-22
|\
* | Use a single Barrier instead of a collection of Notifications to reduce the ↵Gravatar Benoit Steiner2016-03-22
| | | | | | | | thread synchronization overhead
| * Fixed a couple of typosGravatar Benoit Steiner2016-03-22
| |
| * Avoid using std::vector whenever possibleGravatar Benoit Steiner2016-03-22
|/
* Split TensorDeviceType.h in 3 files to make it more manageableGravatar Benoit Steiner2015-11-20