Commit message (Collapse) | Author | Age | ||
---|---|---|---|---|
... | ||||
* | Converting all sycl buffers to uninitialised device only buffers; adding ↵ | Mehdi Goli | 2016-11-08 | |
| | | | | memcpyHostToDevice and memcpyDeviceToHost on syclDevice; modifying all examples to obey the new rules; moving sycl queue creating to the device based on Benoit suggestion; removing the sycl specefic condition for returning m_result in TensorReduction.h according to Benoit suggestion. | |||
* | Removed the sycl include from Eigen/Core and moved it to ↵ | Mehdi Goli | 2016-11-04 | |
| | | | | Unsupported/Eigen/CXX11/Tensor; added TensorReduction for sycl (full reduction and partial reduction); added TensorReduction test case for sycl (full reduction and partial reduction); fixed the tile size on TensorSyclRun.h based on the device max work group size; | |||
* | Fixing the code indentation in the TensorReduction.h file. | Mehdi Goli | 2016-10-14 | |
| | ||||
* | Reducing the code by generalising sycl backend functions/structs. | Mehdi Goli | 2016-10-14 | |
| | ||||
* | Fixed a bug impacting some outer reductions on GPU | Benoit Steiner | 2016-09-12 | |
| | ||||
* | Don't attempt to optimize partial reductions when the optimized ↵ | Benoit Steiner | 2016-08-08 | |
| | | | | implementation doesn't buy anything. | |||
* | Improved partial reductions in more cases | Benoit Steiner | 2016-07-22 | |
| | ||||
* | Fix warnings | Gael Guennebaud | 2016-07-08 | |
| | ||||
* | Fix warning | Gael Guennebaud | 2016-07-07 | |
| | ||||
* | Use array_prod to compute the number of elements contained in the input ↵ | Benoit Steiner | 2016-06-04 | |
| | | | | tensor expression | |||
* | Improved the performance of full reductions. | Benoit Steiner | 2016-06-03 | |
| | | | | | | | | | | | | | | | | AFTER: BM_fullReduction/10 4541 4543 154017 21.0M items/s BM_fullReduction/64 5191 5193 100000 752.5M items/s BM_fullReduction/512 9588 9588 71361 25.5G items/s BM_fullReduction/4k 244314 244281 2863 64.0G items/s BM_fullReduction/5k 359382 359363 1946 64.8G items/s BEFORE: BM_fullReduction/10 9085 9087 74395 10.5M items/s BM_fullReduction/64 9478 9478 72014 412.1M items/s BM_fullReduction/512 14643 14646 46902 16.7G items/s BM_fullReduction/4k 260338 260384 2678 60.0G items/s BM_fullReduction/5k 385076 385178 1818 60.5G items/s | |||
* | Resolved merge conflicts | Benoit Steiner | 2016-05-26 | |
| | ||||
* | Merged latest reduction improvements | Benoit Steiner | 2016-05-26 | |
|\ | ||||
* | | Improved the performance of inner reductions. | Benoit Steiner | 2016-05-26 | |
| | | ||||
* | | Merged in rmlarsen/eigen (pull request PR-188) | Benoit Steiner | 2016-05-23 | |
|\ \ | | | | | | | | | | Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL. | |||
* | | | Make EIGEN_HAS_CONSTEXPR user configurable | Gael Guennebaud | 2016-05-20 | |
| | | | ||||
* | | | Make EIGEN_HAS_VARIADIC_TEMPLATES user configurable | Gael Guennebaud | 2016-05-20 | |
| | | | ||||
| * | | Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of ↵ | Rasmus Munk Larsen | 2016-05-18 | |
|/ / | | | | | | | EIGEN_USE_COST_MODEL. | |||
| * | Allow vectorized padding on GPU. This helps speed things up a little. | Benoit Steiner | 2016-05-17 | |
|/ | | | | | | | | | | | | | Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s | |||
* | Improved the portability of the tensor code | Benoit Steiner | 2016-05-11 | |
| | ||||
* | Properly gate the use of half2. | Benoit Steiner | 2016-05-10 | |
| | ||||
* | Improved the performance of full reductions on GPU: | Benoit Steiner | 2016-05-09 | |
| | | | | | | | | | | | | | | Before: BM_fullReduction/10 200000 11751 8.51 MFlops/s BM_fullReduction/80 5000 523385 12.23 MFlops/s BM_fullReduction/640 50 36179326 11.32 MFlops/s BM_fullReduction/4K 1 2173517195 11.50 MFlops/s After: BM_fullReduction/10 500000 5987 16.70 MFlops/s BM_fullReduction/80 200000 10636 601.73 MFlops/s BM_fullReduction/640 50000 58428 7010.31 MFlops/s BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s | |||
* | Eigen Tensor cost model part 2: Thread scheduling for standard evaluators ↵ | Rasmus Munk Larsen | 2016-04-14 | |
| | | | | and reductions. The cost model is turned off by default. | |||
* | Eigen cost model part 1. This implements a basic recursive framework to ↵ | Rasmus Munk Larsen | 2016-04-14 | |
| | | | | estimate the cost of evaluating tensor expressions. | |||
* | Fixed compilation warnings on arm | Benoit Steiner | 2016-03-28 | |
| | ||||
* | Avoid unnecessary conversions | Benoit Steiner | 2016-03-23 | |
| | ||||
* | Fixed compilation warning | Benoit Steiner | 2016-03-23 | |
| | ||||
* | Use a single Barrier instead of a collection of Notifications to reduce the ↵ | Benoit Steiner | 2016-03-22 | |
| | | | | thread synchronization overhead | |||
* | Avoid implicit cast | Benoit Steiner | 2016-03-09 | |
| | ||||
* | Avoid unnecessary conversion from 32bit int to 64bit unsigned int | Benoit Steiner | 2016-03-09 | |
| | ||||
* | Replace std::vector with our own implementation, as using the stl when ↵ | Benoit Steiner | 2016-03-08 | |
| | | | | compiling with nvcc and avx enabled leads to many issues. | |||
* | Simplified the full reduction code | Benoit Steiner | 2016-03-08 | |
| | ||||
* | Decoupled the packet type definition from the definition of the tensor ops. ↵ | Benoit Steiner | 2016-03-08 | |
| | | | | All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit. | |||
* | Made the signature of the inner and outer reducers consistent | Benoit Steiner | 2016-02-29 | |
| | ||||
* | Optimized the performance of narrow reductions on CUDA devices | Benoit Steiner | 2016-02-29 | |
| | ||||
* | Fixed a typo in the reduction code that could prevent large full reductionsx ↵ | Benoit Steiner | 2016-02-24 | |
| | | | | from running properly on old cuda devices. | |||
* | Fixed a number of compilation warnings generated by the cuda tests | Benoit Steiner | 2016-01-31 | |
| | ||||
* | Fixed a couple of compilation warnings. | Benoit Steiner | 2016-01-28 | |
| | ||||
* | Fixed some compilation problems with nvcc + clang | Benoit Steiner | 2016-01-27 | |
| | ||||
* | Record whether the underlying tensor storage can be accessed directly during ↵ | Benoit Steiner | 2016-01-19 | |
| | | | | the evaluation of an expression. | |||
* | Properly record the rank of reduced tensors in the tensor traits. | Benoit Steiner | 2016-01-13 | |
| | ||||
* | Merged in jeremy_barnes/eigen/shader-model-3.0 (pull request PR-152) | Benoit Steiner | 2016-01-11 | |
|\ | | | | | | | Alternative way of forcing instantiation of device kernels without causing warnings or requiring device to device kernel invocations. | |||
* | | Fixed a bug in the dispatch of optimized reduction kernels. | Benoit Steiner | 2016-01-11 | |
| | | ||||
* | | Re-enabled the optimized reduction CUDA code. | Benoit Steiner | 2016-01-11 | |
| | | ||||
| * | Alternative way of forcing instantiation of device kernels without | Jeremy Barnes | 2016-01-10 | |
|/ | | | | | | causing warnings or requiring device to device kernel invocations. This allows Tensorflow to work on SM 3.0 (ie, Amazon EC2) machines. | |||
* | Simplified the dispatch code. | Benoit Steiner | 2016-01-08 | |
| | ||||
* | Reworked the dispatch of optimized cuda reduction kernels to workaround a ↵ | Benoit Steiner | 2016-01-08 | |
| | | | | nvcc bug that prevented the code from compiling in optimized mode in some cases | |||
* | Improved the performance of reductions on CUDA devices | Benoit Steiner | 2016-01-04 | |
| | ||||
* | Optimized outer reduction on GPUs. | Benoit Steiner | 2015-12-22 | |
| | ||||
* | Silenced some compilation warnings triggered by nvcc | Benoit Steiner | 2015-12-17 | |
| |