Commit message (Collapse) | Author | Age | |
---|---|---|---|
* | Address comments by bsteiner. | 2016-05-12 | |
| | |||
* | Improvements to parallelFor. | 2016-05-12 | |
| | | | | Move some scalar functors from TensorFunctors. to Eigen core. | ||
* | Worked around a compilation error triggered by nvcc when compiling a tensor ↵ | 2016-05-12 | |
| | | | | concatenation kernel. | ||
* | Fixed potential race condition in the non blocking thread pool | 2016-05-12 | |
| | |||
* | Replace implicit cast with an explicit one | 2016-05-12 | |
| | |||
* | Worked around compilation errors with older versions of gcc | 2016-05-11 | |
| | |||
* | Improved the portability of the tensor code | 2016-05-11 | |
| | |||
* | Added the ability to load fp16 using the texture path. | 2016-05-11 | |
| | | | | Improved the performance of some reductions on fp16 | ||
* | Removed deprecated flag (which apparently was ignored anyway) | 2016-05-11 | |
| | |||
* | fixed some double-promotion and sign-compare warnings | 2016-05-11 | |
| | |||
* | Fixed a typo in my previous commit | 2016-05-11 | |
| | |||
* | Fix potential race condition in the CUDA reduction code. | 2016-05-11 | |
| | |||
* | Explicitely initialize all the atomic variables. | 2016-05-11 | |
| | |||
* | Properly gate the use of half2. | 2016-05-10 | |
| | |||
* | Added support for fp16 to the sigmoid functor. | 2016-05-10 | |
| | |||
* | Small improvement to the full reduction of fp16 | 2016-05-10 | |
| | |||
* | Simplified the reduction code a little. | 2016-05-10 | |
| | |||
* | Improved the performance of full reductions on GPU: | 2016-05-09 | |
| | | | | | | | | | | | | | | Before: BM_fullReduction/10 200000 11751 8.51 MFlops/s BM_fullReduction/80 5000 523385 12.23 MFlops/s BM_fullReduction/640 50 36179326 11.32 MFlops/s BM_fullReduction/4K 1 2173517195 11.50 MFlops/s After: BM_fullReduction/10 500000 5987 16.70 MFlops/s BM_fullReduction/80 200000 10636 601.73 MFlops/s BM_fullReduction/640 50000 58428 7010.31 MFlops/s BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s | ||
* | Added the ability to use a scratch buffer in cuda kernels | 2016-05-09 | |
| | |||
* | Added a new parallelFor api to the thread pool device. | 2016-05-09 | |
| | |||
* | Optimized the non blocking thread pool: | 2016-05-09 | |
| | | | | | | | | | * Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered. * Directly pop from a non-empty queue when we are waiting for work, instead of first noticing that there is a non-empty queue and then doing another round of random stealing to re-discover the non-empty queue. * Steal only 1 task from a remote queue instead of half of tasks. | ||
* | Marked a few tensor operations as read only | 2016-05-05 | |
| | |||
* | Relaxed an assertion that was tighter that necessary. | 2016-05-05 | |
| | |||
* | Fixed some incorrect assertions | 2016-05-05 | |
| | |||
* | Strongly hint but don't force the compiler to unroll a some loops in the ↵ | 2016-05-05 | |
| | | | | tensor executor. This results in up to 27% faster code. | ||
* | Added tests for full contractions using thread pools and gpu devices. | 2016-05-05 | |
| | | | | Fixed a couple of issues in the corresponding code. | ||
* | Updated the contraction code to ensure that full contraction return a tensor ↵ | 2016-05-05 | |
| | | | | of rank 0 | ||
* | Enable and fix -Wdouble-conversion warnings | 2016-05-05 | |
| | |||
* | Removed extraneous 'explicit' keywords | 2016-05-04 | |
| | |||
* | Use numext::isfinite instead of std::isfinite | 2016-05-03 | |
| | |||
* | Deleted superfluous explicit keyword. | 2016-05-03 | |
| | |||
* | Fixed compilation error | 2016-05-01 | |
| | |||
* | Added missing accessors to fixed sized tensors | 2016-04-29 | |
| | |||
* | Deleted trailing commas | 2016-04-29 | |
| | |||
* | Deleted useless trailing commas | 2016-04-29 | |
| | |||
* | Deleted unnecessary trailing commas. | 2016-04-29 | |
| | |||
* | Return the proper size (ie 1) for tensors of rank 0 | 2016-04-29 | |
| | |||
* | Deleted unused default values for template parameters | 2016-04-29 | |
| | |||
* | Restore Tensor support for non c++11 compilers | 2016-04-29 | |
| | |||
* | Fixed include path | 2016-04-29 | |
| | |||
* | Fix missing inclusion of Eigen/Core | 2016-04-27 | |
| | |||
* | Use computeProductBlockingSizes to compute blocking for both ShardByCol and ↵ | 2016-04-27 | |
| | | | | ShardByRow cases. | ||
* | Refactor the unsupported CXX11/Core module to internal headers only. | 2016-04-26 | |
| | |||
* | Fixed the partial evaluation of non vectorizable tensor subexpressions | 2016-04-25 | |
| | |||
* | Refined the cost of the striding operation. | 2016-04-25 | |
| | |||
* | Provide access to the base threadpool classes | 2016-04-21 | |
| | |||
* | Added the ability to switch to the new thread pool with a #define | 2016-04-21 | |
| | |||
* | Fixed several compilation warnings | 2016-04-21 | |
| | |||
* | Don't crash when attempting to reduce empty tensors. | 2016-04-20 | |
| | |||
* | Started to implement a portable way to yield. | 2016-04-19 | |
| |