Commit message (Collapse) | Author | Age | |
---|---|---|---|
* | Merged in ibab/eigen (pull request PR-192) | Benoit Steiner | 2016-06-03 |
|\ | | | | | | | Add generic scan method | ||
* | | Improved the performance of full reductions. | Benoit Steiner | 2016-06-03 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AFTER: BM_fullReduction/10 4541 4543 154017 21.0M items/s BM_fullReduction/64 5191 5193 100000 752.5M items/s BM_fullReduction/512 9588 9588 71361 25.5G items/s BM_fullReduction/4k 244314 244281 2863 64.0G items/s BM_fullReduction/5k 359382 359363 1946 64.8G items/s BEFORE: BM_fullReduction/10 9085 9087 74395 10.5M items/s BM_fullReduction/64 9478 9478 72014 412.1M items/s BM_fullReduction/512 14643 14646 46902 16.7G items/s BM_fullReduction/4k 260338 260384 2678 60.0G items/s BM_fullReduction/5k 385076 385178 1818 60.5G items/s | ||
| * | Add generic scan method | Igor Babuschkin | 2016-06-03 |
|/ | |||
* | Align the first element of the Waiter struct instead of padding it. This ↵ | Benoit Steiner | 2016-06-02 |
| | | | | reduces its memory footprint a bit while achieving the goal of preventing false sharing | ||
* | Add syntactic sugar to Eigen tensors to allow more natural syntax. | Rasmus Munk Larsen | 2016-06-02 |
| | | | | | | | | | Specifically, this enables expressions involving: scalar + tensor scalar * tensor scalar / tensor scalar - tensor | ||
* | Add tensor scan op | Igor Babuschkin | 2016-06-02 |
| | | | | | This is the initial implementation a generic scan operation. Based on this, cumsum and cumprod method have been added to TensorBase. | ||
* | Use a single PacketSize variable | Benoit Steiner | 2016-06-01 |
| | |||
* | Fixed compilation warning | Benoit Steiner | 2016-06-01 |
| | |||
* | Silenced compilation warning generated by nvcc. | Benoit Steiner | 2016-06-01 |
| | |||
* | Added support for mean reductions on fp16 | Benoit Steiner | 2016-06-01 |
| | |||
* | Only enable optimized reductions of fp16 if the reduction functor supports them | Benoit Steiner | 2016-05-31 |
| | |||
* | Reimplement clamp as a static function. | Benoit Steiner | 2016-05-27 |
| | |||
* | Use NULL instead of nullptr to preserve the compatibility with cxx03 | Benoit Steiner | 2016-05-27 |
| | |||
* | Added a new operation to enable more powerful tensorindexing. | Benoit Steiner | 2016-05-27 |
| | |||
* | Fixed some compilation warnings | Benoit Steiner | 2016-05-26 |
| | |||
* | Preserve the ability to vectorize the evaluation of an expression even when ↵ | Benoit Steiner | 2016-05-26 |
| | | | | it involves a cast that isn't vectorized (e.g fp16 to float) | ||
* | Resolved merge conflicts | Benoit Steiner | 2016-05-26 |
| | |||
* | Merged latest reduction improvements | Benoit Steiner | 2016-05-26 |
|\ | |||
* | | Improved the performance of inner reductions. | Benoit Steiner | 2016-05-26 |
| | | |||
* | | Code cleanup. | Benoit Steiner | 2016-05-26 |
| | | |||
* | | Made the static storage class qualifier come first. | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Deleted unnecessary explicit qualifiers. | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Don't mark inline functions as static since it confuses the ICC compiler | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Marked unused variables as such | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Made the IndexPair code compile in non cxx11 mode | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Made the index pair list code more portable accross various compilers | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Improved the performance of tensor padding | Benoit Steiner | 2016-05-25 |
| | | |||
* | | Added support for statically known lists of pairs of indices | Benoit Steiner | 2016-05-25 |
| | | |||
* | | There is no need to make the fp16 full reduction kernel a static function. | Benoit Steiner | 2016-05-24 |
| | | |||
* | | Fixed compilation warning | Benoit Steiner | 2016-05-24 |
| | | |||
* | | Merged in rmlarsen/eigen (pull request PR-188) | Benoit Steiner | 2016-05-23 |
|\ \ | | | | | | | | | | Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL. | ||
* | | | Fix some sign-compare warnings | Christoph Hertzberg | 2016-05-22 |
| | | | |||
* | | | Make EIGEN_HAS_CONSTEXPR user configurable | Gael Guennebaud | 2016-05-20 |
| | | | |||
* | | | Make EIGEN_HAS_VARIADIC_TEMPLATES user configurable | Gael Guennebaud | 2016-05-20 |
| | | | |||
* | | | Make EIGEN_HAS_RVALUE_REFERENCES user configurable | Gael Guennebaud | 2016-05-20 |
| | | | |||
* | | | Rename EIGEN_HAVE_RVALUE_REFERENCES to EIGEN_HAS_RVALUE_REFERENCES | Gael Guennebaud | 2016-05-20 |
| | | | |||
| * | | Merged eigen/eigen into default | Rasmus Larsen | 2016-05-18 |
| |\ \ | |/ / |/| | | |||
| * | | Merge. | Rasmus Munk Larsen | 2016-05-18 |
| |\ \ | |||
| * | | | Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of ↵ | Rasmus Munk Larsen | 2016-05-18 |
| | | | | | | | | | | | | | | | | EIGEN_USE_COST_MODEL. | ||
| | * | | Reduce overhead for small tensors and cheap ops by short-circuiting the ↵ | Rasmus Munk Larsen | 2016-05-17 |
| |/ / | | | | | | | | | | const computation and block size calculation in parallelFor. | ||
| | * | Allow vectorized padding on GPU. This helps speed things up a little. | Benoit Steiner | 2016-05-17 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s | ||
* | | | Advertize the packet api of the tensor reducers iff the corresponding packet ↵ | Benoit Steiner | 2016-05-18 |
|/ / | | | | | | | primitives are available. | ||
* | | #if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if ↵ | Benoit Steiner | 2016-05-17 |
| | | | | | | | | !defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly. | ||
* | | Fixed compilation error | Benoit Steiner | 2016-05-17 |
| | | |||
* | | Fixed compilation error in the tensor thread pool | Benoit Steiner | 2016-05-17 |
| | | |||
* | | Merge upstream. | Rasmus Munk Larsen | 2016-05-17 |
|\ \ | |||
* | | | Roll back changes to core. Move include of TensorFunctors.h up to satisfy ↵ | Rasmus Munk Larsen | 2016-05-17 |
| | | | | | | | | | | | | dependence in TensorCostModel.h. | ||
| * | | Merged eigen/eigen into default | Rasmus Larsen | 2016-05-17 |
|/| | | |||
| * | | Enable the use of the packet api to evaluate tensor broadcasts. This speed ↵ | Benoit Steiner | 2016-05-17 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | things up quite a bit: Before" M_broadcasting/10 500000 3690 27.10 MFlops/s BM_broadcasting/80 500000 4014 1594.24 MFlops/s BM_broadcasting/640 100000 14770 27731.35 MFlops/s BM_broadcasting/4K 5000 632711 39512.48 MFlops/s After: BM_broadcasting/10 500000 4287 23.33 MFlops/s BM_broadcasting/80 500000 4455 1436.41 MFlops/s BM_broadcasting/640 200000 10195 40173.01 MFlops/s BM_broadcasting/4K 5000 423746 58997.57 MFlops/s | ||
| * | | Allow vectorized padding on GPU. This helps speed things up a little | Benoit Steiner | 2016-05-17 |
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s |