Commit message (Collapse) | Author | Age | ||
---|---|---|---|---|
... | ||||
* | | Pulled latest updates from trunk | Benoit Steiner | 2015-02-27 | |
|\ \ | ||||
* | | | Added support for 32bit index on a per tensor/tensor expression. This ↵ | Benoit Steiner | 2015-02-27 | |
| | | | | | | | | | | | | enables us to use 32bit indices to evaluate expressions on GPU faster while keeping the ability to use 64 bit indices to manipulate large tensors on CPU in the same binary. | |||
* | | | Switch to truncated casting when converting floating point types to integer. ↵ | Benoit Steiner | 2015-02-27 | |
| | | | | | | | | | | | | This ensures that vectorized casts are consistent with scalar casts | |||
* | | | Added support for vectorized type casting of tensors | Benoit Steiner | 2015-02-27 | |
| | | | ||||
* | | | Added support for fast reciprocal square root computation. | Benoit Steiner | 2015-02-26 | |
| | | | ||||
| | * | Really use zero guess in ConjugateGradients::solve as documented | Jan Blechta | 2015-02-18 | |
| | | | | | | | | | | | | and expected for consistency with other methods. | |||
| | * | merge | Gael Guennebaud | 2015-03-04 | |
| | |\ | ||||
| | * | | Check for no-reallocation in SparseMatrix::insert (bug #974) | Gael Guennebaud | 2015-03-04 | |
| | | | | ||||
| | * | | Improve efficiency of SparseMatrix::insert/coeffRef for sequential ↵ | Gael Guennebaud | 2015-03-04 | |
| | | | | | | | | | | | | | | | | outer-index insertion strategies (bug #974) | |||
| | * | | Update manual wrt new LSCG solver. | Gael Guennebaud | 2015-03-04 | |
| | | | | ||||
| | * | | Add a CG-based solver for rectangular least-square problems (bug #975). | Gael Guennebaud | 2015-03-04 | |
| | | | | ||||
| | | * | Fix asm comments in 1px1 kernel | Benoit Jacob | 2015-03-03 | |
| | | | | ||||
| | | * | Add a benchmark-default-sizes action to benchmark-blocking-sizes.cpp | Benoit Jacob | 2015-03-03 | |
| | | | | ||||
| | | * | New scoring functor to select the pivot. | Marc Glisse | 2015-03-03 | |
| | | | | | | | | | | | | | | | | This is can be useful for non-floating point scalars, where choosing the biggest element is generally not the best choice. | |||
| | | * | must also disable complex<double> when disabling double vectorization | Benoit Jacob | 2015-03-03 | |
| | |/ | ||||
| | * | Work around an ICE in Clang 3.5 in the iOS toolchain with double NEON ↵ | Benoit Jacob | 2015-03-03 | |
| | | | | | | | | | | | | intrinsics. | |||
| | * | HalfPacket also needed to be disabled for double, on ARMv8. | Benoit Jacob | 2015-03-02 | |
| | | | ||||
| | * | Add SSE vectorization of Quaternion::conjugate. Significant speed-up when ↵ | Gael Guennebaud | 2015-03-02 | |
| | | | | | | | | | | | | combined with products like q1*q2.conjugate() | |||
| | * | Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵ | Gael Guennebaud | 2015-02-27 | |
| | | | | | | | | | | | | loop within product kernel. | |||
| | * | Re-enbale detection of min/max parentheses protection, and re-enable ↵ | Gael Guennebaud | 2015-02-27 | |
| |/ | | | | | | | mpreal_support unit test. | |||
| * | Reimplement the selection between rotating and non-rotating kernels | Benoit Jacob | 2015-02-27 | |
| | | | | | | | | | | | | using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier. | |||
| * | remove trailing comma | Benoit Jacob | 2015-02-27 | |
| | | ||||
| * | Disable Packet2f/2i halfpacket support in NEON. | Benoit Jacob | 2015-02-27 | |
| | | | | | | | | | | | | I believe that it was erroneously turned on, since Packet2f/2i intrinsics are unimplemented, and code trying to use halfpackets just fails to compile on NEON, as it tries to use the default implementation of pload/pstore and the types don't match. | |||
| * | Replace a static assert by a runtime one, fixes the build of unit tests on ARM | Benoit Jacob | 2015-02-27 | |
| | | | | | | | | | | Also safely assert in the non-implemented path that should never be taken in practice, and would return wrong results. | |||
| * | Avoid packing rhs multiple-times when blocking on the lhs only. | Gael Guennebaud | 2015-02-26 | |
| | | ||||
| * | Make sure that the block size computation is tested by our unit test. | Gael Guennebaud | 2015-02-26 | |
| | | ||||
| * | Implement a more generic blocking-size selection algorithm. See explanations ↵ | Gael Guennebaud | 2015-02-26 | |
| | | | | | | | | | | | | | | inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3) | |||
| * | Fix typos in block-size testing code, and set peeling on k to 8. | Gael Guennebaud | 2015-02-26 | |
|/ | ||||
* | So I extensively measured the impact of the offset in this prefetch. I tried ↵ | Benoit Jacob | 2015-02-25 | |
| | | | | | | | | | | | | | | offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether! | |||
* | bug #970: Add EIGEN_DEVICE_FUNC to RValue functions, in case Cuda supports ↵ | Christoph Hertzberg | 2015-02-24 | |
| | | | | RValue-references. | |||
* | Fix my recent prefetch changes: | Benoit Jacob | 2015-02-23 | |
| | | | | | | | | | | | - the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device. | |||
* | Fix two trivial warnings | Christoph Hertzberg | 2015-02-22 | |
| | ||||
* | log1p is defined only for real Scalars in C++11 | Christoph Hertzberg | 2015-02-21 | |
| | ||||
* | Fix compilation of unit tests disabling assertion cheking | Gael Guennebaud | 2015-02-21 | |
| | ||||
* | Fix doc of Ref<> | Gael Guennebaud | 2015-02-20 | |
| | ||||
* | In C++11 destructors do not throw by default (fix CommaInitializer unit test) | Gael Guennebaud | 2015-02-20 | |
| | ||||
* | Pulled latest changes from trunk | Benoit Steiner | 2015-02-19 | |
|\ | ||||
* | | Marked the CUDA packet primitives as EIGEN_DEVICE_FUNC since they'll end up ↵ | Benoit Steiner | 2015-02-19 | |
| | | | | | | | | being executed on the GPU device. | |||
| * | Fix regression with C++11 support of lambda: now internal::result_of falls ↵ | Gael Guennebaud | 2015-02-19 | |
| | | | | | | | | back to std::result_of in C++11. | |||
| * | Fix some calls to result_of on binary functors as unary ones. | Gael Guennebaud | 2015-02-19 | |
| | | ||||
| * | Declare const some const variables | Gael Guennebaud | 2015-02-19 | |
|/ | ||||
* | Add support for C++11 result_of/lambdas | Gael Guennebaud | 2015-02-19 | |
| | ||||
* | rotating kernel: avoid compiling anything outside of ARM | Benoit Jacob | 2015-02-18 | |
| | ||||
* | remove a newly introduced redundant typedef - sorry. | Benoit Jacob | 2015-02-18 | |
| | ||||
* | bug #955 - Implement a rotating kernel alternative in the 3px4 gebp path | Benoit Jacob | 2015-02-18 | |
| | | | | | | | | This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower). | |||
* | Fixed template parameter. | Hauke Heibel | 2015-02-18 | |
| | ||||
* | merge | Gael Guennebaud | 2015-02-18 | |
|\ | ||||
* | | Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro) | Gael Guennebaud | 2015-02-18 | |
| | | ||||
| * | bug #958 - Allow testing specific blocking sizes | Benoit Jacob | 2015-02-18 | |
|/ | | | | | | | | | | | | | | This is only a debugging/testing patch. It allows testing specific product blocking sizes, typically to study the impact on performance. Example usage: int testk, testm, testn; #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn #include <Eigen/Core> | |||
* | Fix a regression when using OpenMP, and fix bug #714: the number of threads ↵ | Gael Guennebaud | 2015-02-18 | |
| | | | | might be lower than the number of requested ones |