aboutsummaryrefslogtreecommitdiffhomepage
path: root/Eigen
Commit message (Collapse)AuthorAge
...
* | Pulled latest updates from trunkGravatar Benoit Steiner2015-02-27
|\ \
* | | Added support for 32bit index on a per tensor/tensor expression. This ↵Gravatar Benoit Steiner2015-02-27
| | | | | | | | | | | | enables us to use 32bit indices to evaluate expressions on GPU faster while keeping the ability to use 64 bit indices to manipulate large tensors on CPU in the same binary.
* | | Switch to truncated casting when converting floating point types to integer. ↵Gravatar Benoit Steiner2015-02-27
| | | | | | | | | | | | This ensures that vectorized casts are consistent with scalar casts
* | | Added support for vectorized type casting of tensorsGravatar Benoit Steiner2015-02-27
| | |
* | | Added support for fast reciprocal square root computation.Gravatar Benoit Steiner2015-02-26
| | |
| | * Really use zero guess in ConjugateGradients::solve as documentedGravatar Jan Blechta2015-02-18
| | | | | | | | | | | | and expected for consistency with other methods.
| | * mergeGravatar Gael Guennebaud2015-03-04
| | |\
| | * | Check for no-reallocation in SparseMatrix::insert (bug #974)Gravatar Gael Guennebaud2015-03-04
| | | |
| | * | Improve efficiency of SparseMatrix::insert/coeffRef for sequential ↵Gravatar Gael Guennebaud2015-03-04
| | | | | | | | | | | | | | | | outer-index insertion strategies (bug #974)
| | * | Update manual wrt new LSCG solver.Gravatar Gael Guennebaud2015-03-04
| | | |
| | * | Add a CG-based solver for rectangular least-square problems (bug #975).Gravatar Gael Guennebaud2015-03-04
| | | |
| | | * Fix asm comments in 1px1 kernelGravatar Benoit Jacob2015-03-03
| | | |
| | | * Add a benchmark-default-sizes action to benchmark-blocking-sizes.cppGravatar Benoit Jacob2015-03-03
| | | |
| | | * New scoring functor to select the pivot.Gravatar Marc Glisse2015-03-03
| | | | | | | | | | | | | | | | This is can be useful for non-floating point scalars, where choosing the biggest element is generally not the best choice.
| | | * must also disable complex<double> when disabling double vectorizationGravatar Benoit Jacob2015-03-03
| | |/
| | * Work around an ICE in Clang 3.5 in the iOS toolchain with double NEON ↵Gravatar Benoit Jacob2015-03-03
| | | | | | | | | | | | intrinsics.
| | * HalfPacket also needed to be disabled for double, on ARMv8.Gravatar Benoit Jacob2015-03-02
| | |
| | * Add SSE vectorization of Quaternion::conjugate. Significant speed-up when ↵Gravatar Gael Guennebaud2015-03-02
| | | | | | | | | | | | combined with products like q1*q2.conjugate()
| | * Increase unit-test L1 cache size to ensure we are doing at least 2 peeled ↵Gravatar Gael Guennebaud2015-02-27
| | | | | | | | | | | | loop within product kernel.
| | * Re-enbale detection of min/max parentheses protection, and re-enable ↵Gravatar Gael Guennebaud2015-02-27
| |/ | | | | | | mpreal_support unit test.
| * Reimplement the selection between rotating and non-rotating kernelsGravatar Benoit Jacob2015-02-27
| | | | | | | | | | | | using templates instead of macros and if()'s. That was needed to fix the build of unit tests on ARM, which I had broken. My bad for not testing earlier.
| * remove trailing commaGravatar Benoit Jacob2015-02-27
| |
| * Disable Packet2f/2i halfpacket support in NEON.Gravatar Benoit Jacob2015-02-27
| | | | | | | | | | | | I believe that it was erroneously turned on, since Packet2f/2i intrinsics are unimplemented, and code trying to use halfpackets just fails to compile on NEON, as it tries to use the default implementation of pload/pstore and the types don't match.
| * Replace a static assert by a runtime one, fixes the build of unit tests on ARMGravatar Benoit Jacob2015-02-27
| | | | | | | | | | Also safely assert in the non-implemented path that should never be taken in practice, and would return wrong results.
| * Avoid packing rhs multiple-times when blocking on the lhs only.Gravatar Gael Guennebaud2015-02-26
| |
| * Make sure that the block size computation is tested by our unit test.Gravatar Gael Guennebaud2015-02-26
| |
| * Implement a more generic blocking-size selection algorithm. See explanations ↵Gravatar Gael Guennebaud2015-02-26
| | | | | | | | | | | | | | inlines. It performs extremely well on Haswell. The main issue is to reliably and quickly find the actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
| * Fix typos in block-size testing code, and set peeling on k to 8.Gravatar Gael Guennebaud2015-02-26
|/
* So I extensively measured the impact of the offset in this prefetch. I tried ↵Gravatar Benoit Jacob2015-02-25
| | | | | | | | | | | | | | offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes). On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400. I could not see any significant impact of this offset. On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0. So let's just go with 0! Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
* bug #970: Add EIGEN_DEVICE_FUNC to RValue functions, in case Cuda supports ↵Gravatar Christoph Hertzberg2015-02-24
| | | | RValue-references.
* Fix my recent prefetch changes:Gravatar Benoit Jacob2015-02-23
| | | | | | | | | | | - the first prefetch is actually harmful on Haswell with FMA, but it is the most beneficial on ARM. - the second prefetch... I was very stupid and multiplied by sizeof(scalar) and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8. So this effectively restores the older offset. Actually, there were two prefetches here, one with offset 48 and one with offset 64. I could not confirm any benefit from this strange 48 offset on either the haswell or my ARM device.
* Fix two trivial warningsGravatar Christoph Hertzberg2015-02-22
|
* log1p is defined only for real Scalars in C++11Gravatar Christoph Hertzberg2015-02-21
|
* Fix compilation of unit tests disabling assertion chekingGravatar Gael Guennebaud2015-02-21
|
* Fix doc of Ref<>Gravatar Gael Guennebaud2015-02-20
|
* In C++11 destructors do not throw by default (fix CommaInitializer unit test)Gravatar Gael Guennebaud2015-02-20
|
* Pulled latest changes from trunkGravatar Benoit Steiner2015-02-19
|\
* | Marked the CUDA packet primitives as EIGEN_DEVICE_FUNC since they'll end up ↵Gravatar Benoit Steiner2015-02-19
| | | | | | | | being executed on the GPU device.
| * Fix regression with C++11 support of lambda: now internal::result_of falls ↵Gravatar Gael Guennebaud2015-02-19
| | | | | | | | back to std::result_of in C++11.
| * Fix some calls to result_of on binary functors as unary ones.Gravatar Gael Guennebaud2015-02-19
| |
| * Declare const some const variablesGravatar Gael Guennebaud2015-02-19
|/
* Add support for C++11 result_of/lambdasGravatar Gael Guennebaud2015-02-19
|
* rotating kernel: avoid compiling anything outside of ARMGravatar Benoit Jacob2015-02-18
|
* remove a newly introduced redundant typedef - sorry.Gravatar Benoit Jacob2015-02-18
|
* bug #955 - Implement a rotating kernel alternative in the 3px4 gebp pathGravatar Benoit Jacob2015-02-18
| | | | | | | | This is substantially faster on ARM, where it's important to minimize the number of loads. This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome. Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
* Fixed template parameter.Gravatar Hauke Heibel2015-02-18
|
* mergeGravatar Gael Guennebaud2015-02-18
|\
* | Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)Gravatar Gael Guennebaud2015-02-18
| |
| * bug #958 - Allow testing specific blocking sizesGravatar Benoit Jacob2015-02-18
|/ | | | | | | | | | | | | | This is only a debugging/testing patch. It allows testing specific product blocking sizes, typically to study the impact on performance. Example usage: int testk, testm, testn; #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm #define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn #include <Eigen/Core>
* Fix a regression when using OpenMP, and fix bug #714: the number of threads ↵Gravatar Gael Guennebaud2015-02-18
| | | | might be lower than the number of requested ones