eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	Tensor block evaluation cost model	Eugene Zhulenev	2019-12-18
\|
*	Remove V2 suffix from TensorBlock	Eugene Zhulenev	2019-12-10
\|
*	Remove TensorBlock.h and old TensorBlock/BlockMapper	Eugene Zhulenev	2019-12-10
\|
*	Do not use std::vector in getResourceRequirements	Eugene Zhulenev	2019-12-09
\|
*	Add async evaluation support to TensorSelectOp	Eugene Zhulenev	2019-12-09
\|
*	Remove legacy block evaluation support	Eugene Zhulenev	2019-11-12
\|
*	Propagate block evaluation preference through rvalue tensor expressions	Eugene Zhulenev	2019-10-17
\|
*	Block evaluation for TensorGenerator/TensorReverse/TensorShuffling	Eugene Zhulenev	2019-10-14
\|
*	Block evaluation for TensorChipping + fixed bugs in TensorPadding and ↵	Eugene Zhulenev	2019-10-09
\| \| \| \|	TensorSlicing
*	Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelect	Eugene Zhulenev	2019-10-02
\|
*	Tensor block evaluation V2 support for unary/binary/broadcsting	Eugene Zhulenev	2019-09-24
\|
*	evalSubExprsIfNeededAsync + async TensorContractionThreadPool	Eugene Zhulenev	2019-08-30
\|
*	Fix performance regressions due to ↵	Rasmus Munk Larsen	2019-08-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://bitbucket.org/eigen/eigen/pull-requests/662. The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU: Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s VS Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s
*	[SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵	Mehdi Goli	2019-06-28
\| \| \| \| \| \| \| \| \| \|	module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
*	Restore C++03 compatibility	Christoph Hertzberg	2019-05-07
\|
*	Adding lowlevel APIs for optimized RHS packet load in TensorFlow	Anuj Rawat	2019-04-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \|2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 \|2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 \|2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 \|1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 \|1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 \|2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \| 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 \| 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 \| 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 \| 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 \| 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 41350 \| 15073 \| 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 7277 \| 7341 \| 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 8675 \| 8681 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 24155 \| 16079 \| 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 25052 \| 17152 \| 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 18269 \| 18345 \| 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 19468 \| 19872 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 156060 \| 42432 \| 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 132701 \| 36944 \| 3.59X AVX2: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 26233 \| 12393 \| 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 6091 \| 6062 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 7427 \| 7408 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 23453 \| 20826 \| 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 23167 \| 22091 \| 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 23422 \| 23682 \| 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 23165 \| 23663 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 72689 \| 44969 \| 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 61732 \| 39779 \| 1.55X All benchmarks on Intel Skylake server with 8 cores.
*	Fix BlockAccess enum in CwiseUnaryOp evaluator	Eugene Zhulenev	2018-08-10
\|
*	Add block evaluationto CwiseUnaryOp and add PreferBlockAccess enum to all ↵	Eugene Zhulenev	2018-08-10
\| \| \| \|	evaluators
*	Fixed compilation errors.	Benoit Steiner	2018-08-06
\|
*	Enabling per device specialisation of packetsize.	Mehdi Goli	2018-08-01
\|
*	Rename Index to StorageIndex + use Eigen::Array and Eigen::Map when possible	Eugene Zhulenev	2018-07-27
\|
*	Add tiled evaluation support to TensorExecutor	Eugene Zhulenev	2018-07-25
\|
*	Avoid using memcpy for non-POD elements	Weiming Zhao	2018-04-11
\|
*	Add a EIGEN_NO_CUDA option, and introduce EIGEN_CUDACC and EIGEN_CUDA_ARCH ↵	Gael Guennebaud	2017-07-17
\| \| \| \|	aliases
*	Merged in mehdi_goli/opencl/DataDependancy (pull request PR-10)	Benoit Steiner	2017-06-28
\| \| \| \| \| \| \| \| \| \|	DataDependancy * Wrapping data type to the pointer class for sycl in non-terminal nodes; not having that breaks Tensorflow Conv2d code. * Applying Ronnan's Comments. * Applying benoit's comments
*	Adding TensorIndexTuple and TensorTupleReduceOP backend (ArgMax/Min) for ↵	Mehdi Goli	2017-03-07
\| \| \| \|	sycl; fixing the address space issue for const TensorMap; converting all discard_write to write due to data missmatch.
*	Converting all parallel for lambda to functor in order to prevent kernel ↵	Mehdi Goli	2016-12-16
\| \| \| \|	duplication name error; adding tensorConcatinationOp backend for sycl.
*	Adding tensor contraction operation backend for Sycl; adding test for ↵	Mehdi Goli	2016-12-14
\| \| \| \|	contractionOp sycl backend; adding temporary solution to prevent memory leak in buffer; cleaning up cxx11_tensor_buildins_sycl.h
*	Merged with default.	Luke Iwanski	2016-09-19
\|\
* \|	Partial OpenCL support via SYCL compatible with ComputeCpp CE.	Luke Iwanski	2016-09-19
\| \|
\| *	Made the index type an explicit template parameter to help some compilers ↵	Benoit Steiner	2016-09-02
\| \| \| \| \| \| \| \|	compile the code.
\| *	Adjust Tensor module wrt recent change in nullary functor	Gael Guennebaud	2016-09-01
\| \|
\| *	Force the inlining of a simple accessor.	Benoit Steiner	2016-08-18
\|/
*	bug #1266: half implementation has been moved to half_impl namespace	Benoit Steiner	2016-07-29
\|
*	Moved assertions to the constructor to make the code more portable	Benoit Steiner	2016-06-06
\|
*	Add TernaryFunctors and the betainc SpecialFunction.	Eugene Brevdo	2016-06-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TernaryFunctors and their executors allow operations on 3-tuples of inputs. API fully implemented for Arrays and Tensors based on binary functors. Ported the cephes betainc function (regularized incomplete beta integral) to Eigen, with support for CPU and GPU, floats, doubles, and half types. Added unit tests in array.cpp and cxx11_tensor_cuda.cu Collapsed revision * Merged helper methods for betainc across floats and doubles. * Added TensorGlobalFunctions with betainc(). Removed betainc() from TensorBase. * Clean up CwiseTernaryOp checks, change igamma_helper to cephes_helper. * betainc: merge incbcf and incbd into incbeta_cfe. and more cleanup. * Update TernaryOp and SpecialFunctions (betainc) based on review comments.
*	Added the ability to load fp16 using the texture path.	Benoit Steiner	2016-05-11
\| \| \| \|	Improved the performance of some reductions on fp16
*	Deleted unnecessary variable	Benoit Steiner	2016-04-15
\|
*	Eigen Tensor cost model part 2: Thread scheduling for standard evaluators ↵	Rasmus Munk Larsen	2016-04-14
\| \| \| \|	and reductions. The cost model is turned off by default.
*	Eigen cost model part 1. This implements a basic recursive framework to ↵	Rasmus Munk Larsen	2016-04-14
\| \| \| \|	estimate the cost of evaluating tensor expressions.
*	Fixed the tensor chipping code.	Benoit Steiner	2016-03-08
\|
*	Decoupled the packet type definition from the definition of the tensor ops. ↵	Benoit Steiner	2016-03-08
\| \| \| \|	All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit.
*	Record whether the underlying tensor storage can be accessed directly during ↵	Benoit Steiner	2016-01-19
\| \| \| \|	the evaluation of an expression.
*	Added support for rank-0 tensors	Benoit Steiner	2015-10-29
\|
*	Fix Tensor module wrt nullary functor recent change	Gael Guennebaud	2015-08-09
\|
*	Use NumTraits<T>::RequireInitialization instead of ↵	Benoit Steiner	2015-07-07
\| \| \| \|	internal::is_arithmetic<T>::value to check whether it's possible to bypass the type constructor in the tensor code.
*	Only attempt to use the texture path on GPUs when it's supported by CUDA	Benoit Steiner	2015-07-06
\|
*	Sped up the assignment of a tensor to a tensor slice, as well as the ↵	Benoit Steiner	2015-04-20
\| \| \| \|	assigment of a constant slice to a tensor
*	Fixed the vectorized implementation of the Tensor select() method	Benoit Steiner	2015-03-25
\|
*	Silenced more compilation warnings	Benoit Steiner	2015-02-10
\|