aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h
Commit message (Collapse)AuthorAge
* Fix more enum arithmetic.Gravatar Rasmus Munk Larsen2021-06-15
|
* Fix calls to device functions from host codeGravatar Nathan Luehr2021-05-11
|
* Tensor block evaluation cost modelGravatar Eugene Zhulenev2019-12-18
|
* Remove V2 suffix from TensorBlockGravatar Eugene Zhulenev2019-12-10
|
* Remove TensorBlock.h and old TensorBlock/BlockMapperGravatar Eugene Zhulenev2019-12-10
|
* Do not use std::vector in getResourceRequirementsGravatar Eugene Zhulenev2019-12-09
|
* Add async evaluation support to TensorSelectOpGravatar Eugene Zhulenev2019-12-09
|
* Remove legacy block evaluation supportGravatar Eugene Zhulenev2019-11-12
|
* Propagate block evaluation preference through rvalue tensor expressionsGravatar Eugene Zhulenev2019-10-17
|
* Block evaluation for TensorGenerator/TensorReverse/TensorShufflingGravatar Eugene Zhulenev2019-10-14
|
* Block evaluation for TensorChipping + fixed bugs in TensorPadding and ↵Gravatar Eugene Zhulenev2019-10-09
| | | | TensorSlicing
* Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelectGravatar Eugene Zhulenev2019-10-02
|
* Tensor block evaluation V2 support for unary/binary/broadcstingGravatar Eugene Zhulenev2019-09-24
|
* evalSubExprsIfNeededAsync + async TensorContractionThreadPoolGravatar Eugene Zhulenev2019-08-30
|
* Fix performance regressions due to ↵Gravatar Rasmus Munk Larsen2019-08-02
| | | | | | | | | | | | | | | | | https://bitbucket.org/eigen/eigen/pull-requests/662. The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU: Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s VS Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s
* [SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵Gravatar Mehdi Goli2019-06-28
| | | | | | | | | | module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
* Restore C++03 compatibilityGravatar Christoph Hertzberg2019-05-07
|
* Adding lowlevel APIs for optimized RHS packet load in TensorFlowGravatar Anuj Rawat2019-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 41350 | 15073 | 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 7277 | 7341 | 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 8675 | 8681 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 24155 | 16079 | 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 25052 | 17152 | 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 18269 | 18345 | 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 19468 | 19872 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 156060 | 42432 | 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 132701 | 36944 | 3.59X AVX2: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 26233 | 12393 | 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 6091 | 6062 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 7427 | 7408 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 23453 | 20826 | 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 23167 | 22091 | 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422 | 23682 | 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165 | 23663 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 72689 | 44969 | 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 61732 | 39779 | 1.55X All benchmarks on Intel Skylake server with 8 cores.
* Fix BlockAccess enum in CwiseUnaryOp evaluatorGravatar Eugene Zhulenev2018-08-10
|
* Add block evaluationto CwiseUnaryOp and add PreferBlockAccess enum to all ↵Gravatar Eugene Zhulenev2018-08-10
| | | | evaluators
* Fixed compilation errors.Gravatar Benoit Steiner2018-08-06
|
* Enabling per device specialisation of packetsize.Gravatar Mehdi Goli2018-08-01
|
* Rename Index to StorageIndex + use Eigen::Array and Eigen::Map when possibleGravatar Eugene Zhulenev2018-07-27
|
* Add tiled evaluation support to TensorExecutorGravatar Eugene Zhulenev2018-07-25
|
* Avoid using memcpy for non-POD elementsGravatar Weiming Zhao2018-04-11
|
* Add a EIGEN_NO_CUDA option, and introduce EIGEN_CUDACC and EIGEN_CUDA_ARCH ↵Gravatar Gael Guennebaud2017-07-17
| | | | aliases
* Merged in mehdi_goli/opencl/DataDependancy (pull request PR-10)Gravatar Benoit Steiner2017-06-28
| | | | | | | | | | DataDependancy * Wrapping data type to the pointer class for sycl in non-terminal nodes; not having that breaks Tensorflow Conv2d code. * Applying Ronnan's Comments. * Applying benoit's comments
* Adding TensorIndexTuple and TensorTupleReduceOP backend (ArgMax/Min) for ↵Gravatar Mehdi Goli2017-03-07
| | | | sycl; fixing the address space issue for const TensorMap; converting all discard_write to write due to data missmatch.
* Converting all parallel for lambda to functor in order to prevent kernel ↵Gravatar Mehdi Goli2016-12-16
| | | | duplication name error; adding tensorConcatinationOp backend for sycl.
* Adding tensor contraction operation backend for Sycl; adding test for ↵Gravatar Mehdi Goli2016-12-14
| | | | contractionOp sycl backend; adding temporary solution to prevent memory leak in buffer; cleaning up cxx11_tensor_buildins_sycl.h
* Merged with default.Gravatar Luke Iwanski2016-09-19
|\
* | Partial OpenCL support via SYCL compatible with ComputeCpp CE.Gravatar Luke Iwanski2016-09-19
| |
| * Made the index type an explicit template parameter to help some compilers ↵Gravatar Benoit Steiner2016-09-02
| | | | | | | | compile the code.
| * Adjust Tensor module wrt recent change in nullary functorGravatar Gael Guennebaud2016-09-01
| |
| * Force the inlining of a simple accessor.Gravatar Benoit Steiner2016-08-18
|/
* bug #1266: half implementation has been moved to half_impl namespaceGravatar Benoit Steiner2016-07-29
|
* Moved assertions to the constructor to make the code more portableGravatar Benoit Steiner2016-06-06
|
* Add TernaryFunctors and the betainc SpecialFunction.Gravatar Eugene Brevdo2016-06-02
| | | | | | | | | | | | | | | | | | | TernaryFunctors and their executors allow operations on 3-tuples of inputs. API fully implemented for Arrays and Tensors based on binary functors. Ported the cephes betainc function (regularized incomplete beta integral) to Eigen, with support for CPU and GPU, floats, doubles, and half types. Added unit tests in array.cpp and cxx11_tensor_cuda.cu Collapsed revision * Merged helper methods for betainc across floats and doubles. * Added TensorGlobalFunctions with betainc(). Removed betainc() from TensorBase. * Clean up CwiseTernaryOp checks, change igamma_helper to cephes_helper. * betainc: merge incbcf and incbd into incbeta_cfe. and more cleanup. * Update TernaryOp and SpecialFunctions (betainc) based on review comments.
* Added the ability to load fp16 using the texture path.Gravatar Benoit Steiner2016-05-11
| | | | Improved the performance of some reductions on fp16
* Deleted unnecessary variableGravatar Benoit Steiner2016-04-15
|
* Eigen Tensor cost model part 2: Thread scheduling for standard evaluators ↵Gravatar Rasmus Munk Larsen2016-04-14
| | | | and reductions. The cost model is turned off by default.
* Eigen cost model part 1. This implements a basic recursive framework to ↵Gravatar Rasmus Munk Larsen2016-04-14
| | | | estimate the cost of evaluating tensor expressions.
* Fixed the tensor chipping code.Gravatar Benoit Steiner2016-03-08
|
* Decoupled the packet type definition from the definition of the tensor ops. ↵Gravatar Benoit Steiner2016-03-08
| | | | All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit.
* Record whether the underlying tensor storage can be accessed directly during ↵Gravatar Benoit Steiner2016-01-19
| | | | the evaluation of an expression.
* Added support for rank-0 tensorsGravatar Benoit Steiner2015-10-29
|
* Fix Tensor module wrt nullary functor recent changeGravatar Gael Guennebaud2015-08-09
|
* Use NumTraits<T>::RequireInitialization instead of ↵Gravatar Benoit Steiner2015-07-07
| | | | internal::is_arithmetic<T>::value to check whether it's possible to bypass the type constructor in the tensor code.
* Only attempt to use the texture path on GPUs when it's supported by CUDAGravatar Benoit Steiner2015-07-06
|
* Sped up the assignment of a tensor to a tensor slice, as well as the ↵Gravatar Benoit Steiner2015-04-20
| | | | assigment of a constant slice to a tensor