Commit message (Collapse) | Author | Age | |
---|---|---|---|
* | Tensor block evaluation cost model | Eugene Zhulenev | 2019-12-18 |
| | |||
* | Remove V2 suffix from TensorBlock | Eugene Zhulenev | 2019-12-10 |
| | |||
* | Remove TensorBlock.h and old TensorBlock/BlockMapper | Eugene Zhulenev | 2019-12-10 |
| | |||
* | Do not use std::vector in getResourceRequirements | Eugene Zhulenev | 2019-12-09 |
| | |||
* | Add async evaluation support to TensorSelectOp | Eugene Zhulenev | 2019-12-09 |
| | |||
* | Remove legacy block evaluation support | Eugene Zhulenev | 2019-11-12 |
| | |||
* | Propagate block evaluation preference through rvalue tensor expressions | Eugene Zhulenev | 2019-10-17 |
| | |||
* | Block evaluation for TensorGenerator/TensorReverse/TensorShuffling | Eugene Zhulenev | 2019-10-14 |
| | |||
* | Block evaluation for TensorChipping + fixed bugs in TensorPadding and ↵ | Eugene Zhulenev | 2019-10-09 |
| | | | | TensorSlicing | ||
* | Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelect | Eugene Zhulenev | 2019-10-02 |
| | |||
* | Tensor block evaluation V2 support for unary/binary/broadcsting | Eugene Zhulenev | 2019-09-24 |
| | |||
* | evalSubExprsIfNeededAsync + async TensorContractionThreadPool | Eugene Zhulenev | 2019-08-30 |
| | |||
* | Fix performance regressions due to ↵ | Rasmus Munk Larsen | 2019-08-02 |
| | | | | | | | | | | | | | | | | | https://bitbucket.org/eigen/eigen/pull-requests/662. The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU: Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s VS Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s | ||
* | [SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵ | Mehdi Goli | 2019-06-28 |
| | | | | | | | | | | module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes. | ||
* | Restore C++03 compatibility | Christoph Hertzberg | 2019-05-07 |
| | |||
* | Adding lowlevel APIs for optimized RHS packet load in TensorFlow | Anuj Rawat | 2019-04-20 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters | Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------|------------------------------------------ 128, 24x24, 3, 64, 5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 41350 | 15073 | 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 7277 | 7341 | 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 8675 | 8681 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 24155 | 16079 | 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 25052 | 17152 | 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 18269 | 18345 | 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 19468 | 19872 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 156060 | 42432 | 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 132701 | 36944 | 3.59X AVX2: Parameters | Runtime without patch (ns) | Runtime with patch (ns) | Speedup ---------------------------------------------------------------|----------------------------|-------------------------|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) | 26233 | 12393 | 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) | 6091 | 6062 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) | 7427 | 7408 | 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) | 23453 | 20826 | 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) | 23167 | 22091 | 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422 | 23682 | 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165 | 23663 | 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) | 72689 | 44969 | 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) | 61732 | 39779 | 1.55X All benchmarks on Intel Skylake server with 8 cores. | ||
* | Fix BlockAccess enum in CwiseUnaryOp evaluator | Eugene Zhulenev | 2018-08-10 |
| | |||
* | Add block evaluationto CwiseUnaryOp and add PreferBlockAccess enum to all ↵ | Eugene Zhulenev | 2018-08-10 |
| | | | | evaluators | ||
* | Fixed compilation errors. | Benoit Steiner | 2018-08-06 |
| | |||
* | Enabling per device specialisation of packetsize. | Mehdi Goli | 2018-08-01 |
| | |||
* | Rename Index to StorageIndex + use Eigen::Array and Eigen::Map when possible | Eugene Zhulenev | 2018-07-27 |
| | |||
* | Add tiled evaluation support to TensorExecutor | Eugene Zhulenev | 2018-07-25 |
| | |||
* | Avoid using memcpy for non-POD elements | Weiming Zhao | 2018-04-11 |
| | |||
* | Add a EIGEN_NO_CUDA option, and introduce EIGEN_CUDACC and EIGEN_CUDA_ARCH ↵ | Gael Guennebaud | 2017-07-17 |
| | | | | aliases | ||
* | Merged in mehdi_goli/opencl/DataDependancy (pull request PR-10) | Benoit Steiner | 2017-06-28 |
| | | | | | | | | | | DataDependancy * Wrapping data type to the pointer class for sycl in non-terminal nodes; not having that breaks Tensorflow Conv2d code. * Applying Ronnan's Comments. * Applying benoit's comments | ||
* | Adding TensorIndexTuple and TensorTupleReduceOP backend (ArgMax/Min) for ↵ | Mehdi Goli | 2017-03-07 |
| | | | | sycl; fixing the address space issue for const TensorMap; converting all discard_write to write due to data missmatch. | ||
* | Converting all parallel for lambda to functor in order to prevent kernel ↵ | Mehdi Goli | 2016-12-16 |
| | | | | duplication name error; adding tensorConcatinationOp backend for sycl. | ||
* | Adding tensor contraction operation backend for Sycl; adding test for ↵ | Mehdi Goli | 2016-12-14 |
| | | | | contractionOp sycl backend; adding temporary solution to prevent memory leak in buffer; cleaning up cxx11_tensor_buildins_sycl.h | ||
* | Merged with default. | Luke Iwanski | 2016-09-19 |
|\ | |||
* | | Partial OpenCL support via SYCL compatible with ComputeCpp CE. | Luke Iwanski | 2016-09-19 |
| | | |||
| * | Made the index type an explicit template parameter to help some compilers ↵ | Benoit Steiner | 2016-09-02 |
| | | | | | | | | compile the code. | ||
| * | Adjust Tensor module wrt recent change in nullary functor | Gael Guennebaud | 2016-09-01 |
| | | |||
| * | Force the inlining of a simple accessor. | Benoit Steiner | 2016-08-18 |
|/ | |||
* | bug #1266: half implementation has been moved to half_impl namespace | Benoit Steiner | 2016-07-29 |
| | |||
* | Moved assertions to the constructor to make the code more portable | Benoit Steiner | 2016-06-06 |
| | |||
* | Add TernaryFunctors and the betainc SpecialFunction. | Eugene Brevdo | 2016-06-02 |
| | | | | | | | | | | | | | | | | | | | TernaryFunctors and their executors allow operations on 3-tuples of inputs. API fully implemented for Arrays and Tensors based on binary functors. Ported the cephes betainc function (regularized incomplete beta integral) to Eigen, with support for CPU and GPU, floats, doubles, and half types. Added unit tests in array.cpp and cxx11_tensor_cuda.cu Collapsed revision * Merged helper methods for betainc across floats and doubles. * Added TensorGlobalFunctions with betainc(). Removed betainc() from TensorBase. * Clean up CwiseTernaryOp checks, change igamma_helper to cephes_helper. * betainc: merge incbcf and incbd into incbeta_cfe. and more cleanup. * Update TernaryOp and SpecialFunctions (betainc) based on review comments. | ||
* | Added the ability to load fp16 using the texture path. | Benoit Steiner | 2016-05-11 |
| | | | | Improved the performance of some reductions on fp16 | ||
* | Deleted unnecessary variable | Benoit Steiner | 2016-04-15 |
| | |||
* | Eigen Tensor cost model part 2: Thread scheduling for standard evaluators ↵ | Rasmus Munk Larsen | 2016-04-14 |
| | | | | and reductions. The cost model is turned off by default. | ||
* | Eigen cost model part 1. This implements a basic recursive framework to ↵ | Rasmus Munk Larsen | 2016-04-14 |
| | | | | estimate the cost of evaluating tensor expressions. | ||
* | Fixed the tensor chipping code. | Benoit Steiner | 2016-03-08 |
| | |||
* | Decoupled the packet type definition from the definition of the tensor ops. ↵ | Benoit Steiner | 2016-03-08 |
| | | | | All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit. | ||
* | Record whether the underlying tensor storage can be accessed directly during ↵ | Benoit Steiner | 2016-01-19 |
| | | | | the evaluation of an expression. | ||
* | Added support for rank-0 tensors | Benoit Steiner | 2015-10-29 |
| | |||
* | Fix Tensor module wrt nullary functor recent change | Gael Guennebaud | 2015-08-09 |
| | |||
* | Use NumTraits<T>::RequireInitialization instead of ↵ | Benoit Steiner | 2015-07-07 |
| | | | | internal::is_arithmetic<T>::value to check whether it's possible to bypass the type constructor in the tensor code. | ||
* | Only attempt to use the texture path on GPUs when it's supported by CUDA | Benoit Steiner | 2015-07-06 |
| | |||
* | Sped up the assignment of a tensor to a tensor slice, as well as the ↵ | Benoit Steiner | 2015-04-20 |
| | | | | assigment of a constant slice to a tensor | ||
* | Fixed the vectorized implementation of the Tensor select() method | Benoit Steiner | 2015-03-25 |
| | |||
* | Silenced more compilation warnings | Benoit Steiner | 2015-02-10 |
| |