aboutsummaryrefslogtreecommitdiffhomepage
path: root/unsupported/Eigen/CXX11/src/Tensor/TensorFFT.h
Commit message (Collapse)AuthorAge
* Fix calls to device functions from host codeGravatar Nathan Luehr2021-05-11
|
* Remove V2 suffix from TensorBlockGravatar Eugene Zhulenev2019-12-10
|
* Remove legacy block evaluation supportGravatar Eugene Zhulenev2019-11-12
|
* Drop support for c++03 in Eigen tensor. Get rid of some code used to emulate ↵Gravatar Rasmus Munk Larsen2019-10-18
| | | | c++11 functionality with older compilers.
* Tensor block evaluation V2 support for unary/binary/broadcstingGravatar Eugene Zhulenev2019-09-24
|
* [SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵Gravatar Mehdi Goli2019-06-28
| | | | | | | | | | module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
* Add block evaluationto CwiseUnaryOp and add PreferBlockAccess enum to all ↵Gravatar Eugene Zhulenev2018-08-10
| | | | evaluators
* Fix typos found using codespellGravatar Gael Guennebaud2018-06-07
|
* Disable use of recurrence for computing twiddle factors. Fixes FFT precision ↵Gravatar RJ Ryan2017-12-31
| | | | issues for large FFTs. https://github.com/tensorflow/tensorflow/issues/10749#issuecomment-354557689
* Merged in mehdi_goli/opencl/DataDependancy (pull request PR-10)Gravatar Benoit Steiner2017-06-28
| | | | | | | | | | DataDependancy * Wrapping data type to the pointer class for sycl in non-terminal nodes; not having that breaks Tensorflow Conv2d code. * Applying Ronnan's Comments. * Applying benoit's comments
* Adds a fast memcpy function to Eigen. This takes advantage of the following:Gravatar Rasmus Munk Larsen2017-01-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. For small fixed sizes, the compiler generates inline code for memcpy, which is much faster. 2. My colleague eriche at googl dot com discovered that for large sizes, memmove is significantly faster than memcpy (at least on Linux with GCC or Clang). See benchmark numbers measured on a Haswell (HP Z440) workstation here: https://docs.google.com/a/google.com/spreadsheets/d/1jLs5bKzXwhpTySw65MhG1pZpsIwkszZqQTjwrd_n0ic/pubhtml This is of course surprising since memcpy is a less constrained version of memmove. This stackoverflow thread contains some speculation as to the causes: http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-linux Below are numbers for copying and slicing tensors using the multithreaded TensorDevice. The numbers show significant improvements for memcpy of very small blocks and for memcpy of large blocks single threaded (we were already able to saturate memory bandwidth for >1 threads before on large blocks). The "slicingSmallPieces" benchmark also shows small consistent improvements, since memcpy cost is a fair portion of that particular computation. The benchmarks operate on NxN matrices, and the names are of the form BM_$OP_${NUMTHREADS}T/${N}. Measured improvements in wall clock time: Run on rmlarsen3.mtv (12 X 3501 MHz CPUs); 2017-01-20T11:26:31.493023454-08:00 CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_memcpy_1T/2 3.48 2.39 +31.3% BM_memcpy_1T/8 12.3 6.51 +47.0% BM_memcpy_1T/64 371 383 -3.2% BM_memcpy_1T/512 66922 66720 +0.3% BM_memcpy_1T/4k 9892867 6849682 +30.8% BM_memcpy_1T/5k 14951099 10332856 +30.9% BM_memcpy_2T/2 3.50 2.46 +29.7% BM_memcpy_2T/8 12.3 7.66 +37.7% BM_memcpy_2T/64 371 376 -1.3% BM_memcpy_2T/512 66652 66788 -0.2% BM_memcpy_2T/4k 6145012 6117776 +0.4% BM_memcpy_2T/5k 9181478 9010942 +1.9% BM_memcpy_4T/2 3.47 2.47 +31.0% BM_memcpy_4T/8 12.3 6.67 +45.8 BM_memcpy_4T/64 374 376 -0.5% BM_memcpy_4T/512 67833 68019 -0.3% BM_memcpy_4T/4k 5057425 5188253 -2.6% BM_memcpy_4T/5k 7555638 7779468 -3.0% BM_memcpy_6T/2 3.51 2.50 +28.8% BM_memcpy_6T/8 12.3 7.61 +38.1% BM_memcpy_6T/64 373 378 -1.3% BM_memcpy_6T/512 66871 66774 +0.1% BM_memcpy_6T/4k 5112975 5233502 -2.4% BM_memcpy_6T/5k 7614180 7772246 -2.1% BM_memcpy_8T/2 3.47 2.41 +30.5% BM_memcpy_8T/8 12.4 10.5 +15.3% BM_memcpy_8T/64 372 388 -4.3% BM_memcpy_8T/512 67373 66588 +1.2% BM_memcpy_8T/4k 5148462 5254897 -2.1% BM_memcpy_8T/5k 7660989 7799058 -1.8% BM_memcpy_12T/2 3.50 2.40 +31.4% BM_memcpy_12T/8 12.4 7.55 +39.1 BM_memcpy_12T/64 374 378 -1.1% BM_memcpy_12T/512 67132 66683 +0.7% BM_memcpy_12T/4k 5185125 5292920 -2.1% BM_memcpy_12T/5k 7717284 7942684 -2.9% BM_slicingSmallPieces_1T/2 47.3 47.5 +0.4% BM_slicingSmallPieces_1T/8 53.6 52.3 +2.4% BM_slicingSmallPieces_1T/64 491 476 +3.1% BM_slicingSmallPieces_1T/512 21734 18814 +13.4% BM_slicingSmallPieces_1T/4k 394660 396760 -0.5% BM_slicingSmallPieces_1T/5k 218722 209244 +4.3% BM_slicingSmallPieces_2T/2 80.7 79.9 +1.0% BM_slicingSmallPieces_2T/8 54.2 53.1 +2.0 BM_slicingSmallPieces_2T/64 497 477 +4.0% BM_slicingSmallPieces_2T/512 21732 18822 +13.4% BM_slicingSmallPieces_2T/4k 392885 390490 +0.6% BM_slicingSmallPieces_2T/5k 221988 208678 +6.0% BM_slicingSmallPieces_4T/2 80.8 80.1 +0.9% BM_slicingSmallPieces_4T/8 54.1 53.2 +1.7% BM_slicingSmallPieces_4T/64 493 476 +3.4% BM_slicingSmallPieces_4T/512 21702 18758 +13.6% BM_slicingSmallPieces_4T/4k 393962 404023 -2.6% BM_slicingSmallPieces_4T/5k 249667 211732 +15.2% BM_slicingSmallPieces_6T/2 80.5 80.1 +0.5% BM_slicingSmallPieces_6T/8 54.4 53.4 +1.8% BM_slicingSmallPieces_6T/64 488 478 +2.0% BM_slicingSmallPieces_6T/512 21719 18841 +13.3% BM_slicingSmallPieces_6T/4k 394950 397583 -0.7% BM_slicingSmallPieces_6T/5k 223080 210148 +5.8% BM_slicingSmallPieces_8T/2 81.2 80.4 +1.0% BM_slicingSmallPieces_8T/8 58.1 53.5 +7.9% BM_slicingSmallPieces_8T/64 489 480 +1.8% BM_slicingSmallPieces_8T/512 21586 18798 +12.9% BM_slicingSmallPieces_8T/4k 394592 400165 -1.4% BM_slicingSmallPieces_8T/5k 219688 208301 +5.2% BM_slicingSmallPieces_12T/2 80.2 79.8 +0.7% BM_slicingSmallPieces_12T/8 54.4 53.4 +1.8 BM_slicingSmallPieces_12T/64 488 476 +2.5% BM_slicingSmallPieces_12T/512 21931 18831 +14.1% BM_slicingSmallPieces_12T/4k 393962 396541 -0.7% BM_slicingSmallPieces_12T/5k 218803 207965 +5.0%
* Use numext::conj instead of std::conjGravatar Benoit Steiner2016-08-01
|
* Eigen cost model part 1. This implements a basic recursive framework to ↵Gravatar Rasmus Munk Larsen2016-04-14
| | | | estimate the cost of evaluating tensor expressions.
* Relaxed the condition used to gate the fft code.Gravatar Benoit Steiner2016-03-31
|
* Properly gate the fft codeGravatar Benoit Steiner2016-03-31
|
* Use EIGEN_PI instead of redefining our own constant PIGravatar Benoit Steiner2016-03-05
|
* Don't rely on the M_PI constant since not all compilers provide it.Gravatar Benoit Steiner2016-03-04
|
* Don't use implicit type conversions in initializer lists since not all ↵Gravatar Benoit Steiner2016-03-04
| | | | compilers support them.
* Prevent unecessary Index to int conversionsGravatar Benoit Steiner2016-02-21
|
* Get rid of duplicate code.Gravatar Rasmus Munk Larsen2016-02-19
|
* Speed up tensor FFT by up ~25-50%.Gravatar Rasmus Munk Larsen2016-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_tensor_fft_single_1D_cpu/8 132 134 -1.5% BM_tensor_fft_single_1D_cpu/9 1162 1229 -5.8% BM_tensor_fft_single_1D_cpu/16 199 195 +2.0% BM_tensor_fft_single_1D_cpu/17 2587 2267 +12.4% BM_tensor_fft_single_1D_cpu/32 373 341 +8.6% BM_tensor_fft_single_1D_cpu/33 5922 4879 +17.6% BM_tensor_fft_single_1D_cpu/64 797 675 +15.3% BM_tensor_fft_single_1D_cpu/65 13580 10481 +22.8% BM_tensor_fft_single_1D_cpu/128 1753 1375 +21.6% BM_tensor_fft_single_1D_cpu/129 31426 22789 +27.5% BM_tensor_fft_single_1D_cpu/256 4005 3008 +24.9% BM_tensor_fft_single_1D_cpu/257 70910 49549 +30.1% BM_tensor_fft_single_1D_cpu/512 8989 6524 +27.4% BM_tensor_fft_single_1D_cpu/513 165402 107751 +34.9% BM_tensor_fft_single_1D_cpu/999 198293 115909 +41.5% BM_tensor_fft_single_1D_cpu/1ki 21289 14143 +33.6% BM_tensor_fft_single_1D_cpu/1k 361980 233355 +35.5% BM_tensor_fft_double_1D_cpu/8 138 131 +5.1% BM_tensor_fft_double_1D_cpu/9 1253 1133 +9.6% BM_tensor_fft_double_1D_cpu/16 218 200 +8.3% BM_tensor_fft_double_1D_cpu/17 2770 2392 +13.6% BM_tensor_fft_double_1D_cpu/32 406 368 +9.4% BM_tensor_fft_double_1D_cpu/33 6418 5153 +19.7% BM_tensor_fft_double_1D_cpu/64 856 728 +15.0% BM_tensor_fft_double_1D_cpu/65 14666 11148 +24.0% BM_tensor_fft_double_1D_cpu/128 1913 1502 +21.5% BM_tensor_fft_double_1D_cpu/129 36414 24072 +33.9% BM_tensor_fft_double_1D_cpu/256 4226 3216 +23.9% BM_tensor_fft_double_1D_cpu/257 86638 52059 +39.9% BM_tensor_fft_double_1D_cpu/512 9397 6939 +26.2% BM_tensor_fft_double_1D_cpu/513 203208 114090 +43.9% BM_tensor_fft_double_1D_cpu/999 237841 125583 +47.2% BM_tensor_fft_double_1D_cpu/1ki 20921 15392 +26.4% BM_tensor_fft_double_1D_cpu/1k 455183 250763 +44.9% BM_tensor_fft_single_2D_cpu/8 1051 1005 +4.4% BM_tensor_fft_single_2D_cpu/9 16784 14837 +11.6% BM_tensor_fft_single_2D_cpu/16 4074 3772 +7.4% BM_tensor_fft_single_2D_cpu/17 75802 63884 +15.7% BM_tensor_fft_single_2D_cpu/32 20580 16931 +17.7% BM_tensor_fft_single_2D_cpu/33 345798 278579 +19.4% BM_tensor_fft_single_2D_cpu/64 97548 81237 +16.7% BM_tensor_fft_single_2D_cpu/65 1592701 1227048 +23.0% BM_tensor_fft_single_2D_cpu/128 472318 384303 +18.6% BM_tensor_fft_single_2D_cpu/129 7038351 5445308 +22.6% BM_tensor_fft_single_2D_cpu/256 2309474 1850969 +19.9% BM_tensor_fft_single_2D_cpu/257 31849182 23797538 +25.3% BM_tensor_fft_single_2D_cpu/512 10395194 8077499 +22.3% BM_tensor_fft_single_2D_cpu/513 144053843 104242541 +27.6% BM_tensor_fft_single_2D_cpu/999 279885833 208389718 +25.5% BM_tensor_fft_single_2D_cpu/1ki 45967677 36070985 +21.5% BM_tensor_fft_single_2D_cpu/1k 619727095 456489500 +26.3% BM_tensor_fft_double_2D_cpu/8 1110 1016 +8.5% BM_tensor_fft_double_2D_cpu/9 17957 15768 +12.2% BM_tensor_fft_double_2D_cpu/16 4558 4000 +12.2% BM_tensor_fft_double_2D_cpu/17 79237 66901 +15.6% BM_tensor_fft_double_2D_cpu/32 21494 17699 +17.7% BM_tensor_fft_double_2D_cpu/33 357962 290357 +18.9% BM_tensor_fft_double_2D_cpu/64 105179 87435 +16.9% BM_tensor_fft_double_2D_cpu/65 1617143 1288006 +20.4% BM_tensor_fft_double_2D_cpu/128 512848 419397 +18.2% BM_tensor_fft_double_2D_cpu/129 7271322 5636884 +22.5% BM_tensor_fft_double_2D_cpu/256 2415529 1922032 +20.4% BM_tensor_fft_double_2D_cpu/257 32517952 24462177 +24.8% BM_tensor_fft_double_2D_cpu/512 10724898 8287617 +22.7% BM_tensor_fft_double_2D_cpu/513 146007419 108603266 +25.6% BM_tensor_fft_double_2D_cpu/999 296351330 221885776 +25.1% BM_tensor_fft_double_2D_cpu/1ki 59334166 48357539 +18.5% BM_tensor_fft_double_2D_cpu/1k 666660132 483840349 +27.4%
* Avoid unecessary type conversionsGravatar Benoit Steiner2016-02-05
|
* Record whether the underlying tensor storage can be accessed directly during ↵Gravatar Benoit Steiner2016-01-19
| | | | the evaluation of an expression.
* Consistently use the same index type in the fft codebase.Gravatar Benoit Steiner2015-10-29
|
* Silenced a few more compilation warningsGravatar Benoit Steiner2015-10-29
|
* Fixed compilation warningGravatar Benoit Steiner2015-10-26
|
* Reordered the code of fft constructor to prevent compilation warningsGravatar Benoit Steiner2015-10-23
|
* Added support for fourier transforms (code courtesy of thucjw@gmail.com)Gravatar Benoit Steiner2015-10-22