| Commit message (Collapse) | Author | Age |
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
batches of dense matrices. This calls Eigen::JacobiSVD<Matrix, Eigen::HouseholderQRPreconditioner> which is known to be rather slow. This change is primarily intended to get the TensorFlow interfaces and functionality in place. We intend to swap out the "backend" with a higher performance algorithm implementation in the future.
This CL also contains a small refactoring of the LinearAlgebraOp base class:
1. I moved the initial processing of inputs and outputs into separate helper functions so Compute() is not so long.
2. The derived classes are now allowed to return fewer output matrix shapes (n) than the number of op outputs (m) in which case empty (shape[0]) tensors are returned for the last m-n outputs.
Fixed a few Python linter errors that were blocking presubmit.
Change: 128990912
|
| |
| |
| |
| | |
Change: 128401884
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* Simplify Eigen package config
* Add missing unsupported/Eigen/*
* Fix pip setup.py
* Adjust new eigen header
* Fix bazel include dependency error
* Adjust Makefile to work with Eigen changes
* Remove nvcc workaround for CUDA <= 6.0
CUDA versions prior to 6.5 gave an
error: kernel launches from templates are not allowed in system files
error when using gcc v4.8 and including code that uses templated
kernel launches via `-isystem`.
In order to work around this, the GPU crosstool converted `-isystem`
arguments containing the cuda headers into `-iquote` arguments.
This workaround has now been removed.
* Configure cmake and make to get eigen version from tensorflow/workspace.bzl
|
|\| |
|
| |
| |
| |
| |
| |
| | |
improvements for fp16
Added SpecialFunctions to the list of eigen headers TensorFlow depends on
Change: 127264575
|
| |
| |
| |
| | |
Change: 127253427
|
| | |
|
| |
| |
| |
| |
| | |
improvements for fp16
Change: 127233960
|
|\| |
|
| |
| |
| |
| |
| |
| | |
handle per-thread buffer allocation for the tileable executor without resorting to thread_local that is not fully supported on Android.
Change: 126009029
|
|\| |
|
| |
| |
| |
| |
| |
| | |
will enable the implementation of the cumsum operation in TensorFlow
Change: 125697517
|
|\| |
|
| |
| |
| |
| |
| | |
performance of the toy mnist training by 1 order of magnitude
Change: 124374286
|
|\| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
NEW
BM_fullReduction/10 4591 4595 153149 20.8M items/s
BM_fullReduction/64 5073 5075 100000 770.0M items/s
BM_fullReduction/512 9067 9070 75263 26.9G items/s
BM_fullReduction/4k 243984 244125 2868 64.0G items/s
BM_fullReduction/5k 359125 359273 1951 64.8G items/s
OLD
BM_fullReduction/10 9085 9087 74395 10.5M items/s
BM_fullReduction/64 9478 9478 72014 412.1M items/s
BM_fullReduction/512 14643 14646 46902 16.7G items/s
BM_fullReduction/4k 260338 260384 2678 60.0G items/s
BM_fullReduction/5k 385076 385178 1818 60.5G items/s
Change: 124290852
|
|\| |
|
| |
| |
| |
| |
| | |
gradients, some variants etc.).
Change: 124197406
|
| |
| |
| |
| |
| | |
gradients, some variants etc.).
Change: 123967787
|
| |
| |
| |
| |
| | |
gradients, some variants etc.).
Change: 123967117
|
|\| |
|
| |
| |
| |
| | |
Change: 123659102
|
|\| |
|
| |
| |
| |
| | |
Change: 123238579
|
| |
| |
| |
| |
| |
| |
| |
| | |
with many cpu cores
For example, the wall time for the following tutorial went down from 13m35 to 5m27:
bazel run -c opt --copt=-mavx tensorflow/examples/tutorials/word2vec/word2vec_basic
Change: 122462177
|
| |
| |
| |
| |
| |
| |
| |
| | |
with many cpu cores
For example, the wall time for the following tutorial went down from 13m35 to 5m27:
bazel run -c opt --copt=-mavx tensorflow/examples/tutorials/word2vec/word2vec_basic
Change: 122462177
|
|\| |
|
| |
| |
| |
| | |
Change: 122192081
|
| |
| |
| |
| |
| | |
by about 3 orders of magnitude as well as some partial reductions by 30% when using cuda 7.5 or above
Change: 122191448
|
|\| |
|
| |
| |
| |
| |
| |
| | |
gpus
Updated the check numerics code to make it compatible with fp16
Change: 120980302
|
|\| |
|
| |
| |
| |
| | |
Change: 120739269
|
| |
| |
| |
| |
| |
| |
| |
| | |
tensorflow: switch to eigen thread pool
This is first step of switching tensorflow to the new
non-blocking thread pool in eigen.
Change: 120510292
|
| |
| |
| |
| |
| |
| | |
on GPU
Change: 120505517
|
|\| |
|
| |
| |
| |
| |
| | |
offered by AWS
Change: 120369420
|
|\| |
|
| |
| |
| |
| |
| | |
sigmoid of fp16 and introduces a condition estimator.
Change: 119907721
|
| |
| |
| |
| | |
Change: 119850987
|
| |
| |
| |
| |
| | |
improvements for fp16
Change: 119771118
|
|\| |
|
| |
| |
| |
| | |
Change: 119458778
|
| |
| |
| |
| |
| | |
as well as fp16
Change: 119398881
|
|\| |
|
| |
| |
| |
| |
| |
| |
| | |
the zeta
and polygamma functions, as well as improved support for float16.
Change: 119279101
|
| |
| |
| |
| |
| | |
and fixes the computation of absolute values on gpu.
Change: 119001808
|
|\| |
|
| |
| |
| |
| | |
Change: 118414762
|