| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
| |
matrix_inverse. Cholesky is not enough faster than PartialPivLU in Eigen to be worth the extra pass over the data to check for symmetry, even for relatively large matrices. For (n=2k) SPD matrices Cholesky does give ~5% speedup over LU, but for other matrix types we see an equivalent slowdown.
Change: 115276994
|
|
|
|
|
| |
on tensorflow.org.
Change: 115270889
|
|
|
|
| |
Change: 115269320
|
|
|
|
| |
Change: 115268843
|
|
|
|
|
|
|
| |
if the underlying
allocator doesn't already do it.
Change: 115263741
|
|
|
|
| |
Change: 115261957
|
|
|
|
|
| |
it returns metadata at a single point of control.
Change: 115255052
|
|
|
|
|
| |
floats in TensorFlow. The code was tested on Tegra x1.
Change: 115253733
|
|
|
|
|
|
|
| |
This is useful when you produce tensors using Split, but don't want to copy
them unnecessarily if you need them aligned. Without this, you need to
pessimistically assume that they are all unaligned and copy all of them.
Change: 115251844
|
|
|
|
|
|
| |
call changes state. If it's set to False, the new name that will be used
will be returned without actually marking the name as having been used.
Change: 115249981
|
|
|
|
| |
Change: 115249194
|
|
|
|
|
|
| |
The absence of shape function makes import_graph_def() fail when RandomCrop exists
in GraphDef.
Change: 115243268
|
|
|
|
| |
Change: 115243253
|
|
|
|
|
|
| |
var as a list.
Change: 115243053
|
|
|
|
|
|
|
|
|
|
|
| |
The current implementations are very simplistic: they spawn a thread
in response to each call. This is needed to deal with the fact that
some users of SchedClosure rely on the ability to spawn blocking
closures, and we lack an unbounded threadpool.
TODO(mrry): Replace the currently-blocking users of this API with
asynchronous implementations.
Change: 115239594
|
|
|
|
|
|
| |
The testing::SrcDir() function supports data dependencies in cc_test
targets.
Change: 115239392
|
|
|
|
| |
Change: 115220444
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This improves the performance of dynamic RNN into the reasonable range when comparing with cond-based static unrolling.
Before:
Graph Creation: Static Unroll vs. Dynamic Unroll LSTM
max_t dt(static) dt(dynamic) dt(dynamic)/dt(static)
1 0.973783 1.598944 1.641992
25 13.994146 1.849802 0.132184
50 27.849715 2.052574 0.073702
Calculation: Static Unroll with Dynamic Flow LSTM vs. Dynamic Unroll LSTM
batch max_t units gpu dt(static) dt(dynamic) dt(dynamic)/dt(static)
256 50 512 False 1.262335 1.349654 1.069172
256 50 256 False 0.720269 0.742385 1.030706
256 50 128 False 0.342915 0.360554 1.051439
256 100 512 False 2.512101 2.592826 1.032134
256 100 256 False 1.398599 1.449359 1.036294
256 100 128 False 0.688278 0.723332 1.050930
512 50 512 False 1.777011 2.040112 1.148058
512 50 256 False 0.854183 0.915705 1.072024
512 50 128 False 0.609203 0.624703 1.025443
512 100 512 False 3.731255 4.289601 1.149640
512 100 256 False 1.763375 1.867427 1.059007
512 100 128 False 1.226971 1.274628 1.038841
256 50 512 True 0.190479 0.217636 1.142570
256 50 256 True 0.086440 0.119876 1.386814
256 50 128 True 0.061334 0.097079 1.582790
256 100 512 True 0.381617 0.432454 1.133215
256 100 256 True 0.174479 0.239955 1.375264
256 100 128 True 0.122436 0.190479 1.555740
512 50 512 True 0.322039 0.355348 1.103433
512 50 256 True 0.129060 0.163209 1.264603
512 50 128 True 0.073067 0.106976 1.464091
512 100 512 True 0.653037 0.719606 1.101936
512 100 256 True 0.259759 0.323882 1.246856
512 100 128 True 0.147856 0.215792 1.459475
After:
Graph Creation: Static Unroll vs. Dynamic Unroll LSTM
max_t dt(static) dt(dynamic) dt(dynamic)/dt(static)
1 0.945166 1.643999 1.739376
25 13.471901 1.787826 0.132708
50 26.668288 2.041938 0.076568
Calculation: Static Unroll with Dynamic Flow LSTM vs. Dynamic Unroll LSTM
batch max_t units gpu dt(static) dt(dynamic) dt(dynamic)/dt(static)
256 50 512 False 1.282594 1.293548 1.008540
256 50 256 False 0.707062 0.738919 1.045055
256 50 128 False 0.353723 0.365117 1.032211
256 100 512 False 2.573490 2.579687 1.002408
256 100 256 False 1.397638 1.448193 1.036172
256 100 128 False 0.699666 0.727913 1.040371
512 50 512 False 1.755335 1.849683 1.053749
512 50 256 False 0.857895 0.917298 1.069242
512 50 128 False 0.606808 0.625990 1.031610
512 100 512 False 3.608412 3.964380 1.098649
512 100 256 False 1.744636 1.862331 1.067461
512 100 128 False 1.221435 1.277420 1.045835
256 50 512 True 0.191454 0.204069 1.065890
256 50 256 True 0.083181 0.092068 1.106844
256 50 128 True 0.055699 0.064500 1.158020
256 100 512 True 0.377481 0.403046 1.067727
256 100 256 True 0.171492 0.189591 1.105542
256 100 128 True 0.112558 0.135522 1.204021
512 50 512 True 0.324426 0.348642 1.074641
512 50 256 True 0.125665 0.143196 1.139510
512 50 128 True 0.069971 0.077949 1.114019
512 100 512 True 0.670467 0.704176 1.050278
512 100 256 True 0.256430 0.300047 1.170094
512 100 128 True 0.142816 0.161151 1.128383
Change: 115179042
|
|
|
|
|
|
|
|
|
|
|
| |
This is necessary for, e.g., RNN where one wants to cache Variables
locally even when they are accessed through a conditional like cond.
Without the local caching, each cond creates a Switch that bypasses
the current Variable copy deduplication code and forces a (possibly slow)
copy for each iteration. With local caching, the Variable is copied
once to the local device and then that local copy is accessed at each iteration.
Change: 115151788
|
|
|
|
|
|
|
| |
The test is breaking for OSS, and I can't look at it in detail right now. The
fact that it breaks is harmless (it's equivalent to yesterday), so disable the
test in the open source case for the moment.
Change: 115128265
|
|
|
|
| |
Change: 115121755
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two different mechanisms are required. On the CPU, we push and pop the
appropriate processor flags in the executor (for the master thread) *and*
in each threadpool thread, since the processor flags are thread local. On
the GPU, we set -ftz=true for both nvcc and gcudacc so that kernels that we
build flush denormals to zero using instruction flags.
Caveat: On GPU, only single precision denormals are flushed to zero; double
precision is unchanged.
Change: 115114845
|
|
|
|
| |
Change: 115113975
|
|
|
|
|
|
|
|
|
| |
Support for variable regularization was recently added to tf.get_variable(),
and I've modified layers.py to take advantage of it. This CL changes the
semantics of tf.fully_connected() to match TensorFlow's standard conventions:
a regularization function will only be applied at the time that a variable is
first created.
Change: 115111632
|
|
|
|
| |
Change: 115111581
|
|
|
|
| |
Change: 115111428
|
|
|
|
| |
Change: 115102931
|
|
|
|
|
| |
Test both layouts in tests.
Change: 115096872
|
|
|
|
| |
Change: 115036211
|
|
|
|
|
| |
support it in pip installed form.
Change: 115034582
|
|
|
|
| |
Change: 115027725
|
|
|
|
|
| |
since that's roughly what we package too.
Change: 115027171
|
|
|
|
|
|
| |
image.
Change: 115023941
|
|
|
|
| |
Change: 115018272
|
|
|
|
| |
Change: 115017744
|
|
|
|
|
| |
Remove the correct file in the case of sharded checkpoints.
Change: 115015294
|
|
|
|
| |
Change: 115013578
|
|
|
|
| |
Change: 115010103
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL changes how the session factory is chosen. Previous an empty
target would always use DIRECT_SESSION, and a non-empty target would
always REMOTE_SESSION. In preparation for multiple distributed session
implementations (issue #23), we now delegate to the SessionFactory to
see whether it accepts a given SessionOptions. Existing programs
should continue to work unmodified.
NOTE: This CL assumes that the domains of the registered session factories
do not overlap. We may need to revisit this in future.
Change: 115008046
|
|
|
|
| |
Change: 115005379
|
|
|
|
|
|
|
| |
The basic stats is basicly free in gpu allocator.
The cpu stats collection can be optionally turned on.
Change: 115000479
|
|
|
|
|
| |
Update protobuf commit
Change: 114990608
|
|
|
|
| |
Change: 114990321
|
|
|
|
|
|
| |
used by tf-regex-group, tf-categorizer and tf-collapsable-pane components. Also make the version number exact instead of ^1.0.0, and make it match the version inside google.
Change: 114987970
|
|
|
|
| |
Change: 114985229
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Speeds up allocation microbenchmarks by 8% to 15%
Run on REDACTED (40 X 2801 MHz CPUs); 2016/02/17-16:56:24
CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_Allocation 184 164 +10.9%
BM_AllocationThreaded/1 185 169 +8.6%
BM_AllocationThreaded/4 1966 1771 +9.9%
BM_AllocationThreaded/16 9989 9197 +7.9%
BM_AllocationDelayed/1 204 183 +10.3%
BM_AllocationDelayed/10 171 146 +14.6%
BM_AllocationDelayed/100 152 130 +14.5%
BM_AllocationDelayed/1000 155 131 +15.5%
Change: 114984794
|
|
|
|
|
|
|
|
| |
to tensorboard.
It loads the data from the tensorboard backend, and also presents a slightly cleaner
abstraction. It's typed and tested.
Change: 114984720
|
|
|
|
| |
Change: 114983764
|
|
|
|
|
| |
and change thread_annotations.h to prefer default for android
Change: 114980803
|
|
|
|
| |
Change: 114975142
|