| Commit message (Collapse) | Author | Age |
|
|
|
| |
Change: 116595261
|
|
|
|
| |
Change: 116594819
|
|
|
|
|
|
|
|
|
| |
Adds support for binding a TensorFlow server to any port, to support
single-process testing.
This interface is a work in progress. In particular, it supports
launching a server, but the support for clean shutdown is incomplete.
Change: 116593644
|
|
|
|
|
|
| |
Checking for a specific crosstool_top directory doesn't work when TensorFlow
is a sub-module for a different project.
Change: 116592676
|
|
|
|
|
| |
can propagate empty tensors.
Change: 116588842
|
|
|
|
|
|
| |
This is a necessary change to ensure that we can remove input_size from
the RNNCell objects.
Change: 116586469
|
|
|
|
| |
Change: 116586466
|
|
|
|
| |
Change: 116584285
|
|
|
|
| |
Change: 116583469
|
|
|
|
|
|
| |
(SDCA) optimizer.
Change: 116582004
|
|
|
|
|
| |
cause a crash.
Change: 116574701
|
|
|
|
| |
Change: 116574344
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
// OLD
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU1_conv0 247698841 247715520 100 667.6M items/s 32_112_112_3_8_128_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU1_conv1 662664406 662723089 100 665.5M items/s 32_112_112_64_1_128_3_3_1_2_cpu1
// NEW
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU1_conv0 60316894 60215905 100 2.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu1
BM_ConvFloatDepthwiseFwdCPU1_conv1 158600898 158571194 100 2.7G items/s 32_112_112_64_1_64_3_3_1_2_cpu1
// NEW 4-THREADS
Benchmark Time(ns) CPU(ns) Iterations
-------------------------------------------------------------------
BM_ConvFloatDepthwiseFwdCPU4_conv0 16703436 64535709 100 9.7G items/s 32_112_112_3_8_24_3_3_1_2_cpu4
BM_ConvFloatDepthwiseFwdCPU4_conv1 51874080 182896805 100 8.3G items/s 32_112_112_64_1_64_3_3_1_2_cpu4
Change: 116555067
|
|
|
|
|
| |
Fixes #1409.
Change: 116554875
|
|
|
|
| |
Change: 116478086
|
|
|
|
|
| |
to be linked into CPU-only binaries.
Change: 116473970
|
|
|
|
|
|
|
|
| |
buffers when copying to the CPU device.
Re-arranges some of the internal gpu libraries to be library vs. runtime
specific.
Change: 116472314
|
|
|
|
| |
Change: 116471151
|
|
|
|
|
| |
conflict.
Change: 116468477
|
|
|
|
| |
Change: 116456020
|
|
|
|
| |
Change: 116453666
|
|
|
|
| |
Change: 116434899
|
|
|
|
|
| |
informaton needed for streamed-data duality-gam computation.
Change: 116434080
|
|
|
|
|
| |
constraint rather than ops.device, since colocation is more portable.
Change: 116431514
|
|
|
|
|
| |
Adds test that failed before and passes after.
Change: 116431333
|
|
|
|
| |
Change: 116429368
|
|
|
|
| |
Change: 116421402
|
|
|
|
| |
Change: 116413358
|
|
|
|
| |
Change: 116411097
|
|
|
|
|
| |
GPUs
Change: 116409601
|
|
|
|
|
| |
This replaces the confusing /gs/path/on/gcs syntax.
Change: 116407064
|
|
|
|
|
|
|
| |
If the log path looks like /gs/foo/bar/baz, we assume it's a reference to a GCS
object named bar/baz in bucket foo. This requires the gsutil binary to be
available somewhere on $PATH.
Change: 116401931
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Switch BiasGrad to use shared atomics.
Improve OSS CNN model benchmark performance.
OSS CNN Benchmark Performance Comparison.
One forward+backward pass in ms.
Before After Improvement
AlexNet 146 104 40.38%
Overfeat 338 303 11.55%
OxfordNet 983 678 44.99%
GoogleNet V1 943 634 48.74%
Change: 116401874
|
|
|
|
|
|
|
|
|
|
|
|
| |
the "in-use" memory reaches a given threshold. Added more bechmarks.
batch seqlen units dynamic elapsed_t/seqlen
512 100 512 True 0.006626
512 500 512 True 0.007760
512 800 512 True 0.008175
Line 1 has no memory swapping; line 2 has started memory swapping; and line 3 spills a lot of tensors to CPU. More improvements are possible, but it is in a good enough state for people to use it, in particular for long sequences.
Change: 116396980
|
|
|
|
| |
Change: 116396958
|
|
|
|
| |
Change: 116396154
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
depthwise convolution op is completed for GPU runs. There are still space for
more optimization which will be done in the future CLs.
Note that the current CPU kernels with this CL are just reference kernels and
not optimized at all.
The old depthwise_conv is very inefficient by calling slice() on each
input channel on input and filters, followed by a conv() on each input channel,
after which is a concat().
Change: 116383911
|
|
|
|
| |
Change: 116379470
|
|
|
|
| |
Change: 116368992
|
|
|
|
|
|
|
| |
Usage of a TensorArray with a hardcoded name causes errors when 'dynamic_rnn'
is called multiple times. Suggested solution is to use name of the current
variable scope as part of TensorArray's name.
Change: 116362122
|
|
|
|
|
|
| |
Depending on the test environment, opening a session against
example.org might result in an UNAVAILABLE or DEADLINE_EXCEEDED error.
Change: 116362101
|
|
|
|
| |
Change: 116322420
|
|
|
|
|
| |
gcc 4.8 compatibility. Fixes #1362.
Change: 116317708
|
|
|
|
|
|
|
|
| |
Queues have a parameter `shared_name` required if multiple workers want to access a
queue. This is required, for instance, in the recommended use of the filename queue
in a multiple worker training run. A shared filename queue should yield a shards to the training data. Different workers should work on different shards.
By passing the shared_name paramter through the various wrapper functions (string_input_producer, range_input_producer, etc.) we can use these wrappers also in the case where this resource is shared by various workers.
Change: 116312445
|
|
|
|
| |
Change: 116310892
|
|
|
|
| |
Change: 116303083
|
|
|
|
| |
Change: 116302210
|
|
|
|
| |
Change: 116299181
|
|
|
|
| |
Change: 116295883
|
|
|
|
| |
Change: 116290685
|