| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
| |
doesn't filter op registrations, this saves >185k. Also, this may save a few
cycles during startup for mobile (untested), since the doc string won't be
parsed.
This introduces use of a TF_LEAN_BINARY macro that we can use to control other
such options.
Change: 116889235
|
|
|
|
|
|
|
| |
reflect and
symmetric modes of Numpy pad."
Change: 116836742
|
|
|
|
|
| |
symmetric modes of Numpy pad.
Change: 116828726
|
|
|
|
|
|
|
|
|
|
|
| |
This CL also adds the Scanner class to do simple scans over strings, to mimic
regexp behavior like [a-zA-Z][a-zA-Z0-9]* with:
Scanner scan(s);
scan.One(Scanner::LETTER);
scan.Any(Scanner::LETTER_DIGIT);
bool matched = scan.GetResult();
Change: 116803757
|
|
|
|
|
|
| |
use filegroups from tensorflow/core/kernels/BUILD for Android targets.
Change: 116688861
|
|
|
|
| |
Change: 116619279
|
|
|
|
|
|
|
| |
potentially linked into the binary. This makes sure that the :core_cpu
target will never have any GPU code linked in, which was confusing
and weird.
Change: 116618884
|
|
|
|
|
| |
Fixes #1409.
Change: 116554875
|
|
|
|
|
| |
to be linked into CPU-only binaries.
Change: 116473970
|
|
|
|
|
|
|
|
| |
buffers when copying to the CPU device.
Re-arranges some of the internal gpu libraries to be library vs. runtime
specific.
Change: 116472314
|
|
|
|
| |
Change: 116256253
|
|
|
|
| |
Change: 116199874
|
|
|
|
| |
Change: 116188107
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When enabled, the following events are recorded:
The start of a step, with the numerical step_id and a textual handle describing the step.
A Tensor allocation, including the step_id, the name of the OpKernel, the data type, shape, allocation size, allocation_id, data pointer location, and allocator used (the allocation_id is local to an allocator).
A Tensor deallocation, including the allocation_id and allocator used.
A raw memory allocation, including the step_id, the name of the component (e.g. Eigen), the number of bytes, data pointer location, allocation_id and allocator used.
A raw memory deallocation, including the step_id, the name of the component (e.g. Eigen), allocation_id and allocator used.
For now many Tensor allocations show 'unknown' for the kernel and step_id. These mostly come from Tensors allocated by the system from protocol buffers, and Tensors allocated by Ops using the Tensor constructor directly instead of calling OpKernelContext::allocate_temp. The latter can in principle be cleaned up one by one as necessary. The former would require some plumbing to associate an allocation with the appropriate step_id.
With this CL memory logging is enabled by raising the VLOG level to 1. Once there is an ability to set process-wide options programmatically it would make sense to update the machinery to do that. Currently recorded events are logged as INFO, and they can all be retrieved by filtering the log for lines including __LOG_MEMORY__.
Some example lines are as follows:
I0301 13:38:55.797563 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown (from Proto)" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 4 allocator_name: "cuda_host" allocation_id: 2 has_single_reference: true ptr: 8717861408 } } }
I0301 13:38:55.802245 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { } allocation_description { requested_bytes: 4 allocated_bytes: 256 allocator_name: "gpu_bfc" allocation_id: 1 has_single_reference: true ptr: 47378989056 } } }
I0301 13:38:55.802347 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 2 allocator_name: "cuda_host" }
[...]
I0301 13:38:55.806454 81179 log_memory.cc:18] __LOG_MEMORY__ MemoryLogStep { step_id: 1 handle: "->/init;0" }
I0301 13:38:55.806659 81220 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "random_normal/shape" tensor { dtype: DT_INT32 shape { dim { size: 4 } } allocation_description { requested_bytes: 16 allocated_bytes: 16 allocator_name: "cuda_host" allocation_id: 1 ptr: 8717860896 } } }
[...]
I0301 13:38:56.362898 81218 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 1 kernel_name: "conv1/truncated_normal" tensor { dtype: DT_FLOAT shape { dim { size: 11 } dim { size: 11 } dim { size: 3 } dim { size: 96 } } allocation_description { requested_bytes: 139392 allocated_bytes: 139520 allocator_name: "gpu_bfc" allocation_id: 36 has_single_reference: true ptr: 47379030016 } } }
I0301 13:38:56.362894 81217 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 24 allocator_name: "gpu_bfc" }
I0301 13:38:56.362903 81213 log_memory.cc:18] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 1 kernel_name: "conv5/truncated_normal/mul" tensor { dtype: DT_FLOAT shape { dim { size: 3 } dim { size: 3 } dim { size: 1024 } dim { size: 1024 } } allocation_description { requested_bytes: 37748736 allocated_bytes: 37748736 allocator_name: "gpu_bfc" allocation_id: 34 ptr: 48512711168 } } }
[...]
I0229 16:39:57.482980 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawAllocation { step_id: 13 operation: "xentropy/EigenAllocator" num_bytes: 64 ptr: 47386857472 allocation_id: 625 allocator_name: "gpu_bfc" }
I0229 16:39:57.483147 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" deferred: true }
I0229 16:39:57.483197 76558 log_memory.cc:18] __LOG_MEMORY__ MemoryLogRawDeallocation { step_id: 13 operation: "xentropy/EigenAllocator" allocation_id: 625 allocator_name: "gpu_bfc" }
Change: 116065112
|
|
|
|
|
|
|
|
|
|
|
| |
For the large model, the overhead is roughly 35%. More improvements over this base-line implementation are coming. But if you have long sequence models that are currently running out of memory, I encourage you to try this out.
Calculation: Dynamic LSTM No Memory Swap vs. Memory Swap
batch max_t units no_swap swap swap/no_swap
512 100 512 0.702892 0.946286 1.346275
512 100 256 0.292875 0.451330 1.541033
512 100 128 0.162116 0.257621 1.589119
Change: 115912325
|
|
|
|
| |
Change: 115879158
|
|
|
|
| |
Change: 115740568
|
|
|
|
|
| |
longer referenced.
Change: 115711086
|
|
|
|
|
|
|
|
| |
This includes:
* ctc_loss
* ctc_greedy_decoder
* ctc_beam_search_decoder
Change: 115683564
|
|
|
|
| |
Change: 115675044
|
|
|
|
|
| |
core:gpu_lib, out of the non-gpu core:core_cpu* targets.
Change: 115641392
|
|
|
|
|
|
|
|
|
|
|
|
| |
This includes a gRPC server (grpc_tensorflow_server) that can serve as both
the master of a distributed TensorFlow computation, and an individual worker
in the computation. The GrpcSession class is included to allow client programs
(including Python clients) to interact with a server.
See tensorflow/core/distributed_runtime/README.md for usage instructions.
This change partially addresses issue #23.
Change: 115634191
|
|
|
|
|
| |
Test fails.
Change: 115602477
|
|
|
|
| |
Change: 115598732
|
|
|
|
|
|
| |
hopefully reduce confusion since io.* is not the implementation of the
".../kernels:io" build target.
Change: 115593814
|
|
|
|
| |
Change: 115589642
|
|
|
|
|
|
| |
return true. Add a unittest to catch this type of regression in
the future.
Change: 115573280
|
|
|
|
|
|
|
|
| |
These tools are meant to allow recording of benchmark & unit test
structured output to pbtxt files in a directory only when the
environment variable TEST_REPORT_FILE_PREFIX is set. For now,
only saving of C++ microbenchmark output is supported.
Change: 115518303
|
|
|
|
| |
Change: 115384748
|
|
|
|
| |
Change: 115379524
|
|
|
|
|
| |
This will be necessary for tests the distributed runtime (issue #23).
Change: 115339579
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two different mechanisms are required. On the CPU, we push and pop the
appropriate processor flags in the executor (for the master thread) *and*
in each threadpool thread, since the processor flags are thread local. On
the GPU, we set -ftz=true for both nvcc and gcudacc so that kernels that we
build flush denormals to zero using instruction flags.
Caveat: On GPU, only single precision denormals are flushed to zero; double
precision is unchanged.
Change: 115114845
|
|
|
|
|
| |
Test both layouts in tests.
Change: 115096872
|
|
|
|
| |
Change: 114834125
|
|
|
|
| |
Change: 114831071
|
|
|
|
|
|
| |
extended_ops_headers.
Change: 114784491
|
|
|
|
|
|
| |
to package custom operator sets with core binary.
Change: 114781921
|
|
|
|
|
|
| |
also fixes the problem where users are asked which version of polymer to install when they run `bower install`.
Change: 114774859
|
|
|
|
|
|
|
|
|
|
|
|
| |
approach:
1. Do not instantiate templates for all tf types. Instead, various
types is casted to one of uint8/uint16/uint32/uint64/string.
2. Use eigen3 for 2/3/4 rank tensors' transpose and fallback to a
naive routine which is only templatized on type T but not on
NDIMS.
Change: 114763098
|
|
|
|
|
| |
This allows one group to include headers from the other.
Change: 114578983
|
|
|
|
| |
Change: 114565136
|
|
|
|
|
| |
Framework calls ceil somewhere deep inside.
Change: 114539499
|
|
|
|
| |
Change: 114470777
|
|
|
|
| |
Change: 114448861
|
|
|
|
| |
Change: 114378906
|
|
|
|
|
|
|
| |
Checkpoints now have a version scheme analogous to that for GraphDefs. We have
no plans to ever deprecate a checkpoint version, but it's good to have the
scheme in place in case we need to.
Change: 114364388
|
|
|
|
|
|
| |
separate compilation.
Change: 114356795
|
|
|
|
| |
Change: 114273085
|
|
|
|
|
| |
section of the BUILD file.
Change: 114255368
|
|
|
|
|
| |
more careful about re-enabling wildcard-import where appropriate.
Change: 114167131
|