| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 216400726
|
|
|
|
| |
PiperOrigin-RevId: 216354906
|
|
|
|
| |
PiperOrigin-RevId: 215935319
|
|
|
|
| |
PiperOrigin-RevId: 215607769
|
|
|
|
| |
PiperOrigin-RevId: 215263951
|
|
|
|
| |
PiperOrigin-RevId: 215073641
|
|
|
|
|
|
|
|
|
| |
the duration of a single RunInternal() call from RunHandlerPool. It is used for
running inter-op closures with a global scheduler (which in the future) to
improve both median and tail latency (for use-cases like CPU inference).
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214992852
|
|
|
|
|
|
|
|
| |
NOTE: All ops and kernels previously previously defined in
tensorflow/contrib/data have had their name prefixed with
"Experimental" to indicate that they are not (yet) stable, and thus
not subject to backwards or forwards compatibility guarantees.
PiperOrigin-RevId: 214940819
|
|
|
|
| |
PiperOrigin-RevId: 214853846
|
|
|
|
|
|
|
|
|
| |
the duration of a single RunInternal() call from RunHandlerPool.
We want to leverage this abstraction for improving the cross-session inter-op
parallelism for lower latency inference in the future.
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214818187
|
|\
| |
| |
| | |
PiperOrigin-RevId: 214793113
|
|\ \
| | |
| | |
| | | |
PiperOrigin-RevId: 214726180
|
| | |
| | |
| | |
| | |
| | |
| | | |
the same source dependency twice.
PiperOrigin-RevId: 214704620
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The purpose of these ops is to fix a latency problem observed for an inference benchmark. Often a inference step starts by reading the value of many (hundreds) of weights. For a resource variable, this requires a VarHandleOp and a ReadVariableOp per variable. Running hundreds of trivial ops can add hundreds of microseconds of latency to the critical path of an inference step. The inter-op latency of the executor can be hundreds of nanoseconds, which rapidly adds up.
This change introduces two fused ops _VarHandlesOp and _ReadVariablesOp that allow us to read many variables in a pair of larger ops, rather than many tiny ops.
PiperOrigin-RevId: 214662338
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Bazel does not allow python rules to directly depend on c++ rules.
So I have to separately manage static dependencies, unfortunately avoiding
"kernels" option for now.
PiperOrigin-RevId: 214532631
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 214366272
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 214354104
|
| | |
| | |
| | |
| | |
| | |
| | | |
All devices implement the same tracing logic in an override of `Device::Compute()`. However, that logic does not have access to the cached `NodeItem::kernel_is_expensive` bit for the kernel, so it must make a virtual call to `OpKernel::IsExpensive()`. By inlining the logic into `ExecutorState::Process()`, we avoid making an unnecessary virtual call on each kernel invocation (when a trace controller is attached).
PiperOrigin-RevId: 214332492
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
cc_header_only_library.
This allows TensorFlow to be build from another bazel repo.
PiperOrigin-RevId: 214091199
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 213875284
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 213863392
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
standard python `print` method, and deprecates the old `tf.Print` operator (to be removed in in v2.0).
It follows the design doc specified in https://github.com/tensorflow/community/pull/14 and additionally incorporates the community feedback and design review decisions.
This CL adds two new internal graph operators: a StringFormat operator that formats a template string with a list of input tensors to insert into the string and outputs a string scalar containing the result, and a PrintV2 operator that prints a string scalar to a specified output stream or logging level.
The formatting op is exposed at `tf.strings.Format`. A new python method is exposed at `tf.print` that takes a list of inputs that may be nested structures and may contain tensors, formats them nicely using the formatting op, and returns a PrintV2 operator that prints them. In Eager mode and inside defuns this PrintV2 operator will automatically be executed, but in graph mode it will need to be either added to `sess.run`, or used as a control dependency for other operators being executed.
As compared to the previous print function, the new print function:
- Has an API that more closely aligns with the standard python3 print
- Supports changing the print logging level/output stream
- allows printing arbitrary (optionally nested) data structures as opposed to just flat lists of tensors
- support printing sparse tensors
- changes printed tensor format to show more meaningful summary (recursively print the first and last elements of each tensor dimension, instead of just the first few elements of the tensor irregardless of dimension).
PiperOrigin-RevId: 213709924
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 213693027
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213505655
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 213394522
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213371553
|
|\ \ \
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 213343364
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
`num_parallel_calls` argument of `tf.data.Dataset.map()`, `tf.data.Dataset.interleave()`, and `tf.contrib.data.map_and_batch()`.
When `tf.data.AUTOTUNE` is specified, the level of parallelism is determined at runtime. The underlying mechanism instruments the input pipeline to build a performance model and then uses the model to find the optimal values for the parallelism knobs.
PiperOrigin-RevId: 213283297
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Stateless MapDatasets can be paralellized by switching to ParallelMapDataset. We set `num_parallel_calls` to 2 for now, but in the future a special value will be used that result in the optimal value to be selected dynamically at runtime.
This patch also exposed a memory leak which was fixed.
PiperOrigin-RevId: 213015223
|
| | | |
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 212920113
|
| | | |
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 212736286
|
| | | |
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 212684548
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
performance.
PiperOrigin-RevId: 212557406
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
1. Change Variant Decode to accept VariantTensorData (non-ref).
This should allow some optimization in the future.
In the meantime it means removing the variant.h include from tensor.h, since
variant_encode_decode.h now relies on tensor.h and variant.h now relies on that.
It also means we found a bunch of places where tensor.proto.h, variant.h, and
mutex.h were being imported through tensor.h (along with a bunch of other crap);
so now we directly import them in order to compile.
2. Move Variant registry to use TypeIndex instead of a TypeName string; this should
speed up registry lookups.
PiperOrigin-RevId: 212478896
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
tensorflow/core:common_runtime/mkl_threadpool_device_test.
PiperOrigin-RevId: 212060726
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is no reason for outside dependents to make a distinction between the
Eigen or MKL transpose operation, as the substitution is transparent. There is
also no need for transpose_op.cc itself to be compiled differently based on
whether MKL is in use or not. Therefore we remove external dependencies on
:mkl_transpose_op and make :transpose_op depend on it if needed (i.e., if
using MKL). This is consistent with how other transparent MKL operations (e.g.
matmul) are built.
PiperOrigin-RevId: 211874336
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
lightweight way
PiperOrigin-RevId: 211833556
|
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The commit contains following components to support TensorFlow on ROCm platform
- bazel build system
- continuous integration logic
Authors:
- Jack Chung: jack.chung@amd.com
- Jeffrey Poznanovic: Jeffrey.Poznanovic@amd.com
- Peng Sun: Peng.Sun@amd.com
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 211639440
|
|/ /
| |
| |
| |
| |
| |
| | |
Used to prepare all the header files so they can easily be installed
into /usr/include when packaging TF.
Signed-off-by: Jason Zaman <jason@perfinion.com>
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
| |
Rollback of rollback. Fix: make access to collective_graph_key thread-safe.
The original change introduced a collective_graph_key_ integer to DirectSession, but it did not protect accesses to this integer. This change protects access with a mutex.
END_PUBLIC
Automated rollback of commit cb9443831283c2366e3dd91001db6362d6594f66
PiperOrigin-RevId: 211161961
|
|
|
|
|
|
| |
is first of a series of CL to merge these into one. In this change, we remove the format tag from the errors.
PiperOrigin-RevId: 211146036
|
|
|
|
| |
PiperOrigin-RevId: 211110958
|
|
|
|
| |
PiperOrigin-RevId: 211037202
|
|
|
|
|
|
|
|
|
|
|
| |
Before this CL, for collective_ops to work, the client had to specify a
collective_graph_key in the RunOptions of a session.Run call.
After this change, if a client does not specify a collective_graph_key for a
graph that contains collective ops, a graph key is generated automatically as a
hash of the set of keys of collective instances in the placed graph.
PiperOrigin-RevId: 211024617
|
|
|
|
| |
PiperOrigin-RevId: 211020126
|
|
|
|
|
|
|
| |
This will allow the functional tf.while_loop proposed in https://github.com/tensorflow/community/pull/13 to achieve feature parity with the current implementation.
Lowering is performed only when the "_lower_using_switch_merge" attr is set to True.
PiperOrigin-RevId: 210956432
|
|
|
|
|
|
|
|
| |
There are several API migrations happening:
* ArraySlice's sub-slice constructor => .subspan
* MutableArraySlice's container pointer constructor => absl::MakeSpan
PiperOrigin-RevId: 210946124
|
|
|
|
| |
PiperOrigin-RevId: 210929192
|