| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
| |
This change complements the existing `InstantiateOptions::executor_type`
option, which takes precedence over the attr if both are provided. It
enables the choice of executor to be separated from both the calling
op implementation and the function definition, which simplifies the
use of custom executors in operations that take a function as an attr
(e.g.) `tf.data` and the functional control-flow ops.
PiperOrigin-RevId: 216532778
|
|
|
|
| |
PiperOrigin-RevId: 216443201
|
|
|
|
| |
PiperOrigin-RevId: 216395709
|
|
|
|
|
|
|
|
| |
Doesn't attempt to deal with cases where we might have already generated
the functiondef for the parent function as in that case we cannot easily
modify the forward pass.
PiperOrigin-RevId: 216243224
|
|
|
|
|
|
|
|
| |
`set_stats_aggregator`. `tag` would get prep-end with all the statistics recorded as summary and `counter_prefix` would set the prefix for the statistics recorded as counter.
Note: `counter` defaults to `\tensorflow`, and `tag` and `prefix` gets associated with the dataset (not the stats_aggregator).
PiperOrigin-RevId: 215609159
|
|
|
|
|
|
| |
coordination.
PiperOrigin-RevId: 215309735
|
|
|
|
| |
PiperOrigin-RevId: 215018984
|
|
|
|
|
|
| |
{Lookup,Create,LookupOrCreate}Resource().
PiperOrigin-RevId: 215008650
|
|
|
|
| |
PiperOrigin-RevId: 215003704
|
|
|
|
|
|
|
|
|
| |
the duration of a single RunInternal() call from RunHandlerPool. It is used for
running inter-op closures with a global scheduler (which in the future) to
improve both median and tail latency (for use-cases like CPU inference).
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214992852
|
|
|
|
| |
PiperOrigin-RevId: 214853846
|
|
|
|
|
|
|
|
|
| |
the duration of a single RunInternal() call from RunHandlerPool.
We want to leverage this abstraction for improving the cross-session inter-op
parallelism for lower latency inference in the future.
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214818187
|
|\
| |
| |
| | |
PiperOrigin-RevId: 214726180
|
| |
| |
| |
| |
| |
| |
| |
| | |
The purpose of these ops is to fix a latency problem observed for an inference benchmark. Often a inference step starts by reading the value of many (hundreds) of weights. For a resource variable, this requires a VarHandleOp and a ReadVariableOp per variable. Running hundreds of trivial ops can add hundreds of microseconds of latency to the critical path of an inference step. The inter-op latency of the executor can be hundreds of nanoseconds, which rapidly adds up.
This change introduces two fused ops _VarHandlesOp and _ReadVariablesOp that allow us to read many variables in a pair of larger ops, rather than many tiny ops.
PiperOrigin-RevId: 214662338
|
| |
| |
| |
| |
| |
| |
| | |
This patch introduces optimization that hoists RandomUniform out of map functions.
By doing it, we make function stateless, which is crucial for parallelization and vectorization.
PiperOrigin-RevId: 214623178
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214553359
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214295534
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213990950
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| | |
refactoring the API for exposing tunable parameters, and removing `model::Node` from the public API.
PiperOrigin-RevId: 213907565
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213886813
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213770000
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
standard python `print` method, and deprecates the old `tf.Print` operator (to be removed in in v2.0).
It follows the design doc specified in https://github.com/tensorflow/community/pull/14 and additionally incorporates the community feedback and design review decisions.
This CL adds two new internal graph operators: a StringFormat operator that formats a template string with a list of input tensors to insert into the string and outputs a string scalar containing the result, and a PrintV2 operator that prints a string scalar to a specified output stream or logging level.
The formatting op is exposed at `tf.strings.Format`. A new python method is exposed at `tf.print` that takes a list of inputs that may be nested structures and may contain tensors, formats them nicely using the formatting op, and returns a PrintV2 operator that prints them. In Eager mode and inside defuns this PrintV2 operator will automatically be executed, but in graph mode it will need to be either added to `sess.run`, or used as a control dependency for other operators being executed.
As compared to the previous print function, the new print function:
- Has an API that more closely aligns with the standard python3 print
- Supports changing the print logging level/output stream
- allows printing arbitrary (optionally nested) data structures as opposed to just flat lists of tensors
- support printing sparse tensors
- changes printed tensor format to show more meaningful summary (recursively print the first and last elements of each tensor dimension, instead of just the first few elements of the tensor irregardless of dimension).
PiperOrigin-RevId: 213709924
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213693027
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213505655
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213394522
|
| |
| |
| |
| | |
PiperOrigin-RevId: 213386401
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213371553
|
| |
| |
| |
| |
| |
| | |
of the fact in the tf.data kernels.
PiperOrigin-RevId: 213361953
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Prior to this change,
GraphConstructor::PopulateMissingUnusedInputMapKey() didn't correctly
compute the number of outputs for ops with variadic outputs. This
meant that missing_unused_input_map_keys could contain spurious
entries for unused variadic outputs, which could trigger a ValueError
in import_graph_def.
This also adds a new util method in node_def_util.h, NumOutputsForNode().
PiperOrigin-RevId: 213353158
|
| |
| |
| |
| |
| |
| |
| |
| | |
`num_parallel_calls` argument of `tf.data.Dataset.map()`, `tf.data.Dataset.interleave()`, and `tf.contrib.data.map_and_batch()`.
When `tf.data.AUTOTUNE` is specified, the level of parallelism is determined at runtime. The underlying mechanism instruments the input pipeline to build a performance model and then uses the model to find the optimal values for the parallelism knobs.
PiperOrigin-RevId: 213283297
|
| |
| |
| |
| |
| |
| |
| |
| | |
Stateless MapDatasets can be paralellized by switching to ParallelMapDataset. We set `num_parallel_calls` to 2 for now, but in the future a special value will be used that result in the optimal value to be selected dynamically at runtime.
This patch also exposed a memory leak which was fixed.
PiperOrigin-RevId: 213015223
|
| |
| |
| |
| |
| |
| | |
Previously, we would schedule a closure for each ResourceHandleOp, because it is erroneously considered to be "expensive". This would cost several microseconds per op, whereas the execution cost of this kernel is as little as 100ns. This change causes these kernels to execute inline at the beginning of a step.
PiperOrigin-RevId: 212712378
|
| |
| |
| |
| |
| |
| | |
optimization pass, instead of a step in XlaCompiler.".
PiperOrigin-RevId: 212657932
|
| |
| |
| |
| |
| |
| |
| |
| | |
copying that type.
This avoids unnecessary string copies and deallocations in the ReadVariableOp, and similar ops.
PiperOrigin-RevId: 212652588
|
| |
| |
| |
| |
| |
| | |
performance.
PiperOrigin-RevId: 212557406
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
1. Change Variant Decode to accept VariantTensorData (non-ref).
This should allow some optimization in the future.
In the meantime it means removing the variant.h include from tensor.h, since
variant_encode_decode.h now relies on tensor.h and variant.h now relies on that.
It also means we found a bunch of places where tensor.proto.h, variant.h, and
mutex.h were being imported through tensor.h (along with a bunch of other crap);
so now we directly import them in order to compile.
2. Move Variant registry to use TypeIndex instead of a TypeName string; this should
speed up registry lookups.
PiperOrigin-RevId: 212478896
|
| |
| |
| |
| |
| |
| | |
time.
PiperOrigin-RevId: 212321238
|
| |
| |
| |
| | |
PiperOrigin-RevId: 212182923
|
| |
| |
| |
| |
| |
| | |
a step in XlaCompiler.
PiperOrigin-RevId: 212164482
|
| |
| |
| |
| |
| |
| | |
StringPiece has been changed to string to avoid static order destruction fiasco (we store pointers that might have shorter lifetime) and also to use unordered_set (there is hash specialization for StringPiece).
PiperOrigin-RevId: 212059185
|
|/
|
|
| |
PiperOrigin-RevId: 211733735
|
|
|
|
|
|
| |
optimization.
PiperOrigin-RevId: 211179990
|
|
|
|
|
|
|
| |
This will allow the functional tf.while_loop proposed in https://github.com/tensorflow/community/pull/13 to achieve feature parity with the current implementation.
Lowering is performed only when the "_lower_using_switch_merge" attr is set to True.
PiperOrigin-RevId: 210956432
|
|
|
|
|
|
|
|
| |
There are several API migrations happening:
* ArraySlice's sub-slice constructor => .subspan
* MutableArraySlice's container pointer constructor => absl::MakeSpan
PiperOrigin-RevId: 210946124
|
|
|
|
| |
PiperOrigin-RevId: 210565027
|
|
|
|
| |
PiperOrigin-RevId: 210559796
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this change, introducing a new collective algorithm required touching
multiple files. CollectiveParams setup was in common_runtime/collective_param_resolver_local,
and the data movement was in common_runtime/reducer and common_runtime/broadcaster.
This change introduces CollectiveImplementationInterface.
CollectiveImplementationInterface brings together param initialization and data
movement for a collective algorithm. Every collective implementation will
implement this interface and override the virtual methods. This should
hopefully reduce obscurity and lead to code with fewer dependencies.
PiperOrigin-RevId: 210430157
|