| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
| |
This change complements the existing `InstantiateOptions::executor_type`
option, which takes precedence over the attr if both are provided. It
enables the choice of executor to be separated from both the calling
op implementation and the function definition, which simplifies the
use of custom executors in operations that take a function as an attr
(e.g.) `tf.data` and the functional control-flow ops.
PiperOrigin-RevId: 216532778
|
|
|
|
|
|
|
| |
call for better xprof tracing. Also annotate synchronous op execution with the session-run id (or step_id) as metadata leveraging the support introduced in cl/215985561.
This should enable highlighting the duration of a Session::Run and all the ops that ran in it for visualizing latency regressions in the case of CPU inference.
PiperOrigin-RevId: 216284682
|
|
|
|
|
|
|
|
| |
Doesn't attempt to deal with cases where we might have already generated
the functiondef for the parent function as in that case we cannot easily
modify the forward pass.
PiperOrigin-RevId: 216243224
|
|
|
|
| |
PiperOrigin-RevId: 216187878
|
|
|
|
|
|
| |
Enable GPU tests for cond_v2.
PiperOrigin-RevId: 215956220
|
|
|
|
| |
PiperOrigin-RevId: 215946205
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
attr values that are not overridden e.g. transpose_a in the matmul op).
This is required for backward compatibility (a binary built via an older version
of TF should still run on a newer version of TF, where some ops may have added
attrs).
For non-eager graph building, the default attr values of graph ops are added by
tensorflow::AddDefaultsToNodeDef().
We ran into this issue when running the same S4TF test cases via eager APIs --
some tests failed due to "missing attrs", but are fixed by this patch.
PiperOrigin-RevId: 215927271
|
|
|
|
|
|
| |
An environment variable (TF_EAGER_ENABLE_SMALL_TENSOR_CPU_PINNING) is provided to turn this off if necessary (its on by default).
PiperOrigin-RevId: 215821915
|
|
|
|
|
|
| |
Switch or Merge node.".
PiperOrigin-RevId: 215772272
|
|
|
|
|
|
|
|
|
| |
in a lambda
UNLOCK_FUNCTION(ir->out_mu) annotates that the lock is held on entry.
try_lock() should not be called.
PiperOrigin-RevId: 215769341
|
|
|
|
|
|
| |
Avoids LOG(ERROR) spam when the Executor is unable to find a CPU kernel.
PiperOrigin-RevId: 215738481
|
|
|
|
|
|
|
| |
In the process, properly place nodes on devices in the collective graph key
test.
PiperOrigin-RevId: 215616146
|
|\
| |
| |
| | |
PiperOrigin-RevId: 215560522
|
| | |
|
| |
| |
| |
| | |
PiperOrigin-RevId: 215292521
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Prior to this change, the lowering pass assumed that the If op
functions would be available in the If op's graph. If the If op is
defined in a defun and then called via eager execution, the functions
will be in the eager context, but not in the defun's graph. This
change makes the lowering pass correctly use the function library
passed in by the caller via GraphOptimizationPassOptions.
PiperOrigin-RevId: 215271990
|
| | |
|
| |
| |
| |
| | |
MKL is disabled, and with some minor changes
|
| |\
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
the duration of a single RunInternal() call from RunHandlerPool. It is used for
running inter-op closures with a global scheduler (which in the future) to
improve both median and tail latency (for use-cases like CPU inference).
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214992852
|
|/
|
|
| |
variable TF_DISABLE_MKL=1
|
|
|
|
| |
PiperOrigin-RevId: 214853860
|
|
|
|
| |
PiperOrigin-RevId: 214853846
|
|\
| |
| |
| | |
PiperOrigin-RevId: 214821528
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
the duration of a single RunInternal() call from RunHandlerPool.
We want to leverage this abstraction for improving the cross-session inter-op
parallelism for lower latency inference in the future.
In the case that global pools aren't used, this change should be a no-op.
PiperOrigin-RevId: 214818187
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Before this change, a CollectiveOp user was required to specify subdiv_offsets
for the RingReduce algorithm. During ring reduction, we created chunks of the
tensor to exchange between devices. If the chunks were too large, or if the
hardware supported multiple data exchanges in parallel, the user could further
subdivide the chunk by specifying more than 1 subdiv offset. Each subdiv
offset corresponded to another subdivision of the chunk, so effectively the
total number of tensor chunks is number of devices * number of subdivs.
After this change, we can dynamically infer the number of subdivisions based on
a target chunk size. In ring_reducer.cc, we start with 1 subdiv, and keep
increasing until chunk size is less than MAX_CHUNK_SIZE. Currently,
MAX_CHUNK_SIZE is set at 4 MB, although it may make sense to change this based
on specific hardware.
As a part of this change, a user can now provide an empty subdiv_offset list.
If empty, we dynamically add subdivisions based on the above algorithm. If
non-empty, we take the user-specified subdivions.
PiperOrigin-RevId: 214815959
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214802032
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214723970
|
| |
| |
| |
| |
| |
| |
| |
| | |
functionalization.
If we want to evaluate SymbolicGradient op in constant folding, we need to construct Device object and attach it to FunctionLibraryRuntime. In graph rewriting pass, we do not have Device object created yet; it will only be created in XlaCompiler.
PiperOrigin-RevId: 214702943
|
| |
| |
| |
| |
| |
| | |
Make shape inference lazy in optimizers that may not trigger.
PiperOrigin-RevId: 214669034
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214557082
|
| |
| |
| |
| | |
PiperOrigin-RevId: 214380876
|
| |
| |
| |
| |
| |
| | |
All devices implement the same tracing logic in an override of `Device::Compute()`. However, that logic does not have access to the cached `NodeItem::kernel_is_expensive` bit for the kernel, so it must make a virtual call to `OpKernel::IsExpensive()`. By inlining the logic into `ExecutorState::Process()`, we avoid making an unnecessary virtual call on each kernel invocation (when a trace controller is attached).
PiperOrigin-RevId: 214332492
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This change switches `tf.contrib.data.Optional` to use a `Structure` class to represent
the structure of its value, instead of `output_types`, `output_shapes`, and `output_classes` properties. It adds support for nesting `Optional` objects and representing their structure.
This change also makes a modification to the `Structure` class: `Structure.is_compatible_with(x)` now takes another `Structure` as the `x` argument, instead of a value. This makes it easier to work with nested structures (where we might not have a value readily available), and better matches the interface of other `is_compatible_with()` methods (e.g. in `tf.TensorShape` and `tf.DType`).
Finally, in the process of making this change, I observed possible crash-failures when a DT_VARIANT tensor containing another DT_VARIANT tensor is copied between CPU and GPU. This change "fixes" the immediate problem by raising an UnimplementedError, but more work will be necessary to support the full range of use cases.
PiperOrigin-RevId: 214198993
|
| |
| |
| |
| |
| |
| | |
GPU). This avoids many unnecessary CPU<->GPU memcpy and syncs.
PiperOrigin-RevId: 214108484
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In `ExecutorState::PropagateOutputs()`, each time a loop enter node is
processed, the node's attrs are consulted to determine if it is a
"constant" or "non-constant" enter node. This entails a call to the
protobuf library, followed by multiple string comparisons to find the
attribute in the Node's NodeDef's attr map. The value of this property
never changes after the executor is first constructed, so in this
change we move it to a cached field on the `NodeItem` struct, and use
that value.
PiperOrigin-RevId: 214047449
|
| |
| |
| |
| |
| |
| | |
Thanks @alextp for finding the bug!
PiperOrigin-RevId: 213999971
|
|\ \
| | |
| | |
| | | |
PiperOrigin-RevId: 213906379
|
| | |
| | |
| | |
| | | |
PiperOrigin-RevId: 213875284
|
|\ \ \
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 213844688
|
| | | |
| | | |
| | | |
| | | | |
PiperOrigin-RevId: 213770000
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
TF_FORCE_GPU_ALLOW_GROWTH environment variable.
PiperOrigin-RevId: 213728460
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
ROCmSoftwarePlatform:upstream-staging-gpu-common-runtime-1
PiperOrigin-RevId: 213653830
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
light-weight statistics collector for tf.data performance modeling.
PiperOrigin-RevId: 213566889
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213505655
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This fixes #22274.
Signed-off-by: Bairen Yi <byi@connect.ust.hk>
|
| | | | |
| | | | |
| | | | |
| | | | | |
PiperOrigin-RevId: 213394522
|
| | | | |
| | | | |
| | | | |
| | | | | |
PiperOrigin-RevId: 213377426
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed. This usage pattern
fits the SubAllocator better than a general Allocator. This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.
This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining. This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.
This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices. It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA. It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).
PiperOrigin-RevId: 213371553
|
| | | | | |
|