| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
| |
Instead return a friendlier failed Status from the following two methods which
used to CHECK-fail before: GetIncomingPreds, FindUniqueBackedge.
While at it, also rename GetIncomingPreds to GetInputPreds to be consistent with
the variable names.
PiperOrigin-RevId: 215758757
|
|
|
|
| |
PiperOrigin-RevId: 215757701
|
|
|
|
| |
PiperOrigin-RevId: 215624875
|
|
|
|
|
|
| |
can use to accelerate transfers.
PiperOrigin-RevId: 215362667
|
|
|
|
| |
PiperOrigin-RevId: 215324035
|
|
|
|
|
|
| |
The previous version was hitting a very slow path in `GetNodeAttr()`, which is expensive when the named attr is not found. This change inlines the logic of finding the two relevant attrs inside `GetFunctionNameAttr()` and avoids constructing a status object with a serialized `NodeDef` when the attr can't be found.
PiperOrigin-RevId: 215298411
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This cleanup will make the future CL implementing lazy compilation simpler.
Includes some supporting changes:
- Teach NewInternalScope to create a scope that doesn't do shape inference. We
need this because we don't have a ShapeRefiner that has been run over the
entire graph available in the build_xla_ops pass.
- Add a WithAssignedDevice modifier to tensorflow::Scope.
- Make cc_op_gen write out an Operation field for nodes which may not
necessarily have any outputs. We already did this in most cases, but we
weren't doing it for nodes that have possibly-empty list outputs.
- Minor change renaming ops/xla_jit_op.cc to ops/xla_jit_ops.cc, now that we
have more than one XLA JIT op.
PiperOrigin-RevId: 215293817
|
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
|
|
| |
requested device placement of the XlaLaunch op must be derived from the subgraph.
PiperOrigin-RevId: 215239672
|
|
|
|
| |
PiperOrigin-RevId: 215183847
|
|
|
|
|
|
|
|
|
|
|
| |
Even with this bug we were accidentally doing the right thing (so the test case
doesn't actually fail without the fix): deleting an Edge sets its input and
output indices to kControlSlot-1 so we'd normally expect to fail when there is a
control edge out of the TF cluster (because a control edge would be recognized
as a data edge). But AddEdge(x, -1, y, -1) seems to do the right thing for both
control and data edges.
PiperOrigin-RevId: 214831204
|
|
|
|
|
|
|
|
| |
The purpose of these ops is to fix a latency problem observed for an inference benchmark. Often a inference step starts by reading the value of many (hundreds) of weights. For a resource variable, this requires a VarHandleOp and a ReadVariableOp per variable. Running hundreds of trivial ops can add hundreds of microseconds of latency to the critical path of an inference step. The inter-op latency of the executor can be hundreds of nanoseconds, which rapidly adds up.
This change introduces two fused ops _VarHandlesOp and _ReadVariablesOp that allow us to read many variables in a pair of larger ops, rather than many tiny ops.
PiperOrigin-RevId: 214662338
|
|
|
|
|
|
| |
SnapshotResourceVariables function.
PiperOrigin-RevId: 214488033
|
|
|
|
|
|
| |
It wasn't actually needed.
PiperOrigin-RevId: 214346217
|
|
|
|
|
|
| |
All devices implement the same tracing logic in an override of `Device::Compute()`. However, that logic does not have access to the cached `NodeItem::kernel_is_expensive` bit for the kernel, so it must make a virtual call to `OpKernel::IsExpensive()`. By inlining the logic into `ExecutorState::Process()`, we avoid making an unnecessary virtual call on each kernel invocation (when a trace controller is attached).
PiperOrigin-RevId: 214332492
|
|
|
|
|
|
| |
So far, just the clustered graph is dumped.
PiperOrigin-RevId: 213994376
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL splits the functionality in XlaLaunch into two separate operations:
- XlaCompile, responsible for compiling a TF function into a LocalExecutable
- XlaRun, responsible for executing a LocalExecutable created by XlaCompile
This CL is a stepping stone towards implementing lazy compilation for TF/XLA.
The XlaCompile op is spec'ed to return a boolean indicating whether the
compilation was successful. Right now that boolean is always set to true by
XlaCompile and its value is otherwise ignored, but in the future it will be used
to indicate whether the TF function was compiled or not, and thus whether we
should execute XlaRun or just directly call the TF function.
XlaLaunch still exists, and will be created by create_xla_launch_op.cc. In the
future we may consider removing it altogether. build_xla_launch_ops.cc, now
renamed to build_xla_ops.cc, creates a XlaCompile/XlaRun pair instead of
XlaLaunch.
This CL is organized as follows:
- jit/ops/xla_ops.cc gets two new XLA-specific operations, XlaCompile and
XlaRun, described above. XlaRun redundantly takes the must-be-constant
inputs to the TensorFlow cluster to keep the implementation simple (simple in
the sense of similar to XlaLaunch), but I will remove this in a subsequent
cleanup CL.
- jit/kernels/xla_ops.cc implements XlaCompile and XlaRun in a fairly
straightforward manner. XlaCompile compiles the TF function, puts it in a
process-global storage, XlaExecutableClosureStore, and produces a int64 key.
XlaRun uses the key to read out the LocalExecutable and execute it. I'm not
sure if XlaExecutableClosureStore should be a resource like
XlaCompilationCache; I did not immediately see any reason to make it so.
- There are changes to the various _device files to register XlaCompile and
XlaRun for the XLA_* devices.
- Finally, I had to fix some tests that were expecting XlaLaunch in the
execution timeline.
PiperOrigin-RevId: 213895405
|
|
|
|
|
|
|
|
|
|
|
|
| |
These have the same behavior as unquantized types so we can just pass them
through to XLA (which converts them to unquantized types). They're supposed to
be used with special ops, none of which are currently implemented by XLA.
Casting (without quantization) and basic math works fine though.
These do not have a corresponding numpy type, so only tests using TF types will
see them.
PiperOrigin-RevId: 213781650
|
|
|
|
| |
PiperOrigin-RevId: 213770000
|
|
|
|
| |
PiperOrigin-RevId: 213653853
|
|
|
|
| |
PiperOrigin-RevId: 213574904
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I first tried to fix this issue in cr/209996730 but didn't quite fix the problem
for for XLA_* devices. A node assigned to an XLA_* device must be compiled so
the cr/209996730 fix of simply not compiling the nodes doesn't generalize to
XLA_* devices. Instead we now "isolate" these nodes, only putting them in a
trivial one-node cluster. For non-XLA devices even this trivial cluster is
ignored because of flags->tf_xla_min_cluster_size.
I was initially considering a more principled data-flow-analysis based solution
but then decided the upfront work isn't worth it until I see a clear motivating
example.
PiperOrigin-RevId: 213531437
|
|
|
|
|
|
|
| |
Before this CL the PartiallyDeclusterPassTest.DontDuplicateResourceVarOps test
was buggy, in that it wasn't testing what it was supposed to test.
PiperOrigin-RevId: 213501558
|
|
|
|
|
|
|
|
|
|
| |
The test changes are awkward. None of these are XLA bugs, it's just that the op
definitions in tensorflow are really inconsistent. I tried to infer whether the
limitation is on signed types, index types or just arbitrary. In the latter
case just int8/uint8 is blacklisted, we should probably lift that requirement
at some point.
PiperOrigin-RevId: 213243906
|
|
|
|
|
|
|
|
|
|
|
|
| |
I need these to write readable unit tests for TF graph transformations. All of
my use cases will live inside tensorflow/compiler so putting it in
tensorflow/compiler/jit for now; but we can move these out if other users are
interested.
In the future we may want to auto-generate type safe versions of these from the
op registrations like we generate C++ wrappers today.
PiperOrigin-RevId: 213186810
|
|
|
|
| |
PiperOrigin-RevId: 212896336
|
|
|
|
|
|
| |
optimization pass, instead of a step in XlaCompiler.".
PiperOrigin-RevId: 212657932
|
|
|
|
| |
PiperOrigin-RevId: 212465918
|
|
|
|
|
|
| |
have been explicitly marked to be compiled via xla.compile()
PiperOrigin-RevId: 212407112
|
|
|
|
|
|
| |
This is needed when the graph contains custom call ops. These functions are found only in the graph's registry and not the default one.
PiperOrigin-RevId: 212297305
|
|
|
|
| |
PiperOrigin-RevId: 212289067
|
|
|
|
| |
PiperOrigin-RevId: 212182923
|
|
|
|
|
|
| |
a step in XlaCompiler.
PiperOrigin-RevId: 212164482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The CL is organized as follows:
- The main change is in jit/partially_decluster_pass.
- tf2xla/const_analysis now takes an "edge_filter" to facilitate use by
jit/partially_decluster_pass.
- tests/dense_layer_test.py was using the execution of ListDiff as what I
assume is a sanity check to see that the XLA cluster ran. With this CL the
ListDiff op gets declustered so we now check for "MatMult" for the sanity
check.
- Some tests were dropping TF_XLA_FLAGS; fixed them to not do so.
PiperOrigin-RevId: 212071118
|
|
|
|
| |
PiperOrigin-RevId: 212002568
|
|
|
|
| |
PiperOrigin-RevId: 211895566
|
|
|
|
| |
PiperOrigin-RevId: 211733735
|
|
|
|
|
|
|
|
| |
consistently
StringPiece is an alias for absl::string_view, InlinedVector is aliased to absl::InlinedVector. StrCat is compatible, so swapping it out is safe.
PiperOrigin-RevId: 211691840
|
|
|
|
|
|
|
| |
I want --vmodule=xla_compilation_cache=1 to print only the most essential
things.
PiperOrigin-RevId: 211676846
|
|
|
|
| |
PiperOrigin-RevId: 210998142
|
|
|
|
|
|
|
|
|
|
|
| |
XlaLocalLaunchBase was modifying platform_id_ without a lock which is racy
because the same OpKernel can be execute concurrently. Fix this by inferring
platform_id_ in the kernel constructor.
While at it, make use_multiple_streams_ and xla_device_metadata_ member
variables also.
PiperOrigin-RevId: 210751494
|
|
|
|
| |
PiperOrigin-RevId: 210467779
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a couple of reasons to do this:
- resource handle are regular tensors part of a public API that
can potentially be returned from a function.
- When tfe.defun is executed under GradientTape, it generates a
function returning resource handles in certain cases.
This CL adds support for returning resource handles from an XLA
compiled function. These resource handles must have been passed as
arguments to the function. In other words, we don't yet support
returning resources created inside the function. tfe.defun never
makes functions that create resources.
PiperOrigin-RevId: 210442856
|
|
|
|
|
|
| |
Of {fusable, fusile, fusible} my dictionary only knows about fusible.
PiperOrigin-RevId: 210373347
|
|
|
|
| |
PiperOrigin-RevId: 210317627
|
|
|
|
| |
PiperOrigin-RevId: 210130976
|
|
|
|
|
|
| |
executor until transfers from host to device are complete.
PiperOrigin-RevId: 210098914
|
|
|
|
|
|
|
|
|
| |
This is a cleanup on cr/208763036. Instead of spreading information about
resource ops between jit/mark_for_compilation_pass and
jit/resource_operation_safety_analysis we now have
tf2xla/resource_operation_table own it.
PiperOrigin-RevId: 210044178
|
|
|
|
|
|
|
|
|
|
| |
tensor.h soon)
We plan to remove the import variant.h from tensor.h; and variant.h brings in a lot
of transitive imports (including protos like tensor.proto.h). To prepare, we're
updating folks who this will break.
PiperOrigin-RevId: 210043667
|
|
|
|
| |
PiperOrigin-RevId: 210042392
|