| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enable collective op execution in distibuted mode:
Pass collective_graph_key into graph building and
step execution contexts (MasterSession) where it triggers
allocation of an RpcCollectiveExecutorMgr that becomes
accessible via the WorkerEnv and MasterEnv.
The collective_graph_key is used to synchronize step_ids
(which are otherwise random) between otherwise independent
graph executions that contain collective ops that need
to rendezvous.
All APIs for using collectives are still non-public and
experimental.
PiperOrigin-RevId: 199879087
|
|
|
|
|
|
|
|
|
|
| |
Distributed-mode implementations of CollectiveRemoteAccess.
Extend Worker interface with corresponding new methods.
This change is part of a series of changes introducing infrastructure
for collective ops and initial implementations of reduction and broadcast.
PiperOrigin-RevId: 196010718
|
|
|
|
|
|
|
|
|
|
|
| |
Distributed-mode implementations of DeviceResolverInterface
and ParamResolverInterface. Extend Worker interface with
new methods in support of these interfaces.
This change is part of a series of changes introducing infrastructure
for collective ops and initial implementations of reduction and broadcast.
PiperOrigin-RevId: 194984585
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, `MasterSession::Close()` did not block on the cleanup RPCs
to the individual workers, leading to deployments where the remote
workers might be shut down (e.g. by an external mechanism) before the
session was closed. In order to switch over to using
DeleteWorkerSession for all sessions, and preserve backwards
compatibility, we need to permit this behavior. Therefore, this CL
adds a 10-second timeout on the requests to workers, and logs an error
if the request does not succeed in that time period.
PiperOrigin-RevId: 193441618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, if the session handle was unrecognized by the worker, it
would default to using the LegacySession. This prevents us from
noticing that a server has been restarted.
To address the problem in a backwards-compatible way, we add a bit to
each session-handle-carrying worker request, indicating whether the
master believes that CreateWorkerSession has been called. If this bit
is set and the handle is unrecognized, the worker will raise an
AbortedError, which can be caught by high-level frameworks such as
`tf.estimator`.
Note that CreateWorkerSession is not yet used by default, and a
follow-up change will add that.
PiperOrigin-RevId: 193427057
|
|
|
|
| |
PiperOrigin-RevId: 188747641
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, DeleteWorkerSession was responsible for freeing the
WorkerSession owned by the SessionMgr. However, it is possible for
other requests to be in-flight on the same session, and requests from
the master to be delivered out of order, which leads to the potential
for a request to use a WorkerSession after it has been freed. Revise
the SessionMgr interface to handle std::shared_ptr<WorkerSession>
instead of raw pointers to avoid this risk.
PiperOrigin-RevId: 181975078
|
|
|
|
|
|
|
| |
body for RunGraph and RunStep RPCs, to workaround the fact that the
RPC subsystem truncates long metadata messages.
PiperOrigin-RevId: 180203356
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Setting this option to true when creating a session ensures that no
stateful resources (variables, queues, iterators, etc.) will be
visible to any other session running on the same server, and those
resources will be deleted when the session is closed.
The default behavior, namely that all `tf.Variable` objects are shared by
default and most other resources are shared when their `shared_name` attr is
non-empty, is preserved.
This change augments the semantics of the WorkerService.CreateWorkerSession
RPC. Now, if the server_def in the request is empty, it implies that
the worker should use its default ClusterSpec. Note that clusters created
using ClusterSpec propagation always have isolated session state, and are
unaffected by this change.
PiperOrigin-RevId: 177173545
|
|
|
|
|
|
|
|
| |
The new method is the counterpart to `WorkerService.CreateWorkerSession`, and
is called in all cases where worker sessions have been explicitly created (i.e.
when using ClusterSpec propagation).
PiperOrigin-RevId: 175877407
|
|
|
|
| |
PiperOrigin-RevId: 175637128
|
|
|
|
| |
PiperOrigin-RevId: 171239477
|
|
|
|
|
|
| |
only works for processes running on CPU's only.
PiperOrigin-RevId: 170725482
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
END_PUBLIC
---
Commit b30ce4714 authored by James Qin<jamesqin@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Revamp CudnnRNN Saveables
1. Use a lossy way to save/restore cudnn biases during checkpointing.
Cudnn uses 2 biases each gate for all RNNs while tf uses one. To allow cudnn checkpoints
to be compatible with both Cudnn and platform-independent impls, previously both
individual bias and summed biases each gate were stored.
The new way only stores the bias sum for each gate, and split it half-half when
restoring from a cudnn graph. Doing this does not cause problems since RNNs do not use
weight-decay to regularize.
2. Use inheritance instead of branching
* Split RNNParamsSaveable to 1 base class and 4 subclasses.
* Extract common routines and only overwrite rnn-type-specific pieces in subclasses.
PiperOrigin-RevId: 166413989
---
Commit ebc421daf authored by Alan Yee<alyee@ucsd.edu>
Committed by Jonathan Hseu<vomjom@vomjom.net>:
Update documentation for contrib (#12424)
* Update __init__.py
Remove ## for standardization of api docs
* Create README.md
Add README to define this directory's purpose
* Update __init.py
Markdown styling does not show up well in api docs
* Update README.md
Add short mention of describing what to deprecate
* Update README.md
Capitalize title
* Update README.md
Revert README change
* Delete README.md
---
Commit fd295394d authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Use latest version of nsync library, which now allows use of cmake on MacOS.
PiperOrigin-RevId: 166411437
---
Commit 587d728e0 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA] Refactor reduce-precision-insertion filters, add several more options.
In particular, this adds the ability to add reduce-precision operations after fusion nodes based on the contents of those fusion nodes, and the ability to filter operations based on the "op_name" metadata.
PiperOrigin-RevId: 166408392
---
Commit 3142f8ef5 authored by Ali Yahya<alive@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Steps toward making ResourceVariables compatible with Eager.
This change forces the value of the reuse flag in variable scopes to be tf.AUTO_REUSE when in Eager mode.
This change also adds comprehensive Eager tests for ResourceVariable.
PiperOrigin-RevId: 166408161
---
Commit b2ce45150 authored by Igor Ganichev<iga@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Make Graph::IsValidNode public
It can be reimplemented with existing public APIs, but instead of doing so,
making this one public seems better.
PiperOrigin-RevId: 166407897
---
Commit 0a2f40e92 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
[XLA::CPU] Fix HLO profiling in parallel CPU backend.
PiperOrigin-RevId: 166400211
---
Commit c4a58e3fd authored by Yao Zhang<yaozhang@google.com>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
Identify frame ids for all nodes in a graph.
PiperOrigin-RevId: 166397615
---
Commit 989713f26 authored by A. Unique TensorFlower<gardener@tensorflow.org>
Committed by TensorFlower Gardener<gardener@tensorflow.org>:
BEGIN_PUBLIC
Automated g4 rollback of changelist 166294015
PiperOrigin-RevId: 166521502
|
|
|
|
| |
PiperOrigin-RevId: 162456565
|
|
|
|
|
|
| |
std::move casts.
PiperOrigin-RevId: 158017670
|
|
|
|
| |
PiperOrigin-RevId: 156529141
|
|
|
|
|
|
|
|
|
|
|
| |
ClusterSpec propagation is a capability upgrade for TensorFlow that should make
it much easier to (1) build distributed TensorFlow clusters, and (2) handle
node failures. The ClusterSpec propagation capability allows TensorFlow workers
to be booted independently of each other, and with no knowledge about others.
The client can then construct a ClusterDef (ClusterSpec), and then send it
to the TF master at session creation. The master in turn then propagates the
ClusterDef along to all of the workers.
Change: 155159972
|
|
|
|
|
|
|
|
| |
* Along the way, unify the way the debugger works in DirectSession (non-distributed Sessions) and MasterSession (for distributed Sessions).
* The SummarizDebugTensorWatches method is invoked in DirectSession::GetOrCreateExecutors() and MasterSession::HashBuildGraphOptions() method to generate keys for partition graphs and executors.
* The DebugStateInterface::PublishDebugMetadata() method is used to send metadata about the debugged Session::Run() call to debug URLs. This happens in DirectSession::Run() and MasterSession::DoRunWithLocalExecution() respectively.
* The DebugGraphDecoratorInterface::DecorateGraph() and DebugGraphDecoratorInterface::PublishGraph() methods are used to insert debug ops to the debugged graph and send the modified graph to debug URLs. This happens in DirectSession::GetOrCreateExecutors() and GraphMgr::InitItem(), respectively.
Change: 154631802
|
|
|
|
| |
Change: 151899404
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
State in workers is currently splayed across graph_mgr, rendezvous_mgr, and
additional components. This has resulted in it being difficult to ensure proper
cleanup and shut down of the worker components.
In addition to paving the way for a more reliable shut down, this CL also sets
up the beginnings of ClusterSpec propagation.
ClusterSpec propagation is a capability upgrade for TensorFlow that should make
it much easier to (1) build distributed TensorFlow clusters, and (2) handle
node failures. After the ClusterSpec propagation capability is fully
implemented, the TensorFlow workers can be booted independently of each other,
and with no knowledge about others. A client can then query a central cluster
scheduler or other API to find all of the workers, and then send the
ClusterDef (ClusterSpec) to the TF master, which then propagates that along to
all of the workers.
This change is only the first of a sequence to fully implement ClusterSpec
propagation in TensorFlow.
Change: 151229111
|
|
|
|
| |
Change: 150265300
|
|
|
|
|
| |
Add a new tensorflow::Status::IgnoreError() method to mark call sites where a Status has been intentionally ignored.
Change: 147402405
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL includes wrapper classes for the protocol buffer messages
`tensorflow::RunStepResponse` and `tensorflow::RunGraphResponse` (to
complement the corresponding request message wrappers that were added recently).
This change makes the backend code deal with abstract
`tensorflow::MutableRunStepResponseWrapper` and
`tensorflow::MutableRunGraphResponseWrapper` interfaces and adds three
concrete implementations of each interface:
* A mutable in-memory wrapper, which maintains the tensor data in
`tensorflow::Tensor` objects, and provides the most efficient
implementation when the client and master (or master and worker) or
in the same address space.
* A mutable, owned protobuf wrapper, which has a similar implementation
to today's client code.
* A mutable, non-owned protobuf wrapper, which has a similar
implementation to today's server code (where the protobuf message is
owned by the RPC subsystem).
This is another improvement for issue #6256.
Change: 144481118
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL includes wrapper classes for the protocol buffer messages
`tensorflow::RunStepRequest` and `tensorflow::RunGraphRequest`.
Previously the service arguments were always protocol buffer messages,
which can entail copying large tensor values into and out of the
request message. This change makes the backend code deal with abstract
`tensorflow::RunStepRequestWrapper` and
`tensorflow::RunGraphRequestWrapper` interfaces and adds three
concrete implementations of each interface:
* An mutable in-memory wrapper, which maintains the tensor data in
`tensorflow::Tensor` objects, and provides the most efficient
implementation when the client and master are in the same address
space.
* A mutable protobuf wrapper, which has a similar implementation to
today's client code.
* A const wrapper around a const protobuf, which has a similar
implementation to today's server code.
This is another improvement for issue #6256.
Change: 143620823
|
|
|
|
|
|
|
| |
Also use `Tensor::AsProtoTensorContent()` when populating the fetched
values from a gRPC worker service, as this is more efficient for larger
values. This should improve #6256 slightly.
Change: 142813084
|
|
This change modifies the GrpcWorkerCache so that, when a master
attempts to communicate with the worker in the same process, it does
so by direct method calls on a `WorkerInterface*`, without making a
loopback RPC call.
This change is another incremental step towards addressing issue
#6256. There are further improvements possible, and we will continue
to investigate them, including:
* Avoiding the protobuf encoding/decoding for request/response objects
where this affects performance. The zero-copy `TensorResponse` class
is an example of how we could improve performance here, for
`RunGraphRequest` and `RunGraphResponse` objects.
* Profiling the closure creation/context switch overhead for
interactions with the local worker.
Change: 142793965
|