aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/core/distributed_runtime/worker.cc
Commit message (Collapse)AuthorAge
* Collective Ops Part 8Gravatar A. Unique TensorFlower2018-06-08
| | | | | | | | | | | | | | | | | | | Enable collective op execution in distibuted mode: Pass collective_graph_key into graph building and step execution contexts (MasterSession) where it triggers allocation of an RpcCollectiveExecutorMgr that becomes accessible via the WorkerEnv and MasterEnv. The collective_graph_key is used to synchronize step_ids (which are otherwise random) between otherwise independent graph executions that contain collective ops that need to rendezvous. All APIs for using collectives are still non-public and experimental. PiperOrigin-RevId: 199879087
* Collective Ops Part 6Gravatar A. Unique TensorFlower2018-05-09
| | | | | | | | | | Distributed-mode implementations of CollectiveRemoteAccess. Extend Worker interface with corresponding new methods. This change is part of a series of changes introducing infrastructure for collective ops and initial implementations of reduction and broadcast. PiperOrigin-RevId: 196010718
* Collective Ops Part 5Gravatar A. Unique TensorFlower2018-05-01
| | | | | | | | | | | Distributed-mode implementations of DeviceResolverInterface and ParamResolverInterface. Extend Worker interface with new methods in support of these interfaces. This change is part of a series of changes introducing infrastructure for collective ops and initial implementations of reduction and broadcast. PiperOrigin-RevId: 194984585
* Add a ten-second timeout to the DeleteWorkerSession call.Gravatar Derek Murray2018-04-18
| | | | | | | | | | | | | Previously, `MasterSession::Close()` did not block on the cleanup RPCs to the individual workers, leading to deployments where the remote workers might be shut down (e.g. by an external mechanism) before the session was closed. In order to switch over to using DeleteWorkerSession for all sessions, and preserve backwards compatibility, we need to permit this behavior. Therefore, this CL adds a 10-second timeout on the requests to workers, and logs an error if the request does not succeed in that time period. PiperOrigin-RevId: 193441618
* Never use the LegacySession when a Master explicitly calls CreateWorkerSession.Gravatar Derek Murray2018-04-18
| | | | | | | | | | | | | | | | | | Previously, if the session handle was unrecognized by the worker, it would default to using the LegacySession. This prevents us from noticing that a server has been restarted. To address the problem in a backwards-compatible way, we add a bit to each session-handle-carrying worker request, indicating whether the master believes that CreateWorkerSession has been called. If this bit is set and the handle is unrecognized, the worker will raise an AbortedError, which can be caught by high-level frameworks such as `tf.estimator`. Note that CreateWorkerSession is not yet used by default, and a follow-up change will add that. PiperOrigin-RevId: 193427057
* Avoid capturing unused variables in lambda functionsGravatar Benoit Steiner2018-03-12
| | | | PiperOrigin-RevId: 188747641
* Fix potential use-after-free bugs in the worker with DeleteWorkerSession.Gravatar Derek Murray2018-01-15
| | | | | | | | | | | | Previously, DeleteWorkerSession was responsible for freeing the WorkerSession owned by the SessionMgr. However, it is possible for other requests to be in-flight on the same session, and requests from the master to be delivered out of order, which leads to the potential for a request to use a WorkerSession after it has been freed. Revise the SessionMgr interface to handle std::shared_ptr<WorkerSession> instead of raw pointers to avoid this risk. PiperOrigin-RevId: 181975078
* Optionally store the status code/message in the responseGravatar A. Unique TensorFlower2017-12-27
| | | | | | | body for RunGraph and RunStep RPCs, to workaround the fact that the RPC subsystem truncates long metadata messages. PiperOrigin-RevId: 180203356
* Add `ConfigProto.isolate_session_state` option for the distributed runtime.Gravatar Derek Murray2017-11-28
| | | | | | | | | | | | | | | | | | | Setting this option to true when creating a session ensures that no stateful resources (variables, queues, iterators, etc.) will be visible to any other session running on the same server, and those resources will be deleted when the session is closed. The default behavior, namely that all `tf.Variable` objects are shared by default and most other resources are shared when their `shared_name` attr is non-empty, is preserved. This change augments the semantics of the WorkerService.CreateWorkerSession RPC. Now, if the server_def in the request is empty, it implies that the worker should use its default ClusterSpec. Note that clusters created using ClusterSpec propagation always have isolated session state, and are unaffected by this change. PiperOrigin-RevId: 177173545
* Add `WorkerService.DeleteWorkerSession` method to fix a memory leak.Gravatar Derek Murray2017-11-15
| | | | | | | | The new method is the counterpart to `WorkerService.CreateWorkerSession`, and is called in all cases where worker sessions have been explicitly created (i.e. when using ClusterSpec propagation). PiperOrigin-RevId: 175877407
* OOM error with allocation information.Gravatar A. Unique TensorFlower2017-11-13
| | | | PiperOrigin-RevId: 175637128
* Track memory allocation/deallocation history.Gravatar A. Unique TensorFlower2017-10-05
| | | | PiperOrigin-RevId: 171239477
* Allowing for functions to run across processes using RPC's. Currently this ↵Gravatar Rohan Jain2017-10-02
| | | | | | only works for processes running on CPU's only. PiperOrigin-RevId: 170725482
* Merge changes from github.Gravatar Jonathan Hseu2017-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | END_PUBLIC --- Commit b30ce4714 authored by James Qin<jamesqin@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Revamp CudnnRNN Saveables 1. Use a lossy way to save/restore cudnn biases during checkpointing. Cudnn uses 2 biases each gate for all RNNs while tf uses one. To allow cudnn checkpoints to be compatible with both Cudnn and platform-independent impls, previously both individual bias and summed biases each gate were stored. The new way only stores the bias sum for each gate, and split it half-half when restoring from a cudnn graph. Doing this does not cause problems since RNNs do not use weight-decay to regularize. 2. Use inheritance instead of branching * Split RNNParamsSaveable to 1 base class and 4 subclasses. * Extract common routines and only overwrite rnn-type-specific pieces in subclasses. PiperOrigin-RevId: 166413989 --- Commit ebc421daf authored by Alan Yee<alyee@ucsd.edu> Committed by Jonathan Hseu<vomjom@vomjom.net>: Update documentation for contrib (#12424) * Update __init__.py Remove ## for standardization of api docs * Create README.md Add README to define this directory's purpose * Update __init.py Markdown styling does not show up well in api docs * Update README.md Add short mention of describing what to deprecate * Update README.md Capitalize title * Update README.md Revert README change * Delete README.md --- Commit fd295394d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use latest version of nsync library, which now allows use of cmake on MacOS. PiperOrigin-RevId: 166411437 --- Commit 587d728e0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Refactor reduce-precision-insertion filters, add several more options. In particular, this adds the ability to add reduce-precision operations after fusion nodes based on the contents of those fusion nodes, and the ability to filter operations based on the "op_name" metadata. PiperOrigin-RevId: 166408392 --- Commit 3142f8ef5 authored by Ali Yahya<alive@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Steps toward making ResourceVariables compatible with Eager. This change forces the value of the reuse flag in variable scopes to be tf.AUTO_REUSE when in Eager mode. This change also adds comprehensive Eager tests for ResourceVariable. PiperOrigin-RevId: 166408161 --- Commit b2ce45150 authored by Igor Ganichev<iga@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make Graph::IsValidNode public It can be reimplemented with existing public APIs, but instead of doing so, making this one public seems better. PiperOrigin-RevId: 166407897 --- Commit 0a2f40e92 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA::CPU] Fix HLO profiling in parallel CPU backend. PiperOrigin-RevId: 166400211 --- Commit c4a58e3fd authored by Yao Zhang<yaozhang@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Identify frame ids for all nodes in a graph. PiperOrigin-RevId: 166397615 --- Commit 989713f26 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: BEGIN_PUBLIC Automated g4 rollback of changelist 166294015 PiperOrigin-RevId: 166521502
* Add output_partitions support in distributed runtime.Gravatar Suharsh Sivakumar2017-07-19
| | | | PiperOrigin-RevId: 162456565
* Performance-related tweaks: Don't copy loop variables; remove ineffective ↵Gravatar A. Unique TensorFlower2017-06-05
| | | | | | std::move casts. PiperOrigin-RevId: 158017670
* Refactor partial run state handling into partial_run_mgr.Gravatar Suharsh Sivakumar2017-05-19
| | | | PiperOrigin-RevId: 156529141
* Implement ClusterSpec Propagation in TF MasterGravatar Brennan Saeta2017-05-04
| | | | | | | | | | | ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. The ClusterSpec propagation capability allows TensorFlow workers to be booted independently of each other, and with no knowledge about others. The client can then construct a ClusterDef (ClusterSpec), and then send it to the TF master at session creation. The master in turn then propagates the ClusterDef along to all of the workers. Change: 155159972
* Add TFDBG support to GrpcSessionGravatar Shanqing Cai2017-04-29
| | | | | | | | * Along the way, unify the way the debugger works in DirectSession (non-distributed Sessions) and MasterSession (for distributed Sessions). * The SummarizDebugTensorWatches method is invoked in DirectSession::GetOrCreateExecutors() and MasterSession::HashBuildGraphOptions() method to generate keys for partition graphs and executors. * The DebugStateInterface::PublishDebugMetadata() method is used to send metadata about the debugged Session::Run() call to debug URLs. This happens in DirectSession::Run() and MasterSession::DoRunWithLocalExecution() respectively. * The DebugGraphDecoratorInterface::DecorateGraph() and DebugGraphDecoratorInterface::PublishGraph() methods are used to insert debug ops to the debugged graph and send the modified graph to debug URLs. This happens in DirectSession::GetOrCreateExecutors() and GraphMgr::InitItem(), respectively. Change: 154631802
* Change calls to use status.Update.Gravatar Suharsh Sivakumar2017-03-31
| | | | Change: 151899404
* Consolidate worker state behind a session-centric abstraction.Gravatar Brennan Saeta2017-03-25
| | | | | | | | | | | | | | | | | | | | | | State in workers is currently splayed across graph_mgr, rendezvous_mgr, and additional components. This has resulted in it being difficult to ensure proper cleanup and shut down of the worker components. In addition to paving the way for a more reliable shut down, this CL also sets up the beginnings of ClusterSpec propagation. ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. After the ClusterSpec propagation capability is fully implemented, the TensorFlow workers can be booted independently of each other, and with no knowledge about others. A client can then query a central cluster scheduler or other API to find all of the workers, and then send the ClusterDef (ClusterSpec) to the TF master, which then propagates that along to all of the workers. This change is only the first of a sequence to fully implement ClusterSpec propagation in TensorFlow. Change: 151229111
* Ensure that partial run doesn't block any threads on the worker compute_pool.Gravatar Suharsh Sivakumar2017-03-15
| | | | Change: 150265300
* Fix code that ignores tensorflow::Status.Gravatar Peter Hawkins2017-02-13
| | | | | Add a new tensorflow::Status::IgnoreError() method to mark call sites where a Status has been intentionally ignored. Change: 147402405
* Provide multiple implementations of RPC responses on the fetch path.Gravatar Derek Murray2017-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | This CL includes wrapper classes for the protocol buffer messages `tensorflow::RunStepResponse` and `tensorflow::RunGraphResponse` (to complement the corresponding request message wrappers that were added recently). This change makes the backend code deal with abstract `tensorflow::MutableRunStepResponseWrapper` and `tensorflow::MutableRunGraphResponseWrapper` interfaces and adds three concrete implementations of each interface: * A mutable in-memory wrapper, which maintains the tensor data in `tensorflow::Tensor` objects, and provides the most efficient implementation when the client and master (or master and worker) or in the same address space. * A mutable, owned protobuf wrapper, which has a similar implementation to today's client code. * A mutable, non-owned protobuf wrapper, which has a similar implementation to today's server code (where the protobuf message is owned by the RPC subsystem). This is another improvement for issue #6256. Change: 144481118
* Provide multiple implementations of RPC requests on the feed path.Gravatar Derek Murray2017-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | This CL includes wrapper classes for the protocol buffer messages `tensorflow::RunStepRequest` and `tensorflow::RunGraphRequest`. Previously the service arguments were always protocol buffer messages, which can entail copying large tensor values into and out of the request message. This change makes the backend code deal with abstract `tensorflow::RunStepRequestWrapper` and `tensorflow::RunGraphRequestWrapper` interfaces and adds three concrete implementations of each interface: * An mutable in-memory wrapper, which maintains the tensor data in `tensorflow::Tensor` objects, and provides the most efficient implementation when the client and master are in the same address space. * A mutable protobuf wrapper, which has a similar implementation to today's client code. * A const wrapper around a const protobuf, which has a similar implementation to today's server code. This is another improvement for issue #6256. Change: 143620823
* Combine NamedTensorProto and NamedTensor into a single proto.Gravatar Derek Murray2016-12-22
| | | | | | | Also use `Tensor::AsProtoTensorContent()` when populating the fetched values from a gRPC worker service, as this is more efficient for larger values. This should improve #6256 slightly. Change: 142813084
* Optimize the case of a master communicating with an in-process worker.Gravatar Derek Murray2016-12-22
This change modifies the GrpcWorkerCache so that, when a master attempts to communicate with the worker in the same process, it does so by direct method calls on a `WorkerInterface*`, without making a loopback RPC call. This change is another incremental step towards addressing issue #6256. There are further improvements possible, and we will continue to investigate them, including: * Avoiding the protobuf encoding/decoding for request/response objects where this affects performance. The zero-copy `TensorResponse` class is an example of how we could improve performance here, for `RunGraphRequest` and `RunGraphResponse` objects. * Profiling the closure creation/context switch overhead for interactions with the local worker. Change: 142793965