aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/core/distributed_runtime
Commit message (Collapse)AuthorAge
* Set step_id in Executor Args to the step_id generated in MasterSession.Gravatar Ayush Dubey2018-09-25
| | | | PiperOrigin-RevId: 214542049
* [tf.data] Introducing an optimization that parallelizes map transformations.Gravatar Piotr Padlewski2018-09-14
| | | | | | | | Stateless MapDatasets can be paralellized by switching to ParallelMapDataset. We set `num_parallel_calls` to 2 for now, but in the future a special value will be used that result in the optimal value to be selected dynamically at runtime. This patch also exposed a memory leak which was fixed. PiperOrigin-RevId: 213015223
* BEGIN_PUBLICGravatar Ayush Dubey2018-08-31
| | | | | | | | | | | Rollback of rollback. Fix: make access to collective_graph_key thread-safe. The original change introduced a collective_graph_key_ integer to DirectSession, but it did not protect accesses to this integer. This change protects access with a mutex. END_PUBLIC Automated rollback of commit cb9443831283c2366e3dd91001db6362d6594f66 PiperOrigin-RevId: 211161961
* Automated rollback of commit 73a3477356990f2451e220f553c9d7782df836acGravatar Ayush Dubey2018-08-30
| | | | PiperOrigin-RevId: 211037202
* Initialize collective_graph_key based on the graph if unspecified in RunOptions.Gravatar Ayush Dubey2018-08-30
| | | | | | | | | | | Before this CL, for collective_ops to work, the client had to specify a collective_graph_key in the RunOptions of a session.Run call. After this change, if a client does not specify a collective_graph_key for a graph that contains collective ops, a graph key is generated automatically as a hash of the set of keys of collective instances in the placed graph. PiperOrigin-RevId: 211024617
* Improve the GPU memory use discipline of CollectiveReduce.Gravatar A. Unique TensorFlower2018-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GPU memory allocation can be done in one of two modes: efficient (but complex and therefore somewhat risky) or conservative (simpler, but less efficient). The main difference is that 'efficient' allocation allows the same memory area to be allocated to mutiple independent uses simultaenously, when it should be the case that those uses will in fact be serial and thus temporally disjoint, while 'conservative' allocation will always obey the invarient that one piece of memory is allocated to at most one use at any point in time. If GPUDevice::RequiresRecordingAccessedTensors() returns false, then the TF runtime uses efficient memory allocation for GPU ops. That is, GPU ops are nominally synchronous and their tensor Ref's are deleted immediately after the ops returns although really the corresponding GPU kernel is only guaranteed to have been enqueued on the compute stream and may not have yet begin execution. If RequiresRecordingAccessedTensors() returns true, then conservative memory allocation is used, i.e. Refs on the tensors accessed by a GPU op are held until the corresponding kernel is guaranteed to have completed execution and no part of the op will touch them again. Efficient GPU memory allocation should be safe when the following criteria are all met: 1. All GPU kernels are executed serially on a single compute stream. 2. All GPU kernel outputs and temp buffers are allocated by the GPU Op in the executor thread in which it is originally called. 3. Any read of a GPU tensor computed by a GPU kernel that is not by another kernel on that same GPU first synchronizes on the compute stream that produced it. 4. Any read by a GPU kernel of a value that was not produced by another GPU kernel first synchronizes on the entity that produced it, e.g. a copy stream. 5. All direct allocations of GPU memory that are not for kernel outputs or temp buffers are conservative in duration. 6. Any use of directly allocated GPU memory that is not part of a kernel execution first synchronizes on the compute stream to ensure that any prior granted uses of the same region have expired before this new use. These conditions together should be sufficient for safety, and correspond to established practice, though it may be possible to contrive other sets of rules that are also sufficient. Collective Ops for GPUs are unusual in that they are async (as TF Ops) and they can directly allocate GPU memory in CPU threads that are asynchronous to the launching executor thread. This CL corrects a couple of subtle misuse errors related to conditions 2 and 6. PiperOrigin-RevId: 210841522
* Removed redundant std::string -> string conversions.Gravatar A. Unique TensorFlower2018-08-28
| | | | PiperOrigin-RevId: 210596417
* Allow child class of Server to supply custom ChannelArgumentsGravatar Noah Eisen2018-08-21
| | | | PiperOrigin-RevId: 209685137
* fix C++ header guards.Gravatar A. Unique TensorFlower2018-08-21
| | | | PiperOrigin-RevId: 209679086
* [Distributed] Add methods to WorkerCache that selectively list workers by ↵Gravatar Derek Murray2018-08-21
| | | | | | job name. PiperOrigin-RevId: 209597829
* Merge pull request #20549 from naurril:bug-fix-grpc-serverGravatar TensorFlower Gardener2018-08-10
|\ | | | | | | PiperOrigin-RevId: 208266944
* | Fix keep alive stale condition.Gravatar Akshay Modi2018-08-10
| | | | | | | | PiperOrigin-RevId: 208254124
* | Support keep alive so we can reclaim memory in the remote case.Gravatar Akshay Modi2018-08-08
| | | | | | | | PiperOrigin-RevId: 207971672
* | Add duplicate detection to RecvBuf requests.Gravatar A. Unique TensorFlower2018-08-04
| | | | | | | | PiperOrigin-RevId: 207394440
* | In grpc_server_lib.cc initialize master_env_.collective_executor_mgrGravatar A. Unique TensorFlower2018-07-25
| | | | | | | | | | | | from the worker_env_ value. PiperOrigin-RevId: 205987011
* | Push tensors from client to workers.Gravatar Akshay Modi2018-07-24
| | | | | | | | | | | | | | | | | | | | | | At times, a server cannot open a reverse connection to the client. This is required when using the _Send/_Recv ops and the client needs to send a tensor to the server (tensors are pulled). Instead, this adds a way to push the tensors directly from the client. Currently, pushing tensors always happens in sync mode. PiperOrigin-RevId: 205888825
* | Remove unnecessary thread pool and use the worker env's compute pool directly.Gravatar Akshay Modi2018-07-23
| | | | | | | | PiperOrigin-RevId: 205756865
* | Automated rollback of commit 2936833c7e22c102ff2b82e3f4e261b94602fbccGravatar Reed Wanderman-Milne2018-07-17
| | | | | | | | PiperOrigin-RevId: 204981602
* | Automated rollback of commit d98b99d1cd4337ee11e7cbc4c9b6324f0e381502Gravatar Reed Wanderman-Milne2018-07-13
| | | | | | | | PiperOrigin-RevId: 204544587
* | tfdbg: remove Experimental tags and obsolete libraryGravatar Shanqing Cai2018-07-13
| | | | | | | | | | | | * debug_gateway and the related node_outputs_callback are not used and hence are removed in this CL. PiperOrigin-RevId: 204519574
* | Add version of SessionFactory::NewSession that returns Status.Gravatar Reed Wanderman-Milne2018-07-13
| | | | | | | | | | | | This causes DirectSession to report a better error message if there is an error initializing GPUs. PiperOrigin-RevId: 204498143
* | Automated rollback of commit 19a98bf9054d9be58a3293b0390b18288a65a25cGravatar Noah Eisen2018-07-09
| | | | | | | | PiperOrigin-RevId: 203872748
* | Allow passing in an IPv6 address in server def.Gravatar Akshay Modi2018-07-09
| | | | | | | | | | | | I belive this will be required if (when?) the TPUClusterResolver returns IPv6 addresses. PiperOrigin-RevId: 203842540
* | Merge changes from github.Gravatar Yifei Feng2018-07-06
| | | | | | | | PiperOrigin-RevId: 203518000
| * add commentsGravatar naurril2018-07-05
| |
| * check parameters before init GrpcServerGravatar naurril2018-07-05
| |
* | Add distributed model GetStepSequenceAsync implementation toGravatar A. Unique TensorFlower2018-07-03
| | | | | | | | | | | | | | | | | | | | distributed_runtiume/RpcCollectiveExecutorMgr. In a distributed environment WorkerInterface is going to call this method at the group leader when fielding a GetStepSequence request from one of the other workers. PiperOrigin-RevId: 203196543
* | Make functions defined with tfe.defun respect devices when executing.Gravatar Akshay Agrawal2018-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modifies GraphModeFunction to emit PartitionedCall ops instead of Call ops so that the created functions can execute across devices. This should strictly increase the set of functions that tfe.defun can faithfully execute. Previous to this change, functions executed through tfe.defun would ignore device annotations and only run on a single device. It is not yet possible to execute a function across multiple processes. Specifically, this CL: (1) Adds a stateful version of PartitionedCall, (2) Modifies `defun` to emit PartitionedCall or StatefulPartitionedCall by default, (3) Makes `tf.gradients` aware of the existence of `(Stateful)PartitionedCall`, (4) Fixes bugs in PartitionedCallOp related to the placement of resource-touching ops / which args and retvals are always on host memory, and also removes the requirement for args/retvals to be passed through the host. PiperOrigin-RevId: 203164388
| * Fix incorrect merge of grpc_server_lib.h.Gravatar Michael Case2018-06-29
| |
| * Merge commit for internal changesGravatar Michael Case2018-06-29
| |\ | |/ |/|
| * Do not capture variables that may be destroyed before callback finishes.Gravatar Ayush Dubey2018-06-28
| | | | | | | | PiperOrigin-RevId: 202370201
| * Add GPUOptions::num_dev_to_dev_copy_streams to allow creation ofGravatar A. Unique TensorFlower2018-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | more than one device-to-device copy stream per GPU device. This is an experimental feature that will have no effect unless copy operations explicitly request a stream other than 0, which currently does not occur anywhere in a standard build. Eventually it may be of benefit in the presence of multiple bi-directional concurrent data copies. PiperOrigin-RevId: 202354513
| * Fix synchronization across callbacks in collective params initialization.Gravatar Ayush Dubey2018-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During initialization of local collective params, we may issue RPCs to other workers in order to obtain device localities. Currently, we hold a mutex across these RPCs, but we do not ensure that the thread that unlocks the mutex is the same as the one that locked it. This change releases the mutex (InstanceRec::out_mu) before calling GetDeviceLocalitiesAsync. Before releasing out_mu, it marks the mutex unavailable. Any thread that wishes to acquire out_mu must wait on a condition variable if the mutex is unavailable. The callback for GetDeviceLocalitiesAsync marks the mutex as available again and notifies the condition variable. PiperOrigin-RevId: 202346357
| * [C++]: Ability to feed and fetch tensors while keeping them in device memoryGravatar Asim Shankar2018-06-28
| | | | | | | | | | | | when using Session::RunCallable(). PiperOrigin-RevId: 202234757
| * Support shapes for remote eager tensor handles.Gravatar Akshay Modi2018-06-28
| | | | | | | | | | | | | | | | Since we respond with the shape, all RPCs will happen sync (note that we may still hide the python overhead, since the op is still scheduled for execution via the eager executor). PiperOrigin-RevId: 202207324
* | Merge changes from github.Gravatar Mingxing Tan2018-06-28
| | | | | | | | PiperOrigin-RevId: 202585094
* | Rewrite master_service in terms of more up-to-date gRPC APIsGravatar Noah Eisen2018-06-28
| | | | | | | | PiperOrigin-RevId: 202544091
* | Do not capture variables that may be destroyed before callback finishes.Gravatar Ayush Dubey2018-06-27
| | | | | | | | PiperOrigin-RevId: 202370201
* | Add GPUOptions::num_dev_to_dev_copy_streams to allow creation ofGravatar A. Unique TensorFlower2018-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | more than one device-to-device copy stream per GPU device. This is an experimental feature that will have no effect unless copy operations explicitly request a stream other than 0, which currently does not occur anywhere in a standard build. Eventually it may be of benefit in the presence of multiple bi-directional concurrent data copies. PiperOrigin-RevId: 202354513
* | Fix synchronization across callbacks in collective params initialization.Gravatar Ayush Dubey2018-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During initialization of local collective params, we may issue RPCs to other workers in order to obtain device localities. Currently, we hold a mutex across these RPCs, but we do not ensure that the thread that unlocks the mutex is the same as the one that locked it. This change releases the mutex (InstanceRec::out_mu) before calling GetDeviceLocalitiesAsync. Before releasing out_mu, it marks the mutex unavailable. Any thread that wishes to acquire out_mu must wait on a condition variable if the mutex is unavailable. The callback for GetDeviceLocalitiesAsync marks the mutex as available again and notifies the condition variable. PiperOrigin-RevId: 202346357
* | [C++]: Ability to feed and fetch tensors while keeping them in device memoryGravatar Asim Shankar2018-06-26
| | | | | | | | | | | | when using Session::RunCallable(). PiperOrigin-RevId: 202234757
* | Support shapes for remote eager tensor handles.Gravatar Akshay Modi2018-06-26
| | | | | | | | | | | | | | | | Since we respond with the shape, all RPCs will happen sync (note that we may still hide the python overhead, since the op is still scheduled for execution via the eager executor). PiperOrigin-RevId: 202207324
| * Merge commit for internal changesGravatar Mingxing Tan2018-06-22
| |\ | |/ |/|
* | Allow dynamic specification of clusters for eager remote execution.Gravatar Akshay Modi2018-06-21
| | | | | | | | PiperOrigin-RevId: 201586130
| * Merge commit for internal changesGravatar Mingxing Tan2018-06-21
| |\ | |/ |/|
* | Rename tensor_data_is_large to share_tensor_slice_memoryGravatar Noah Eisen2018-06-20
| | | | | | | | PiperOrigin-RevId: 201422113
| * Merge commit for internal changesGravatar Mingxing Tan2018-06-20
| |\ | |/ |/|
* | Allow setting server def on the eager context, and add the eager service to ↵Gravatar Akshay Modi2018-06-19
| | | | | | | | | | | | the grpc_tensorflow_server. PiperOrigin-RevId: 201198350
* | Merge changes from github.Gravatar Akshay Modi2018-06-18
| | | | | | | | PiperOrigin-RevId: 201110240
* | Automated g4 rollback of changelist 201011811Gravatar Akshay Modi2018-06-18
| | | | | | | | PiperOrigin-RevId: 201033171