| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 214542049
|
|
|
|
|
|
|
|
| |
Stateless MapDatasets can be paralellized by switching to ParallelMapDataset. We set `num_parallel_calls` to 2 for now, but in the future a special value will be used that result in the optimal value to be selected dynamically at runtime.
This patch also exposed a memory leak which was fixed.
PiperOrigin-RevId: 213015223
|
|
|
|
|
|
|
|
|
|
|
| |
Rollback of rollback. Fix: make access to collective_graph_key thread-safe.
The original change introduced a collective_graph_key_ integer to DirectSession, but it did not protect accesses to this integer. This change protects access with a mutex.
END_PUBLIC
Automated rollback of commit cb9443831283c2366e3dd91001db6362d6594f66
PiperOrigin-RevId: 211161961
|
|
|
|
| |
PiperOrigin-RevId: 211037202
|
|
|
|
|
|
|
|
|
|
|
| |
Before this CL, for collective_ops to work, the client had to specify a
collective_graph_key in the RunOptions of a session.Run call.
After this change, if a client does not specify a collective_graph_key for a
graph that contains collective ops, a graph key is generated automatically as a
hash of the set of keys of collective instances in the placed graph.
PiperOrigin-RevId: 211024617
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GPU memory allocation can be done in one of two modes: efficient (but
complex and therefore somewhat risky) or conservative (simpler, but less
efficient). The main difference is that 'efficient' allocation allows
the same memory area to be allocated to mutiple independent uses
simultaenously, when it should be the case that those uses will in
fact be serial and thus temporally disjoint, while 'conservative'
allocation will always obey the invarient that one piece of memory is
allocated to at most one use at any point in time.
If GPUDevice::RequiresRecordingAccessedTensors() returns false, then
the TF runtime uses efficient memory allocation for GPU ops. That is, GPU
ops are nominally synchronous and their tensor Ref's are deleted
immediately after the ops returns although really the corresponding GPU
kernel is only guaranteed to have been enqueued on the compute stream
and may not have yet begin execution.
If RequiresRecordingAccessedTensors() returns true, then conservative
memory allocation is used, i.e. Refs on the tensors accessed by a GPU op
are held until the corresponding kernel is guaranteed to have completed
execution and no part of the op will touch them again.
Efficient GPU memory allocation should be safe when the following criteria
are all met:
1. All GPU kernels are executed serially on a single compute stream.
2. All GPU kernel outputs and temp buffers are allocated by
the GPU Op in the executor thread in which it is originally called.
3. Any read of a GPU tensor computed by a GPU kernel that is not
by another kernel on that same GPU first synchronizes on
the compute stream that produced it.
4. Any read by a GPU kernel of a value that was not produced by another
GPU kernel first synchronizes on the entity that produced it,
e.g. a copy stream.
5. All direct allocations of GPU memory that are not for kernel outputs
or temp buffers are conservative in duration.
6. Any use of directly allocated GPU memory that is not part of a kernel
execution first synchronizes on the compute stream to ensure that
any prior granted uses of the same region have expired before this new use.
These conditions together should be sufficient for safety, and
correspond to established practice, though it may be possible to
contrive other sets of rules that are also sufficient.
Collective Ops for GPUs are unusual in that they are async (as TF
Ops) and they can directly allocate GPU memory in CPU threads that are
asynchronous to the launching executor thread. This CL corrects a
couple of subtle misuse errors related to conditions 2 and 6.
PiperOrigin-RevId: 210841522
|
|
|
|
| |
PiperOrigin-RevId: 210596417
|
|
|
|
| |
PiperOrigin-RevId: 209685137
|
|
|
|
| |
PiperOrigin-RevId: 209679086
|
|
|
|
|
|
| |
job name.
PiperOrigin-RevId: 209597829
|
|\
| |
| |
| | |
PiperOrigin-RevId: 208266944
|
| |
| |
| |
| | |
PiperOrigin-RevId: 208254124
|
| |
| |
| |
| | |
PiperOrigin-RevId: 207971672
|
| |
| |
| |
| | |
PiperOrigin-RevId: 207394440
|
| |
| |
| |
| |
| |
| | |
from the worker_env_ value.
PiperOrigin-RevId: 205987011
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
At times, a server cannot open a reverse connection to the client. This is
required when using the _Send/_Recv ops and the client needs to send a tensor
to the server (tensors are pulled). Instead, this adds a way to push the
tensors directly from the client.
Currently, pushing tensors always happens in sync mode.
PiperOrigin-RevId: 205888825
|
| |
| |
| |
| | |
PiperOrigin-RevId: 205756865
|
| |
| |
| |
| | |
PiperOrigin-RevId: 204981602
|
| |
| |
| |
| | |
PiperOrigin-RevId: 204544587
|
| |
| |
| |
| |
| |
| | |
* debug_gateway and the related node_outputs_callback are not used and hence are removed in this CL.
PiperOrigin-RevId: 204519574
|
| |
| |
| |
| |
| |
| | |
This causes DirectSession to report a better error message if there is an error initializing GPUs.
PiperOrigin-RevId: 204498143
|
| |
| |
| |
| | |
PiperOrigin-RevId: 203872748
|
| |
| |
| |
| |
| |
| | |
I belive this will be required if (when?) the TPUClusterResolver returns IPv6 addresses.
PiperOrigin-RevId: 203842540
|
| |
| |
| |
| | |
PiperOrigin-RevId: 203518000
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
distributed_runtiume/RpcCollectiveExecutorMgr.
In a distributed environment WorkerInterface is going to call this
method at the group leader when fielding a GetStepSequence request
from one of the other workers.
PiperOrigin-RevId: 203196543
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Modifies GraphModeFunction to emit PartitionedCall ops instead of Call ops
so that the created functions can execute across devices. This should strictly
increase the set of functions that tfe.defun can faithfully execute.
Previous to this change, functions executed through tfe.defun would ignore
device annotations and only run on a single device. It is not yet possible to execute
a function across multiple processes.
Specifically, this CL:
(1) Adds a stateful version of PartitionedCall,
(2) Modifies `defun` to emit PartitionedCall or StatefulPartitionedCall by default,
(3) Makes `tf.gradients` aware of the existence of `(Stateful)PartitionedCall`,
(4) Fixes bugs in PartitionedCallOp related to the placement of
resource-touching ops / which args and retvals are always on host memory, and
also removes the requirement for args/retvals to be passed through the host.
PiperOrigin-RevId: 203164388
|
| | |
|
| |\
| |/
|/| |
|
| |
| |
| |
| | |
PiperOrigin-RevId: 202370201
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
more than one device-to-device copy stream per GPU device.
This is an experimental feature that will have no effect unless
copy operations explicitly request a stream other than 0, which
currently does not occur anywhere in a standard build.
Eventually it may be of benefit in the presence of multiple
bi-directional concurrent data copies.
PiperOrigin-RevId: 202354513
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
During initialization of local collective params, we may issue RPCs to other
workers in order to obtain device localities. Currently, we hold a mutex
across these RPCs, but we do not ensure that the thread that unlocks the mutex
is the same as the one that locked it.
This change releases the mutex (InstanceRec::out_mu) before calling
GetDeviceLocalitiesAsync. Before releasing out_mu, it marks the mutex
unavailable. Any thread that wishes to acquire out_mu must wait on a condition
variable if the mutex is unavailable. The callback for
GetDeviceLocalitiesAsync marks the mutex as available again and notifies the
condition variable.
PiperOrigin-RevId: 202346357
|
| |
| |
| |
| |
| |
| | |
when using Session::RunCallable().
PiperOrigin-RevId: 202234757
|
| |
| |
| |
| |
| |
| |
| |
| | |
Since we respond with the shape, all RPCs will happen sync (note
that we may still hide the python overhead, since the op is still scheduled for
execution via the eager executor).
PiperOrigin-RevId: 202207324
|
| |
| |
| |
| | |
PiperOrigin-RevId: 202585094
|
| |
| |
| |
| | |
PiperOrigin-RevId: 202544091
|
| |
| |
| |
| | |
PiperOrigin-RevId: 202370201
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
more than one device-to-device copy stream per GPU device.
This is an experimental feature that will have no effect unless
copy operations explicitly request a stream other than 0, which
currently does not occur anywhere in a standard build.
Eventually it may be of benefit in the presence of multiple
bi-directional concurrent data copies.
PiperOrigin-RevId: 202354513
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
During initialization of local collective params, we may issue RPCs to other
workers in order to obtain device localities. Currently, we hold a mutex
across these RPCs, but we do not ensure that the thread that unlocks the mutex
is the same as the one that locked it.
This change releases the mutex (InstanceRec::out_mu) before calling
GetDeviceLocalitiesAsync. Before releasing out_mu, it marks the mutex
unavailable. Any thread that wishes to acquire out_mu must wait on a condition
variable if the mutex is unavailable. The callback for
GetDeviceLocalitiesAsync marks the mutex as available again and notifies the
condition variable.
PiperOrigin-RevId: 202346357
|
| |
| |
| |
| |
| |
| | |
when using Session::RunCallable().
PiperOrigin-RevId: 202234757
|
| |
| |
| |
| |
| |
| |
| |
| | |
Since we respond with the shape, all RPCs will happen sync (note
that we may still hide the python overhead, since the op is still scheduled for
execution via the eager executor).
PiperOrigin-RevId: 202207324
|
| |\
| |/
|/| |
|
| |
| |
| |
| | |
PiperOrigin-RevId: 201586130
|
| |\
| |/
|/| |
|
| |
| |
| |
| | |
PiperOrigin-RevId: 201422113
|
| |\
| |/
|/| |
|
| |
| |
| |
| |
| |
| | |
the grpc_tensorflow_server.
PiperOrigin-RevId: 201198350
|
| |
| |
| |
| | |
PiperOrigin-RevId: 201110240
|
| |
| |
| |
| | |
PiperOrigin-RevId: 201033171
|