Improve the GPU memory use discipline of CollectiveReduce.

GPU memory allocation can be done in one of two modes: efficient (but complex and therefore somewhat risky) or conservative (simpler, but less efficient). The main difference is that 'efficient' allocation allows the same memory area to be allocated to mutiple independent uses simultaenously, when it should be the case that those uses will in fact be serial and thus temporally disjoint, while 'conservative' allocation will always obey the invarient that one piece of memory is allocated to at most one use at any point in time. If GPUDevice::RequiresRecordingAccessedTensors() returns false, then the TF runtime uses efficient memory allocation for GPU ops. That is, GPU ops are nominally synchronous and their tensor Ref's are deleted immediately after the ops returns although really the corresponding GPU kernel is only guaranteed to have been enqueued on the compute stream and may not have yet begin execution. If RequiresRecordingAccessedTensors() returns true, then conservative memory allocation is used, i.e. Refs on the tensors accessed by a GPU op are held until the corresponding kernel is guaranteed to have completed execution and no part of the op will touch them again. Efficient GPU memory allocation should be safe when the following criteria are all met: 1. All GPU kernels are executed serially on a single compute stream. 2. All GPU kernel outputs and temp buffers are allocated by the GPU Op in the executor thread in which it is originally called. 3. Any read of a GPU tensor computed by a GPU kernel that is not by another kernel on that same GPU first synchronizes on the compute stream that produced it. 4. Any read by a GPU kernel of a value that was not produced by another GPU kernel first synchronizes on the entity that produced it, e.g. a copy stream. 5. All direct allocations of GPU memory that are not for kernel outputs or temp buffers are conservative in duration. 6. Any use of directly allocated GPU memory that is not part of a kernel execution first synchronizes on the compute stream to ensure that any prior granted uses of the same region have expired before this new use. These conditions together should be sufficient for safety, and correspond to established practice, though it may be possible to contrive other sets of rules that are also sufficient. Collective Ops for GPUs are unusual in that they are async (as TF Ops) and they can directly allocate GPU memory in CPU threads that are asynchronous to the launching executor thread. This CL corrects a couple of subtle misuse errors related to conditions 2 and 6. PiperOrigin-RevId: 210841522
author: A. Unique TensorFlower <gardener@tensorflow.org> 2018-08-29 20:23:07 -0700
committer: TensorFlower Gardener <gardener@tensorflow.org> 2018-08-29 20:27:29 -0700
commit: 729e39b1a4f0f7a6b3e35a04bf8bbba5e921862b (patch)
tree: 16022c5e31fee45a7d525e4fdb321659a36f54e3 /tensorflow/core/distributed_runtime
parent: b7c2e7872c737dd87e48469fc977237819cb8809 (diff)
1 files changed, 3 insertions, 0 deletions
diff --git a/tensorflow/core/distributed_runtime/tensor_coding.h b/tensorflow/core/distributed_runtime/tensor_coding.h
index bae4ec794c..4c34297990 100644
--- a/tensorflow/core/distributed_runtime/tensor_coding.h
+++ b/tensorflow/core/distributed_runtime/tensor_coding.h
@@ -87,6 +87,9 @@ class TensorResponse {
   // modified.
   const RecvTensorResponse& metadata() const { return meta_; }
 
+  // Return pointer to the device hosting the tensor.
+  DeviceBase* device() const { return device_; }
+
  private:
   bool ParseTensorSubmessage(protobuf::io::CodedInputStream* input,
                              TensorProto* tensor_meta);
author	A. Unique TensorFlower <gardener@tensorflow.org>	2018-08-29 20:23:07 -0700
committer	TensorFlower Gardener <gardener@tensorflow.org>	2018-08-29 20:27:29 -0700
commit	729e39b1a4f0f7a6b3e35a04bf8bbba5e921862b (patch)
tree	16022c5e31fee45a7d525e4fdb321659a36f54e3 /tensorflow/core/distributed_runtime
parent	b7c2e7872c737dd87e48469fc977237819cb8809 (diff)