[TF:XLA] Reduce sequential memory usage via better ordering and simulated heap.

The choice of instruction ordering, and the minimization of fragmentation once we've chosen an order, are two large inter-related factors wrt overall memory usage. The approach in this CL uses heuristics to do better on both, but neither problem is completely solved. To pick a better an ordering (the larger factor), the approach is to try the original list-scheduler based ordering, and to also try a DFS based ordering. We pick the ordering that yields a smaller minimum memory, computed with the simulated heap, ignoring fragmentation. Note that this is the absolute minimum memory for a given ordering. To minimize fragmentation, the approach is to run a heap simulation on temporary buffers. We still try to re-use existing allocations when possible, but instead of creating new allocations for temp buffers, we collect all the leftovers and use a heap to pack them. The heap algorithm that gave the best results is "lazy best-fit"; a variant of traditional best-fit that sometimes delays offset assignment until Free is called, in the hopes of yielding larger free chunks. Here's some measurements of the temp buffer sizes for GNMT encoder training (a stacked LSTM). Lower is better. I've tried various combinations of instruction ordering and heap simulation, to show the joint impact of these two factors. List-scheduler order, no heap simulation 33.33GiB List-scheduler order, with heap simulation 25.09GiB Minimized DFS order, no heap simulation 16.59GiB Arbitrary DFS order, no heap simulation 15.05GiB (old) Arbitrary DFS order, with heap simulation 12.57GiB Minimized DFS order, with heap simulation 11.71GiB (new) Note that the original list scheduler order is much worse than DFS on stacked LSTMs, but (not shown here) is much better than DFS on convolutions like Inception. Also note that heap simulation packs things tighter for all instruction orders in this example, but to varying degrees. Change: 149049028
author: A. Unique TensorFlower <gardener@tensorflow.org> 2017-03-02 14:38:39 -0800
committer: TensorFlower Gardener <gardener@tensorflow.org> 2017-03-02 14:55:25 -0800
commit: 379560be32c3910593e94aa6e91277fc3df3fc98 (patch)
tree: e62bdd8d78b677dce9bcd5a0727c8c18ff79482d /tensorflow/compiler/xla/service/buffer_assignment.h
parent: fee34c128479c5b1549ae2635b3f034bda526441 (diff)
1 files changed, 51 insertions, 21 deletions
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.h b/tensorflow/compiler/xla/service/buffer_assignment.h
index 914553f354..82b9bf49ec 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.h
+++ b/tensorflow/compiler/xla/service/buffer_assignment.h
@@ -95,9 +95,6 @@ class BufferAllocation {
   // large as any LogicalBuffer assigned to this allocation.
   int64 size() const { return size_; }
 
-  // Returns the maximum alignment of all buffers assigned to the allocation.
-  int64 max_alignment() const { return max_alignment_; }
-
   struct OffsetSize {
     int64 offset = 0;
     int64 size = 0;
@@ -195,8 +192,7 @@ class BufferAllocation {
   friend class BufferAssignment;
 
   // Adds a LogicalBuffer to the set assigned to this buffer.
-  void AddAssignment(const LogicalBuffer& buffer, int64 offset, int64 size,
-                     int64 alignment);
+  void AddAssignment(const LogicalBuffer& buffer, int64 offset, int64 size);
 
   void set_entry_computation_parameter(int64 parameter_number) {
     is_entry_computation_parameter_ = true;
@@ -212,9 +208,6 @@ class BufferAllocation {
   // Size of the allocation in bytes.
   int64 size_;
 
-  // Maximum alignment of all assigned LogicalBuffers.
-  int64 max_alignment_ = 1;
-
   // Whether this buffer needs to be thread-local.
   bool is_thread_local_;
 
@@ -308,25 +301,49 @@ class BufferAssignment {
 
   string ToString() const;
 
+  // Statistics for the assignment.  Values initialized to -1 are not always
+  // collected; fragmentation is only collected for instructions that have a
+  // sequential total ordering.
+  struct Stats {
+    int64 parameter_allocation_count = 0;
+    int64 parameter_allocation_bytes = 0;
+    int64 maybe_live_out_allocation_count = 0;
+    int64 maybe_live_out_allocation_bytes = 0;
+    int64 preallocated_temp_allocation_count = 0;
+    int64 preallocated_temp_allocation_bytes = 0;
+    int64 preallocated_temp_fragmentation_bytes = -1;
+    int64 total_allocation_count = 0;
+    int64 total_allocation_bytes = 0;
+    int64 total_fragmentation_bytes = -1;
+
+    string ToString() const;
+  };
+  const Stats& GetStats() const { return stats_; }
+
  private:
   // Only BufferAssigner can build or modify BufferAssignments.
   friend class BufferAssigner;
 
   explicit BufferAssignment(const HloModule* module,
-                            std::unique_ptr<BufferLiveness> liveness)
-      : module_(module), liveness_(std::move(liveness)) {}
-
-  // Creates and returns a new BufferAllocation. Ownership is maintained
-  // internally. The allocation initially has only the given LogicalBuffer
-  // assigned to it. `is_thread_local` indicates whether this buffer needs to be
-  // thread-local.
+                            std::unique_ptr<BufferLiveness> liveness,
+                            int64 alignment)
+      : module_(module),
+        liveness_(std::move(liveness)),
+        alignment_(alignment) {}
+
+  // Creates and returns a new BufferAllocation, with no assigned
+  // LogicalBuffers. Ownership is maintained internally.
+  BufferAllocation* NewEmptyAllocation(int64 size, bool is_thread_local,
+                                       bool is_reusable);
+
+  // Helper that calls NewEmptyAllocation and AddAssignment in one call,
+  // creating an allocation containing a single LogicalBuffer.
   BufferAllocation* NewAllocation(const LogicalBuffer& buffer, int64 size,
-                                  int64 alignment, bool is_thread_local,
-                                  bool is_reusable);
+                                  bool is_thread_local, bool is_reusable);
 
   // Adds a LogicalBuffer to the set assigned to the given allocation.
-  void AddAssignment(const LogicalBuffer& buffer, BufferAllocation* allocation,
-                     int64 size, int64 alignment);
+  void AddAssignment(BufferAllocation* allocation, const LogicalBuffer& buffer,
+                     int64 offset, int64 size);
 
   // Returns the BufferLiveness object used to construct this assignment.
   const BufferLiveness& liveness() { return *liveness_; }
@@ -342,6 +359,9 @@ class BufferAssignment {
   // Combines allocations of temporary buffers into one big BufferAllocation.
   void CombineTempAllocations();
 
+  // Computes stats for the assignment, to be retrieved by GetStats.
+  Status ComputeSummaryStats(const LogicalBuffer::SizeFunction& buffer_size);
+
   // The vector of buffer allocations. Indexed by BufferAllocation::Index.
   std::vector<BufferAllocation> allocations_;
 
@@ -353,7 +373,10 @@ class BufferAssignment {
       allocation_index_for_buffer_;
 
   const HloModule* module_;
-  std::unique_ptr<BufferLiveness> liveness_;
+  const std::unique_ptr<BufferLiveness> liveness_;
+  const int64 alignment_;
+
+  Stats stats_;
 
   TF_DISALLOW_COPY_AND_ASSIGN(BufferAssignment);
 };
@@ -402,7 +425,7 @@ class BufferAssigner {
   // true. If hlos_to_allocate is not null it indicates which HLOs to include in
   // buffer assignment. If null, all instructions in the computation are
   // included.
-  tensorflow::Status AssignBuffersForComputation(
+  Status AssignBuffersForComputation(
       const HloComputation* computation, bool is_thread_local,
       const tensorflow::gtl::FlatSet<const HloInstruction*>* hlos_to_allocate,
       const tensorflow::gtl::FlatSet<const LogicalBuffer*>& colocated_buffers,
@@ -410,6 +433,13 @@ class BufferAssigner {
           colocated_allocations,
       BufferAssignment* assignment);
 
+  // Assigns 'buffers_to_assign' assuming the HLO instructions will be executed
+  // in the given 'sequential_order'.
+  Status AssignBuffersWithSequentialOrdering(
+      const std::vector<const HloInstruction*>& sequential_order,
+      const tensorflow::gtl::FlatSet<const LogicalBuffer*>& buffers_to_assign,
+      const HloComputation& computation, BufferAssignment* assignment);
+
   // Tries to assign the given instruction to the given buffer. Returns if the
   // assignment was successful.
   bool MaybeAssignBuffer(BufferAllocation* allocation,
author	A. Unique TensorFlower <gardener@tensorflow.org>	2017-03-02 14:38:39 -0800
committer	TensorFlower Gardener <gardener@tensorflow.org>	2017-03-02 14:55:25 -0800
commit	379560be32c3910593e94aa6e91277fc3df3fc98 (patch)
tree	e62bdd8d78b677dce9bcd5a0727c8c18ff79482d /tensorflow/compiler/xla/service/buffer_assignment.h
parent	fee34c128479c5b1549ae2635b3f034bda526441 (diff)