diff options
author | 2017-03-02 14:38:39 -0800 | |
---|---|---|
committer | 2017-03-02 14:55:25 -0800 | |
commit | 379560be32c3910593e94aa6e91277fc3df3fc98 (patch) | |
tree | e62bdd8d78b677dce9bcd5a0727c8c18ff79482d /tensorflow/compiler/xla/service/tuple_points_to_analysis.h | |
parent | fee34c128479c5b1549ae2635b3f034bda526441 (diff) |
[TF:XLA] Reduce sequential memory usage via better ordering and simulated heap.
The choice of instruction ordering, and the minimization of fragmentation once
we've chosen an order, are two large inter-related factors wrt overall memory
usage. The approach in this CL uses heuristics to do better on both, but
neither problem is completely solved.
To pick a better an ordering (the larger factor), the approach is to try the
original list-scheduler based ordering, and to also try a DFS based ordering.
We pick the ordering that yields a smaller minimum memory, computed with the
simulated heap, ignoring fragmentation. Note that this is the absolute minimum
memory for a given ordering.
To minimize fragmentation, the approach is to run a heap simulation on temporary
buffers. We still try to re-use existing allocations when possible, but instead
of creating new allocations for temp buffers, we collect all the leftovers and
use a heap to pack them. The heap algorithm that gave the best results is "lazy
best-fit"; a variant of traditional best-fit that sometimes delays offset
assignment until Free is called, in the hopes of yielding larger free chunks.
Here's some measurements of the temp buffer sizes for GNMT encoder training (a
stacked LSTM). Lower is better. I've tried various combinations of instruction
ordering and heap simulation, to show the joint impact of these two factors.
List-scheduler order, no heap simulation 33.33GiB
List-scheduler order, with heap simulation 25.09GiB
Minimized DFS order, no heap simulation 16.59GiB
Arbitrary DFS order, no heap simulation 15.05GiB (old)
Arbitrary DFS order, with heap simulation 12.57GiB
Minimized DFS order, with heap simulation 11.71GiB (new)
Note that the original list scheduler order is much worse than DFS on stacked
LSTMs, but (not shown here) is much better than DFS on convolutions like
Inception. Also note that heap simulation packs things tighter for all
instruction orders in this example, but to varying degrees.
Change: 149049028
Diffstat (limited to 'tensorflow/compiler/xla/service/tuple_points_to_analysis.h')
-rw-r--r-- | tensorflow/compiler/xla/service/tuple_points_to_analysis.h | 3 |
1 files changed, 2 insertions, 1 deletions
diff --git a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h index 8c73e2cbad..a384529171 100644 --- a/tensorflow/compiler/xla/service/tuple_points_to_analysis.h +++ b/tensorflow/compiler/xla/service/tuple_points_to_analysis.h @@ -34,6 +34,7 @@ limitations under the License. #include "tensorflow/core/lib/core/status.h" #include "tensorflow/core/lib/gtl/array_slice.h" #include "tensorflow/core/lib/gtl/flatmap.h" +#include "tensorflow/core/lib/gtl/flatset.h" #include "tensorflow/core/platform/macros.h" #include "tensorflow/core/platform/types.h" @@ -65,7 +66,7 @@ class PointsToSet : public ShapeTree<std::vector<const LogicalBuffer*>> { // Creates a set containing the union of all LogicalBuffers contained in the // PointsToSet. - std::set<const LogicalBuffer*> CreateFlattenedSet() const; + tensorflow::gtl::FlatSet<const LogicalBuffer*> CreateFlattenedSet() const; // Returns true if the given buffer is in the points-to set at the given // index. |