| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
Subtract the size of the aliased buffers from the subcomputation estimate instead of from the current computation. This way, the memory estimate for the current computation is more accurate.
For the newly added test, the heap simulation calculates 48 bytes at head instead of the correct 64 bytes.
PiperOrigin-RevId: 215653047
|
|
|
|
| |
PiperOrigin-RevId: 215324035
|
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
|
|
|
|
| |
which sorts buffers by size regardless of their alloc/free time. It uses a interval tree to avoid conflicting allocations.
Also changed to choose the best result from the new algorithm and the old one.
PiperOrigin-RevId: 214032637
|
|
|
|
|
|
|
|
|
|
|
| |
*** Original change description ***
Add HloSchedule class representing a sequential order of an HloModule.
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encap...
***
PiperOrigin-RevId: 211726890
|
|
|
|
|
|
| |
Automated rollback of commit 7fa693209fe238478739b3982f652a7e35be91f3
PiperOrigin-RevId: 211681957
|
|
|
|
|
|
|
|
| |
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encapsulation of code which deals with schedules and better enforcement of invariants.
This CL also fixes a corner-case bug in dataflow analysis, where values of instructions which are live out of the computation erroneously did not interfere with the values of instructions scheduled after the root instruction.
PiperOrigin-RevId: 211656888
|
|
|
|
|
|
|
|
| |
HeapSimulator.
The new modeling of subcomputations is still not entirely accurate, but probably not worth putting more work into, since TuplePointsToAnalysis will be removed from HeapSimulator soon.
PiperOrigin-RevId: 209646234
|
|
|
|
| |
PiperOrigin-RevId: 200646674
|
|
|
|
|
|
|
| |
These methods have nothing to do with scheduling.
Also, rename methods CreateMemoryMinimizingSequence in hlo_scheduling.
PiperOrigin-RevId: 200254100
|
|
|
|
| |
PiperOrigin-RevId: 196293610
|
|
|
|
|
|
| |
issue Alloc/Free for constants, and enable buffer sharing.
PiperOrigin-RevId: 183836922
|
|
|
|
| |
PiperOrigin-RevId: 179953488
|
|
|
|
|
|
|
| |
Also add flags to dump this data in JSON format, for each backend.
This is useful for upcoming debugging tools.
PiperOrigin-RevId: 157178357
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the HeapSimulator was only run on a per-computation basis. This meant
that if you had many sub-computations in your module (e.g. many While loops),
the space for all of the temporary buffers inside the conditions and bodies of
the loops were in distinct memory ranges. This is overly pessimistic if all
computations in the module are sequential.
This CL changes the HeapSimulator to also run whole-module simulation, calling
Alloc and Free on sub-computation buffers at the appropriate nested spot, right
next to the calling instruction. The BufferAssigner is updated to take advantage
of this when possible, as is MinimumMemoryForSequence.
Change: 154908856
|
|
The choice of instruction ordering, and the minimization of fragmentation once
we've chosen an order, are two large inter-related factors wrt overall memory
usage. The approach in this CL uses heuristics to do better on both, but
neither problem is completely solved.
To pick a better an ordering (the larger factor), the approach is to try the
original list-scheduler based ordering, and to also try a DFS based ordering.
We pick the ordering that yields a smaller minimum memory, computed with the
simulated heap, ignoring fragmentation. Note that this is the absolute minimum
memory for a given ordering.
To minimize fragmentation, the approach is to run a heap simulation on temporary
buffers. We still try to re-use existing allocations when possible, but instead
of creating new allocations for temp buffers, we collect all the leftovers and
use a heap to pack them. The heap algorithm that gave the best results is "lazy
best-fit"; a variant of traditional best-fit that sometimes delays offset
assignment until Free is called, in the hopes of yielding larger free chunks.
Here's some measurements of the temp buffer sizes for GNMT encoder training (a
stacked LSTM). Lower is better. I've tried various combinations of instruction
ordering and heap simulation, to show the joint impact of these two factors.
List-scheduler order, no heap simulation 33.33GiB
List-scheduler order, with heap simulation 25.09GiB
Minimized DFS order, no heap simulation 16.59GiB
Arbitrary DFS order, no heap simulation 15.05GiB (old)
Arbitrary DFS order, with heap simulation 12.57GiB
Minimized DFS order, with heap simulation 11.71GiB (new)
Note that the original list scheduler order is much worse than DFS on stacked
LSTMs, but (not shown here) is much better than DFS on convolutions like
Inception. Also note that heap simulation packs things tighter for all
instruction orders in this example, but to varying degrees.
Change: 149049028
|