| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
Subtract the size of the aliased buffers from the subcomputation estimate instead of from the current computation. This way, the memory estimate for the current computation is more accurate.
For the newly added test, the heap simulation calculates 48 bytes at head instead of the correct 64 bytes.
PiperOrigin-RevId: 215653047
|
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
|
|
|
|
| |
which sorts buffers by size regardless of their alloc/free time. It uses a interval tree to avoid conflicting allocations.
Also changed to choose the best result from the new algorithm and the old one.
PiperOrigin-RevId: 214032637
|
|
|
|
|
|
| |
conversion is mostly automated).
PiperOrigin-RevId: 212303594
|
|
|
|
|
|
|
|
|
|
|
| |
*** Original change description ***
Add HloSchedule class representing a sequential order of an HloModule.
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encap...
***
PiperOrigin-RevId: 211726890
|
|
|
|
|
|
| |
Automated rollback of commit 7fa693209fe238478739b3982f652a7e35be91f3
PiperOrigin-RevId: 211681957
|
|
|
|
|
|
| |
duplicate definitions.
PiperOrigin-RevId: 211662523
|
|
|
|
|
|
|
|
| |
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encapsulation of code which deals with schedules and better enforcement of invariants.
This CL also fixes a corner-case bug in dataflow analysis, where values of instructions which are live out of the computation erroneously did not interfere with the values of instructions scheduled after the root instruction.
PiperOrigin-RevId: 211656888
|
|
|
|
|
|
|
|
| |
HLO transformations would forget to propagate the feature depth attribute.
Making these attributes mandatory, while slightly less convenient for tests,
makes HLO transformations more robust.
PiperOrigin-RevId: 211490160
|
|
|
|
|
|
| |
Same for WrapUnique.
PiperOrigin-RevId: 209531124
|
|
|
|
|
|
|
|
|
| |
Currently Literal classes sits in literal_util.{h,cc} instead of literal.{h,cc}.
It also contains helper functions that are better fit to be their own separate
class/namespace. This change starts this process by moving most static factory
methods to LiteralUtil namespace.
PiperOrigin-RevId: 203217065
|
|
|
|
|
|
|
|
| |
- Enable multioutput fusion opearnd buffer reuse.
- Fix a bug in heap simulator where a buffer can be reused twice.
- Add unittest.
PiperOrigin-RevId: 201567720
|
|
|
|
| |
PiperOrigin-RevId: 200646674
|
|
|
|
|
|
|
| |
These methods have nothing to do with scheduling.
Also, rename methods CreateMemoryMinimizingSequence in hlo_scheduling.
PiperOrigin-RevId: 200254100
|
|
|
|
| |
PiperOrigin-RevId: 196293610
|
|
|
|
|
|
|
| |
easier to migrate from TuplePointsToAnalysis/LogicalBuffer to
HloDataflowAnalysis/HloValue. No functional changes.
PiperOrigin-RevId: 195179676
|
|
|
|
|
|
|
| |
- Require a module config when creating an HloModule.
- All tests using HloTestBase create a module using CreateNewModule.
PiperOrigin-RevId: 194684585
|
|
|
|
|
|
|
|
|
|
| |
elements alive.
This achieves two things:
1. Heap simulation runtime is no longer quadratic in the number of tuple elements (as we don't add each GetTupleElement to the liveset of each buffer defined by the tuple).
2. A reduction in the heap memory footprint.
PiperOrigin-RevId: 187079787
|
|
|
|
|
|
| |
contracting and batch dimensions.
PiperOrigin-RevId: 177481231
|
|
|
|
|
|
|
| |
and remove any main() definitions in tests. This enables use of flags
in all tests.
PiperOrigin-RevId: 168424796
|
|
|
|
|
|
|
|
| |
HLO graphs. This enables, for example, dumping of the graphs with
--xla_generate_hlo_graph. Also remove some superfluous tensorflow test_main
dependencies.
PiperOrigin-RevId: 168406746
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
- Moved frequently used fields to start for better cache locality.
- Use InlinedVector instead of vector for operand array.
- Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
- Combine two ShapeTrees into one.
- Use CompactPointerSet instead of std::set to hold sources.
- Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
(reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.
Co-author=jeff
PiperOrigin-RevId: 165759117
|
|
|
|
| |
PiperOrigin-RevId: 160354095
|
|
|
|
|
|
| |
This patch removes class xla::LiteralUtil and rewrites every call to use class
xla::Literal instead.
PiperOrigin-RevId: 159446373
|
|
|
|
|
|
| |
Buffers can now be assigned "color" tags. Buffers that have different colors must live in separate allocations.
PiperOrigin-RevId: 158575828
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the HeapSimulator was only run on a per-computation basis. This meant
that if you had many sub-computations in your module (e.g. many While loops),
the space for all of the temporary buffers inside the conditions and bodies of
the loops were in distinct memory ranges. This is overly pessimistic if all
computations in the module are sequential.
This CL changes the HeapSimulator to also run whole-module simulation, calling
Alloc and Free on sub-computation buffers at the appropriate nested spot, right
next to the calling instruction. The BufferAssigner is updated to take advantage
of this when possible, as is MinimumMemoryForSequence.
Change: 154908856
|
|
The choice of instruction ordering, and the minimization of fragmentation once
we've chosen an order, are two large inter-related factors wrt overall memory
usage. The approach in this CL uses heuristics to do better on both, but
neither problem is completely solved.
To pick a better an ordering (the larger factor), the approach is to try the
original list-scheduler based ordering, and to also try a DFS based ordering.
We pick the ordering that yields a smaller minimum memory, computed with the
simulated heap, ignoring fragmentation. Note that this is the absolute minimum
memory for a given ordering.
To minimize fragmentation, the approach is to run a heap simulation on temporary
buffers. We still try to re-use existing allocations when possible, but instead
of creating new allocations for temp buffers, we collect all the leftovers and
use a heap to pack them. The heap algorithm that gave the best results is "lazy
best-fit"; a variant of traditional best-fit that sometimes delays offset
assignment until Free is called, in the hopes of yielding larger free chunks.
Here's some measurements of the temp buffer sizes for GNMT encoder training (a
stacked LSTM). Lower is better. I've tried various combinations of instruction
ordering and heap simulation, to show the joint impact of these two factors.
List-scheduler order, no heap simulation 33.33GiB
List-scheduler order, with heap simulation 25.09GiB
Minimized DFS order, no heap simulation 16.59GiB
Arbitrary DFS order, no heap simulation 15.05GiB (old)
Arbitrary DFS order, with heap simulation 12.57GiB
Minimized DFS order, with heap simulation 11.71GiB (new)
Note that the original list scheduler order is much worse than DFS on stacked
LSTMs, but (not shown here) is much better than DFS on convolutions like
Inception. Also note that heap simulation packs things tighter for all
instruction orders in this example, but to varying degrees.
Change: 149049028
|