| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 215324035
|
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
|
|
|
| |
These are type aliases of one another; we'd like to be consistent and
use the shorter one.
PiperOrigin-RevId: 196322955
|
|
|
|
|
|
| |
analysis.
PiperOrigin-RevId: 165963591
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
- Moved frequently used fields to start for better cache locality.
- Use InlinedVector instead of vector for operand array.
- Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
- Combine two ShapeTrees into one.
- Use CompactPointerSet instead of std::set to hold sources.
- Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
(reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.
Co-author=jeff
PiperOrigin-RevId: 165759117
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tables and a vector into two vectors. One holds per-instruction
information (TuplePointsToAnalysis::PerInstruction) and is indexed by
HloInstruction::unique_id(), and the other
((TuplePointsToAnalysis::PerLogicalBuffer) holds per-logical-buffer
information, indexed by LogicalBuffer::Id.
This speeds up compilation of a convolutional image model from 44.4s
to 37.39s ( 15.8% improvement), and reduces the number of profiling
samples in TuplePointsToAnalysis-related code from 17.38s to 11.87s
(31.7% improvement).
Changed the interface to get access to the logical buffers collected
by the TuplePointsToAnalysis module. Instead of a logical_buffers()
accessor that gives access to a whole vector of buffers, we instead
expose a num_logical_buffers() method and require clients to iterate
sequentially over the buffer id space and call a logical_buffer(id)
access to get access to each individual buffer (this is necessary
because we no longer keep the buffers in a flat std::vector object,
and we'd rather not leak this kind of representation information
through the interface anyway).
PiperOrigin-RevId: 164199505
|
|
|
|
| |
PiperOrigin-RevId: 162690202
|
|
|
|
| |
PiperOrigin-RevId: 162275084
|
|
|
|
| |
PiperOrigin-RevId: 160354095
|
|
|
|
|
|
| |
Buffers can now be assigned "color" tags. Buffers that have different colors must live in separate allocations.
PiperOrigin-RevId: 158575828
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The choice of instruction ordering, and the minimization of fragmentation once
we've chosen an order, are two large inter-related factors wrt overall memory
usage. The approach in this CL uses heuristics to do better on both, but
neither problem is completely solved.
To pick a better an ordering (the larger factor), the approach is to try the
original list-scheduler based ordering, and to also try a DFS based ordering.
We pick the ordering that yields a smaller minimum memory, computed with the
simulated heap, ignoring fragmentation. Note that this is the absolute minimum
memory for a given ordering.
To minimize fragmentation, the approach is to run a heap simulation on temporary
buffers. We still try to re-use existing allocations when possible, but instead
of creating new allocations for temp buffers, we collect all the leftovers and
use a heap to pack them. The heap algorithm that gave the best results is "lazy
best-fit"; a variant of traditional best-fit that sometimes delays offset
assignment until Free is called, in the hopes of yielding larger free chunks.
Here's some measurements of the temp buffer sizes for GNMT encoder training (a
stacked LSTM). Lower is better. I've tried various combinations of instruction
ordering and heap simulation, to show the joint impact of these two factors.
List-scheduler order, no heap simulation 33.33GiB
List-scheduler order, with heap simulation 25.09GiB
Minimized DFS order, no heap simulation 16.59GiB
Arbitrary DFS order, no heap simulation 15.05GiB (old)
Arbitrary DFS order, with heap simulation 12.57GiB
Minimized DFS order, with heap simulation 11.71GiB (new)
Note that the original list scheduler order is much worse than DFS on stacked
LSTMs, but (not shown here) is much better than DFS on convolutions like
Inception. Also note that heap simulation packs things tighter for all
instruction orders in this example, but to varying degrees.
Change: 149049028
|
|
|
|
|
|
|
| |
Add CreateMemoryMinimizingSequence which constructs a sequence of the
instructions in an HLO module that heuristically minimizes the
total size of live buffers containing HLO outputs.
Change: 145599747
|
|
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.
XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941
|