| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 216350134
|
|
|
|
|
|
|
|
| |
- This CL intruduces input/output alias config in HLO module that allows any HLO pass to configure it. Once the alias_config is set, each backend needs to follow the contract during execution time to make sure the input and output are indeed aliased.
- Copy insertion / buffer assignment and alias analysis has been updated to correctly honor the config and avoid any possible liveness interference.
PiperOrigin-RevId: 216299501
|
|
|
|
|
|
| |
During copy elision, copies which were previously added during copy insertion might end up with the same source and destination buffer. Previous logic did not remove them. This CL handles this corner case.
PiperOrigin-RevId: 208509171
|
|
|
|
|
|
|
|
|
| |
Currently Literal classes sits in literal_util.{h,cc} instead of literal.{h,cc}.
It also contains helper functions that are better fit to be their own separate
class/namespace. This change starts this process by moving most static factory
methods to LiteralUtil namespace.
PiperOrigin-RevId: 203217065
|
|
|
|
|
|
|
|
|
|
|
|
| |
Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand.
This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes.
No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics.
HLO cost analysis is fixed to handle the two ops appropriately.
PiperOrigin-RevId: 203180342
|
|
|
|
|
|
|
|
| |
It already detects layout-changing copies and those are already left unchanged
by copy elision. Special case copies are also skipped because they are tagged
separately (SetCopyElisionAllowed)
PiperOrigin-RevId: 202574858
|
|
|
|
|
|
|
|
| |
Long term I think we want to require kAfterAll to take at least one token as operand so it cannot generate a token out of thin air, so kGenerateToken is no longer an appropriate name. Instead, a primordial token would be supplied some how in the entry computation, perhaps as a parameter, and then threaded to any side-effecting ops.
NFC.
PiperOrigin-RevId: 202079040
|
|
|
|
|
|
|
|
| |
TOKENs will be used for ordering side-effecting operations. They are not materialized but can be contained in tuples and flow into and out of computations. This CL adds a trivial representation for the cpu and gpu backends to support TOKENs and modifies copy insertion to avoid making copies of tokens.
This also adds a Literal TOKEN which is required for the interpreter backend.
PiperOrigin-RevId: 200623120
|
|
|
|
| |
PiperOrigin-RevId: 199673573
|
|
|
|
|
|
|
|
|
|
| |
HloDataflowAnalysis::UpdateTupleValueSet, then optimize that method.
HloDataflowAnalysis::UpdateTupleValueSet starts to show up in profiles for while bodies that have many GetTupleElement nodes.
Use a set to keep track of which HloInstructions we need to propagate DFA for.
PiperOrigin-RevId: 185739365
|
|
|
|
|
|
|
| |
Merge input values at phi nodes correctly: If a phi operand is the phi itself,
and the other operands are all the same, then the phi node is redundant.
PiperOrigin-RevId: 181521522
|
|
|
|
| |
PiperOrigin-RevId: 179953488
|
|
|
|
| |
PiperOrigin-RevId: 179260538
|
|
|
|
| |
PiperOrigin-RevId: 179258973
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Original cl: cl/174423881, rollback cl: cl/174505237.
This roll forward includes the following changes from the original to address various issues uncovered with the rollback:
(1) A fix for a problem with fusion instruction serialization was broken out and submitted separately (cl/176035108).
(2) A dataflow analysis fix was broken out and submitted separately (cl/176035108)
(3) Adding RunBenchmarks to our unit test main was broken out. Fix for uncovered segv in while_test_cpu benchmark in pending cl/176068232.
(4) Moved a cpu-specific copy-insertion pass into it's own file, and added tests.
(5) Renamed gpu/copy_insertion.* to gpu/gpu_copy_insertion.* to match cpu side.
PiperOrigin-RevId: 176658339
|
|
|
|
|
|
|
|
|
|
|
| |
internal model.
END_PUBLIC
BEGIN_PUBLIC
Automated g4 rollback of changelist 174423881
PiperOrigin-RevId: 174505237
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(number of copies inserted) is roughly similar to the existing implementation, but the new implementation is much more general. The new implementation can handle entry argument buffer reuse with minimal modification, for example.
Some unnecessary copies are still added due to deficiencies in buffer assignment (b/62548313), but these can be removed when buffer assignment also uses HloAliasAnalysis.
Also address a few issues uncovered with this cl:
(1) For inplace dynamic slice in llvm backends, truncate do not wrap the slice. This matches the behavior of the non-inplace variant.
(2) Disable SelectBetweenPredTuples test on GPU. The test introduces top-level buffer ambiguity which is not tolerated by the gpu backend.
(3) When deserializing HLO form a proto, do not uniquify instruction names in fused computations.
(4) In dataflow analysis, don't deallocate deleted HloValues during propagation.
(5) In dataflow analysis, fix issue with live_out_of_computation property.
PiperOrigin-RevId: 174423881
|
|
|
|
|
|
|
| |
and remove any main() definitions in tests. This enables use of flags
in all tests.
PiperOrigin-RevId: 168424796
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
- Moved frequently used fields to start for better cache locality.
- Use InlinedVector instead of vector for operand array.
- Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
- Combine two ShapeTrees into one.
- Use CompactPointerSet instead of std::set to hold sources.
- Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
(reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.
Co-author=jeff
PiperOrigin-RevId: 165759117
|
|
|
|
|
|
| |
This patch removes class xla::LiteralUtil and rewrites every call to use class
xla::Literal instead.
PiperOrigin-RevId: 159446373
|
|
|
|
|
|
| |
instruction uses (eliminates large (typically R3) entry parameter copies which are used as read-only in while body computations).
PiperOrigin-RevId: 158855437
|
|
|
|
| |
PiperOrigin-RevId: 158522608
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
overriding logic and RecordIndicesToColocatingBuffers:
- When building instructions ShapeTree to be copy overriden, it is possible
that we create a single kCopy for two identical instructions. An example can
be:
%tuple.19 = tuple(%constant.4, %constant.1793, %constant.1793)
where it is used in a while.init operand, and constant.1793 is read-only within
the loop and also used by another while loop. The copy overriding pass will then
create the following (logical, not finalized) tuple:
%tuple.19 = tuple(%constant.4, %copy.5, %copy.5)
- In the subsequent pass RecordAmbiguousOrNonDistinctIndices, to add copies to
ensure point_to set is distinct, the duplicate %copy.5 are ignored because they
are not yet finalized, and these indices (1 and 2 in the example) are still
marked as to-be copied.
Therefore distinctiveness is lost.
This fix applies to the override building stage, to explicitly avoid creating
shared copies for non-distinct buffers.
PiperOrigin-RevId: 157872231
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CopyInsertion adds copies to while init operand elements to maintain correctness
for in-place while loops; e.g. if an element is updated in the loop, it must be
copied before entering the loop to avoid corrupting the state of other users of
the same buffer. However these copies are unnecessary if the element is
read-only in the while body. That is the general idea behind this CL; to remove
copies of read-only elements.
But there are some details. E.g. if any of these read-only elements are entry
parameters, they still must be copied (at least once). The problem here is that
entry parameter buffers are managed by the caller, and cannot (currently) share
the same allocation with other buffers. We add an optimization such that if the
same entry parameter is used by multiple while loops in a read-only fashion, it
is only copied once.
Also, the way the original code was adding the copies was sub-optimal. We'd end
up with this type of accordian pattern:
tuple -> (gte, gte, gte) -> tuple
This CL also removes many of the extra gte+tuple ops.
Change: 154082222
|
|
|
|
| |
Change: 153159175
|
|
|
|
| |
Change: 152823724
|
|
|
|
|
| |
std::set is slow and the iteration order is unstable. A couple other opportunistic changes include consolidating all called computations of an instruction in a single vector. This faciliates fast access to all called computations. Also, replace AddControlSuccessor/Predecessor with Add/RemoveControlDepedencyTo which is less error prone as you can't create a half connected control edge.
Change: 149810889
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The choice of instruction ordering, and the minimization of fragmentation once
we've chosen an order, are two large inter-related factors wrt overall memory
usage. The approach in this CL uses heuristics to do better on both, but
neither problem is completely solved.
To pick a better an ordering (the larger factor), the approach is to try the
original list-scheduler based ordering, and to also try a DFS based ordering.
We pick the ordering that yields a smaller minimum memory, computed with the
simulated heap, ignoring fragmentation. Note that this is the absolute minimum
memory for a given ordering.
To minimize fragmentation, the approach is to run a heap simulation on temporary
buffers. We still try to re-use existing allocations when possible, but instead
of creating new allocations for temp buffers, we collect all the leftovers and
use a heap to pack them. The heap algorithm that gave the best results is "lazy
best-fit"; a variant of traditional best-fit that sometimes delays offset
assignment until Free is called, in the hopes of yielding larger free chunks.
Here's some measurements of the temp buffer sizes for GNMT encoder training (a
stacked LSTM). Lower is better. I've tried various combinations of instruction
ordering and heap simulation, to show the joint impact of these two factors.
List-scheduler order, no heap simulation 33.33GiB
List-scheduler order, with heap simulation 25.09GiB
Minimized DFS order, no heap simulation 16.59GiB
Arbitrary DFS order, no heap simulation 15.05GiB (old)
Arbitrary DFS order, with heap simulation 12.57GiB
Minimized DFS order, with heap simulation 11.71GiB (new)
Note that the original list scheduler order is much worse than DFS on stacked
LSTMs, but (not shown here) is much better than DFS on convolutions like
Inception. Also note that heap simulation packs things tighter for all
instruction orders in this example, but to varying degrees.
Change: 149049028
|
|
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.
XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941
|