aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/compiler/xla/service/copy_insertion_test.cc
Commit message (Collapse)AuthorAge
* Automated rollback of commit 375c109659d2d0e6265447dffdeb460693b3cccfGravatar A. Unique TensorFlower2018-10-09
| | | | PiperOrigin-RevId: 216350134
* [XLA] Introduce input/output alias config.Gravatar Yunxing Dai2018-10-08
| | | | | | | | - This CL intruduces input/output alias config in HLO module that allows any HLO pass to configure it. Once the alias_config is set, each backend needs to follow the contract during execution time to make sure the input and output are indeed aliased. - Copy insertion / buffer assignment and alias analysis has been updated to correctly honor the config and avoid any possible liveness interference. PiperOrigin-RevId: 216299501
* Elide copies for which the source and destination buffer is the same.Gravatar Mark Heffernan2018-08-13
| | | | | | During copy elision, copies which were previously added during copy insertion might end up with the same source and destination buffer. Previous logic did not remove them. This CL handles this corner case. PiperOrigin-RevId: 208509171
* [TF:XLA] Split literal_util into {literal, literal_util}.Gravatar Kay Zhu2018-07-03
| | | | | | | | | Currently Literal classes sits in literal_util.{h,cc} instead of literal.{h,cc}. It also contains helper functions that are better fit to be their own separate class/namespace. This change starts this process by moving most static factory methods to LiteralUtil namespace. PiperOrigin-RevId: 203217065
* [TF:XLA] Split select HLO into array- and tuple-select.Gravatar A. Unique TensorFlower2018-07-03
| | | | | | | | | | | | Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand. This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes. No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics. HLO cost analysis is fixed to handle the two ops appropriately. PiperOrigin-RevId: 203180342
* [TF:XLA] Copy elision does not need to know about existing copies.Gravatar A. Unique TensorFlower2018-06-28
| | | | | | | | It already detects layout-changing copies and those are already left unchanged by copy elision. Special case copies are also skipped because they are tagged separately (SetCopyElisionAllowed) PiperOrigin-RevId: 202574858
* Rename HLO opcode kGenerateToken to kAfterAll.Gravatar Mark Heffernan2018-06-25
| | | | | | | | Long term I think we want to require kAfterAll to take at least one token as operand so it cannot generate a token out of thin air, so kGenerateToken is no longer an appropriate name. Instead, a primordial token would be supplied some how in the entry computation, perhaps as a parameter, and then threaded to any side-effecting ops. NFC. PiperOrigin-RevId: 202079040
* Add support for TOKEN type to CPU/GPU backends.Gravatar Mark Heffernan2018-06-14
| | | | | | | | TOKENs will be used for ordering side-effecting operations. They are not materialized but can be contained in tuples and flow into and out of computations. This CL adds a trivial representation for the cpu and gpu backends to support TOKENs and modifies copy insertion to avoid making copies of tokens. This also adds a Literal TOKEN which is required for the interpreter backend. PiperOrigin-RevId: 200623120
* [XLA] Redesign: delete versioned_computation_handle and compilation_cache.Gravatar A. Unique TensorFlower2018-06-07
| | | | PiperOrigin-RevId: 199673573
* [XLA] Add reproducer that shows perf issues in ↵Gravatar Nick Desaulniers2018-02-14
| | | | | | | | | | HloDataflowAnalysis::UpdateTupleValueSet, then optimize that method. HloDataflowAnalysis::UpdateTupleValueSet starts to show up in profiles for while bodies that have many GetTupleElement nodes. Use a set to keep track of which HloInstructions we need to propagate DFA for. PiperOrigin-RevId: 185739365
* [TF:XLA] Fix infinite loop in HLO data flow analysis.Gravatar A. Unique TensorFlower2018-01-10
| | | | | | | Merge input values at phi nodes correctly: If a phi operand is the phi itself, and the other operands are all the same, then the phi node is redundant. PiperOrigin-RevId: 181521522
* Merge changes from github.Gravatar A. Unique TensorFlower2017-12-22
| | | | PiperOrigin-RevId: 179953488
* Automated g4 rollback of changelist 179258973Gravatar A. Unique TensorFlower2017-12-15
| | | | PiperOrigin-RevId: 179260538
* Merge changes from github.Gravatar Dandelion Man?2017-12-15
| | | | PiperOrigin-RevId: 179258973
* Roll forward new copy insertion pass.Gravatar Mark Heffernan2017-11-22
| | | | | | | | | | | | | | | | | | Original cl: cl/174423881, rollback cl: cl/174505237. This roll forward includes the following changes from the original to address various issues uncovered with the rollback: (1) A fix for a problem with fusion instruction serialization was broken out and submitted separately (cl/176035108). (2) A dataflow analysis fix was broken out and submitted separately (cl/176035108) (3) Adding RunBenchmarks to our unit test main was broken out. Fix for uncovered segv in while_test_cpu benchmark in pending cl/176068232. (4) Moved a cpu-specific copy-insertion pass into it's own file, and added tests. (5) Renamed gpu/copy_insertion.* to gpu/gpu_copy_insertion.* to match cpu side. PiperOrigin-RevId: 176658339
* Rollback copy insertion change because it results in a DCHECK with an ↵Gravatar Mark Heffernan2017-11-03
| | | | | | | | | | | internal model. END_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 174423881 PiperOrigin-RevId: 174505237
* Rewrite CopyInsertion to use module-scoped HloAliasAnalysis. The net effect ↵Gravatar Mark Heffernan2017-11-02
| | | | | | | | | | | | | | | | | | | | (number of copies inserted) is roughly similar to the existing implementation, but the new implementation is much more general. The new implementation can handle entry argument buffer reuse with minimal modification, for example. Some unnecessary copies are still added due to deficiencies in buffer assignment (b/62548313), but these can be removed when buffer assignment also uses HloAliasAnalysis. Also address a few issues uncovered with this cl: (1) For inplace dynamic slice in llvm backends, truncate do not wrap the slice. This matches the behavior of the non-inplace variant. (2) Disable SelectBetweenPredTuples test on GPU. The test introduces top-level buffer ambiguity which is not tolerated by the gpu backend. (3) When deserializing HLO form a proto, do not uniquify instruction names in fused computations. (4) In dataflow analysis, don't deallocate deleted HloValues during propagation. (5) In dataflow analysis, fix issue with live_out_of_computation property. PiperOrigin-RevId: 174423881
* Use xla/tests:xla_internal_test_main for all tests under tf/compiler/xlaGravatar Mark Heffernan2017-09-12
| | | | | | | and remove any main() definitions in tests. This enables use of flags in all tests. PiperOrigin-RevId: 168424796
* Reduce XLA compile time by ~7% for a convolutional image model:Gravatar A. Unique TensorFlower2017-08-18
| | | | | | | | | | | | | | | | | | | | | | * Added CompactPointerSet<T>, which is optimized for set size <= 1. * Changed expensive CHECKs to DCHECKS in buffer_assignment.cc * Reserve space in DFS state array before starting DFS. * Use unsigned arithmetic in DFS state maintenance. * HloInstruction: - Moved frequently used fields to start for better cache locality. - Use InlinedVector instead of vector for operand array. - Use InlinedVector instead of vector for DFS stack. * Pre-compute "is array" and "is tuple" for LogicalBuffer. * PointsToSet: - Combine two ShapeTrees into one. - Use CompactPointerSet instead of std::set to hold sources. - Use CompactPointerSet instead of std::set to hold flattened buffers. * ShapeTree: use unique_ptr instead of optional for shape storage (reduces size and destruction overhead). * Add proper const qualifiers to some FlatSet iterator methods. Co-author=jeff PiperOrigin-RevId: 165759117
* Remove class xla::LiteralUtil. NFC (mind-numbingly so).Gravatar A. Unique TensorFlower2017-06-19
| | | | | | This patch removes class xla::LiteralUtil and rewrites every call to use class xla::Literal instead. PiperOrigin-RevId: 159446373
* [XLA:HLO] Allow entry parameter buffer aliasing for read-only while ↵Gravatar A. Unique TensorFlower2017-06-13
| | | | | | instruction uses (eliminates large (typically R3) entry parameter copies which are used as read-only in while body computations). PiperOrigin-RevId: 158855437
* [XLA] Switch HloTestBase-based tests to use new debug options flag.Gravatar Eli Bendersky2017-06-09
| | | | PiperOrigin-RevId: 158522608
* [XLA] Fix a subtle issue in copy_insertion due the interaction between copyGravatar Kay Zhu2017-06-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | overriding logic and RecordIndicesToColocatingBuffers: - When building instructions ShapeTree to be copy overriden, it is possible that we create a single kCopy for two identical instructions. An example can be: %tuple.19 = tuple(%constant.4, %constant.1793, %constant.1793) where it is used in a while.init operand, and constant.1793 is read-only within the loop and also used by another while loop. The copy overriding pass will then create the following (logical, not finalized) tuple: %tuple.19 = tuple(%constant.4, %copy.5, %copy.5) - In the subsequent pass RecordAmbiguousOrNonDistinctIndices, to add copies to ensure point_to set is distinct, the duplicate %copy.5 are ignored because they are not yet finalized, and these indices (1 and 2 in the example) are still marked as to-be copied. Therefore distinctiveness is lost. This fix applies to the override building stage, to explicitly avoid creating shared copies for non-distinct buffers. PiperOrigin-RevId: 157872231
* [XLA:HLO] Reduce copy and get-tuple-element instructions added by CopyInsertion.Gravatar A. Unique TensorFlower2017-04-24
| | | | | | | | | | | | | | | | | | | | | | CopyInsertion adds copies to while init operand elements to maintain correctness for in-place while loops; e.g. if an element is updated in the loop, it must be copied before entering the loop to avoid corrupting the state of other users of the same buffer. However these copies are unnecessary if the element is read-only in the while body. That is the general idea behind this CL; to remove copies of read-only elements. But there are some details. E.g. if any of these read-only elements are entry parameters, they still must be copied (at least once). The problem here is that entry parameter buffers are managed by the caller, and cannot (currently) share the same allocation with other buffers. We add an optimization such that if the same entry parameter is used by multiple while loops in a read-only fashion, it is only copied once. Also, the way the original code was adding the copies was sub-optimal. We'd end up with this type of accordian pattern: tuple -> (gte, gte, gte) -> tuple This CL also removes many of the extra gte+tuple ops. Change: 154082222
* Added and using GMock matcher for HloInstructionGravatar A. Unique TensorFlower2017-04-14
| | | | Change: 153159175
* Using GMock matchers in XLA tests.Gravatar A. Unique TensorFlower2017-04-11
| | | | Change: 152823724
* [XLA] Replace uses of std::set with std::vector.Gravatar Mark Heffernan2017-03-10
| | | | | std::set is slow and the iteration order is unstable. A couple other opportunistic changes include consolidating all called computations of an instruction in a single vector. This faciliates fast access to all called computations. Also, replace AddControlSuccessor/Predecessor with Add/RemoveControlDepedencyTo which is less error prone as you can't create a half connected control edge. Change: 149810889
* [TF:XLA] Reduce sequential memory usage via better ordering and simulated heap.Gravatar A. Unique TensorFlower2017-03-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The choice of instruction ordering, and the minimization of fragmentation once we've chosen an order, are two large inter-related factors wrt overall memory usage. The approach in this CL uses heuristics to do better on both, but neither problem is completely solved. To pick a better an ordering (the larger factor), the approach is to try the original list-scheduler based ordering, and to also try a DFS based ordering. We pick the ordering that yields a smaller minimum memory, computed with the simulated heap, ignoring fragmentation. Note that this is the absolute minimum memory for a given ordering. To minimize fragmentation, the approach is to run a heap simulation on temporary buffers. We still try to re-use existing allocations when possible, but instead of creating new allocations for temp buffers, we collect all the leftovers and use a heap to pack them. The heap algorithm that gave the best results is "lazy best-fit"; a variant of traditional best-fit that sometimes delays offset assignment until Free is called, in the hopes of yielding larger free chunks. Here's some measurements of the temp buffer sizes for GNMT encoder training (a stacked LSTM). Lower is better. I've tried various combinations of instruction ordering and heap simulation, to show the joint impact of these two factors. List-scheduler order, no heap simulation 33.33GiB List-scheduler order, with heap simulation 25.09GiB Minimized DFS order, no heap simulation 16.59GiB Arbitrary DFS order, no heap simulation 15.05GiB (old) Arbitrary DFS order, with heap simulation 12.57GiB Minimized DFS order, with heap simulation 11.71GiB (new) Note that the original list scheduler order is much worse than DFS on stacked LSTMs, but (not shown here) is much better than DFS on convolutions like Inception. Also note that heap simulation packs things tighter for all instruction orders in this example, but to varying degrees. Change: 149049028
* Initial open-source release of XLA: Accelerated Linear Algebra.Gravatar Peter Hawkins2017-01-09
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators. XLA is still experimental; we are releasing it early to get the community involved. Change: 143990941