aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/compiler/xla/service/cpu/ir_emitter.h
Commit message (Collapse)AuthorAge
* [XLA] Migrate from gtl::FlatMap to absl::flat_hash_mapGravatar Benjamin Kramer2018-10-01
| | | | PiperOrigin-RevId: 215272497
* Implement sort op for CPU.Gravatar Adrian Kuegel2018-09-19
| | | | | | Also don't allow parallelization for the sort op in parallel_task_assignment. PiperOrigin-RevId: 213592046
* Rollforward of cl/211656888 after fixing failing unit test.Gravatar Mark Heffernan2018-09-05
| | | | | | | | | | | *** Original change description *** Add HloSchedule class representing a sequential order of an HloModule. Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encap... *** PiperOrigin-RevId: 211726890
* BEGIN_PUBLICGravatar Mark Heffernan2018-09-05
| | | | | | Automated rollback of commit 7fa693209fe238478739b3982f652a7e35be91f3 PiperOrigin-RevId: 211681957
* Add HloSchedule class representing a sequential order of an HloModule.Gravatar Mark Heffernan2018-09-05
| | | | | | | | Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encapsulation of code which deals with schedules and better enforcement of invariants. This CL also fixes a corner-case bug in dataflow analysis, where values of instructions which are live out of the computation erroneously did not interfere with the values of instructions scheduled after the root instruction. PiperOrigin-RevId: 211656888
* [XLA:CPU] Don't use "temps" to refer to the table of buffer allocationsGravatar Sanjoy Das2018-08-31
| | | | | | | Instead call it "buffer table", it now contains both entry computation parameters and temporaries. PiperOrigin-RevId: 211171651
* CHECK that the thread locality of the call matches thread locality of the calleeGravatar Sanjoy Das2018-08-31
| | | | PiperOrigin-RevId: 211162384
* Change headers to directly include absl::Span, and clean up the buildGravatar Tim Shen2018-08-30
| | | | | | dependencies as well. PiperOrigin-RevId: 211038094
* [XLA] Rename all (Mutable)ArraySlice to absl::Span.Gravatar Tim Shen2018-08-30
| | | | PiperOrigin-RevId: 210998142
* [XLA] Implement kIota for CPU & GPU, extend it w/ broadcast semanticsGravatar David Majnemer2018-08-28
| | | | | | | This extends the Iota HLO to have a broadcast field. This allows for higher rank kIota operations. PiperOrigin-RevId: 210600435
* Use a mixin to reduce llvm::IRBuilder<> related boilerplate.Gravatar Sanjoy Das2018-08-27
| | | | PiperOrigin-RevId: 210472260
* [XLA] Use absl string types and functions instead of the TF versions.Gravatar Justin Lebar2018-08-23
| | | | | | | Unfortunately this has to be one big patch, because e.g. absl::StrCat doesn't accept a TF StringPiece, but as soon as we switch to absl::string_view, we have to switch away from all of the TF functions. PiperOrigin-RevId: 209957896
* [XLA] Add Scatter HLO.Gravatar A. Unique TensorFlower2018-08-01
| | | | PiperOrigin-RevId: 207045468
* Reland "Overhaul XLA:CPU's calling convention."Gravatar Sanjoy Das2018-08-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | aligned_buffer_bytes in compiler/aot/runtime.cc was checking sizes[i] == -1 (as opposed to checking sizes[i] < 0) to decide whether sizes[i] should count towards the total size. Original CL description: Overhaul XLA:CPU's calling convention. This CL introduces a clean separation between calls to "thread local" and "global" computations in XLA:CPU. Global computations are: - kWhile body and condition computations - kConditional true and false computations - kCall callees Parameters and results buffers for these calls are assigned a static BufferAllocation::Slice by buffer assignment and so they don't require pointers to result buffers and parameters to be explicitly passed in. In fact, passing in result and parameters buffers is actively misleading because in cases like: while_condition { val = (s32[], pred[]) infeed() ROOT result = get-tuple-element(val), index=0 } there is no instruction explicitly copying the result of the computation into the result buffer. Instead, it is up to the caller to pick up the correct result buffer by asking buffer assignment (which would be buffer where infeed wrote its second tuple component). Thread local computations are all the other nested computations except fusion, e.g. computations used by kMap and kReduce. Parameters and result buffers for these calls are assigned a "thread local" BufferAllocation::Slice which in XLA:CPU are mapped to allocas. Since these are not static addresses, we *do* need to pass in parameter and result buffers. The output is written to the result buffer by "allocating" the storage for the root into the result buffer passed in by the caller. There are two cleanup items that I kept off this CL to make reviews easier: - We should rename "temps" to something more generic, like "buffer_table". I'll do that in a followup CL. - We should use GatherComputationsByAllocationType from buffer_assignment.cc to CHECK that we use thread local calls for thread local callees and global calls for global callees. PiperOrigin-RevId: 206980796
* Automated rollback of commit fba2d773f45f10882aa475ac75cbf9884995d626Gravatar Sanjoy Das2018-07-31
| | | | PiperOrigin-RevId: 206855848
* Overhaul XLA:CPU's calling convention.Gravatar Sanjoy Das2018-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This CL introduces a clean separation between calls to "thread local" and "global" computations in XLA:CPU. Global computations are: - kWhile body and condition computations - kConditional true and false computations - kCall callees Parameters and results buffers for these calls are assigned a static BufferAllocation::Slice by buffer assignment and so they don't require pointers to result buffers and parameters to be explicitly passed in. In fact, passing in result and parameters buffers is actively misleading because in cases like: while_condition { val = (s32[], pred[]) infeed() ROOT result = get-tuple-element(val), index=0 } there is no instruction explicitly copying the result of the computation into the result buffer. Instead, it is up to the caller to pick up the correct result buffer by asking buffer assignment (which would be buffer where infeed wrote its second tuple component). Thread local computations are all the other nested computations except fusion, e.g. computations used by kMap and kReduce. Parameters and result buffers for these calls are assigned a "thread local" BufferAllocation::Slice which in XLA:CPU are mapped to allocas. Since these are not static addresses, we *do* need to pass in parameter and result buffers. The output is written to the result buffer by "allocating" the storage for the root into the result buffer passed in by the caller. There are two cleanup items that I kept off this CL to make reviews easier: - We should rename "temps" to something more generic, like "buffer_table". I'll do that in a followup CL. - We should use GatherComputationsByAllocationType from buffer_assignment.cc to CHECK that we use thread local calls for thread local callees and global calls for global callees. PiperOrigin-RevId: 206843794
* Use constant buffer allocations for XLA:CPUGravatar Sanjoy Das2018-07-27
| | | | | | | | This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all instructions are codegened so we can always embed a pointer to the constant global variable directly in the generated LLVM IR. PiperOrigin-RevId: 206363887
* [XLA:CPU/GPU] Implement the parallel Philox random number generation algorithm.Gravatar Bixia Zheng2018-07-25
| | | | | | | | | | | | | | | | | Implement the RNG elemental ir generator using the Philox algorithm. To ensure multiple execution of the same RNG hlo instruction rarely produce the same result, we increment a global variable with the number of random numbers generated by the RNG hlo each time the hlo is executed and use the value of the global variable to construct the seed for the RNG algorithm. Modify the GPU backend to generate a parallel loop to execute the Philox algorithm. The CPU backend still uses a sequential loop to perform Philox random number generation, and we will need to enhance the ParallelTaskAssignment pass to change this. Remove the old PCG RNG algorithm for the CPU and GPU backends. PiperOrigin-RevId: 206069733
* [XLA] s/ir_builder/b/Gravatar Justin Lebar2018-07-20
| | | | | | Brevity. PiperOrigin-RevId: 205454869
* Start implementation of Iota HLO.Gravatar Nick Desaulniers2018-07-20
| | | | PiperOrigin-RevId: 205447892
* Avoid huge lambda functions for EmitTargetElementLoop bodies; NFCGravatar Sanjoy Das2018-07-10
| | | | PiperOrigin-RevId: 204042666
* [TF:XLA] Split select HLO into array- and tuple-select.Gravatar A. Unique TensorFlower2018-07-03
| | | | | | | | | | | | Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand. This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes. No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics. HLO cost analysis is fixed to handle the two ops appropriately. PiperOrigin-RevId: 203180342
* Delete ExternalConstantPool.Gravatar Adrian Kuegel2018-06-26
| | | | PiperOrigin-RevId: 202090038
* Rename HLO opcode kGenerateToken to kAfterAll.Gravatar Mark Heffernan2018-06-25
| | | | | | | | Long term I think we want to require kAfterAll to take at least one token as operand so it cannot generate a token out of thin air, so kGenerateToken is no longer an appropriate name. Instead, a primordial token would be supplied some how in the entry computation, perhaps as a parameter, and then threaded to any side-effecting ops. NFC. PiperOrigin-RevId: 202079040
* Add support for TOKEN type to CPU/GPU backends.Gravatar Mark Heffernan2018-06-14
| | | | | | | | TOKENs will be used for ordering side-effecting operations. They are not materialized but can be contained in tuples and flow into and out of computations. This CL adds a trivial representation for the cpu and gpu backends to support TOKENs and modifies copy insertion to avoid making copies of tokens. This also adds a Literal TOKEN which is required for the interpreter backend. PiperOrigin-RevId: 200623120
* Use ConstantDataArray to lower arrays of constants.Gravatar A. Unique TensorFlower2018-06-01
| | | | | | For large constants, creating an llvm::Constant for each element can get prohibitively large compile times. PiperOrigin-RevId: 198843141
* Don't call into Eigen unless the input and output tensors are alignedGravatar Sanjoy Das2018-05-09
| | | | | | | | | | We teach TargetMachineFeatures about the alignment required for Eigen GEMM and Conv and then pipe TargetMachineFeatures through the places that need to decide whether a dot or a conv needs to be lowered to a call to Eigen. I also had to fix a minor bug in our LLVM IR implementation for convolution. PiperOrigin-RevId: 196065557
* [XLA:CPU] Re-use the same llvm::GlobalVariable for identical literalsGravatar Sanjoy Das2018-05-01
| | | | | | | | | | This isn't necessary today, but it will be after an optimization change I'm about to make. LLVM has a constant merging pass too, but one of the motivations here is to avoid the LLVM compile time overhead of having many large arrays in the IR. PiperOrigin-RevId: 195032900
* Remove the parallel cpu backendGravatar Sanjoy Das2018-04-25
| | | | PiperOrigin-RevId: 194299356
* [BatchNorm] Remove CPU implementationGravatar Yunxing Dai2018-01-24
| | | | | | | We now use batchnorm rewriter (tensorflow/compiler/xla/service/batchnorm_rewriter.h) to expand batch norm into smaller ops. A specific implementation should not be needed anymore (for CPU). RELNOTES:n/a PiperOrigin-RevId: 183117252
* [XLA:CPU] Count cycles in non-entry computationsGravatar Sanjoy Das2018-01-03
| | | | | | | | | | | | This change teaches XLA to maintain cycle counters specifically for non-entry computations, like computations representing the body of a While. Without this change, instructions in such non-entry computations are noted as taking 0.00% of their execution time which isn't ideal. Implementation-wise, this just falls out of uniformly using a std::unordered_map<T, int64> for both the HloInstruction->ProfileIndex and the HloComputation->ProfileIndex mappings. PiperOrigin-RevId: 180750463
* [XLA:CPU] Cleanups to VectorSupportLibrary, TargetMachineFeatures and ↵Gravatar Sanjoy Das2018-01-03
| | | | | | | | | | | DotOpEmitter - Move VectorSupportLibrary to under service/cpu since it is specific to the CPU backend. - Use TargetMachineFeatures to infer the vector width in DotOpEmitter - Move the kAvxVectorSize magic constant into TargetMachineFeatures PiperOrigin-RevId: 180740693
* Automated g4 rollback of changelist 180622078Gravatar Sanjoy Das2018-01-02
| | | | PiperOrigin-RevId: 180628481
* [XLA:CPU] Cleanups to VectorSupportLibrary, TargetMachineFeatures and ↵Gravatar Sanjoy Das2018-01-02
| | | | | | | | | | | DotOpEmitter - Move VectorSupportLibrary to under service/cpu since it is specific to the CPU backend. - Use TargetMachineFeatures to infer the vector width in DotOpEmitter - Move the kAvxVectorSize magic constant into TargetMachineFeatures PiperOrigin-RevId: 180622078
* Automated g4 rollback of changelist 180000981Gravatar A. Unique TensorFlower2018-01-02
| | | | PiperOrigin-RevId: 180581912
* Automated g4 rollback of changelist 179983419Gravatar A. Unique TensorFlower2017-12-23
| | | | PiperOrigin-RevId: 180000981
* Adds FFT for XLA: CPU via Eigen, GPU via cuFFT.Gravatar A. Unique TensorFlower2017-12-22
| | | | | | GPU support includes plan reuse with new scratch allocator per execution in fft_thunk. PiperOrigin-RevId: 179983419
* [XLA:CPU] Use LLVM's TargetTransformInfo in TargetMachineFeaturesGravatar Sanjoy Das2017-12-15
| | | | | | I'll add more uses of TargetMachineFeatures is subsequence CLs. PiperOrigin-RevId: 179211454
* [XLA] Implement Conditional in XLA service, client ComputationBuilder, and ↵Gravatar A. Unique TensorFlower2017-12-07
| | | | | | CPU backend. PiperOrigin-RevId: 178322445
* [XLA:CPU] Factor out parallel function call logic into IrFunction (so it can ↵Gravatar A. Unique TensorFlower2017-12-06
| | | | | | be called from other emitters). Just code movement (no functional change). PiperOrigin-RevId: 178158853
* [XLA:CPU] Factor out parallel loop emission into its own file so it can be ↵Gravatar A. Unique TensorFlower2017-11-29
| | | | | | called by other emitters (no functional change, just code movement). PiperOrigin-RevId: 177317764
* [XLA:CPU] Factor IR function building logic out of IrEmitter into its own ↵Gravatar A. Unique TensorFlower2017-11-29
| | | | | | file (no functional changes, just code movement). This will enable building parallel IR functions from other emitters, and remove the requirement that parallel IR functions are associated with a sub-computation. PiperOrigin-RevId: 177309875
* [XLA:CPU] Add an explicit code path for the entry computation cycle countGravatar Sanjoy Das2017-11-16
| | | | | | | | | While this does change the profile counter entry for the entry computation during AOT compiles (earlier it would always be some non-null llvm::Value, but now it can be null), it does not change any observable behavior since RecordCompleteComputation is a no-op for an empty hlo_to_profile_idx_ map. PiperOrigin-RevId: 176022629
* Make CPU's IrEmitter::hlo_to_profile_idx_ a valueGravatar Sanjoy Das2017-11-16
| | | | | | | | | | | I think the performance advantages of keeping it as a maybe-null pointer are minimal, and it instead complicates the signature generation code. For example, the code to generate calls to __xla_cpu_runtime_ParallelForkJoin is buggy when hlo_to_profile_idx_ is nullptr today. This bug isn't visible today because we always have hlo_to_profile_idx_ as nullptr in JIT mode and in AOT mode we don't parallelize Hlo operations. PiperOrigin-RevId: 175993645
* Change for asynchronous Send and Recv by splitting Send into {Send, SendDone}Gravatar HyoukJoong Lee2017-11-10
| | | | | | | and Recv into {Recv, RecvDone}. See operation_semantics.md for the updated semantics. PiperOrigin-RevId: 175216012
* [TF:XLA] Reduce boilerplate code in HLO visitors.Gravatar A. Unique TensorFlower2017-10-30
| | | | | | | Only pass the HloInstruction into visitor methods. This makes changing instructions and visitors easier. PiperOrigin-RevId: 173983398
* Automated g4 rollback of changelist 171877766Gravatar A. Unique TensorFlower2017-10-16
| | | | PiperOrigin-RevId: 172325692
* Automated g4 rollback of changelist 171877766Gravatar Anna R2017-10-11
| | | | PiperOrigin-RevId: 171915087
* [XLA:CPU] Adds intra-op parallelism to the "sequential" CPU backend (which ↵Gravatar A. Unique TensorFlower2017-10-11
| | | | | | | | | | | | | | | | | | | | | | | | | | already has intra-op parallelism for library calls). Adds support for parallel task assignment to instructions in entry (or embedded) computations. Adds code to emit calls to a new a runtime parallel fork/join function for instructions which have been assigned parallel tasks. Adds a simple cost model for I/O bound instructions. *) Translation (deleuze model) wall time (seconds). large_model small_model small_model_small_attn sequential: 0.00556 0.00484 0.00155 parallel: 0.00263 0.00163 0.00106 *) Wavenet sequential: Avg. latency (30 runs): 1026.13ms, min/max: 988/1108ms parallel: Avg. latency (30 runs): 800.633ms, min/max: 785/818ms *) ParallelFusion benchmark. Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------- sequential cpu backend (at head) 610584 611467 1000 parallel cpu backend 153241 836097 4528 sequential cpu backend (this CL) 113482 679535 6017 PiperOrigin-RevId: 171877766
* [XLA:CPU] Add an in-place implementation of fused-dynamic-update-slice.Gravatar Justin Lebar2017-10-11
| | | | | | | | | | | This implementation, which applies when a loop-fusion node's root is a dynamic-update-slice whose input operand and output share the same buffer slice, is much faster than the out-of-place implementation. This patch also unifies the implementation of the CPU and GPU versions of this algorithm. PiperOrigin-RevId: 171863142