| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
|
|
| |
Also don't allow parallelization for the sort op in parallel_task_assignment.
PiperOrigin-RevId: 213592046
|
|
|
|
|
|
|
|
|
|
|
| |
*** Original change description ***
Add HloSchedule class representing a sequential order of an HloModule.
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encap...
***
PiperOrigin-RevId: 211726890
|
|
|
|
|
|
| |
Automated rollback of commit 7fa693209fe238478739b3982f652a7e35be91f3
PiperOrigin-RevId: 211681957
|
|
|
|
|
|
|
|
| |
Currently we represent a sequential schedule of a module using a SequentialHloOrdering::HloModuleSequence which is a type alias of a bare map from HloComputation* to std::vector<HloInstruction*>. This CL replaces this with a proper class which results in better encapsulation of code which deals with schedules and better enforcement of invariants.
This CL also fixes a corner-case bug in dataflow analysis, where values of instructions which are live out of the computation erroneously did not interfere with the values of instructions scheduled after the root instruction.
PiperOrigin-RevId: 211656888
|
|
|
|
|
|
|
| |
Instead call it "buffer table", it now contains both entry computation
parameters and temporaries.
PiperOrigin-RevId: 211171651
|
|
|
|
| |
PiperOrigin-RevId: 211162384
|
|
|
|
|
|
| |
dependencies as well.
PiperOrigin-RevId: 211038094
|
|
|
|
| |
PiperOrigin-RevId: 210998142
|
|
|
|
|
|
|
| |
This extends the Iota HLO to have a broadcast field. This allows for higher
rank kIota operations.
PiperOrigin-RevId: 210600435
|
|
|
|
| |
PiperOrigin-RevId: 210472260
|
|
|
|
|
|
|
| |
Unfortunately this has to be one big patch, because e.g. absl::StrCat
doesn't accept a TF StringPiece, but as soon as we switch to
absl::string_view, we have to switch away from all of the TF functions.
PiperOrigin-RevId: 209957896
|
|
|
|
| |
PiperOrigin-RevId: 207045468
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aligned_buffer_bytes in compiler/aot/runtime.cc was checking sizes[i] == -1 (as
opposed to checking sizes[i] < 0) to decide whether sizes[i] should count
towards the total size.
Original CL description:
Overhaul XLA:CPU's calling convention.
This CL introduces a clean separation between calls to "thread local" and
"global" computations in XLA:CPU.
Global computations are:
- kWhile body and condition computations
- kConditional true and false computations
- kCall callees
Parameters and results buffers for these calls are assigned a static
BufferAllocation::Slice by buffer assignment and so they don't require pointers
to result buffers and parameters to be explicitly passed in. In fact, passing
in result and parameters buffers is actively misleading because in cases like:
while_condition {
val = (s32[], pred[]) infeed()
ROOT result = get-tuple-element(val), index=0
}
there is no instruction explicitly copying the result of the computation into
the result buffer. Instead, it is up to the caller to pick up the correct
result buffer by asking buffer assignment (which would be buffer where infeed
wrote its second tuple component).
Thread local computations are all the other nested computations except fusion,
e.g. computations used by kMap and kReduce.
Parameters and result buffers for these calls are assigned a "thread local"
BufferAllocation::Slice which in XLA:CPU are mapped to allocas. Since these are
not static addresses, we *do* need to pass in parameter and result buffers. The
output is written to the result buffer by "allocating" the storage for the
root into the result buffer passed in by the caller.
There are two cleanup items that I kept off this CL to make reviews easier:
- We should rename "temps" to something more generic, like "buffer_table".
I'll do that in a followup CL.
- We should use GatherComputationsByAllocationType from buffer_assignment.cc to
CHECK that we use thread local calls for thread local callees and global
calls for global callees.
PiperOrigin-RevId: 206980796
|
|
|
|
| |
PiperOrigin-RevId: 206855848
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL introduces a clean separation between calls to "thread local" and
"global" computations in XLA:CPU.
Global computations are:
- kWhile body and condition computations
- kConditional true and false computations
- kCall callees
Parameters and results buffers for these calls are assigned a static
BufferAllocation::Slice by buffer assignment and so they don't require pointers
to result buffers and parameters to be explicitly passed in. In fact, passing
in result and parameters buffers is actively misleading because in cases like:
while_condition {
val = (s32[], pred[]) infeed()
ROOT result = get-tuple-element(val), index=0
}
there is no instruction explicitly copying the result of the computation into
the result buffer. Instead, it is up to the caller to pick up the correct
result buffer by asking buffer assignment (which would be buffer where infeed
wrote its second tuple component).
Thread local computations are all the other nested computations except fusion,
e.g. computations used by kMap and kReduce.
Parameters and result buffers for these calls are assigned a "thread local"
BufferAllocation::Slice which in XLA:CPU are mapped to allocas. Since these are
not static addresses, we *do* need to pass in parameter and result buffers. The
output is written to the result buffer by "allocating" the storage for the
root into the result buffer passed in by the caller.
There are two cleanup items that I kept off this CL to make reviews easier:
- We should rename "temps" to something more generic, like "buffer_table".
I'll do that in a followup CL.
- We should use GatherComputationsByAllocationType from buffer_assignment.cc to
CHECK that we use thread local calls for thread local callees and global
calls for global callees.
PiperOrigin-RevId: 206843794
|
|
|
|
|
|
|
|
| |
This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all
instructions are codegened so we can always embed a pointer to the constant
global variable directly in the generated LLVM IR.
PiperOrigin-RevId: 206363887
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement the RNG elemental ir generator using the Philox algorithm. To ensure
multiple execution of the same RNG hlo instruction rarely produce the same
result, we increment a global variable with the number of random numbers
generated by the RNG hlo each time the hlo is executed and use the value of the
global variable to construct the seed for the RNG algorithm.
Modify the GPU backend to generate a parallel loop to execute the Philox
algorithm. The CPU backend still uses a sequential loop to perform Philox
random number generation, and we will need to enhance the
ParallelTaskAssignment pass to change this.
Remove the old PCG RNG algorithm for the CPU and GPU backends.
PiperOrigin-RevId: 206069733
|
|
|
|
|
|
| |
Brevity.
PiperOrigin-RevId: 205454869
|
|
|
|
| |
PiperOrigin-RevId: 205447892
|
|
|
|
| |
PiperOrigin-RevId: 204042666
|
|
|
|
|
|
|
|
|
|
|
|
| |
Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand.
This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes.
No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics.
HLO cost analysis is fixed to handle the two ops appropriately.
PiperOrigin-RevId: 203180342
|
|
|
|
| |
PiperOrigin-RevId: 202090038
|
|
|
|
|
|
|
|
| |
Long term I think we want to require kAfterAll to take at least one token as operand so it cannot generate a token out of thin air, so kGenerateToken is no longer an appropriate name. Instead, a primordial token would be supplied some how in the entry computation, perhaps as a parameter, and then threaded to any side-effecting ops.
NFC.
PiperOrigin-RevId: 202079040
|
|
|
|
|
|
|
|
| |
TOKENs will be used for ordering side-effecting operations. They are not materialized but can be contained in tuples and flow into and out of computations. This CL adds a trivial representation for the cpu and gpu backends to support TOKENs and modifies copy insertion to avoid making copies of tokens.
This also adds a Literal TOKEN which is required for the interpreter backend.
PiperOrigin-RevId: 200623120
|
|
|
|
|
|
| |
For large constants, creating an llvm::Constant for each element can get prohibitively large compile times.
PiperOrigin-RevId: 198843141
|
|
|
|
|
|
|
|
|
|
| |
We teach TargetMachineFeatures about the alignment required for Eigen GEMM and
Conv and then pipe TargetMachineFeatures through the places that need to decide
whether a dot or a conv needs to be lowered to a call to Eigen.
I also had to fix a minor bug in our LLVM IR implementation for convolution.
PiperOrigin-RevId: 196065557
|
|
|
|
|
|
|
|
|
|
| |
This isn't necessary today, but it will be after an optimization change I'm
about to make.
LLVM has a constant merging pass too, but one of the motivations here is to
avoid the LLVM compile time overhead of having many large arrays in the IR.
PiperOrigin-RevId: 195032900
|
|
|
|
| |
PiperOrigin-RevId: 194299356
|
|
|
|
|
|
|
| |
We now use batchnorm rewriter (tensorflow/compiler/xla/service/batchnorm_rewriter.h) to expand batch norm into smaller ops. A specific implementation should not be needed anymore (for CPU).
RELNOTES:n/a
PiperOrigin-RevId: 183117252
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change teaches XLA to maintain cycle counters specifically for non-entry
computations, like computations representing the body of a While. Without this
change, instructions in such non-entry computations are noted as taking 0.00% of
their execution time which isn't ideal.
Implementation-wise, this just falls out of uniformly using a
std::unordered_map<T, int64> for both the HloInstruction->ProfileIndex and the
HloComputation->ProfileIndex mappings.
PiperOrigin-RevId: 180750463
|
|
|
|
|
|
|
|
|
|
|
| |
DotOpEmitter
- Move VectorSupportLibrary to under service/cpu since it is specific to the
CPU backend.
- Use TargetMachineFeatures to infer the vector width in DotOpEmitter
- Move the kAvxVectorSize magic constant into TargetMachineFeatures
PiperOrigin-RevId: 180740693
|
|
|
|
| |
PiperOrigin-RevId: 180628481
|
|
|
|
|
|
|
|
|
|
|
| |
DotOpEmitter
- Move VectorSupportLibrary to under service/cpu since it is specific to the
CPU backend.
- Use TargetMachineFeatures to infer the vector width in DotOpEmitter
- Move the kAvxVectorSize magic constant into TargetMachineFeatures
PiperOrigin-RevId: 180622078
|
|
|
|
| |
PiperOrigin-RevId: 180581912
|
|
|
|
| |
PiperOrigin-RevId: 180000981
|
|
|
|
|
|
| |
GPU support includes plan reuse with new scratch allocator per execution in fft_thunk.
PiperOrigin-RevId: 179983419
|
|
|
|
|
|
| |
I'll add more uses of TargetMachineFeatures is subsequence CLs.
PiperOrigin-RevId: 179211454
|
|
|
|
|
|
| |
CPU backend.
PiperOrigin-RevId: 178322445
|
|
|
|
|
|
| |
be called from other emitters). Just code movement (no functional change).
PiperOrigin-RevId: 178158853
|
|
|
|
|
|
| |
called by other emitters (no functional change, just code movement).
PiperOrigin-RevId: 177317764
|
|
|
|
|
|
| |
file (no functional changes, just code movement). This will enable building parallel IR functions from other emitters, and remove the requirement that parallel IR functions are associated with a sub-computation.
PiperOrigin-RevId: 177309875
|
|
|
|
|
|
|
|
|
| |
While this does change the profile counter entry for the entry computation
during AOT compiles (earlier it would always be some non-null llvm::Value, but
now it can be null), it does not change any observable behavior since
RecordCompleteComputation is a no-op for an empty hlo_to_profile_idx_ map.
PiperOrigin-RevId: 176022629
|
|
|
|
|
|
|
|
|
|
|
| |
I think the performance advantages of keeping it as a maybe-null pointer are
minimal, and it instead complicates the signature generation code. For example,
the code to generate calls to __xla_cpu_runtime_ParallelForkJoin is buggy when
hlo_to_profile_idx_ is nullptr today. This bug isn't visible today because we
always have hlo_to_profile_idx_ as nullptr in JIT mode and in AOT mode we don't
parallelize Hlo operations.
PiperOrigin-RevId: 175993645
|
|
|
|
|
|
|
| |
and Recv into {Recv, RecvDone}. See operation_semantics.md for the updated
semantics.
PiperOrigin-RevId: 175216012
|
|
|
|
|
|
|
| |
Only pass the HloInstruction into visitor methods. This makes changing
instructions and visitors easier.
PiperOrigin-RevId: 173983398
|
|
|
|
| |
PiperOrigin-RevId: 172325692
|
|
|
|
| |
PiperOrigin-RevId: 171915087
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
already has intra-op parallelism for library calls).
Adds support for parallel task assignment to instructions in entry (or embedded) computations.
Adds code to emit calls to a new a runtime parallel fork/join function for instructions which have been assigned parallel tasks.
Adds a simple cost model for I/O bound instructions.
*) Translation (deleuze model) wall time (seconds).
large_model small_model small_model_small_attn
sequential: 0.00556 0.00484 0.00155
parallel: 0.00263 0.00163 0.00106
*) Wavenet
sequential: Avg. latency (30 runs): 1026.13ms, min/max: 988/1108ms
parallel: Avg. latency (30 runs): 800.633ms, min/max: 785/818ms
*) ParallelFusion benchmark.
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------
sequential cpu backend (at head) 610584 611467 1000
parallel cpu backend 153241 836097 4528
sequential cpu backend (this CL) 113482 679535 6017
PiperOrigin-RevId: 171877766
|
|
|
|
|
|
|
|
|
|
|
| |
This implementation, which applies when a loop-fusion node's root is a
dynamic-update-slice whose input operand and output share the same
buffer slice, is much faster than the out-of-place implementation.
This patch also unifies the implementation of the CPU and GPU versions
of this algorithm.
PiperOrigin-RevId: 171863142
|