| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
We have a 1-element thunk sequence if we're not copying. That's still two
thunks and hlo profiling gets confused if it sees two thunks for the same
instruction and one of them claims to be the whole instruction.
PiperOrigin-RevId: 216448063
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simple has a kernel that runs on every element of the updates tensor,
figure out the right indices to perform the update, and applies it with an
atomic operation.
Currently we emit a CAS for plain (i.e. non-add) updates, which is inefficient.
Also TuplePointsToAnalysis doesn't know that it should alias the operand and
output buffers of a scatter, which would avoid a copy.
PiperOrigin-RevId: 216412467
|
|
|
|
|
|
|
| |
custom call and try to understand what's inside. convolution_thunk does
it anyway.
PiperOrigin-RevId: 213676051
|
|
|
|
| |
PiperOrigin-RevId: 212370999
|
|
|
|
|
|
|
|
|
| |
I put this in as a quick hack because init_value is usually a constant, but
it's really easy to construct a case where it's not. The code also became more
complex because of the constant buffer work, sharing that with the fused IR
emitter is a good thing.
PiperOrigin-RevId: 211936337
|
|
|
|
|
|
|
|
|
|
|
| |
Move routine ConstructIrArrayForOutputs to class IrEmitter so that it can be
used in classes IrEmitterNested and IrEmitterUnnested.
Move the code that stores the address of each individual output of a
multiple output fusion to the tuple buffer of the fusion to an overload
version of routine llvm_ir::EmitTuple so that we can reduce code duplication.
PiperOrigin-RevId: 211884483
|
|
|
|
|
|
|
|
| |
Cudnn supports grouped convolutions, so we don't need the
ConvolutionFeatureGroupConverter pass and can instead set the group_count
parameter on the cudnn custom calls.
PiperOrigin-RevId: 211339551
|
|
|
|
|
|
| |
dependencies as well.
PiperOrigin-RevId: 211038094
|
|
|
|
| |
PiperOrigin-RevId: 210998142
|
|
|
|
|
|
| |
The replacement for the initializer_list overload is a bit sad because MakeSpan doesn't understand initializer_list (and we don't have CTAD yet)
PiperOrigin-RevId: 210974939
|
|
|
|
| |
PiperOrigin-RevId: 210472260
|
|
|
|
|
|
|
|
| |
Unlike Printf, StrFormat does not require type-length qualifiers, e.g
%z, %ll. Nor does it require that you call c_str() to print strings.
So these are fixed up here as well.
PiperOrigin-RevId: 210435915
|
|
|
|
|
|
| |
Of {fusable, fusile, fusible} my dictionary only knows about fusible.
PiperOrigin-RevId: 210373347
|
|
|
|
|
|
|
| |
Unfortunately this has to be one big patch, because e.g. absl::StrCat
doesn't accept a TF StringPiece, but as soon as we switch to
absl::string_view, we have to switch away from all of the TF functions.
PiperOrigin-RevId: 209957896
|
|
|
|
| |
PiperOrigin-RevId: 209686671
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
209663919 by yifeif<yifeif@google.com>:
Internal change.
--
209663914 by amitpatankar<amitpatankar@google.com>:
Fix the topk_op_test for numpy>1.15.
--
209660476 by jdduke<jdduke@google.com>:
Fix model lifetime for TensorFlow Lite C# bindings
Ensure the model's existence for the duration of the interpreter,
as per API requirements.
--
209655960 by scottzhu<scottzhu@google.com>:
Unify RNN Cell interface between TF and Keras.
--
209655731 by A. Unique TensorFlower<gardener@tensorflow.org>:
Added tests for PredictionOps and PartitionExamplesOps
--
209655291 by nolivia<nolivia@google.com>:
adding rate class so that we can save global_step/sec using tf.contrib.summary. The function takes the rate in relation to any tensors provided that the numerator and denominator are broadcastable and have dtypes that can be cast to float64
--
209654655 by kramerb<kramerb@google.com>:
[XLA] Switch from tensorflow::gtl::InlinedVector to absl::InlinedVector
This one comes with extra goodies like a move constructor.
--
209653851 by A. Unique TensorFlower<gardener@tensorflow.org>:
Internal build specification change
--
PiperOrigin-RevId: 209663919
|
|
|
|
|
|
| |
Same for WrapUnique.
PiperOrigin-RevId: 209531124
|
|
|
|
| |
PiperOrigin-RevId: 209502513
|
|
|
|
| |
PiperOrigin-RevId: 209248552
|
|
|
|
| |
PiperOrigin-RevId: 209247783
|
|
|
|
| |
PiperOrigin-RevId: 208106767
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simply brute-forces the trip count by evaluating the trip count
repeatedly. This is a simple extension of the code in while_loop_simplifier.
Make while_loop_simplifier use it.
The GPU backend has a WhileTransformer, which tries to pattern match loops with
a constant trip count. This has stopped working a long time ago. Just replace
it with the common trip count finder. The longer-term goal is to move the
transformation before fusion and copy insertion so it's less fragile.
The tests that cover this are while_transformer's tests at the moment.
PiperOrigin-RevId: 207901341
|
|
|
|
|
|
| |
This gives a huge speedup for users of batchdot. This is a minimal implementation without autotuning and without support for strided batch gemm.
PiperOrigin-RevId: 207247740
|
|
|
|
|
|
|
| |
This became unnecessary with cl/206243319 "Implement constant buffer allocation
for XLA:GPU".
PiperOrigin-RevId: 207204478
|
|
|
|
|
|
|
|
|
|
| |
The BitonicSort algorithm works in-place, so we can make use of that.
On GPU, so far we copied the operand to the output and then performed the algorithm in-place.
Now, we may not need to do this anymore if we see that the buffer is shared.
Also, we now only need device-to-device copies in case the buffer is not shared,
because constants are now also assigned a buffer.
PiperOrigin-RevId: 206745686
|
|
|
|
|
|
|
|
| |
This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all
instructions are codegened so we can always embed a pointer to the constant
global variable directly in the generated LLVM IR.
PiperOrigin-RevId: 206363887
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL teaches XLA:GPU to use "normal" buffer assignment for constant
instructions. Constant instructions are mapped to a BufferAllocation, like all
other instructions, except the storage for this buffer is allocated statically
as a global in the generated PTX.
This CL does not change how we access the constants -- in
IrEmitterUnnested::BuildKernelThunk (used for top level computations) and in
HloToIrBindings::EmitBasePointersForHlos (used for nested computations) we bind
the kConstant instructions to the llvm::GlobalVariable backing them. So users
of constant instructions still access the globals corresponding to the constants
directly.
However, we no longer emit the constant literals inline. Instead we emit a
constant with a zero initializer and then memcpy in the contents of the literal
when we load the CUBIN/PTX. This works around compile time issues in LLVM and
ptxas caused by large constants.
We also populate `BufferAllocations` with the device pointers for the constant
globals. This is at least needed for TupleThunk today because TupleThunk wants
the addresses for the sub-buffers on the host. I'm not sure if there are other
places in XLA:GPU that rely on there being an entry in BufferAllocations for
every BufferAllocation.
PiperOrigin-RevId: 206243319
|
|
|
|
|
|
|
|
| |
This requires a slight modification in the emitted compare loop: now, we use
another if to check if we need to swap instead of two selects. Speed is mostly
the same, possibly even a little bit faster.
PiperOrigin-RevId: 206148647
|
|
|
|
|
|
| |
computations
PiperOrigin-RevId: 206075141
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement the RNG elemental ir generator using the Philox algorithm. To ensure
multiple execution of the same RNG hlo instruction rarely produce the same
result, we increment a global variable with the number of random numbers
generated by the RNG hlo each time the hlo is executed and use the value of the
global variable to construct the seed for the RNG algorithm.
Modify the GPU backend to generate a parallel loop to execute the Philox
algorithm. The CPU backend still uses a sequential loop to perform Philox
random number generation, and we will need to enhance the
ParallelTaskAssignment pass to change this.
Remove the old PCG RNG algorithm for the CPU and GPU backends.
PiperOrigin-RevId: 206069733
|
|
|
|
| |
PiperOrigin-RevId: 205832336
|
|
|
|
|
|
|
|
| |
We now emit O(log^n) kernel thunks. Each thunk is responsible for looping over
the other dimensions, and then doing a comparison loop through the dimension
that should be sorted.
PiperOrigin-RevId: 205791397
|
|
|
|
|
|
| |
This allows the use of CHECK_EQ with Thunk::Kind values.
PiperOrigin-RevId: 205775065
|
|
|
|
|
|
| |
There are edge cases where a top-level allocation exists but it's ambiguous.
PiperOrigin-RevId: 205665320
|
|
|
|
|
|
| |
There's a limit to how much shared memory we can use.
PiperOrigin-RevId: 205465441
|
|
|
|
|
|
| |
Brevity.
PiperOrigin-RevId: 205454869
|
|
|
|
|
|
|
|
| |
For unsigned indices, we need to use unsigned comparisons when clamping the
start_indices.
Also rename the files from ops.* to dynamic_update_slice_util.*
PiperOrigin-RevId: 205072344
|
|
|
|
|
|
|
|
|
| |
This is a first version, several things are still missing:
- Support for key/value sorting.
- Support for other types than F32, S32 and U32.
- Parallelization of the inner loop.
PiperOrigin-RevId: 205052657
|
|
|
|
|
|
|
|
| |
greater than 1.
Tiles of width 1 result in poor memory bandwidth for 16b inputs.
PiperOrigin-RevId: 205033124
|
|
|
|
|
|
|
|
|
| |
shfl.down is deprecated and ptxas 9.2 emits a loud warning when you use
it. Convert XLA to shfl.sync.down.
This change makes XLA:GPU require CUDA 9.
PiperOrigin-RevId: 204546742
|
|
|
|
|
|
| |
Less visual noise.
PiperOrigin-RevId: 204139183
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Fix typos.
- Clarify comments.
- Reduce nesting in a few places.
- Add asserts that this code is dealing with specifically a loop fusion.
- Rename some functions. In particular, it's confusing to have a
function with a generic name like EmitCodeWithBoundCheck that actually
is specialized to a tiled implementation.
- Remove statement expression (GCC language extension), replacing it
with an IIFE.
- Don't refer to shared-memory tile space as "buffer" without other
qualifying words, since that's ambiguous with what XLA refers to as a
"buffer".
- Use llvm::cast instead of static_cast.
- Comply with style guide naming rules for compile-time constants
(kFoo).
- Use c_accumulate instead of std::accumulate.
- Put std::function parameter at the end of the param list. This lets
us cleanly embed the lambda into the call because of how clang-format
formats such calls. (I think this one is possibly the most helpful
change in this patch, as it suddenly makes clear to me the way that we
use two calls to emit_tiled_elemental_code_with_bounds_check to emit
the code.)
PiperOrigin-RevId: 204134102
|
|
|
|
|
|
|
|
|
| |
Infeed and outfeed manager are really similar but not quite the same, I'm open
for ideas on how to factor them better. This has a much cleaner design for
OutfeedManager than we have for InfeedManager, I'll look into cleaning up
InfeedManager in a follow-up.
PiperOrigin-RevId: 204012304
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function was trying to annotate loads/stores to shared memory with
metadata copied from another IRArray. But it was over-eager and copied
more than we wanted.
In addition, it CHECKed that the source buffer was invariant, and that's
not necessarily true.
For now, remove this function, which is just a performance optimization.
PiperOrigin-RevId: 204005618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add class TiledParameterInfo to provide information for FusedIrEmitter to read
the content of a tiled parameter from the tile buffer instead of the original
input memory.
Reimplement the tiled 0-2-1 transpose algorithm for copy instructions only in a
more general way so that it can handle both fusion instructions and copy
instructions.
The original tiled 0-2-1 transpose implementation incorrectly used
(tile_size+1) rows for a tile buffer to reduce share memory bank conflicts while
it should be (tile_size+1) column instead. This is a performance issue and is
fixed in the new implementation.
The original tiled 0-2-1 transpose implementation did not generate LLVM alias
meta data for the loads and stores of the tensors. This was due to a bug where
function IrArray::CastToShape miss copying meta data to the new IrArray
object. This is also a performance issue and is fixed in this change.
Modified KernelSupportLibrary to support emitting an if-stmt with a given
branch name prefix.
Add test cases to test the new implementation.
PiperOrigin-RevId: 203310403
|
|
|
|
|
|
|
|
|
|
| |
SequentialThunks are used in two different ways: sometimes as a sequence of
individual thunks for different HloInstructions, and sometimes for one
HloInstruction which consists of several thunks. For the latter, we want
to measure the total time taken by the HloInstruction. Previously, we would
instead measure the time of the last thunk from the SequentialThunk.
PiperOrigin-RevId: 203258617
|
|
|
|
|
|
|
|
|
| |
Currently Literal classes sits in literal_util.{h,cc} instead of literal.{h,cc}.
It also contains helper functions that are better fit to be their own separate
class/namespace. This change starts this process by moving most static factory
methods to LiteralUtil namespace.
PiperOrigin-RevId: 203217065
|
|
|
|
|
|
|
|
|
|
|
|
| |
Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand.
This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes.
No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics.
HLO cost analysis is fixed to handle the two ops appropriately.
PiperOrigin-RevId: 203180342
|
|
|
|
|
|
|
|
| |
The previous check was checking the number of elements in a tile against
the number of elements in the input shape. This doesn't work if one dimension
of the tile is bigger than the input dimension, but the other dimension is smaller.
PiperOrigin-RevId: 202326635
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a bunch of things going on here:
- BuildInitializerThunk threw away half of 16 bit init values. Fix that.
- Make HandleFusion verify that it gets input-fusible reduces
- Fuse BF16 again in multi-output fusion. This was a workaround for the initializer bug
- Drop the 32 bit requirement from unfused reduce emission. It is really confusing to have different code paths for fused and unfused reduces
- Emit 8/16 integer bit add/min/max as CAS.
This is somewhat covered by existing tests.
PiperOrigin-RevId: 202125572
|