aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
Commit message (Collapse)AuthorAge
* [XLA:GPU] Elide the SequentialThunk when emitting scatter with no copyGravatar Benjamin Kramer2018-10-09
| | | | | | | | We have a 1-element thunk sequence if we're not copying. That's still two thunks and hlo profiling gets confused if it sees two thunks for the same instruction and one of them claims to be the whole instruction. PiperOrigin-RevId: 216448063
* [XLA:GPU] Add an implementation of scatter for GPUGravatar Benjamin Kramer2018-10-09
| | | | | | | | | | | | This simple has a kernel that runs on every element of the updates tensor, figure out the right indices to perform the update, and applies it with an atomic operation. Currently we emit a CAS for plain (i.e. non-add) updates, which is inefficient. Also TuplePointsToAnalysis doesn't know that it should alias the operand and output buffers of a scatter, which would avoid a copy. PiperOrigin-RevId: 216412467
* Simplify ir_emitter_unnested so that it doesn't take a look at convGravatar Tim Shen2018-09-19
| | | | | | | custom call and try to understand what's inside. convolution_thunk does it anyway. PiperOrigin-RevId: 213676051
* Simplify convolution_thunk's interface.Gravatar Tim Shen2018-09-10
| | | | PiperOrigin-RevId: 212370999
* [XLA:GPU] Clean up init thunk handling to handle arbitrary fused init valuesGravatar Benjamin Kramer2018-09-07
| | | | | | | | | I put this in as a quick hack because init_value is usually a constant, but it's really easy to construct a case where it's not. The code also became more complex because of the constant buffer work, sharing that with the fused IR emitter is a good thing. PiperOrigin-RevId: 211936337
* [XLA:GPU] Refactor some code for fusion output handling.Gravatar Bixia Zheng2018-09-06
| | | | | | | | | | | Move routine ConstructIrArrayForOutputs to class IrEmitter so that it can be used in classes IrEmitterNested and IrEmitterUnnested. Move the code that stores the address of each individual output of a multiple output fusion to the tuple buffer of the fusion to an overload version of routine llvm_ir::EmitTuple so that we can reduce code duplication. PiperOrigin-RevId: 211884483
* Call Cudnn also for grouped convolutions.Gravatar Adrian Kuegel2018-09-03
| | | | | | | | Cudnn supports grouped convolutions, so we don't need the ConvolutionFeatureGroupConverter pass and can instead set the group_count parameter on the cudnn custom calls. PiperOrigin-RevId: 211339551
* Change headers to directly include absl::Span, and clean up the buildGravatar Tim Shen2018-08-30
| | | | | | dependencies as well. PiperOrigin-RevId: 211038094
* [XLA] Rename all (Mutable)ArraySlice to absl::Span.Gravatar Tim Shen2018-08-30
| | | | PiperOrigin-RevId: 210998142
* [XLA] xla::ContainersEqual -> absl::c_equalGravatar Benjamin Kramer2018-08-30
| | | | | | The replacement for the initializer_list overload is a bit sad because MakeSpan doesn't understand initializer_list (and we don't have CTAD yet) PiperOrigin-RevId: 210974939
* Use a mixin to reduce llvm::IRBuilder<> related boilerplate.Gravatar Sanjoy Das2018-08-27
| | | | PiperOrigin-RevId: 210472260
* [XLA] Switch to absl::StrFormat.Gravatar Justin Lebar2018-08-27
| | | | | | | | Unlike Printf, StrFormat does not require type-length qualifiers, e.g %z, %ll. Nor does it require that you call c_str() to print strings. So these are fixed up here as well. PiperOrigin-RevId: 210435915
* [XLA] Unify spelling of 'fusible'Gravatar Benjamin Kramer2018-08-27
| | | | | | Of {fusable, fusile, fusible} my dictionary only knows about fusible. PiperOrigin-RevId: 210373347
* [XLA] Use absl string types and functions instead of the TF versions.Gravatar Justin Lebar2018-08-23
| | | | | | | Unfortunately this has to be one big patch, because e.g. absl::StrCat doesn't accept a TF StringPiece, but as soon as we switch to absl::string_view, we have to switch away from all of the TF functions. PiperOrigin-RevId: 209957896
* [XLA] gtl::optional->absl::optionalGravatar Yunxing Dai2018-08-21
| | | | PiperOrigin-RevId: 209686671
* Merged commit includes the following changes:Gravatar Yifei Feng2018-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 209663919 by yifeif<yifeif@google.com>: Internal change. -- 209663914 by amitpatankar<amitpatankar@google.com>: Fix the topk_op_test for numpy>1.15. -- 209660476 by jdduke<jdduke@google.com>: Fix model lifetime for TensorFlow Lite C# bindings Ensure the model's existence for the duration of the interpreter, as per API requirements. -- 209655960 by scottzhu<scottzhu@google.com>: Unify RNN Cell interface between TF and Keras. -- 209655731 by A. Unique TensorFlower<gardener@tensorflow.org>: Added tests for PredictionOps and PartitionExamplesOps -- 209655291 by nolivia<nolivia@google.com>: adding rate class so that we can save global_step/sec using tf.contrib.summary. The function takes the rate in relation to any tensors provided that the numerator and denominator are broadcastable and have dtypes that can be cast to float64 -- 209654655 by kramerb<kramerb@google.com>: [XLA] Switch from tensorflow::gtl::InlinedVector to absl::InlinedVector This one comes with extra goodies like a move constructor. -- 209653851 by A. Unique TensorFlower<gardener@tensorflow.org>: Internal build specification change -- PiperOrigin-RevId: 209663919
* [XLA] Use absl::make_unique instead of xla::MakeUnique.Gravatar Justin Lebar2018-08-20
| | | | | | Same for WrapUnique. PiperOrigin-RevId: 209531124
* [XLA] Switch to absl versions of the c_foo functions.Gravatar Justin Lebar2018-08-20
| | | | PiperOrigin-RevId: 209502513
* Automated rollback of commit 4a41f50648929197954d892559587cb76458d306Gravatar A. Unique TensorFlower2018-08-17
| | | | PiperOrigin-RevId: 209248552
* [XLA] Switch to absl versions of the c_foo functions.Gravatar Justin Lebar2018-08-17
| | | | PiperOrigin-RevId: 209247783
* [XLA] Make sure backends that don't support variadic reduce reject it.Gravatar Michael Kuperstein2018-08-09
| | | | PiperOrigin-RevId: 208106767
* [XLA:GPU] Add a generic trip count analysis based on HloEvaluatorGravatar Benjamin Kramer2018-08-08
| | | | | | | | | | | | | | | This simply brute-forces the trip count by evaluating the trip count repeatedly. This is a simple extension of the code in while_loop_simplifier. Make while_loop_simplifier use it. The GPU backend has a WhileTransformer, which tries to pattern match loops with a constant trip count. This has stopped working a long time ago. Just replace it with the common trip count finder. The longer-term goal is to move the transformation before fusion and copy insertion so it's less fragile. The tests that cover this are while_transformer's tests at the moment. PiperOrigin-RevId: 207901341
* [XLA:GPU] Forward batched dot to cublas instead of expanding itGravatar Benjamin Kramer2018-08-03
| | | | | | This gives a huge speedup for users of batchdot. This is a minimal implementation without autotuning and without support for strided batch gemm. PiperOrigin-RevId: 207247740
* [XLA:GPU] Don't emit HostToDevice copiesGravatar Sanjoy Das2018-08-02
| | | | | | | This became unnecessary with cl/206243319 "Implement constant buffer allocation for XLA:GPU". PiperOrigin-RevId: 207204478
* Allow Sort to share the buffer with the operand if it is the only user.Gravatar Adrian Kuegel2018-07-31
| | | | | | | | | | The BitonicSort algorithm works in-place, so we can make use of that. On GPU, so far we copied the operand to the output and then performed the algorithm in-place. Now, we may not need to do this anymore if we see that the buffer is shared. Also, we now only need device-to-device copies in case the buffer is not shared, because constants are now also assigned a buffer. PiperOrigin-RevId: 206745686
* Use constant buffer allocations for XLA:CPUGravatar Sanjoy Das2018-07-27
| | | | | | | | This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all instructions are codegened so we can always embed a pointer to the constant global variable directly in the generated LLVM IR. PiperOrigin-RevId: 206363887
* Implement constant buffer allocation for XLA:GPUGravatar Sanjoy Das2018-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | This CL teaches XLA:GPU to use "normal" buffer assignment for constant instructions. Constant instructions are mapped to a BufferAllocation, like all other instructions, except the storage for this buffer is allocated statically as a global in the generated PTX. This CL does not change how we access the constants -- in IrEmitterUnnested::BuildKernelThunk (used for top level computations) and in HloToIrBindings::EmitBasePointersForHlos (used for nested computations) we bind the kConstant instructions to the llvm::GlobalVariable backing them. So users of constant instructions still access the globals corresponding to the constants directly. However, we no longer emit the constant literals inline. Instead we emit a constant with a zero initializer and then memcpy in the contents of the literal when we load the CUBIN/PTX. This works around compile time issues in LLVM and ptxas caused by large constants. We also populate `BufferAllocations` with the device pointers for the constant globals. This is at least needed for TupleThunk today because TupleThunk wants the addresses for the sub-buffers on the host. I'm not sure if there are other places in XLA:GPU that rely on there being an entry in BufferAllocations for every BufferAllocation. PiperOrigin-RevId: 206243319
* Support sorting of key/value pairs on GPU.Gravatar Adrian Kuegel2018-07-26
| | | | | | | | This requires a slight modification in the emitted compare loop: now, we use another if to check if we need to swap instead of two selects. Speed is mostly the same, possibly even a little bit faster. PiperOrigin-RevId: 206148647
* [XLA:GPU] Remember to execute non-root outfeed instructions in nested ↵Gravatar Sanjoy Das2018-07-25
| | | | | | computations PiperOrigin-RevId: 206075141
* [XLA:CPU/GPU] Implement the parallel Philox random number generation algorithm.Gravatar Bixia Zheng2018-07-25
| | | | | | | | | | | | | | | | | Implement the RNG elemental ir generator using the Philox algorithm. To ensure multiple execution of the same RNG hlo instruction rarely produce the same result, we increment a global variable with the number of random numbers generated by the RNG hlo each time the hlo is executed and use the value of the global variable to construct the seed for the RNG algorithm. Modify the GPU backend to generate a parallel loop to execute the Philox algorithm. The CPU backend still uses a sequential loop to perform Philox random number generation, and we will need to enhance the ParallelTaskAssignment pass to change this. Remove the old PCG RNG algorithm for the CPU and GPU backends. PiperOrigin-RevId: 206069733
* [XLA:GPU] Don't lie about buffer alignment to LLVMGravatar Sanjoy Das2018-07-24
| | | | PiperOrigin-RevId: 205832336
* Parallelize BitonicSort on GPU.Gravatar Adrian Kuegel2018-07-24
| | | | | | | | We now emit O(log^n) kernel thunks. Each thunk is responsible for looping over the other dimensions, and then doing a comparison loop through the dimension that should be sorted. PiperOrigin-RevId: 205791397
* [XLA:GPU] Add an operator<< to Thunk::Kind.Gravatar Bixia Zheng2018-07-23
| | | | | | This allows the use of CHECK_EQ with Thunk::Kind values. PiperOrigin-RevId: 205775065
* [XLA:GPU] Make sure that buffers for tuple() have a unique top-level allocationGravatar Benjamin Kramer2018-07-23
| | | | | | There are edge cases where a top-level allocation exists but it's ambiguous. PiperOrigin-RevId: 205665320
* [XLA:GPU] Limit the number of shmem tiles XLA:GPU will use for 021 transposes.Gravatar Justin Lebar2018-07-20
| | | | | | There's a limit to how much shared memory we can use. PiperOrigin-RevId: 205465441
* [XLA] s/ir_builder/b/Gravatar Justin Lebar2018-07-20
| | | | | | Brevity. PiperOrigin-RevId: 205454869
* Support unsigned indices for in-place DynamicUpdateSlice.Gravatar Adrian Kuegel2018-07-18
| | | | | | | | For unsigned indices, we need to use unsigned comparisons when clamping the start_indices. Also rename the files from ops.* to dynamic_update_slice_util.* PiperOrigin-RevId: 205072344
* Implement BitonicSort for GPU.Gravatar Adrian Kuegel2018-07-18
| | | | | | | | | This is a first version, several things are still missing: - Support for key/value sorting. - Support for other types than F32, S32 and U32. - Parallelization of the inner loop. PiperOrigin-RevId: 205052657
* [XLA:GPU] Generalize the column reduction algorithm to handle tile widths ↵Gravatar Thomas Joerg2018-07-17
| | | | | | | | greater than 1. Tiles of width 1 result in poor memory bandwidth for 16b inputs. PiperOrigin-RevId: 205033124
* [XLA] Use shfl.sync.down instead of shfl.sync.Gravatar Justin Lebar2018-07-13
| | | | | | | | | shfl.down is deprecated and ptxas 9.2 emits a loud warning when you use it. Convert XLA to shfl.sync.down. This change makes XLA:GPU require CUDA 9. PiperOrigin-RevId: 204546742
* [XLA:GPU] s/llvm_ir::IrArray/IrArray/ in ir_emitter_unnested.Gravatar Justin Lebar2018-07-11
| | | | | | Less visual noise. PiperOrigin-RevId: 204139183
* [XLA:GPU] Cleanups to fused 021 transpose implementation.Gravatar Justin Lebar2018-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | - Fix typos. - Clarify comments. - Reduce nesting in a few places. - Add asserts that this code is dealing with specifically a loop fusion. - Rename some functions. In particular, it's confusing to have a function with a generic name like EmitCodeWithBoundCheck that actually is specialized to a tiled implementation. - Remove statement expression (GCC language extension), replacing it with an IIFE. - Don't refer to shared-memory tile space as "buffer" without other qualifying words, since that's ambiguous with what XLA refers to as a "buffer". - Use llvm::cast instead of static_cast. - Comply with style guide naming rules for compile-time constants (kFoo). - Use c_accumulate instead of std::accumulate. - Put std::function parameter at the end of the param list. This lets us cleanly embed the lambda into the call because of how clang-format formats such calls. (I think this one is possibly the most helpful change in this patch, as it suddenly makes clear to me the way that we use two calls to emit_tiled_elemental_code_with_bounds_check to emit the code.) PiperOrigin-RevId: 204134102
* [XLA:GPU] Implement outfeedGravatar Benjamin Kramer2018-07-10
| | | | | | | | | Infeed and outfeed manager are really similar but not quite the same, I'm open for ideas on how to factor them better. This has a much cleaner design for OutfeedManager than we have for InfeedManager, I'll look into cleaning up InfeedManager in a follow-up. PiperOrigin-RevId: 204012304
* [XLA:GPU] Delete AnnotateBufferLoadStoreInstructionWithMetadata.Gravatar Justin Lebar2018-07-10
| | | | | | | | | | | | | This function was trying to annotate loads/stores to shared memory with metadata copied from another IRArray. But it was over-eager and copied more than we wanted. In addition, it CHECKed that the source buffer was invariant, and that's not necessarily true. For now, remove this function, which is just a performance optimization. PiperOrigin-RevId: 204005618
* [XLA:GPU] Enhance the tiled 0-2-1 transpose algorithm to handle fusion.Gravatar Bixia Zheng2018-07-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add class TiledParameterInfo to provide information for FusedIrEmitter to read the content of a tiled parameter from the tile buffer instead of the original input memory. Reimplement the tiled 0-2-1 transpose algorithm for copy instructions only in a more general way so that it can handle both fusion instructions and copy instructions. The original tiled 0-2-1 transpose implementation incorrectly used (tile_size+1) rows for a tile buffer to reduce share memory bank conflicts while it should be (tile_size+1) column instead. This is a performance issue and is fixed in the new implementation. The original tiled 0-2-1 transpose implementation did not generate LLVM alias meta data for the loads and stores of the tensors. This was due to a bug where function IrArray::CastToShape miss copying meta data to the new IrArray object. This is also a performance issue and is fixed in this change. Modified KernelSupportLibrary to support emitting an if-stmt with a given branch name prefix. Add test cases to test the new implementation. PiperOrigin-RevId: 203310403
* Profile SequentialThunks if they represent one HloInstruction.Gravatar Adrian Kuegel2018-07-04
| | | | | | | | | | SequentialThunks are used in two different ways: sometimes as a sequence of individual thunks for different HloInstructions, and sometimes for one HloInstruction which consists of several thunks. For the latter, we want to measure the total time taken by the HloInstruction. Previously, we would instead measure the time of the last thunk from the SequentialThunk. PiperOrigin-RevId: 203258617
* [TF:XLA] Split literal_util into {literal, literal_util}.Gravatar Kay Zhu2018-07-03
| | | | | | | | | Currently Literal classes sits in literal_util.{h,cc} instead of literal.{h,cc}. It also contains helper functions that are better fit to be their own separate class/namespace. This change starts this process by moving most static factory methods to LiteralUtil namespace. PiperOrigin-RevId: 203217065
* [TF:XLA] Split select HLO into array- and tuple-select.Gravatar A. Unique TensorFlower2018-07-03
| | | | | | | | | | | | Array select and tuple-select already are handled separately in all backends and HLO passes: Array select is an elementwise operation. The shapes of the to operands have the same dimensions. Tuple select does not define its own output, but instead forwards the true- or false- operand based on a scalar predicate operand. This CL reflects this by adding a new kTupleSelect HLO. The XLA builder interface stays the same and dispatches based on the operand shapes. No change in the operation semantics. This CL just splits the existing select operation into two opcodes and preserves the existing semantics. HLO cost analysis is fixed to handle the two ops appropriately. PiperOrigin-RevId: 203180342
* Fix check whether there is more than one tile.Gravatar Adrian Kuegel2018-06-27
| | | | | | | | The previous check was checking the number of elements in a tile against the number of elements in the input shape. This doesn't work if one dimension of the tile is bigger than the input dimension, but the other dimension is smaller. PiperOrigin-RevId: 202326635
* [XLA:GPU] Make the input-fused reduce emitter work on 16-bit typesGravatar Benjamin Kramer2018-06-26
| | | | | | | | | | | | | There's a bunch of things going on here: - BuildInitializerThunk threw away half of 16 bit init values. Fix that. - Make HandleFusion verify that it gets input-fusible reduces - Fuse BF16 again in multi-output fusion. This was a workaround for the initializer bug - Drop the 32 bit requirement from unfused reduce emission. It is really confusing to have different code paths for fused and unfused reduces - Emit 8/16 integer bit add/min/max as CAS. This is somewhat covered by existing tests. PiperOrigin-RevId: 202125572