aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
Commit message (Collapse)AuthorAge
* [XLA] Migrate from gtl::FlatMap to absl::flat_hash_mapGravatar Benjamin Kramer2018-10-01
| | | | PiperOrigin-RevId: 215272497
* [XLA] Rename all (Mutable)ArraySlice to absl::Span.Gravatar Tim Shen2018-08-30
| | | | PiperOrigin-RevId: 210998142
* [XLA] Switch to absl::StrFormat.Gravatar Justin Lebar2018-08-27
| | | | | | | | Unlike Printf, StrFormat does not require type-length qualifiers, e.g %z, %ll. Nor does it require that you call c_str() to print strings. So these are fixed up here as well. PiperOrigin-RevId: 210435915
* [XLA] gtl::optional->absl::optionalGravatar Yunxing Dai2018-08-21
| | | | PiperOrigin-RevId: 209686671
* [XLA] Use absl::make_unique instead of xla::MakeUnique.Gravatar Justin Lebar2018-08-20
| | | | | | Same for WrapUnique. PiperOrigin-RevId: 209531124
* Make GemmThunkShouldHaltAllActivityBeforeRunning easier to understand; NFCGravatar Sanjoy Das2018-08-09
| | | | PiperOrigin-RevId: 208110715
* [XLA] Clean up clang tidy readability warnings in compiler/xlaGravatar Benjamin Kramer2018-08-06
| | | | | | | | | | | | | | | | * lambda capture 'builder' is not used * using decl 'Printf' is unused * lambda capture 'this' is not used (17 times) * lambda capture 'buffer_liveness' is not used * lambda capture 'computation' is not used * lambda capture 'operand_to_generator' is not used * lambda capture 'M' is not used * using decl 'InvalidParameterArgument' is unused * lambda capture 'sum' is not used * lambda capture 's' is not used * lambda capture 'epsilon' is not used PiperOrigin-RevId: 207542895
* Use constant buffer allocations for XLA:CPUGravatar Sanjoy Das2018-07-27
| | | | | | | | This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all instructions are codegened so we can always embed a pointer to the constant global variable directly in the generated LLVM IR. PiperOrigin-RevId: 206363887
* [XLA:GPU] Only add the cubin if it is availableGravatar Benjamin Kramer2018-07-27
| | | | | | | | It's only non-empty if we were able to run ptxas. If the PTX is going to be JIT'ed by the driver it won't be around. Loading an empty cubin will result in a fatal error. PiperOrigin-RevId: 206341931
* Implement constant buffer allocation for XLA:GPUGravatar Sanjoy Das2018-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | This CL teaches XLA:GPU to use "normal" buffer assignment for constant instructions. Constant instructions are mapped to a BufferAllocation, like all other instructions, except the storage for this buffer is allocated statically as a global in the generated PTX. This CL does not change how we access the constants -- in IrEmitterUnnested::BuildKernelThunk (used for top level computations) and in HloToIrBindings::EmitBasePointersForHlos (used for nested computations) we bind the kConstant instructions to the llvm::GlobalVariable backing them. So users of constant instructions still access the globals corresponding to the constants directly. However, we no longer emit the constant literals inline. Instead we emit a constant with a zero initializer and then memcpy in the contents of the literal when we load the CUBIN/PTX. This works around compile time issues in LLVM and ptxas caused by large constants. We also populate `BufferAllocations` with the device pointers for the constant globals. This is at least needed for TupleThunk today because TupleThunk wants the addresses for the sub-buffers on the host. I'm not sure if there are other places in XLA:GPU that rely on there being an entry in BufferAllocations for every BufferAllocation. PiperOrigin-RevId: 206243319
* Replace generic Pool with StreamPool, and discard failed streams.Gravatar Todd Wang2018-07-25
| | | | | | | | | | | | | | | | | | | | | | | We have a Pool in XLA that maintains a freelist of Streams, to avoid the overhead of repeatedly allocating new Streams. Streams have a monotonic state machine; if a stream encounters any error, it will remain in an error state forever. The functional change in this CL is to ensure that streams which have encountered an error are deleted, rather than being put back on the pool. Without this change, a previously failed stream will be put back on the pool, only to cause the next usage of the stream to trivially fail. I've chosen to replace the generic templatized Pool with a concrete StreamPool, since this makes the logic more straightforward to reason about. Also note that the only existing usage of Pool is to hold streams. The functional change is in stream_pool.cc; most of everything else is mechanical updates. PiperOrigin-RevId: 206100631
* Do profiling inside while thunks and conditionals.Gravatar Adrian Kuegel2018-07-02
| | | | | | | | | We now look into the computations of kWhile and kConditional ops when profiling. This still does not help regarding the statistics of the estimated optimum, but at least we can see the relative performance of the ops within a subcomputation. PiperOrigin-RevId: 202916616
* Extract HloExecutionProfiler into its own file.Gravatar Adrian Kuegel2018-06-14
| | | | | | | This is in preparation of passing it on to the Thunks, so that we can profile HloInstructions within a while loop. PiperOrigin-RevId: 200532394
* [XLA:GPU] Add op-tracing to XLA:GPU.Gravatar Justin Lebar2018-05-16
| | | | PiperOrigin-RevId: 196912575
* [XLA:GPU] Load kernel thunks' kernels before running them.Gravatar Justin Lebar2018-05-11
| | | | | | | | | The motivation here is that with --xla_hlo_profile, we count the time spent in Thunk::ExecuteOnStream, but we don't want to count the time spent loading the CUDA code into the GPU as time spent in the first kernel thunk we try to run. PiperOrigin-RevId: 196314733
* [XLA:GPU] Remove unused Thunk::ShouldBlockFutureThunks function.Gravatar Justin Lebar2018-05-10
| | | | PiperOrigin-RevId: 196206896
* [XLA] Make XLA's memory allocator return an owning smart pointer.Gravatar Justin Lebar2018-05-09
| | | | | | | | | | | | | | | | | | | | Previously, xla::DeviceMemoryAllocator::Allocate returned a stream_executor::DeviceMemoryBase. This is morally equivalent to a raw pointer: It's on you the user to call Deallocate(). Unfortunately we ~never got this right. Essentially all users of Allocate() call it in a loop, and TF_RETURN_IF_ERROR within the loop. If any of these allocations fails (mostly commonly, due to OOM), we leak everything we've allocated up until then. This patch changes our API so that it returns an owning pointer. Now things mostly Just Work. Also worth calling out: The lambda in CpuExecutable::ExecuteOnStream passed to ExecuteComputeFunction almost certainly had multithreaded use-after-free bugs. This patch fixes them. PiperOrigin-RevId: 196000535
* [XLA] Make Executable return a ScopedShapedBuffer.Gravatar Justin Lebar2018-04-22
| | | | | | | Previously, we returned a plain ShapedBuffer. But this doesn't capture our semantics: It's up to the callee to free this ShapedBuffer. PiperOrigin-RevId: 193854051
* [XLA] De-unique_ptr-ify ShapedBuffer and ScopedShapedBuffer.Gravatar Justin Lebar2018-04-19
| | | | | | | | | These are already notionally equivalent to T* and unique_ptr<T>, so having a unique_ptr of a {Scoped,}ShapedBuffer is pretty redundant. Also clean up the ScopedShapedBuffer API a bit. PiperOrigin-RevId: 193599773
* [XLA] Convert XLA to use xla::se as a namespace alias for ::stream_executor.Gravatar Justin Lebar2018-04-17
| | | | PiperOrigin-RevId: 193301997
* [XLA] Assert that all buffers and sub-buffers passed to XLA have an explicit ↵Gravatar Justin Lebar2018-03-27
| | | | | | | | | | | | | | | | | | | | | | pointer. In the past, we allowed sub-buffers to be null if the top-level tuple was non-null. This doesn't actually work well on the GPU: For ops that are implemented using cudnn or cublas, we have to have a pointer to the sub-buffer on the host in order to make the call. Retrieving it from the GPU in an efficient manner is complicated, and the best we can come up with isn't all that efficient (fundamentally having to pull data down from the GPU blocks the ability of the CPU to "run ahead" of the GPU). Since TF wasn't making use of our flexibility *anyway*, we add the requirement that XLA be given non-null pointers to all sub-buffers. Changes to the XLA:GPU backend to take advantage of this will come separately. PiperOrigin-RevId: 190700021
* [XLA:GPU] Fix HLO profiling when multiple streams are involved.Gravatar Justin Lebar2018-02-26
| | | | | | | We were enqueueing the timer on the main stream, but not blocking the substreams, so the results were nonsensical. PiperOrigin-RevId: 187032412
* [XLA:GPU] Don't crash when the root instruction of a computation is a ↵Gravatar Justin Lebar2018-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | multi-output fusion node, and avoid some pointer chasing with tuples. Previously, the kernels we generated would have one argument per *top-level* buffer of the input/output. This was fine for inputs. But it doesn't work for outputs: Imagine you're a node that returns a tuple -- e.g. multi-output fusion -- if all you get is a pointer to the top-level buffer of your output (which should contain pointers to the lower-level buffers at some point, but at the moment is just empty), how are you supposed to figure out where to write your output? (This usually worked because most of the time your output would live inside of the big XLA temp buffer, and kernels always get a pointer to that.) Now we pass all the buffers, top-level and otherwise, to our kernel. In addition, we're now willing to dereference statically tuples that live entirely in XLA's temp buffer. Pointers in input tuples must still be dereferenced dynamically, because the caller has the option of giving us these values or not when invoking XLA. This change makes some parts of BufferAssignment/BufferAllocations more truthful. Previously, if you passed a tuple-shaped input to XLA, we'd say in BufferAllocations that the pointer for some subshape of the param was the *top-level tuple pointer*. XLA then knew that this was a lie and would dereference it accordingly. Now we have an explicit notion of a BufferAllocation pointing to a subshape of an input parameter. PiperOrigin-RevId: 185614060
* [TF:XLA] Replace most of HloProfilePrinter by a protocol bufferGravatar Sanjoy Das2018-01-24
| | | | | | | | | | | | | | This change replaces the meat of HloProfilePrinter with a protobuf HloProfilePrinterData. The original plan was to serialize HloProfilePrinter into C++ source code and put that in a .cc file along with the string for the xla::ProgramShape. However, since we now directly serialize xla::ProgramShape into a .o file, for consistency I think we should do the same thing for HloProfilePrinter (instead of adding yet another output file to tfcompile). The change itself is fairly simple, it is large mostly due to the mass renaming I had to do. PiperOrigin-RevId: 183158192
* [XLA] Clean up our handling of ExecutionProfile and add a test caseGravatar Sanjoy Das2018-01-10
| | | | | | ExecutionProfile::compute_cycle_count never worked for CPU and GPU with Hlo profiling disabled, as far as I can tell. PiperOrigin-RevId: 181517824
* [XLA:GPU] Fix implementation of Thunk::ShouldBlockFutureThunks.Gravatar Justin Lebar2017-12-22
| | | | | | | | | | The old implementation did not do what it was intended to do. Instead of making the next thunk on each other stream wait for this thunk to complete, it made all following thunks on this stream wait for this thunk to complete. (Which is redundant, since streams are serialized anyway.) PiperOrigin-RevId: 179938424
* [XLA:GPU] Make the use of scratch allocator in convolution_thunk safe.Gravatar A. Unique TensorFlower2017-12-20
| | | | | | | | | Add member function Thunk::ShouldFutureScheduledThunksDependOn for convolution_thunk to tell thunk executor that all future scheduled thunks should wait for convolution_thunk. This can ensure that the use of scratch allocator in convolution_thunk is safe. PiperOrigin-RevId: 179628764
* Merged commit includes the following changes:Gravatar A. Unique TensorFlower2017-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 179277894 by gunan: Run buildifier on build file. -- 179275101 by meheff: Replace DeviceMemoryBase with ShapedBuffer in XLA interfaces. Executable, TransferManager, and AllocationTracker now use ShapedBuffer to hold device memory addresses holding XLA data. Most of the change is straight-forward with the exception of AllocationTracker which was mostly rewritten (and simplified) and some refactoring in the CPU executable. Also, have ShapedBuffer hold on-host and on-device Shapes which are the shapes of the representation of the data on the host and device, respectively. This is necessary because with cl/178624364 the on-host and on-device shape may no longer be equal. -- 179265385 by A. Unique TensorFlower: Return error rather than CHECK fail in Executable::ExecuteOnStreamWrapper -- 179264551 by dandelion: Internal fixes. -- PiperOrigin-RevId: 179277894
* Rename Stream::BlockHostUntilDoneWithStatus to BlockHostUntilDone.Gravatar A. Unique TensorFlower2017-12-13
| | | | PiperOrigin-RevId: 178951330
* Use BlockHostUntilDoneWithStatus in various places.Gravatar A. Unique TensorFlower2017-12-11
| | | | PiperOrigin-RevId: 178723711
* [XLA:GPU] Don't autotune while other kernels are running.Gravatar Justin Lebar2017-12-05
| | | | | | | | | | | | | | | | | | XLA:GPU autotunes gemm and conv thunks, trying multiple algorithms in sequence and picking the fastest one. If other work is running concurrently with our autotuning, this can mess up the results. In particular, even if the GPU is totally deterministic, the concurrent work may finish before we finish autotuning, giving an unfair advantage to the later algorithms. To address this, we modify GpuExecutable to wait until the GPU is quiescent before executing a thunk which performs autotuning. We then cross our fingers and hope that whatever is fastest while the GPU is quiescent will also be fastest in the "real world", with (potentially) concurrent work going on. PiperOrigin-RevId: 178041481
* Place HloProfilePrinter and HloProfileIndexMap in ExecutableGravatar Sanjoy Das2017-11-22
| | | | | | | | | This refactoring will later allow XlaCompiledCpuFunction to pull out the HloProfilePrinter from Executable and use that to display the hlo execution profile. A de/serialized HloProfilePrinter will let AOT compiled binaries display their Hlo execution profile. PiperOrigin-RevId: 176689528
* GPU JIT improvements.Gravatar Artem Belevich2017-11-07
| | | | | | | | * Use ptxas to compile generated PTX. * Run PTX compilations in parallel. * Cache results of PTX compilation. PiperOrigin-RevId: 174921332
* Rename (Add|Get)ProfileResult to something more specific; NFCGravatar Sanjoy Das2017-10-31
| | | | PiperOrigin-RevId: 174084570
* Remove "hybrid" HloModuleConfig option. The option was used to generate ↵Gravatar Mark Heffernan2017-10-04
| | | | | | | | executables which only generated the array values of tuple-shaped outputs, not the tuple index tables.. With cl/170133015, ShapedBuffers which hold the computation output now have materialized tuples with these index tables so this option is no longer desired or necessary. No functional change. Just cleanup. PiperOrigin-RevId: 171035738
* Add more `const`s to xla::Executable. No functional change.Gravatar A. Unique TensorFlower2017-09-27
| | | | PiperOrigin-RevId: 170252047
* For tuple-shaped data, change ShapedBuffer (an abstraction holding on-device ↵Gravatar Mark Heffernan2017-09-26
| | | | | | | | | | | | | | | | data of a given shape) to also hold an array of pointers representing the tuple structure in the device memory. Previously ShapedBuffer only held array-shaped data at the leaves of the tuple shape. Construction of these array-of-pointers is handled by TransferManager which has to construct array-of-pointers anyway to transfer literals to the device. This change makes ShapedBuffer match the native representative of tuple-shaped data passed into XLA computations. This is the first step to migrating XLA interfaces away from using naked device memory pointers (DeviceMemoryBase) to using more expressive ShapedBuffers instead. This change enables tuple-shaped parameters in computations run through the LocalClient interface. Also, change LocalClient interfaces to return ScopedShapedBuffers as these are generally easier to deal with ownership-wise that ShapedBuffers. They are analogous to std::unique_ptr, while ShapedBuffers are analogous to bare pointers. This change includes a couple other cleanups found along the way: * move cpu/gpu/interpreter transfer managers into their respective directories under xla/service. * Make the generic transfer manager take a pointer size. Previously it would just use sizeof(void*) which might not be exactly what is needed. PiperOrigin-RevId: 170133015
* Reduce XLA compile time by ~7% for a convolutional image model:Gravatar A. Unique TensorFlower2017-08-18
| | | | | | | | | | | | | | | | | | | | | | * Added CompactPointerSet<T>, which is optimized for set size <= 1. * Changed expensive CHECKs to DCHECKS in buffer_assignment.cc * Reserve space in DFS state array before starting DFS. * Use unsigned arithmetic in DFS state maintenance. * HloInstruction: - Moved frequently used fields to start for better cache locality. - Use InlinedVector instead of vector for operand array. - Use InlinedVector instead of vector for DFS stack. * Pre-compute "is array" and "is tuple" for LogicalBuffer. * PointsToSet: - Combine two ShapeTrees into one. - Use CompactPointerSet instead of std::set to hold sources. - Use CompactPointerSet instead of std::set to hold flattened buffers. * ShapeTree: use unique_ptr instead of optional for shape storage (reduces size and destruction overhead). * Add proper const qualifiers to some FlatSet iterator methods. Co-author=jeff PiperOrigin-RevId: 165759117
* Allow cost estimates to differ per backend and include the estimates into ↵Gravatar Bjarke Hammersholt Roune2017-07-31
| | | | | | the HLO profile. Add a summary table for what categories have the most opportunity for optimization left in them. PiperOrigin-RevId: 163780413
* [XLA] Simplify Shape traversal visitors.Gravatar Mark Heffernan2017-06-06
| | | | | | Simplify shape traversal visitors in ShapeUtil and ShapeTree. Add a non-Status form because most uses of the traversal methods do not use it, and remove is_leaf parameter from ShapeTree.ForEach* as it is not frequently used. PiperOrigin-RevId: 158201574
* [XLA] Teach Executable to do its own profiling (patch 1/4).Gravatar A. Unique TensorFlower2017-05-22
| | | | | | | | Presently, ExecuteOnStreamWrapper is a method on xla::Service, where it doesn't really conceptually belong -- note that it doesn't use anything from the containing Service object, but it does have an Executable object as its first parameter that it could easily be a method on instead. The only reason that it needs to be on Service is that it needs to access a Backend object in order to call backend->compiler()->shape_size_function(), and simply moving that into Executable would introduce a dependency cycle. Thus, this patch (the first part of a sequence to address this) teaches Executable and its derivatives to compute shape_size_function. In the CPU cases, this is simply a static function. However, in the GPU case, we need to pass in the shape_size_function to the constructor, since it depends on a pointer size computed in the GpuCompiler. PiperOrigin-RevId: 156807318
* [XLA] Attach an HloModuleConfig to HloModule, obviating the need to pass ↵Gravatar Eli Bendersky2017-05-18
| | | | | | | | | | them around as a pair. This cuts through a bunch of critical XLA APIs, but it's time... The background for this change is to make flags/options more easily pipe-able from the TF/XLA boundary deep into the XLA compiler and other components. The situation after this CL is still not perfect; there are a number of places with chicken-egg scenarios when a module has to be constructed before a config (to register the result shape), but the situation is strictly better than before. Future CLs will clean things up even more. PiperOrigin-RevId: 156469639
* Automated g4 rollback of changelist 155305963Gravatar Peter Hawkins2017-05-14
| | | | PiperOrigin-RevId: 156011931
* [TF:XLA] Instead of dumping ptxas output, write to a temp file & and use to ↵Gravatar A. Unique TensorFlower2017-05-06
| | | | | | build gpu executable. Switching to ptxas to avoid relying on the JIT in the nvidia driver for compilation. Change: 155305963
* [TF:XLA:GPU] Avoid a CPU-GPU synchronization when running XLA GPU ↵Gravatar Peter Hawkins2017-03-08
| | | | | | computations from Tensorflow. Change: 149550275
* [TF:XLA] Remove support for client-allocated result buffers.Gravatar Peter Hawkins2017-03-07
| | | | | This code path is unused; Tensorflow ended up settling on having XLA allocate result buffers using Tensorflow's allocator. Remove it to reduce the proliferation of ExecuteXYZ() methods. Change: 149423775
* [XLA:GPU] Cache GPU substreams across executionsGravatar A. Unique TensorFlower2017-03-02
| | | | Change: 149063035
* [XLA] Add support for profiling multiple computationsGravatar David Majnemer2017-03-01
| | | | | While we are here, add support for getting the cost analysis for call HLOs. Change: 148952748
* [TF:XLA] Change buffer assignment to combine temp buffers into one allocation.Gravatar A. Unique TensorFlower2017-02-07
| | | | | | | | | | | | | | | | | | | | | | | | | | This lays the groundwork for future CLs to reduce overall memory usage, but doesn't accomplish that goal yet. I.e. this is step 1. The main change is in the semantics of BufferAllocation. Previously we'd only assign non-interferring (i.e. disjoint in liveness) LogicalBuffers to a single BufferAllocation. This meant that each BufferAllocation represented a unique address range in the working memory of the compiled program. Now we allow assignment of LogicalBuffers that overlap in liveness to the same BufferAllocation, by ensuring they occupy disjoint address ranges within the allocation. Bookkeeping of each address range is accomplished by associating each LogicalBuffer with an offset and size. We take advantage of these new semantics to combine all temp buffers into a single BufferAllocation, by laying them end-to-end in a postprocessing step - see BufferAssigner::CombineTempAllocations. This is the same logic that TempBufferOffsets used on the GPU side; that class has been removed. Entry parameters (inputs) and maybe_live_out (outputs) are unchanged, and may still occupy multiple BufferAllocations. The rest of the CL deals with the consequences of these changes. Change: 146800348
* Initial open-source release of XLA: Accelerated Linear Algebra.Gravatar Peter Hawkins2017-01-09
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators. XLA is still experimental; we are releasing it early to get the community involved. Change: 143990941