| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 215272497
|
|
|
|
| |
PiperOrigin-RevId: 210998142
|
|
|
|
|
|
|
|
| |
Unlike Printf, StrFormat does not require type-length qualifiers, e.g
%z, %ll. Nor does it require that you call c_str() to print strings.
So these are fixed up here as well.
PiperOrigin-RevId: 210435915
|
|
|
|
| |
PiperOrigin-RevId: 209686671
|
|
|
|
|
|
| |
Same for WrapUnique.
PiperOrigin-RevId: 209531124
|
|
|
|
| |
PiperOrigin-RevId: 208110715
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* lambda capture 'builder' is not used
* using decl 'Printf' is unused
* lambda capture 'this' is not used (17 times)
* lambda capture 'buffer_liveness' is not used
* lambda capture 'computation' is not used
* lambda capture 'operand_to_generator' is not used
* lambda capture 'M' is not used
* using decl 'InvalidParameterArgument' is unused
* lambda capture 'sum' is not used
* lambda capture 's' is not used
* lambda capture 'epsilon' is not used
PiperOrigin-RevId: 207542895
|
|
|
|
|
|
|
|
| |
This is simpler than the corresponding change to XLA:GPU because on XLA:CPU all
instructions are codegened so we can always embed a pointer to the constant
global variable directly in the generated LLVM IR.
PiperOrigin-RevId: 206363887
|
|
|
|
|
|
|
|
| |
It's only non-empty if we were able to run ptxas. If the PTX is going to be
JIT'ed by the driver it won't be around. Loading an empty cubin will result in
a fatal error.
PiperOrigin-RevId: 206341931
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL teaches XLA:GPU to use "normal" buffer assignment for constant
instructions. Constant instructions are mapped to a BufferAllocation, like all
other instructions, except the storage for this buffer is allocated statically
as a global in the generated PTX.
This CL does not change how we access the constants -- in
IrEmitterUnnested::BuildKernelThunk (used for top level computations) and in
HloToIrBindings::EmitBasePointersForHlos (used for nested computations) we bind
the kConstant instructions to the llvm::GlobalVariable backing them. So users
of constant instructions still access the globals corresponding to the constants
directly.
However, we no longer emit the constant literals inline. Instead we emit a
constant with a zero initializer and then memcpy in the contents of the literal
when we load the CUBIN/PTX. This works around compile time issues in LLVM and
ptxas caused by large constants.
We also populate `BufferAllocations` with the device pointers for the constant
globals. This is at least needed for TupleThunk today because TupleThunk wants
the addresses for the sub-buffers on the host. I'm not sure if there are other
places in XLA:GPU that rely on there being an entry in BufferAllocations for
every BufferAllocation.
PiperOrigin-RevId: 206243319
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have a Pool in XLA that maintains a freelist of Streams, to avoid
the overhead of repeatedly allocating new Streams. Streams have a
monotonic state machine; if a stream encounters any error, it will
remain in an error state forever.
The functional change in this CL is to ensure that streams which have
encountered an error are deleted, rather than being put back on the
pool. Without this change, a previously failed stream will be put back
on the pool, only to cause the next usage of the stream to trivially
fail.
I've chosen to replace the generic templatized Pool with a concrete
StreamPool, since this makes the logic more straightforward to reason
about. Also note that the only existing usage of Pool is to hold
streams.
The functional change is in stream_pool.cc; most of everything else is
mechanical updates.
PiperOrigin-RevId: 206100631
|
|
|
|
|
|
|
|
|
| |
We now look into the computations of kWhile and kConditional ops when profiling.
This still does not help regarding the statistics of the estimated optimum,
but at least we can see the relative performance of the ops within a
subcomputation.
PiperOrigin-RevId: 202916616
|
|
|
|
|
|
|
| |
This is in preparation of passing it on to the Thunks, so that we can profile
HloInstructions within a while loop.
PiperOrigin-RevId: 200532394
|
|
|
|
| |
PiperOrigin-RevId: 196912575
|
|
|
|
|
|
|
|
|
| |
The motivation here is that with --xla_hlo_profile, we count the time
spent in Thunk::ExecuteOnStream, but we don't want to count the time
spent loading the CUDA code into the GPU as time spent in the first
kernel thunk we try to run.
PiperOrigin-RevId: 196314733
|
|
|
|
| |
PiperOrigin-RevId: 196206896
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, xla::DeviceMemoryAllocator::Allocate returned a
stream_executor::DeviceMemoryBase. This is morally equivalent to a raw
pointer: It's on you the user to call Deallocate().
Unfortunately we ~never got this right. Essentially all users of
Allocate() call it in a loop, and TF_RETURN_IF_ERROR within the loop.
If any of these allocations fails (mostly commonly, due to OOM), we leak
everything we've allocated up until then.
This patch changes our API so that it returns an owning pointer. Now
things mostly Just Work.
Also worth calling out: The lambda in CpuExecutable::ExecuteOnStream
passed to ExecuteComputeFunction almost certainly had multithreaded
use-after-free bugs. This patch fixes them.
PiperOrigin-RevId: 196000535
|
|
|
|
|
|
|
| |
Previously, we returned a plain ShapedBuffer. But this doesn't capture
our semantics: It's up to the callee to free this ShapedBuffer.
PiperOrigin-RevId: 193854051
|
|
|
|
|
|
|
|
|
| |
These are already notionally equivalent to T* and unique_ptr<T>, so
having a unique_ptr of a {Scoped,}ShapedBuffer is pretty redundant.
Also clean up the ScopedShapedBuffer API a bit.
PiperOrigin-RevId: 193599773
|
|
|
|
| |
PiperOrigin-RevId: 193301997
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pointer.
In the past, we allowed sub-buffers to be null if the top-level tuple
was non-null.
This doesn't actually work well on the GPU: For ops that are implemented
using cudnn or cublas, we have to have a pointer to the sub-buffer on
the host in order to make the call. Retrieving it from the GPU in an
efficient manner is complicated, and the best we can come up with isn't
all that efficient (fundamentally having to pull data down from the GPU
blocks the ability of the CPU to "run ahead" of the GPU).
Since TF wasn't making use of our flexibility *anyway*, we add the
requirement that XLA be given non-null pointers to all sub-buffers.
Changes to the XLA:GPU backend to take advantage of this will come
separately.
PiperOrigin-RevId: 190700021
|
|
|
|
|
|
|
| |
We were enqueueing the timer on the main stream, but not blocking the
substreams, so the results were nonsensical.
PiperOrigin-RevId: 187032412
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
multi-output fusion node, and avoid some pointer chasing with tuples.
Previously, the kernels we generated would have one argument per
*top-level* buffer of the input/output. This was fine for inputs. But
it doesn't work for outputs: Imagine you're a node that returns a tuple
-- e.g. multi-output fusion -- if all you get is a pointer to the
top-level buffer of your output (which should contain pointers to the
lower-level buffers at some point, but at the moment is just empty), how
are you supposed to figure out where to write your output?
(This usually worked because most of the time your output would live
inside of the big XLA temp buffer, and kernels always get a pointer to
that.)
Now we pass all the buffers, top-level and otherwise, to our kernel. In
addition, we're now willing to dereference statically tuples that live
entirely in XLA's temp buffer. Pointers in input tuples must still be
dereferenced dynamically, because the caller has the option of giving us
these values or not when invoking XLA.
This change makes some parts of BufferAssignment/BufferAllocations more
truthful. Previously, if you passed a tuple-shaped input to XLA, we'd
say in BufferAllocations that the pointer for some subshape of the param
was the *top-level tuple pointer*. XLA then knew that this was a lie
and would dereference it accordingly. Now we have an explicit notion of
a BufferAllocation pointing to a subshape of an input parameter.
PiperOrigin-RevId: 185614060
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change replaces the meat of HloProfilePrinter with a protobuf
HloProfilePrinterData. The original plan was to serialize HloProfilePrinter
into C++ source code and put that in a .cc file along with the string for the
xla::ProgramShape. However, since we now directly serialize xla::ProgramShape
into a .o file, for consistency I think we should do the same thing for
HloProfilePrinter (instead of adding yet another output file to tfcompile).
The change itself is fairly simple, it is large mostly due to the mass renaming
I had to do.
PiperOrigin-RevId: 183158192
|
|
|
|
|
|
| |
ExecutionProfile::compute_cycle_count never worked for CPU and GPU with Hlo
profiling disabled, as far as I can tell.
PiperOrigin-RevId: 181517824
|
|
|
|
|
|
|
|
|
|
| |
The old implementation did not do what it was intended to do. Instead
of making the next thunk on each other stream wait for this thunk to
complete, it made all following thunks on this stream wait for this
thunk to complete. (Which is redundant, since streams are serialized
anyway.)
PiperOrigin-RevId: 179938424
|
|
|
|
|
|
|
|
|
| |
Add member function Thunk::ShouldFutureScheduledThunksDependOn for
convolution_thunk to tell thunk executor that all future scheduled thunks
should wait for convolution_thunk. This can ensure that the use of scratch
allocator in convolution_thunk is safe.
PiperOrigin-RevId: 179628764
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
179277894 by gunan:
Run buildifier on build file.
--
179275101 by meheff:
Replace DeviceMemoryBase with ShapedBuffer in XLA interfaces.
Executable, TransferManager, and AllocationTracker now use ShapedBuffer to hold device memory addresses holding XLA data. Most of the change is straight-forward with the exception of AllocationTracker which was mostly rewritten (and simplified) and some refactoring in the CPU executable.
Also, have ShapedBuffer hold on-host and on-device Shapes which are the shapes of the representation of the data on the host and device, respectively. This is necessary because with cl/178624364 the on-host and on-device shape may no longer be equal.
--
179265385 by A. Unique TensorFlower:
Return error rather than CHECK fail in Executable::ExecuteOnStreamWrapper
--
179264551 by dandelion:
Internal fixes.
--
PiperOrigin-RevId: 179277894
|
|
|
|
| |
PiperOrigin-RevId: 178951330
|
|
|
|
| |
PiperOrigin-RevId: 178723711
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
XLA:GPU autotunes gemm and conv thunks, trying multiple algorithms in
sequence and picking the fastest one.
If other work is running concurrently with our autotuning, this can mess
up the results. In particular, even if the GPU is totally
deterministic, the concurrent work may finish before we finish
autotuning, giving an unfair advantage to the later algorithms.
To address this, we modify GpuExecutable to wait until the GPU is
quiescent before executing a thunk which performs autotuning. We then
cross our fingers and hope that whatever is fastest while the GPU is
quiescent will also be fastest in the "real world", with (potentially)
concurrent work going on.
PiperOrigin-RevId: 178041481
|
|
|
|
|
|
|
|
|
| |
This refactoring will later allow XlaCompiledCpuFunction to pull out the
HloProfilePrinter from Executable and use that to display the hlo execution
profile. A de/serialized HloProfilePrinter will let AOT compiled binaries
display their Hlo execution profile.
PiperOrigin-RevId: 176689528
|
|
|
|
|
|
|
|
| |
* Use ptxas to compile generated PTX.
* Run PTX compilations in parallel.
* Cache results of PTX compilation.
PiperOrigin-RevId: 174921332
|
|
|
|
| |
PiperOrigin-RevId: 174084570
|
|
|
|
|
|
|
|
| |
executables which only generated the array values of tuple-shaped outputs, not the tuple index tables.. With cl/170133015, ShapedBuffers which hold the computation output now have materialized tuples with these index tables so this option is no longer desired or necessary.
No functional change. Just cleanup.
PiperOrigin-RevId: 171035738
|
|
|
|
| |
PiperOrigin-RevId: 170252047
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
data of a given shape) to also hold an array of pointers representing the tuple structure in the device memory. Previously ShapedBuffer only held array-shaped data at the leaves of the tuple shape. Construction of these array-of-pointers is handled by TransferManager which has to construct array-of-pointers anyway to transfer literals to the device. This change makes ShapedBuffer match the native representative of tuple-shaped data passed into XLA computations. This is the first step to migrating XLA interfaces away from using naked device memory pointers (DeviceMemoryBase) to using more expressive ShapedBuffers instead.
This change enables tuple-shaped parameters in computations run through the LocalClient interface.
Also, change LocalClient interfaces to return ScopedShapedBuffers as these are generally easier to deal with ownership-wise that ShapedBuffers. They are analogous to std::unique_ptr, while ShapedBuffers are analogous to bare pointers.
This change includes a couple other cleanups found along the way:
* move cpu/gpu/interpreter transfer managers into their respective directories under xla/service.
* Make the generic transfer manager take a pointer size. Previously it would just use sizeof(void*) which might not be exactly what is needed.
PiperOrigin-RevId: 170133015
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added CompactPointerSet<T>, which is optimized for set size <= 1.
* Changed expensive CHECKs to DCHECKS in buffer_assignment.cc
* Reserve space in DFS state array before starting DFS.
* Use unsigned arithmetic in DFS state maintenance.
* HloInstruction:
- Moved frequently used fields to start for better cache locality.
- Use InlinedVector instead of vector for operand array.
- Use InlinedVector instead of vector for DFS stack.
* Pre-compute "is array" and "is tuple" for LogicalBuffer.
* PointsToSet:
- Combine two ShapeTrees into one.
- Use CompactPointerSet instead of std::set to hold sources.
- Use CompactPointerSet instead of std::set to hold flattened buffers.
* ShapeTree: use unique_ptr instead of optional for shape storage
(reduces size and destruction overhead).
* Add proper const qualifiers to some FlatSet iterator methods.
Co-author=jeff
PiperOrigin-RevId: 165759117
|
|
|
|
|
|
| |
the HLO profile. Add a summary table for what categories have the most opportunity for optimization left in them.
PiperOrigin-RevId: 163780413
|
|
|
|
|
|
| |
Simplify shape traversal visitors in ShapeUtil and ShapeTree. Add a non-Status form because most uses of the traversal methods do not use it, and remove is_leaf parameter from ShapeTree.ForEach* as it is not frequently used.
PiperOrigin-RevId: 158201574
|
|
|
|
|
|
|
|
| |
Presently, ExecuteOnStreamWrapper is a method on xla::Service, where it doesn't really conceptually belong -- note that it doesn't use anything from the containing Service object, but it does have an Executable object as its first parameter that it could easily be a method on instead. The only reason that it needs to be on Service is that it needs to access a Backend object in order to call backend->compiler()->shape_size_function(), and simply moving that into Executable would introduce a dependency cycle.
Thus, this patch (the first part of a sequence to address this) teaches Executable and its derivatives to compute shape_size_function. In the CPU cases, this is simply a static function. However, in the GPU case, we need to pass in the shape_size_function to the constructor, since it depends on a pointer size computed in the GpuCompiler.
PiperOrigin-RevId: 156807318
|
|
|
|
|
|
|
|
|
|
| |
them around as a pair.
This cuts through a bunch of critical XLA APIs, but it's time... The background for this change is to make flags/options more easily pipe-able from the TF/XLA boundary deep into the XLA compiler and other components.
The situation after this CL is still not perfect; there are a number of places with chicken-egg scenarios when a module has to be constructed before a config (to register the result shape), but the situation is strictly better than before. Future CLs will clean things up even more.
PiperOrigin-RevId: 156469639
|
|
|
|
| |
PiperOrigin-RevId: 156011931
|
|
|
|
|
|
| |
build gpu executable. Switching to ptxas to avoid relying on the JIT in the nvidia driver for compilation.
Change: 155305963
|
|
|
|
|
|
| |
computations from Tensorflow.
Change: 149550275
|
|
|
|
|
| |
This code path is unused; Tensorflow ended up settling on having XLA allocate result buffers using Tensorflow's allocator. Remove it to reduce the proliferation of ExecuteXYZ() methods.
Change: 149423775
|
|
|
|
| |
Change: 149063035
|
|
|
|
|
| |
While we are here, add support for getting the cost analysis for call HLOs.
Change: 148952748
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This lays the groundwork for future CLs to reduce overall memory usage, but
doesn't accomplish that goal yet. I.e. this is step 1.
The main change is in the semantics of BufferAllocation. Previously we'd only
assign non-interferring (i.e. disjoint in liveness) LogicalBuffers to a single
BufferAllocation. This meant that each BufferAllocation represented a unique
address range in the working memory of the compiled program.
Now we allow assignment of LogicalBuffers that overlap in liveness to the same
BufferAllocation, by ensuring they occupy disjoint address ranges within the
allocation. Bookkeeping of each address range is accomplished by associating
each LogicalBuffer with an offset and size.
We take advantage of these new semantics to combine all temp buffers into a
single BufferAllocation, by laying them end-to-end in a postprocessing step -
see BufferAssigner::CombineTempAllocations. This is the same logic that
TempBufferOffsets used on the GPU side; that class has been removed.
Entry parameters (inputs) and maybe_live_out (outputs) are unchanged, and may
still occupy multiple BufferAllocations.
The rest of the CL deals with the consequences of these changes.
Change: 146800348
|
|
XLA is a compiler-based linear algebra execution engine that targets CPUs, GPUs and custom accelerators.
XLA is still experimental; we are releasing it early to get the community involved.
Change: 143990941
|