| Commit message (Collapse) | Author | Age |
|
|
|
|
|
| |
Also don't allow parallelization for the sort op in parallel_task_assignment.
PiperOrigin-RevId: 213592046
|
|
|
|
|
|
|
| |
Unfortunately this has to be one big patch, because e.g. absl::StrCat
doesn't accept a TF StringPiece, but as soon as we switch to
absl::string_view, we have to switch away from all of the TF functions.
PiperOrigin-RevId: 209957896
|
|
|
|
|
|
| |
Same for WrapUnique.
PiperOrigin-RevId: 209531124
|
|
|
|
| |
PiperOrigin-RevId: 202090038
|
|
|
|
| |
PiperOrigin-RevId: 201110240
|
|
|
|
| |
PiperOrigin-RevId: 201033171
|
|
|
|
| |
PiperOrigin-RevId: 201011811
|
|
|
|
| |
PiperOrigin-RevId: 196978634
|
|
|
|
|
|
|
|
|
|
| |
We teach TargetMachineFeatures about the alignment required for Eigen GEMM and
Conv and then pipe TargetMachineFeatures through the places that need to decide
whether a dot or a conv needs to be lowered to a call to Eigen.
I also had to fix a minor bug in our LLVM IR implementation for convolution.
PiperOrigin-RevId: 196065557
|
|
|
|
|
|
|
| |
The INTEL MKL_DNN provides 32-bit Conv2d method. With INTEL_MKL flag set,
XLA backend emits runtime call to MKL_DNN Conv2d instead of Eigen.
PiperOrigin-RevId: 194445212
|
|
|
|
| |
PiperOrigin-RevId: 191824447
|
|
|
|
| |
PiperOrigin-RevId: 191605505
|
|
|
|
|
|
|
| |
The INTEL GEMM API provides 32-bit and 64-bit MatMul. With INTEL_MKL flag set,
XLA backend emits runtime call to INTEL GEMM MatMul instead of Eigen.
PiperOrigin-RevId: 191527251
|
|
|
|
| |
PiperOrigin-RevId: 191428965
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Extend the stream interface ThenBlasGemmWithAlgorithm to support F16 matrix
multiplication with computation type FP32.
Extend the stream executor interface DoBlasGemmWithAlgorithm to support F16
GEMM with computation type FP32.
Extend the CPU IR emitter to handle F16 Dot instruction, and add F16 matrix
multiplication implementation to the CPU runtime.
Extend the GPU backend to handle FP16 GEMM Thunk.
Replicate the existing matrix multiplication test cases in
matrix_ops_simple_test and dot_operation_test for FP16.
RELNOTES:
PiperOrigin-RevId: 187369731
|
|
|
|
|
|
|
|
|
|
|
|
| |
LLVM generates calls to these functions when lowering some fp16 operations on
certain architectures. These symbols are defined in compiler-rt but we don't
always link to compiler-rt so these symbols are sometimes absent.
This change adds __gnu_f2h_ieee and __gnu_h2f_ieee as weak symbols. Making them
weak ensures that we are able to build successfully even when linking to a
compiler-rt that defines these symbols.
PiperOrigin-RevId: 186416684
|
|
|
|
|
|
|
| |
SimpleResolver became unused after an LLVM upstream merge, and we never needed
the name mangling logic in what is now FindCompiledSymbol.
PiperOrigin-RevId: 186039307
|
|
|
|
| |
PiperOrigin-RevId: 185979538
|
|
|
|
|
|
|
|
|
|
|
| |
Enhance the CPU IR emitter to support F16 dot operation and convolution
operation.
Add a CPU runtime implementation for F16 convolution.
Enhance the GPU backend to handle F16 convolution thunk.
Convert some F32 xla convolution tests to support both F32 and F16 and disable
the tests for the CPU backend due to b/72509305.
PiperOrigin-RevId: 185862438
|
|
|
|
|
|
|
| |
This was the last vectorized intrinsic for which we had to call into
C++ so also remove the associated machinery.
PiperOrigin-RevId: 185482962
|
|
|
|
| |
PiperOrigin-RevId: 185149198
|
|
|
|
| |
PiperOrigin-RevId: 185016276
|
|
|
|
| |
PiperOrigin-RevId: 184856538
|
|
|
|
|
|
|
| |
This lets us avoid the usual set of issues that crop up when XLA generated code
has to call into C++.
PiperOrigin-RevId: 184793093
|
|
|
|
|
|
| |
This will require an LLVM version bump
PiperOrigin-RevId: 182661291
|
|
|
|
| |
PiperOrigin-RevId: 180746153
|
|
|
|
| |
PiperOrigin-RevId: 180581912
|
|
|
|
| |
PiperOrigin-RevId: 180301735
|
|
|
|
| |
PiperOrigin-RevId: 180000981
|
|
|
|
|
|
| |
GPU support includes plan reuse with new scratch allocator per execution in fft_thunk.
PiperOrigin-RevId: 179983419
|
|
|
|
| |
PiperOrigin-RevId: 179953488
|
|
|
|
| |
PiperOrigin-RevId: 177526301
|
|
|
|
|
|
|
| |
XLA clients can use this registry to inject client-specific behavior into how
Orc JIT's manages virtual memory.
PiperOrigin-RevId: 175905401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
171084886 had to be rolled back twice due to various open source build issues.
I'm trying again, now that I think I've addressed all the pertinent issues.
Original CL description:
Don't use dlsym to resolve symbols in the CPU JIT
Instead of resolving symbols via dlsym when JITting for the CPU backend, use a
registry based mechanism. This lets us kill off the --export_dynamic hack that
we used to need for CustomCall on the CPU backend.
PiperOrigin-RevId: 173277862
|
|
|
|
| |
PiperOrigin-RevId: 172325692
|
|
|
|
| |
PiperOrigin-RevId: 171915087
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
already has intra-op parallelism for library calls).
Adds support for parallel task assignment to instructions in entry (or embedded) computations.
Adds code to emit calls to a new a runtime parallel fork/join function for instructions which have been assigned parallel tasks.
Adds a simple cost model for I/O bound instructions.
*) Translation (deleuze model) wall time (seconds).
large_model small_model small_model_small_attn
sequential: 0.00556 0.00484 0.00155
parallel: 0.00263 0.00163 0.00106
*) Wavenet
sequential: Avg. latency (30 runs): 1026.13ms, min/max: 988/1108ms
parallel: Avg. latency (30 runs): 800.633ms, min/max: 785/818ms
*) ParallelFusion benchmark.
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------
sequential cpu backend (at head) 610584 611467 1000
parallel cpu backend 153241 836097 4528
sequential cpu backend (this CL) 113482 679535 6017
PiperOrigin-RevId: 171877766
|
|
|
|
|
|
|
|
|
|
|
| |
LLVM does not deal well with huge arrays emitted inline into the IR. In JIT
mode, this change teaches XLA to emit large constant tensors onto a side data
structure, which are then symbolically linked to the generated executable. It
is important to note that this works only in JIT mode, and my current
understanding is that making this work reliably in AOT will be somewhat more
difficult.
PiperOrigin-RevId: 171626043
|
|
|
|
| |
PiperOrigin-RevId: 171221629
|
|
|
|
|
|
|
|
| |
Instead of resolving symbols via dlsym when JITting for the CPU backend, use a
registry based mechanism. This lets us kill off the --export_dynamic hack that
we used to need for CustomCall on the CPU backend.
PiperOrigin-RevId: 171084886
|
|
|
|
| |
PiperOrigin-RevId: 170919783
|
|
|
|
|
|
|
|
| |
Instead of resolving symbols via dlsym when JITting for the CPU backend, use a
registry based mechanism. This lets us kill off the --export_dynamic hack that
we used to need for CustomCall on the CPU backend.
PiperOrigin-RevId: 170892257
|
|
|
|
|
| |
RELNOTES: n/a
PiperOrigin-RevId: 166766323
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds log and exp for NEON.
(tanh is already supported on all platforms via
the LLVM IR runtime)
This change also fixes tf_library() to link in the
intrinsics to the binary.
PiperOrigin-RevId: 165782270
|
|
|
|
| |
PiperOrigin-RevId: 164943597
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change introduces an LLVMCompiler class, of which the
CPU and GPU compilers are subclasses. The LLVMCompiler
class provides the ability to inspect LLVM generated compiler
code by registering a callback. The callbacks can
be used to analyze IR before and after optimizations.
This also adds a simple test for the callback mechanism.
PiperOrigin-RevId: 164805348
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change adds a CPU-specific flag: xla_cpu_optimize_for_size
When this flag is passed, it changes the optimizers to run
more or less analogously to LLVM's -Os optimizations.
There are two things that turning on the code size optimization option
controls:
* the internal settings of some optimization passes (which is mostly
controlled through a function attribute)
* the passes that get run (which is decided by the pass manager)
This change also refactors the code by reorganizing the way
that CPU backend specific flags are queried, as well as some
other minor refactoring.
PiperOrigin-RevId: 164218771
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
163914294 by annarev:
Refactors build target for gradients_impl to allow code to depend on the gradient generation but not the gradients themselves.
--
163913011 by A. Unique TensorFlower:
Use an LLVM-IR version of vector hyperbolic tangent.
This lets us:
- Inline routine where it is called, eliminated call overhead.
- Use AVX instructions in JITed code even if Tensorflow was not built with -mavx.
--
163909534 by A. Unique TensorFlower:
Add tensorflow-android to standard TF maven artifacts.
--
163908704 by A. Unique TensorFlower:
Go: Update generated wrapper functions for TensorFlow ops.
--
163907709 by A. Unique TensorFlower:
Update ops-related pbtxt files.
--
163907497 by A. Unique TensorFlower:
Remove old TensorFlow Serving landing page in prepartion for new TF
Serving landing page. Fix bad leftnav.
--
163906225 by alive:
Refactors build target for gradients_impl to allow code to depend on the gradient generation but not the gradients themselves.
--
PiperOrigin-RevId: 163914294
|
|
|
|
| |
PiperOrigin-RevId: 163349457
|
|
|
|
| |
PiperOrigin-RevId: 163001060
|