| Commit message (Collapse) | Author | Age |
... | |
|
|
|
| |
PiperOrigin-RevId: 171983705
|
|
|
|
| |
PiperOrigin-RevId: 171982861
|
|
|
|
| |
PiperOrigin-RevId: 171982493
|
|
|
|
|
|
| |
writing tests of libraries.
PiperOrigin-RevId: 171973311
|
|
|
|
|
|
| |
Fixes #13355.
PiperOrigin-RevId: 171972633
|
|
|
|
| |
PiperOrigin-RevId: 171966540
|
|
|
|
|
|
|
|
|
| |
- Move away from previous TF graph executor, which contains few features that we need and also introduces indeterminism.
- Unlike previous executor, the new serial graph compiler doesn't recurse into a function and inlines it. Instead, it creates a computation of the function and then creates a `call` op to call into the newly created computation.
- Add a optional comparator in DFS algorithm, which is needed to make the compiler deterministic.
RELNOTES: Use a determinisitc executor to generate xla graph.
PiperOrigin-RevId: 171962775
|
|
|
|
| |
PiperOrigin-RevId: 171961190
|
|
|
|
|
|
| |
TPUReplicateMetadata graph node, rather than attaching a copy of it to every node that is to be replicated.
PiperOrigin-RevId: 171957514
|
|
|
|
| |
PiperOrigin-RevId: 171956450
|
|
|
|
| |
PiperOrigin-RevId: 171931173
|
|
|
|
|
|
| |
faster and more readable and avoids an issue with using the Eigen generator mechanism with GPUs on Windows.
PiperOrigin-RevId: 171924800
|
|
|
|
| |
PiperOrigin-RevId: 171919244
|
|
|
|
| |
PiperOrigin-RevId: 171918115
|
|
|
|
| |
PiperOrigin-RevId: 171917856
|
|
|
|
| |
PiperOrigin-RevId: 171917834
|
|
|
|
|
|
|
|
| |
This utility function is designed for using a `tf.data.Dataset` in a serving
context, where it is useful for expressing the stateless transformation from a
fed-in batch into the serving input.
PiperOrigin-RevId: 171915928
|
|
|
|
|
|
| |
have issues with multiple servers and have intermittent failures (https://github.com/grpc/grpc/issues/10142)
PiperOrigin-RevId: 171915902
|
|
|
|
| |
PiperOrigin-RevId: 171915087
|
|
|
|
| |
PiperOrigin-RevId: 171914551
|
|
|
|
| |
PiperOrigin-RevId: 171913954
|
|
|
|
| |
PiperOrigin-RevId: 171904584
|
|
|
|
| |
PiperOrigin-RevId: 171904046
|
|
|
|
| |
PiperOrigin-RevId: 171900256
|
|
|
|
| |
PiperOrigin-RevId: 171895671
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hermitian) transposition. Currently, this can only be accomplished by adding extra conjugation ops, which means reading the tensor data from memory twice. More importantly, Hermitian transpose is the most common transposition operation when using complex arithmetic, so using it in new code helps prevent "conjugation bugs" by making the math work for real and complex types alike. The alias tf.linalg.adjoint was added to help with the latter.
Optimized fused conjugate transpose op for GPU will be added in a followup.
Get rid of some duplication of code among CPU/GPU/SYCL in transpose_functor.
Support accelerating 2D transpose ops using MKL in more cases.
PiperOrigin-RevId: 171895454
|
|
|
|
| |
PiperOrigin-RevId: 171890081
|
|
|
|
|
|
| |
are correct, lower the tolerance from 1e-2 to 1e-6.
PiperOrigin-RevId: 171885525
|
|
|
|
| |
PiperOrigin-RevId: 171884257
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in a "while" loop.
Benchmarks results (times in ms):
nontrivial_gather.axis0_cpu: 0.110
nontrivial_gather.axis0_xla_cpu: 0.139
nontrivial_gather.axis1_cpu: 0.093
nontrivial_gather.axis1_xla_cpu: 0.142
nontrivial_gather.axis4_cpu: 1.183
nontrivial_gather.axis4_xla_cpu: 2.658
slice_gather.axis0_cpu: 0.00388
slice_gather.axis0_xla_cpu: 0.00397
slice_gather.axis1_cpu: 0.00421
slice_gather.axis1_xla_cpu: 0.00427
slice_gather.axis4_cpu: 0.252
slice_gather.axis4_xla_cpu: 0.114
As you can see, the pure-XLA implementation is slower in all the nontrivial
cases and as-fast or faster in the slice-gather cases.
The slice-gather cases are gathers that can be implemented as a single XLA
dynamic-slice, and so the speedup here is likely understated: Once we can
simplify the gather to a single dynamic-slice, we should be able to do many
other optimizations to it, ideally fusing it so it has zero cost.
The nontrivial gathers all gather more than one element, and are implemented
with an XLA while loop. The most important one is the axis 0 gather --
gathering from an inner dimension is so slow no matter what you do that it's
probably not worth optimizing.
It's possible to make this XLA implementation faster -- one option I've
considered is "unrolling" the gather into a series of dynamic-slice's that are
then concat'ed together. This would be totally fusable, unlike the
implementation in this CL. Another option would be adding a notion of
uninitialized memory into XLA -- part of what makes us slow is that we have to
initialize the memset our output to 0 before we overwrite it.
But given that the shape we're benchmarking here is totally arbitrary, and
given that we're getting decent performance, I think this is good enough to
start with.
PiperOrigin-RevId: 171883273
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
already has intra-op parallelism for library calls).
Adds support for parallel task assignment to instructions in entry (or embedded) computations.
Adds code to emit calls to a new a runtime parallel fork/join function for instructions which have been assigned parallel tasks.
Adds a simple cost model for I/O bound instructions.
*) Translation (deleuze model) wall time (seconds).
large_model small_model small_model_small_attn
sequential: 0.00556 0.00484 0.00155
parallel: 0.00263 0.00163 0.00106
*) Wavenet
sequential: Avg. latency (30 runs): 1026.13ms, min/max: 988/1108ms
parallel: Avg. latency (30 runs): 800.633ms, min/max: 785/818ms
*) ParallelFusion benchmark.
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------
sequential cpu backend (at head) 610584 611467 1000
parallel cpu backend 153241 836097 4528
sequential cpu backend (this CL) 113482 679535 6017
PiperOrigin-RevId: 171877766
|
|
|
|
| |
PiperOrigin-RevId: 171876670
|
|
|
|
|
|
|
|
|
|
|
| |
This implementation, which applies when a loop-fusion node's root is a
dynamic-update-slice whose input operand and output share the same
buffer slice, is much faster than the out-of-place implementation.
This patch also unifies the implementation of the CPU and GPU versions
of this algorithm.
PiperOrigin-RevId: 171863142
|
|
|
|
| |
PiperOrigin-RevId: 171853263
|
|
|
|
|
|
|
| |
* Remove HostConstraint for ops taking Variants; they can now be copied from/to Device.
* Add ResourceVariable assign operations that support variants.
PiperOrigin-RevId: 171845029
|
|
|
|
| |
PiperOrigin-RevId: 171843463
|
|
|
|
| |
PiperOrigin-RevId: 171842961
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cdi@google.com), using the algorithm outlined in Mike Giles' paper: http://eprints.maths.ox.ac.uk/1079/1/NA-08-01.pdf.
This initial version has the following restrictions:
Only supports statically known inner matrix dimensions m and n.
Backpropagating through U and V (i.e. backpropagating through SVD nodes with compute_uv=True) has further restrictions:
a) Only supports real tensors.
b) Only supports square and "almost square" matrices where the number of rows and columns differ by at most 1.
c) full_matrices must be true also. This does not currently have severe implications, given the restriction in b).
Feature request on Github:
#6503
This CL also adds support for calling tf.real, tf.imag, and tf.angle with real arguments.
PiperOrigin-RevId: 171836140
|
|
|
|
| |
PiperOrigin-RevId: 171833156
|
|
|
|
|
|
|
|
| |
There are multiple references "see reroute_inputs" which are unhelpful because the full docstring now only exists on _reroute_sgv_inputs (likewise for reroute_outputs). Copy most of the docstring to reroute_{inputs,outputs} so that it is outputted in the docs.
Update some other dangling doc references from _reroute to _reroute_sgv, but that docstring will not be included the docs.
PiperOrigin-RevId: 171821659
|
|
|
|
| |
PiperOrigin-RevId: 171789232
|
|
|
|
| |
PiperOrigin-RevId: 171788007
|
|
|
|
| |
PiperOrigin-RevId: 171775503
|
|
|
|
| |
PiperOrigin-RevId: 171774816
|
|
|
|
| |
PiperOrigin-RevId: 171772766
|
|
|
|
| |
PiperOrigin-RevId: 171769504
|
|
|
|
|
|
|
|
| |
into GradientBoostedDecisionTreeModel.
* Export GTFlow model into generic format with features defined in proto.
PiperOrigin-RevId: 171766066
|
|
|
|
|
|
|
|
|
|
| |
Also make the dynamic-update-slice simplification respect the
is_layout_sensitive_ flag in algebraic-simplifier
While we're here, make the algebraic-simplifier test use the new
HloVerifiedTestBase class.
PiperOrigin-RevId: 171759708
|
|
|
|
|
|
| |
option to provide specific feed nodes to the item builder.
PiperOrigin-RevId: 171758733
|
|
|
|
| |
PiperOrigin-RevId: 171756150
|