| Commit message (Collapse) | Author | Age |
|
|
|
| |
PiperOrigin-RevId: 177989542
|
|
|
|
|
|
| |
Also add support for rank != 4 tensors to the TF/XLA fused batchnorm operators, although the TF core ops don't actually support other ranks yet so this is not tested.
PiperOrigin-RevId: 177987592
|
|
|
|
|
|
|
|
| |
This requires absl-py 0.1.6.
Also remove the manual tag on //tensorflow/python:app_test.
PiperOrigin-RevId: 177986813
|
|
|
|
|
|
|
|
| |
Also fix a TODO in XlaOpRegistry to filter by the types allowed by the OpDef.
Also see #14798
PiperOrigin-RevId: 177986664
|
|
|
|
|
|
|
| |
This option is necessary to mimic the Python import_graph_def method's
behavior.
PiperOrigin-RevId: 177986165
|
|
|
|
| |
PiperOrigin-RevId: 177972555
|
|
|
|
| |
PiperOrigin-RevId: 177971801
|
|
|
|
|
|
|
| |
Change dependency optimizer to remove isolated NoOps when it is safe.
Fix bug in arithmetic optimizer: Only remove deduped nodes if we know the fetches.
PiperOrigin-RevId: 177970063
|
|
|
|
| |
PiperOrigin-RevId: 177966156
|
|
|
|
| |
PiperOrigin-RevId: 177964932
|
|
|
|
|
|
|
|
| |
requiring
a ShapeTree.
PiperOrigin-RevId: 177956572
|
|
|
|
| |
PiperOrigin-RevId: 177956552
|
|
|
|
| |
PiperOrigin-RevId: 177953076
|
|
|
|
|
|
| |
that rng instructions are not rematerialized. This also lists Rng as non-rematerializable.
PiperOrigin-RevId: 177932160
|
|
|
|
|
|
| |
with input shape != output shape.
PiperOrigin-RevId: 177920882
|
|
|
|
| |
PiperOrigin-RevId: 177908680
|
|
|
|
|
|
|
|
| |
Use ShapedBuffer to allocate required memory for the shape, then transfer the
literal to the allocated addresses on each replica. Also, add Allocate() method
to ShapedBuffer.
PiperOrigin-RevId: 177900588
|
|
|
|
|
|
| |
rather than Pad.
PiperOrigin-RevId: 177896187
|
|
|
|
|
|
| |
Also arrange for continuous testing with GPUs.
PiperOrigin-RevId: 177895214
|
|
|
|
| |
PiperOrigin-RevId: 177892591
|
|
|
|
|
|
| |
This fixes subtle problems with partitioned variables.
PiperOrigin-RevId: 177892499
|
|
|
|
|
|
| |
copy.
PiperOrigin-RevId: 177891209
|
|
|
|
| |
PiperOrigin-RevId: 177890892
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The colocation attrs must be updated after all NodeDefs have been
processed. The nodes are processed and uniquified in topological
order, which allows us to update the inputs simultaneously due to the
topological ordering, but this doesn't work for the colocation groups.
I also considered updating all the NodeDefs with prefixes or unique
names at the very beginning, before starting conversion. This would
make the logic simpler, but require us to potentially keep a full copy
of all the NodeDefs in memory (so we could edit them), so I decided to
edit in-place after construction. We might want to consider this
alternate in future though.
PiperOrigin-RevId: 177890362
|
|
|
|
|
|
|
|
|
| |
Before, we assumed that if you passed --use_fake_data, you didn't care
about the output of the computation. With this patch, we decouple the
decision of using fake data from the decision of whether or not to print
the results.
PiperOrigin-RevId: 177889877
|
|
|
|
| |
PiperOrigin-RevId: 177886163
|
|
|
|
| |
PiperOrigin-RevId: 177884096
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this change, we supported two algorithms for choosing the number
of threads per block:
* "optimize-for-latency" algorithm assumed that each thread would want
the maximum number of registers it could have, and chose a block size
small enough to accommodate this.
* "optimize-for-throughput" algorithm packed as many threads into a
block as possible.
In practice we always chose the optimize-for-latency algorithm.
This change removes the choice of algorithm and changes us to
unconditionally use a new one. In our new algorithm, we choose the
smallest block size that still has the potential to allow the GPU to
reach maximum occupancy.
When each thread's register usage is small, we can pack many of these
blocks into one SM and hit maximum occupancy. When the threads'
register usage is larger, we degrade gracefully (unlike with larger
block sizes, where the occupancy degredation is more quantized).
On our benchmarks, this is a moderate (0-10%) speedup on K40, and a
large (10-25%) speedup on P100.
PiperOrigin-RevId: 177879741
|
|
|
|
| |
PiperOrigin-RevId: 177878887
|
|
|
|
| |
PiperOrigin-RevId: 177877751
|
|
|
|
| |
PiperOrigin-RevId: 177876455
|
|
|
|
|
|
|
| |
* add a bfloat16 Python type and NumPy extension.
* allow the bfloat16 type in a number places in the Python libraries.
PiperOrigin-RevId: 177875784
|
|
|
|
| |
PiperOrigin-RevId: 177875589
|
|
|
|
|
|
|
|
|
|
|
| |
Some properties of nvidia GPUs cannot be queried via the driver API --
these are hardcoded in the UnqueryableDeviceParams struct in
StreamExecutor.
Before this change, we only had values for sm_35. This change adds the
values for all other nvidia GPUs, sm_20 through sm_70.
PiperOrigin-RevId: 177874401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change fixes the case where a newly-generated uniquified name
conflicts with another NodeDef being imported (the original NodeDef
names are required to be unique among each other, so this is only an
issue when we create new names).
Note that this behavior is not well defined in the Python
import_graph_def method. It will always generate unique names, but the
exact naming scheme may depend on the order the NodeDefs are
imported. I didn't write a corresponding Python unit test or try to
make this change produce the same names for this reason.
PiperOrigin-RevId: 177872720
|
|
|
|
| |
PiperOrigin-RevId: 177871523
|
|
|
|
| |
PiperOrigin-RevId: 177871286
|
|
|
|
| |
PiperOrigin-RevId: 177870577
|
|
|
|
| |
PiperOrigin-RevId: 177869591
|
|
|
|
|
|
|
|
|
| |
We sometimes pass scalars to non-entry computations, and since these are
pointers pointing to elements in a buffer and are not individually allocated
buffers, they don't have to follow the same alignment rules as buffers, even
though they incidentally do so today.
PiperOrigin-RevId: 177868506
|
|
|
|
|
|
| |
The optimization is to use the vhaddps instruction when possible.
PiperOrigin-RevId: 177868238
|
|
|
|
| |
PiperOrigin-RevId: 177865604
|
|
|
|
|
|
|
| |
- Using the GuaranteeConstOp.
- Runs a backwards analysis on the args to see if all the paths lead to GuaranteeConstOps/ConstOps.
PiperOrigin-RevId: 177862716
|
|
|
|
|
|
| |
https://github.com/bazelbuild/bazel-toolchains/releases/tag/b49ba36
PiperOrigin-RevId: 177858255
|
|
|
|
|
|
| |
LayerCollection. Replaced it with a simple function that returns a list of all the registered variables.
PiperOrigin-RevId: 177857623
|
|
|
|
|
|
|
|
|
|
| |
Fixes list formatting and sanitizes words in angle brackets, which aren't rendered in the web doc:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/lookup/IdTableWithHashBuckets
Follows the working formatting example of TextFileInitializer.
PiperOrigin-RevId: 177856349
|
|
|
|
| |
PiperOrigin-RevId: 177851804
|
|
|
|
|
|
| |
to avoid port conflicts with other tests during parallel bazel tests.
PiperOrigin-RevId: 177851615
|
|
|
|
| |
PiperOrigin-RevId: 177851421
|
|
|
|
|
|
|
|
|
|
|
|
| |
maxntid specifies the max number of threads in a block, whereas reqntid
says that we will use *exactly* this many threads in a block.
This doesn't have any effect on the benchmarks I ran, but we might as
well do it in case it helps ptxas make a better decision at some point
on some GPU. At least it will prevent the next person to come along
from doing this same investigation I just did. :)
PiperOrigin-RevId: 177851116
|