| Commit message (Collapse) | Author | Age |
|
|
|
|
|
| |
attributes, set the attributes of all the contained variables. This fixes a bug that tf.train.init_from_checkpoint doesn't overwrite the initialization values correctly for TPUMirroredVariable.
PiperOrigin-RevId: 216429476
|
|
|
|
|
|
| |
tf.train.init_from_checkpoint can be supported.
PiperOrigin-RevId: 215843249
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
output depends on the updates across all mirrors. Before this change,
update() would return a Mirrored value that where each component was
an update to a single mirror. This caused a problem since for reading
purposes other DistributionStrategy methods would consider it okay
to read any single component, and so if you for example did something
like session.run(strategy.update(...)) it would only perform the
update on one replica. The fix is to have the output be a Mirrored
value that is actually the identity operation returning the output on
that device, but that has a control dependency making sure that the
update actually happens on all the replicas. This fix was already
present in MirroredVariable._assign_func, this CL moves the fix into
update() and generalizes it to multiple return values.
To disable this new grouping behavior, you may now pass
"grouped=False" to update(). For example, some callers (like Optimizer)
are performing a lot of updates and they prefer to group all of them
together at once for performance reasons. In this case, we still want
to make sure the caller executes the update on all replicas, so we
return an unwrapped value instead of a Mirrored value. This has the
happy side effect of removing a bunch of unwrap calls in client code,
since unwrapping was the only safe way to use the Mirrored value we
used to return.
PiperOrigin-RevId: 215301909
|
|
|
|
| |
PiperOrigin-RevId: 215027511
|
|
|
|
| |
PiperOrigin-RevId: 214989908
|
|
|
|
|
|
| |
We will re-enable it when it is more robust.
PiperOrigin-RevId: 214956066
|
|
|
|
|
|
|
|
| |
distribution strategies. That is always the appropriate option.
In the existing code, we would set it to a partially specified "worker" name that was ambiguous and end up on the GPU.
PiperOrigin-RevId: 214882658
|
|
|
|
|
|
|
|
| |
supported in Graph mode using initializable iterators. In a subsequent change, we'll add in support for Eager mode as well.
This removes prefetching_ops_v2 code.
PiperOrigin-RevId: 214546754
|
|
|
|
|
|
|
|
|
| |
components of a MirroredVariable. We switched to using
`_distributed_container` set in the parent class
`DistributedVariable`, but the code setting `_mirrored_container` was
accidentally added back as a result of a merge.
PiperOrigin-RevId: 211111147
|
|
|
|
|
|
|
| |
step counter. This allows us to get rid of the increment_var()
function and just use a standard assign_add().
PiperOrigin-RevId: 210743165
|
|
|
|
|
|
| |
- add `_in_graph_mode` property to DistributedVariable
PiperOrigin-RevId: 210177702
|
|
|
|
|
|
|
|
|
| |
being fixed is when you session.run(assignment) and assignment is the
MirroredVariable value returned by ResourceVariable.assign*, only one
of the components of assignment is executed. Now that it is safer,
allow session.run() on Mirrored values (not just MirroredVariables).
PiperOrigin-RevId: 210149461
|
|
|
|
|
|
| |
wrapper for variables in collections instead of what it wraps.
PiperOrigin-RevId: 210107528
|
|
|
|
|
|
|
|
| |
ParameterServerStrategy when using >1 device per machine. This means
wrapping the variable instances returned in that case in a class
that intercepts assign_*() method calls.
PiperOrigin-RevId: 209533673
|
|
|
|
|
|
|
|
|
| |
with few dependencies. This allows us to import this in some places without creating circular dependencies as the original file imported many things.
2. Move the stack used in distribution strategy context to the graph. This allows us to use different strategies in different graphs (for e.g. in train and eval).
This fixes #21412 and #21180.
PiperOrigin-RevId: 208680454
|
|
|
|
|
|
| |
this for MirroredStrategy and OneDeviceStrategy. Implemented in TPUStrategy earlier.
PiperOrigin-RevId: 207961939
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this change, when was function is called in a distribution
strategy context, it would capture the component variables from some
device and always use these variables, even when the function is
executed on a different device.
This CL "reevaluates" distributed variables to get the correct variable
at call time. These correct variables are then passed to the function.
We don't handle distributed tensors. First, because the mechanics for handling
distributed tensors are different from handling distributed variables,
their support added significant complexity to already complex defuns.
Second, there is no easy way for users have a function capture a distributed
tensor or feed a distributed tensor explicitly. If this changes, we can
support them (the code exists in this CL's history).
We also don't handle distributed variables explicitly passed into the
function for similar reasons.
PiperOrigin-RevId: 207640908
|
|
|
|
| |
PiperOrigin-RevId: 206864512
|
|
|
|
| |
PiperOrigin-RevId: 206208637
|
|
|
|
|
|
| |
support for calling `assign` on TowerLocalVariables.
PiperOrigin-RevId: 205595323
|
|
|
|
| |
PiperOrigin-RevId: 205424692
|
|
|
|
|
|
| |
estimator.
PiperOrigin-RevId: 205030626
|
|
|
|
|
|
| |
TowerLocalVariables.
PiperOrigin-RevId: 203520287
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
will be used for distributed variables.
Add Enum `VariableSynchronization` with values for `synchronization`: AUTO, UNREPLICATED, ON_WRITE, ON_READ
Add Enum `VariableAggregation` with values for `aggregation`: NONE, SUM, MEAN. Replace all the aggregation methods strings in distribution strategy to the enum values.
Update Mirrored strategy to use these parameters to decide on whether a variable should be Mirrored or TowerLocal.
Update different distribution strategy value types to use the `VariableAggregation` Enum
PiperOrigin-RevId: 202736077
|
|
|
|
|
|
| |
running multiple steps at a time using the `run_steps_on_dataset` API. It allows the user's step function to specify which outputs to emit at what frequency. Currently it only supports capturing output from the last step, but will soon be augmented to support other use cases such as output each N steps.
PiperOrigin-RevId: 202520245
|
|
|
|
|
|
| |
in cross tower and tower context.
PiperOrigin-RevId: 202162272
|
|
|
|
| |
PiperOrigin-RevId: 201554738
|
|
|
|
|
|
| |
was using are now deprecated.
PiperOrigin-RevId: 201478331
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
so we can delete it. Frequently we can now delete the call entirely,
but in other cases we switch to read_var().
This revealed some bugs also fixed in this CL:
* For MirroredStrategy: fix read_var(mean_tower_local) bug.
* Support get() for Mirrored values that are not MirroredVariables,
and make them DistributedDelegates so we can operate on them in
cross-tower mode.
* Actually iterate through the available devices in MirroredStrategy.get().
With this and already-submitted 201390698, we can pass mirrored
variables and other mirrored values directly to self.evaluate() in
tests.
PiperOrigin-RevId: 201435436
|
|
|
|
| |
PiperOrigin-RevId: 200467472
|
|
|
|
| |
PiperOrigin-RevId: 199241723
|
|
|
|
|
|
|
|
|
|
| |
python/training/checkpointable/
Need to add some new checkpointable files in core (specifically I had some checkpointable data structures in mind), and prefixing more files with "checkpointable_" in python/training/ seems dirty.
No functional changes, just some branching and build/import fiddling.
PiperOrigin-RevId: 196883136
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
cross-tower context:
* only provide read-only access to variables via get()
* don't fail if use the variable isn't copied to the current device in
get()
* make _as_graph_element() return the aggregate value for tower-local
variables (instead of the incorrect previous behavior of returning
the primary)
PiperOrigin-RevId: 195711474
|
|
|
|
|
|
| |
TPUStrategy passes tests in minimize_loss_test. That caused me to add a capability to have `iterations x cores` inputs of any structure. I also resolved a big number of small issues and uncovered more things to resolve that are documented as todos.
PiperOrigin-RevId: 195696833
|
|
|
|
| |
PiperOrigin-RevId: 195092992
|
|
|
|
|
|
| |
multi-node distribution strategy.
PiperOrigin-RevId: 194862215
|
|
|
|
|
|
|
|
|
|
|
| |
variable.
This prevents errors like
ValueError: Fetch argument MirroredVariable({'/job:localhost/replica:0/task:0/device:GPU:0': <tf.Variable 'global_step:0' shape=() dtype=int64>, '/job:localhost/replica:0/task:0/device:GPU:1': <tf.Variable 'global_step/replica_1:0' shape=() dtype=int64>}) cannot be interpreted as a Tensor. (Device /job:localhost/replica:0/task:0/device:CPU:0 not found in ['/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1'] (current device ))
I ran distribute/examples/resnet with and without the change and it fixed the problem.
PiperOrigin-RevId: 194828672
|
|
|
|
|
|
|
|
| |
TPUStrategy is added to a few more tests.
There appears to be an issue with the batch norm test in minimize_loss_test where the moving averages stay at 0. I'm trying to resolve that separately as the next CL.
PiperOrigin-RevId: 193610264
|
|
|
|
| |
PiperOrigin-RevId: 193563912
|
|
|
|
|
|
| |
in estimator.
PiperOrigin-RevId: 193394603
|
|
|
|
| |
PiperOrigin-RevId: 192850372
|
|
|
|
|
|
| |
underlying primary variable's serialization). Also, throw an exception when trying to de-serialize as we haven't implemented that yet.
PiperOrigin-RevId: 191022884
|
|
|
|
| |
PiperOrigin-RevId: 191020351
|
|
and MirroredStrategy, and related functionality.
Also add tf.contrib.optimizer_v2, an update to the Optimizer API.
RELNOTES: Can now pass tf.contrib.distribute.MirroredStrategy() to
tf.estimator.RunConfig() to run an Estimator model on multiple GPUs
on one machine.
PiperOrigin-RevId: 190996247
|