Make `replicate_model_fn` friendlier to distributed training. - tensorflow

diff options

author	Igor Saprykin <isaprykin@google.com>	2018-01-17 11:37:48 -0800
committer	TensorFlower Gardener <gardener@tensorflow.org>	2018-01-17 11:43:26 -0800
commit	3f7c05cc4e2cf823ae7825c4ccec55eef1596d49 (patch)
tree	f97f22e7616534bc835f8f40989fa61da4647fde /third_party/examples
parent	a41ab15aeea526355d807fcf35e057ece0e35bc4 (diff)

Make `replicate_model_fn` friendlier to distributed training.

I verified that async distributed training works as is. One quirk is that when replicating over a single GPU, variables end up being placed on /gpu:0 on PSs, which works correctly only thanks to allow_soft_placement=True. For sync distributed training using SyncReplicasOptimizer the only quirk is that SyncReplicasOptimizerHook insists on SyncReplicasOptimizer.apply_gradients to be called. That happens only in the last tower, yet any tower could create the hook. To accommodate that requirement hooks from the last tower are taken as part of this CL. Before this, hooks from the first tower were taken. SyncReplicasOptimizer doesn't behave perfectly in tests. The queue keeps hanging waiting for new token to arrive until `stop_grace_period_seconds` which is set for 120 seconds. The latter isn't exposed through the Estimator interface, which means the test is slower. PiperOrigin-RevId: 182245657

Diffstat (limited to 'third_party/examples')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: