aboutsummaryrefslogtreecommitdiffhomepage
path: root/third_party/examples
diff options
context:
space:
mode:
authorGravatar Igor Saprykin <isaprykin@google.com>2018-01-17 11:37:48 -0800
committerGravatar TensorFlower Gardener <gardener@tensorflow.org>2018-01-17 11:43:26 -0800
commit3f7c05cc4e2cf823ae7825c4ccec55eef1596d49 (patch)
treef97f22e7616534bc835f8f40989fa61da4647fde /third_party/examples
parenta41ab15aeea526355d807fcf35e057ece0e35bc4 (diff)
Make `replicate_model_fn` friendlier to distributed training.
I verified that async distributed training works as is. One quirk is that when replicating over a single GPU, variables end up being placed on /gpu:0 on PSs, which works correctly only thanks to allow_soft_placement=True. For sync distributed training using SyncReplicasOptimizer the only quirk is that SyncReplicasOptimizerHook insists on SyncReplicasOptimizer.apply_gradients to be called. That happens only in the last tower, yet any tower could create the hook. To accommodate that requirement hooks from the last tower are taken as part of this CL. Before this, hooks from the first tower were taken. SyncReplicasOptimizer doesn't behave perfectly in tests. The queue keeps hanging waiting for new token to arrive until `stop_grace_period_seconds` which is set for 120 seconds. The latter isn't exposed through the Estimator interface, which means the test is slower. PiperOrigin-RevId: 182245657
Diffstat (limited to 'third_party/examples')
0 files changed, 0 insertions, 0 deletions