Improve the loss_scale_optimizer docstring.

PiperOrigin-RevId: 200001771
author: James Qin <jamesqin@google.com> 2018-06-10 22:15:46 -0700
committer: TensorFlower Gardener <gardener@tensorflow.org> 2018-06-10 22:18:40 -0700
commit: 3a1d8bd815b5216bc9515801e4d59cf3ebd1126d (patch)
tree: c0fb2c362712d62b2e8ccaba12ae700812e145ed /tensorflow/contrib/mixed_precision
parent: 119db15241e29587e0b6ab3912bff5ff63d123eb (diff)
1 files changed, 24 insertions, 18 deletions
diff --git a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
index e4e5ccc334..ef34f7bf7b 100644
--- a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
+++ b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py
@@ -26,26 +26,32 @@ from tensorflow.python.training import optimizer
 
 
 class LossScaleOptimizer(optimizer.Optimizer):
+  # TODO(jamesqin): move mixed precision training explanation to __init__
+  # docstring.
   """An optimizer that applies loss scaling in backprop.
 
-  This class is useful for mixed precision training on GPUs (or other potential
-  accelerators), which is an approach to improve compute throughput without loss
-  of model quality.
-
-  The commmon configuration of mixed precision models is the following:
-  * variables are kept in high precision (e.g. float32).
-  * computations are done in lower precision (e.g. float16). variables are
-    casted to lower precision before they're used.
-  * (in training), final gradients are casted back to variable precision and get
-    applied.
-
-  Because computations happen in lower precision, gradients in the backprop pass
-  might underflow in the smaller dynamic range, causing a model to converge at a
-  suboptimal level. This optimizer multiplies the loss by a factor before
-  backprop starts to prevent underflow. Before gradients are applied, they are
-  casted to higher precision and down-scaled by the same factor, so
-  mathematically the variable updates are no different from regular
-  same-precision training.
+  This class is useful for "mixed precision training" on GPUs (or other
+  potential accelerators), an approach to improve compute throughput without
+  compromising model quality.
+
+  The canonical way to perform mixed precision training is the following:
+  * Model variables are kept in high precision (e.g. float32).
+  * Computations are done in lower precision (e.g. float16), which enjoys
+    performance speedup by virtue of hardware support. Variables are casted to
+    lower precision before they're used.
+  * Final gradients are casted back to high precision dtype, then used to update
+    variables.
+
+  The side-effect of performing computation in lower precision, is that it comes
+  with smaller numerical range. During backproping, small gradients might
+  underflow in the reduced numerical range, causing a model to converge at
+  suboptimal level.
+
+  To prevent underflow, this optimizer multiplies the loss by a factor before
+  backprop starts. Consequently, the gradients are linearly scaled up by the
+  same factor, thus not falling into the underflow zone. After that, to perserve
+  the correctness of backprop, the gradients are down-scaled by the same factor,
+  casted to the (higher) variable precision, then applied on the variables.
 
   See [Nvidia's manual on mixed precision training](
   https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
author	James Qin <jamesqin@google.com>	2018-06-10 22:15:46 -0700
committer	TensorFlower Gardener <gardener@tensorflow.org>	2018-06-10 22:18:40 -0700
commit	3a1d8bd815b5216bc9515801e4d59cf3ebd1126d (patch)
tree	c0fb2c362712d62b2e8ccaba12ae700812e145ed /tensorflow/contrib/mixed_precision
parent	119db15241e29587e0b6ab3912bff5ff63d123eb (diff)