diff options
author | James Qin <jamesqin@google.com> | 2018-06-10 22:15:46 -0700 |
---|---|---|
committer | TensorFlower Gardener <gardener@tensorflow.org> | 2018-06-10 22:18:40 -0700 |
commit | 3a1d8bd815b5216bc9515801e4d59cf3ebd1126d (patch) | |
tree | c0fb2c362712d62b2e8ccaba12ae700812e145ed /tensorflow/contrib/mixed_precision | |
parent | 119db15241e29587e0b6ab3912bff5ff63d123eb (diff) |
Improve the loss_scale_optimizer docstring.
PiperOrigin-RevId: 200001771
Diffstat (limited to 'tensorflow/contrib/mixed_precision')
-rw-r--r-- | tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py | 42 |
1 files changed, 24 insertions, 18 deletions
diff --git a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py index e4e5ccc334..ef34f7bf7b 100644 --- a/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py +++ b/tensorflow/contrib/mixed_precision/python/loss_scale_optimizer.py @@ -26,26 +26,32 @@ from tensorflow.python.training import optimizer class LossScaleOptimizer(optimizer.Optimizer): + # TODO(jamesqin): move mixed precision training explanation to __init__ + # docstring. """An optimizer that applies loss scaling in backprop. - This class is useful for mixed precision training on GPUs (or other potential - accelerators), which is an approach to improve compute throughput without loss - of model quality. - - The commmon configuration of mixed precision models is the following: - * variables are kept in high precision (e.g. float32). - * computations are done in lower precision (e.g. float16). variables are - casted to lower precision before they're used. - * (in training), final gradients are casted back to variable precision and get - applied. - - Because computations happen in lower precision, gradients in the backprop pass - might underflow in the smaller dynamic range, causing a model to converge at a - suboptimal level. This optimizer multiplies the loss by a factor before - backprop starts to prevent underflow. Before gradients are applied, they are - casted to higher precision and down-scaled by the same factor, so - mathematically the variable updates are no different from regular - same-precision training. + This class is useful for "mixed precision training" on GPUs (or other + potential accelerators), an approach to improve compute throughput without + compromising model quality. + + The canonical way to perform mixed precision training is the following: + * Model variables are kept in high precision (e.g. float32). + * Computations are done in lower precision (e.g. float16), which enjoys + performance speedup by virtue of hardware support. Variables are casted to + lower precision before they're used. + * Final gradients are casted back to high precision dtype, then used to update + variables. + + The side-effect of performing computation in lower precision, is that it comes + with smaller numerical range. During backproping, small gradients might + underflow in the reduced numerical range, causing a model to converge at + suboptimal level. + + To prevent underflow, this optimizer multiplies the loss by a factor before + backprop starts. Consequently, the gradients are linearly scaled up by the + same factor, thus not falling into the underflow zone. After that, to perserve + the correctness of backprop, the gradients are down-scaled by the same factor, + casted to the (higher) variable precision, then applied on the variables. See [Nvidia's manual on mixed precision training]( https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) |