diff options
author | 2018-08-31 14:28:52 -0700 | |
---|---|---|
committer | 2018-08-31 14:33:10 -0700 | |
commit | b05af29fee46507f8bd688382cfcdbd500621d6a (patch) | |
tree | 8ae1e39d3a9a9fc1ae1250785bd2f5696476a29b /tensorflow/contrib/lite/tools | |
parent | 8e547f8f03b61e2b370acf29f19fbac2702371aa (diff) |
Docs for quantize weights tool.
PiperOrigin-RevId: 211144861
Diffstat (limited to 'tensorflow/contrib/lite/tools')
-rw-r--r-- | tensorflow/contrib/lite/tools/optimize/g3doc/quantize_weights.md | 70 |
1 files changed, 70 insertions, 0 deletions
diff --git a/tensorflow/contrib/lite/tools/optimize/g3doc/quantize_weights.md b/tensorflow/contrib/lite/tools/optimize/g3doc/quantize_weights.md new file mode 100644 index 0000000000..93fe576583 --- /dev/null +++ b/tensorflow/contrib/lite/tools/optimize/g3doc/quantize_weights.md @@ -0,0 +1,70 @@ +# TFLite Quantize Weights Tool + +## Recommended usage + +The Quantize Weights transformation is integrated with +[tflite_convert](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md#transformation-flags). + +The recommended way of invoking this tool is by simply adding the +`--post_training_quantize` flag to your original tflite_convert invocation. For +example, + +``` +tflite_convert \ + --output_file=/tmp/foo.tflite \ + --saved_model_dir=/tmp/saved_model \ + --post_training_quantize +``` + +## Overview + +The Quantize Weights tool provides a simple way to quantize the weights for a +float TFLite model. + +TODO(raghuramank): Add link to weight quantization tutorial. + +### Size reduction + +float32 weights will be converted to 8 bit integers. This results in a model +that is around 1/4th the size of the original model. + +### Latency reduction + +TFLite also has "hybrid" kernels implemented for many operations. These "hybrid" +kernels take 8 bit integer weights and float inputs, dynamically quantize the +inputs tensor (based on the input tensor's min and max elements), and does +computations using the 8 bit integer values. This results in a 2-4x reduction in +latency for "hybrid" kernels. In this mode the inference type is still FLOAT +since the inputs and output to each operation is still float. + +For operations that do not yet have "hybrid" kernels implemented, we introduce a +Dequantize operation after 8 bit integer weights. These convert weights back to +float32 during inference to allow original float32 kernels to run. Since we +cache dequantized results, the result of each of this dequantized path will be +on-par with the original float model. + +TODO(yunluli): Fill in latency results from latency experiments. + +### Accuracy + +Since this technique quantizes weights after the model has already been trained, +there can be accuracy drops depending on the model. For common CNN networks, the +observed accuracy drops are small and can be seen below. + +TODO(yunluli): Fill in accuracy results from accuracy experiments. + +## Direct usage + +One can also invoke the Quantize Weights directly via C++ if they have a float +`::tflite::Model` that they want to convert. They must provide a +`flatbuffers::FlatBufferBuilder` which owns the underlying buffer of the created +model. Here is an example invocation: + +``` +::tflite::Model* input_model = ...; +flatbuffers::FlatBufferBuilder builder; +TfLiteStatus status = ::tflite::optimize::QuantizeWeights(&builder, input_model); +CHECK(status, kTfLiteStatusOk); +const uint8_t* buffer = builder->GetBufferPointer(); +tflite::Model* output_model = ::tflite::GetModel(buffer); +``` |