Update pytorch 2.0 export quantization doc

jerryzh168 · facebook-github-bot · commit b66faf27cc53 · 2023-08-09T21:24:13.000-07:00
Summary: att

Reviewed By: digantdesai

Differential Revision: D47855806

fbshipit-source-id: aebd755878a394f3dfd81a2105b41b6a55727e41
diff --git a/README.md b/README.md
@@ -42,6 +42,7 @@ Compared to the legacy Lite Interpreter, there are some major benefits:
 - [Exporting to Executorch](/docs/website/docs/tutorials/exporting_to_executorch.md)
     - [EXIR Spec](/docs/website/docs/ir_spec/00_exir.md)
     - [Exporting manual](/docs/website/docs/export/00_export_manual.md)
+    - [Quantization](/docs/website/docs/tutorials/quantization_flow.md)
     - [Delegate to a backend](/docs/website/docs/tutorials/backend_delegate.md)
     - [Profiling](/docs/website/docs/tutorials/profiling.md)
 - [Executorch Google Colab](https://colab.research.google.com/drive/1m8iU4y7CRVelnnolK3ThS2l2gBo7QnAP#scrollTo=1o2t3LlYJQY5)
diff --git a/docs/website/docs/tutorials/exporting_to_executorch.md b/docs/website/docs/tutorials/exporting_to_executorch.md
@@ -95,7 +95,7 @@ At this point, users can choose to run additional passes through the
 `exported_program.transform(passes)` function. A tutorial on how to write
 transformations can be found [here](./passes.md).
 
-Additionally, users can run quantization at this step. A tutorial for doing so can be found [here](./short_term_quantization_flow.md).
+Additionally, users can run quantization at this step. A tutorial for doing so can be found [here](./quantization_flow.md).
 
 ### 1.2 Lower to EXIR Edge Dialect
 
diff --git a/docs/website/docs/tutorials/quantization_flow.md b/docs/website/docs/tutorials/quantization_flow.md
@@ -0,0 +1,43 @@
+# Quantization Flow in Executorch
+
+## 1. Capture the model with `export.capture_pre_autograd_graph`
+### Process
+The flow uses `PyTorch 2.0 Export Quantization` to quantize the model, that works on a model captured by `exir.capture`. If the model is not traceable, please see [here](https://pytorch.org/docs/main/generated/exportdb/index.html) for supported constructs in `export.capture_pre_autograd_graph` and how to make the model exportable.
+
+```
+# program capture
+from torch._export import export
+
+m = export.capture_pre_autograd_graph(m, copy.deepcopy(example_inputs))
+```
+### Result
+The result in this step will be a `fx.GraphModule`
+
+## 2. Quantization
+### Process
+Note: Before quantizing models, each backend need to implement their own `Quantizer` by following [this tutorial](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html).
+
+Please take a look at the [pytorch 2.0 export post training static quantization tutorial](https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_static.html) to learn about all the steps of quantization. Main APIs that's used to quantize the model would be:
+* `prepare_pt2e`: used to insert observers to the model, it takes a backend specific `Quantizer` as argument, which will annotate the nodes with informations needed to quantize the model properly for the backend
+* (not an api) calibration: run the model through some sample data
+* `convert_pt2e`: convert a observed model to a quantized model, we have special representation for selected ops (e.g. quantized linear), other ops are represented as (dq -> float32_op -> q), and q/dq are decomposed into more primitive operators.
+
+### Result
+The result after these steps will be a reference quantized model, with quantize/dequantize operators being further decomposed. Example:
+
+```
+# Reference Quantized Pattern for quantized linear
+def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_int32, bias_scale, bias_zero_point, output_scale, output_zero_point):
+    x_int16 = x_int8.to(torch.int16)
+    weight_int16 = weight_int8.to(torch.int16)
+    acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
+    acc_rescaled_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale)
+    bias_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, bias_int32 - bias_zero_point, bias_scale / output_scale))
+    out_int8 = torch.ops.aten.clamp(acc_rescaled_int32 + bias_int32 + output_zero_point, qmin, qmax).to(torch.int8)
+    return out_int8
+```
+
+See [here](https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8) for some operators that has integer operator representations.
+
+## 4. Lowering to Executorch
+You can lower the quantized model to executorch by following [this tutorial](https://github.com/pytorch/executorch/blob/main/docs/website/docs/tutorials/exporting_to_executorch.md#12-lower-to-exir-edge-dialect).
diff --git a/docs/website/docs/tutorials/short_term_quantization_flow.md b/docs/website/docs/tutorials/short_term_quantization_flow.md
@@ -1,4 +1,6 @@
-# Short Term Quantization Flow in Executorch
+# [Deprecated, Please Don't Use] Short Term Quantization Flow in Executorch
+
+Note: this is deprecated, pelase use [this](./quantization_flow.md) instead.
 
 High level flow for short term quantization flow in exeuctorch looks like the following: https://fburl.com/8pspa022
 
diff --git a/docs/website/quantization_flow.md b/docs/website/quantization_flow.md
diff --git a/examples/export/TARGETS b/examples/export/TARGETS
@@ -9,3 +9,15 @@ python_library(
         "//executorch/exir:lib",
     ],
 )
+
+python_library(
+    name = "export_example",
+    srcs = [
+        "export_example.py",
+    ],
+    deps = [
+        ":utils",
+        "//executorch/examples/models:models",
+        "//executorch/exir:lib",
+    ],
+)
diff --git a/examples/quantization/TARGETS b/examples/quantization/TARGETS
@@ -0,0 +1,10 @@
+load("@fbcode_macros//build_defs:python_binary.bzl", "python_binary")
+
+python_binary(
+    name = "example",
+    main_src = "example.py",
+    deps = [
+        "//caffe2:torch",
+        "//executorch/examples/models:models",
+    ],
+)
diff --git a/examples/quantization/example.py b/examples/quantization/example.py
@@ -0,0 +1,60 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import copy
+
+import torch._export as export
+from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e
+from torch.ao.quantization.quantizer import XNNPACKQuantizer
+from torch.ao.quantization.quantizer.xnnpack_quantizer import (
+    get_symmetric_quantization_config,
+)
+
+# TODO: maybe move this to examples/export/utils.py?
+# from ..export.export_example import export_to_ff
+
+from ..models import MODEL_NAME_TO_MODEL
+
+
+def quantize(model_name, model, example_inputs):
+    m = model.eval()
+    m = export.capture_pre_autograd_graph(m, copy.deepcopy(example_inputs))
+    print("original model:", m)
+    quantizer = XNNPACKQuantizer()
+    # if we set is_per_channel to True, we also need to add out_variant of quantize_per_channel/dequantize_per_channel
+    operator_config = get_symmetric_quantization_config(is_per_channel=False)
+    quantizer.set_global(operator_config)
+    m = prepare_pt2e(m, quantizer)
+    # calibration
+    m(*example_inputs)
+    m = convert_pt2e(m)
+    print("quantized model:", m)
+    # make sure we can export to flat buffer
+    # Note: this is not working yet due to missing out variant ops for quantize_per_tensor/dequantize_per_tensor ops
+    # aten = export_to_ff(model_name, m, copy.deepcopy(example_inputs))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--model_name",
+        required=True,
+        help=f"Provide model name. Valid ones: {list(MODEL_NAME_TO_MODEL.keys())}",
+    )
+
+    args = parser.parse_args()
+
+    if args.model_name not in MODEL_NAME_TO_MODEL:
+        raise RuntimeError(
+            f"Model {args.model_name} is not a valid name. "
+            f"Available models are {list(MODEL_NAME_TO_MODEL.keys())}."
+        )
+
+    model, example_inputs = MODEL_NAME_TO_MODEL[args.model_name]()
+
+    quantize(args.model_name, model, example_inputs)