Skip to content

Commit 14d0c54

Browse files
committed
Merge remote-tracking branch 'origin/master' into pt_doc
2 parents 884c100 + 4dbf71e commit 14d0c54

File tree

17 files changed

+778
-316
lines changed

17 files changed

+778
-316
lines changed

.azure-pipelines/scripts/ut/env_setup.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ elif [[ $(echo "${test_case}" | grep -c "tf pruning") != 0 ]]; then
9292
fi
9393

9494
if [[ $(echo "${test_case}" | grep -c "api") != 0 ]] || [[ $(echo "${test_case}" | grep -c "adaptor") != 0 ]]; then
95-
pip install git+https://github.com/intel/auto-round.git@ecca5349981044e1278773a251b3fc5c0a11fe7b
95+
pip install auto-round
9696
fi
9797

9898
# test deps

docs/3x/TF_Quant.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
2+
TensorFlow Quantization
3+
===============
4+
5+
1. [Introduction](#introduction)
6+
2. [Usage](#usage)
7+
2.1 [Without Accuracy Aware Tuning](#without-accuracy-aware-tuning)
8+
2.2 [With Accuracy Aware Tuning](#with-accuracy-aware-tuning)
9+
2.3 [Specify Quantization Rules](#specify-quantization-rules)
10+
3. [Examples](#examples)
11+
12+
## Introduction
13+
14+
The INC 3x New API supports quantizing both TensorFlow and Keras model with or without accuracy aware tuning.
15+
16+
For the detailed quantization fundamentals, please refer to the document for [Quantization](../quantization.md).
17+
18+
19+
## Get Started
20+
21+
22+
### Without Accuracy Aware Tuning
23+
24+
25+
This means user could leverage Intel(R) Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation.
26+
27+
``` python
28+
# main.py
29+
30+
# Original code
31+
model = tf.keras.applications.resnet50.ResNet50(weights="imagenet")
32+
val_dataset = ...
33+
val_dataloader = MyDataloader(dataset=val_dataset)
34+
35+
# Quantization code
36+
from neural_compressor.tensorflow import quantize_model, StaticQuantConfig
37+
38+
quant_config = StaticQuantConfig()
39+
qmodel = quantize_model(
40+
model=model,
41+
quant_config=quant_config,
42+
calib_dataloader=val_dataloader,
43+
)
44+
qmodel.save("./output")
45+
```
46+
47+
### With Accuracy Aware Tuning
48+
49+
This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn` and `eval_args`.
50+
51+
``` python
52+
# main.py
53+
54+
# Original code
55+
model = tf.keras.applications.resnet50.ResNet50(weights="imagenet")
56+
val_dataset = ...
57+
val_dataloader = MyDataloader(dataset=val_dataset)
58+
59+
60+
def eval_acc_fn(model) -> float:
61+
...
62+
return acc
63+
64+
65+
# Quantization code
66+
from neural_compressor.common.base_tuning import TuningConfig
67+
from neural_compressor.tensorflow import autotune
68+
69+
# it's also supported to define custom_tune_config as:
70+
# TuningConfig(StaticQuantConfig(weight_sym=[True, False], act_sym=[True, False]))
71+
custom_tune_config = TuningConfig(
72+
config_set=[
73+
StaticQuantConfig(weight_sym=True, act_sym=True),
74+
StaticQuantConfig(weight_sym=False, act_sym=False),
75+
]
76+
)
77+
best_model = autotune(
78+
model=model,
79+
tune_config=custom_tune_config,
80+
eval_fn=eval_acc_fn,
81+
calib_dataloader=val_dataloader,
82+
)
83+
best_model.save("./output")
84+
```
85+
86+
### Specify Quantization Rules
87+
Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can set `local` in dict or use `set_local` method of config class to achieve the above purpose.
88+
89+
1. Example of setting `local` from a dict
90+
```python
91+
quant_config = {
92+
"static_quant": {
93+
"global": {
94+
"weight_dtype": "int8",
95+
"weight_sym": True,
96+
"weight_granularity": "per_tensor",
97+
"act_dtype": "int8",
98+
"act_sym": True,
99+
"act_granularity": "per_tensor",
100+
},
101+
"local": {
102+
"conv1": {
103+
"weight_dtype": "fp32",
104+
"act_dtype": "fp32",
105+
}
106+
},
107+
}
108+
}
109+
config = StaticQuantConfig.from_dict(quant_config)
110+
```
111+
2. Example of using `set_local`
112+
```python
113+
quant_config = StaticQuantConfig()
114+
conv2d_config = StaticQuantConfig(
115+
weight_dtype="fp32",
116+
act_dtype="fp32",
117+
)
118+
quant_config.set_local("conv1", conv2d_config)
119+
```
120+
121+
## Examples
122+
123+
Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow) on how to quantize a TensorFlow model with INC 3x API.

docs/3x/TF_SQ.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Smooth Quant
2+
3+
1. [Introduction](#introduction)
4+
2. [Usage](#usage)
5+
2.1 [Using a Fixed alpha](#using-a-fixed-alpha)
6+
2.2 [Determining the alpha through auto-tuning](#determining-the-alpha-through-auto-tuning)
7+
3. [Examples](#examples)
8+
9+
10+
## Introduction
11+
12+
Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation.
13+
14+
Please refer to the document of [Smooth Quant](../quantization.md/#smooth-quant) for detailed fundamental knowledge.
15+
16+
17+
## Usage
18+
There are two ways to apply smooth quantization: 1) using a fixed `alpha` for the entire model or 2) determining the `alpha` through auto-tuning.
19+
20+
### Using a Fixed `alpha`
21+
To set a fixed alpha for the entire model, users can follow this example:
22+
23+
```python
24+
from neural_compressor.tensorflow import SmoothQuantConfig, StaticQuantConfig
25+
26+
quant_config = [SmoothQuantConfig(alpha=0.5), StaticQuantConfig()]
27+
q_model = quantize_model(output_graph_def, [sq_config, static_config], calib_dataloader)
28+
```
29+
The `SmoothQuantConfig` should be combined with `StaticQuantConfig` in a list because we still need to insert QDQ and apply pattern fusion after the smoothing process.
30+
31+
32+
### Determining the `alpha` through auto-tuning
33+
Users can search for the best `alpha` for the entire model.The tuning process looks for the optimal `alpha` value from a list of `alpha` values provided by the user.
34+
35+
Here is an example:
36+
37+
```python
38+
from neural_compressor.tensorflow import StaticQuantConfig, SmoothQuantConfig
39+
40+
custom_tune_config = TuningConfig(config_set=[SmoothQuantConfig(alpha=[0.5, 0.6, 0.7]), StaticQuantConfig()])
41+
best_model = autotune(
42+
model="fp32_model",
43+
tune_config=custom_tune_config,
44+
eval_fn=eval_fn_wrapper,
45+
calib_dataloader=calib_dataloader,
46+
)
47+
```
48+
> Please note that, it may a considerable amount of time as the tuning process applies each `alpha` to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best `alpha`.
49+
50+
## Examples
51+
52+
Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/tensorflow/nlp/large_language_models\quantization\ptq\smoothquant) on how to apply smooth quant to a TensorFlow model with INC 3x API.

docs/3x/TensorFlow.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
TensorFlow
2+
===============
3+
4+
5+
1. [Introduction](#introduction)
6+
2. [API for TensorFlow](#api-for-tensorflow)
7+
3. [Support Matrix](#support-matrix)
8+
3.1 [Quantization Scheme](#quantization-scheme)
9+
3.2 [Quantization Approaches](#quantization-approaches)
10+
3.3 [Backend and Device](#backend-and-device)
11+
12+
## Introduction
13+
14+
<div align="center">
15+
<img src="https://www.tensorflow.org/images/tf_logo_horizontal.png">
16+
</div>
17+
18+
[TensorFlow](https://www.tensorflow.org/) is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of [tools](https://www.tensorflow.org/resources/tools), [libraries](https://www.tensorflow.org/resources/libraries-extensions), and [community](https://www.tensorflow.org/community) resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. It provides stable [Python](https://www.tensorflow.org/api_docs/python) and [C++](https://www.tensorflow.org/api_docs/cc) APIs, as well as a non-guaranteed backward compatible API for [other languages](https://www.tensorflow.org/api_docs).
19+
20+
Keras is a multi-backend deep learning framework , supporting JAX, TensorFlow, and PyTorch. It serves as a dependency of TensorFlow, providing high-level API. Effortlessly build and train models for computer vision, natural language processing, audio processing, timeseries forecasting, recommender systems, etc.
21+
22+
23+
24+
## API for TensorFlow
25+
26+
Intel(R) Neural Compressor provides `quantize_model` and `autotune` as main interfaces for supported algorithms on TensorFlow framework.
27+
28+
29+
**quantize_model**
30+
31+
The design philosophy of the `quantize_model` interface is easy-of-use. With minimal parameters requirement, including `model`, `quant_config`, `calib_dataloader` and `calib_iteration`, it offers a straightforward choice of quantizing TF model in one-shot.
32+
33+
```python
34+
def quantize_model(
35+
model: Union[str, tf.keras.Model, BaseModel],
36+
quant_config: Union[BaseConfig, list],
37+
calib_dataloader: Callable = None,
38+
calib_iteration: int = 100,
39+
):
40+
```
41+
`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class.
42+
43+
`quant_config` is either the `StaticQuantConfig` object or a list contains `SmoothQuantConfig` and `StaticQuantConfig` to indicate what algorithm should be used and what specific quantization rules should be applied.
44+
45+
`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.
46+
47+
`calib_iteration` is used to decide how many iterations the calibration process will be run.
48+
49+
Here is a simple example of using `quantize_model` interface with a dummy calibration dataloader and the default `StaticQuantConfig`:
50+
```python
51+
from neural_compressor.tensorflow import StaticQuantConfig, quantize_model
52+
from neural_compressor.tensorflow.utils import DummyDataset
53+
54+
dataset = DummyDataset(shape=(100, 32, 32, 3), label=True)
55+
calib_dataloader = MyDataLoader(dataset=dataset)
56+
quant_config = StaticQuantConfig()
57+
58+
qmodel = quantize_model("fp32_model.pb", quant_config, calib_dataloader)
59+
```
60+
**autotune**
61+
62+
The `autotune` interface, on the other hand, provides greater flexibility and power. It's particularly useful when accuracy is a critical factor. If the initial quantization doesn't meet the tolerance of accuracy loss, `autotune` will iteratively try quantization rules according to the `tune_config`.
63+
64+
Just like `quantize_model`, `autotune` requires `model`, `calib_dataloader` and `calib_iteration`. And the `eval_fn`, `eval_args` are used to build evaluation process.
65+
66+
67+
68+
```python
69+
def autotune(
70+
model: Union[str, tf.keras.Model, BaseModel],
71+
tune_config: TuningConfig,
72+
eval_fn: Callable,
73+
eval_args: Optional[Tuple[Any]] = None,
74+
calib_dataloader: Callable = None,
75+
calib_iteration: int = 100,
76+
) -> Optional[BaseModel]:
77+
```
78+
`model` should be a string of the model's location, the object of Keras model or INC TF model wrapper class.
79+
80+
`tune_config` is the `TuningConfig` object which contains multiple quantization rules.
81+
82+
`eval_fn` is the evaluation function that measures the accuracy of a model.
83+
84+
`eval_args` is the supplemental arguments required by the defined evaluation function.
85+
86+
`calib_dataloader` is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.
87+
88+
`calib_iteration` is used to decide how many iterations the calibration process will be run.
89+
90+
Here is a simple example of using `autotune` interface with different quantization rules defined by a list of `StaticQuantConfig`:
91+
```python
92+
from neural_compressor.common.base_tuning import TuningConfig
93+
from neural_compressor.tensorflow import StaticQuantConfig, autotune
94+
95+
calib_dataloader = MyDataloader(dataset=Dataset())
96+
custom_tune_config = TuningConfig(
97+
config_set=[
98+
StaticQuantConfig(weight_sym=True, act_sym=True),
99+
StaticQuantConfig(weight_sym=False, act_sym=False),
100+
]
101+
)
102+
best_model = autotune(
103+
model="baseline_model",
104+
tune_config=custom_tune_config,
105+
eval_fn=eval_acc_fn,
106+
calib_dataloader=calib_dataloader,
107+
)
108+
```
109+
110+
### Support Matrix
111+
112+
#### Quantization Scheme
113+
114+
| Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization |
115+
| :-------------- |:---------------:| ---------------:|---------------:|
116+
| TensorFlow | [oneDNN](https://github.com/oneapi-src/oneDNN) | Activation (int8/uint8), Weight (int8) | - |
117+
| Keras | [ITEX](https://github.com/intel/intel-extension-for-tensorflow) | Activation (int8/uint8), Weight (int8) | - |
118+
119+
120+
+ Symmetric Quantization
121+
+ int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
122+
+ uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8))
123+
124+
125+
+ oneDNN: [Lower Numerical Precision Deep Learning Inference and Training](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html)
126+
127+
#### Quantization Approaches
128+
129+
The supported Quantization methods for TensorFlow and Keras are listed below:
130+
<table class="center">
131+
<thead>
132+
<tr>
133+
<th>Types</th>
134+
<th>Quantization</th>
135+
<th>Dataset Requirements</th>
136+
<th>Framework</th>
137+
<th>Backend</th>
138+
</tr>
139+
</thead>
140+
<tbody>
141+
<tr>
142+
<td rowspan="2" align="center">Post-Training Static Quantization (PTQ)</td>
143+
<td rowspan="2" align="center">weights and activations</td>
144+
<td rowspan="2" align="center">calibration</td>
145+
<td align="center">Keras</td>
146+
<td align="center"><a href="https://github.com/intel/intel-extension-for-tensorflow">ITEX</a></td>
147+
</tr>
148+
<tr>
149+
<td align="center">TensorFlow</td>
150+
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
151+
</tr>
152+
<tr>
153+
<td rowspan="2" align="center">Smooth Quantization(SQ)</td>
154+
<td rowspan="2" align="center">weights</td>
155+
<td rowspan="2" align="center">calibration</td>
156+
<td align="center">Tensorflow</td>
157+
<td align="center"><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>/<a href="https://github.com/Intel-tensorflow/tensorflow">Intel TensorFlow</a></td>
158+
</tr>
159+
</tbody>
160+
</table>
161+
<br>
162+
<br>
163+
164+
##### Post Training Static Quantization
165+
166+
The min/max range in weights and activations are collected offline on a so-called `calibration` dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The `calibration` process runs on the original fp32 model and dumps out all the tensor distributions for `Scale` and `ZeroPoint` calculations. Usually preparing 100 samples are enough for calibration.
167+
168+
Refer to the [PTQ Guide](./TF_Quant.md) for detailed information.
169+
170+
##### Smooth Quantization
171+
172+
Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights.
173+
174+
Refer to the [SQ Guide](./TF_SQ.md) for detailed information.
175+
176+
#### Backend and Device
177+
Intel(R) Neural Compressor supports TF GPU with [ITEX-XPU](https://github.com/intel/intel-extension-for-tensorflow). We will automatically run model on GPU by checking if it has been installed.
178+
179+
<table class="center">
180+
<thead>
181+
<tr>
182+
<th>Framework</th>
183+
<th>Backend</th>
184+
<th>Backend Library</th>
185+
<th>Backend Value</th>
186+
<th>Support Device(cpu as default)</th>
187+
</tr>
188+
</thead>
189+
<tbody>
190+
<tr>
191+
<td rowspan="2" align="left">TensorFlow</td>
192+
<td align="left">TensorFlow</td>
193+
<td align="left">OneDNN</td>
194+
<td align="left">"default"</td>
195+
<td align="left">cpu</td>
196+
</tr>
197+
<tr>
198+
<td align="left">ITEX</td>
199+
<td align="left">OneDNN</td>
200+
<td align="left">"itex"</td>
201+
<td align="left">cpu | gpu</td>
202+
</tr>
203+
</tbody>
204+
</table>
205+
<br>
206+
<br>

0 commit comments

Comments
 (0)