Skip to content

Commit ecffc2e

Browse files
authored
PyTorch 3x document (#1824)
Signed-off-by: xin3he <[email protected]> Signed-off-by: Cheng, Zixuan <[email protected]> Signed-off-by: Kaihui-intel <[email protected]> Signed-off-by: zehao-intel <[email protected]> Signed-off-by: yiliu30 <[email protected]>
1 parent de3e94f commit ecffc2e

17 files changed

+1330
-34
lines changed

.coverage

52 KB
Binary file not shown.

docs/3x/PT_DynamicQuant.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
Dynamic Quantization
2+
===============
3+
4+
1. [Introduction](#introduction)
5+
2. [Getting Started with Dynamic Quantization](#Getting-Started-with-Dynamic-Quantization)
6+
3. [Examples](#examples)
7+
8+
9+
## Introduction
10+
Quantization is the process of converting floating point weights and activations to lower bitwidth tensors by multiplying the floating point values by a scale factor and rounding the results to whole numbers. Dynamic quantization determines the scale factor for activations dynamically based on the data range observed at runtime. We support W8A8 (quantizing weights and activations into 8 bits) dynamic quantization by leveraging torch's [`X86InductorQuantizer`](https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html?highlight=x86inductorquantizer).
11+
12+
13+
## Getting Started with Dynamic Quantization
14+
There are four steps to perform W8A8 dynamic quantization: `export`, `prepare`, `convert` and `compile`.
15+
16+
```python
17+
import torch
18+
from neural_compressor.torch.export import export
19+
from neural_compressor.torch.quantization import DynamicQuantConfig, prepare, convert
20+
21+
# Prepare the float model and example inputs for export model
22+
model = UserFloatModel()
23+
example_inputs = ...
24+
25+
# Export eager model into FX graph model
26+
exported_model = export(model=model, example_inputs=example_inputs)
27+
# Quantize the model
28+
quant_config = DynamicQuantConfig()
29+
prepared_model = prepare(exported_model, quant_config=quant_config)
30+
q_model = convert(prepared_model)
31+
# Compile the quantized model and replace the Q/DQ pattern with Q-operator
32+
from torch._inductor import config
33+
34+
config.freezing = True
35+
opt_model = torch.compile(q_model)
36+
```
37+
38+
> Note: The `set_local` of `DynamicQuantConfig` will be supported after the torch 2.4 release.
39+
40+
41+
## Examples
42+
Example will be added later.

docs/3x/PT_MXQuant.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
Microscaling Quantization
2+
===============
3+
4+
1. [Introduction](#introduction)
5+
2. [Get Started with Microscaling Quantization API](#get-start-with-microscaling-quantization-api)
6+
3. [Examples](#examples)
7+
4. [Reference](#reference)
8+
9+
## Introduction
10+
11+
Numerous breakthroughs have emerged across various fields, such as text analysis, language translation and chatbot technologies, fueled by the development of large language models (LLMs). Nevertheless, their increasing power comes with the challenge of explosive growth in parameters, posing obstacles for practical use. To balance memory limits and accuracy preservation for AI models, the Microscaling (MX) specification was promoted from the well-known Microsoft Floating Point (MSFP) data type [1, 2]:
12+
13+
<table>
14+
<tr>
15+
<th>Format Name</th>
16+
<th>Element Data type</th>
17+
<th>Element Bits</th>
18+
<th>Scaling Block Size</th>
19+
<th>Scale Data Type</th>
20+
<th>Scale Bits</th>
21+
</tr>
22+
<tr>
23+
<td rowspan="2">MXFP8</td>
24+
<td>FP8 (E5M2)</td>
25+
<td rowspan="2">8</td>
26+
<td rowspan="2">32</td>
27+
<td rowspan="2">E8M0</td>
28+
<td rowspan="2">8</td>
29+
</tr>
30+
<tr>
31+
<td>FP8 (E4M3)</td>
32+
</tr>
33+
<tr>
34+
<td rowspan="2">MXFP6</td>
35+
<td>FP6 (E3M2)</td>
36+
<td rowspan="2">6</td>
37+
<td rowspan="2">32</td>
38+
<td rowspan="2">E8M0</td>
39+
<td rowspan="2">8</td>
40+
</tr>
41+
<tr>
42+
<td>FP6 (E2M3)</td>
43+
</tr>
44+
<tr>
45+
<td>MXFP4</td>
46+
<td>FP4 (E2M1)</td>
47+
<td>4</td>
48+
<td>32</td>
49+
<td>E8M0</td>
50+
<td>8</td>
51+
</tr>
52+
<tr>
53+
<td>MXINT8</td>
54+
<td>INT8</td>
55+
<td>8</td>
56+
<td>32</td>
57+
<td>E8M0</td>
58+
<td>8</td>
59+
</tr>
60+
</table>
61+
62+
63+
At an equivalent accuracy level, the MX data type demonstrates the ability to occupy a smaller area and incur lower energy costs for multiply-accumulate compared to other conventional data types on the same silicon [1].
64+
65+
Neural Compressor seamlessly applies the MX data type to post-training quantization, offering meticulously crafted recipes to empower users to quantize LLMs without sacrificing accuracy. The workflow is shown as below.
66+
67+
<a target="_blank" href="./imgs/mx_workflow.png" text-align:left>
68+
<left>
69+
<img src="./imgs/mx_workflow.png" alt="Workflow of MX Quant (source [3])" height=120>
70+
</left>
71+
</a>
72+
73+
The memory and computational limits of LLMs are more severe than other general neural networks, so our exploration focuses on LLMs first. The following table shows the basic MX quantization recipes in Neural Compressor and enumerates distinctions among various data types. The MX data type replaces general float scale with powers of two to be more hardware-friendly. It adapts a granularity falling between per-channel and per-tensor to balance accuracy and memory consumption.
74+
75+
| | MX Format | INT8 | FP8 |
76+
|------------|--------------|------------|------------|
77+
| Scale | $2^{exp}$ | $\frac{MAX}{amax}$ | $\frac{MAX}{amax}$ |
78+
| Zero point | 0 (None) | $2^{bits - 1}$ or $-min * scale$ | 0 (None) |
79+
| Granularity | per-block (default blocksize is 32) | per-channel or per-tensor | per-channel or per-tensor |
80+
81+
The exponent (exp) is equal to torch.floor(torch.log2(amax)), MAX is the representation range of the data type, amax is the max absolute value of per-block tensor, and rmin is the minimum value of the per-block tensor.
82+
83+
84+
## Get Started with Microscaling Quantization API
85+
86+
To get a model quantized with Microscaling Data Types, users can use the Microscaling Quantization API as follows.
87+
88+
```python
89+
from neural_compressor.torch.quantization import MXQuantConfig, prepare, convert
90+
91+
quant_config = MXQuantConfig(w_dtype=args.w_dtype, act_dtype=args.act_dtype, weight_only=args.woq)
92+
user_model = prepare(model=user_model, quant_config=quant_config)
93+
user_model = convert(model=user_model)
94+
```
95+
96+
## Examples
97+
98+
- PyTorch [huggingface models](/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/mx)
99+
100+
101+
## Reference
102+
103+
[1]: Darvish Rouhani, Bita, et al. "Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point." Advances in neural information processing systems 33 (2020): 10271-10281
104+
105+
[2]: OCP Microscaling Formats (MX) Specification
106+
107+
[3]: Rouhani, Bita Darvish, et al. "Microscaling Data Formats for Deep Learning." arXiv preprint arXiv:2310.10537 (2023).

docs/3x/PT_MixPrecision.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
PyTorch Mixed Precision
2+
========================================
3+
4+
1. [Introduction](#introduction)
5+
2. [Mixed Precision Support Matrix](#mixed-precision-support-matrix)
6+
3. [Get Started](#get-start)
7+
4. [Examples](#examples)
8+
9+
## Introduction
10+
11+
The recent growth of Deep Learning has driven the development of more complex models that require significantly more compute and memory capabilities. Several low precision numeric formats have been proposed to address the problem. Google's [bfloat16](https://cloud.google.com/tpu/docs/bfloat16) and the [FP16: IEEE](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) half-precision format are two of the most widely used sixteen bit formats. [Mixed precision](https://arxiv.org/abs/1710.03740) training and inference using low precision formats have been developed to reduce compute and bandwidth requirements.
12+
13+
The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), featuring Intel® Deep Learning Boost, is the first general-purpose x86 CPU to support the bfloat16 format. Specifically, three new bfloat16 instructions are added as a part of the AVX512_BF16 extension within Intel Deep Learning Boost: VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions allow converting to and from bfloat16 data type, while the last one performs a dot product of bfloat16 pairs. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-deep-learning-boost-new-instruction-bfloat16.html) published by Intel.
14+
15+
The 4th Gen Intel® Xeon® Scalable processor supports FP16 instruction set architecture (ISA) for Intel®
16+
Advanced Vector Extensions 512 (Intel® AVX-512). The new ISA supports a wide range of general-purpose numeric
17+
operations for 16-bit half-precision IEEE-754 floating-point and complements the existing 32-bit and 64-bit floating-point instructions already available in the Intel Xeon processor based products. Further details can be found in the [hardware numerics document](https://www.intel.com/content/www/us/en/content-details/669773/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide.html) published by Intel.
18+
19+
<p align="center" width="100%">
20+
<img src="./imgs/data_format.png" alt="Architecture" height=230>
21+
</p>
22+
23+
## Mixed Precision Support Matrix
24+
25+
<table class="center">
26+
<thead>
27+
<tr>
28+
<th>Framework</th>
29+
<th>Backend</th>
30+
<th>Backend Library</th>
31+
<th>Backend Value</th>
32+
<th>Support Device(cpu as default)</th>
33+
<th>Support BF16</th>
34+
<th>Support FP16</th>
35+
</tr>
36+
</thead>
37+
<tbody>
38+
<tr>
39+
<td rowspan="1" align="left">PyTorch</td>
40+
<td align="left">FX</td>
41+
<td align="left">FBGEMM</td>
42+
<td align="left">"default"</td>
43+
<td align="left">cpu</td>
44+
<td align="left">&#10004;</td>
45+
<td align="left">&#10004;</td>
46+
</tr>
47+
</tbody>
48+
</table>
49+
50+
51+
### Hardware and Software requests for **BF16**
52+
- PyTorch
53+
1. Hardware: CPU supports `avx512_bf16` instruction set.
54+
2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html).
55+
56+
57+
### Hardware and Software requests for **FP16**
58+
- PyTorch
59+
1. Hardware: CPU supports `avx512_fp16` instruction set.
60+
2. Software: torch >= [1.11.0](https://download.pytorch.org/whl/torch_stable.html).
61+
62+
63+
### Accuracy-driven mixed precision
64+
BF16/FP16 conversion may lead to accuracy drop. Intel® Neural Compressor provides an accuracy-driven tuning function to reduce accuracy loss,
65+
which could fallback converted ops to FP32, if set in config, to get better accuracy. To enable this function, users only to provide
66+
`eval_fn` and `eval_args` for `autotune`.
67+
To be noticed, IPEX backend doesn't support accuracy-driven mixed precision.
68+
69+
## Get Started with autotune API
70+
71+
To get a bf16/fp16 model, users can use the `autotune` interface with `MixPrecisionConfig` as follows.
72+
73+
- BF16:
74+
75+
```python
76+
from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune
77+
78+
def eval_acc_fn(model):
79+
......
80+
return acc
81+
82+
# modules might be fallback to fp32 to get better accuracy
83+
custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["bf16", "fp32"])], max_trials=3)
84+
best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn)
85+
```
86+
87+
- FP16:
88+
89+
```python
90+
from neural_compressor.torch.quantization import MixPrecisionConfig, TuningConfig, autotune
91+
92+
def eval_acc_fn(model):
93+
......
94+
return acc
95+
96+
# modules might be fallback to fp32 to get better accuracy
97+
custom_tune_config = TuningConfig(config_set=[MixPrecisionConfig(dtype=["fp16", "fp32"])], max_trials=3)
98+
best_model = autotune(model=build_torch_model(), tune_config=custom_tune_config, eval_fn=eval_acc_fn)
99+
```
100+
101+
## Examples
102+
103+
Example will be added later.

docs/3x/PT_SmoothQuant.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
PyTorch Smooth Quantization
2+
========================================
3+
4+
1. [Introduction](#Introduction)
5+
2. [Usage](#Usage)
6+
3. [Validated Models](#Validated-Models)
7+
4. [Supported Framework Matrix](#Supported-Framework-Matrix)
8+
9+
10+
## Introduction
11+
Quantization is a common compression operation to reduce memory and accelerate inference by converting the floating point matrix to an integer matrix. For large language models (LLMs) with gigantic parameters, the systematic outliers make quantification of activations difficult. [SmoothQuant](https://arxiv.org/abs/2211.10438), a training free post-training quantization (PTQ) solution, offline migrates this difficulty from activations to weights with a mathematically equivalent transformation.
12+
13+
14+
## Usage
15+
### Fixed Alpha
16+
To set a fixed alpha for the entire model, users can follow this example:
17+
18+
```python
19+
from neural_compressor.torch.quantization import SmoothQuantConfig, convert, prepare
20+
21+
22+
def run_fn(model):
23+
model(example_inputs)
24+
25+
26+
quant_config = SmoothQuantConfig(alpha=0.5)
27+
prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs)
28+
run_fn(prepared_model)
29+
q_model = convert(prepared_model)
30+
```
31+
`SmoothQuantConfig` description:
32+
33+
`alpha`: a smooth factor to calculate the conversion per-channel scale and balance the quantization difficulty of activation and weight. Float value, default is 0.5.
34+
35+
> **Note:** Alpha="auto" and alpha auto-tuning was supported in old API, please stay tuned for the new API's support for auto alpha.
36+
37+
### Specify Quantization Rules
38+
Intel(R) Neural Compressor support specify quantization rules by operator type for Smooth Quantization. Users can use `set_local` to fallback op type in `SmoothQuantConfig` to achieve the above purpose.
39+
40+
Here we don't quantize `Linear` layers.
41+
```python
42+
# fallback by op_type
43+
quant_config.set_local("Linear", SmoothQuantConfig(w_dtype="fp32", act_dtype="fp32"))
44+
prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs)
45+
run_fn(prepared_model)
46+
q_model = convert(prepared_model)
47+
```
48+
49+
To get more information, please refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/llm).
50+
51+
52+
## Validated Models
53+
Neural Compressor: 2.1
54+
55+
IPEX (Intel Extension for PyTorch): 2.0/2.1
56+
57+
Dataset: lambada_openai
58+
59+
Task: text-generation provided by [ITREX](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-generation/quantization)
60+
61+
alpha [0.4, 0.6] is sweet spot region in SmoothQuant paper.
62+
63+
A list of models that achieved a <1% accuracy drop is shown below.
64+
65+
| Model/Last token accuracy | FP32 Accuracy | INT8 (w/ SmoothQuant) | Notes |
66+
|:----------:|:------:|:------:|-----------------------------------|
67+
| bigscience/bloom-560m | 0.354 | 0.3542 | alpha=0.5, Ipex 2.1 |
68+
| bigscience/bloom-1b7 | 0.4634 | 0.4936 | alpha=0.5, Ipex 2.0 |
69+
| bigscience/bloom-3b | 0.518 | 0.5185 | alpha=0.8, Ipex 2.1 |
70+
| bigscience/bloom-7b1 | 0.5764 | 0.5977 | alpha=0.5, Ipex 2.0 |
71+
| bigscience/bloomz-560m | 0.3947 | 0.3930 | alpha=0.8, Ipex 2.1 |
72+
| bigscience/bloomz-1b7 | 0.4828 | 0.4906 | alpha=0.5, Ipex 2.1 |
73+
| bigscience/bloomz-3b | 0.5018 | 0.4980 | alpha=0.5, Ipex 2.1 |
74+
| bigscience/bloomz-7b1 | 0.5593 | 0.5552 | alpha=0.5, Ipex 2.1 |
75+
| facebook/opt-125m | 0.379 | 0.3757 | alpha=0.5, Ipex 2.1 |
76+
| facebook/opt-350m | 0.4516 | 0.4533 | alpha=0.8, Ipex 2.1 |
77+
| facebook/opt-1.3b | 0.5789 | 0.5742 | alpha=0.8, Ipex 2.0 |
78+
| facebook/opt-2.7b | 0.6365 | 0.6404 | alpha=0.5, Ipex 2.0 |
79+
| facebook/opt-6.7b | 0.6769 | 0.6804 | alpha=0.5, Ipex 2.0 |
80+
| facebook/opt-13b | 0.6872 | 0.6814 | alpha=0.5, Ipex 2.1 |
81+
| facebook/opt-30b | 0.7149 | 0.7128 | alpha=0.5, Ipex 2.1 |
82+
| facebook/opt-66b | 0.7398 | 0.7326 | alpha=0.5, Ipex 2.1 |
83+
| LLaMa-7b | 0.7361 | 0.7357 | alpha=0.8, Ipex 2.1 |
84+
| LLaMa-13b | 0.7627 | 0.7590 | alpha=0.7, Ipex 2.1 |
85+
| LLaMa-30b | 0.7759 | 0.7840 | alpha=0.7, Ipex 2.1 |
86+
| LLaMa-65b | 0.7908 | 0.7957 | alpha=0.9, Ipex 2.1 |
87+
| EleutherAI/gpt-j-6B* | 0.6831 | 0.6821 | alpha=1.0, Ipex 2.1 |
88+
| MBZUAI/LaMini-GPT-124m | 0.3804 | 0.3887 | alpha=0.5, Ipex 2.1 |
89+
| MBZUAI/LaMini-GPT-774m | 0.5048 | 0.5057 | alpha=0.5, Ipex 2.1 |
90+
| MBZUAI/LaMini-GPT-1.5b | 0.5443 | 0.5436 | alpha=0.5, Ipex 2.1 |
91+
| mosaicml/mpt-7b-chat | 0.655 | 0.6499 | alpha=0.7, Ipex 2.1 |
92+
| stabilityai/stablelm-base-alpha-3b | 0.4172 | 0.4149 | alpha=0.6, Ipex 2.1 |
93+
| togethercomputer/RedPajama-INCITE-Base-3B-v1 | 0.6542 | 0.6735 | alpha=0.5, Ipex 2.1 |
94+
| togethercomputer/RedPajama-INCITE-Chat-3B-v1* | 0.6718 | 0.6740 | alpha=0.5, Ipex 2.0 |
95+
| togethercomputer/RedPajama-INCITE-Instruct-3B-v1* | 0.6569 | 0.6621 | alpha=0.5, Ipex 2.0 |
96+
| togethercomputer/RedPajama-INCITE-Base-7B-v0.1* | 0.7143 | 0.7221 | alpha=0.5, Ipex 2.0 |
97+
| togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1* | 0.6895 | 0.6953 | alpha=0.5, Ipex 2.0 |
98+
| databricks/dolly-v1-6b* | 0.6866 | 0.6895 | alpha=0.8, Ipex 2.1 |
99+
| databricks/dolly-v2-3b* | 0.6297 | 0.6247 | alpha=0.5, Ipex 2.1 |
100+
| tiiuae/falcon-7b-instruct | 0.6437 | 0.6392 | alpha=0.7, Pytorch |
101+
102+
Please refer to the step-by-step [instruction](../../examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/ipex/README.md) for details.
103+
104+
Please note that for models with asterisk(*), we have set all add ops to FP32 during quantization step to achieve desirable results.
105+
106+
107+
## Supported Framework Matrix
108+
109+
| Framework | Alpha | Folding |
110+
|:---------:|--------------|------------|
111+
| PyTorch | [0-1] | False |
112+
| IPEX | [0-1] | True / False(Version>2.1) |

0 commit comments

Comments
 (0)