From 7624309616b26417c7818c5942580f112ec2719c Mon Sep 17 00:00:00 2001 From: xinhe3 Date: Wed, 14 Aug 2024 09:57:37 +0300 Subject: [PATCH 1/2] update readme for fp8 Signed-off-by: xinhe3 --- README.md | 2 +- docs/{ => source}/3x/PT_FP8Quant.md | 2 +- docs/source/3x/PyTorch.md | 13 +++++++++---- 3 files changed, 11 insertions(+), 6 deletions(-) rename docs/{ => source}/3x/PT_FP8Quant.md (97%) diff --git a/README.md b/README.md index f4694e991e9..12cbe7367f9 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision ``` After successfully installing these packages, try your first quantization program. -### [FP8 Quantization](./examples/3.x_api/pytorch/cv/fp8_quant/) +### [FP8 Quantization](./docs/source/3x/PT_FP8Quant.md) Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator. To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built). diff --git a/docs/3x/PT_FP8Quant.md b/docs/source/3x/PT_FP8Quant.md similarity index 97% rename from docs/3x/PT_FP8Quant.md rename to docs/source/3x/PT_FP8Quant.md index a0ed3352e8e..06fd37b367f 100644 --- a/docs/3x/PT_FP8Quant.md +++ b/docs/source/3x/PT_FP8Quant.md @@ -108,6 +108,6 @@ model = convert(model) | Task | Example | |----------------------|---------| | Computer Vision (CV) | [Link](../../examples/3.x_api/pytorch/cv/fp8_quant/) | -| Large Language Model (LLM) | [Link](https://github.com/HabanaAI/optimum-habana-fork/tree/habana-main/examples/text-generation#running-with-fp8) | +| Large Language Model (LLM) | [Link](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8) | > Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally. diff --git a/docs/source/3x/PyTorch.md b/docs/source/3x/PyTorch.md index a3004f6bcfb..2c2111d4d69 100644 --- a/docs/source/3x/PyTorch.md +++ b/docs/source/3x/PyTorch.md @@ -176,16 +176,21 @@ def load(output_dir="./saved_results", model=None): link - Static Quantization - Post-traning Static Quantization - intel-extension-for-pytorch + Static Quantization + Post-traning Static Quantization + intel-extension-for-pytorch (INT8) ✔ link - TorchDynamo + TorchDynamo (INT8)link + + Intel Gaudi AI accelerator (FP8) + ✔ + link + Dynamic Quantization From 3091c6e8cccef9188999bd6efaca24f073689696 Mon Sep 17 00:00:00 2001 From: xinhe3 Date: Wed, 14 Aug 2024 10:15:03 +0300 Subject: [PATCH 2/2] fix per review Signed-off-by: xinhe3 --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 12cbe7367f9..fa82961dd75 100644 --- a/README.md +++ b/README.md @@ -147,7 +147,7 @@ Intel Neural Compressor will convert the model format from auto-gptq to hpu form Weight-Only Quantization - FP8 Quantization + FP8 Quantization MX Quantization Mixed Precision