Skip to content

add channel wise quantization option for QDQ, and opt for intel NPU #669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: ovep-develop
Choose a base branch
from

Conversation

bopeng1234
Copy link

Description

add 4bits channel-wised quantization capability for DequantizeLinear Op for phi3 model, it optimized the TPS on Intel NPU

JIRA - https://jira.devtools.intel.com/browse/EISW-163602

Motivation and Context

As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support

if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:

  1. symmetric, zp=0
  2. channel wised quantize, block_size = K

So this PR's changes is to enable the channel wised quantize, and symmetric.

we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362).

command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified-QDQ-T -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1

normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.

@bopeng1234
Copy link
Author

@ankitm3k , create this new one, only add the QDQ CW, removed Qoperater related code.

@ankitm3k
Copy link

@bopeng1234 kindly resolve conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants