add channel wise quantization option for QDQ, and opt for intel NPU #669

bopeng1234 · 2025-04-22T02:16:21Z

Description

add 4bits channel-wised quantization capability for DequantizeLinear Op for phi3 model, it optimized the TPS on Intel NPU

JIRA - https://jira.devtools.intel.com/browse/EISW-163602

Motivation and Context

As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support

if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:

symmetric, zp=0
channel wised quantize, block_size = K

So this PR's changes is to enable the channel wised quantize, and symmetric.

we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362).

command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified-QDQ-T -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1

normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.

bopeng1234 · 2025-04-22T02:18:43Z

@ankitm3k , create this new one, only add the QDQ CW, removed Qoperater related code.

ankitm3k · 2025-04-24T09:03:48Z

@bopeng1234 kindly resolve conflicts

bopeng1234 mentioned this pull request Apr 22, 2025

add 4bits channel-wised quantization capability for MatMulNbits Op #631

Closed

bopeng1234 force-pushed the ovep-develop-dev branch from 2fc7fad to 542bab2 Compare April 22, 2025 02:17

add channel wise quantization option for QDQ, it optimize for intel NPU

02e5d04

bopeng1234 force-pushed the ovep-develop-dev branch from 542bab2 to 02e5d04 Compare April 30, 2025 06:50

add channel_wised_quantize args to MatMulNBitsQuantizer

9194c4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add channel wise quantization option for QDQ, and opt for intel NPU #669

add channel wise quantization option for QDQ, and opt for intel NPU #669

bopeng1234 commented Apr 22, 2025

bopeng1234 commented Apr 22, 2025

ankitm3k commented Apr 24, 2025

add channel wise quantization option for QDQ, and opt for intel NPU #669

Are you sure you want to change the base?

add channel wise quantization option for QDQ, and opt for intel NPU #669

Conversation

bopeng1234 commented Apr 22, 2025

Description

Motivation and Context

bopeng1234 commented Apr 22, 2025

ankitm3k commented Apr 24, 2025