-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[CORE] [QUANT] Support for GPTQModel's dynamic
quantization per module override/control
#7086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
simon-mo
merged 69 commits into
vllm-project:main
from
ZX-ModelCloud:compat_dynamic_bits
Feb 12, 2025
Merged
Changes from all commits
Commits
Show all changes
69 commits
Select commit
Hold shift + click to select a range
f470b26
gptq_marlin compat dynamic_bits quantize config
ZX-ModelCloud c56e3de
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud 502edb3
Update gptq_marlin.py
Qubitium 18064cd
cleanup
ZX-ModelCloud 1b132c3
cleanup
ZX-ModelCloud 4b63754
cleanup
ZX-ModelCloud 90258d2
cleanup
ZX-ModelCloud a5d3c8b
cleanup
ZX-ModelCloud c84793f
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud 5682124
load "dynamic" field from config
ZX-ModelCloud d651668
fix key error: change "is_sym" to "sym"
ZX-ModelCloud 9a36694
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud fbc594f
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud e9ae8f5
update quant_type
ZX-ModelCloud 19d7772
update
ZX-ModelCloud 7057dbb
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud 8565328
fix judgment error
ZX-ModelCloud 84ada54
cleanup
ZX-ModelCloud e81a7da
cleanup
ZX-ModelCloud 68291ce
cleanup
ZX-ModelCloud 7867405
cleanup
ZX-ModelCloud c63ba51
cleanup
ZX-ModelCloud 5f9b712
Update gptq_marlin.py
Qubitium 3692578
Update gptq_marlin.py
Qubitium f902b2d
cleanup
ZX-ModelCloud a570509
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud 9b9d7e3
Update gptq_marlin.py
Qubitium 0559137
cleanup
ZX-ModelCloud b29a094
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud 3a2bb94
cleanup
ZX-ModelCloud 3c0d45a
cleanup
ZX-ModelCloud 74b1d42
add test_gptq_dynamic_cfg.py
ZX-ModelCloud b0672ae
cleanup
ZX-ModelCloud 066f489
Update test_gptq_dynamic_cfg.py
Qubitium 6dc56a6
Update test_gptq_dynamic_cfg.py
Qubitium 98a198e
cleanup
ZX-ModelCloud b2861d8
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud c4a29eb
use PROMPT variable
ZX-ModelCloud 25703e3
cleanup
ZX-ModelCloud 1fd690e
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud 4f48d1b
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud 070ae3c
rename method and add detailed comments
Qubitium 13b2b7b
Changed VocabParallelEmbedding.linear_method to quant_method to be co…
ZX-ModelCloud 6850e6d
Merge remote-tracking branch 'origin/compat_dynamic_bits' into compat…
ZX-ModelCloud 40562d1
fix unittest
ZX-ModelCloud 7b774bb
cleanup
ZX-ModelCloud c72125a
cleanup
ZX-ModelCloud c298195
cleanup
ZX-ModelCloud bbc049d
Update gptq_marlin.py
Qubitium 78f8818
format
ZX-ModelCloud 2cfec63
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud 93ee576
Update gptq_marlin.py
Qubitium 6ebf85c
rename to parallel_lm_head_quantized for clarity
Qubitium 59bdf54
simplify
Qubitium 9de0382
shorten code
Qubitium 67d0882
cleanup
ZX-ModelCloud 5623936
cleanup
ZX-ModelCloud e41bdd7
make lint pass
Qubitium 965d7da
change model_id
ZX-ModelCloud 1a34027
format
ZX-ModelCloud 0b249a1
format code
ZX-ModelCloud 4de04ae
format code
ZX-ModelCloud 4c0608b
format code
ZX-ModelCloud 8f21375
disable E712 ruff check
ZX-ModelCloud e3084e3
Extract code to gptq_utils.get_linear_quant_method()
ZX-ModelCloud 25dbd5a
cleanup
ZX-ModelCloud 874076c
cleanup
ZX-ModelCloud 17704df
Merge branch 'main' into compat_dynamic_bits
ZX-ModelCloud c7f10be
do not use Fraction
ZX-ModelCloud File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
"""Tests whether gptq models with dynamic quantized can be loaded. | ||
|
||
Run `pytest tests/quantization/test_gptq_dynamic.py --forked`. | ||
""" | ||
|
||
import pytest | ||
import torch | ||
|
||
from vllm.model_executor.layers.linear import UnquantizedLinearMethod | ||
from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod | ||
from vllm.model_executor.layers.quantization.gptq_marlin import ( | ||
GPTQMarlinLinearMethod) | ||
from vllm.model_executor.layers.quantization.utils.gptq_utils import ( | ||
get_dynamic_override) | ||
|
||
PROMPT = "On the surface of Mars, we found" | ||
|
||
# The first layer is quantized using bits=4, group_size=128 | ||
# The second layer is quantized using bits=8, group_size=32 | ||
# All other layers (layer index >= 2) are not quantized | ||
MODEL_QUANT = [ | ||
("ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue", | ||
True), | ||
("ModelCloud/Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse", | ||
False), | ||
] | ||
|
||
|
||
@pytest.mark.parametrize("model_id, use_marlin_kernel", MODEL_QUANT) | ||
def test_gptq_with_dynamic(vllm_runner, model_id: str, | ||
use_marlin_kernel: bool): | ||
|
||
vllm_model = vllm_runner(model_id, dtype=torch.float16, max_model_len=2048) | ||
|
||
linear_method_cls = GPTQMarlinLinearMethod if use_marlin_kernel else ( | ||
GPTQLinearMethod) | ||
|
||
for name, submodule in (vllm_model.model.llm_engine.model_executor. | ||
driver_worker.model_runner.model.named_modules()): | ||
if name == "lm_head": | ||
assert isinstance(submodule.quant_method, linear_method_cls) | ||
elif name == 'model.layers.0.self_attn.qkv_proj': | ||
# The first layer is quantized using bits=4, group_size=128 | ||
# desc_act=True | ||
assert isinstance(submodule.quant_method, linear_method_cls) | ||
config = submodule.quant_method.quant_config | ||
assert config.weight_bits == 4 | ||
assert config.group_size == 128 | ||
assert config.desc_act | ||
elif name == 'model.layers.1.self_attn.qkv_proj': | ||
# The second layer is quantized using bits=8, group_size=32 | ||
# desc_act=False | ||
assert isinstance(submodule.quant_method, linear_method_cls) | ||
config = submodule.quant_method.quant_config | ||
assert get_dynamic_override(config, layer_name=name, | ||
key="bits") == 8 | ||
assert get_dynamic_override(config, | ||
layer_name=name, | ||
key="group_size") == 32 | ||
assert not get_dynamic_override( | ||
config, layer_name=name, key="desc_act") | ||
elif (name == 'model.layers.2.self_attn.qkv_proj' | ||
or name == 'model.layers.2.mlp.gate_up_proj'): | ||
# All other layers (layer index >= 2) are not quantized | ||
assert isinstance(submodule.quant_method, UnquantizedLinearMethod) | ||
|
||
del vllm_model |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.