Add Qwen2.5-VL #706

seungwoos · 2025-02-06T08:13:44Z

Add Qwen2.5-VL model with updated util functions.
Add position_embedding on module_kwargs since the latest huggingface version requires pre-computed positional embeddings as a forward process argument. (see the difference between huggingface<4.48.0 and huggingface>=4.48.0)

BenasdTW · 2025-02-07T12:52:50Z

@seungwoos Is this branch usable? Can you provide some instructions on how to get it to work?
I can't get it to work, and it also causes issues with older models like Qwen2-VL.
This is how I install it:

pip install git+https://github.com/seungwoos/AutoAWQ.git@add-qwen2_5_vl --no-deps
pip install git+https://github.com/huggingface/transformers

Code:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
quant_path = "test_awq"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

Got the following error:

root@0455e7995f18:/workspaces/SpecsML# /opt/conda/bin/python /workspaces/SpecsML/quant.py
Fetching 14 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 190033.19it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.05it/s]
Repo card metadata block was not found. Setting CardData to empty.
AWQ:   0%|                                                                                                                         | 0/36 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/workspaces/SpecsML/quant.py", line 13, in <module>
    model = AutoAWQForCausalLM.from_pretrained(model_path)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/awq/models/base.py", line 242, in quantize
    self.quantizer.quantize()
  File "/opt/conda/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 172, in quantize
    input_feat = self._get_input_feat(self.modules[i], named_linears)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 648, in _get_input_feat
    self.inps = self._module_forward(self.inps, layer, module_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 260, in _module_forward
    module_output = module(x, **module_kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1017, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 910, in forward
    cos, sin = position_embeddings
    ^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object

BenasdTW · 2025-02-07T13:24:21Z

@seungwoos Is this branch usable? Can you provide some instructions on how to get it to work? I can't get it to work, and it also causes issues with older models like Qwen2-VL.

The error is believed to be caused by dependency issue. This Qwen2.5-VL depends on the main branch of transformers, but AutoAWQ depends on transformers<=4.47.1,>=4.45.0

autoawq 0.2.8 requires transformers<=4.47.1,>=4.45.0, but you have transformers 4.49.0.dev0 which is incompatible.

seungwoos · 2025-02-07T13:34:37Z

Hi, @BenasdTW

You should compute positional embeddings beforehand. I actually made another PR to handle this issue. There's room for enhancement since rotary embedding only requires the input device.

The current AutoAWQ doesn't include the latest transformer version. Installing the latest transformer after installing AutoAWQ's required packages worked for me.

seungwoos · 2025-02-07T13:39:20Z

I guess we shouldn't use pile-val-backup as a calibration dataset, ~~but the Qwen2-VL example code seems not working properly. I'm currently working on fixing this issue.~~

you should add padding_side=left on preprocessor.

BenasdTW · 2025-02-07T14:06:04Z

You should compute positional embeddings beforehand. I actually made another PR to handle this issue. There's room for enhancement since rotary embedding only requires the input device.

Thanks for the clarification! After manually applying the patch from #705, it works as expected.

I think it would be useful to mention that this PR depends on #705.

BenasdTW · 2025-02-07T14:15:52Z

@seungwoos Would you mind creating a branch that merges add-computed-position-embedding and add-qwen2_5_vl in your fork? This would make it easier for people to install and use.

Add computed position embedding external

seungwoos · 2025-02-07T14:22:19Z

Thanks for your comment @BenasdTW !
I just merged the previous PR into this one.

jlia0 · 2025-02-09T07:06:22Z

The following config works for me.

Image.debian_slim(python_version="3.12")
    .apt_install("git")
    .pip_install("torch")
    .pip_install(
        "git+https://github.com/seungwoos/AutoAWQ.git@add-qwen2_5_vl"
    )
    .pip_install(
        "git+https://github.com/huggingface/transformers",
        "accelerate",
    )
    .pip_install(
        "pillow"
    )

BenasdTW · 2025-02-09T08:52:41Z

@jlia0 I saw your comment on Hugging Face. Would you mind sharing the 72B model on Hugging Face if you manage to quantize it? I don't have a PC powerful enough to quantize the 72B model.

Here are the 3B and 7B AWQ quantized version in case someone needs it.
https://huggingface.co/Benasd/Qwen2.5-VL-7B-Instruct-AWQ
https://huggingface.co/Benasd/Qwen2.5-VL-3B-Instruct-AWQ

jlia0 · 2025-02-09T09:50:39Z

@jlia0 I saw your comment on Hugging Face. Would you mind sharing the 72B model on Hugging Face if you manage to quantize it? I don't have a PC powerful enough to quantize the 72B model.

Here are the 3B and 7B AWQ quantized version in case someone needs it. https://huggingface.co/Benasd/Qwen2.5-VL-7B-Instruct-AWQ https://huggingface.co/Benasd/Qwen2.5-VL-3B-Instruct-AWQ

sure - there you go

https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

jlia0 · 2025-02-10T18:02:23Z

@jlia0 I saw your comment on Hugging Face. Would you mind sharing the 72B model on Hugging Face if you manage to quantize it? I don't have a PC powerful enough to quantize the 72B model.

Here are the 3B and 7B AWQ quantized version in case someone needs it. https://huggingface.co/Benasd/Qwen2.5-VL-7B-Instruct-AWQ https://huggingface.co/Benasd/Qwen2.5-VL-3B-Instruct-AWQ

Hi could you please share your AutoAWQ quantization code for Qwen2.5-VL?

There's something wrong with my 72B-AWQ model when serving it using vLLM with --tensor-parallel-size=2.

BenasdTW · 2025-02-10T18:33:13Z

Hi could you please share your AutoAWQ quantization code for Qwen2.5-VL?
Sure.

from AutoAWQ.awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
quant_path = "Qwen2.5-VL-7B-Instruct-AWQ"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

I haven't tried --tensor-parallel-size=2, so I'm not sure if it will work.

jlia0 · 2025-02-10T18:44:22Z

Hi could you please share your AutoAWQ quantization code for Qwen2.5-VL?

Sure.

from AutoAWQ.awq import AutoAWQForCausalLM

from transformers import AutoTokenizer



model_path = "Qwen/Qwen2.5-VL-7B-Instruct"

quant_path = "Qwen2.5-VL-7B-Instruct-AWQ"

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }



# Load model

model = AutoAWQForCausalLM.from_pretrained(model_path)

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)



# Quantize

model.quantize(tokenizer, quant_config=quant_config)



# Save quantized model

model.save_quantized(quant_path)

tokenizer.save_pretrained(quant_path)



print(f'Model is quantized and saved at "{quant_path}"')

I haven't tried --tensor-parallel-size=2, so I'm not sure if it will work.

What's your setup/environment?

I have tried TP=2 with your 7B-AWQ model and they work.

However, the 72B didn't work, with the following error.

ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

BenasdTW · 2025-02-10T18:59:42Z

What's your setup/environment?

I ran it in a vscode devcontainer with this docker file:

FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
# Install Python and other necessary packages
RUN apt-get update && \
    apt-get install -y git libgl1-mesa-glx libglib2.0-0 && \
    rm -rf /var/lib/apt/lists/*

# Upgrade pip
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install torch torchvision torchaudio
RUN python3 -m pip install git+https://github.com/huggingface/transformers
RUN python3 -m pip install git+https://github.com/huggingface/accelerate
RUN python3 -m pip install git+https://github.com/huggingface/peft
RUN python3 -m pip install git+https://github.com/huggingface/trl
RUN python3 -m pip install flash-attn --no-build-isolation
RUN python3 -m pip install datasets numpy sentencepiece gguf protobuf matplotlib
RUN python3 -m pip install bitsandbytes
RUN python3 -m pip install tensorboard
RUN python3 -m pip install qwen-vl-utils[decord]
RUN python3 -m pip install git+https://github.com/seungwoos/AutoAWQ.git@add-qwen2_5_vl --no-deps
RUN python3 -m pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Hardware: i9-12900K, RTX 3080 Ti
Host OS: Windows 10

I'm not sure, but I think it could be because TP=2 doesn't split the 7B-AWQ model, instead, it just duplicates the small model.

BenasdTW · 2025-02-13T16:55:30Z

However, the 72B didn't work, with the following error.

ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

@jlia0 Have you found a solution to this problem?

I was able to run a quantized model with -tp 2 using a workaround: setting q_group_size to 64 during quantization. As shown here.
However, I'm not sure if it's working correctly, because my quantized 72B model just outputs gibberish.

Would you mind sharing your quantization code?

casper-hansen · 2025-02-13T17:27:39Z

@seungwoos Thanks for this PR. I hope to review it soon and merge it!

@BenasdTW There is a bug in vLLM. Try inference in AutoAWQ first to see if it works. vllm-project/vllm#13227

BenasdTW · 2025-02-13T18:06:53Z

Nevermind. Everything worked again after reboot.

Cescfangs · 2025-02-19T07:48:44Z

However, the 72B didn't work, with the following error.
ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
@jlia0 Have you found a solution to this problem?

I was able to run a quantized model with -tp 2 using a workaround: setting q_group_size to 64 during quantization. As shown here. However, I'm not sure if it's working correctly, because my quantized 72B model just outputs gibberish.

Would you mind sharing your quantization code?

Hey @BenasdTW , I encountered the same issue, quantized 72B model outputs gibberish, how did you solve this?

BenasdTW · 2025-02-19T07:58:49Z

Hey @BenasdTW , I encountered the same issue, quantized 72B model outputs gibberish, how did you solve this?

I actually just restarted the server, rebuilt the container and re-ran the exact same code. Make sure no other program is using the GPUs.

Cescfangs · 2025-02-20T02:20:39Z

Hey @BenasdTW , I encountered the same issue, quantized 72B model outputs gibberish, how did you solve this?

I actually just restarted the server, rebuilt the container and re-ran the exact same code. Make sure no other program is using the GPUs.

Actually, the quantized model is ok under autoawq, inference result was completely different with vllm server, I was using vllm 0.7.2, any further advices?

BenasdTW · 2025-02-20T02:43:21Z

Actually, the quantized model is ok under autoawq, inference result was completely different with vllm server, I was using vllm 0.7.2, any further advices?

Are you using vLLM v1? I think v1 is bugged, the inference result is different to v0.

seungwoos · 2025-02-20T04:46:35Z

If you want to use a vision and text dataset as a calibration set, you should use processor = Qwen2_5_VLProcessor.from_pretrained(model_path, padding_side='left') instead of model.processor in this example.

BenasdTW · 2025-02-20T05:06:58Z

If you want to use a vision and text dataset as a calibration set, you should change to processor = Qwen2_5_VLProcessor.from_pretrained(model_path, padding_side='left') in this example.

There is no processor in the example.
Did you mean replacing model.processor with Qwen2_5_VLProcessor.from_pretrained(model_path, padding_side='left')?

BenasdTW · 2025-02-20T05:13:59Z

The Qwen team just released their official version of AWQ quantized model.
Qwen/Qwen2.5-VL-72B-Instruct-AWQ

BTW, the official quantized version doesn't work with -tp 2 for now.

seungwoos · 2025-02-20T06:24:10Z

There is no processor in the example. Did you mean replacing model.processor with Qwen2_5_VLProcessor.from_pretrained(model_path, padding_side='left')?

Oh yes, we should import Qwen2_5_VLProcessor first, then set processor with padding_side=left.
Or we can just use AutoProcessor. The key point is using padding_side=left; otherwise, it does not work.

jlia0 · 2025-02-21T15:33:02Z

@BenasdTW @seungwoos

I have updated the previous uploaded weights.

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

BenasdTW · 2025-02-21T16:32:34Z

@BenasdTW @seungwoos

I have updated the previous uploaded weights.

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

Thanks! Good Work! This is definitely better than changing the group_size!
I've noticed that you padded the intermediate_size of the model. Would you mind telling me how to pad the model? Is fine-tuning required? I would also like to know which calibration dataset you used for AWQ.

seungwoos added 3 commits February 5, 2025 18:31

Add computed position embedding external

94e0fe7

Add Qwen2.5 VL model support

ea724a1

Update Qwen VL Utils

4448165

seungwoos changed the title ~~Add qwen2 5 vl~~ Add qwen2.5-vl Feb 6, 2025

seungwoos changed the title ~~Add qwen2.5-vl~~ Add Qwen2.5-VL Feb 6, 2025

Merge pull request #1 from seungwoos/add-computed-position-embedding

957442b

Add computed position embedding external

casper-hansen merged commit b6719dc into casper-hansen:main Mar 6, 2025

Add Qwen2.5-VL #706

Add Qwen2.5-VL #706

Uh oh!

Conversation

seungwoos commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenasdTW commented Feb 7, 2025

Uh oh!

BenasdTW commented Feb 7, 2025

Uh oh!

seungwoos commented Feb 7, 2025

Uh oh!

seungwoos commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenasdTW commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenasdTW commented Feb 7, 2025

Uh oh!

seungwoos commented Feb 7, 2025

Uh oh!

jlia0 commented Feb 9, 2025

Uh oh!

BenasdTW commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlia0 commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlia0 commented Feb 10, 2025

Uh oh!

BenasdTW commented Feb 10, 2025

Uh oh!

jlia0 commented Feb 10, 2025

Uh oh!

BenasdTW commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenasdTW commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper-hansen commented Feb 13, 2025

Uh oh!

BenasdTW commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cescfangs commented Feb 19, 2025

Uh oh!

BenasdTW commented Feb 19, 2025

Uh oh!

Cescfangs commented Feb 20, 2025

Uh oh!

BenasdTW commented Feb 20, 2025

Uh oh!

seungwoos commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenasdTW commented Feb 20, 2025

Uh oh!

BenasdTW commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seungwoos commented Feb 20, 2025

Uh oh!

jlia0 commented Feb 21, 2025

Uh oh!

BenasdTW commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seungwoos commented Feb 6, 2025 •

edited

Loading

seungwoos commented Feb 7, 2025 •

edited

Loading

BenasdTW commented Feb 7, 2025 •

edited

Loading

BenasdTW commented Feb 9, 2025 •

edited

Loading

jlia0 commented Feb 9, 2025 •

edited

Loading

BenasdTW commented Feb 10, 2025 •

edited

Loading

BenasdTW commented Feb 13, 2025 •

edited

Loading

BenasdTW commented Feb 13, 2025 •

edited

Loading

seungwoos commented Feb 20, 2025 •

edited

Loading

BenasdTW commented Feb 20, 2025 •

edited

Loading