-
Notifications
You must be signed in to change notification settings - Fork 292
Add Qwen2.5-VL #706
Add Qwen2.5-VL #706
Conversation
|
@seungwoos Is this branch usable? Can you provide some instructions on how to get it to work? Code: from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
quant_path = "test_awq"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')Got the following error: |
The error is believed to be caused by dependency issue. This Qwen2.5-VL depends on the main branch of |
|
Hi, @BenasdTW You should compute positional embeddings beforehand. I actually made another PR to handle this issue. There's room for enhancement since rotary embedding only requires the input device. The current AutoAWQ doesn't include the latest transformer version. Installing the latest transformer after installing AutoAWQ's required packages worked for me. |
|
I guess we shouldn't use you should add |
Thanks for the clarification! After manually applying the patch from #705, it works as expected. I think it would be useful to mention that this PR depends on #705. |
|
@seungwoos Would you mind creating a branch that merges add-computed-position-embedding and add-qwen2_5_vl in your fork? This would make it easier for people to install and use. |
Add computed position embedding external
|
Thanks for your comment @BenasdTW ! |
|
The following config works for me. |
|
@jlia0 I saw your comment on Hugging Face. Would you mind sharing the 72B model on Hugging Face if you manage to quantize it? I don't have a PC powerful enough to quantize the 72B model. Here are the 3B and 7B AWQ quantized version in case someone needs it. |
sure - there you go https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ |
Hi could you please share your AutoAWQ quantization code for Qwen2.5-VL? There's something wrong with my 72B-AWQ model when serving it using vLLM with |
from AutoAWQ.awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
quant_path = "Qwen2.5-VL-7B-Instruct-AWQ"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')I haven't tried |
What's your setup/environment? I have tried TP=2 with your 7B-AWQ model and they work. However, the 72B didn't work, with the following error. |
I ran it in a vscode devcontainer with this docker file: FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
# Install Python and other necessary packages
RUN apt-get update && \
apt-get install -y git libgl1-mesa-glx libglib2.0-0 && \
rm -rf /var/lib/apt/lists/*
# Upgrade pip
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install torch torchvision torchaudio
RUN python3 -m pip install git+https://github.com/huggingface/transformers
RUN python3 -m pip install git+https://github.com/huggingface/accelerate
RUN python3 -m pip install git+https://github.com/huggingface/peft
RUN python3 -m pip install git+https://github.com/huggingface/trl
RUN python3 -m pip install flash-attn --no-build-isolation
RUN python3 -m pip install datasets numpy sentencepiece gguf protobuf matplotlib
RUN python3 -m pip install bitsandbytes
RUN python3 -m pip install tensorboard
RUN python3 -m pip install qwen-vl-utils[decord]
RUN python3 -m pip install git+https://github.com/seungwoos/AutoAWQ.git@add-qwen2_5_vl --no-deps
RUN python3 -m pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightlyHardware: i9-12900K, RTX 3080 Ti I'm not sure, but I think it could be because TP=2 doesn't split the 7B-AWQ model, instead, it just duplicates the small model. |
@jlia0 Have you found a solution to this problem? I was able to run a quantized model with Would you mind sharing your quantization code? |
|
@seungwoos Thanks for this PR. I hope to review it soon and merge it! @BenasdTW There is a bug in vLLM. Try inference in AutoAWQ first to see if it works. vllm-project/vllm#13227 |
|
Nevermind. Everything worked again after reboot. |
Hey @BenasdTW , I encountered the same issue, quantized 72B model outputs gibberish, how did you solve this? |
I actually just restarted the server, rebuilt the container and re-ran the exact same code. Make sure no other program is using the GPUs. |
Actually, the quantized model is ok under autoawq, inference result was completely different with vllm server, I was using vllm 0.7.2, any further advices? |
Are you using vLLM v1? I think v1 is bugged, the inference result is different to v0. |
|
If you want to use a vision and text dataset as a calibration set, you should use |
There is no processor in the example. |
|
The Qwen team just released their official version of AWQ quantized model. BTW, the official quantized version doesn't work with |
Oh yes, we should import Qwen2_5_VLProcessor first, then set processor with |
|
I have updated the previous uploaded weights. Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs. |
Thanks! Good Work! This is definitely better than changing the |
Add Qwen2.5-VL model with updated util functions.
Add
position_embeddingonmodule_kwargssince the latest huggingface version requires pre-computed positional embeddings as a forward process argument. (see the difference between huggingface<4.48.0 and huggingface>=4.48.0)