-
Notifications
You must be signed in to change notification settings - Fork 30.8k
support qwen2-vl #32318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support qwen2-vl #32318
Changes from all commits
a8c38a8
3caad96
779d9da
382b0bc
d6bd095
a74ee73
d585b01
9b1d485
774a5bc
b7a6567
8b8f37e
ec66f42
b8fae77
9262416
283b03c
7160f7f
5c8d171
7e57d56
009f637
a7a1f6f
6fa58f5
6d3d580
849621d
bd806ff
d7778e2
fca6fee
c3dd8df
8ba3b08
20983ff
254c657
32072c5
49ba7b8
3d292cd
1b75471
37a2672
2cfa97c
27bc926
fdb4afd
ea8b03e
e5a63a3
48cccfd
8754d6f
55be524
f564db7
5aef1d8
7f402eb
77273da
d8dbb1f
f145949
1a6ecad
aba0a2f
3895071
3482568
49106ff
abb6811
8da4fcc
4dd4c9b
a91b2e9
110914f
f97ee89
468d698
b4fbbea
1639991
d98e9a9
e6cebdb
0805a65
a92f0ae
c8d69f0
1ce5837
27c29d6
a4af0b5
eade3b6
77ab7a5
c4b2add
80969ef
7b5e785
ce37f64
db4ceb0
5f43897
9b14cf0
4795317
620e35d
f7adae9
fe009cf
db7b5c3
61cf241
f2fb132
c42cedf
c185ffb
cafbe43
7553723
43f9685
704e3f2
eefb67a
1fe8570
827e5e9
f32ac01
e65e7f8
5d37d76
4752328
36f2d43
3ef1657
e28cc19
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,329 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
rendered properly in your Markdown viewer. | ||
|
||
--> | ||
|
||
# Qwen2_VL | ||
|
||
|
||
## Overview | ||
|
||
The [Qwen2_VL](https://qwenlm.github.io/blog/qwen2-vl/) is a major update to our [Qwen-VL](https://arxiv.org/pdf/2308.12966) model from the Qwen team. | ||
|
||
The abstract from the blog is the following: | ||
|
||
*This blog introduces Qwen2-VL, an advanced version of the Qwen-VL model that has undergone significant enhancements over the past year. Key improvements include enhanced image comprehension, advanced video understanding, integrated visual agent functionality, and expanded multilingual support. The model architecture has been optimized for handling arbitrary image resolutions through Naive Dynamic Resolution support and utilizes Multimodal Rotary Position Embedding (M-ROPE) to effectively process both 1D textual and multi-dimensional visual data. This updated model demonstrates competitive performance against leading AI systems like GPT-4o and Claude 3.5 Sonnet in vision-related tasks and ranks highly among open-source models in text capabilities. These advancements make Qwen2-VL a versatile tool for various applications requiring robust multimodal processing and reasoning abilities.* | ||
|
||
|
||
## Usage example | ||
|
||
### Single Media inference | ||
|
||
The model can accept both images and videos as input. Here's an example code for inference. | ||
|
||
```python | ||
|
||
from PIL import Image | ||
import requests | ||
import torch | ||
from torchvision import io | ||
from typing import Dict | ||
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor | ||
|
||
# Load the model in half-precision on the available device(s) | ||
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto") | ||
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") | ||
|
||
# Image | ||
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" | ||
image = Image.open(requests.get(url, stream=True).raw) | ||
|
||
conversation = [ | ||
{ | ||
"role":"user", | ||
"content":[ | ||
{ | ||
"type":"image", | ||
}, | ||
{ | ||
"type":"text", | ||
"text":"Describe this image." | ||
} | ||
] | ||
} | ||
] | ||
|
||
|
||
# Preprocess the inputs | ||
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n' | ||
simonJJJ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt") | ||
inputs = inputs.to('cuda') | ||
|
||
# Inference: Generation of the output | ||
output_ids = model.generate(**inputs, max_new_tokens=128) | ||
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] | ||
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) | ||
print(output_text) | ||
|
||
|
||
|
||
# Video | ||
def fetch_video(ele: Dict, nframe_factor=2): | ||
if isinstance(ele['video'], str): | ||
def round_by_factor(number: int, factor: int) -> int: | ||
return round(number / factor) * factor | ||
|
||
video = ele["video"] | ||
if video.startswith("file://"): | ||
video = video[7:] | ||
|
||
video, _, info = io.read_video( | ||
video, | ||
start_pts=ele.get("video_start", 0.0), | ||
end_pts=ele.get("video_end", None), | ||
pts_unit="sec", | ||
output_format="TCHW", | ||
) | ||
assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`" | ||
if "nframes" in ele: | ||
nframes = round_by_factor(ele["nframes"], nframe_factor) | ||
else: | ||
fps = ele.get("fps", 1.0) | ||
nframes = round_by_factor(video.size(0) / info["video_fps"] * fps, nframe_factor) | ||
idx = torch.linspace(0, video.size(0) - 1, nframes, dtype=torch.int64) | ||
return video[idx] | ||
Comment on lines
+84
to
+107
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @zucchini-nlp I think we'd want to have this in transformers utils for people to easily use it no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure we want exactly this one. AFAIR there were some discussion on what video-decoder to use and the final decision was to use I think for docs it's okay to use torchvision, so I didn't insist on |
||
|
||
video_info = {"type": "video", "video": "/path/to/video.mp4", "fps": 1.0} | ||
video = fetch_video(video_info) | ||
conversation = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "video"}, | ||
{"type": "text", "text": "What happened in the video?"}, | ||
], | ||
} | ||
] | ||
|
||
# Preprocess the inputs | ||
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n' | ||
|
||
inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt") | ||
inputs = inputs.to('cuda') | ||
|
||
# Inference: Generation of the output | ||
output_ids = model.generate(**inputs, max_new_tokens=128) | ||
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] | ||
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) | ||
print(output_text) | ||
|
||
``` | ||
|
||
|
||
### Batch Mixed Media Inference | ||
|
||
The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example. | ||
|
||
```python | ||
|
||
image1 = Image.open("/path/to/image1.jpg") | ||
image2 = Image.open("/path/to/image2.jpg") | ||
image3 = Image.open("/path/to/image3.jpg") | ||
image4 = Image.open("/path/to/image4.jpg") | ||
image5 = Image.open("/path/to/image5.jpg") | ||
video = fetch_video({ | ||
"type": "video", | ||
"video": "/path/to/video.mp4", | ||
"fps": 1.0 | ||
}) | ||
|
||
# Conversation for the first image | ||
conversation1 = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "text", "text": "Describe this image."} | ||
] | ||
} | ||
] | ||
|
||
# Conversation with two images | ||
conversation2 = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "image"}, | ||
{"type": "text", "text": "What is written in the pictures?"} | ||
] | ||
} | ||
] | ||
|
||
# Conversation with pure text | ||
conversation3 = [ | ||
{ | ||
"role": "user", | ||
"content": "who are you?" | ||
} | ||
] | ||
|
||
|
||
# Conversation with mixed midia | ||
conversation4 = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "image"}, | ||
{"type": "video"}, | ||
{"type": "text", "text": "What are the common elements in these medias?"}, | ||
], | ||
} | ||
] | ||
|
||
conversations = [conversation1, conversation2, conversation3, conversation4] | ||
# Preparation for batch inference | ||
texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations] | ||
inputs = processor( | ||
text=texts, | ||
images=[image1, image2, image3, image4, image5], | ||
videos=[video], | ||
padding=True, | ||
return_tensors="pt", | ||
) | ||
inputs = inputs.to('cuda') | ||
|
||
# Batch Inference | ||
output_ids = model.generate(**inputs, max_new_tokens=128) | ||
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)] | ||
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) | ||
print(output_text) | ||
``` | ||
|
||
### Usage Tips | ||
|
||
#### Image Resolution for performance boost | ||
|
||
simonJJJ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs. | ||
|
||
```python | ||
|
||
min_pixels = 224*224 | ||
max_pixels = 2048*2048 | ||
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) | ||
|
||
``` | ||
|
||
|
||
|
||
#### Multiple Image Inputs | ||
|
||
simonJJJ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings: | ||
|
||
|
||
|
||
```python | ||
|
||
conversation = [ | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "image"}, | ||
{"type": "text", "text": "Hello, how are you?"} | ||
] | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "I'm doing well, thank you for asking. How can I assist you today?" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": [ | ||
{"type": "text", "text": "Can you describe these images and video?"}, | ||
{"type": "image"}, | ||
{"type": "image"}, | ||
{"type": "video"}, | ||
{"type": "text", "text": "These are from my vacation."} | ||
] | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "It was a trip to the mountains. Can you see the details in the images and video?" | ||
} | ||
] | ||
|
||
# default: | ||
prompt_without_id = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n' | ||
|
||
|
||
# add ids | ||
prompt_with_id = processor.apply_chat_template(conversation, add_generation_prompt=True, add_vision_id=True) | ||
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n' | ||
|
||
``` | ||
|
||
#### Flash-Attention 2 to speed up generation | ||
|
||
simonJJJ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
First, make sure to install the latest version of Flash Attention 2: | ||
|
||
```bash | ||
pip install -U flash-attn --no-build-isolation | ||
``` | ||
|
||
Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. | ||
|
||
To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows: | ||
|
||
```python | ||
from transformers import Qwen2VLForConditionalGeneration | ||
|
||
model = Qwen2VLForConditionalGeneration.from_pretrained( | ||
"Qwen/Qwen2-VL-7B-Instruct", | ||
torch_dtype=torch.bfloat16, | ||
attn_implementation="flash_attention_2", | ||
) | ||
``` | ||
|
||
|
||
## Qwen2VLConfig | ||
|
||
[[autodoc]] Qwen2VLConfig | ||
|
||
## Qwen2VLImageProcessor | ||
|
||
[[autodoc]] Qwen2VLImageProcessor | ||
- preprocess | ||
|
||
## Qwen2VLProcessor | ||
|
||
[[autodoc]] Qwen2VLProcessor | ||
|
||
## Qwen2VLModel | ||
|
||
[[autodoc]] Qwen2VLModel | ||
- forward | ||
|
||
## Qwen2VLForConditionalGeneration | ||
|
||
[[autodoc]] Qwen2VLForConditionalGeneration | ||
- forward |
Uh oh!
There was an error while loading. Please reload this page.