Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Sep 23, 2025

Purpose

Part of #22743

Test Plan

Test Result

$ vllm bench serve \
    --backend openai-chat \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --endpoint /v1/chat/completions \
    --dataset-name hf \
    --dataset-path lmarena-ai/VisionArena-Chat \
    --hf-split train \
    --num-prompts 500

$ vllm serve Qwen/Qwen2-VL-7B-Instruct -tp 2 --mm-encoder-tp-mode weights --limit_mm_per_prompt.image=1
============ Serving Benchmark Result ============
Successful requests:                     500       
Benchmark duration (s):                  159.99    
Total input tokens:                      34073     
Total generated tokens:                  49892     
Request throughput (req/s):              3.13      
Output token throughput (tok/s):         311.84    
Peak output token throughput (tok/s):    2314.00   
Peak concurrent requests:                500.00    
Total Token throughput (tok/s):          524.80    
---------------Time to First Token----------------
Mean TTFT (ms):                          62900.15  
Median TTFT (ms):                        52084.25  
P99 TTFT (ms):                           152029.84 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          606.95    
Median TPOT (ms):                        711.70    
P99 TPOT (ms):                           791.46    
---------------Inter-token Latency----------------
Mean ITL (ms):                           584.34    
Median ITL (ms):                         716.03    
P99 ITL (ms):                            1033.28   
==================================================

$ vllm serve Qwen/Qwen2-VL-7B-Instruct -tp 2 --mm-encoder-tp-mode data --limit_mm_per_prompt.image=1
============ Serving Benchmark Result ============
Successful requests:                     500       
Benchmark duration (s):                  128.82    
Total input tokens:                      34073     
Total generated tokens:                  49656     
Request throughput (req/s):              3.88      
Output token throughput (tok/s):         385.46    
Peak output token throughput (tok/s):    2297.00   
Peak concurrent requests:                500.00    
Total Token throughput (tok/s):          649.95    
---------------Time to First Token----------------
Mean TTFT (ms):                          50187.68  
Median TTFT (ms):                        40427.38  
P99 TTFT (ms):                           120921.60 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          488.14    
Median TPOT (ms):                        562.51    
P99 TPOT (ms):                           630.13    
---------------Inter-token Latency----------------
Mean ITL (ms):                           475.18    
Median ITL (ms):                         578.27    
P99 ITL (ms):                            728.98    
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
@mergify mergify bot added the qwen Related to Qwen models label Sep 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces data parallelism support for the Vision Transformer in Qwen2-VL. The changes are well-structured and primarily involve plumbing a use_data_parallel flag through the vision model components to conditionally disable tensor parallelism. The logic for handling data-parallel execution paths appears correct. Overall, the changes are sound and should enable the intended data parallelism functionality.

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 23, 2025 04:51
@DarkLight1337 DarkLight1337 merged commit c98be0a into vllm-project:main Sep 23, 2025
52 of 53 checks passed
@DarkLight1337 DarkLight1337 deleted the vit-dp-qwen2-vl branch September 23, 2025 05:17
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants