Skip to content

[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. #20105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jul 2, 2025

Conversation

huachenheli
Copy link
Contributor

@huachenheli huachenheli commented Jun 26, 2025

Purpose

New model additions currently suffer from two pain points:

  1. Adding new multimodal models or modalities requires changing hard-coded _placeholder_str function in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L507. This is now configurable via --mm-placeholder-str-override so future additions can be contained within model_executor/models/ directory without modifying the vllm framework code.
  2. Video input processing currently uses a hard-coded 32 frame config. We want to allow flexible & extensible media io policies in the future.

Example flag formats:
--mm-placeholder-str-override '{"video":"<|video_placeholder|>", "image": "<|image_placeholder|>"}'
--media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'

Test Plan

  • Unit tests:
    pytest tests/engine/test_arg_utils.py
    pytest tests/multimodal/test_video.py
    pytest tests/multimodal/test_utils.py

  • vllm serve test. Printed & verified args are passed as expected.

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5 --mm-placeholder-str-override '{"video":"<|video_placeholder|>"}' --media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'

…strings.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@huachenheli huachenheli marked this pull request as ready for review June 26, 2025 00:18
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @huachenheli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and extensibility of multimodal model support by introducing configurable placeholder strings and dynamic video sampling policies. These changes empower users to easily integrate new multimodal models and customize video input processing without modifying core framework code, making the system more adaptable to future model additions and diverse use cases.

Highlights

  • Configurable Multimodal Placeholder Strings: Introduced a new CLI flag --mm-placeholder-str-override that allows users to specify custom placeholder strings for different modalities (e.g., video, image). This removes the need to hardcode these strings within the framework, simplifying the addition of new multimodal models.
  • Flexible Video Sampling Policies: Added a new CLI flag --video-media-io-kwargs which enables passing arbitrary keyword arguments to the video processing pipeline. This allows for flexible and extensible video sampling policies, moving beyond the previously hard-coded 32-frame configuration.
  • Core Configuration Updates: Updated ModelConfig and MultiModalConfig to include the new video_media_io_kwargs and mm_placeholder_str_override fields, ensuring these configurations are propagated throughout the system.
  • Media Connector Enhancements: Modified the MediaConnector to accept and utilize the new video_media_io_kwargs, ensuring that video fetching and processing can leverage the newly introduced flexible policies.
  • Video Loader API Extension: Extended the VideoLoader and VideoMediaIO interfaces to accept and pass through arbitrary keyword arguments, allowing custom video loaders to define and use their own specific parameters for video processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added frontend multi-modality Related to multi-modality (#4194) labels Jun 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the stated pain points by introducing configurable multimodal placeholder strings and flexible video sampling policies via CLI flags. The changes are well-integrated across the configuration, argument parsing, and multimodal utility layers. New test cases have been added to validate the parsing of the new CLI arguments and the propagation of video processing parameters. The overall code quality is good, with clear intent and appropriate use of dataclasses and type hints. One minor type mismatch was identified in the EngineArgs definition for mm_placeholder_str_override.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
@ywang96 ywang96 self-assigned this Jun 26, 2025
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch 2 times, most recently from 1d1f103 to aa4277e Compare June 26, 2025 04:37
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch 2 times, most recently from 70368b9 to da55875 Compare June 26, 2025 04:45
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch from da55875 to 3f75b64 Compare June 26, 2025 04:54
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this! @ywang96 any further comments?

@DarkLight1337
Copy link
Member

Perhaps we can later work on a solution similar to #20179 to define the placeholder token inside the model class....

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small nit but otherwise LGTM. Thanks for the contribution!

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
@huachenheli
Copy link
Contributor Author

huachenheli commented Jun 29, 2025

Perhaps we can later work on a solution similar to #20179 to define the placeholder token inside the model class....

I suppose we could put this into BaseProcessingInfo which mm models inherit from, assuming new models have the correct processor setup.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 1, 2025 04:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 1, 2025
@DarkLight1337
Copy link
Member

Can you merge from main to fix the CI?

@huachenheli
Copy link
Contributor Author

Can you merge from main to fix the CI?

Just merged with main. Let's see if the timeout issue goes away.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <[email protected]>
auto-merge was automatically disabled July 1, 2025 20:36

Head branch was pushed to by a user without write access

@vllm-bot vllm-bot merged commit 2e7cbf2 into vllm-project:main Jul 2, 2025
71 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants