-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[v1] Add Whisper model support (encoder-decoder) #21088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.
I've identified one critical issue in _build_encoder_attn_metadata
where a missing else
block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This is already some work to support encoder-decoder models: Can you coordinate with @maxdebayser to avoid duplicate work? |
Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill. |
Yep, we're in contact. Did you mean to link something different than #20226? Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet. |
follow-up on next steps and collaboration with @maxdebayser We're going to combine our work and try to land it all in a few stages. PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention. PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type). PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them. |
96be9ad
to
4da8b7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one!
16f557d
to
a9e3459
Compare
I got this caught up with |
87d9bfa
to
f62a66e
Compare
This pull request has merge conflicts that must be resolved before it can be |
Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
5bf5479
to
7444d37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your patience for iterating on this. It is really helpful for v0 deprecation.
Signed-off-by: Russell Bryant <[email protected]>
and thank you for all of the time invested in review! The end result is a lot cleaner because of your feedback. |
Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: NickLucche <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: NickLucche <[email protected]>
v1: Add Whisper encoder-decoder model support
Implements Whisper mdoel support in the V1 engine. Key changes include:
This closes a major feature gap between V0 and V1, enabling Whisper transcription
in the new engine architecture while maintaining backward compatibility.
Related to V0 deprecation (#18571) and 2025 Q3 roadmap (#20336).
Closes #12761
Signed-off-by: Russell Bryant [email protected]
Co-authored-by: NickLucche [email protected]
Follow-up TODO items