Skip to content

Conversation

russellb
Copy link
Member

@russellb russellb commented Jul 17, 2025

v1: Add Whisper encoder-decoder model support

Implements Whisper mdoel support in the V1 engine. Key changes include:

  • Add encoder-decoder architecture support with cross-attention KV cache management
  • Add CrossAttentionManager and CrossAttentionSpec for encoder-decoder KV cache
  • Update scheduler to handle cross-attention block allocation and disable prefix caching
  • Modify GPU model runner for encoder input processing and attention metadata
  • Disable BART tests/examples (Whisper-only support for now)

This closes a major feature gap between V0 and V1, enabling Whisper transcription
in the new engine architecture while maintaining backward compatibility.

Related to V0 deprecation (#18571) and 2025 Q3 roadmap (#20336).

Closes #12761

Signed-off-by: Russell Bryant [email protected]
Co-authored-by: NickLucche [email protected]


Follow-up TODO items

  • utilize encoder cache
  • remove custom cross-attention slot mapping calculation and integrate into the proper abstraction
  • clean up hard coded assumptions about whisper / multi-modal

Copy link

mergify bot commented Jul 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.

I've identified one critical issue in _build_encoder_attn_metadata where a missing else block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

This is already some work to support encoder-decoder models:

Can you coordinate with @maxdebayser to avoid duplicate work?

@maxdebayser
Copy link
Contributor

Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill.
Currently in my PR I'm not disabling the KV cache entirely because functionally it makes no difference for the encoder attention. So I can keep the diff small. But I do want to test if removing the KV cache will have a performance improvement for encoder models

@russellb
Copy link
Member Author

This is already some work to support encoder-decoder models:

Can you coordinate with @maxdebayser to avoid duplicate work?

Yep, we're in contact.

Did you mean to link something different than #20226?

Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet.

@russellb
Copy link
Member Author

follow-up on next steps and collaboration with @maxdebayser

We're going to combine our work and try to land it all in a few stages.

PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention.

PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type).

PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them.

@russellb russellb force-pushed the v1-whisper branch 3 times, most recently from 96be9ad to 4da8b7c Compare July 17, 2025 19:27
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one!

@russellb russellb force-pushed the v1-whisper branch 3 times, most recently from 16f557d to a9e3459 Compare July 18, 2025 20:46
@mergify mergify bot added documentation Improvements or additions to documentation and removed needs-rebase labels Jul 18, 2025
@russellb
Copy link
Member Author

I got this caught up with main with all conflicts resolved, but I haven't addressed feedback received so far.

@russellb russellb force-pushed the v1-whisper branch 2 times, most recently from 87d9bfa to f62a66e Compare July 18, 2025 21:00
Copy link

mergify bot commented Jul 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 19, 2025
maxdebayser added a commit to maxdebayser/vllm that referenced this pull request Jul 20, 2025
Add support for encoder models such as BERT which don't support
a KV cache due to the non-causal attention. Since the KV Cache
Spec is used to build the attention metadata for decoder models,
this PR initializes the attention metadata builds for encoder-only
models directly from the layers and adds a function to build the
attention metadata.

This PR combines elements of PRs
vllm-project#21088
and vllm-project#19988

Summary of changes:

**Flash Attention Backend:**
- Implement encoder self-attention support without using KV cache

**Scheduler:**
- Disable chunked prefill for models without KV cache

**GPU Model Runner:**
- Implement encoder-only attention metadata building for self-attention

Related to:
- V0 deprecation: vllm-project#18571
- 2025 Q3 roadmap: vllm-project#20336

Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your patience for iterating on this. It is really helpful for v0 deprecation.

@russellb
Copy link
Member Author

LGTM! Thanks for your patience for iterating on this. It is really helpful for v0 deprecation.

and thank you for all of the time invested in review! The end result is a lot cleaner because of your feedback.

@heheda12345 heheda12345 enabled auto-merge (squash) September 10, 2025 18:10
@simon-mo simon-mo disabled auto-merge September 10, 2025 20:53
@simon-mo simon-mo merged commit 37e8182 into vllm-project:main Sep 10, 2025
69 of 72 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Initial support for multi-model models using cross attention in V1