Skip to content

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Aug 27, 2025

Purpose

We would like to enable V1 by default for hybrid models (or models based on "mamba" layers, where "mamba" is a stand-in for: mamba2, mamba2, linear_attention or short_conv). However, these models do not yet support prefix caching. This PR will disable prefix caching by default for these models, ensuring that user does not experience crash using default vllm serve ....

This is just a user experience improvement until we enable prefix caching - we are aiming to put up a first PR for this later this week.

cc @heheda12345 @asafgardin

Test Plan

n/a

Test Result

n/a


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables prefix caching for Mamba-based and hybrid models to prevent crashes, as this feature is not yet supported for them. This is a valuable user experience improvement. My review includes a suggestion to refine the implementation. Instead of unconditionally disabling the feature, I recommend checking if the user has explicitly enabled it and then issuing a warning before disabling. This approach enhances clarity for the user and aligns better with existing configuration handling practices in the codebase.

@Josephasafg
Copy link
Contributor

LGTM

@heheda12345 heheda12345 enabled auto-merge (squash) August 27, 2025 06:53
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025
@Josephasafg
Copy link
Contributor

@tdoublep Should we also update v1_guide? since the users dont need to disable prefix caching after this change

Signed-off-by: Thomas Parnell <[email protected]>
@tdoublep tdoublep requested a review from hmellor as a code owner August 27, 2025 07:02
@tdoublep
Copy link
Member Author

@Josephasafg Good catch thanks - I have updated the language accordingly.

@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 27, 2025
Signed-off-by: Thomas Parnell <[email protected]>
@heheda12345 heheda12345 merged commit 704432a into vllm-project:main Aug 27, 2025
40 checks passed
@tdoublep tdoublep deleted the mamba-v1-prefix-caching-off branch August 27, 2025 13:03
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants