-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models #23716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models #23716
Conversation
Signed-off-by: Thomas Parnell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request disables prefix caching for Mamba-based and hybrid models to prevent crashes, as this feature is not yet supported for them. This is a valuable user experience improvement. My review includes a suggestion to refine the implementation. Instead of unconditionally disabling the feature, I recommend checking if the user has explicitly enabled it and then issuing a warning before disabling. This approach enhances clarity for the user and aligns better with existing configuration handling practices in the codebase.
LGTM |
@tdoublep Should we also update v1_guide? since the users dont need to disable prefix caching after this change |
Signed-off-by: Thomas Parnell <[email protected]>
@Josephasafg Good catch thanks - I have updated the language accordingly. |
Signed-off-by: Thomas Parnell <[email protected]>
…ased models (vllm-project#23716) Signed-off-by: Thomas Parnell <[email protected]>
…ased models (vllm-project#23716) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…ased models (vllm-project#23716) Signed-off-by: Thomas Parnell <[email protected]>
…ased models (vllm-project#23716) Signed-off-by: Thomas Parnell <[email protected]>
…ased models (vllm-project#23716) Signed-off-by: Thomas Parnell <[email protected]>
Purpose
We would like to enable V1 by default for hybrid models (or models based on "mamba" layers, where "mamba" is a stand-in for: mamba2, mamba2, linear_attention or short_conv). However, these models do not yet support prefix caching. This PR will disable prefix caching by default for these models, ensuring that user does not experience crash using default
vllm serve ...
.This is just a user experience improvement until we enable prefix caching - we are aiming to put up a first PR for this later this week.
cc @heheda12345 @asafgardin
Test Plan
n/a
Test Result
n/a
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.