Skip to content

Commit d69b094

Browse files
tdoublepepwalsh
authored andcommitted
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (vllm-project#23716)
Signed-off-by: Thomas Parnell <[email protected]>
1 parent 778ead5 commit d69b094

File tree

2 files changed

+11
-8
lines changed

2 files changed

+11
-8
lines changed

docs/usage/v1_guide.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i
107107
#### Mamba Models
108108

109109
Models using selective state-space mechanisms instead of standard transformer attention are supported.
110-
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1.
110+
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported.
111+
Please note that prefix caching is not yet supported for these models.
111112

112113
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
113-
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that
114-
these models currently require disabling prefix caching in V1.
114+
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`).
115+
Please note that prefix caching is not yet supported for these models.
115116

116117
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
117-
Please note that these models currently require disabling prefix caching and enforcing eager mode in V1.
118+
Please note that prefix caching is not yet supported for these models.
119+
It is also necessary to enforce eager mode for these models in V1.
118120

119121
#### Encoder-Decoder Models
120122

vllm/model_executor/models/config.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -292,12 +292,13 @@ def verify_and_update_config(cls, vllm_config: "VllmConfig") -> None:
292292
return
293293

294294
model_config = vllm_config.model_config
295+
cache_config = vllm_config.cache_config
295296
compilation_config = vllm_config.compilation_config
296297

297-
model_cls, _ = ModelRegistry.resolve_model_cls(
298-
model_config.architecture,
299-
model_config=model_config,
300-
)
298+
# TODO(tdoublep): remove once prefix caching is enabled
299+
cache_config.enable_prefix_caching = False
300+
logger.info("Hybrid or mamba-based model detected: disabling prefix "
301+
"caching since it is not yet supported.")
301302

302303
# TODO(tdoublep): remove as full cuda graph support is added
303304
FCG_NOT_SUPPORTED_MODELS = [

0 commit comments

Comments
 (0)