-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Motivation.
vLLM engine expect the memory is enough to serve at least 1 request of max-model-len
[pointer]. When max-model-len
is unset, vllm will read that from model configuration, this is problematic as 10M or infinite context model will get increasingly more common.
For example, when running llama4 scout (10M) model using the following command:
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 8
It throws following error after ~10 minutes of initialization:
ValueError: To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed, which is larger than the available KV cache memory (88.38 GiB). Based on the available memory, the estimated maximum model length is 3861424. Try increasing
`gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
While prior work like #16168 by @lengrongfu and @heheda12345 make it straightforward for users to adjust max-model-len, this still requires 1 failed attempt when people are trying new models.
Although it's easier to override max_model_len
to overcome one-time failure, it can be annoying to maintain this setting for multiple hardware and parallelism settings. It's also not uncommon to see user confusions about OOMs due to this reason.
Proposed Change.
Support --max-model-len auto
that automatically truncates max-model-len to the max context length supportable by HBM capacity and warn users about overrides.
Actual change requires some refactoring for initialization code to ensure updates got populated properly to SchedulerConfig
, CacheConfig
and SchedulerConfig
.
Looking for feedback on the idea!
Feedback Period.
7/24
CC List.
@heheda12345 @mgoin @WoosukKwon
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.