Skip to content

[RFC]: Support automatic max context length via --max-model-len auto #19407

@yeqcharlotte

Description

@yeqcharlotte

Motivation.

vLLM engine expect the memory is enough to serve at least 1 request of max-model-len [pointer]. When max-model-len is unset, vllm will read that from model configuration, this is problematic as 10M or infinite context model will get increasingly more common.

For example, when running llama4 scout (10M) model using the following command:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 8

It throws following error after ~10 minutes of initialization:

ValueError: To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed, which is larger than the available KV cache memory (88.38 GiB). Based on the available memory, the estimated maximum model length is 3861424. Try increasing
 `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

While prior work like #16168 by @lengrongfu and @heheda12345 make it straightforward for users to adjust max-model-len, this still requires 1 failed attempt when people are trying new models.

Although it's easier to override max_model_len to overcome one-time failure, it can be annoying to maintain this setting for multiple hardware and parallelism settings. It's also not uncommon to see user confusions about OOMs due to this reason.

Proposed Change.

Support --max-model-len auto that automatically truncates max-model-len to the max context length supportable by HBM capacity and warn users about overrides.

Actual change requires some refactoring for initialization code to ensure updates got populated properly to SchedulerConfig, CacheConfig and SchedulerConfig.

Looking for feedback on the idea!

Feedback Period.

7/24

CC List.

@heheda12345 @mgoin @WoosukKwon

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions