[RFC]: Support automatic max context length via `--max-model-len auto`

### Motivation.

vLLM engine expect the memory is enough to serve at least 1 request of `max-model-len` [[pointer](https://github.com/vllm-project/vllm/blob/5f1ac1e1d14ce239ea6ea99e276ad90cdc342132/vllm/v1/core/kv_cache_utils.py#L562)].  When `max-model-len` is unset, vllm will read that from model configuration, this is problematic as 10M or infinite context model will get increasingly more common. 

For example, when running llama4 scout (10M) model using the following command:
```
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 8
```

It throws following error after ~10 minutes of initialization:
```
ValueError: To serve at least one request with the models's max seq len (10485760), (240.00 GiB KV cache is needed, which is larger than the available KV cache memory (88.38 GiB). Based on the available memory, the estimated maximum model length is 3861424. Try increasing
 `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
```

While prior work like https://github.com/vllm-project/vllm/pull/16168 by @lengrongfu and @heheda12345 make it straightforward for users to adjust max-model-len, this still requires 1 failed attempt when people are trying new models. 

Although it's easier to override `max_model_len` to overcome one-time failure, it can be annoying to maintain this setting for multiple hardware and parallelism settings. It's also not uncommon to see user confusions about OOMs due to this reason.

### Proposed Change.

Support  `--max-model-len auto` that automatically truncates max-model-len to the max context length supportable by HBM capacity and warn users about overrides.

Actual change requires some refactoring for initialization code to ensure updates got populated properly to `SchedulerConfig`, `CacheConfig` and `SchedulerConfig`.

Looking for feedback on the idea!

### Feedback Period.

7/24

### CC List.

@heheda12345 @mgoin @WoosukKwon 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support automatic max context length via `--max-model-len auto` #19407

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support automatic max context length via --max-model-len auto #19407

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC]: Support automatic max context length via `--max-model-len auto` #19407