[Feature]: Estimate max-model-len when the KV cache memory is not enough

### 🚀 The feature, motivation and pitch

When the KV cache is not enough for holding one request, vLLM v1 will raise an error like this
> ERROR 04-05 01:12:55 [core.py:390] ValueError: To serve at least one request with the models's max seq len (1048576), (24.00 GiB KV cache is needed, which is larger than the available KV cache memory (9.97 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

It would be more convenient if we can provide an estimated `max_model_len` to the users in this error log.

The estimation is more complex than `max_model_len = block_size * num_gpu_blocks` after the introduction of different types of KV cache like sliding window, and help wanted on implementing with binary search of `max_model_len` based on the `KVCacheSpec.max_memory_usage_bytes`.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Estimate max-model-len when the KV cache memory is not enough #16118

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Estimate max-model-len when the KV cache memory is not enough #16118

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions