[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash

### Your current environment

<details>

</details>


### 🐛 Describe the bug

Using `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` allows setting `--max-model-len` to be greater than what is found from the model configurtion. However, this only affect the scheduler's knowledge of the max context; the model configuration is not changed.

In V1 with torch compilation, I've found that extending the context with `--max-model-len` causes a CUDA crash when a long-context request is processed (at least for models with `max_position_embeddings`):
```
...
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [112,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [113,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [114,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [115,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
...
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 05-09 22:28:50 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 05-09 22:28:50 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-09 22:28:50 core.py:291]     return func(*args, **kwargs)
ERROR 05-09 22:28:50 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 993, in execute_model
ERROR 05-09 22:28:50 core.py:291]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-09 22:28:50 core.py:291]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291] RuntimeError: CUDA error: device-side assert triggered
```

When not using pytorch compile (eg. V0 or with `--enforce-eager`), there is no immediate crash, but generation results are gibberish when exceeding the model's configuration and I suspect that crashes are still possible.

Obviously, this requires `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` which is a good guard as an opt-in, and the warning is clear that it could be problematic, but I wonder if we can do more to prevent a known crash scenario.

https://github.com/vllm-project/vllm/issues/17747 is an example where it would be easy to think that this should work without crashing. In that case I was using `--config-format mistral` while setting `--max-model-len 131072` to match the `config.json`.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash #17924

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash #17924

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions