Skip to content

[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash #17924

@tjohnson31415

Description

@tjohnson31415

Your current environment

🐛 Describe the bug

Using VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 allows setting --max-model-len to be greater than what is found from the model configurtion. However, this only affect the scheduler's knowledge of the max context; the model configuration is not changed.

In V1 with torch compilation, I've found that extending the context with --max-model-len causes a CUDA crash when a long-context request is processed (at least for models with max_position_embeddings):

...
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [112,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [113,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [114,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [115,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
...
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 05-09 22:28:50 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 05-09 22:28:50 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-09 22:28:50 core.py:291]     return func(*args, **kwargs)
ERROR 05-09 22:28:50 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291]   File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 993, in execute_model
ERROR 05-09 22:28:50 core.py:291]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-09 22:28:50 core.py:291]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291] RuntimeError: CUDA error: device-side assert triggered

When not using pytorch compile (eg. V0 or with --enforce-eager), there is no immediate crash, but generation results are gibberish when exceeding the model's configuration and I suspect that crashes are still possible.

Obviously, this requires VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 which is a good guard as an opt-in, and the warning is clear that it could be problematic, but I wonder if we can do more to prevent a known crash scenario.

#17747 is an example where it would be easy to think that this should work without crashing. In that case I was using --config-format mistral while setting --max-model-len 131072 to match the config.json.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions