-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
🐛 Describe the bug
Using VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
allows setting --max-model-len
to be greater than what is found from the model configurtion. However, this only affect the scheduler's knowledge of the max context; the model configuration is not changed.
In V1 with torch compilation, I've found that extending the context with --max-model-len
causes a CUDA crash when a long-context request is processed (at least for models with max_position_embeddings
):
...
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [112,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [113,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [114,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
/workspace/my-vllm/lib64/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py:45: <module>: block: [337,0,0], thread: [115,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp10, [XBLOCK]) < 4096` failed.
...
ERROR 05-09 22:28:50 core.py:291] File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 05-09 22:28:50 core.py:291] output = self.model_runner.execute_model(scheduler_output)
ERROR 05-09 22:28:50 core.py:291] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291] File "/workspace/my-vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-09 22:28:50 core.py:291] return func(*args, **kwargs)
ERROR 05-09 22:28:50 core.py:291] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291] File "/workspace/my-vllm/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 993, in execute_model
ERROR 05-09 22:28:50 core.py:291] valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-09 22:28:50 core.py:291] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-09 22:28:50 core.py:291] RuntimeError: CUDA error: device-side assert triggered
When not using pytorch compile (eg. V0 or with --enforce-eager
), there is no immediate crash, but generation results are gibberish when exceeding the model's configuration and I suspect that crashes are still possible.
Obviously, this requires VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
which is a good guard as an opt-in, and the warning is clear that it could be problematic, but I wonder if we can do more to prevent a known crash scenario.
#17747 is an example where it would be easy to think that this should work without crashing. In that case I was using --config-format mistral
while setting --max-model-len 131072
to match the config.json
.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.