-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug that caused a crash when the number of sampled tokens exceeded the maximum model length. The fix correctly replaces an assertion with logic to truncate the sampled tokens, ensuring they fit within max_model_len
. The implementation is sound and handles edge cases, such as when a sequence has already reached its maximum length. This change effectively resolves the reported issue.
3e4cbd0
to
c0c438f
Compare
vllm/v1/worker/gpu_model_runner.py
Outdated
continue | ||
# avoid overflow. | ||
if len(sampled_ids) > remaining_slots: | ||
sampled_ids = sampled_ids[:remaining_slots] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, when truncating the output, vllm would set the reason/cause in the output object
But in any case, it was extremely weird that this assert blew up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that #20291
changed one line in the scheduler from:
self.max_model_len - request.num_computed_tokens
to:
self.max_model_len - 1 - request.num_computed_tokens
I haven’t directly identified the root cause of the overflow, but I did notice that the scheduler handles num_new_tokens
inconsistently between running and waiting requests, which is very likely related.
@WoosukKwon Could you share some suggestions on this?
7fccea0
to
a823976
Compare
@yannicks1 Looks like we’re fixing the same issue — I guessed another possible cause. I’m new to vLLM, so any feedback would be super helpful. |
a823976
to
299ae57
Compare
hi @nicole-lihui, I found that "[Optimization] Cache sampled token ids in model runner" #20291 actually introduced this bug and addressed it accordingly in that part of the code. IMO that is safer and less intrusive than touching the scheduler as in your PR. I also provide a unit test to detect such an inconsistency in the future. |
@nicole-lihui Could you perhaps review #24446? To me, the fix looks simpler and the PR also includes a test case. |
closed vllm-project#25120 Signed-off-by: nicole-lihui <[email protected]>
Signed-off-by: nicole-lihui <[email protected]>
299ae57
to
59b4f2b
Compare
Purpose
closed #25120
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.