[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

nicole-lihui · 2025-09-18T10:46:59Z

Purpose

closed #25120

cut off tokens beyond max_model_len so it doesn’t assert.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request addresses a bug that caused a crash when the number of sampled tokens exceeded the maximum model length. The fix correctly replaces an assertion with logic to truncate the sampled tokens, ensuring they fit within max_model_len. The implementation is sound and handles edge cases, such as when a sequence has already reached its maximum length. This change effectively resolves the reported issue.

vadimkantorov · 2025-09-19T02:25:32Z

vllm/v1/worker/gpu_model_runner.py

+                continue
+            # avoid overflow.
+            if len(sampled_ids) > remaining_slots:
+                sampled_ids = sampled_ids[:remaining_slots]


Typically, when truncating the output, vllm would set the reason/cause in the output object

But in any case, it was extremely weird that this assert blew up

I noticed that #20291
changed one line in the scheduler from:

self.max_model_len - request.num_computed_tokens

to:

self.max_model_len - 1 - request.num_computed_tokens

I haven’t directly identified the root cause of the overflow, but I did notice that the scheduler handles num_new_tokens inconsistently between running and waiting requests, which is very likely related.

@WoosukKwon Could you share some suggestions on this?

nicole-lihui · 2025-09-24T09:33:10Z

@yannicks1 Looks like we’re fixing the same issue — I guessed another possible cause. I’m new to vLLM, so any feedback would be super helpful.

yannicks1 · 2025-09-24T10:41:37Z

hi @nicole-lihui, I found that "[Optimization] Cache sampled token ids in model runner" #20291 actually introduced this bug and addressed it accordingly in that part of the code. IMO that is safer and less intrusive than touching the scheduler as in your PR. I also provide a unit test to detect such an inconsistency in the future.

tdoublep · 2025-09-24T10:44:39Z

@nicole-lihui Could you perhaps review #24446? To me, the fix looks simpler and the PR also includes a test case.

closed vllm-project#25120 Signed-off-by: nicole-lihui <[email protected]>

Signed-off-by: nicole-lihui <[email protected]>

nicole-lihui requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 18, 2025 10:47

mergify bot added the v1 label Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

nicole-lihui force-pushed the fix-25120 branch from 3e4cbd0 to c0c438f Compare September 19, 2025 01:53

vadimkantorov reviewed Sep 19, 2025

View reviewed changes

nicole-lihui requested review from heheda12345 and ApostaC as code owners September 24, 2025 09:16

nicole-lihui force-pushed the fix-25120 branch 2 times, most recently from 7fccea0 to a823976 Compare September 24, 2025 09:29

nicole-lihui force-pushed the fix-25120 branch from a823976 to 299ae57 Compare September 24, 2025 09:45

nicole-lihui added 2 commits September 24, 2025 19:25

[Bugfix]: prevent crash when sampled tokens exceed max_model_len

5c8fd37

closed vllm-project#25120 Signed-off-by: nicole-lihui <[email protected]>

unify num_new_tokens calculation in scheduler

59b4f2b

Signed-off-by: nicole-lihui <[email protected]>

nicole-lihui force-pushed the fix-25120 branch from 299ae57 to 59b4f2b Compare September 24, 2025 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

Uh oh!

nicole-lihui commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

vadimkantorov Sep 19, 2025

Uh oh!

nicole-lihui Sep 24, 2025

Uh oh!

nicole-lihui commented Sep 24, 2025

Uh oh!

yannicks1 commented Sep 24, 2025 •

edited

Loading

Uh oh!

tdoublep commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

Are you sure you want to change the base?

[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

Uh oh!

Conversation

nicole-lihui commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

vadimkantorov Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

nicole-lihui Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

nicole-lihui commented Sep 24, 2025

Uh oh!

yannicks1 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Sep 24, 2025

Uh oh!

Uh oh!

nicole-lihui commented Sep 18, 2025 •

edited by github-actions bot

Loading

yannicks1 commented Sep 24, 2025 •

edited

Loading