Skip to content

Conversation

nicole-lihui
Copy link
Contributor

@nicole-lihui nicole-lihui commented Sep 18, 2025

Purpose

closed #25120

  • cut off tokens beyond max_model_len so it doesn’t assert.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug that caused a crash when the number of sampled tokens exceeded the maximum model length. The fix correctly replaces an assertion with logic to truncate the sampled tokens, ensuring they fit within max_model_len. The implementation is sound and handles edge cases, such as when a sequence has already reached its maximum length. This change effectively resolves the reported issue.

continue
# avoid overflow.
if len(sampled_ids) > remaining_slots:
sampled_ids = sampled_ids[:remaining_slots]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, when truncating the output, vllm would set the reason/cause in the output object

But in any case, it was extremely weird that this assert blew up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that #20291
changed one line in the scheduler from:

self.max_model_len - request.num_computed_tokens

to:

self.max_model_len - 1 - request.num_computed_tokens

I haven’t directly identified the root cause of the overflow, but I did notice that the scheduler handles num_new_tokens inconsistently between running and waiting requests, which is very likely related.

@WoosukKwon Could you share some suggestions on this?

@nicole-lihui nicole-lihui force-pushed the fix-25120 branch 2 times, most recently from 7fccea0 to a823976 Compare September 24, 2025 09:29
@nicole-lihui
Copy link
Contributor Author

@yannicks1 Looks like we’re fixing the same issue — I guessed another possible cause. I’m new to vLLM, so any feedback would be super helpful.

@yannicks1
Copy link
Contributor

yannicks1 commented Sep 24, 2025

hi @nicole-lihui, I found that "[Optimization] Cache sampled token ids in model runner" #20291 actually introduced this bug and addressed it accordingly in that part of the code. IMO that is safer and less intrusive than touching the scheduler as in your PR. I also provide a unit test to detect such an inconsistency in the future.

@tdoublep
Copy link
Member

@nicole-lihui Could you perhaps review #24446? To me, the fix looks simpler and the PR also includes a test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Strange exception in GPUModelRunner._bookkeeping_sync
4 participants