Re-enable prefill of max model length #24446

yannicks1 · 2025-09-08T13:49:35Z

Re-enable prefill at max model length

Purpose

Before #20291 it was possible to prefill the model's context to max_model_len and then request a single new token. The change in #20291 added an assertion that prevents this:

start_idx = self.input_batch.num_tokens_no_spec[req_idx]
end_idx = start_idx + len(sampled_ids)
assert end_idx <= self.max_model_len # <-- causes rejection when start_idx == max_model_len and len(sampled_ids) == 1

That assertion causes a failure when the prompt already consumes the full context and we sample one token. This PR restores the previous behavior (allowing a prefill to max_model_len and then sampling a single token), which matches the behaviour of HuggingFace Transformers.

Proposed change

Relax the assertion/check so that a single sampled token after a prefill that exactly equals max_model_len is allowed. In short: allow the runner to return one new token when the prefill already fills the model's maximum context length.

This restores parity with the HuggingFace transformers behaviour and avoids rejecting otherwise-valid generation requests that only ask for one additional token beyond a full prefill.

Test Plan

Add an end-to-end test that compares vLLM to HuggingFace Transformers:

Test file: tests/v1/e2e/test_context_length.py
Test logic: build a textual prompt that tokenizes to exactly max_model_len, then:
- Run vLLM generation requesting max_tokens=1.
- Run HF .generate(...) requesting one new token.
- Assert both return the same number of generated tokens and the same token ids.

The test is parametrized to make it easy to extend to other models / lengths later; the provided version uses JackFram/llama-160m, max_model_len=2048 and max_tokens=1.

Failing behavior (before this PR)

Without the change the unit test fails with the assertion raised by the runner. Example failure seen during testing:

AssertionError: Sampled token IDs exceed the max model length. \
Total number of tokens: 2049 > max_model_len: 2048

Test result (after this PR)

The assertion no longer triggers.
vLLM returns the expected single token and the test that compares vLLM output to HuggingFace passes: both produce one token and the ids match.

Signed-off-by: Yannick Schnider <[email protected]>

gemini-code-assist

Code Review

This pull request relaxes an assertion to re-enable prefilling up to the maximum model length and sampling a single token. While the intent is correct, the change as-is will likely cause an IndexError because the underlying buffer for token IDs is not large enough to accommodate the extra token. A fix is required in vllm/v1/worker/gpu_input_batch.py (and likely vllm/v1/worker/tpu_input_batch.py) to increase the buffer size.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-09-09T16:31:20Z

@WoosukKwon @LucasWilkinson tagging you guys here as author/reviewer of #20291

Signed-off-by: Yannick Schnider <[email protected]>

tests/v1/e2e/test_context_length.py

vllm/v1/worker/gpu_model_runner.py

nicole-lihui · 2025-09-24T11:20:24Z

vllm/v1/worker/gpu_model_runner.py

-            self.input_batch.token_ids_cpu[req_idx,
-                                           start_idx:end_idx] = sampled_ids
-            self.input_batch.is_token_ids[req_idx, start_idx:end_idx] = True
+            assert end_idx <= self.max_model_len + 1, (


assert end_idx <= self.max_model_len + 1 should fix the immediate issue and probably works with self.max_model_len - 1 - request.num_computed_tokens.

I’m a bit stuck because:

One place adds +1, another -1 — I feel like they cancel out, so maybe this isn’t the real root cause.

From the vLLM module side, I don’t think the runner should care too much about how max_model_len is calculated upstairs. The assert is mostly just a safeguard.

@vadimkantorov brought up a deeper question: why is this assert even triggered? Looking at the call chain, it seems something unexpected happens in the schedule part (my PR isn’t addressing that).

Also, I really like the unit test you added — maybe we can team up and dig into the root cause together. 👍 @yannicks1

that (getting rid of the -1) is exactly what I will address in a follow up PR (have the changes working locally already)!
This PR is about prefill of max model length only.
For the decode stopping condition I just recently merged a PR in hugging face which allows one last decode on the max model length of context before emitting the warning (HF does simply truncate context, not stopping generation like vLLM).

Did split this into two separate PRs: 1st (this one) re-enabling prefill of max model length directly addressing the assert failure introduced in #20291, 2nd (builds on top of this one) allowing one last decode on max model length of context (that's where the getting rid of the -1 will happen, along minor other changes).

Reasons for splitting this is a) making the PRs smaller and easier to review, b) being consistent with HF (my HF PR just got merged into main this week, probably not in a release yet).

I’m a bit stuck because:

One place adds +1, another -1 — I feel like they cancel out, so maybe this isn’t the real root cause.

For prefill this part of the code is untouched. This is only for running sequences (decodes). So there is no + 1 - 1 happening in my unit test. As I mentioned above for decodes on the max model length this -1 will be gone (not the only change). I can share the branch later today for clarification...

@vadimkantorov brought up a deeper question: why is this assert even triggered? Looking at the call chain, it seems something unexpected happens in the schedule part (my PR isn’t addressing that).

I my case the assertion is triggered when doing a prefill of max_model_len and requesting 1 output token. I would be surprised if you triggered it another way? @vadimkantorov

Probably the same on my side: max_model_len = 1024, and prompt_len happened to be 1023 probably or something similar. If a fix is out, I can try it.

@vadimkantorov you can use this branch to run your workload. it should fix your issue.

@nicole-lihui here is the branch with the 2nd part (addressing decode): yannicks1#4
I will open the PR to vllm upstream once this PR is merged (currently it is targeting this branch to highlight the diffs)

Awesome work! Our PRs seems complementary. I'll take inspiration from yours test and check if concurrency issue might show up.

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-09-24T16:44:38Z

hey @tdoublep I addressed all of your feedback.

yannicks1 · 2025-09-26T08:26:33Z

@vadimkantorov have you been able to confirm that your workload does not throw the assertion error with this branch?

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep · 2025-09-26T13:48:13Z

tests/v1/e2e/test_context_length.py

+@pytest.mark.parametrize("model", ["JackFram/llama-160m"])
+@pytest.mark.parametrize("max_model_len", [2048])
+@pytest.mark.parametrize("max_tokens", [1])
+def test_models(


Could we give the test a more descriptive name?

tdoublep

LGTM - thanks for catching this regression and adding the test

Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Karan Goel <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

yannicks1 added 2 commits September 8, 2025 15:15

re-enable prefilll of entire model length

ffb7cd6

Signed-off-by: Yannick Schnider <[email protected]>

adjust log message

49f3335

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 8, 2025 13:49

mergify bot added the v1 label Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

increase buffer size

ea31082

Signed-off-by: Yannick Schnider <[email protected]>

mergify bot added the tpu Related to Google TPUs label Sep 8, 2025

Merge branch 'main' into enable-prefill-of-max-model-len

cf66540

yannicks1 requested a review from NickLucche as a code owner September 22, 2025 07:33

yannicks1 added 3 commits September 22, 2025 10:28

revert change: increasing buffer size

698222f

Signed-off-by: Yannick Schnider <[email protected]>

handling sampled token ids exceeding max model length separately

1155bd7

Signed-off-by: Yannick Schnider <[email protected]>

adding unit test

6a6665a

Signed-off-by: Yannick Schnider <[email protected]>

mergify bot removed the tpu Related to Google TPUs label Sep 22, 2025

Merge branch 'main' into enable-prefill-of-max-model-len

ea0661c

nicole-lihui mentioned this pull request Sep 24, 2025

[Bug]: Strange exception in GPUModelRunner._bookkeeping_sync #25120

Closed

1 task

Merge branch 'main' into enable-prefill-of-max-model-len

a30867a

tdoublep mentioned this pull request Sep 24, 2025

[Bugfix]: prevent crash when sampled tokens exceed max_model_len #25160

Open

5 tasks

tdoublep reviewed Sep 24, 2025

View reviewed changes

nicole-lihui reviewed Sep 24, 2025

View reviewed changes

address feedback

d307a0e

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 mentioned this pull request Sep 24, 2025

Enable decode of max model len yannicks1/vllm#4

Closed

1 task

address feedback: make code more readable

29775b3

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into enable-prefill-of-max-model-len

122418d

tdoublep reviewed Sep 26, 2025

View reviewed changes

tdoublep approved these changes Sep 26, 2025

View reviewed changes

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025

yannicks1 and others added 8 commits September 26, 2025 16:33

address feedback: rename test

ec516bb

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into enable-prefill-of-max-model-len

001ae01

Merge branch 'vllm-project:main' into enable-prefill-of-max-model-len

87289b0

Merge branch 'vllm-project:main' into enable-prefill-of-max-model-len

82ab561

Merge branch 'vllm-project:main' into enable-prefill-of-max-model-len

c275c8d

Merge branch 'vllm-project:main' into enable-prefill-of-max-model-len

ce5c4d0

Merge branch 'vllm-project:main' into enable-prefill-of-max-model-len

8f2cfc5

using function @create_new_process_for_each_test()

516ef68

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep merged commit 8ee846c into vllm-project:main Oct 3, 2025
46 checks passed

yannicks1 deleted the enable-prefill-of-max-model-len branch October 3, 2025 12:25

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix] Re-enable prefill of max model length (#24446)

7faf51f

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: yewentao256 <[email protected]>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Bugfix] Re-enable prefill of max model length (vllm-project#24446)

ec37d88

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025

[Bugfix] Re-enable prefill of max model length (vllm-project#24446)

9717d78

Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Karan Goel <[email protected]>

tdoublep added a commit to tdoublep/vllm that referenced this pull request Oct 7, 2025

Revert vllm-project#24446 and vllm-project#26168

6e8e65f

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep mentioned this pull request Oct 7, 2025

Revert #24446 and #26168 #26332

Merged

5 tasks

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

[Bugfix] Re-enable prefill of max model length (vllm-project#24446)

a798fdc

Signed-off-by: Yannick Schnider <[email protected]>

mgoin pushed a commit that referenced this pull request Oct 7, 2025

Revert #24446 and #26168 (#26332)

31a4b3e

Signed-off-by: Thomas Parnell <[email protected]>

mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025

Revert vllm-project#24446 and vllm-project#26168 (vllm-project#26332)

617b774

Signed-off-by: Thomas Parnell <[email protected]>

Uh oh!

Re-enable prefill of max model length #24446

Re-enable prefill of max model length #24446

Conversation

yannicks1 commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-enable prefill at max model length

Purpose

Proposed change

Test Plan

Failing behavior (before this PR)

Test result (after this PR)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yannicks1 commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicole-lihui Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

vadimkantorov Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

nicole-lihui Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 commented Sep 24, 2025

Uh oh!

yannicks1 commented Sep 26, 2025

Uh oh!

tdoublep Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yannicks1 commented Sep 8, 2025 •

edited by github-actions bot

Loading

nicole-lihui Sep 24, 2025 •

edited

Loading

vadimkantorov Sep 24, 2025 •

edited

Loading