Limit concurrent long partial prefills via max_long_partial_prefills #21651

pansicheng · 2025-07-26T08:31:08Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

To further address #14003

As mentioned in #10235, extremely long texts may cause blocking (when long texts consume the entire budget) or slowdowns (when short and long requests are batched together), which can significantly increase TTFT for short requests.

While using smaller chunk sizes helps maintain TTFT and ITL for short texts in mixed workloads, it doesn't resolve the issue of long text prefills monopolizing the budget and blocking short text prefills.

This PR leverages max_long_partial_prefills in V1 to limit the number of concurrent long text prefills per step, reserving capacity for short texts to be prioritized. This optimization aims to improve P50-P90 TTFT metrics.

Test Plan

Referencing the medium dataset results from #10235:

Tested with 1,000 requests (900 small requests: <50 prompt tokens; 100 large requests: 10k–20k tokens).
Compared main branch d1fb65b against this PR.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_USE_V1=1 \
vllm serve /data/models/Meta-Llama-3.1-8B-Instruct --disable-log-requests \
--long-prefill-token-threshold $THRESHOLD \
--max-long-partial-prefills $MAX_LONG_PREFILLS

python3 benchmarks/benchmark_serving.py --model /data/models/Meta-Llama-3.1-8B-Instruct --dataset-name custom \
--dataset-path /vllm-workspace/benchmarks/medium.jsonl --metric-percentiles 50,80,85,90,95,99 --request-rate 12

Test Result

To achieve optimal throughput and TTFT, thorough tuning of both parameters is required:

long_prefill_token_threshold: High values limit TTFT improvements.
max_long_partial_prefills: Low values risk underutilizing computational resources and causing long-request queueing.

(Optional) Documentation Update

gemini-code-assist

Code Review

The goal of limiting concurrent long prefills to improve TTFT for shorter requests is a valuable optimization. The implementation introduces a new mechanism to control this, and the included tests and benchmarks are helpful. I've found a critical issue in the core scheduling logic that could lead to incorrect behavior.

vllm/v1/core/sched/scheduler.py

github-actions · 2025-07-26T08:33:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-07-29T02:05:14Z

@WoosukKwon @heheda12345 can you help review this?

hmellor · 2025-07-30T14:45:00Z

vllm/config.py

Would None be a better default to indicate that there is no maximum?

Fixed! PTAL

Ithanil · 2025-08-08T09:07:09Z

Hope this gets merged soon, this is really important for some and keeps us on V0.

Csrayz · 2025-08-11T10:46:27Z

any updates?

hmellor · 2025-09-01T14:13:22Z

Needs:

merge from main
pre-commit fixed

Signed-off-by: pansicheng <[email protected]>

pansicheng · 2025-09-02T11:07:26Z

Needs:

merge from main

pre-commit fixed

@hmellor Fixed! PTAL

eransh777 · 2025-09-21T08:17:08Z

Hi, any updates?

Ithanil · 2025-10-04T18:12:08Z

bump

hmellor · 2025-10-08T09:15:40Z

I'm not the right person to review the scheduler change, @njhill could you take a look?

One thing I can ask for though is to remove anything that's conditional on the V1 env var. It's now safe to assume that we are always using V1 and we do not need to accomodate V0 behaviour.

pansicheng requested review from WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners July 26, 2025 08:31

mergify bot added the v1 label Jul 26, 2025

gemini-code-assist bot reviewed Jul 26, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

pansicheng mentioned this pull request Jul 26, 2025

[Feature]: Implement Concurrent Partial Prefills In V1 Engine #14003

Open

1 task

chaunceyjiang mentioned this pull request Jul 28, 2025

[Feature]: max-num-partial-prefills in V1 #21674

Open

1 task

pansicheng force-pushed the max_long_partial_prefills branch from 9baa5ae to 9d52abc Compare July 29, 2025 04:56

hmellor reviewed Jul 30, 2025

View reviewed changes

pansicheng force-pushed the max_long_partial_prefills branch from 3178353 to a8d41c4 Compare September 2, 2025 07:44

pansicheng requested review from ProExpertProg and yewentao256 as code owners September 2, 2025 07:44

pansicheng force-pushed the max_long_partial_prefills branch from a8d41c4 to 289340e Compare September 2, 2025 08:18

pansicheng requested a review from zhuohan123 as a code owner September 2, 2025 08:18

Limit concurrent long partial prefills via max_long_partial_prefills

c15612c

Signed-off-by: pansicheng <[email protected]>

pansicheng force-pushed the max_long_partial_prefills branch from 289340e to c15612c Compare September 2, 2025 09:12

pansicheng requested a review from hmellor September 2, 2025 11:06

Uh oh!

Limit concurrent long partial prefills via max_long_partial_prefills #21651

Are you sure you want to change the base?

Limit concurrent long partial prefills via max_long_partial_prefills #21651

Conversation

pansicheng commented Jul 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Jul 26, 2025

Uh oh!

DarkLight1337 commented Jul 29, 2025

Uh oh!

hmellor Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

pansicheng Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Ithanil commented Aug 8, 2025

Uh oh!

Csrayz commented Aug 11, 2025

Uh oh!

hmellor commented Sep 1, 2025

Uh oh!

pansicheng commented Sep 2, 2025

Uh oh!

eransh777 commented Sep 21, 2025

Uh oh!

Ithanil commented Oct 4, 2025

Uh oh!

hmellor commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pansicheng commented Jul 26, 2025 •

edited by github-actions bot

Loading