Skip to content

Conversation

pansicheng
Copy link
Contributor

@pansicheng pansicheng commented Jul 26, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

To further address #14003

As mentioned in #10235, extremely long texts may cause blocking (when long texts consume the entire budget) or slowdowns (when short and long requests are batched together), which can significantly increase TTFT for short requests.

While using smaller chunk sizes helps maintain TTFT and ITL for short texts in mixed workloads, it doesn't resolve the issue of long text prefills monopolizing the budget and blocking short text prefills.

This PR leverages max_long_partial_prefills in V1 to limit the number of concurrent long text prefills per step, reserving capacity for short texts to be prioritized. This optimization aims to improve P50-P90 TTFT metrics.

Test Plan

Referencing the medium dataset results from #10235:

Tested with 1,000 requests (900 small requests: <50 prompt tokens; 100 large requests: 10k–20k tokens).
Compared main branch d1fb65b against this PR.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_USE_V1=1 \
vllm serve /data/models/Meta-Llama-3.1-8B-Instruct --disable-log-requests \
--long-prefill-token-threshold $THRESHOLD \
--max-long-partial-prefills $MAX_LONG_PREFILLS
python3 benchmarks/benchmark_serving.py --model /data/models/Meta-Llama-3.1-8B-Instruct --dataset-name custom \
--dataset-path /vllm-workspace/benchmarks/medium.jsonl --metric-percentiles 50,80,85,90,95,99 --request-rate 12

Test Result

image

To achieve optimal throughput and TTFT, thorough tuning of both parameters is required:

  • long_prefill_token_threshold: High values limit TTFT improvements.
  • max_long_partial_prefills: Low values risk underutilizing computational resources and causing long-request queueing.

(Optional) Documentation Update

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The goal of limiting concurrent long prefills to improve TTFT for shorter requests is a valuable optimization. The implementation introduces a new mechanism to control this, and the included tests and benchmarks are helpful. I've found a critical issue in the core scheduling logic that could lead to incorrect behavior.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

@WoosukKwon @heheda12345 can you help review this?

@pansicheng pansicheng force-pushed the max_long_partial_prefills branch from 9baa5ae to 9d52abc Compare July 29, 2025 04:56
vllm/config.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would None be a better default to indicate that there is no maximum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! PTAL

@Ithanil
Copy link
Contributor

Ithanil commented Aug 8, 2025

Hope this gets merged soon, this is really important for some and keeps us on V0.

@Csrayz
Copy link
Contributor

Csrayz commented Aug 11, 2025

any updates?

@hmellor
Copy link
Member

hmellor commented Sep 1, 2025

Needs:

  • merge from main
  • pre-commit fixed

@pansicheng pansicheng force-pushed the max_long_partial_prefills branch from 289340e to c15612c Compare September 2, 2025 09:12
@pansicheng pansicheng requested a review from hmellor September 2, 2025 11:06
@pansicheng
Copy link
Contributor Author

Needs:

  • merge from main
  • pre-commit fixed

@hmellor Fixed! PTAL

@eransh777
Copy link

Hi, any updates?

@Ithanil
Copy link
Contributor

Ithanil commented Oct 4, 2025

bump

@hmellor
Copy link
Member

hmellor commented Oct 8, 2025

I'm not the right person to review the scheduler change, @njhill could you take a look?

One thing I can ask for though is to remove anything that's conditional on the V1 env var. It's now safe to assume that we are always using V1 and we do not need to accomodate V0 behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants