-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Limit concurrent long partial prefills via max_long_partial_prefills #21651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The goal of limiting concurrent long prefills to improve TTFT for shorter requests is a valuable optimization. The implementation introduces a new mechanism to control this, and the included tests and benchmarks are helpful. I've found a critical issue in the core scheduling logic that could lead to incorrect behavior.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@WoosukKwon @heheda12345 can you help review this? |
9baa5ae
to
9d52abc
Compare
vllm/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would None
be a better default to indicate that there is no maximum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! PTAL
Hope this gets merged soon, this is really important for some and keeps us on V0. |
any updates? |
Needs:
|
3178353
to
a8d41c4
Compare
a8d41c4
to
289340e
Compare
Signed-off-by: pansicheng <[email protected]>
289340e
to
c15612c
Compare
@hmellor Fixed! PTAL |
Hi, any updates? |
bump |
I'm not the right person to review the scheduler change, @njhill could you take a look? One thing I can ask for though is to remove anything that's conditional on the V1 env var. It's now safe to assume that we are always using V1 and we do not need to accomodate V0 behaviour. |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
To further address #14003
As mentioned in #10235, extremely long texts may cause blocking (when long texts consume the entire budget) or slowdowns (when short and long requests are batched together), which can significantly increase TTFT for short requests.
While using smaller chunk sizes helps maintain TTFT and ITL for short texts in mixed workloads, it doesn't resolve the issue of long text prefills monopolizing the budget and blocking short text prefills.
This PR leverages
max_long_partial_prefills
in V1 to limit the number of concurrent long text prefills per step, reserving capacity for short texts to be prioritized. This optimization aims to improve P50-P90 TTFT metrics.Test Plan
Referencing the medium dataset results from #10235:
Tested with 1,000 requests (900 small requests: <50 prompt tokens; 100 large requests: 10k–20k tokens).
Compared main branch d1fb65b against this PR.
Test Result
To achieve optimal throughput and TTFT, thorough tuning of both parameters is required:
(Optional) Documentation Update