[CORE] concurrent partial prefills #2356

Csrayz · 2025-08-13T16:00:45Z

What this PR does / why we need it?

When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to vllm-project/vllm#10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests.

This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card.

Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests.

Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios.

Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). The results of the benchmark are as follows.

‍python python benchmarks/benchmark_serving.py \ --host "xx" \ --port 80 \ --model /model/Qwen3-8B/ \ --dataset-name "custom" \ --dataset-path ${test_case} \ --metric-percentiles 50,80,85,90,95,99 \ --max-concurrency 40 ‍

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@da27051

Implement Concurrent Partial Prefills Signed-off-by: Csrayz <[email protected]>

Signed-off-by: Csrayz <[email protected]>

github-actions · 2025-08-13T16:00:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a mechanism to limit concurrent partial prefills for long prompts, which is a great feature for improving TTFT in mixed-load scenarios. The implementation looks mostly correct, but I've found a few issues:

A critical bug in the scheduler logic where the budget for long prompts is incorrectly consumed by all prefill requests.
A potential crash in the configuration logic if max_model_len is set to a small value.
A documentation error for one of the new configuration parameters.
I've provided detailed comments and suggestions for each of these points. Once these are addressed, the PR should be in good shape.

gemini-code-assist · 2025-08-13T16:02:08Z

vllm_ascend/core/schedule_config.py

+            if self.long_prefill_token_threshold is None:
+                self.long_prefill_token_threshold = \
+                    int(self.max_model_len * 0.04)


The default value calculation for long_prefill_token_threshold can result in 0 if self.max_model_len is small (less than 25). This will cause the assertion self.long_prefill_token_threshold > 0 on line 72 to fail, leading to a crash on startup. To prevent this, ensure the calculated default value is at least 1.

if self.long_prefill_token_threshold is None: self.long_prefill_token_threshold = max(1, int(self.max_model_len * 0.04))

gemini-code-assist · 2025-08-13T16:02:08Z

vllm_ascend/core/scheduler.py

            # Update request info.
            num_scheduled_tokens[request.request_id] = num_new_tokens
            token_budget -= num_new_tokens
+            long_prefill_budget -= 1


The long_prefill_budget is intended to limit the number of concurrent long prompt prefills. However, it is currently decremented for every scheduled prefill request, regardless of its length. This will cause the budget to be consumed by short requests, incorrectly preventing subsequent long requests from being scheduled. The budget should only be decremented for long requests.

Suggested change

long_prefill_budget -= 1

if num_new_tokens > self.vllm_config.scheduler_config.long_prefill_token_threshold:

long_prefill_budget -= 1

gemini-code-assist · 2025-08-13T16:02:08Z

docs/source/user_guide/configuration/additional_config.md

 | ---- | ---- | ------- | ----------- |
 | `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
+| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
+| `long_prefill_token_threshold` | Union[int, float] | `False` | a request is considered long if the prompt is longer than this number of tokens. |


The default value for long_prefill_token_threshold is documented as False, which is inconsistent with its type Union[int, float] and the implementation. In the code, its default is None, and it's then either set to float('inf') or calculated based on max_model_len if max_long_partial_prefills is set. Using False here is misleading for users. Please update the default value to None to align with the implementation.

Suggested change

| `long_prefill_token_threshold` | Union[int, float] | `False` | a request is considered long if the prompt is longer than this number of tokens. |

| `long_prefill_token_threshold` | Union[int, float] | `None` | a request is considered long if the prompt is longer than this number of tokens. |

Csrayz added 2 commits August 13, 2025 22:00

[FEAT] add Concurrent Partial Prefills

3a77ddc

Implement Concurrent Partial Prefills Signed-off-by: Csrayz <[email protected]>

[Doc] new ascend_scheduler_config

1b3dc61

Signed-off-by: Csrayz <[email protected]>

github-actions bot added the documentation Improvements or additions to documentation label Aug 13, 2025

gemini-code-assist bot reviewed Aug 13, 2025

View reviewed changes

Csrayz closed this Aug 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CORE] concurrent partial prefills #2356

[CORE] concurrent partial prefills #2356

Uh oh!

Csrayz commented Aug 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 13, 2025

Uh oh!

gemini-code-assist bot Aug 13, 2025

Uh oh!

gemini-code-assist bot Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	long_prefill_budget -= 1
	if num_new_tokens > self.vllm_config.scheduler_config.long_prefill_token_threshold:
	long_prefill_budget -= 1

	\| `long_prefill_token_threshold` \| Union[int, float] \| `False` \| a request is considered long if the prompt is longer than this number of tokens. \|
	\| `long_prefill_token_threshold` \| Union[int, float] \| `None` \| a request is considered long if the prompt is longer than this number of tokens. \|

[CORE] concurrent partial prefills #2356

[CORE] concurrent partial prefills #2356

Uh oh!

Conversation

Csrayz commented Aug 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

github-actions bot commented Aug 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Csrayz commented Aug 13, 2025 •

edited by github-actions bot

Loading