Skip to content

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Jul 22, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • Benchmarking
  • Correctness

Purpose

Allow FCG to be used for batches that contain only decode requests. This PR will close the last remaining performance gap to V0 for hybrid models.

Thank you @fhl2000 for all the work on #21367 that makes these speed-ups possible.

cc @heheda12345 @tlrmchlsmth

Testing

I added a new test to explicitly verify correctness when using FCG for mamba-only and hybrid models.

Benchmarking

On main without FCG:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmark_latency.py \
	--model ibm-granite/granite-4.0-tiny-preview \
	--input-len 31500 \
	--output-len 512 \
	--batch-size 1	\
	--num-iters-warmup 3 \
	--num-iters 3 

produces:

Avg latency: 9.049272118601948 seconds
10% percentile latency: 9.034534037625416 seconds
25% percentile latency: 9.037122667999938 seconds
50% percentile latency: 9.041437051957473 seconds
75% percentile latency: 9.05750403588172 seconds
90% percentile latency: 9.06714422623627 seconds
99% percentile latency: 9.072928340448998 seconds

Using this PR with full cuda graph:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmark_latency.py \
	--model ibm-granite/granite-4.0-tiny-preview \
	--input-len 31500 \
	--output-len 512 \
	--batch-size 1	\
	--num-iters-warmup 3 \
	--num-iters 3 \
	--compilation-config='{"full_cuda_graph": true}'

produces:

Avg latency: 3.631023456264908 seconds
10% percentile latency: 3.6187003629282115 seconds
25% percentile latency: 3.6219325389247388 seconds
50% percentile latency: 3.6273194989189506 seconds
75% percentile latency: 3.6382623949320987 seconds
90% percentile latency: 3.6448281325399874 seconds
99% percentile latency: 3.648767575104721 seconds

Huge win! 🎆

Correctness

On main without FCG:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=ibm-granite/granite-4.0-tiny-preview,enable_prefix_caching=False --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

produces:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.576|±  |0.0221|
|     |       |strict-match    |     5|exact_match|↑  |0.552|±  |0.0223|

Using this PR with full cuda graph (I hacked lm_eval code to pass the compilation config since there is no way to do it via CLI)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.578|±  |0.0221|
|     |       |strict-match    |     5|exact_match|↑  |0.560|±  |0.0222|

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added rocm Related to AMD ROCm v1 labels Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CUDA graph support in attention backends to be more granular by using an enum instead of a boolean. This is a good change that allows enabling full CUDA graph for decode-only batches in hybrid models, such as those using Mamba layers. The changes are well-implemented across various backend files. I've found a minor copy-paste error in a docstring/assertion in the Mamba attention backend and a style violation in the FlashInfer backend. Overall, the changes look good and address the intended purpose.

Copy link

mergify bot commented Jul 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@heheda12345
Copy link
Collaborator

@tdoublep #21367 is merged :-)

Signed-off-by: Thomas Parnell <[email protected]>
@mergify mergify bot removed the needs-rebase label Aug 2, 2025
@tdoublep tdoublep marked this pull request as ready for review August 2, 2025 08:53
@tdoublep
Copy link
Member Author

tdoublep commented Aug 2, 2025

Ready for review cc @heheda12345 @tlrmchlsmth @LucasWilkinson

@tdoublep
Copy link
Member Author

tdoublep commented Aug 2, 2025

IMO we should consider making FCG the default for mamba-based models since it makes such a difference in perf. Otherwise users will continue to see perf difference to V0.

Signed-off-by: Thomas Parnell <[email protected]>
@tdoublep tdoublep requested a review from DarkLight1337 as a code owner August 2, 2025 09:10
Copy link

mergify bot commented Aug 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 3, 2025
Signed-off-by: Thomas Parnell <[email protected]>
@mergify mergify bot removed the needs-rebase label Aug 4, 2025
@tdoublep
Copy link
Member Author

tdoublep commented Aug 7, 2025

@heheda12345 @tlrmchlsmth can I get a /ready on this one?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 8, 2025
@heheda12345 heheda12345 enabled auto-merge (squash) August 8, 2025 21:37
@tdoublep
Copy link
Member Author

tdoublep commented Aug 9, 2025

The hybrid tests (including the newly-added one) are passing but a bunch of other unrelated CI tests are failing.

I've merged in main again to see if it helps, but given current state of CI, I think this one is ready to merge. cc @heheda12345 @DarkLight1337

@tdoublep
Copy link
Member Author

tdoublep commented Aug 9, 2025

Hmm, actually the latest hybrid test looks wrong. Let me look into it.

@tdoublep
Copy link
Member Author

tdoublep commented Aug 9, 2025

I can't reproduce locally but the results in CI look like V1 with FCG is producing garbage:

[2025-08-09T13:43:18Z] FAILED models/language/generation/test_hybrid.py::test_full_cuda_graph[5-64-Zyphra/Zamba2-1.2B-instruct] - AssertionError: Test5:
--
  | [2025-08-09T13:43:18Z] Matched tokens:	[13, 27332, 22478, 28705, 28750, 28747, 13, 3278, 12804, 272, 3905, 302, 18278, 10895, 297, 5516, 288, 272]
  | [2025-08-09T13:43:18Z] hf:	'\n### Question 2:\nDiscuss the role of artificial intelligence in transforming the healthcare industry.\n\n### Question 3:\nExplain the significance of blockchain technology in enhancing supply chain management.\n\n### Question 4:\nAnalyze the potential of renewable energy sources in reducing carbon'	{15240: -2.046424388885498, 11603: -2.421424388885498, 3437: -3.171424388885498, 5593: -3.421424388885498, 4607: -3.421424388885498}
  | [2025-08-09T13:43:18Z] vllm-v1:	'\n### Question 2:\nDiscuss the role of artificial intelligence in transforming the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future'	{3437: Logprob(logprob=-2.817570447921753, rank=1, decoded_token='future'), 3526: Logprob(logprob=-2.942570447921753, rank=2, decoded_token='global'), 1526: Logprob(logprob=-3.192570447921753, rank=3, decoded_token='world'), 771: Logprob(logprob=-3.192570447921753, rank=4, decoded_token='work'), 1955: Logprob(logprob=-3.442570447921753, rank=5, decoded_token='business')}

I will see if I can reproduce within same docker container that is used in CI.

@tdoublep
Copy link
Member Author

tdoublep commented Aug 9, 2025

I can't reproduce this exact failure but I can break it in other ways. It looks like there is indeed a bug. Please don't merge until I fix it.

Signed-off-by: Thomas Parnell <[email protected]>
auto-merge was automatically disabled August 9, 2025 16:13

Head branch was pushed to by a user without write access

@tdoublep
Copy link
Member Author

tdoublep commented Aug 9, 2025

I have the fixed the bug. I think the remaining failures are due to other known CI problems. Please take another look but I think can now be merged safely. @heheda12345 @DarkLight1337

@vllm-bot vllm-bot merged commit 61f67d8 into vllm-project:main Aug 10, 2025
34 of 41 checks passed
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants