[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers #21401

tdoublep · 2025-07-22T19:04:43Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
Benchmarking
Correctness

Purpose

Allow FCG to be used for batches that contain only decode requests. This PR will close the last remaining performance gap to V0 for hybrid models.

Thank you @fhl2000 for all the work on #21367 that makes these speed-ups possible.

cc @heheda12345 @tlrmchlsmth

Testing

I added a new test to explicitly verify correctness when using FCG for mamba-only and hybrid models.

Benchmarking

On main without FCG:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmark_latency.py \
	--model ibm-granite/granite-4.0-tiny-preview \
	--input-len 31500 \
	--output-len 512 \
	--batch-size 1	\
	--num-iters-warmup 3 \
	--num-iters 3

produces:

Avg latency: 9.049272118601948 seconds
10% percentile latency: 9.034534037625416 seconds
25% percentile latency: 9.037122667999938 seconds
50% percentile latency: 9.041437051957473 seconds
75% percentile latency: 9.05750403588172 seconds
90% percentile latency: 9.06714422623627 seconds
99% percentile latency: 9.072928340448998 seconds

Using this PR with full cuda graph:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmark_latency.py \
	--model ibm-granite/granite-4.0-tiny-preview \
	--input-len 31500 \
	--output-len 512 \
	--batch-size 1	\
	--num-iters-warmup 3 \
	--num-iters 3 \
	--compilation-config='{"full_cuda_graph": true}'

produces:

Avg latency: 3.631023456264908 seconds
10% percentile latency: 3.6187003629282115 seconds
25% percentile latency: 3.6219325389247388 seconds
50% percentile latency: 3.6273194989189506 seconds
75% percentile latency: 3.6382623949320987 seconds
90% percentile latency: 3.6448281325399874 seconds
99% percentile latency: 3.648767575104721 seconds

Huge win! 🎆

Correctness

On main without FCG:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=ibm-granite/granite-4.0-tiny-preview,enable_prefix_caching=False --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

produces:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.576|±  |0.0221|
|     |       |strict-match    |     5|exact_match|↑  |0.552|±  |0.0223|

Using this PR with full cuda graph (I hacked lm_eval code to pass the compilation config since there is no way to do it via CLI)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.578|±  |0.0221|
|     |       |strict-match    |     5|exact_match|↑  |0.560|±  |0.0222|

(Optional) Documentation Update

github-actions · 2025-07-22T19:04:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the CUDA graph support in attention backends to be more granular by using an enum instead of a boolean. This is a good change that allows enabling full CUDA graph for decode-only batches in hybrid models, such as those using Mamba layers. The changes are well-implemented across various backend files. I've found a minor copy-paste error in a docstring/assertion in the Mamba attention backend and a style violation in the FlashInfer backend. Overall, the changes look good and address the intended purpose.

vllm/v1/attention/backends/flashinfer.py

vllm/v1/attention/backends/mamba_attn.py

mergify · 2025-07-25T05:53:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-08-02T03:46:53Z

@tdoublep #21367 is merged :-)

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-08-02T08:54:19Z

Ready for review cc @heheda12345 @tlrmchlsmth @LucasWilkinson

tdoublep · 2025-08-02T09:08:51Z

IMO we should consider making FCG the default for mamba-based models since it makes such a difference in perf. Otherwise users will continue to see perf difference to V0.

Signed-off-by: Thomas Parnell <[email protected]>

mergify · 2025-08-03T07:59:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Thomas Parnell <[email protected]>

tests/models/language/generation/test_hybrid.py

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-08-07T21:11:52Z

@heheda12345 @tlrmchlsmth can I get a /ready on this one?

tdoublep · 2025-08-09T12:05:23Z

The hybrid tests (including the newly-added one) are passing but a bunch of other unrelated CI tests are failing.

I've merged in main again to see if it helps, but given current state of CI, I think this one is ready to merge. cc @heheda12345 @DarkLight1337

tdoublep · 2025-08-09T14:04:06Z

Hmm, actually the latest hybrid test looks wrong. Let me look into it.

tdoublep · 2025-08-09T14:13:04Z

I can't reproduce locally but the results in CI look like V1 with FCG is producing garbage:

[2025-08-09T13:43:18Z] FAILED models/language/generation/test_hybrid.py::test_full_cuda_graph[5-64-Zyphra/Zamba2-1.2B-instruct] - AssertionError: Test5:
--
  | [2025-08-09T13:43:18Z] Matched tokens:	[13, 27332, 22478, 28705, 28750, 28747, 13, 3278, 12804, 272, 3905, 302, 18278, 10895, 297, 5516, 288, 272]
  | [2025-08-09T13:43:18Z] hf:	'\n### Question 2:\nDiscuss the role of artificial intelligence in transforming the healthcare industry.\n\n### Question 3:\nExplain the significance of blockchain technology in enhancing supply chain management.\n\n### Question 4:\nAnalyze the potential of renewable energy sources in reducing carbon'	{15240: -2.046424388885498, 11603: -2.421424388885498, 3437: -3.171424388885498, 5593: -3.421424388885498, 4607: -3.421424388885498}
  | [2025-08-09T13:43:18Z] vllm-v1:	'\n### Question 2:\nDiscuss the role of artificial intelligence in transforming the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future of the future'	{3437: Logprob(logprob=-2.817570447921753, rank=1, decoded_token='future'), 3526: Logprob(logprob=-2.942570447921753, rank=2, decoded_token='global'), 1526: Logprob(logprob=-3.192570447921753, rank=3, decoded_token='world'), 771: Logprob(logprob=-3.192570447921753, rank=4, decoded_token='work'), 1955: Logprob(logprob=-3.442570447921753, rank=5, decoded_token='business')}

I will see if I can reproduce within same docker container that is used in CI.

tdoublep · 2025-08-09T15:20:10Z

I can't reproduce this exact failure but I can break it in other ways. It looks like there is indeed a bug. Please don't merge until I fix it.

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-08-09T20:51:03Z

I have the fixed the bug. I think the remaining failures are due to other known CI problems. Please take another look but I think can now be merged safely. @heheda12345 @DarkLight1337

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>

mergify bot added rocm Related to AMD ROCm v1 labels Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mamba_attn.py Show resolved Hide resolved

mergify bot added the needs-rebase label Jul 25, 2025

heheda12345 mentioned this pull request Jul 25, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Open

9 tasks

Enable decode-only FCG for mamba

01f10d6

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep force-pushed the full_cg_mamba branch from 28ccca7 to 01f10d6 Compare August 2, 2025 08:15

mergify bot removed the needs-rebase label Aug 2, 2025

tdoublep marked this pull request as ready for review August 2, 2025 08:53

tdoublep requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 2, 2025 08:53

Add FCG test

17ab125

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested a review from DarkLight1337 as a code owner August 2, 2025 09:10

mergify bot added the needs-rebase label Aug 3, 2025

Resolve merge conflicts

735b082

Signed-off-by: Thomas Parnell <[email protected]>

mergify bot removed the needs-rebase label Aug 4, 2025

tlrmchlsmth reviewed Aug 4, 2025

View reviewed changes

tests/models/language/generation/test_hybrid.py Outdated Show resolved Hide resolved

Use smaller model in test that is supported by transformers

c3f2224

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep mentioned this pull request Aug 6, 2025

[V1] [Hybrid] Support Minimax-Text-01 in V1 #22151

Merged

4 tasks

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 8, 2025

heheda12345 approved these changes Aug 8, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) August 8, 2025 21:37

tdoublep added 2 commits August 9, 2025 02:28

Merge branch 'main' into full_cg_mamba

4a7e424

Merge branch 'main' into full_cg_mamba

af64c47

Fix padding issue

2868ba7

Signed-off-by: Thomas Parnell <[email protected]>

auto-merge was automatically disabled August 9, 2025 16:13
Head branch was pushed to by a user without write access

vllm-bot merged commit 61f67d8 into vllm-project:main Aug 10, 2025
34 of 41 checks passed

tdoublep mentioned this pull request Aug 10, 2025

[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 #22594

Merged

4 tasks

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (v…

2ffbe41

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Paul Pak <[email protected]>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (v…

18da4f1

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (v…

96d9bcb

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (v…

8a9b9fe

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers (v…

4aa9b45

…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>

Uh oh!

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers #21401

[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers #21401

Conversation

tdoublep commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Testing

Benchmarking

Correctness

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jul 25, 2025

Uh oh!

heheda12345 commented Aug 2, 2025

Uh oh!

tdoublep commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 3, 2025

Uh oh!

Uh oh!

tdoublep commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Aug 9, 2025

Uh oh!

tdoublep commented Aug 9, 2025

Uh oh!

tdoublep commented Aug 9, 2025

Uh oh!

tdoublep commented Aug 9, 2025

Uh oh!

tdoublep commented Aug 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tdoublep commented Jul 22, 2025 •

edited by github-actions bot

Loading

tdoublep commented Aug 2, 2025 •

edited

Loading

tdoublep commented Aug 2, 2025 •

edited

Loading

tdoublep commented Aug 7, 2025 •

edited

Loading