-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers #21401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CUDA graph support in attention backends to be more granular by using an enum instead of a boolean. This is a good change that allows enabling full CUDA graph for decode-only batches in hybrid models, such as those using Mamba layers. The changes are well-implemented across various backend files. I've found a minor copy-paste error in a docstring/assertion in the Mamba attention backend and a style violation in the FlashInfer backend. Overall, the changes look good and address the intended purpose.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Thomas Parnell <[email protected]>
Ready for review cc @heheda12345 @tlrmchlsmth @LucasWilkinson |
IMO we should consider making FCG the default for mamba-based models since it makes such a difference in perf. Otherwise users will continue to see perf difference to V0. |
Signed-off-by: Thomas Parnell <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
@heheda12345 @tlrmchlsmth can I get a /ready on this one? |
The hybrid tests (including the newly-added one) are passing but a bunch of other unrelated CI tests are failing. I've merged in main again to see if it helps, but given current state of CI, I think this one is ready to merge. cc @heheda12345 @DarkLight1337 |
Hmm, actually the latest hybrid test looks wrong. Let me look into it. |
I can't reproduce locally but the results in CI look like V1 with FCG is producing garbage:
I will see if I can reproduce within same docker container that is used in CI. |
I can't reproduce this exact failure but I can break it in other ways. It looks like there is indeed a bug. Please don't merge until I fix it. |
Signed-off-by: Thomas Parnell <[email protected]>
Head branch was pushed to by a user without write access
I have the fixed the bug. I think the remaining failures are due to other known CI problems. Please take another look but I think can now be merged safely. @heheda12345 @DarkLight1337 |
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#21401) Signed-off-by: Thomas Parnell <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Allow FCG to be used for batches that contain only decode requests. This PR will close the last remaining performance gap to V0 for hybrid models.
Thank you @fhl2000 for all the work on #21367 that makes these speed-ups possible.
cc @heheda12345 @tlrmchlsmth
Testing
I added a new test to explicitly verify correctness when using FCG for mamba-only and hybrid models.
Benchmarking
On main without FCG:
produces:
Using this PR with full cuda graph:
produces:
Huge win! 🎆
Correctness
On main without FCG:
produces:
Using this PR with full cuda graph (I hacked lm_eval code to pass the compilation config since there is no way to do it via CLI)
(Optional) Documentation Update