[Attention] MLA - Flashinfer Ragged Prefill #20034

alexm-redhat · 2025-06-24T18:28:47Z

This PR adds MLA FlashInfer ragged prefill support on B200 GPUs - it is dependent on this FlashInfer FIX from NVIDIA: flashinfer-ai/flashinfer#1198

Here are performance results for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on a single B200 with 10000/100 prompt/output for various batch sizes. We can see 20-25% improvement for TTFT and 15-20% improvement for TPOT.

github-actions · 2025-06-24T18:28:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @alexm-redhat, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the initial integration of FlashInfer's ragged prefill functionality into the MLA backend. The aim is to leverage FlashInfer for prefill operations, with current benchmarks indicating a performance regression compared to the existing FlashAttention 2 implementation. The changes primarily involve setting up FlashInfer-specific metadata, managing its prefill wrappers, and adapting the attention computation logic to utilize the new backend, along with an updated example for testing.

Highlights

FlashInfer Ragged Prefill Integration: I've integrated FlashInfer's BatchPrefillWithRaggedKVCacheWrapper to handle ragged prefill operations within the Multi-Layer Attention (MLA) backend. This involves setting up the necessary FlashInfer wrappers and planning their execution.
FlashInfer-Specific Metadata and Hyperparameter Management: I've introduced a new FIPrefillMetadata dataclass to store FlashInfer-specific prefill parameters and wrappers. Additionally, I've added logic to extract and validate attention layer hyperparameters (such as window size and scale) from the VLLM configuration, ensuring these are consistent and correctly passed to the FlashInfer kernels.
Attention Logic Adaptation for Prefill: The core attention computation paths for prefill (_compute_prefill_context and _forward_prefill) have been updated. They now utilize a new _fi_prefill_run method, which dispatches to the FlashInfer prefill kernel, replacing the previous FlashAttention 2 calls for these operations.
Model Update for Benchmarking: The basic.py offline inference example has been updated to use the deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct model. This change facilitates focused testing and benchmarking of the new FlashInfer ragged prefill integration with a specific model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces FlashInfer ragged prefill for the DeepSeek-Coder-V2-Lite-Instruct model. While the changes have correctness, there's a performance slowdown. The author is actively working on improvements like enabling CUDA graph and trace profiling. The review identifies areas for code cleanup, consistency, and verification of key logic.

vllm/v1/attention/backends/mla/common.py

vllm/v1/attention/backends/flashinfer.py

vllm/v1/attention/backends/mla/common.py

alexm-redhat · 2025-07-02T16:10:49Z

/gemini summary

gemini-code-assist · 2025-07-02T16:11:26Z

Summary of Changes

This pull request integrates FlashInfer's ragged prefill capabilities into the MLA attention backend, primarily targeting B200 GPUs to achieve substantial performance gains in both prefill and token generation. The changes involve adding new data structures and logic to conditionally leverage FlashInfer's optimized kernels.

Highlights

FlashInfer Ragged Prefill Integration: I've added support for FlashInfer's ragged prefill attention, specifically optimized for B200 GPUs (SM100a architecture). This enables the use of FlashInfer's highly optimized kernels for prefill operations.
Performance Improvements: This integration is designed to significantly boost prefill performance. Initial benchmarks on a single B200 GPU show a 20-25% improvement in Time To First Token (TTFT) and a 15-20% improvement in Tokens Per Output Token (TPOT) for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct.
Dynamic Backend Selection: The system now intelligently switches between the existing attention mechanism and FlashInfer's ragged prefill based on the detected GPU architecture. If an SM100a GPU is present, FlashInfer is utilized for prefill operations.
New Metadata and Utilities: To support FlashInfer, I've introduced new dataclasses (FlashInferPrefillMetadata, PerLayerParameters) and a FlashInferPrefill class. These manage FlashInfer-specific parameters, workspace buffers, and provide configured prefill wrappers, ensuring optimal use of the 'cutlass' backend for B200.

Changelog

vllm/v1/attention/backends/mla/common.py
- Added imports for FlashInfer components, including BatchPrefillWithRaggedKVCacheWrapper and GPU capability detection (is_sm100a_supported).
- Introduced FlashInferPrefillMetadata dataclass to store FlashInfer prefill wrappers for main and chunked contexts.
- Implemented the FlashInferPrefill class to manage a shared workspace buffer and provide BatchPrefillWithRaggedKVCacheWrapper instances configured for the 'NHD' KV layout and 'cutlass' backend.
- Modified MLACommonMetadata to include an optional fi_prefill field for FlashInfer-specific metadata.
- Updated MLACommonMetadataBuilder to initialize and populate FlashInferPrefillMetadata when FlashInfer prefill is enabled and applicable, planning the prefill operations with relevant parameters.
- Adjusted MLACommonBackend to detect SM100a GPUs and enable FlashInfer prefill, which also sets specific FlashInfer hyperparameters (sliding_window, logits_soft_cap) and disables _pad_v.
- Modified _compute_prefill_context and _forward_prefill methods to conditionally dispatch attention calculations to FlashInfer's run method if FlashInfer prefill is active, otherwise falling back to the existing _flash_attn_varlen_diff_headdims.

Activity

A bot (github-actions[bot]) provided initial guidance on the CI and PR workflow.
The author (alexm-redhat) requested a summary from a bot.
An automated code assist bot (gemini-code-assist[bot]) provided several review comments:
Four high-priority comments questioned the identical cu_seqlens_k and max_seqlen_k values being passed as their _q counterparts in the original attention function calls, suggesting a potential for incorrect attention calculations.
Two medium-priority comments advised removing debugging print statements that were added.
Two medium-priority comments recommended replacing hardcoded head dimensions with variables or constants for better consistency in the attention function calls.

pavanimajety

LGTM, thanks for your work! The flashinfer bugfix has been merged. Do we need to update the flashinfer commit too?

vllm/v1/attention/backends/mla/common.py

vllm/v1/attention/backends/flashinfer.py

vllm/v1/attention/backends/mla/common.py

Signed-off-by: Lucas Wilkinson <[email protected]>

cuda now for MLA Signed-off-by: mgoin <[email protected]>

Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Diego-Castan <[email protected]>

alexm-redhat requested review from WoosukKwon, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners June 24, 2025 18:28

alexm-redhat marked this pull request as draft June 24, 2025 18:28

gemini-code-assist bot reviewed Jun 24, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation v1 labels Jun 24, 2025

gemini-code-assist bot reviewed Jun 24, 2025

View reviewed changes

alexm-redhat force-pushed the mla_fi_prefill branch from 4ca0726 to 56e5285 Compare July 2, 2025 14:48

alexm-redhat self-assigned this Jul 2, 2025

alexm-redhat marked this pull request as ready for review July 2, 2025 14:52

alexm-redhat changed the title ~~MLA - Flashinfer Ragged Prefill~~ MLA - Flashinfer Ragged Prefill (for B200) Jul 2, 2025

alexm-redhat changed the title ~~MLA - Flashinfer Ragged Prefill (for B200)~~ MLA - Flashinfer Ragged Prefill Jul 2, 2025

alexm-redhat force-pushed the mla_fi_prefill branch 2 times, most recently from e3afaf1 to 72060d3 Compare July 2, 2025 15:27

alexm-redhat requested a review from mgoin July 2, 2025 15:29

alexm-redhat force-pushed the mla_fi_prefill branch 3 times, most recently from e1f2aa4 to e52c4f1 Compare July 7, 2025 16:28

pavanimajety reviewed Jul 7, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

mgoin reviewed Jul 9, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

mgoin requested review from LucasWilkinson and mgoin July 9, 2025 02:15

LucasWilkinson changed the title ~~MLA - Flashinfer Ragged Prefill~~ [Attention] MLA - Flashinfer Ragged Prefill Jul 10, 2025

LucasWilkinson reviewed Jul 10, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

LucasWilkinson added 12 commits July 10, 2025 15:50

pass in class to assert against

a09b4c3

Signed-off-by: Lucas Wilkinson <[email protected]>

fix debug once not available

fa340ac

Signed-off-by: Lucas Wilkinson <[email protected]>

review comments

6dd8811

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

742344d

Signed-off-by: Lucas Wilkinson <[email protected]>

dont touch v0

c080407

Signed-off-by: Lucas Wilkinson <[email protected]>

fix FA path

7444646

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

a2afa2f

Signed-off-by: Lucas Wilkinson <[email protected]>

undo

d2cc996

Signed-off-by: Lucas Wilkinson <[email protected]>

add newline

85c581a

Signed-off-by: Lucas Wilkinson <[email protected]>

move comment block

0dcbda6

Signed-off-by: Lucas Wilkinson <[email protected]>

pre-commit manual fix

f97a417

Signed-off-by: Lucas Wilkinson <[email protected]>

remove space

61dd3bb

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson force-pushed the mla_fi_prefill branch from a9527ce to 61dd3bb Compare July 10, 2025 19:50

mgoin added 3 commits July 10, 2025 22:23

Merge branch 'main' into mla_fi_prefill

4557e83

Fix fork requirement for test_multi_connector test since we activate

be6b99c

cuda now for MLA Signed-off-by: mgoin <[email protected]>

Merge branch 'main' into mla_fi_prefill

f7d64c8

simon-mo merged commit 5b03235 into main Jul 11, 2025
65 of 69 checks passed

simon-mo deleted the mla_fi_prefill branch July 11, 2025 03:17

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

68d6211

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

2e5ea0d

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

b8208e5

Signed-off-by: Jinzhen Lin <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

ce0b868

Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

04a5150

Signed-off-by: Diego-Castan <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025

[Attention] MLA - Flashinfer Ragged Prefill (vllm-project#20034)

12c6e89

Uh oh!

[Attention] MLA - Flashinfer Ragged Prefill #20034

[Attention] MLA - Flashinfer Ragged Prefill #20034

Uh oh!

Conversation

alexm-redhat commented Jun 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexm-redhat commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jul 2, 2025

Summary of Changes

Highlights

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alexm-redhat commented Jun 24, 2025 •

edited by github-actions bot

Loading

alexm-redhat commented Jul 2, 2025 •

edited

Loading