Skip to content

Conversation

alexm-redhat
Copy link
Collaborator

@alexm-redhat alexm-redhat commented Jun 24, 2025

This PR adds MLA FlashInfer ragged prefill support on B200 GPUs - it is dependent on this FlashInfer FIX from NVIDIA: flashinfer-ai/flashinfer#1198

Here are performance results for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct on a single B200 with 10000/100 prompt/output for various batch sizes. We can see 20-25% improvement for TTFT and 15-20% improvement for TPOT.

image

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @alexm-redhat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the initial integration of FlashInfer's ragged prefill functionality into the MLA backend. The aim is to leverage FlashInfer for prefill operations, with current benchmarks indicating a performance regression compared to the existing FlashAttention 2 implementation. The changes primarily involve setting up FlashInfer-specific metadata, managing its prefill wrappers, and adapting the attention computation logic to utilize the new backend, along with an updated example for testing.

Highlights

  • FlashInfer Ragged Prefill Integration: I've integrated FlashInfer's BatchPrefillWithRaggedKVCacheWrapper to handle ragged prefill operations within the Multi-Layer Attention (MLA) backend. This involves setting up the necessary FlashInfer wrappers and planning their execution.
  • FlashInfer-Specific Metadata and Hyperparameter Management: I've introduced a new FIPrefillMetadata dataclass to store FlashInfer-specific prefill parameters and wrappers. Additionally, I've added logic to extract and validate attention layer hyperparameters (such as window size and scale) from the VLLM configuration, ensuring these are consistent and correctly passed to the FlashInfer kernels.
  • Attention Logic Adaptation for Prefill: The core attention computation paths for prefill (_compute_prefill_context and _forward_prefill) have been updated. They now utilize a new _fi_prefill_run method, which dispatches to the FlashInfer prefill kernel, replacing the previous FlashAttention 2 calls for these operations.
  • Model Update for Benchmarking: The basic.py offline inference example has been updated to use the deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct model. This change facilitates focused testing and benchmarking of the new FlashInfer ragged prefill integration with a specific model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jun 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FlashInfer ragged prefill for the DeepSeek-Coder-V2-Lite-Instruct model. While the changes have correctness, there's a performance slowdown. The author is actively working on improvements like enabling CUDA graph and trace profiling. The review identifies areas for code cleanup, consistency, and verification of key logic.

@alexm-redhat alexm-redhat self-assigned this Jul 2, 2025
@alexm-redhat alexm-redhat marked this pull request as ready for review July 2, 2025 14:52
@alexm-redhat alexm-redhat changed the title MLA - Flashinfer Ragged Prefill MLA - Flashinfer Ragged Prefill (for B200) Jul 2, 2025
@alexm-redhat alexm-redhat changed the title MLA - Flashinfer Ragged Prefill (for B200) MLA - Flashinfer Ragged Prefill Jul 2, 2025
@alexm-redhat alexm-redhat force-pushed the mla_fi_prefill branch 2 times, most recently from e3afaf1 to 72060d3 Compare July 2, 2025 15:27
@alexm-redhat alexm-redhat requested a review from mgoin July 2, 2025 15:29
@alexm-redhat
Copy link
Collaborator Author

alexm-redhat commented Jul 2, 2025

/gemini summary

Copy link
Contributor

Summary of Changes

This pull request integrates FlashInfer's ragged prefill capabilities into the MLA attention backend, primarily targeting B200 GPUs to achieve substantial performance gains in both prefill and token generation. The changes involve adding new data structures and logic to conditionally leverage FlashInfer's optimized kernels.

Highlights

  • FlashInfer Ragged Prefill Integration: I've added support for FlashInfer's ragged prefill attention, specifically optimized for B200 GPUs (SM100a architecture). This enables the use of FlashInfer's highly optimized kernels for prefill operations.
  • Performance Improvements: This integration is designed to significantly boost prefill performance. Initial benchmarks on a single B200 GPU show a 20-25% improvement in Time To First Token (TTFT) and a 15-20% improvement in Tokens Per Output Token (TPOT) for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct.
  • Dynamic Backend Selection: The system now intelligently switches between the existing attention mechanism and FlashInfer's ragged prefill based on the detected GPU architecture. If an SM100a GPU is present, FlashInfer is utilized for prefill operations.
  • New Metadata and Utilities: To support FlashInfer, I've introduced new dataclasses (FlashInferPrefillMetadata, PerLayerParameters) and a FlashInferPrefill class. These manage FlashInfer-specific parameters, workspace buffers, and provide configured prefill wrappers, ensuring optimal use of the 'cutlass' backend for B200.
Changelog
  • vllm/v1/attention/backends/mla/common.py
    • Added imports for FlashInfer components, including BatchPrefillWithRaggedKVCacheWrapper and GPU capability detection (is_sm100a_supported).
    • Introduced FlashInferPrefillMetadata dataclass to store FlashInfer prefill wrappers for main and chunked contexts.
    • Implemented the FlashInferPrefill class to manage a shared workspace buffer and provide BatchPrefillWithRaggedKVCacheWrapper instances configured for the 'NHD' KV layout and 'cutlass' backend.
    • Modified MLACommonMetadata to include an optional fi_prefill field for FlashInfer-specific metadata.
    • Updated MLACommonMetadataBuilder to initialize and populate FlashInferPrefillMetadata when FlashInfer prefill is enabled and applicable, planning the prefill operations with relevant parameters.
    • Adjusted MLACommonBackend to detect SM100a GPUs and enable FlashInfer prefill, which also sets specific FlashInfer hyperparameters (sliding_window, logits_soft_cap) and disables _pad_v.
    • Modified _compute_prefill_context and _forward_prefill methods to conditionally dispatch attention calculations to FlashInfer's run method if FlashInfer prefill is active, otherwise falling back to the existing _flash_attn_varlen_diff_headdims.
Activity
  • A bot (github-actions[bot]) provided initial guidance on the CI and PR workflow.
  • The author (alexm-redhat) requested a summary from a bot.
  • An automated code assist bot (gemini-code-assist[bot]) provided several review comments:
  • Four high-priority comments questioned the identical cu_seqlens_k and max_seqlen_k values being passed as their _q counterparts in the original attention function calls, suggesting a potential for incorrect attention calculations.
  • Two medium-priority comments advised removing debugging print statements that were added.
  • Two medium-priority comments recommended replacing hardcoded head dimensions with variables or constants for better consistency in the attention function calls.

@alexm-redhat alexm-redhat force-pushed the mla_fi_prefill branch 3 times, most recently from e1f2aa4 to e52c4f1 Compare July 7, 2025 16:28
Copy link
Contributor

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your work! The flashinfer bugfix has been merged. Do we need to update the flashinfer commit too?

@mgoin mgoin requested review from LucasWilkinson and mgoin July 9, 2025 02:15
@LucasWilkinson LucasWilkinson changed the title MLA - Flashinfer Ragged Prefill [Attention] MLA - Flashinfer Ragged Prefill Jul 10, 2025
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@simon-mo simon-mo merged commit 5b03235 into main Jul 11, 2025
65 of 69 checks passed
@simon-mo simon-mo deleted the mla_fi_prefill branch July 11, 2025 03:17
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants