[None][feat] add Hopper FP8 context MLA #7107

zhou-yuxin · 2025-08-21T04:59:34Z

Tested with DeepSeekV3Lite with FP8 KVCache enabled.

export LLM_MODELS_ROOT=/models
pytest accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=eagle-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]

Summary by CodeRabbit

New Features
- Added 192x128 attention kernel variants (separate Q/K/V) with BF16 output options and expanded public kernel coverage.
- Multimodal evaluation and MMMU support: new multimodal wrapper, MMMU evaluator/CLI, image placeholder handling, and public image/chat helpers.
Performance/Compatibility
- TMA store now respects output dtype and requires head-size alignment; FP8 MLA support expanded to SM90 and SM120.
- Added per-context QKV descaling and safer masking sentinel behavior; updated dependency minimums.
Tests
- Updated SM skip logic for FP8 MLA tests, many test list and test-case adjustments, and new/modified integration/unit test entries.
Documentation
- README/blog updates and serving/packaging guidance revised.

coderabbitai · 2025-08-21T04:59:42Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Expands FMHA kernel enumeration and metadata to include SEPARATE_Q_K_V and new 192x128 context MLA variants (with output_dtype options), adjusts TMA store gating and kernel traits propagation, introduces a CUBIN-aware per-CUBIN QKV descaling factor and unified packed store path, fixes V-tile transpose indexing, and broadens FP8 MLA test gating to include SM90.

Changes

Cohort / File(s)	Summary
Tests: FP8 MLA gating `cpp/kernels/fmha_v2/fmha_test.py`	FP8 MLA test skip condition changed to run on SM90 and SM120; skip message updated.
Kernel enumeration & public API `cpp/kernels/fmha_v2/setup.py`	Added `InputLayout.SEPARATE_Q_K_V`; `enumerate_qgmma_flash_warpspec_kernels` now accepts `output_dtype`; TMA store gating uses `output_dtype` and requires `head_size % 16 == 0`; added 192x128 context MLA variants (output_dtype None/bf16); broadened Deepseek MLA allowed dtypes to include `e4m3`.
Packed store path & CUBIN-aware scale `cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h`	Added public `params_scale_bmm2_` with CUBIN-aware init; replaced element-wise stores with unified packing via `Acc_packer` that uses the runtime/compile-time scale; adjusted constructor col offset handling.
Warp-spec DMA V-transpose indexing `cpp/kernels/fmha_v2/src/fmha/warpspec/dma.h`	Inner loop bound changed from `D_GROUPS` → `DV_GROUPS`; destination offset uses `DV` instead of `D` for V-tile transpose addressing.
Kernel traits propagation `cpp/kernels/fmha_v2/src/fmha/warpspec/kernel_traits.h`	Extended `Base` alias to include `RETURN_SOFTMAX_STATS_`, `OutputType`, and `SAGE_BLOCK_SIZE_{Q,K,V}_` template parameters.
CUBIN declarations & metadata `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.h`	Added extern kernels for 192x128 SEPARATE_Q_K_V (e4m3 and bf16-output) and appended four kernel-meta entries covering 192x128 tma/causal and bf16-output variants.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Build as Kernel Enumeration
  participant Enum as enumerate_qgmma_flash_warpspec_kernels
  participant Traits as Kernel_traits
  participant Cubin as CUBIN Metadata

  Build->>Enum: call(sm, dtype, output_dtype, layouts...)
  Note over Enum: include InputLayout.SEPARATE_Q_K_V and 192x128 context MLA
  Enum->>Traits: compute/instantiate traits (propagate RETURN_SOFTMAX_STATS_, OutputType, SAGE sizes)
  Note over Enum: enable_tma_store checks output_dtype and head_size%16==0
  Enum->>Cubin: emit kernel declarations/metadata for 192x128 (e4m3 + bf16-output)

sequenceDiagram
  autonumber
  participant Kernel as Gmem_tile_o_qgmma_fp32_16bits
  participant Params as Params (runtime)
  participant Packer as Acc_packer
  participant GM as GlobalMemory

  Kernel->>Params: initialize params_scale_bmm2_ (CUBIN-aware)
  Kernel->>Kernel: store(accumulators...)
  Kernel->>Packer: Acc_packer<Src,Out,Scale>::run(_src, params_scale_bmm2_)
  Packer-->>Kernel: packed uint2 (_dst)
  Kernel->>GM: stg(_dst) @ computed offset

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper #7035 — Related XQA/cross-attention wiring (encoder_input_lengths) and XQA kernel gating; overlaps on attention/KV handling.
[https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 #7076 — Same change to pattern_matcher.patched_fn signature (*args/**kwargs forwarding); directly related to compatibility adjustments.
[None][fix] Fix llama4 multimodal by skipping request validation #6957 — Llama4 multimodal shortcut in PyExecutor._validate_request; related cross-cutting multimodal/test gating changes touched in this diff.

Suggested reviewers

litaotju
kaiyux
chzblych
yuxianq
PerkzZheng
zhhuang-nv

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fd6e177 and 810beb2.

📒 Files selected for processing (59)

.github/CODEOWNERS (1 hunks)
.github/workflows/blossom-ci.yml (1 hunks)
README.md (2 hunks)
cpp/include/tensorrt_llm/common/logger.h (2 hunks)
cpp/kernels/fmha_v2/fmha_test.py (1 hunks)
cpp/kernels/fmha_v2/setup.py (5 hunks)
cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h (3 hunks)
cpp/kernels/fmha_v2/src/fmha/warpspec/dma.h (2 hunks)
cpp/kernels/fmha_v2/src/fmha/warpspec/kernel_traits.h (1 hunks)
cpp/kernels/xqa/mha_sm90.cu (5 hunks)
cpp/tensorrt_llm/common/attentionOp.cpp (3 hunks)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1 hunks)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.h (2 hunks)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/xqaParams.h (2 hunks)
cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h (2 hunks)
cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (7 hunks)
cpp/tests/unit_tests/executor/transferAgentTest.cpp (2 hunks)
docker/common/install_nixl.sh (2 hunks)
docker/common/install_ucx.sh (1 hunks)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (7 hunks)
jenkins/L0_Test.groovy (1 hunks)
jenkins/current_image_tags.properties (1 hunks)
requirements.txt (1 hunks)
scripts/build_wheel.py (5 hunks)
tensorrt_llm/_torch/attention_backend/__init__.py (1 hunks)
tensorrt_llm/_torch/attention_backend/utils.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1 hunks)
tensorrt_llm/_torch/custom_ops/__init__.py (2 hunks)
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2 hunks)
tensorrt_llm/_torch/flashinfer_utils.py (1 hunks)
tensorrt_llm/_torch/modules/rms_norm.py (1 hunks)
tensorrt_llm/_torch/modules/rotary_embedding.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/commands/eval.py (2 hunks)
tensorrt_llm/evaluate/__init__.py (1 hunks)
tensorrt_llm/evaluate/lm_eval.py (10 hunks)
tensorrt_llm/inputs/__init__.py (2 hunks)
tensorrt_llm/inputs/utils.py (4 hunks)
tensorrt_llm/serve/chat_utils.py (2 hunks)
tests/integration/defs/.test_durations (3 hunks)
tests/integration/defs/accuracy/accuracy_core.py (1 hunks)
tests/integration/defs/accuracy/references/mmmu.yaml (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
tests/integration/test_lists/qa/llm_function_full.txt (2 hunks)
tests/integration/test_lists/qa/llm_function_sanity.txt (1 hunks)
tests/integration/test_lists/test-db/l0_a30.yml (1 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (2 hunks)
tests/integration/test_lists/test-db/l0_gb200.yml (1 hunks)
tests/integration/test_lists/test-db/l0_gb200_multi_nodes.yml (1 hunks)
tests/integration/test_lists/test-db/l0_gb202.yml (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
tests/integration/test_lists/test-db/l0_l40s.yml (1 hunks)
tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml (1 hunks)
tests/integration/test_lists/waives.txt (2 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py (1 hunks)
tests/unittest/_torch/executor/test_resource_manager.py (2 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama4.py (2 hunks)
tests/unittest/_torch/sampler/test_trtllm_sampler.py (2 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/kernels/fmha_v2/fmha_test.py (1)
1-1: Missing NVIDIA copyright header (2025)

Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply at the very top of the file:
+ # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

🧹 Nitpick comments (5)

cpp/kernels/fmha_v2/fmha_test.py (1)
168-169: De-duplicate supported-arch lists for FP8 MLA gating

Supported SMs for FP8 context MLA are hardcoded here, while generation-phase FP8 MLA uses a different guard (Lines 213-215). To reduce churn when enabling more archs, centralize these into a single constant.

Apply this diff in-place:
-    if dtype in ["-e4m3", "-e4m3 -bf16-output"] and sm_version not in [90, 120]:
+    if dtype in ["-e4m3", "-e4m3 -bf16-output"] and sm_version not in SUPPORTED_FP8_CONTEXT_MLA:
         pytest.skip("FP8 MLAs are only supported on sm90 and sm120 currently.")
Then, add this near the top of the file (e.g., below imports) to define the shared constant:
SUPPORTED_FP8_CONTEXT_MLA = (90, 120)
cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h (1)
1267-1281: Improved packing implementation with configurable scaling.

The refactored store path using Acc_packer with template parameter Scale is cleaner and more maintainable than the previous per-element approach. The conditional scaling based on UNIFIED_EPILOGUE_SCALE provides flexibility for different build configurations.

Consider extracting the packing logic into a helper function to reduce macro usage:
-#define STORE_COLUMNS()                                                                            \
-    {                                                                                              \
-        /* we assume M = 1. some shortcuts. */                                                     \
-        static_assert(M == 1);                                                                     \
-        uint4 _src = {                                                                             \
-            .x = acc[0][mma_ni].reg(((ci + 0) * ROWS_PER_THREAD + ri) * 2),                        \
-            .y = acc[0][mma_ni].reg(((ci + 1) * ROWS_PER_THREAD + ri) * 2),                        \
-            .z = acc[0][mma_ni].reg(((ci + 0) * ROWS_PER_THREAD + ri) * 2 + 1),                    \
-            .w = acc[0][mma_ni].reg(((ci + 1) * ROWS_PER_THREAD + ri) * 2 + 1),                    \
-        };                                                                                         \
-        uint2 _dst = Acc_packer<float, Output_type, Scale>::run(this, _src);                       \
-        int64_t _offset =                                                                          \
-            (int64_t)ri * step_m + (int64_t)(ci + mma_ni * COLS_PER_THREAD) * STEP_N;              \
-        fmha::stg(o_ptr_ + _offset, _dst);                                                         \
-    }
+    template<bool Scale>
+    inline __device__ void store_columns(auto const& acc, int mma_ni, int ci, int ri, int64_t step_m) {
+        static_assert(M == 1);
+        uint4 src = {
+            .x = acc[0][mma_ni].reg(((ci + 0) * ROWS_PER_THREAD + ri) * 2),
+            .y = acc[0][mma_ni].reg(((ci + 1) * ROWS_PER_THREAD + ri) * 2),
+            .z = acc[0][mma_ni].reg(((ci + 0) * ROWS_PER_THREAD + ri) * 2 + 1),
+            .w = acc[0][mma_ni].reg(((ci + 1) * ROWS_PER_THREAD + ri) * 2 + 1),
+        };
+        uint2 dst = Acc_packer<float, Output_type, Scale>::run(this, src);
+        int64_t offset = (int64_t)ri * step_m + (int64_t)(ci + mma_ni * COLS_PER_THREAD) * STEP_N;
+        fmha::stg(o_ptr_ + offset, dst);
+    }
Then replace STORE_COLUMNS() with store_columns<Scale>(acc, mma_ni, ci, ri, step_m); in the loops.
cpp/kernels/fmha_v2/setup.py (3)
1917-1921: TMA-store gating: consider row-byte alignment (16B) rather than head_size only; optionally broaden to bf16 when validated

Using the actual output dtype here is correct. However, aligning on element count (head_size % 16) assumes 1B elements. TMA operates on 16B granularity, so the robust check is row_bytes % 16 == 0. That keeps behavior identical for FP8 today and future-proofs the condition. If/when bf16 output store via TMA is validated, you can safely include it by keeping the same row alignment check.

Apply this minimal generalization now (no behavior change for FP8), keeping bf16 disabled until you validate:
 def enable_tma_store(kspec):
-    output_dtype = kspec.output_dtype if kspec.output_dtype is not None else kspec.dtype
-    # TMA copies data in the 16B granularity.
-    return 'true' if (output_dtype in ['e4m3', 'e4m3_fp32']
-                      and kspec.head_size % 16 == 0) else 'false'
+    output_dtype = kspec.output_dtype if kspec.output_dtype is not None else kspec.dtype
+    # TMA copies data in 16B granularity: require row-size (in bytes) to be a multiple of 16.
+    row_bytes = kspec.head_size * dtype2bytes[output_dtype]
+    return 'true' if (output_dtype in ['e4m3', 'e4m3_fp32'] and (row_bytes % 16 == 0)) else 'false'
Optionally, once store path readiness for 16-bit outputs is confirmed, enable bf16 with the same alignment:
-    return 'true' if (output_dtype in ['e4m3', 'e4m3_fp32'] and (row_bytes % 16 == 0)) else 'false'
+    return 'true' if (output_dtype in ['e4m3', 'e4m3_fp32', 'bf16'] and (row_bytes % 16 == 0)) else 'false'
Would you like me to add a small guard (env flag) to toggle bf16 TMA-store at runtime for A/B perf validation without rebuilds?

3816-3818: Broadened input-layout combinations for FP8 WS kernels: OK; consider trimming to avoid generating unneeded SEPARATE_Q_K_V variants outside MLA

Including SEPARATE_Q_K_V in the general cartesian product is functionally fine, but most of those specs will be filtered out later by specs_names, adding enumeration noise. Optional: restrict the general pass to PACKED/CONTIGUOUS/Q_PAGED and handle SEPARATE_Q_K_V only in the MLA block below (lines 3932-3971).

Light refactor (paired with the MLA block tweak below) to avoid spec explosion:
-    combinations = product([False, True], \
-        [InputLayout.PACKED_QKV, InputLayout.CONTIGUOUS_Q_KV,
-         InputLayout.Q_PAGED_KV, InputLayout.SEPARATE_Q_K_V],
-        [False, True])
+    combinations = product(
+        [False, True],  # alibi
+        [InputLayout.PACKED_QKV, InputLayout.CONTIGUOUS_Q_KV, InputLayout.Q_PAGED_KV],
+        [False, True],  # enable_attn_logit_softcapping
+    )
This change should be combined with forcing SEPARATE_Q_K_V in the 192x128 MLA block (see comment at lines 3932-3971).

3932-3971: 192x128 context MLA variants: force SEPARATE_Q_K_V here to avoid generating unused variants

Great to see explicit 192x128 WS variants with output bf16 and default output. Since this block is exclusively for Deepseek context MLA (separate Q/K/V), pin the input layout here instead of inheriting it from the outer combinations. This removes many discarded specs and keeps intent explicit.

Apply this small tweak:
-        for output_type in [None, 'bf16']:
+        for output_type in [None, 'bf16']:
             specs.append(
                 kernel_spec(
                     sm=sm,
                     sm_mma=90,
                     dtype=dtype,
                     seq_len=0,  # support any sequence length
                     head_size=192,
                     head_size_v=128,
                     warps_m=4,  #4x1 warpgroups
                     warps_n=1,
                     version=2,
                     interleaved=False,
                     ldgsts_q=
                     False,  # for Hopper kernels, ldgsts = False signals TMA usage.
                     ldgsts_k=False,
                     ldgsts_v=False,
                     share_smem_k_v=False,
                     loop_step=64,
                     q_tile_buffers=1,  # only used by warp specialized kernels
                     has_noloop=0,
                     noloop_step=64,
                     kv_loop_step=128,
                     kv_tile_buffers=2,  # only used by warp specialized kernels
                     unroll_threshold=1,
                     has_scale_max=False,
                     flash_attention=True,
                     warp_specialization=True,
                     alibi=alibi,
                     enable_attn_logit_softcapping=enable_attn_logit_softcapping,
                     return_softmax_stats=
                     False,  # return softmax stats is not supported for fp8 now
                     scheduling_mode=scheduling_mode,
-                    input_layout=input_layout,
+                    input_layout=InputLayout.SEPARATE_Q_K_V,
                     sage_block_sizes=sage_block_sizes,
                     output_dtype=output_type))
This pairs with limiting the general combinations to PACKED/CONTIGUOUS/Q_PAGED (see lines 3816-3818).

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 21f4434 and b0d09a3.

📒 Files selected for processing (6)

cpp/kernels/fmha_v2/fmha_test.py (1 hunks)
cpp/kernels/fmha_v2/setup.py (5 hunks)
cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h (3 hunks)
cpp/kernels/fmha_v2/src/fmha/warpspec/dma.h (2 hunks)
cpp/kernels/fmha_v2/src/fmha/warpspec/kernel_traits.h (1 hunks)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.h (2 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py