draft: DO NOT MERGE #6172

hypdeb · 2025-07-18T06:04:17Z

PR created to look at diff between releases, not intended to be merged

Summary by CodeRabbit

New Features
- Added support for MLA (Matrix Layout Accelerator) attention kernels and benchmarking, including new CUDA kernels, headers, and test cases.
- Introduced fabric memory support for cache transfer buffers to improve performance on supported hardware.
- Enabled NIXL as a new communication backend for cache transceivers.
- Enhanced multimodal input support in requests and executor interfaces.
- Added new quantization mode (W4A8_MXFP4_FP8) and improved MoE backend benchmarking with additional data type combinations.
Improvements
- Refactored cache formatting logic for better extensibility and duplication-aware buffer management.
- Updated batch manager and runtime interfaces for improved buffer handling, speculative decoding, and lookahead support.
- Enhanced MoE benchmarks with new routing strategies and data type support (FP4, FP8, WFP4AFP8).
- Improved CUDA kernel and utility code for advanced synchronization, memory management, and warp-level operations.
- Expanded documentation, code ownership, and configuration for better maintainability.
Bug Fixes
- Fixed various documentation links, configuration typos, and improved error handling in utility scripts and Conan recipes.
Chores
- Updated dependencies, Docker images, and pre-commit hook configurations.
- Added and updated configuration files for CI, code review, and Git LFS tracking.
- Improved test coverage and added new utility scripts for automated labeling and community PR management.
Style
- Improved formatting, naming consistency, and code comments across multiple files.
Documentation
- Added and updated README, contributing guidelines, and kernel documentation for new features and usage instructions.
Refactor
- Modularized batch manager, cache, and kernel code for better maintainability and scalability.
- Removed deprecated or unused files and functions to streamline the codebase.

Signed-off-by: Jinyang Yuan <[email protected]>

Signed-off-by: Izzy Putterman <[email protected]>

Signed-off-by: Mike Iovine <[email protected]>

…ck. (#4264) Signed-off-by: Bo Li <[email protected]>

…4522) Signed-off-by: Zheng Duan <[email protected]>

…uting unit test (#5065) Signed-off-by: Christina Zhang <[email protected]>

Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>

Signed-off-by: Yiqing Yan <[email protected]>

Signed-off-by: Dongxu Yang <[email protected]> Signed-off-by: ShiXiaowei02 <[email protected]> Signed-off-by: Enwei Zhu <[email protected]> Signed-off-by: Jun Yang <[email protected]> Co-authored-by: Dongxu Yang <[email protected]> Co-authored-by: ShiXiaowei02 <[email protected]> Co-authored-by: Jun Yang <[email protected]>

Signed-off-by: Daniel Campora <[email protected]>

Signed-off-by: ZhanruiSunCh <[email protected]>

Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>

Signed-off-by: Jin Li <[email protected]>

Signed-off-by: Yiqing Yan <[email protected]>

Signed-off-by: xinhe-nv <[email protected]>

Signed-off-by: Tyler Burt <[email protected]>

Signed-off-by: Po-Wei Wang (Vincent)

Signed-off-by: Zheng Duan <[email protected]>

#4635) Signed-off-by: Hui Gao <[email protected]>

Signed-off-by: Netanel Haber <[email protected]>

…5106) Signed-off-by: Lucas Liebenwein <[email protected]>

…a) + extras ✨ (#5066) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Venky <[email protected]>

…4961) Signed-off-by: moraxu <[email protected]>

Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]>

Signed-off-by: Tomer Asida <[email protected]>

…d more cases for llama_v3.1/3.3 70b fp8 models (#5149) Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>

…4933) Signed-off-by: moraxu <[email protected]>

Signed-off-by: Daniel Campora <[email protected]>

Signed-off-by: Fanrong Li <[email protected]>

Signed-off-by: yizhan <[email protected]>

Signed-off-by: ruodil <[email protected]>

Signed-off-by: Pengyun Lin <[email protected]>

Signed-off-by: Robin Kobus <[email protected]>

…o max attention window (#5874) Signed-off-by: Netanel Haber <[email protected]> Signed-off-by: Netanel Haber <[email protected]>

…on store_cubin error (#5865) Signed-off-by: Zhenhuan Chen <[email protected]>

Signed-off-by: Balaram Buddharaju <[email protected]>

Signed-off-by: Amir Klein <[email protected]>

… to 0.75 to prevent OOM on CI. (#5896) Signed-off-by: Bo Li <[email protected]>

Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Yingge He <[email protected]> Signed-off-by: Martin Marciniszyn Mehringer <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: nv-guomingz <[email protected]> Co-authored-by: Yingge He <[email protected]> Co-authored-by: Martin Marciniszyn Mehringer <[email protected]> Co-authored-by: Kaiyu Xie <[email protected]> Co-authored-by: zpatel <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Nikita Korobov <[email protected]>

Signed-off-by: zhengd-nv <[email protected]>

…5947) Signed-off-by: Fanrong Li <[email protected]>

Signed-off-by: Yi Zhang <[email protected]>

Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: Iman Tabrizian <[email protected]>

#6039) Signed-off-by: nv-guomingz <[email protected]>

Signed-off-by: Yiqing Yan <[email protected]>

Signed-off-by: Iman Tabrizian <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]>

Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>

Signed-off-by: Patrice Castonguay <[email protected]>

Signed-off-by: Yanchao Lu <[email protected]> Signed-off-by: qqiao <[email protected]> Co-authored-by: Emma Qiao <[email protected]>

Signed-off-by: bhsueh <[email protected]>

coderabbitai · 2025-07-18T06:04:36Z

Walkthrough

This update introduces major enhancements and refactoring across the batch manager, executor, runtime, and kernel components. Key changes include support for a new NIXL communication backend in cache transceivers, extensive interface and buffer management updates for multimodal and MLA (Matrix Layout Accelerator) features, new CUDA kernel implementations, and significant improvements in MoE benchmarking infrastructure. Build configurations, documentation, and test utilities are also updated to support these new features and maintain compatibility.

Changes

Files / Groups	Change Summary
.coderabbit.yaml, .github/scripts/label_community_user.py, .github/workflows/label_community_pr.yml, ...	Added CodeRabbit config, community PR labeling workflow and script, expanded authorized users, improved PR templates, and updated code/module ownership rules.
.gitattributes, .devcontainer/docker-compose.yml, constraints.txt, 3rdparty/cutlass, .pre-commit-config.yaml	Updated LFS tracking patterns, Docker image tag, security constraints, submodule ref, and pre-commit hook scope.
README.md, CONTRIBUTING.md, CODING_GUIDELINES.md, benchmarks/cpp/README.md, ...	Documentation: updated badges, added tech blogs/news, clarified contribution and code guidelines, and fixed dataset URLs.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h/.cpp, ...	Added NIXL backend support to cache transceiver, refactored cache formatter to support MLA and duplication, and modularized cache formatting logic.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, ...	Refined KV cache management: per-window block allocation, new memory calculation/grouping methods, and removed obsolete classes.
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h, ...	Refactored decoder request creation: new constructor, static helpers, extended operator signatures, and enhanced multimodal/draft token support.
cpp/include/tensorrt_llm/batch_manager/llmRequest.h, ...	Extended LlmRequest and related classes for multimodal fields, dummy request flag, and improved result serialization.
cpp/include/tensorrt_llm/executor/executor.h, ...	Added MultimodalInput class, extended KvCacheRetentionConfig, Request, and Serialization for multimodal and agent state serialization.
cpp/include/tensorrt_llm/runtime/decoderState.h, ...	DecoderState: removed buffer setup, added accessors for draft/lookahead/eagle buffers, sequence lengths, and batch size.
cpp/include/tensorrt_llm/batch_manager/decoderBuffers.h, ...	Refactored buffer ownership (unique_ptr), introduced DraftBuffers class, updated MPI request pointer semantics.
cpp/include/tensorrt_llm/common/quantization.h	Added new quantization mode flag (w4a8Mxfp4Fp8) and related methods.
cpp/include/tensorrt_llm/deep_gemm/.cuh, cpp/include/tensorrt_llm/deep_gemm/.h	Major GEMM kernel overhaul: added SwapAB support, new scheduler classes, kernel and JIT utility updates, new MMA/FP8 logic, and removed paired float ops header.
cpp/include/tensorrt_llm/kernels/kvCachePartialCopy.h	kvCacheBlockPartialCopy gains a kvFactor argument.
cpp/include/tensorrt_llm/runtime/gptDecoder.h/.cpp, ...	Simplified GptDecoder interfaces, updated for new draft/lookahead configs, and removed maxSequenceLength from some signatures.
cpp/include/tensorrt_llm/runtime/modelConfig.h	Added getNumKvHeadsForGivenLayers, removed getSumLocalKvHeads.
cpp/include/tensorrt_llm/runtime/utils/mpiUtils.h, ...	Changed MPI async request ownership to unique_ptr.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp/.h, ...	Modularized cache formatting, added MLA support, duplication-aware buffer management, and new factory function.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp/.h	Added FabricMemory class for CUDA fabric memory, buffer manager extended for fabric memory support, and new buffer count accessors.
cpp/kernels/xqa/, cpp/kernels/xqa/test/, cpp/kernels/xqa/mla_sm120.cu, ...	Introduced MLA kernel, new CUDA headers, utilities, and test logic for MLA, refactored barriers, updated type traits, and added tensor map helpers.
cpp/micro_benchmarks/*, cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixtureOss.h, ...	Overhauled MoE benchmarks: new fixture for OSS, extended data/weight types, routing configs, CUDA graph support, and improved tactic selection.
cpp/kernels/fmha_v2/*	Updated kernel build logic, cubin header handling, test conditions, and documentation for FMHA kernels.
cpp/kernels/xqa/pairedF32Op.cuh	Removed paired float operation declarations.
Other build/config files (CMakeLists.txt, etc.)	Updated build logic for OSS Cutlass options, cuFile support, and conditional linking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Executor
    participant BatchManager
    participant CacheTransceiver
    participant CacheFormatter
    participant BufferManager
    participant ConnectionManager

    User->>Executor: Submit multimodal/generation request
    Executor->>BatchManager: Prepare request (multimodal, draft, MLA, etc.)
    BatchManager->>CacheTransceiver: Initiate cache communication (MPI/UCX/NIXL)
    CacheTransceiver->>BufferManager: Allocate send/recv buffers (optionally FabricMemory)
    CacheTransceiver->>ConnectionManager: Establish connections (including NIXL)
    CacheTransceiver->>CacheFormatter: Format cache blocks (duplication, MLA, etc.)
    CacheFormatter-->>CacheTransceiver: Provide block ranges/connections
    CacheTransceiver->>CacheTransceiver: Send/receive cache blocks
    CacheTransceiver->>BatchManager: Notify completion
    BatchManager->>Executor: Return results
    Executor->>User: Deliver output (with multimodal/MLA support)

Possibly related PRs

[None][infra] Cherry-pick #6128 and #6130 from main branch #6151: Adds a new .coderabbit.yaml config file, directly matching the configuration content here.
[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter #6007: Refactors speculative decoding and drafting logic, which is directly related to the updated batch manager and executor code in this PR.
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service #5234: Modifies cache transceiver configuration and backend type usage, which is extended here by adding the NIXL backend.

Suggested labels

Community want to contribute

Suggested reviewers

nv-guomingz
dcampora
zongfeijing

Poem

In tunnels of code where the data flows,
New backends and buffers, the architecture grows.
MLA and NIXL, a hop and a leap,
MoE benchmarks and kernels, no secrets to keep.
With carrots for logic and lettuce for flair,
This rabbit approves—change is in the air! 🥕✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 32

🔭 Outside diff range comments (1)

cpp/kernels/fmha_v2/src/fmha/utils.h (1)
2889-2905: Replace integer PTX max with floating-point max for correctness

Using max.s16x2 on reinterpreted float bits can mix halves from different operands and yield invalid results. Switch to the proper FP max instructions (or the fmaxf intrinsic) to guarantee correct behavior:

• File: cpp/kernels/fmha_v2/src/fmha/utils.h
• Lines: 2897–2903
-    int32_t tmp;
-    asm volatile("max.s16x2 %0, %1, %2;\n" : "=r"(tmp) : "r"(a_), "r"(b_));
-    asm volatile("max.s16x2 %0, %0, %1;\n" : "+r"(tmp) : "r"(tmp), "r"(c_));
-    res = reinterpret_cast<float const&>(tmp);
+    float tmp_fp;
+    asm volatile("max.f32 %0, %1, %2;\n" : "=f"(tmp_fp) : "f"(a), "f"(b));
+    asm volatile("max.f32 %0, %0, %1;\n" : "+f"(tmp_fp) : "f"(c));
+    res = tmp_fp;
Alternatively, drop the inline PTX and simply do:
res = fmaxf(a, fmaxf(b, c));
This ensures correct IEEE-754 behavior across all positive float inputs.

🧹 Nitpick comments (41)

.github/workflows/blossom-ci.yml (1)

42-44: Deduplicate and externalise the author list

"ixlmar" now appears twice in the hard-coded JSON array.
Beyond duplicates, this single ultra-long line is becoming unmanageable. Consider moving the list into:

A YAML sequence (multi-line) and using join()

Or a separate JSON file checked into the repo and read with cat/jq

to keep the workflow readable and reduce merge-conflict risk.
README.md (1)
20-33: Heading style violates project markdown rules

## Tech Blogs and ## Latest News are ATX headings, while the rest of the file uses Setext style (===).
To silence markdownlint (MD003) and keep consistency, convert these two headings to Setext or disable the rule locally:
-## Tech Blogs
+Tech Blogs
+----------
Same for Latest News.
constraints.txt (1)
6-9: Good proactive pinning – add “why not upper-bounds” note

The lower-bound WAR pins are clear, but without an upper bound the environment may pull incompatible major versions in the future.
Add a brief comment explaining that only minimum versions are intended and are expected to stay compatible with the base image.
-# WAR against … tornado>=6.5.0
+# WAR against … tornado>=6.5.0  # no upper-bound; rely on image tests
cpp/kernels/fmha_v2/README.md (1)
5-5: Fix punctuation as suggested by static analysis.

The static analysis tool correctly identified a missing period after "etc." in American English.
-FMHA_v2 is just a bunch of Multi-head Attention kernels that we've enabled for known cases. It's not built as a library (cuBLAS, cuDNN, HazyResearch's MHA, etc) that is supposed to deliver good perf for all cases. End users will get access to FMHA through products or libraries, not directly through FMHA_v2.
+FMHA_v2 is just a bunch of Multi-head Attention kernels that we've enabled for known cases. It's not built as a library (cuBLAS, cuDNN, HazyResearch's MHA, etc.) that is supposed to deliver good perf for all cases. End users will get access to FMHA through products or libraries, not directly through FMHA_v2.
CONTRIBUTING.md (1)
666-666: Fix markdown linting issue with bare URL.

The static analysis tool flagged a bare URL that should be properly formatted.
-           "Allowed values are: fp8, fp4, wfp4afp8, int4, int8, float, half, bfloat16\n"
+           "Allowed values are: fp8, fp4, wfp4afp8, int4, int8, float, half, bfloat16\\n"
cpp/micro_benchmarks/CMakeLists.txt (1)
56-56: Fix typo in comment.
-# Temporary opend-sourced version. Will be daleted when open-sourced moe_gemm
+# Temporary open-sourced version. Will be deleted when open-sourced moe_gemm
.github/workflows/label_community_pr.yml (1)
28-28: Fix grammatical error in label text.

The label text "Community want to contribute" has a grammatical error. It should be "Community wants to contribute" or "Community contribution".
-          COMMUNITY_LABEL: "Community want to contribute"
+          COMMUNITY_LABEL: "Community wants to contribute"
cpp/libnuma_conan.py (2)
24-38: Good architecture-specific library path handling.

The multi-candidate approach for library paths is excellent, covering RHEL/CentOS, Ubuntu/Debian, and various architectures. However, consider enhancing the architecture handling.

Consider adding more architecture variants and improving the fallback logic:
-            elif arch in ["armv8", "aarch64"]:
+            elif arch in ["armv8", "aarch64", "armv8-a"]:
                 # Debian/Ubuntu aarch64
                 lib_candidates.append("/usr/lib/aarch64-linux-gnu/libnuma.so")
+            elif arch.startswith("arm"):
+                # Generic ARM fallback
+                lib_candidates.append("/usr/lib/arm-linux-gnueabihf/libnuma.so")
             else:
                 self.output.info(
-                    f"Unrecognized architecture: {arch}, falling back to /usr/lib/libnuma.so"
+                    f"Unrecognized architecture: {arch}, trying common paths"
                 )
                 lib_candidates.append("/usr/lib/libnuma.so")
+                lib_candidates.append("/usr/local/lib/libnuma.so")
39-46: Enhance error handling with more descriptive messages.

The error handling is good, but could provide more helpful information for debugging.
             else:
-                raise ConanInvalidConfiguration(
-                    "libnuma.so not found on system")
+                searched_paths = ", ".join(lib_candidates)
+                raise ConanInvalidConfiguration(
+                    f"libnuma.so not found on system. Searched paths: {searched_paths}. "
+                    f"Please install libnuma-dev or numactl-devel package.")
cpp/kernels/xqa/test/warmup.cu (1)
7-11: Potential issue with clock comparison logic

The while loop condition tic + cycles < clock64() could be problematic if clock values wrap around or if there are precision issues. Consider using a more robust comparison.
-    uint64_t const tic = clock64();
-    while (tic + cycles < clock64())
+    uint64_t const tic = clock64();
+    uint64_t elapsed = 0;
+    while (elapsed < cycles)
     {
+        elapsed = clock64() - tic;
     }
.github/scripts/label_community_user.py (3)
26-27: Consider more robust rate limiting

The fixed 0.5-second sleep might not be sufficient during high API usage. Consider implementing exponential backoff or checking rate limit headers from the API response.
-            time.sleep(0.5)
             response = requests.get(url, headers=HEADERS)
+            
+            # Check rate limit headers
+            if 'X-RateLimit-Remaining' in response.headers:
+                remaining = int(response.headers['X-RateLimit-Remaining'])
+                if remaining < 10:
+                    reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
+                    sleep_time = max(reset_time - time.time() + 1, 1)
+                    time.sleep(sleep_time)
+            else:
+                time.sleep(0.5)
36-39: Fix line length to comply with style guide

Line 38 exceeds the 120-character limit. Split the error message for better readability.
                 raise RuntimeError(
                     f"Forbidden (403) when fetching members for 'NVIDIA'. "
-                    f"This may be due to insufficient token permissions or rate limits. Details: {error_message}. Cannot fetch members."
+                    f"This may be due to insufficient token permissions or rate limits. "
+                    f"Details: {error_message}. Cannot fetch members."
                 )
54-56: Consider more specific exception handling

The generic Exception catch might mask unexpected errors. Consider catching specific exceptions to better handle different failure scenarios.
-        except Exception as e:
+        except requests.exceptions.RequestException as e:
             print(f"Error fetching NVIDIA members: {e}")
             return []
+        except (KeyError, ValueError) as e:
+            print(f"Error parsing API response: {e}")
+            return []
benchmarks/cpp/utils/prepare_synthetic_data.py (1)
10-32: Good refactoring to consolidate duplicate logic.

The helper function effectively consolidates task ID generation and LoRA configuration logic. The warning for inconsistent argument usage is helpful for users.

Consider adding type hints to improve code clarity:
-def _generate_task_ids_and_lora_config(root_args, num_reqs):
+def _generate_task_ids_and_lora_config(root_args, num_reqs: int) -> tuple[list[int], list[int] | None, dict[str, str] | None]:
cpp/kernels/xqa/mla_sm120.cuh (1)
78-89: Consider documenting the hash function's purpose and limitations

The hashRegData function uses a simple XOR hash. While functional, it has limitations:

XOR of all elements means different inputs can produce same hash

Only works for 4-byte types as enforced by static_assert

Consider adding a comment explaining the hash function's purpose and that it's not cryptographically secure:
+// Simple XOR-based hash for debugging/verification purposes.
+// Not suitable for cryptographic use or collision-resistant hashing.
 template <typename T, uint32_t n>
 __device__ inline uint32_t hashRegData(Vec<T, n> const& data)
cpp/include/tensorrt_llm/deep_gemm/jit_utils.cuh (1)

65-65: Consider adding parameter documentation.

The new swap_ab parameter is added to both functions but lacks documentation. Consider adding doxygen-style comments to explain:

What swap_ab means (swapping A and B matrix order/layout)

How it affects the computation

When it should be used

Also applies to: 68-68
cpp/include/tensorrt_llm/deep_gemm/compiler.cuh (1)
247-254: Consider documenting swapAB limitations.

The swapAB mode only supports Normal and GroupedWithOffset gemm types. Consider adding a comment explaining why other types are unsupported or plan for future support.
     else
     {
+        // SwapAB mode currently supports only Normal and GroupedWithOffset types
+        // due to specific scheduling requirements for transposed operations
         switch (gemm_type)
         {
         case deep_gemm::GemmType::Normal: input_type = "NormalSchedulerInputSwapAB"; break;
         case deep_gemm::GemmType::GroupedWithOffset: input_type = "GroupedWithOffsetSchedulerInputSwapAB"; break;
         default: throw std::runtime_error("Unsupported gemm type");
         }
     }
cpp/include/tensorrt_llm/deep_gemm/fp8_gemm_impl.cuh (1)
764-772: Consider caching scale computations.

The scale computations in the inner loop could benefit from pre-computation outside the WGMMA section if the scales don't change frequently.
                         // Each thread reads consecutive two b scales, each thread needs to read WGMMA::N / 4 * 2 b
                         // scales
                         float scale_0_0[WGMMA::kNumAccum / 4], scale_0_1[WGMMA::kNumAccum / 4];
+                        // Pre-compute scales to reduce redundant operations
 #pragma unroll
                         for (int i = 0; i < WGMMA::kNumAccum / 4; ++i)
                         {
                             float2 scale_b
                                 = ld_shared(reinterpret_cast<const float2*>(smem_scales_b[s] + i * 8 + scale_offset));
                             scale_0_0[i] = scale_a_0 * scale_b.x;
                             scale_0_1[i] = scale_a_0 * scale_b.y;
                         }
cpp/kernels/fmha_v2/setup.py (1)
3053-3059: Consider making the cubin usage conditions configurable

The hardcoded conditions for using cubin headers (sm90 with head_size 128 and sm89 with e4m3 dtype) should ideally be configurable through environment variables or a configuration file. This would make it easier to modify the behavior without changing the source code.

Consider adding configuration options:
 def use_cubin_header(sm, head_size, dtype):
+    # Check if there's an override in environment
+    if os.environ.get('FMHA_FORCE_CUBIN_HEADER'):
+        return os.environ['FMHA_FORCE_CUBIN_HEADER'].lower() == 'true'
+    
     return (sm == 90 and head_size == 128) or (sm == 89 and 'e4m3' in dtype)
cpp/include/tensorrt_llm/executor/executor.h (1)
285-304: Consider performance optimizations for MultimodalInput class

The implementation is functionally correct, but consider these performance improvements:

Pass constructor parameters by const reference to avoid unnecessary copies

Return const references from getter methods to avoid copying vectors
-    explicit MultimodalInput(std::vector<std::vector<SizeType32>> multimodalHashes,
-        std::vector<SizeType32> multimodalPositions, std::vector<SizeType32> multimodalLengths);
+    explicit MultimodalInput(std::vector<std::vector<SizeType32>> const& multimodalHashes,
+        std::vector<SizeType32> const& multimodalPositions, std::vector<SizeType32> const& multimodalLengths);

-    [[nodiscard]] std::vector<std::vector<SizeType32>> getMultimodalHashes() const;
-    [[nodiscard]] std::vector<SizeType32> getMultimodalPositions() const;
-    [[nodiscard]] std::vector<SizeType32> getMultimodalLengths() const;
+    [[nodiscard]] std::vector<std::vector<SizeType32>> const& getMultimodalHashes() const;
+    [[nodiscard]] std::vector<SizeType32> const& getMultimodalPositions() const;
+    [[nodiscard]] std::vector<SizeType32> const& getMultimodalLengths() const;
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
2259-2267: Simplify multimodal parameter initialization

The ternary operator chains for converting optional values to shared_ptrs reduce readability. Consider extracting these conversions to helper functions or local variables.
-            multimodalHashes.has_value()
-                ? std::make_shared<std::vector<std::vector<SizeType32>>>(std::move(multimodalHashes.value()))
-                : std::optional<std::shared_ptr<std::vector<std::vector<SizeType32>>>>(std::nullopt),
-            multimodalPositions.has_value()
-                ? std::make_shared<std::vector<SizeType32>>(std::move(multimodalPositions.value()))
-                : std::optional<std::shared_ptr<std::vector<SizeType32>>>(std::nullopt),
-            multimodalLengths.has_value()
-                ? std::make_shared<std::vector<SizeType32>>(std::move(multimodalLengths.value()))
-                : std::optional<std::shared_ptr<std::vector<SizeType32>>>(std::nullopt),
+            convertToSharedPtr(std::move(multimodalHashes)),
+            convertToSharedPtr(std::move(multimodalPositions)),
+            convertToSharedPtr(std::move(multimodalLengths)),
Where convertToSharedPtr is a helper template function that handles the conversion.
cpp/include/tensorrt_llm/deep_gemm/fp8_gemm.cuh (1)

291-333: Well-implemented SwapAB variant with clear parameter documentation

The function correctly implements the swapped semantics with N-dimension offsets. The inline comments (/* weight*/, /* act*/) help clarify the swapped parameter roles.

Consider adding similar clarifying comments to the other SwapAB functions for consistency.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncherOss.cu (2)

1-19: Temporary file notice needs clarification.

The comment indicates this is a "temporary open-sourced version" that will be deleted when "open-sourced moe_gemm support MXFP4". This should be tracked properly to ensure cleanup happens.

Would you like me to create an issue to track the removal of this temporary file once MXFP4 support is added to the open-sourced moe_gemm?

294-487: Consider refactoring this large function for better maintainability.

The argGenLoadFile function is over 190 lines and handles multiple responsibilities including JSON parsing, routing configuration management, data type filtering, and benchmark argument generation. Consider breaking it into smaller, focused functions.

Split into functions like:

parseRoutingConfig() - Handle routing configuration parsing

filterByDataType() - Handle data type filtering logic

parseTactics() - Handle tactic ID parsing

generateBenchmarkArgs() - Handle final argument generation
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixtureOss.h (3)
44-44: Fix typos in comment.
-// Temporary opend-sourced version. Will be daleted when open-sourced moe_gemm support MXFP4
+// Temporary open-sourced version. Will be deleted when open-sourced moe_gemm supports MXFP4
133-133: Consider making random seeds configurable.

Fixed seeds (0xD5) make benchmarks reproducible but may not represent real-world variance. Consider making seeds configurable.
 struct RandomDistributionRoutingConfig : public RoutingConfig
 {
     using ElementType = float;
-    std::mt19937_64 twister{0xD5};
+    std::mt19937_64 twister;
     // ... other members ...
     
     RandomDistributionRoutingConfig(std::vector<ElementType> const& in_probabilities, int64_t num_experts, int64_t k,
-        std::string name = "random_distribution")
+        std::string name = "random_distribution", uint64_t seed = 0xD5)
-        : probabilities(std::move(in_probabilities))
+        : twister(seed)
+        , probabilities(std::move(in_probabilities))
Also applies to: 213-213

796-805: Fix incomplete label string construction.

The stringstream builds a string with all parameters but only uses the routing config name as the label.

Either use the full string or remove the unused construction:
-    std::stringstream ss;
-    ss << "Experts,K,Hidden,Inter,TP,EP,Rank,Tokens,Bias,Scale,Actfn,Tactic,Routing=";
-    for (auto v : {num_experts, top_k, hidden_size, inter_size, tp_size, ep_size, world_rank, num_tokens,
-             (int) mUseBias, (int) mUseFinalScale, (int) mActType, tactic_idx1, tactic_idx2})
-    {
-        ss << v << ",";
-    }
-    ss << routingConfigCache.at(routing_config)->getName();
-    // state.SetLabel(ss.str());
     state.SetLabel(routingConfigCache.at(routing_config)->getName());
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (5)
73-73: Remove commented out code.
-    // int selfTpRank = selfIdx % selfConfig.getParallelConfig().mTensorParallelism;
     auto targetInfo = executor::kv_cache::targetIRanks(destConfig, selfConfig, selfIdx);
228-228: Fix typo in variable name.
-    auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+    auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
516-516: Fix typo in variable name.
-                    auto* agentConnnecion
+                    auto* agentConnection
                         = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]);
516-527: Simplify cacheBufferId assignment logic.

The conditional assignment makes the code harder to follow.
-                    auto* agentConnnecion
-                        = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]);
-                    if (agentConnnecion != nullptr)
-                    {
-                        cacheBufferId = agentConnnecion->getCacheBufferId();
-                        TLLM_CHECK(cacheBufferId.has_value());
-                    }
-                    else
-                    {
-                        cacheBufferId = mCacheTransBufferManager->assignBufferIndexForRecv();
-                    }
-                    TLLM_CHECK(cacheBufferId.has_value());
+                    auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(pickUpConnections[0]);
+                    cacheBufferId = agentConnection != nullptr 
+                        ? agentConnection->getCacheBufferId()
+                        : mCacheTransBufferManager->assignBufferIndexForRecv();
+                    TLLM_CHECK(cacheBufferId.has_value(), "Failed to obtain cache buffer ID");
688-754: Optimize validation order for early exit.

Check simpler conditions first for better performance.

Reorder checks from cheapest to most expensive:
[[nodiscard]] bool CacheFormatter::inquireSupport(CacheState const& selfConfig, CacheState const& destConfig) const
{
    // Check simple conditions first
    if (selfConfig.getDataType() != destConfig.getDataType())
    {
        TLLM_LOG_WARNING("CacheFormatter::inquireSupport: selfConfig.getDataType() != destConfig.getDataType()");
        return false;
    }
    
    if (selfConfig.getAttentionConfig().mAttentionType != destConfig.getAttentionConfig().mAttentionType)
    {
        TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same attention type");
        return false;
    }
    
    // Then check MLA early since it's not supported
    if (selfConfig.getAttentionConfig().mAttentionType == CacheState::AttentionType::kMLA)
    {
        TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support non-MLA");
        return false;
    }
    
    // More expensive checks last...
cpp/include/tensorrt_llm/executor/transferAgent.h (1)
235-236: Enhance error message with library name.

Include the library name in the error message for better debugging.
-        void* funcPtr = dlSym(handle, funcName.c_str());
-        TLLM_CHECK_WITH_INFO(funcPtr, funcName + " function is not open correctly.");
+        void* funcPtr = dlSym(handle, funcName.c_str());
+        TLLM_CHECK_WITH_INFO(funcPtr, "Failed to load function '" + funcName + "' from library '" + libName + "'");
cpp/kernels/xqa/utils.cuh (2)
927-985: Well-implemented cluster synchronization utilities.

The cluster query functions and synchronization primitives are correctly implemented using appropriate PTX instructions.

Consider adding brief documentation for the mapa function to explain its purpose:
/**
 * Maps a pointer from the current CTA's address space to another CTA's address space within the cluster.
 * This enables direct memory access across CTAs in a cluster.
 */
template <typename T>
__device__ inline T* mapa(T* src, uint32_t clusterCtaRank)
1033-1058: Timer class is useful but be cautious with printf in kernels.

The Timer implementation is clean, but be aware that printf in CUDA kernels can significantly impact performance and should only be used for debugging.

Consider adding a compile-time flag to disable timer output in production:
#ifdef ENABLE_KERNEL_PROFILING
    printf("%s: %u (block={%u, %u, %u})\n", name, toc - mTic, blockIdx.x, blockIdx.y, blockIdx.z);
#endif
cpp/include/tensorrt_llm/deep_gemm/scheduler.cuh (1)

739-757: Consider documenting why only Normal and GroupedWithOffset are supported.

The SchedulerSelectorSwapAB correctly implements selection for the two supported types with appropriate compile-time validation. However, it would be helpful to add a comment explaining why StridedBatched, GroupedContiguous, and GroupedMasked types are not supported for SwapAB operations.
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (1)
74-80: Consider adding documentation for the complex return tuple.

The expanded operator() signature appropriately handles the increased complexity with clear parameter types. However, the 4-element return tuple would benefit from documentation explaining what each component represents.

Add a brief comment above the method documenting the return value components:
// Returns: tuple of <tensorPtr, samplingConfigs, tensorPtrs, lookaheadConfigs>
cpp/kernels/xqa/mla_sm120.cu (3)
245-273: Consider documenting the ping-pong synchronization pattern.

The PingPongMutex implementation is clever and correct, but the synchronization pattern would benefit from documentation explaining how the two barriers alternate to ensure mutual exclusion between groups.

Add a comment explaining the pattern:
// PingPongMutex uses two barriers alternating between groups to ensure
// only one group uses tensor cores at a time while the other prepares data
344-350: Static assertion may fail in RTC compilation.

The #ifndef __CUDACC_RTC__ guard around the static assertion is good, but consider adding a runtime assertion as backup for NVRTC compilation scenarios.
#ifdef __CUDACC_RTC__
        assert(nbBars == nbBarsRef);
#endif
1882-1885: Consider more descriptive error messages.

The beam width check throws a generic "not implemented" error. Consider providing more specific error messages to help users understand the limitation.
if (beamWidth != 1)
{
    throw std::runtime_error("MLA kernel currently only supports beam width of 1, got " + std::to_string(beamWidth));
}
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1315-1334: Comprehensive enhancement of block allocation strategy.

The calculateMaxNumBlocks method has been significantly improved to:

Consider per-window memory requirements through windowSizeToLayers parameter

Account for both primary and secondary memory allocation

Support the new BlocksPerWindow return type for granular allocation

This enables more efficient memory utilization across different attention window configurations and layer distributions.

Consider documenting the memory allocation algorithm in detail, especially the weighting strategy used for distributing blocks across window sizes. This will help future maintainers understand the design decisions.

coderabbitai · 2025-07-18T06:16:56Z

cpp/include/tensorrt_llm/executor/cacheCommunicator.h

 #pragma once

-#include <cstddef>
+#include "tensorrt_llm/executor/serialization.h"


⚠️ Potential issue

Missing required header for size_t

The file uses size_t in multiple method signatures (lines 49, 51, 64) but removes <cstddef> which provides this type. While it may compile if size_t is transitively included through serialization.h, this creates a fragile dependency.

Include both headers to ensure all used types are properly declared:

+#include <cstddef> #include "tensorrt_llm/executor/serialization.h"

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/executor/cacheCommunicator.h at line 19, the header <cstddef> is missing, which is required to properly declare the size_t type used in method signatures on lines 49, 51, and 64. Add #include <cstddef> alongside the existing #include "tensorrt_llm/executor/serialization.h" to explicitly include the definition of size_t and avoid fragile dependencies on transitive includes.

coderabbitai · 2025-07-18T06:16:57Z

cpp/kernels/fmha_v2/README.md

+What's the difference between cubins and cu files?
+
+'''
+Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins.
+You can modify code in this repo to change or create your own kernels and run.
+Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details.
+'''


🛠️ Refactor suggestion

Improve FAQ formatting and avoid hardcoded line numbers.

The FAQ content is helpful but has formatting issues:

Triple quotes (''') are not standard markdown syntax

Hardcoded line numbers in setup.py references may become outdated

-What's the difference between cubins and cu files? - -''' -Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins. -You can modify code in this repo to change or create your own kernels and run. -Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details. -''' +What's the difference between cubins and cu files? + +Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins. +You can modify code in this repo to change or create your own kernels and run. +Now there are some kernels still running in cubins. See `use_cubin_header()` and `modify_cubin_header()` functions in setup.py for details.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

What's the difference between cubins and cu files?

'''

Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins.

You can modify code in this repo to change or create your own kernels and run.

Now there are some kernels still running in cubins. See use_cubin_header(setup.py#L3055) and modify_cubin_header(setup.py#L3413) for details.

'''

What's the difference between cubins and cu files?

Cubins are precompiled (from the internal fmha_v2 repo) binary files and take a lot of space, cu files are generated directly from this repo. Now we replace most of the kernels with cu files and delete unused cubins.

You can modify code in this repo to change or create your own kernels and run.

Now there are some kernels still running in cubins. See `use_cubin_header()` and `modify_cubin_header()` functions in setup.py for details.

🤖 Prompt for AI Agents

In cpp/kernels/fmha_v2/README.md around lines 84 to 90, replace the triple quotes used for the FAQ explanation with proper markdown formatting such as backticks for code or blockquotes for text. Remove the hardcoded line numbers in the references to setup.py and instead describe the relevant sections or functions by name only, so the documentation remains accurate even if the file changes.

coderabbitai · 2025-07-18T06:16:57Z

.github/workflows/label_community_pr.yml

+        uses: actions/setup-python@v3
+        with:
+          python-version: '3.x'


⚠️ Potential issue

Update to latest action version.

The actions/setup-python@v3 action is outdated and may not work properly on current GitHub Actions runners.

- uses: actions/setup-python@v3 + uses: actions/setup-python@v4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: actions/setup-python@v3

with:

python-version: '3.x'

uses: actions/setup-python@v4

with:

python-version: '3.x'

🧰 Tools

🪛 actionlint (1.7.7)

15-15: the runner of "actions/setup-python@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

In .github/workflows/label_community_pr.yml around lines 15 to 17, the GitHub Action uses an outdated version actions/setup-python@v3 which may cause compatibility issues. Update the action to the latest stable version, such as actions/setup-python@v4, to ensure proper functionality on current GitHub Actions runners.

coderabbitai · 2025-07-18T06:16:57Z

.github/workflows/label_community_pr.yml

+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v3


⚠️ Potential issue

Update to latest action version.

The actions/checkout@v3 action is outdated and may not work properly on current GitHub Actions runners.

- uses: actions/checkout@v3 + uses: actions/checkout@v4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: actions/checkout@v3

uses: actions/checkout@v4

🧰 Tools

🪛 actionlint (1.7.7)

12-12: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

In .github/workflows/label_community_pr.yml at line 12, the actions/checkout@v3 is outdated. Update the version to the latest stable release, such as actions/checkout@v4, to ensure compatibility and proper functionality with current GitHub Actions runners.

coderabbitai · 2025-07-18T06:16:57Z

cpp/kernels/xqa/test/warmup.cu

+    uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz
+    kernel_warmup<<<16, 128, 0, stream>>>(nbCycles);


🛠️ Refactor suggestion

Verify clockRate unit assumption and improve grid configuration

The code assumes clockRate is in kHz (as stated in the comment), but this should be verified. Additionally, the hardcoded grid configuration (16, 128) might not be optimal for all GPU architectures.

Consider making the grid configuration adaptive:

- uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz - kernel_warmup<<<16, 128, 0, stream>>>(nbCycles); + uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz + int const blocks = std::min(16, prop.multiProcessorCount); + int const threads = std::min(128, prop.maxThreadsPerBlock); + kernel_warmup<<<blocks, threads, 0, stream>>>(nbCycles);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz

kernel_warmup<<<16, 128, 0, stream>>>(nbCycles);

uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz

int const blocks = std::min(16, prop.multiProcessorCount);

int const threads = std::min(128, prop.maxThreadsPerBlock);

kernel_warmup<<<blocks, threads, 0, stream>>>(nbCycles);

🤖 Prompt for AI Agents

In cpp/kernels/xqa/test/warmup.cu around lines 15 to 16, verify that prop.clockRate is indeed in kHz before using it to calculate nbCycles. Replace the hardcoded kernel launch configuration (16, 128) with a dynamic calculation based on the GPU's properties, such as the number of multiprocessors and maximum threads per block, to optimize performance across different architectures.

coderabbitai · 2025-07-18T06:17:02Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

+            size_t ppDomainSize = targetInfo.mDomainPPSize;
+            size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor;
+            size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize);
+            if (bufferIdx < bufferCoverTargetNum)


🛠️ Refactor suggestion

Add documentation for complex index calculation.

The buffer index calculation logic is complex and needs explanation.

+ // Calculate buffer index considering duplication factors: + // - processIdx: current process index in the loop + // - ppDomainSize: pipeline parallelism domain size + // - peerDuplicateHeadFactor: how many peers share the same data + // - bufferTpRank: tensor parallel rank after accounting for duplication + // - bufferIdx: final buffer index combining TP and PP dimensions size_t ppDomainSize = targetInfo.mDomainPPSize; size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor; size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

size_t ppDomainSize = targetInfo.mDomainPPSize;

size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor;

size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize);

if (bufferIdx < bufferCoverTargetNum)

// Calculate buffer index considering duplication factors:

// - processIdx: current process index in the loop

// - ppDomainSize: pipeline parallelism domain size

// - peerDuplicateHeadFactor: how many peers share the same data

// - bufferTpRank: tensor parallel rank after accounting for duplication

// - bufferIdx: final buffer index combining TP and PP dimensions

size_t ppDomainSize = targetInfo.mDomainPPSize;

size_t bufferTpRank = (processIdx / ppDomainSize) / peerDuplicateHeadFactor;

size_t bufferIdx = (bufferTpRank * ppDomainSize) + (processIdx % ppDomainSize);

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 254 to 257, the calculation of bufferIdx involves multiple steps and variables, making it hard to understand. Add a clear comment above this code block explaining the purpose of each variable (ppDomainSize, bufferTpRank, bufferIdx), the rationale behind the division and modulo operations, and how these relate to the indexing logic. This will improve code readability and maintainability.

coderabbitai · 2025-07-18T06:17:02Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

+                auto sendBufferIdx = bufferIdx % bufferCoverTargetNum;
                bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx));


⚠️ Potential issue

Fix incorrect index in buffer copy.

Using processIdx instead of bufferIdx when copying to pre-allocated buffer.

else if (bufferCoverTargetNum > 0) { // copy buffer allocated by cudaMallocAsync to buffer allocated by cudaMalloc before sending auto sendBufferIdx = bufferIdx % bufferCoverTargetNum; - bufferManager.copy(*outputSplitCaches[processIdx], *outputSplitCaches.at(sendBufferIdx)); + bufferManager.copy(*outputSplitCaches[bufferIdx], *outputSplitCaches.at(sendBufferIdx));

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp at lines 267-268, the buffer copy uses processIdx as the source index instead of bufferIdx. Replace processIdx with bufferIdx in the first argument of bufferManager.copy to correctly reference the intended buffer for copying.

coderabbitai · 2025-07-18T06:17:02Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

+        auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]);
+        if (agentConnnecion != nullptr)
+        {
+            TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == bufferTargetNum, "Agent need all buffer pre-allocated");
+            TLLM_CHECK(onlyUseDynamicBuffer == false);
+        }


🛠️ Refactor suggestion

Move agent connection validation before buffer allocation.

The check for agent connections requiring pre-allocated buffers happens after allocation, which could waste resources.

+ auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); + if (agentConnection != nullptr) + { + // Agent connections require pre-allocated buffers + // Validate this early to avoid unnecessary allocations + } + auto result = mCacheTransBufferManager->getOrAllocateSendBuffers( cacheBufferId, bufferTargetNum, targetBufferSize, bufferManager); auto& outputSplitCaches = std::get<0>(result); auto& bufferCoverTargetNum = std::get<1>(result); auto& onlyUseDynamicBuffer = std::get<2>(result); - auto* agentConnnecion = dynamic_cast<executor::kv_cache::AgentConnection const*>(connections[0]); - if (agentConnnecion != nullptr) + if (agentConnection != nullptr) { TLLM_CHECK_WITH_INFO(bufferCoverTargetNum == bufferTargetNum, "Agent need all buffer pre-allocated"); TLLM_CHECK(onlyUseDynamicBuffer == false); }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 228 to 233, move the validation that checks if the agent connection requires all buffers to be pre-allocated before any buffer allocation occurs. This prevents unnecessary allocation of buffers when the agent connection condition is not met. Adjust the code so the dynamic_cast and related TLLM_CHECK validations happen prior to buffer allocation logic.

coderabbitai · 2025-07-18T06:17:03Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

+    if (requestBlockNum < blockRange.size() && requestBlockNum > 0)
+    {
+        // handle block reuse, the prefix blocks are reused
+        // TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num
+        auto const& ids = blockRange.getBlockIds();
+        blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()});
+    }
+    return blockRange;


⚠️ Potential issue

Add validation for requestBlockNum.

The code assumes requestBlockNum is valid but doesn't check if it exceeds the available blocks.

if (requestBlockNum < blockRange.size() && requestBlockNum > 0) { // handle block reuse, the prefix blocks are reused // TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num auto const& ids = blockRange.getBlockIds(); + TLLM_CHECK(requestBlockNum <= ids.size(), "Requested block number exceeds available blocks"); blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()}); }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 50 to 57, the code uses requestBlockNum without validating if it exceeds the number of available blocks in blockRange. Add a validation step before the if condition to ensure requestBlockNum is not greater than the size of blockRange and is positive. If invalid, handle the case appropriately, such as returning an empty blockRange or throwing an error, to prevent out-of-bounds access.

coderabbitai · 2025-07-18T06:17:03Z

cpp/kernels/xqa/barriers.cuh

+    template <Scope scope = defaultScope, ArriveOrder order = ArriveOrder::RELEASE>
+    __device__ inline mha::conditional_t<scope == Scope::CTA, ArrivalToken, void> arrive_tx_relaxed(uint32_t txCount)
    {
-        if (arriveCount == 1)
+#if __CUDA_ARCH__ >= 900
+        if constexpr (scope == Scope::CTA)
        {
            ArrivalToken token;
-            asm volatile("mbarrier.arrive.expect_tx.release.cta.shared::cta.b64 %0, [%1], %2;\n"
+            asm volatile("mbarrier.arrive.expect_tx.relaxed.cta.b64 %0, [%1], %2;\n"
                         : "=l"(token)
-                         : "r"(addr()), "r"(txCount));
+                         : "l"(addr()), "r"(txCount)
+                         : "memory");
            return token;
        }
        else
        {
-            asm volatile("mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;\n" ::"r"(addr()), "r"(txCount));
-            return arrive(arriveCount);
+            asm volatile("mbarrier.arrive.expect_tx.relaxed.cluster.b64 _, [%0], %1;\n" ::"l"(addr()), "r"(txCount)
+                         : "memory");
+            return;
        }
+#else
+        asm volatile("trap;\n");
+#endif
    }


⚠️ Potential issue

Fix template parameter shadowing.

In arrive_tx_relaxed, the template parameter order shadows the method name but is never used. This appears to be a copy-paste error.

-template <Scope scope = defaultScope, ArriveOrder order = ArriveOrder::RELEASE> +template <Scope scope = defaultScope> __device__ inline mha::conditional_t<scope == Scope::CTA, ArrivalToken, void> arrive_tx_relaxed(uint32_t txCount)

🤖 Prompt for AI Agents

In cpp/kernels/xqa/barriers.cuh around lines 151 to 173, the template parameter 'order' shadows the method name 'arrive_tx_relaxed' but is unused, likely a copy-paste mistake. Remove the unused 'order' template parameter from the function template declaration to fix the shadowing and clean up the code.

Signed-off-by: Pengyun Lin <[email protected]>

hypdeb · 2025-07-18T11:34:21Z

Served its purpose!

jinyangyuan-nvidia and others added 30 commits June 10, 2025 14:20

[fix] Fix test_attention_mla (#5084)

194a708

Signed-off-by: Jinyang Yuan <[email protected]>

CI: Allow run (#5101)

6cb2b7d

Signed-off-by: Izzy Putterman <[email protected]>

[fix] Unwaive test_llama_eagle3 (#5042)

fcd7192

Signed-off-by: Mike Iovine <[email protected]>

fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlo…

1b79041

…ck. (#4264) Signed-off-by: Bo Li <[email protected]>

test: conditional disagg and cache aware balancing for deepseek v3 (#…

580a925

…4522) Signed-off-by: Zheng Duan <[email protected]>

[https://nvbugspro.nvidia.com/bug/5332927][fix] Fix the bug in the ro…

273c6b9

…uting unit test (#5065) Signed-off-by: Christina Zhang <[email protected]>

infra: Add timeout and retry for wget in docker image build (#5035)

035b048

Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>

Waive L0 tests (#5111)

0a9f105

Signed-off-by: Yiqing Yan <[email protected]>

[TRTLLM-4995][feat] TRTLLM Sampler log probs support (#4836)

fdf1c47

Signed-off-by: Daniel Campora <[email protected]>

chore: bump version to 0.21.0rc2 (#5112)

e2863a3

Signed-off-by: ZhanruiSunCh <[email protected]>

test: add more llama_v3.3_70b cases in perf test (#4979)

56abae0

Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>

[fix] Fix llama4 min latency (#5117)

8282d6c

Signed-off-by: Jin Li <[email protected]>

[TRTLLM-5082] - Add a bot run option for detailed logs (#4390)

a90dd57

Signed-off-by: Yiqing Yan <[email protected]>

test: skip disaggregated tests on arm (#5070)

11b94fe

Signed-off-by: xinhe-nv <[email protected]>

[chore] 2025-06-10 update allowlist (#5102)

ddfe4fc

Signed-off-by: Tyler Burt <[email protected]>

[TRTLLM-5581][infra] Update Module Owners (#5052)

ad99a08

Signed-off-by: Po-Wei Wang (Vincent)

chore: rename IOFormatter to BaseCacheFormatter (#5068)

ee44fa0

Signed-off-by: Zheng Duan <[email protected]>

fix: limit process pool size when prefetching (#5088)

c592798

Signed-off-by: Zheng Duan <[email protected]>

Use backend to replace macro to control enablement of MNNVL all reduce (

4319237

#4635) Signed-off-by: Hui Gao <[email protected]>

Solve underallocation in VSWA+/VGQA (#4667)

e692779

Signed-off-by: Netanel Haber <[email protected]>

[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade (#…

49d7268

…5106) Signed-off-by: Lucas Liebenwein <[email protected]>

test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultr…

c3b2eb6

…a) + extras ✨ (#5066) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Venky <[email protected]>

Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs (#…

0daa709

…4961) Signed-off-by: moraxu <[email protected]>

update the free_gpu_mem_fraction for H100 qwen3 qa test (#5114)

505678a

Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]>

[test] Use LLM API for Nemotron-H correctness test (#5097)

06d9f1e

Signed-off-by: Tomer Asida <[email protected]>

test: set enable_attention_dp to False for non-deepseek models and ad…

d021cc5

…d more cases for llama_v3.1/3.3 70b fp8 models (#5149) Signed-off-by: ruodil <[email protected]> Co-authored-by: Larry <[email protected]>

[TRTLLM-4932] Add Llama-3.1-Nemotron-Nano-8B-v1-FP8 accuracy tests (#…

53983ad

…4933) Signed-off-by: moraxu <[email protected]>

Fix logprobs issues. (#5136)

e462677

Signed-off-by: Daniel Campora <[email protected]>

chore: fix typo in tests (#5092)

4d070d3

Signed-off-by: Fanrong Li <[email protected]>

yizhang-nv and others added 24 commits July 8, 2025 20:22

doc: Update gb200 doc (#5840)

39ad602

Signed-off-by: yizhan <[email protected]>

test: remove duplicate cases in perf sanity test (#5870)

cbcc55e

Signed-off-by: ruodil <[email protected]>

[nvbug 5327706][fix] fix mgmn postprocess error (#5835)

2e21e34

Signed-off-by: Pengyun Lin <[email protected]>

[nvbugs/5345391] fix: chunked prefill + overlap scheduling (#5761)

fd94d3c

Signed-off-by: Robin Kobus <[email protected]>

cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len t…

ce048ec

…o max attention window (#5874) Signed-off-by: Netanel Haber <[email protected]> Signed-off-by: Netanel Haber <[email protected]>

[https://nvbugs/5355316] fix: update torch.compile option to fix trit…

d9e265d

…on store_cubin error (#5865) Signed-off-by: Zhenhuan Chen <[email protected]>

test: Add Gemma3 unit tests to CI in release/0.21 (#5899)

ff9aabb

Signed-off-by: Balaram Buddharaju <[email protected]>

tests: Fix lora perf test (#5875)

cd7aeec

Signed-off-by: Amir Klein <[email protected]>

fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction…

8b7422c

… to 0.75 to prevent OOM on CI. (#5896) Signed-off-by: Bo Li <[email protected]>

fix [nvbug/5351244]: address remote mpi session submit (#5664)

bfa917f

Signed-off-by: Superjomn <[email protected]>

fix: [5328141] increase tolerance for test_fp8_block_scale_gemm (#5849)

aeea5b3

Signed-off-by: Nikita Korobov <[email protected]>

fix: timeout and broken pipe in disagg and worker tests (#5827)

e831673

Signed-off-by: zhengd-nv <[email protected]>

[nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup (#…

4905cac

…5947) Signed-off-by: Fanrong Li <[email protected]>

fix: fix index out of bounds error in spec decoding (#5954)

bed78a2

[nvbugs/5368410][fix] Disable moe allreduce for multi node (#5918)

332a65b

Signed-off-by: Yi Zhang <[email protected]>

[fix] Release slots with spec decode + disagg (#5975)

2e7da20

Signed-off-by: Iman Tabrizian <[email protected]> Signed-off-by: Iman Tabrizian <[email protected]>

[TRTLLM-6495] doc: add disclaimer for 3rd party software installation. (

63f4a7a

#6039) Signed-off-by: nv-guomingz <[email protected]>

[None] - Waive L0 tests (#6082)

69a15c8

Signed-off-by: Yiqing Yan <[email protected]>

Cherry Pick: PR #6076 (#6088)

bce13bb

Signed-off-by: Iman Tabrizian <[email protected]> Co-authored-by: Iman Tabrizian <[email protected]>

add release notes for 0.21 release (#6049)

f6db521

Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]>

fix: Fix triton backend build [nvbug 5396469] (#6098)

4d0bcbc

Signed-off-by: Patrice Castonguay <[email protected]>

[None][infra] Cherry-pick #6128 and #6130 from main branch (#6151)

eeca3ad

Signed-off-by: Yanchao Lu <[email protected]> Signed-off-by: qqiao <[email protected]> Co-authored-by: Emma Qiao <[email protected]>

[Doc][Qwen3] update qwen3 into support-matrix (#6161)

9323de6

Signed-off-by: bhsueh <[email protected]>

hypdeb self-assigned this Jul 18, 2025

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

[fix]: Revert commit 388b491 (#6143)

ab4e178

Signed-off-by: Pengyun Lin <[email protected]>

hypdeb closed this Jul 18, 2025

		uint64_t const nbCycles = std::round(prop.clockRate * ms); // clockRate is in kHz
		kernel_warmup<<<16, 128, 0, stream>>>(nbCycles);

		auto sendBufferIdx = bufferIdx % bufferCoverTargetNum;
		bufferManager.copy(outputSplitCaches[processIdx], outputSplitCaches.at(sendBufferIdx));

draft: DO NOT MERGE #6172

draft: DO NOT MERGE #6172

Uh oh!

Conversation

hypdeb commented Jul 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

hypdeb commented Jul 18, 2025

Uh oh!

Uh oh!

hypdeb commented Jul 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 18, 2025 •

edited

Loading