[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155

zongfeijing · 2025-08-22T07:49:27Z

Summary by CodeRabbit

New Features
- Optional opt-in low-precision MoE all-to-all path (per-16-element NVFP4/FP8 quantization with 128B-aligned blocks) exposed via runtime and Python/Torch flag.
- Exposed workspace helpers and a runtime control to tune usable GPU SMs for performance.
Bug Fixes
- Added CUDA synchronization after workspace initialization to improve stability.
Tests
- Tests extended to cover low-precision flows and relax tolerances where applicable.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-22T07:49:34Z

📝 Walkthrough

Walkthrough

Adds an NVFP4 per-16-element low‑precision path to fused MoE all‑to‑all: device quantize/dequantize utilities, 128B‑aligned metadata and workspace changes, LOW_PRECISION template parameter and compile-time constraints, host/kernel selection by isLowPrecision, and Python/Torch bindings and tests updated to opt into the path.

Changes

Cohort / File(s)	Summary
Kernels: low-precision path, templating, workspace `cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu`, `.../fusedMoeCommKernels.h`	Add NVFP4 device utilities (`quantize_nvfp4_sharedmem`, `dequantize_nvfp4_sharedmem`, `e2m1_to_fp32_vec`), per-16-element group quantization with global/per-group scales and 128B-aligned buffers, new 128B alignment constants, LOW_PRECISION template parameter (compile-time FIELD_COUNT constraint), templated kernel `moeAllToAllKernel<FIELD_COUNT,LOW_PRECISION>`, new workspace APIs (`getFusedMoeCommWorkspaceSize`, `constructWorkspace`, `computeMoeAlltoallMaxDynamicSharedMemorySize`), metadata fields (`originalDataType`, `isLowPrecision`), `fillFieldPlacementInfo`, public SM control state and `setMaxUsableSmCount`, and host launch plumbing updated to select low‑precision kernels and manage shared-memory/packing hooks.
Torch op and bindings `cpp/tensorrt_llm/thop/moeCommOp.cpp`	Expose optional `use_low_precision` to `moeCommOp` and Torch binding; set `params.isLowPrecision`; pass `originalDataType` into `fillFieldInfo`; propagate low‑precision flag to `fillMetaInfo` for send/recv fields.
Python MNNVL utilities `tensorrt_llm/_mnnvl_utils.py`	Add `use_low_precision_combine: bool = False` parameter to `mnnvl_moe_alltoallv_combine` and forward it to `moe_comm`.
Fused MoE EP wrapper `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`	Forward `self.use_low_precision_combine` into `MnnvlMoe.mnnvl_moe_alltoallv_combine` call.
Tests (Python) `tests/unittest/_torch/thop/test_moe_alltoall.py`	Add `quant_and_dequant` helper; parameterize tests with `use_low_precision`; gate low‑precision tests by GPU SM version; when enabled, quantize/dequant reference outputs and relax tolerances; pass flag to `moe_comm`.
Tests (C++) `cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp`	Add `getCudaDataType(int)` helper; update test `fillFieldInfo` calls/signature to accept and pass `cudaDataType_t` mapping element sizes to CUDA data types; propagate changes across G2S/S2G/loopback/local-fifo tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as Python caller
  participant Op as moe_comm (Torch C++)
  participant Host as Host launcher
  participant Ker as moeAllToAllKernel<FIELD_COUNT,LOW_PRECISION>
  participant Dev as Device utils (NVFP4)

  Py->>Op: moe_comm(..., use_low_precision=?)
  Op->>Host: build params (isLowPrecision), fill field meta (originalDataType)
  Host->>Host: select kernel via getFunc(fieldCount, isLowPrecision)
  alt isLowPrecision == true
    Host->>Ker: launch moeAllToAllKernel<..., true>
    note right of Ker #EEF6F3: FIELD_COUNT==1 enforced for low‑precision
    Ker->>Dev: quantize_nvfp4_sharedmem (per‑16 elems) -> write global + per‑group scales
    Ker->>Ker: pack -> all‑to‑all exchange
    Ker->>Dev: dequantize_nvfp4_sharedmem (use scales) -> unpack
  else
    Host->>Ker: launch moeAllToAllKernel<..., false>
    Ker->>Ker: pack -> all‑to‑all exchange -> unpack (FP path)
  end
  Ker-->>Host: kernel complete
  Host-->>Op: return outputs
  Op-->>Py: Tensor[]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP #6973 — Prior/related PR that introduced/extended fused MoE all‑to‑all; this change builds on the same kernels and adds NVFP4 low‑precision paths, metadata, and workspace APIs.

Suggested labels

Community want to contribute

Suggested reviewers

hlu1
hyukn
WeiHaocheng
yizhang-nv

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
66-67: Fix malformed default parameter value.

The default parameter has an incomplete expression MoEWeightLoadingMode. without specifying the enum value.

Apply this diff to fix the default parameter:
-        weight_loading_mode: MoEWeightLoadingMode = MoEWeightLoadingMode.
-        VANILLA,
+        weight_loading_mode: MoEWeightLoadingMode = MoEWeightLoadingMode.VANILLA,

🧹 Nitpick comments (15)

tests/unittest/_torch/thop/test_moe_alltoall.py (5)
75-81: Verify workspace size units (bytes vs. elements) before division by 8

For moe_comm workspace you divide by 8 (uint64 elements), while the prepare workspace uses the raw value. Ensure both size getters return consistent units; otherwise this may under/over‑allocate.

If both APIs return bytes, adjust prepare workspace allocations similarly:
- all_workspaces = torch.zeros(ep_size,
-                              workspace_size,
-                              dtype=torch.uint64,
-                              device=torch.device('cuda'))
+ all_workspaces = torch.zeros(
+     ep_size, workspace_size // 8, dtype=torch.uint64, device='cuda'
+)
If they already return element counts, remove the // 8 in the moe_comm allocations instead. Please confirm via the operator docs or quick prints.

Also applies to: 254-259, 549-556

557-563: Warmup call uses ep_size=1; likely mismatch with test parameterization

The first warmup prepare call hardcodes ep_size=1, while the rest of the test uses the parameterized ep_size. This can mask issues and/or size calculations for the workspace.

Apply this diff to use the test’s ep_size:
-            torch.ops.trtllm.mnnvl_moe_alltoallv_prepare_without_allgather(
-                expert_ids_all_ranks[0], experter_count_lists[0],
-                all_workspaces, max_token_count_per_rank, 0, 1, expert_count,
-                slot_count, top_k)
+            torch.ops.trtllm.mnnvl_moe_alltoallv_prepare_without_allgather(
+                expert_ids_all_ranks[0],
+                experter_count_lists[0],
+                all_workspaces,
+                max_token_count_per_rank,
+                0,
+                ep_size,
+                expert_count,
+                slot_count,
+                top_k,
+            )
566-586: Remove dead allocations immediately overwritten below

You allocate placeholder tensors into local_* lists and then immediately reset them to empty lists, making the allocations dead code.

Apply this diff to drop the unneeded block:
-        # Make torch alloc tensor to avoid cuda sync
-        local_send_rank_count_cumsum = []
-        local_send_rank_indices = []
-        local_recv_rank_count_cumsum = []
-        local_recv_rank_indices = []
-        backward_local_recv_rank_indices = []
-        for _ in range(ep_size):
-            local_send_rank_count_cumsum.append(
-                torch.empty(ep_size,
-                            dtype=torch.int32,
-                            device=torch.device('cuda')))
-            local_send_rank_indices.append(
-                torch.empty(max_token_count_per_rank * ep_size,
-                            dtype=torch.int32,
-                            device=torch.device('cuda')))
-            local_recv_rank_count_cumsum.append(
-                torch.empty(0, dtype=torch.int32, device=torch.device('cuda')))
-            local_recv_rank_indices.append(
-                torch.empty(0, dtype=torch.int32, device=torch.device('cuda')))
-            backward_local_recv_rank_indices.append(
-                torch.empty(0, dtype=torch.int32, device=torch.device('cuda')))
-
-        local_send_rank_count_cumsum = []
-        local_send_rank_indices = []
-        local_recv_rank_count_cumsum = []
-        local_recv_rank_indices = []
-        backward_local_recv_rank_indices = []
+        # These will be filled by the prepare op for the target rank.
+        local_send_rank_count_cumsum = []
+        local_send_rank_indices = []
+        local_recv_rank_count_cumsum = []
+        local_recv_rank_indices = []
+        backward_local_recv_rank_indices = []
Also applies to: 587-592

309-406: Make FP8 test assertions deterministic and reduce stdout noise

The FP8 test prints shapes/counts and treats any exception as failure. Prefer explicit assertions on returned shapes/dtypes and gate the test if FP8 isn’t supported on the current GPU/runtime.

Suggested adjustments:

Replace prints with asserts on tensor shapes, dtypes, and that the scatter/gather preserves non‑zeros for the selected indices.

Add a skip guard when not torch.cuda.is_available() or when runtime lacks torch.float8_e4m3fn kernels (e.g., Hopper requirement).

526-528: Line length exceeds linter limit (Ruff E501)

A few lines exceed 120 chars. Consider wrapping with parentheses for better readability and to appease CI linters.

Also applies to: 606-607
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
587-589: Make unsupported post‑quant all‑to‑all explicit for MNNVL

Currently this branch is a no‑op (pass). Consider raising or logging to surface misconfiguration early when TRTLLM_MOE_POST_QUANT_ALLTOALLV=1 is set with MNNVL.

Apply this diff:
-            if self.alltoall_method_type == AlltoallMethodType.MNNVL:
-                pass
+            if self.alltoall_method_type == AlltoallMethodType.MNNVL:
+                raise NotImplementedError(
+                    "Post‑quant all‑to‑all is not supported for MNNVL. "
+                    "Unset TRTLLM_MOE_POST_QUANT_ALLTOALLV or switch backend."
+                )
1-1: Missing NVIDIA SPDX header

Per repo guidelines, prepend the NVIDIA SPDX header to source files.

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-License-Identifier: Apache-2.0
+
 import os
tensorrt_llm/_torch/models/modeling_speculative.py (1)
1-1: Missing NVIDIA SPDX header

Please prepend the standard NVIDIA SPDX header.

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-License-Identifier: Apache-2.0
+
 from typing import Any, Dict, Generic, Optional, Tuple
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)
34-38: Consider using consistent naming conventions for struct members.

According to the coding guidelines, member variables should use mPrefix lowerCamelCase (e.g., mEpSize, mEpRank). While public members may omit the prefix, using it is encouraged for consistency.

Apply this diff to follow the naming convention:
struct MoeEpWorldInfo
{
-    int epSize;
-    int epRank;
+    int mEpSize;
+    int mEpRank;
};
40-44: Consider using consistent naming conventions for struct members.

The member variables should follow mPrefix lowerCamelCase convention (e.g., mExpertCount, mTopK).

Apply this diff to follow the naming convention:
struct MoeExpertParallelInfo
{
-    int expertCount = -1;
-    int topK = 1;
+    int mExpertCount = -1;
+    int mTopK = 1;
};
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
373-375: Add explicit shape assertions for x_sf before swizzling

To catch any unexpected padding or shape mismatches after communication, it’s safer to assert that x_sf really has the dimensions you expect—namely (x_row, ceil_div(x_col, scaling_vector_size))—before calling swizzle_sf. In fused_moe_cutlass.py you can insert these checks in both the AllToAll and AllGather branches:

• In the AllToAll branch (around lines 373–375):
             if x_sf is not None:
-                x_row = x_sf.shape[0]
-                x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)
+                # verify shape matches precomputed x_row/x_col
+                assert x_sf.shape[0] == x_row and x_sf.shape[1] == ceil_div(x_col, self.scaling_vector_size), \
+                    f"Unexpected x_sf shape {tuple(x_sf.shape)}, expected ({x_row}, {ceil_div(x_col, self.scaling_vector_size)})"
+                # swizzle using known dimensions
+                x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)
• In the post-quant AllGather branch (around lines 392–394):
             if x_sf is not None:
-                x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)
+                # verify shape matches gathered x_row/x_col
+                assert x_sf.shape[0] == x_row and x_sf.shape[1] == ceil_div(x_col, self.scaling_vector_size), \
+                    f"Unexpected x_sf shape {tuple(x_sf.shape)} after allgather, expected ({x_row}, {ceil_div(x_col, self.scaling_vector_size)})"
+                x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)
These assertions mirror the pattern used in fused_moe_wide_ep.py and will surface any inconsistency in the scaling‐factor tensor dimensions early in execution.
cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
1-15: Update copyright year to 2025.

The copyright notice should be updated to include 2025 to match other files in this PR.

Apply this diff:
/*
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
 *
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (2)
125-137: Optimize counter communication with reduced memory traffic.

The current implementation writes values one by one in a loop (Lines 129-137). For better performance, consider batching the writes when possible, especially when communicationCount is large.

Consider coalescing writes for better memory access patterns:
 for (int i = threadIdx.x; i < communicationCount; i += blockDim.x)
 {
     int value = i == 0 ? *(sharedSendRecvRankCount) : *(expertStatics + i - 1);
-    counter.releaseValue(value, i);
     if (i == 0)
     {
         *(sendCounts + targetRankId) = value;
     }
+    counter.releaseValue(value, i);
 }
83-88: Template function parameter not utilized consistently.

The template parameter kThreadsGroupSize is defined but the function also takes runtime parameters that might conflict with compile-time assumptions.

Consider adding a static assertion to ensure the template parameter matches the runtime expectations:
 template <int kThreadsGroupSize>
 __device__ __forceinline__ void computeCountAndSendStatics(int* experts, int tokenCount, int* sharedSendRecvRankCount,
     int* sendCounts, int* sendIndiceWorkspace, int* backwardIndiceWorkspace, int* expertStatics,
     MoeCommWorkspace workspace, int maxTokenCountPerRank, int slotCount, int expertCount, int topK, int epRank,
     int epSize)
 {
+    static_assert(kThreadsGroupSize <= 32 && (kThreadsGroupSize & (kThreadsGroupSize - 1)) == 0, 
+                  "kThreadsGroupSize must be a power of 2 and <= 32");
     cg::thread_block_tile<kThreadsGroupSize> tile = cg::tiled_partition<kThreadsGroupSize>(cg::this_thread_block());
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
169-169: Extract and document the 512 limit for consistent validation
The literal 512 in the TORCH_CHECK corresponds exactly to the block size used by the MOE communication kernel (see cpp/tensorrt_llm/kernels/moePrepareKernels.cu:316: dim3 block(512);). expertCount + 1 must not exceed the number of threads per block.
Define a shared constant, for example in a new header (moeCommon.h):
// Maximum number of experts per block (must match GPU threads per block in MOE kernels)
constexpr int kMaxExpertsPerBlock = 512;
In cpp/tensorrt_llm/thop/moeCommOp.cpp (line 169), replace:
- TORCH_CHECK(expertCount + 1 <= 512, "expertCount + 1 is larger than 512");
+ TORCH_CHECK(expertCount + 1 <= kMaxExpertsPerBlock,
    "expertCount + 1 must not exceed the number of threads per block");
In cpp/tensorrt_llm/kernels/moePrepareKernels.cu (around line 316), replace:
- dim3 block(512);
+ dim3 block(kMaxExpertsPerBlock);
Add a brief comment in both locations explaining that this cap exists because the kernel launches at most kMaxExpertsPerBlock threads per block, and expertCount+1 must fit within that.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 898f37f and a075095.

📒 Files selected for processing (19)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/moeCommKernels.cu (0 hunks)
cpp/tensorrt_llm/kernels/moeCommKernels.h (0 hunks)
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1 hunks)
cpp/tensorrt_llm/kernels/moePrepareKernels.cu (7 hunks)
cpp/tensorrt_llm/kernels/moePrepareKernels.h (3 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (6 hunks)
cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp (0 hunks)
cpp/tests/unit_tests/kernels/CMakeLists.txt (1 hunks)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1 hunks)
tensorrt_llm/_mnnvl_utils.py (5 hunks)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (5 hunks)
tests/integration/test_lists/waives.txt (0 hunks)
tests/unittest/_torch/thop/test_moe_alltoall.py (9 hunks)

💤 Files with no reviewable changes (4)

cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
tests/integration/test_lists/waives.txt
cpp/tensorrt_llm/kernels/moeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernels.cu

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/moePrepareKernels.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
tensorrt_llm/_torch/models/modeling_speculative.py
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_speculative.py
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu

🧠 Learnings (8)

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/thop/moeCommOp.cpp

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-21T02:39:11.984Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:11.984Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

📚 Learning: 2025-08-21T21:48:35.105Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.105Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py

🧬 Code graph analysis (7)

tensorrt_llm/_torch/models/modeling_speculative.py (1)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)

moe_load_balancer_set_repeated_for_next_layer (1037-1046)

tests/unittest/_torch/thop/test_moe_alltoall.py (2)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (16)

_ (13-56)

_ (60-62)

_ (65-68)

_ (71-76)

_ (79-84)

_ (87-99)

_ (102-107)

_ (110-115)

_ (118-123)

_ (126-138)

_ (141-147)

_ (150-151)

_ (154-157)

_ (161-162)

_ (165-166)

_ (169-180)

tensorrt_llm/_mnnvl_utils.py (1)

mnnvl_moe_alltoallv_prepare_without_allgather (399-443)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

create_moe (61-211)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (3)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)

_ (239-289)

_ (365-373)

_ (451-461)

_ (630-657)

_ (690-700)

_ (774-784)

_ (874-890)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)

_ (273-316)

_ (536-559)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

top_k (221-221)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)

tensorrt_llm/_torch/utils.py (2)

_ (190-196)

shape (103-104)

tensorrt_llm/_mnnvl_utils.py (3)

MnnvlMoe (350-617)

mnnvl_moe_alltoallv_prepare_without_allgather (399-443)

mnnvl_moe_alltoallv (528-589)

tensorrt_llm/quantization/utils/fp8_utils.py (1)

ceil_div (10-21)

cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)

memsetExpertIds (230-248)

memsetExpertIds (230-231)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

tensorrt_llm/_mnnvl_utils.py (7)

MnnvlMemory (53-336)

initialize (91-100)

get_moe_workspaces (358-373)

get_moe_prepare_workspace (376-387)

mnnvl_moe_alltoallv_prepare_without_allgather (399-443)

MoEAlltoallInfo (340-347)

mnnvl_moe_alltoallv (528-589)

🪛 Ruff (0.12.2)

tests/unittest/_torch/thop/test_moe_alltoall.py

526-526: Line too long (236 > 120)

(E501)

528-528: Line too long (248 > 120)

(E501)

606-606: Line too long (186 > 120)

(E501)

607-607: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

465-465: Line too long (124 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (31)

tests/unittest/_torch/thop/test_moe_alltoall.py (1)

82-87: No changes needed to moe_comm call sites
The C++ binding for moe_comm defines the last two parameters as optional—bool[]? need_zero_output=None and bool? use_low_precision=None—so calls may either include or omit the boolean list without breaking the API (). In tests, some cases explicitly pass [True] to exercise zero-padding behavior, while others rely on the default (no zero-padding) to validate both code paths. Similarly, the MNNVL utilities intentionally omit the flag when zero-padding isn’t needed and include it in the combine path where it is .

Because both patterns are valid and cover distinct behaviors, unifying all calls to always include or always omit the boolean list would either force redundant defaults or lose test coverage. Please keep the existing calls as-is.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

196-202: Initialize MNNVL workspaces eagerly — LGTM

Allocating both dispatch and prepare workspaces up front for MNNVL keeps the forward path clean and avoids first‑use syncs. Looks good.

299-301: Confirm intent: always allow all‑to‑all for MNNVL, even when chunking would be used

Returning True here bypasses the chunking guard that disables all‑to‑all for other backends. If MNNVL handles chunking/pathological shapes reliably, this is fine; otherwise we may want parity with the chunking check to avoid oversized buffers.

Can you confirm MNNVL supports cases where calculate_num_chunks(all_rank_num_tokens) > 1 without extra memory pressure or perf regressions?

580-585: Correctly dispatch through the two‑step MNNVL flow

Using alltoall_dispatch with the alltoall_info produced by alltoall_prepare is correct, and passing top_k ensures memset_expert_ids can rebuild the compact expert IDs later in combine.

cpp/tests/unit_tests/kernels/CMakeLists.txt (1)

45-46: Add fusedMoeCommKernelTest — LGTM

Registering the fused MoE comm kernel tests unconditionally is consistent with the surrounding test targets.

tensorrt_llm/_torch/models/modeling_speculative.py (1)

342-344: Set load‑balancer repeat count based on next‑N layout — LGTM

Using num_nextn_predict_layers // mtp_num_layers correctly derives the repeat factor for the next layer in both MTP‑Eagle (1 layer) and standard Next‑N cases.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

61-62: Import cleanup — LGTM

Switching to import only create_moe aligns with the removal of the local call sites.

1134-1141: Confirm removal of moe_load_balancer_set_repeated_for_next_layer here

This path now no‑ops where the load‑balancer repeat count was previously set. Given that modeling_speculative.MTPForCausalLM now sets it centrally, this is likely intended to avoid double‑setting. Please confirm this covers all DeepSeek‑V3 speculative configurations (vanilla and non‑vanilla MTP).

If any non‑spec flows still rely on per‑module configuration, we might need a fallback here guarded by the relevant flags.

cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (4)

1-15: LGTM!

The copyright notice follows the Apache 2.0 license template correctly with the updated year range (2019-2025).

16-19: LGTM!

The header guard and include are properly structured - using #pragma once and including the required <stdint.h> for fixed-width integer types.

25-29: LGTM!

Good use of conditional compilation for the ALIGN_256 macro to ensure portability between CUDA and host code.

31-32: LGTM!

The WARP_SIZE and WARP_MASK constants are properly defined and follow the naming convention (kPREFIXED_UPPER_SNAKE for constants).

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (4)

190-195: LGTM! Simplified alltoall gating logic.

The removal of the experts_per_token % 4 == 0 constraint aligns with the new fused MoE communication infrastructure that handles arbitrary topK values.

354-357: LGTM! Clean adaptation to the new MoE prepare API.

The refactoring correctly updates the function call to return only alltoall_info and gathered_expert_statics, aligning with the simplified fused MoE communication interface.

363-367: LGTM! Elegant multi-input alltoall dispatch.

The new implementation cleanly handles multiple inputs in a single alltoall kernel call, improving efficiency compared to the previous multi-step approach.

369-371: LGTM! Proper expert ID initialization.

The new memset_expert_ids call correctly initializes expert metadata using the fused MoE infrastructure.

tensorrt_llm/_mnnvl_utils.py (4)

369-372: LGTM! Proper workspace initialization with synchronization.

The addition of moe_initialize_workspace followed by a barrier ensures the workspace is properly initialized before use across all ranks.

443-443: LGTM! Clean API simplification.

The function now returns only the essential outputs (alltoall_info and gathered_expert_statics), removing unused intermediate values.

528-589: Excellent multi-input handling with None-preservation logic!

The implementation elegantly handles both single tensors and lists with optional entries, preserving None positions in the output. This is a robust design that maintains API flexibility.

602-614: LGTM! Consistent adaptation to the new moe_comm interface.

The combine function correctly wraps the single input tensor in a list and extracts the first element from the result, maintaining compatibility with the new multi-input API.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)

182-202: LGTM! Well-designed multi-input fake implementation.

The fake implementation correctly handles the new multi-input signature, creating appropriately sized output tensors for each input.

246-250: LGTM! New memset_expert_ids op properly registered.

The new operation is correctly registered with appropriate parameter types matching the C++ implementation.

cpp/tensorrt_llm/kernels/moePrepareKernels.h (4)

35-35: LGTM! Cleaner constant definition.

Using static constexpr instead of macros is a better practice for type safety.

71-74: LGTM! Enhanced function signature for expert statistics.

The addition of expertStatics and gatheredExpertStatics parameters along with slotCount and expertCount aligns with the new statics-based counting approach in the fused MoE implementation.

82-83: LGTM! New expert ID initialization function.

The memsetExpertIds function provides necessary functionality for the fused MoE communication stack.

43-48: Verify sufficient buffer capacity for MoeCommFifoConnInfo::values.

Please ensure that the hard-coded 512-element values array can accommodate the worst-case number of in-flight entries in each FIFO channel. Specifically:

Audit all kernel implementations that use MoeCommFifoConnInfo (e.g. in moePrepareKernels.cu) to determine the maximum number of writes to values before reads advance tail.

Confirm host-side orchestration (the code that updates head/tail) never allows (head – tail) % 512 to exceed 511.

If any configuration (for example, channelCount × maxOutstandingRequests) could breach this limit, either increase the array size or make it a runtime parameter.

cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)

84-87: Ensure consistent naming conventions for function parameters.

The parameter name slotCount should follow the project's naming conventions. Based on other parameters in the function (e.g., expertCount, topK), consider using slotCount (already correct) or ensure it aligns with the naming pattern used throughout the codebase.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1219-1229: Potential null pointer dereference in kernel attribute query.

The function computeMoeAlltoallMaxDynamicSharedMemorySize casts a templated function to void const* without validation, which could lead to issues if the function address is invalid.

Add error checking for the function attributes query:
 int computeMoeAlltoallMaxDynamicSharedMemorySize()
 {
     int devId = -1;
     TLLM_CUDA_CHECK(cudaGetDevice(&devId));
     cudaFuncAttributes attr{};
-    TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1>));
+    auto kernelPtr = reinterpret_cast<void const*>(moeAllToAllKernel<1>);
+    TLLM_CHECK_WITH_INFO(kernelPtr != nullptr, "Failed to get kernel function pointer");
+    TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, kernelPtr));
     int staticSmem = static_cast<int>(attr.sharedSizeBytes);
Likely an incorrect or invalid review comment.

49-183: Add CUDA architecture guards for NVFP4 quantization functions.

The quantize_nvfp4_sharedmem function uses CUDA 10.0+ features but only checks at Lines 55-56. The entire function should be guarded to prevent compilation issues on older architectures.

Wrap the entire function with architecture guards:
 template <typename DType>
 __device__ __forceinline__ void quantize_nvfp4_sharedmem(uint8_t* compact_ptr, int sizeInBytes, int laneId)
 {
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
+#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ < 1000)
+    // Function not supported on this architecture
+#else
     int const numElems = sizeInBytes / sizeof(DType);
     // ... rest of the function ...
-#endif
+#endif // __CUDA_ARCH__ >= 1000
 }
Likely an incorrect or invalid review comment.
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1)
1095-1096: Verify workspace size calculation parameters

I confirmed that the only workspace-size helper is

static size_t computeWorkspaceSizePreRank(int epSize, int channelCount) in
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h:520
and that your test invokes it exactly once at
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:1095–1096.

The test passes
auto workspaceSizePerRank = FusedMoeWorkspace::computeWorkspaceSizePreRank(1, totalChannelCount);
auto totalWorkspaceSize     = workspaceSizePerRank;
– using 1 for epSize and totalChannelCount for channelCount. This is correct only if:

you really intend to test the single-endpoint case (epSize == 1), and

totalChannelCount represents the number of channels per endpoint (not the global channel count across multiple endpoints).

Please verify that, under your channel distribution:

epSize is always 1 in this test, and

the value passed as totalChannelCount is the per-endpoint channel count expected by computeWorkspaceSizePreRank.

If either assumption doesn’t hold (e.g. testing multi-endpoint or using a global channel count), you’ll need to adjust the arguments or multiply by the number of endpoints to ensure the test’s expected workspace size matches the implementation.
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

112-113: Ignore the low-precision flag “redundancy” – it’s intentional.

The local useLowPrecisionVal is

assigned to params.isLowPrecision so the subsequent moeAllToAll kernel dispatch and shape checks can specialize on low precision

passed again into fillMetaInfo so the meta-info sizing logic handles the low-precision path correctly

No change is needed here.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp

tests/unittest/_torch/thop/test_moe_alltoall.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

261-273: Static member initialization not thread-safe.

The setMaxUsableSmCount function modifies static members without synchronization, which could lead to race conditions if called from multiple threads.

298-309: Static map access not thread-safe.

The getMoeCommChannelCount function uses a static std::map without synchronization, which can cause race conditions.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (4)

175-216: Inline PTX assembly lacks proper CUDA architecture guards and error handling.

The e2m1_to_fp32_vec function uses inline PTX assembly with an unrealistic architecture check (__CUDA_ARCH__ >= 1000). CUDA compute capabilities don't reach 1000 - the highest current value is around 90.

419-459: Fix parameter shadowing in fillFieldInfo.

Line 458 assigns to the parameter originalDataType instead of the member variable, which is a bug.

1153-1223: Grid-level synchronization issue between send and receive phases.

The kernel launches both sender and receiver logic without proper synchronization between them, which could lead to race conditions.

1300-1311: Large memory operation without size validation.

The initializeLocalWorkspace method performs large memory operations without validating sizes.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a075095 and 552be46.

📒 Files selected for processing (5)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (6 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_mnnvl_utils.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

🧠 Learnings (2)

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

🧬 Code graph analysis (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)

idx (458-488)

idx (458-458)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

fillMetaInfo (1239-1269)

fillMetaInfo (1239-1240)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (15)

tensorrt_llm/_mnnvl_utils.py (2)

592-600: LGTM! Parameter added correctly.

The use_low_precision_combine parameter is properly added as an optional boolean with a default value of False, maintaining backward compatibility.

613-614: Parameter correctly propagated to moe_comm operation.

The new parameter is properly passed as the final argument to the torch operation, following the established pattern in this codebase.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)

72-79: LGTM! Field additions properly aligned and documented.

The new fields are correctly added to support low-precision operations:

alignedUnitBit/Count/Stride enable flexible data alignment

unalignedFieldIndex and compact16BOffset support memory layout optimization

originalDataType preserves type information for quantization

The alignment constants (Lines 85-88) are well-defined for memory access patterns.

408-409: Function signature properly extended for low-precision support.

The addition of the isLowPrecision parameter to fillMetaInfo correctly enables conditional sizing calculations for low-precision paths.

424-424: LGTM! Low-precision flag properly added to kernel parameters.

The isLowPrecision boolean flag is correctly placed in the parameter structure to control kernel behavior.

cpp/tensorrt_llm/thop/moeCommOp.cpp (3)

31-37: LGTM! Function correctly derives and passes originalDataType.

The helper function properly converts the torch data type to CUDA data type and passes it to fillFieldInfo.

112-117: LGTM! Low-precision flag properly handled and propagated.

The optional parameter is correctly converted to a boolean with a sensible default (false), assigned to the kernel parameters, and propagated to both send and receive metadata initialization.

257-257: LGTM! Python binding correctly updated.

The Torch binding signature properly exposes the new optional use_low_precision parameter, maintaining backward compatibility.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (7)

33-173: Quantization implementation looks good with proper CUDA architecture guards.

The NVFP4 quantization function is well-structured with:

Proper architecture check for SM 10.0+

Efficient warp-level reduction for global max

Per-16-element grouping with FP8 scales

Appropriate memory layout for packed data

The implementation follows NVIDIA's recommended patterns for low-precision quantization.

218-269: Dequantization implementation correctly mirrors quantization logic.

The dequantization function properly:

Reads the global scale from the expected location

Iterates through groups to reconstruct values

Applies scales in the correct order

Has proper architecture guards

889-892: LGTM! Proper compile-time enforcement for low-precision constraints.

The static assertion correctly enforces that low-precision mode only supports a single field, preventing misuse at compile time.

1010-1027: Quantization properly integrated in send path.

The conditional quantization block correctly:

Checks for LOW_PRECISION at compile time

Extracts necessary field metadata

Switches on the original data type

Calls the appropriate typed quantization function

1094-1110: Dequantization properly integrated in receive path.

The dequantization logic mirrors the quantization path appropriately, ensuring data is correctly restored after transfer.

1239-1269: Low-precision metadata calculation looks correct.

The function properly computes compact sizes for low-precision mode:

Validates single field requirement

Checks for supported data types (FP16/BF16)

Calculates space for packed values + scales + global scale

Aligns to 128-byte boundaries

1327-1349: Kernel dispatch correctly handles low-precision flag.

The kernel selection logic properly:

Validates that low-precision only works with single field

Selects appropriate kernel template based on field count and precision mode

Falls back to non-low-precision variants for multi-field cases

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
31-37: Update all fillFieldInfo invocations to the new five-argument signature

The unit tests in fusedMoeCommKernelTest.cpp are still calling the old four-argument form. Please update each of these call sites to pass the additional originalDataType argument (for example, via convert_torch_dtype(...) or the appropriate DataType::… enum).

• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:241
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:388
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:615
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:1070

For example, change:
fieldsInfo[i].fillFieldInfo(deviceField, elementSize, vectorSize, vectorSize);
to something like:
fieldsInfo[i].fillFieldInfo(deviceField,
                            elementSize,
                            vectorSize,
                            vectorSize,
                            /* originalDataType – e.g. */ convert_torch_dtype(tensor.scalar_type()));
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
421-462: Critical: originalDataType member is never set in MoeCommFieldInfo::fillFieldInfo.

The assignment originalDataType = originalDataType; is a self-assignment of the parameter and does not update the member. This silently breaks low-precision branching that relies on fieldsInfo[i].originalDataType.

Apply this fix:
-__host__ void MoeCommFieldInfo::fillFieldInfo(
-    uint8_t* dataPtr, size_t elementSize, int vectorSize, int stride, cudaDataType_t originalDataType)
+__host__ void MoeCommFieldInfo::fillFieldInfo(
+    uint8_t* dataPtr, size_t elementSize, int vectorSize, int stride, cudaDataType_t originalDataType)
 {
   ...
-    alignedUnitCount = vectorSize;
-    alignedUnitStride = stride;
-    originalDataType = originalDataType;
+    alignedUnitCount = vectorSize;
+    alignedUnitStride = stride;
+    this->originalDataType = originalDataType;
 }

♻️ Duplicate comments (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

177-220: Inline PTX lacks portable fallback and guards are overly restrictive.

e2m1_to_fp32_vec uses inline PTX gated at CUDA_ARCH >= 1000 with no else path. Provide a safe fallback (e.g., zero the output or a scalar decoder) and use realistic arch guards. This mirrors prior feedback.

Example minimal fallback:

 inline __device__ void e2m1_to_fp32_vec(uint64_t e2m1Vec, float2 (&array)[8])
 {
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
+// Blackwell fast-path
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
   ...
-#endif
+#else
+  #pragma unroll
+  for (int i = 0; i < 8; ++i) { array[i] = make_float2(0.f, 0.f); }
+#endif
 }

1301-1313: Large memset without bounds validation in initializeLocalWorkspace.

Before cuMemsetD32/cudaMemset, verify fifoSize + senderSideInfoSize + receiverSideInfoSize does not exceed rankStrideInU64*sizeof(uint64_t). This matches earlier guidance.

 void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo)
 {
   ...
-  TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE,
-      fifoSize / sizeof(uint32_t)));
+  size_t totalBytes = fifoSize + senderSideInfoSize + receiverSideInfoSize;
+  TLLM_CHECK_WITH_INFO(totalBytes <= rankStrideInU64 * sizeof(uint64_t),
+                       "Workspace size (%zu) exceeds allocated stride (%zu).",
+                       totalBytes, rankStrideInU64 * sizeof(uint64_t));
+  TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr),
+      FusedMoeProto::INITIALIZED_VALUE, fifoSize / sizeof(uint32_t)));
   TLLM_CUDA_CHECK(cudaMemset(
       reinterpret_cast<uint8_t*>(localWorkspacePtr) + fifoSize, 0, senderSideInfoSize + receiverSideInfoSize));
 }

🧹 Nitpick comments (3)

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
230-248: Minor: error message typo and clarity in memset_expert_ids.

Line 234: Check enforces expertsIds.dim() == 2 but the message says “1D tensor”. Please correct to “2D tensor” to avoid confusion.
-    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor");
+    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor");
171-194: Naming consistency for RecvRankCountCumSum variable.

Local variables should use lowerCamelCase per guidelines. RecvRankCountCumSum starts with an uppercase letter and is inconsistent with sendRankCountCumSum. Consider renaming to recvRankCountCumSum for readability.

Also applies to: 211-227
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1227-1237: Static shared memory attribute sampled from non-LOW_PRECISION variant.

computeMoeAlltoallMaxDynamicSharedMemorySize queries attributes from moeAllToAllKernel<1> (LOW_PRECISION=false). If static smem ever diverges between variants, this could under/over-estimate. Consider probing both or the chosen kernelFn.
-    TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1>));
+    // Probe the conservative (largest) variant if they diverge in future.
+    TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1, true>));

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 552be46 and 552a379.

📒 Files selected for processing (5)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (6 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

🧠 Learnings (1)

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧬 Code graph analysis (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)

idx (458-488)

idx (458-458)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)

35-37: Good addition: propagate original dtype into field metadata.

Passing convert_torch_dtype(t.scalar_type()) into fillFieldInfo ensures the kernel can reason about the original element type for low-precision packing. This unblocks accurate size/meta calculations downstream.

257-257: Schema sync for new argument looks correct.

The Torch library def reflects the new optional flag at the end, maintaining backward compatibility for positional callers.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

23-24: Include dependency is appropriate.

Including quantization.cuh here is expected for NVFP4/FP8 conversions and keeps device helpers localized to this TU.

874-897: Constrain low-precision field count at compile-time (nice).

static_assert(FIELD_COUNT == 1) is a good defensive check and keeps meta/layout logic simple.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

cpp/tensorrt_llm/thop/moeCommOp.cpp

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

877-1161: SingleChannelCommunicator template instantiation with LOW_PRECISION requires architecture checks.

The SingleChannelCommunicator template with LOW_PRECISION=true will call quantization/dequantization functions that are no-ops on architectures < SM100. This should be checked at instantiation time.

Add architecture validation in the constructor:

 __device__ __forceinline__ SingleChannelCommunicator(FusedMoeFieldInfo const& fieldInfo,
     MoeExpertParallelInfo const& expertParallelInfo, MoeSingleCommMeta const& commMeta,
     FusedMoeWorkspace const& workspace, FusedMoeWorldInfo const& worldInfo, FusedMoePairInfo const& pairInfo,
     uint64_t* smemBar, uint8_t* shmemBase)
     : mFieldInfo(fieldInfo)
     , mExpertParallelInfo(expertParallelInfo)
     , mCommMeta(commMeta)
     , mWorkspace(workspace)
     , mWorldInfo(worldInfo)
     , mPairInfo(pairInfo)
     , mSmemBar(smemBar)
     , mShmemBase(shmemBase)
 {
     if constexpr (LOW_PRECISION)
     {
         static_assert(FIELD_COUNT == 1, "Low precision alltoall only support 1 field");
+#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 1000
+        // This will cause a compile error on unsupported architectures
+        static_assert(false, "LOW_PRECISION requires SM100+ (CUDA_ARCH >= 1000)");
+#endif
     }

♻️ Duplicate comments (6)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (6)

180-223: Duplicate comment: Missing error handling in inline PTX assembly.

463-463: Duplicate comment: fillFieldInfo method modifies originalDataType parameter incorrectly.

1015-1033: Duplicate comment: Architecture gating: low-precision path compiles to no-ops on < SM100 and will corrupt I/O.

1249-1274: Duplicate comment: Low-precision compact size calculation is inconsistent with the quantized layout.

466-599: Duplicate comment: Complex LL128Proto implementation needs documentation.

1304-1316: Duplicate comment: Large memory operation without size validation.

🧹 Nitpick comments (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1331-1354: Template instantiation for low-precision kernel variants lacks compile-time safety.

The getFunc lambda (Lines 1334-1353) returns kernel function pointers based on runtime parameters. The low-precision path is only instantiated for fieldCount == 1, but this constraint is not enforced at compile time.

Consider using template specialization or static_assert to enforce the constraint:
+template<int FIELD_COUNT>
+struct KernelSelector {
+    static auto get(bool lowPrecision) {
+        if constexpr (FIELD_COUNT == 1) {
+            return lowPrecision ? fused_moe_impl::moeAllToAllKernel<1, true> 
+                                : fused_moe_impl::moeAllToAllKernel<1>;
+        } else {
+            // Low precision not supported for multiple fields
+            TLLM_CHECK(!lowPrecision);
+            return fused_moe_impl::moeAllToAllKernel<FIELD_COUNT>;
+        }
+    }
+};

 auto getFunc = [](int fieldCount, bool lowPrecision)
 {
     switch (fieldCount)
     {
-    case 1:
-        if (lowPrecision)
-            return fused_moe_impl::moeAllToAllKernel<1, true>;
-        else
-            return fused_moe_impl::moeAllToAllKernel<1>;
+    case 1: return KernelSelector<1>::get(lowPrecision);
     case 2: return KernelSelector<2>::get(lowPrecision);
     // ... etc
1467-1467: Test harness doesn't test low-precision path.

The launchSingleG2S function always passes false for isLowPrecision when calling fillMetaInfo. The test infrastructure should be extended to support testing the low-precision quantization path.

Would you like me to generate test code that exercises the low-precision quantization/dequantization paths? This would help validate the implementation and catch potential issues early.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 552a379 and 02c6e07.

📒 Files selected for processing (5)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (6 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/thop/moeCommOp.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧠 Learnings (2)

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧬 Code graph analysis (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)

idx (458-488)

idx (458-458)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1096-1115: I’ve added searches to locate the FusedMoeProto struct and inspect both protoPack and protoUnpack implementations for how quantized data (scales, values, global scale) are handled. Once we see those definitions, we can confirm if the unpacking matches the expected layout before dequantization.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1334-1357: Host-side gating for LOW_PRECISION kernels is missing

Enforce compute capability >= 10.x when params.isLowPrecision is true. Otherwise, on Hopper/Ampere builds the device helpers are no-ops and the protocol/layout diverge.

 void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream)
 {
+    if (params.isLowPrecision)
+    {
+        int dev = -1;
+        int major = 0;
+        TLLM_CUDA_CHECK(cudaGetDevice(&dev));
+        TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev));
+        TLLM_CHECK_WITH_INFO(major >= 10,
+            "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x). Detected SM%d0.", major);
+    }
@@
-    auto getFunc = [](int fieldCount, bool lowPrecision)
+    auto getFunc = [](int fieldCount, bool lowPrecision)
     {
         switch (fieldCount)
         {
         case 1:
             if (lowPrecision)
                 return fused_moe_impl::moeAllToAllKernel<1, true>;
             else
                 return fused_moe_impl::moeAllToAllKernel<1>;

♻️ Duplicate comments (10)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (6)

32-40: Volatile head/tail do not provide synchronization; use atomics or explicit fences

Using volatile for inter-CTA producer/consumer coordination is insufficient on CUDA; it does not guarantee atomicity or ordering. Replace with plain fields and perform updates with atomics and/or memory fences at the access sites (sender/receiver paths already touched by cp.async and fences), or wrap accesses with appropriate cuda::atomic_ref on supported toolchains.

 struct ALIGN_256 SenderSideFifoInfo
 {
-    volatile uint64_t head; // write position
-    volatile uint64_t tail; // read position
+    uint64_t head; // write position (updated with atomics/fences at use sites)
+    uint64_t tail; // read position  (updated with atomics/fences at use sites)
 };
 
 struct ALIGN_256 ReceiverSideFifoInfo
 {
-    volatile uint64_t head; // write position do we use this?
-    volatile uint64_t tail; // read position
+    uint64_t head; // write position (updated with atomics/fences at use sites)
+    uint64_t tail; // read position  (updated with atomics/fences at use sites)
 };

If you prefer device-side C++ atomics:

include <cuda/atomic>
use cuda::atomic_ref<uint64_t, cuda::thread_scope_system>(head).store(..., memory_order_release) etc.

156-167: Pointer arithmetic lacks bounds validation; guard and clamp the copy range

get16BAlignedLoadCopyRange() assumes rawDataPtr and sizes are valid and performs arithmetic in signed int. Use uintptr_t/size_t, clamp to raw range, and handle nulls to avoid UB.

 __device__ __forceinline__ uint8_t* get16BAlignedLoadCopyRange(int index, int* copyByteCount) const
 {
-    int rawSize;
-    uint8_t* rawDataPtr = getRawPtr(index, &rawSize);
-    uint8_t* rawEndPtr = rawDataPtr + rawSize;
-    uint8_t* alignedDataPtr
-        = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawDataPtr) & (~kAlign16BytePtrMask));
-    uint32_t copySize = rawEndPtr - alignedDataPtr;
-    *copyByteCount
-        = (copySize & kAligned16BMask) != 0 ? (copySize & (~kAligned16BMask)) + BYTES_PER_16B_BLOCK : copySize;
-    return alignedDataPtr;
+    int rawSize = 0;
+    uint8_t* rawDataPtr = getRawPtr(index, &rawSize);
+    if (rawDataPtr == nullptr || rawSize <= 0) {
+        if (copyByteCount) *copyByteCount = 0;
+        return nullptr;
+    }
+    uint8_t* rawEndPtr = rawDataPtr + static_cast<size_t>(rawSize);
+    auto alignedDataPtr = reinterpret_cast<uint8_t*>(
+        (reinterpret_cast<uintptr_t>(rawDataPtr)) & (~static_cast<uintptr_t>(kAlign16BytePtrMask)));
+    // clamp start to rawDataPtr
+    if (alignedDataPtr < rawDataPtr) alignedDataPtr = rawDataPtr;
+    size_t copySize = static_cast<size_t>(rawEndPtr - alignedDataPtr);
+    uint32_t copySizeU32 = static_cast<uint32_t>(copySize);
+    uint32_t padded = (copySizeU32 & kAligned16BMask) != 0
+        ? (copySizeU32 & (~kAligned16BMask)) + BYTES_PER_16B_BLOCK
+        : copySizeU32;
+    if (copyByteCount) *copyByteCount = static_cast<int>(padded);
+    return alignedDataPtr;
 }

169-202: Store-range computation can underflow/overflow; compute with unsigned and clamp head/tail indices

As above, compute pointers/offsets with uintptr_t/size_t, clamp alignedDataPtr/alignedEndPtr inside raw range, and set headTail indices to -1 when out of bounds.

 __device__ __forceinline__ uint8_t* get16BAlignedStoreCopyRange(
     int index, int* copyByteCount, int laneId, int* headTailShmIdx, int* headTailGlobalIdx) const
 {
-    int rawSize;
-    uint8_t* rawDataPtr = getRawPtr(index, &rawSize);
-    uint8_t* rawEndPtr = rawDataPtr + rawSize;
-    int offset = reinterpret_cast<uint64_t>(rawDataPtr) & kAlign16BytePtrMask;
-    uint8_t* alignedDataPtr
-        = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawDataPtr) + BYTES_PER_16B_BLOCK - offset);
-    uint8_t* alignedEndPtr
-        = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawEndPtr) & (~kAlign16BytePtrMask));
-    int alignedCopyBytes = alignedEndPtr - alignedDataPtr;
+    int rawSize = 0;
+    uint8_t* rawDataPtr = getRawPtr(index, &rawSize);
+    if (!rawDataPtr || rawSize <= 0) { if (copyByteCount) *copyByteCount = 0; if (headTailShmIdx) *headTailShmIdx = -1; if (headTailGlobalIdx) *headTailGlobalIdx = -1; return nullptr; }
+    uint8_t* rawEndPtr = rawDataPtr + static_cast<size_t>(rawSize);
+    int offset = static_cast<int>(reinterpret_cast<uintptr_t>(rawDataPtr) & kAlign16BytePtrMask);
+    auto alignedDataPtr = reinterpret_cast<uint8_t*>(
+        reinterpret_cast<uintptr_t>(rawDataPtr) + BYTES_PER_16B_BLOCK - offset);
+    auto alignedEndPtr = reinterpret_cast<uint8_t*>(
+        reinterpret_cast<uintptr_t>(rawEndPtr) & (~static_cast<uintptr_t>(kAlign16BytePtrMask)));
+    if (alignedDataPtr < rawDataPtr) alignedDataPtr = rawDataPtr;
+    if (alignedEndPtr > rawEndPtr) alignedEndPtr = rawEndPtr;
+    int alignedCopyBytes = static_cast<int>(alignedEndPtr - alignedDataPtr);
     if (alignedCopyBytes < 0)
     {
         alignedCopyBytes = 0;
     }
     *copyByteCount = alignedCopyBytes;
@@
-    *headTailGlobalIdx = *headTailShmIdx - offset;
+    *headTailGlobalIdx = *headTailShmIdx - offset;
     if (*headTailGlobalIdx < 0 || *headTailGlobalIdx >= rawSize)
     {
         *headTailGlobalIdx = -1;
         *headTailShmIdx = -1;
     }
     return alignedDataPtr;
 }

261-273: setMaxUsableSmCount modifies static state without synchronization

Concurrent calls can race on maxSmCountUsed and maxSmCount. Guard with a mutex; also make reads in getMaxUsableSmCount consistent.

Add a mutex member and lock the setter/getter bodies:

 class FusedMoeCommunicator
 {
 public:
@@
-    static void setMaxUsableSmCount(int maxUsableSmCount)
+    static std::mutex maxSmCountMutex;
+    static void setMaxUsableSmCount(int maxUsableSmCount)
     {
+        std::lock_guard<std::mutex> lock(maxSmCountMutex);
         TLLM_CHECK_WITH_INFO(
             FusedMoeCommunicator::maxSmCountUsed == false, "setMaxUsableSmCount can be called only before it is used");
@@
-        FusedMoeCommunicator::maxSmCount = maxUsableSmCount;
+        FusedMoeCommunicator::maxSmCount = maxUsableSmCount;
     }

And in getMaxUsableSmCount():

-    static int getMaxUsableSmCount()
+    static int getMaxUsableSmCount()
     {
+        std::lock_guard<std::mutex> lock(maxSmCountMutex);
         FusedMoeCommunicator::maxSmCountUsed = true;
         if (FusedMoeCommunicator::maxSmCount == -1)
         {
             int smCount = tensorrt_llm::common::getMultiProcessorCount();
             FusedMoeCommunicator::maxSmCount = smCount;
         }
         return FusedMoeCommunicator::maxSmCount;
     }

Note: add #include <mutex> at the top and provide a definition in the .cu: std::mutex FusedMoeCommunicator::maxSmCountMutex;

298-309: Static map access is not thread-safe

Concurrent callers of getMoeCommChannelCount() can race on the static map. Guard with a static mutex.

     static int getMoeCommChannelCount(int epSize)
     {
         static std::map<int, int> channelCountMap{};
+        static std::mutex mapMutex;
+        std::lock_guard<std::mutex> lock(mapMutex);
         auto iter = channelCountMap.find(epSize);
         if (iter == channelCountMap.end())
         {
             auto channelCount = FusedMoeCommunicator::computeMoeCommChannelCount(epSize);
             channelCountMap[epSize] = channelCount;
             return channelCount;
         }
         return iter->second;
     }

518-524: Workspace size computation may overflow silently

Multiplying large epSize/channelCount can overflow size_t on some platforms or wrap intermediates. Add checked arithmetic and verify final total does not exceed the allocated rank stride.

     static size_t computeWorkspaceSizePreRank(int epSize, int channelCount)
     {
-        size_t fifoSize = static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES) * epSize * channelCount;
-        size_t senderSideInfoSize = sizeof(SenderSideFifoInfo) * epSize * channelCount;
-        size_t receiverSideInfoSize = sizeof(ReceiverSideFifoInfo) * epSize * channelCount;
-        return fifoSize + senderSideInfoSize + receiverSideInfoSize;
+        auto mul_ok = [](size_t a, size_t b, size_t& out) -> bool {
+            if (a == 0 || b <= std::numeric_limits<size_t>::max() / a) { out = a * b; return true; }
+            return false;
+        };
+        size_t fifoSize = 0, senderSideInfoSize = 0, receiverSideInfoSize = 0, tmp = 0;
+        TLLM_CHECK_WITH_INFO(mul_ok(static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES), epSize, tmp)
+                                 && mul_ok(tmp, channelCount, fifoSize),
+            "computeWorkspaceSizePreRank overflow on fifoSize");
+        TLLM_CHECK_WITH_INFO(mul_ok(sizeof(SenderSideFifoInfo), epSize, tmp) && mul_ok(tmp, channelCount, senderSideInfoSize),
+            "computeWorkspaceSizePreRank overflow on senderSideInfoSize");
+        TLLM_CHECK_WITH_INFO(mul_ok(sizeof(ReceiverSideFifoInfo), epSize, tmp) && mul_ok(tmp, channelCount, receiverSideInfoSize),
+            "computeWorkspaceSizePreRank overflow on receiverSideInfoSize");
+        size_t total = fifoSize + senderSideInfoSize + receiverSideInfoSize;
+        TLLM_CHECK_WITH_INFO(total >= fifoSize, "computeWorkspaceSizePreRank overflow on total");
+        return total;
     }

tests/unittest/_torch/thop/test_moe_alltoall.py (1)

252-269: Reference composition bug: shared output_indice_offset corrupts per-tensor references

Each tensor dimension writes into the same [start:end] slice using a single shared offset, then advances only once. Maintain independent offsets for each tensor to prevent overlap.

-        total_recv_all_ranks_cpu = []
-        output_indice_offset = 0
+        total_recv_all_ranks_cpu = []
+        # Maintain independent offsets per tensor
+        output_offsets = [0] * tensor_count
@@
-                for i in range(tensor_count):
-                    ref_output_tensors[i][output_indice_offset:output_indice_offset + local_recv_count_pair] = \
-                        input_tensors_all_ranks[i][other_rank][send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]]
-                output_indice_offset += local_recv_count_pair
+                for i in range(tensor_count):
+                    start = output_offsets[i]
+                    end = start + local_recv_count_pair
+                    ref_output_tensors[i][start:end] = input_tensors_all_ranks[i][other_rank][
+                        send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]
+                    ]
+                    output_offsets[i] = end

cpp/tensorrt_llm/thop/moeCommOp.cpp (1)

112-117: Bug: topK used uninitialized when hasBasicFields=false

params.expertParallelInfo.topK is never set here; passing it into fillMetaInfo is undefined behavior. When hasBasicFields=false, topK is unused; pass a deterministic 0 instead.

-    bool useLowPrecisionVal = useLowPrecision.value_or(false);
-    params.isLowPrecision = useLowPrecisionVal;
-    params.sendFieldInfo.fillMetaInfo(
-        &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
-    params.recvFieldInfo.fillMetaInfo(
-        &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
+    bool useLowPrecisionVal = useLowPrecision.value_or(false);
+    params.isLowPrecision = useLowPrecisionVal;
+    constexpr int kTopKNotUsed = 0;
+    params.sendFieldInfo.fillMetaInfo(
+        &(params.sendCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);
+    params.recvFieldInfo.fillMetaInfo(
+        &(params.recvCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

1306-1318: Zeroing large workspace regions without size validation can overrun allocation

Before cuMemsetD32/cudaMemset, verify that the computed total fits within rankStrideInU64*8 and that fifoSize is divisible by sizeof(uint32_t) for cuMemsetD32.
 void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo)
 {
@@
-    uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64;
+    uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64;
+    size_t total = fifoSize + senderSideInfoSize + receiverSideInfoSize;
+    TLLM_CHECK_WITH_INFO(total <= rankStrideInU64 * sizeof(uint64_t),
+        "FusedMoeWorkspace overflow: required=%zu allocated=%zu", total, rankStrideInU64 * sizeof(uint64_t));
+    TLLM_CHECK_WITH_INFO((fifoSize % sizeof(uint32_t)) == 0, "fifoSize must be a multiple of 4 for cuMemsetD32");
     TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE,
         fifoSize / sizeof(uint32_t)));
1015-1032: LOW_PRECISION quantize is compiled to no-op on < SM100; add host-side capability gate

On pre-SM100 devices, the quantize/dequant helpers are empty due to arch guards, yet the LOW_PRECISION meta sizes/layout are still used, corrupting I/O. Add a CC check in moeAllToAll before launching LOW_PRECISION kernels.

Apply the host-side gate below (see moeAllToAll).

🧹 Nitpick comments (8)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
105-116: Local shadowing of alignedUnitBytes harms readability and risks mistakes

Inside getFieldUncompactSize(), alignedUnitBytes is re-declared with a different meaning. Rename the inner constant to avoid shadowing and accidental misuse.
-        int alignedUnitBytes = 1 << alignedUnitBit;
+        int alignedUnitBytes = 1 << alignedUnitBit;
         int currentFieldSize = alignedUnitCount * alignedUnitBytes;
         if (alignedUnitBytes != 16)
         {
-            constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK;
-            currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes;
-            currentFieldSize += alignedUnitBytes * 2;
+            constexpr int kAligned16B = BYTES_PER_16B_BLOCK;
+            currentFieldSize = currentFieldSize / kAligned16B * kAligned16B;
+            currentFieldSize += kAligned16B * 2;
         }
286-296: Typo in variable name and minor nit

Variable perferredChannel should be preferredChannel.
-        int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
-        int channelCount = std::max(perferredChannel, 1);          // at lease one channel
+        int preferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication
+        int channelCount = std::max(preferredChannel, 1);          // at least one channel
16-16: Repository guideline requires include guards in headers

Our project guideline mandates include guards of the form TRTLLM_<FILE_NAME>_H for headers. Please add them (you can keep #pragma once as a fast-path).

Suggested wrapper (apply at top and bottom of the header):
// Top of file:
#ifndef TRTLLM_FUSEDMOECOMMKERNELS_H
#define TRTLLM_FUSEDMOECOMMKERNELS_H
#pragma once

// ... existing header content ...

// Bottom of file:
#endif // TRTLLM_FUSEDMOECOMMKERNELS_H
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
337-343: Low-precision reference conversion loops over tokens; vectorize to speed up

Per-token quant/dequant is O(T) Python overhead. Convert the entire ref_output tensor once.
-            if use_low_precision:
-                for token_id in range(ref_output_tensors[i].shape[0]):
-                    ref_output_tensors[i][token_id] = quant_and_dequant(
-                        ref_output_tensors[i][token_id])
-                atol, rtol = 1e-2, 1e-2
+            if use_low_precision:
+                # Vectorized conversion
+                ref_output_tensors[i] = quant_and_dequant(ref_output_tensors[i])
+                atol, rtol = 1e-2, 1e-2
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
74-86: Validate low-precision preconditions early (field count and dtype)

Fail fast at the Python binding when low-precision is requested but inputs are incompatible (e.g., multiple fields or unsupported dtype). Improves UX vs. hitting kernel assertions later.
     int fieldCount = inputs.size();
     TORCH_CHECK(fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, "Number of fields (", fieldCount,
         ") exceeds maximum allowed (", tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")");
+    bool useLowPrecisionValEarly = useLowPrecision.has_value() && useLowPrecision.value();
+    if (useLowPrecisionValEarly) {
+        TORCH_CHECK(fieldCount == 1, "Low-precision alltoall supports exactly 1 tensor field.");
+        auto dt = inputs[0].scalar_type();
+        TORCH_CHECK(dt == torch::kHalf || dt == torch::kBFloat16,
+            "Low-precision alltoall requires float16 or bfloat16 input, got ", c10::toString(dt));
+    }
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (3)
35-178: Quantize path: architecture guard OK, but still consider minimal fallback to avoid accidental use

quantize_nvfp4_sharedmem is guarded for CUDA_ARCH >= 1000. Given host-side gating is missing (see below), older architectures will compile LOW_PRECISION branches into no-ops. Add a host gate (in moeAllToAll) and optionally a debug fill or assert here under #else to catch misuse in testing.

Example debug fallback:
 #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
   // existing code...
 #else
-  // no-op
+  // Debug fallback: do nothing but could zero compact_ptr to make misuse visible
+  if (laneId == 0 && sizeInBytes > 0) {
+      for (int i = 0; i < min(sizeInBytes, 64); ++i) compact_ptr[i] = 0;
+  }
 #endif
183-223: Inline PTX helper lacks non-SM100 path; ensure callers are gated or provide stub

e2m1_to_fp32_vec is SM100-specific. Either guarantee all call-sites compile only for >=1000 (host gate below), or add a stub that zero-fills array so accidental calls on lower arch don't propagate uninitialized data.
 #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
   // existing PTX path
 #else
-  // (no body)
+  #pragma unroll
+  for (int i = 0; i < 8; ++i) { array[i] = make_float2(0.f, 0.f); }
 #endif
877-898: Assert invariants for LOW_PRECISION and add alignment assert for shared memory base

You already static_assert FIELD_COUNT==1 under LOW_PRECISION. Add a shared-memory 128B alignment assert to catch misconfigurations early.
     , mShmemBase(shmemBase)
 {
     if constexpr (LOW_PRECISION)
     {
         static_assert(FIELD_COUNT == 1, "Low precision alltoall only support 1 field");
     }
+    // Shared memory alignment is critical for cp.async.bulk and 128B protocol
+    assert((reinterpret_cast<uintptr_t>(mShmemBase) & 0x7F) == 0 && "Shared memory must be 128B aligned");

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 94b3029 and 9e82990.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (4 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)
tests/unittest/_torch/thop/test_moe_alltoall.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks

Files:

tests/unittest/_torch/thop/test_moe_alltoall.py

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards with symbol TRTLLM_<FILE_NAME>_H (no leading/trailing underscores; directory names excluded)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

**/*.{cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧠 Learnings (5)

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tests/unittest/_torch/thop/test_moe_alltoall.py

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧬 Code graph analysis (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

fillMetaInfo (1244-1276)

fillMetaInfo (1244-1245)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)

92-94: API extension looks good; keeping original dtype is necessary for NVFP4 path

The new fillFieldInfo signature carrying originalDataType aligns with the low-precision flow and keeps header/source in sync.

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)

35-37: Good: original dtype is passed from Torch to kernel field info

Forwarding convert_torch_dtype to fillFieldInfo ensures the device path can choose BF16 vs FP16 correctly.

252-258: Binding schema updated correctly with use_low_precision

The extended def string maintains backward compatibility (defaults are None) and surfaces the new flag.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

424-464: Good: fixed member assignment for originalDataType

The fillFieldInfo implementation now assigns the member (not the parameter), resolving the earlier self-assignment bug.

1468-1474: Unit-test launchers: correct (explicit false) for isLowPrecision

fillMetaInfo calls in the test launchers explicitly pass isLowPrecision=false, which is appropriate for these non-quantized helpers.

Also applies to: 1521-1527, 1601-1605, 1664-1668

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (5)
482-616: Complex LL128Proto implementation - consider adding documentation.

The LL128Proto class implements a sophisticated 128-byte aligned protocol but lacks comprehensive documentation about the protocol design and indexing logic.

Add documentation explaining:

The 15:16 packing ratio (15 data blocks → 16 blocks with tail)

The role of step values and finish flags

The indexing mathematics for halfLane/halfIndex calculations

The purpose of INITIALIZED_VALUE constant

893-1179: SingleChannelCommunicator template needs alignment validation.

The template class performs pointer arithmetic without verifying 128-byte alignment requirements.

Add alignment assertion in constructor:
 __device__ __forceinline__ SingleChannelCommunicator(FusedMoeFieldInfo const& fieldInfo,
     MoeExpertParallelInfo const& expertParallelInfo, MoeSingleCommMeta const& commMeta,
     FusedMoeWorkspace const& workspace, FusedMoeWorldInfo const& worldInfo, FusedMoePairInfo const& pairInfo,
     uint64_t* smemBar, uint8_t* shmemBase)
     : mFieldInfo(fieldInfo)
     , mExpertParallelInfo(expertParallelInfo)
     , mCommMeta(commMeta)
     , mWorkspace(workspace)
     , mWorldInfo(worldInfo)
     , mPairInfo(pairInfo)
     , mSmemBar(smemBar)
     , mShmemBase(shmemBase)
 {
+    // Verify 128-byte alignment for shared memory
+    assert((reinterpret_cast<uintptr_t>(shmemBase) & 0x7F) == 0 && "Shared memory must be 128-byte aligned");
1338-1406: Host-side architecture validation needed for low-precision mode.

The moeAllToAll function should validate GPU architecture before launching low-precision kernels.

Add architecture check:
 void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream)
 {
+    if (params.isLowPrecision)
+    {
+        int dev = -1, major = 0;
+        TLLM_CUDA_CHECK(cudaGetDevice(&dev));
+        TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev));
+        TLLM_CHECK_WITH_INFO(major >= 9,
+            "Low-precision (NVFP4) all-to-all requires SM90+ (Hopper or newer). Detected SM%d0.",
+            major);
+    }
1324-1336: Workspace initialization lacks size validation.

The initializeLocalWorkspace method should validate memory sizes before operations.

Add validation:
 void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo)
 {
     int epSize = worldInfo.epInfo.epSize;
     int epRank = worldInfo.epInfo.epRank;
     size_t fifoSize = static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES) * epSize * channelCount;
+    size_t totalSize = fifoSize + sizeof(SenderSideFifoInfo) * epSize * channelCount 
+                     + sizeof(ReceiverSideFifoInfo) * epSize * channelCount;
+    TLLM_CHECK_WITH_INFO(totalSize <= rankStrideInU64 * sizeof(uint64_t),
+                         "Workspace size exceeds allocated memory");
185-228: Missing error handling and architecture validation in PTX assembly.

The e2m1_to_fp32_vec function uses inline PTX without proper guards or fallback implementation for unsupported architectures.

Add proper guards and fallback:
 inline __device__ void e2m1_to_fp32_vec(uint64_t e2m1Vec, float2 (&array)[8])
 {
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000)
+#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ < 900)
+    // Fallback for unsupported architectures  
+    #pragma unroll
+    for (int i = 0; i < 8; ++i) 
+    {
+        array[i] = make_float2(0.0f, 0.0f);
+    }
+#else
     uint32_t out_fp16[8];
     asm volatile(
         // ... existing asm code ...
     );
     // ... rest of conversion ...
 #endif
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9e82990 and f341cef.

📒 Files selected for processing (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧠 Learnings (2)

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧬 Code graph analysis (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)

idx (458-488)

idx (458-458)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (7)

30-183: Well-documented NVFP4 quantization implementation with clear algorithm description.

The quantization function includes comprehensive documentation about the NVFP4 format, output layout, and algorithm steps. The implementation correctly handles both half and bfloat16 types with vectorized operations for efficiency.

440-480: fillFieldInfo method implementation is correct.

The method properly fills field information with alignment handling and type conversion. The bug mentioned in past comments about assigning to parameter instead of member has been fixed.

1031-1049: Low-precision quantization path correctly integrated.

The quantization and dequantization steps are properly placed after packing and before unpacking respectively, with appropriate type handling for FP16 and BF16.

Also applies to: 1116-1133

1181-1246: Kernel template instantiation handles low-precision correctly.

The moeAllToAllKernel properly instantiates different template variants based on field count and low-precision flag. The sender/receiver logic is cleanly separated.

1267-1290: Low-precision metadata calculation appears correct.

The compact size calculation for low-precision paths correctly accounts for the NVFP4 packing format with proper alignment to 128 bytes.

1436-1709: Test utilities are well-structured.

The test namespace provides comprehensive testing utilities for G2S, S2G, loopback, and FIFO operations. The kernels properly handle different configurations.

2-15: Copyright header is properly formatted.

The NVIDIA copyright header with Apache 2.0 license is correctly included.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

zongfeijing · 2025-08-26T10:05:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-26T10:11:12Z

PR_Github #16550 [ run ] triggered by Bot

kaiyux · 2025-08-26T16:34:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-26T16:39:34Z

PR_Github #16579 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-26T16:39:35Z

PR_Github #16550 [ run ] completed with state ABORTED

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1333-1370: Enforce SM capability gate before selecting low-precision variant

The host-side moeAllToAll in cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (around lines 1333–1370) unconditionally dispatches the low-precision kernel when params.isLowPrecision is true, without verifying the device’s compute capability. This risks launching half-precision kernels on unsupported GPUs. You must add a CUDA compute-capability check and gate params.isLowPrecision accordingly.

Suggested fix:
In moeAllToAll (top of the function), query device properties:
int devId = -1;
TLLM_CUDA_CHECK(cudaGetDevice(&devId));
cudaDeviceProp prop{};
TLLM_CUDA_CHECK(cudaGetDeviceProperties(&prop, devId));
int smVersion = prop.major * 10 + prop.minor;
constexpr int kMinSmForLowPrecision = /* e.g. 50 for SM5.0+ */;
if (params.isLowPrecision && smVersion < kMinSmForLowPrecision) {
    // Either override to full precision or error out:
    params.isLowPrecision = false;
    // OR:
    // TLLM_ERROR("Low-precision kernels require SM{}/+ (found SM{})", kMinSmForLowPrecision, smVersion);
}
Then proceed to select the kernel via getFunc(maxFieldCount, params.isLowPrecision);
This ensures the low-precision path is only taken on devices with sufficient SM support.

♻️ Duplicate comments (3)

cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
112-117: Fix: uninitialized expertParallelInfo.topK used in fillMetaInfo (pass 0 when hasBasicFields=false)
This forwards an indeterminate value. Use a deterministic 0 since basic fields are disabled here.

Apply:
-    params.sendFieldInfo.fillMetaInfo(
-        &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
-    params.recvFieldInfo.fillMetaInfo(
-        &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
+    constexpr int kTopKNotUsed = 0;
+    params.sendFieldInfo.fillMetaInfo(
+        &(params.sendCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);
+    params.recvFieldInfo.fillMetaInfo(
+        &(params.recvCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1319-1331: InitializeLocalWorkspace: consider size validations before memset
Large cuMemsetD32/cudaMemset sizes are derived from epSize*channelCount. Validate bounds against allocated stride.

If you want, I can propose a guarded version of this block.

38-45: LOW_PRECISION path compiles to no-ops on < SM100; add host gating to prevent layout mismatch
quantize_nvfp4_sharedmem, e2m1_to_fp32_vec, and dequantize_nvfp4_sharedmem are guarded by CUDA_ARCH >= 1000. On devices compiled for lower CC, these bodies are effectively empty, but fillMetaInfo still chooses the compact NVFP4 layout leading to pack/unpack/size mismatches and potential memory corruption.

Add a device capability gate in moeAllToAll to prohibit low precision unless SM100+ is available:
@@ void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream)
-{
+{
+    if (params.isLowPrecision) {
+        int dev = -1, major = 0;
+        TLLM_CUDA_CHECK(cudaGetDevice(&dev));
+        TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev));
+        TLLM_CHECK_WITH_INFO(major >= 10,
+            "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x). "
+            "Detected compute capability %d.x.", major);
+    }
If you plan to support a software fallback for < SM100, keep the check but switch to a fallback path instead of erroring. Otherwise the gate is required to avoid silent data corruption.

Also applies to: 228-236, 183-187

🧹 Nitpick comments (3)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)

95-103: Initialize originalDataType in setUnused to a safe sentinel
Uninitialized fields may be accessed in debug/validation code. Initialize to a known default to avoid UB.

Apply:
     __host__ void setUnused()
     {
         dataPtrBase = nullptr;
         alignedUnitBit = 4;
         alignedUnitCount = 0;
         alignedUnitStride = 0;
         unalignedFieldIndex = 0;
         compact16BOffset = 0;
+        originalDataType = CUDA_R_32F; // sentinel for "unused"
     }

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

276-281: Remove debug printf in device code; clamp or drop entirely
printf in kernels is brittle and floods logs. Clamp dequantScale or use project logging at host side.

Apply:

-                // This is a workaround to avoid the issue of nan or inf.
-                // TODO: remove this after the issue is fixed.
-                if (dequantScale > 1e10)
-                {
-                    printf("This is a workaround to avoid the issue of nan or inf. \n");
-                }
+                // Clamp to avoid potential overflow -> NaN/Inf in edge cases.
+                // TODO: Revisit scaling to remove this guard.
+                dequantScale = fminf(dequantScale, 1e10f);

894-906: Optional: assert 128B alignment of shared-memory base in communicator ctor
Given the protocol assumes 128B blocks, add a debug assert to catch misalignment early.

Apply:

     , mShmemBase(shmemBase)
 {
+    // Debug: shared memory base should be 128B aligned
+#ifndef NDEBUG
+    assert((reinterpret_cast<uintptr_t>(mShmemBase) & 0x7F) == 0 && "Shared memory must be 128-byte aligned");
+#endif

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f3ca1cb and 88c4489.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (4 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)
tests/unittest/_torch/thop/test_moe_alltoall.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_mnnvl_utils.py
tests/unittest/_torch/thop/test_moe_alltoall.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hh,hpp,hxx,cuh}: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧠 Learnings (3)

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

🧬 Code graph analysis (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)

fillMetaInfo (1257-1289)

fillMetaInfo (1257-1258)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)

topK (187-286)

topK (187-189)

topK (334-455)

topK (334-336)

topK (533-658)

topK (533-535)

topK (987-1139)

topK (987-990)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)

185-195: Good gating of low-precision combine via env + quant-mode (NVFP4)
The feature flag is opt-in and safely guarded behind quant mode support, which prevents accidental enablement on unsupported models. LGTM.

901-910: mnnvl_moe_alltoallv_combine signature and call sites are consistent

The static method in _mnnvl_utils.py defines both token_count: int and use_low_precision_combine: bool = False in its signature.

In fused_moe_wide_ep.py, both token_count and use_low_precision_combine are passed explicitly.

In fused_moe_cutlass.py, only token_count is passed—use_low_precision_combine correctly falls back to its default (False).

No other call sites exist under tensorrt_llm.

No further updates are needed.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)

72-80: New originalDataType and alignment fields: looks correct
The field additions and comments are clear and align with the low-precision data-path requirements. No issues.

406-408: fillMetaInfo extended with isLowPrecision: LGTM
Signature carries the low-precision toggle into sizing. Matches usage in .cu meta logic.

422-422: FusedMoeCommKernelParam.isLowPrecision addition is appropriate
The struct-level flag cleanly propagates mode selection into kernels.
cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
35-37: Correctly passing original dtype to fillFieldInfo
Using convert_torch_dtype(t.scalar_type()) wires the dtype through. LGTM.

252-258: Torch schema updated to include use_low_precision: LGTM
The def string and CUDA impl align with the new optional boolean. Good defaulting behavior.

31-37: The new use_low_precision_combine argument is already optional (defaults to False), so existing call sites don’t need to be updated unless you want to enable it. All direct invocations of torch.ops.trtllm.moe_comm in the Python wrappers and tests remain valid without modification.

I’ve verified:

• In tensorrt_llm/_mnnvl_utils.py, the Python wrapper signature ends with
    use_low_precision_combine: bool = False
and forwards it positionally to torch.ops.trtllm.moe_comm with the correct defaults.
• All direct test calls in tests/unittest/_torch/thop/test_moe_alltoall.py pass only the original required arguments (without the new flag), which continues to map use_low_precision_combine=False.

No changes are required to existing call sites.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (4)

30-37: Appreciate the clear documentation for NVFP4 block layout
The function-level comments for quantize_nvfp4_sharedmem are helpful. Nice.

437-477: fillFieldInfo now sets originalDataType correctly
The earlier self-assignment bug is resolved; this writes the member. Good fix.

890-911: Static assert restricting LOW_PRECISION to 1 field is reasonable
The constraint is enforced at compile time. Good defensive measure.

1258-1287: Low-precision compact sizing formula looks correct and matches the packed layout
numGroups, group size (WARP*8 + WARP), and +4 global-scale with 128B alignment match quantize/dequant. LGTM.

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h

cpp/tensorrt_llm/thop/moeCommOp.cpp

tensorrt-cicd · 2025-08-27T13:15:52Z

PR_Github #16673 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-27T20:00:17Z

PR_Github #16691 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12529 completed with status: 'FAILURE'

zongfeijing · 2025-08-28T00:34:20Z

/bot run

tensorrt-cicd · 2025-08-28T00:43:02Z

PR_Github #16758 [ run ] triggered by Bot

zongfeijing · 2025-08-28T03:06:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-28T03:12:10Z

PR_Github #16783 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T03:12:12Z

PR_Github #16758 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-28T09:36:30Z

PR_Github #16783 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12596 completed with status: 'SUCCESS'

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing · 2025-08-28T09:56:49Z

/bot reuse-pipeline

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1382-1394: Initialize workspace: add bounds validation before large memset.

Guard against overruns and alignment issues.

Apply:

 void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo)
 {
@@
-    uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64;
+    uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64;
+    size_t totalBytes = rankStrideInU64 * sizeof(uint64_t);
+    size_t needed = fifoSize + senderSideInfoSize + receiverSideInfoSize;
+    TLLM_CHECK_WITH_INFO(needed <= totalBytes, "MOE comm workspace overflow: need=%zu, have=%zu", needed, totalBytes);
+    TLLM_CHECK_WITH_INFO((fifoSize % sizeof(uint32_t)) == 0, "fifoSize must be multiple of 4 bytes");
@@
-    TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE,
-        fifoSize / sizeof(uint32_t)));
+    TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE,
+        fifoSize / sizeof(uint32_t)));

♻️ Duplicate comments (5)

tests/unittest/_torch/thop/test_moe_alltoall.py (2)

260-278: Bug: shared output_indice_offset corrupts multi-tensor refs.

Maintain per-tensor offsets; current code overwrites slices across tensors.

Apply:

-        total_recv_all_ranks_cpu = []
-        output_indice_offset = 0
+        total_recv_all_ranks_cpu = []
+        output_offsets = [0] * tensor_count
@@
-                for i in range(tensor_count):
-                    ref_output_tensors[i][output_indice_offset:output_indice_offset + local_recv_count_pair] = \
-                        input_tensors_all_ranks[i][other_rank][send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]]
-                output_indice_offset += local_recv_count_pair
+                for i in range(tensor_count):
+                    start = output_offsets[i]
+                    end = start + local_recv_count_pair
+                    ref_output_tensors[i][start:end] = input_tensors_all_ranks[i][other_rank][
+                        send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]
+                    ]
+                    output_offsets[i] = end

27-41: quant_and_dequant: guard zero, avoid forced CPU hops, add docstring.

Prevent div-by-zero, prefer on-device op with CPU fallback, and document intent.

Apply:

-def quant_and_dequant(tensor):
-    tensor = tensor.reshape(1, -1)
-    global_scale = (448 * 6) / tensor.abs().max().float()
-    fp4_tensor, scale_factors = torch.ops.trtllm.fp4_quantize(
-        tensor, global_scale, 16, False, False)
-
-    dequantized_cpu = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2(
-        fp4_tensor.cpu(),
-        scale_factors.cpu(),
-        (1.0 / global_scale).cpu(),
-        16,
-        1,  # sf_type (1 for UE4M3)
-        False)
-    return dequantized_cpu.to(tensor.device).reshape(-1)
+def quant_and_dequant(tensor):
+    """Quantize to FP4 and dequantize back; used to relax asserts for low-precision paths."""
+    tensor = tensor.reshape(1, -1)
+    absmax = tensor.abs().max().float()
+    if absmax == 0:
+        return torch.zeros_like(tensor.reshape(-1))
+    global_scale = (448 * 6) / absmax
+    fp4_tensor, scale_factors = torch.ops.trtllm.fp4_quantize(
+        tensor, global_scale, 16, False, False)
+    try:
+        out = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2(
+            fp4_tensor, scale_factors, (1.0 / global_scale), 16, 1, False)
+        return out.reshape(-1)
+    except Exception:
+        out_cpu = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2(
+            fp4_tensor.cpu(), scale_factors.cpu(), (1.0 / global_scale).cpu(), 16, 1, False)
+        return out_cpu.to(tensor.device).reshape(-1)

cpp/tensorrt_llm/thop/moeCommOp.cpp (2)

74-101: Validate dtypes when use_low_precision is enabled.

Fail fast if any input isn’t fp16/bf16 to avoid silent kernel misuse.

Apply:

     int fieldCount = inputs.size();
@@
     tensorrt_llm::kernels::FusedMoeFieldInfo sendFieldInfo, recvFieldInfo;
@@
-    bool useLowPrecisionVal = useLowPrecision.value_or(false);
+    bool useLowPrecisionVal = useLowPrecision.value_or(false);
+    if (useLowPrecisionVal)
+    {
+        for (int i = 0; i < fieldCount; ++i)
+        {
+            auto dt = inputs[i].scalar_type();
+            TORCH_CHECK(dt == torch::kHalf || dt == torch::kBFloat16,
+                "use_low_precision requires float16 or bfloat16 inputs; got ", dt, " at field index ", i);
+        }
+    }

Also applies to: 112-117

112-117: Uninitialized topK forwarded into fillMetaInfo (UB).

When hasBasicFields=false, pass a known-zero topK to avoid using params.expertParallelInfo.topK.

Apply:

-    bool useLowPrecisionVal = useLowPrecision.value_or(false);
-    params.isLowPrecision = useLowPrecisionVal;
-    params.sendFieldInfo.fillMetaInfo(
-        &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
-    params.recvFieldInfo.fillMetaInfo(
-        &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal);
+    bool useLowPrecisionVal = useLowPrecision.value_or(false);
+    params.isLowPrecision = useLowPrecisionVal;
+    constexpr int kTopKNotUsed = 0;
+    params.sendFieldInfo.fillMetaInfo(
+        &(params.sendCommMeta), kTopKNotUsed, false, false, useLowPrecisionVal);
+    params.recvFieldInfo.fillMetaInfo(
+        &(params.recvCommMeta), kTopKNotUsed, false, false, useLowPrecisionVal);

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1396-1464: Gate LOW_PRECISION path on device CC to avoid no-op codepaths on < SM100.

Add host-side check; otherwise quant/dequant sections compile out and corrupt I/O.

Apply:

 void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream)
 {
+    if (params.isLowPrecision)
+    {
+        int dev = -1, major = 0;
+        TLLM_CUDA_CHECK(cudaGetDevice(&dev));
+        TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev));
+        TLLM_CHECK_WITH_INFO(major >= 10,
+            "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x).");
+    }
     bool hasBasicFields = params.sendFieldInfo.tokenSelectedSlots != nullptr;

🧹 Nitpick comments (2)

tests/unittest/_torch/thop/test_moe_alltoall.py (1)
345-351: Optional: vectorize low-precision ref adjust to avoid per-token Python loop.

Quant/dequant whole matrix per tensor once for speed in CI.

Example:
if use_low_precision:
    ref_output_tensors[i] = quant_and_dequant(ref_output_tensors[i])
    atol, rtol = 1e-2, 1e-2
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
230-241: Fix misleading error message (typo).

Dim check is 2D but message says 1D.

Apply:
-    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor");
+    TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor");

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 14aab49 and 7f36008.

📒 Files selected for processing (7)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (17 hunks)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (4 hunks)
cpp/tensorrt_llm/thop/moeCommOp.cpp (3 hunks)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (5 hunks)
tensorrt_llm/_mnnvl_utils.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)
tests/unittest/_torch/thop/test_moe_alltoall.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}: In C++/CUDA files, closing braces of namespaces must be commented with the namespace name (e.g., } // namespace foo)
Prefer const or constexpr variables over #define; variables not modified after initialization must be const
Use literals only for initialization except 0, nullptr, true, false; otherwise name them as constexpr
Use Allman indentation style; empty for/while loop’s semicolon on a new line; always use brace-delimited bodies for control statements
Type names are UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, non-anonymous-namespace global variables use g prefix (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix (e.g., sMutableStaticGlobal); locally visible static variables also use s prefix
Member variables use m prefix (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants: enums, globals, static constants, and function-scope magic-number/literal constants use k + UPPER_SNAKE_CASE (e.g., kDIGIT_NUM)
Use spaces only; indent 4 spaces
Run LLVM clang-format before submitting; enforce max 120 chars per line; only justify exceptions via clang-format off/on blocks
Use C++ style comments; allow inline C comments only for parameter labeling in calls
Disable code via #if/#endif or dedicated macros; avoid dead code; do not comment out code
Do not throw exceptions across library boundaries
Prefer least forceful casts; avoid C-style and functional casts (except explicit constructor); do not drop const/volatile; void* to T* via static_cast; use reinterpret_cast as last resort; avoid dynamic_cast
Do not use assignment in subexpressions (e.g., if (x = y))
Enum-controlled switch should cover all enum values without default; structure switch cases clearly; no fall-through except to another label; each case must end with break or throw; do not end a case with return; if case is a compound, put break inside braces
Avoid declaring l...

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

**/*.{cpp,cc,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use smart pointers for heap objects; prefer unique_ptr for sole ownership, shared_ptr for shared; avoid deprecated smart pointers

Files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp

🧠 Learnings (4)

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tests/unittest/_torch/thop/test_moe_alltoall.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

cpp/tensorrt_llm/thop/moeCommOp.cpp

🧬 Code graph analysis (2)

tests/unittest/_torch/thop/test_moe_alltoall.py (3)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)

tensorrt_llm (25-203)

cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)

tensorrt_llm (20-47)

cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)

tensorrt_llm (25-89)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)

elementSize (68-79)

elementSize (68-68)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)

1320-1352: Low-precision compact size: looks correct; aligns with sender/receiver layouts.

numGroups, per-group bytes, and 128B alignment match pack/unpack. LGTM.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt-cicd · 2025-08-28T10:01:56Z

PR_Github #16842 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-08-28T10:18:27Z

PR_Github #16842 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #16783 for commit 7f36008

zongfeijing requested review from a team as code owners August 22, 2025 07:49

zongfeijing requested review from hlu1, symphonylyh and hyukn August 22, 2025 07:49

zongfeijing marked this pull request as draft August 22, 2025 07:54

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

kaiyux requested a review from dongxuy04 August 25, 2025 00:33

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from a075095 to 552be46 Compare August 25, 2025 06:49

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h Show resolved Hide resolved

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from 552be46 to 552a379 Compare August 25, 2025 07:09

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu Show resolved Hide resolved

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu Show resolved Hide resolved

cpp/tensorrt_llm/thop/moeCommOp.cpp Outdated Show resolved Hide resolved

cpp/tensorrt_llm/thop/moeCommOp.cpp Show resolved Hide resolved

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from 552a379 to 02c6e07 Compare August 25, 2025 07:29

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu Show resolved Hide resolved

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch 2 times, most recently from 94b3029 to 9e82990 Compare August 25, 2025 16:00

zongfeijing marked this pull request as ready for review August 25, 2025 16:02

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu Show resolved Hide resolved

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu Outdated Show resolved Hide resolved

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from f341cef to f3ca1cb Compare August 26, 2025 07:12

kaiyux force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from f3ca1cb to 88c4489 Compare August 26, 2025 16:34

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h Show resolved Hide resolved

cpp/tensorrt_llm/thop/moeCommOp.cpp Show resolved Hide resolved

dongxuy04 approved these changes Aug 28, 2025

View reviewed changes

litaotju approved these changes Aug 28, 2025

View reviewed changes

QiJune approved these changes Aug 28, 2025

View reviewed changes

kaiyux enabled auto-merge (squash) August 28, 2025 07:49

zongfeijing disabled auto-merge August 28, 2025 09:52

zongfeijing added 7 commits August 28, 2025 17:53

Add lowprecision all2all for mnnvl

46ed96c

Signed-off-by: Zongfei Jing <[email protected]>

War

813ee6a

Signed-off-by: Zongfei Jing <[email protected]>

Bankconflict free

0bd5594

Signed-off-by: Zongfei Jing <[email protected]>

Add testcases for deepseek

ee434f2

Signed-off-by: Zongfei Jing <[email protected]>

Tidy code

9bfa7ca

Signed-off-by: Zongfei Jing <[email protected]>

OPT

593ef27

Signed-off-by: Zongfei Jing <[email protected]>

Swizzle strategy for shared mem access

7f36008

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing force-pushed the user/zongfeij/mnnvl_lowprecision_all2all branch from 14aab49 to 7f36008 Compare August 28, 2025 09:53

zongfeijing enabled auto-merge (squash) August 28, 2025 09:55

yuxianq approved these changes Aug 28, 2025

View reviewed changes

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Show resolved Hide resolved

zongfeijing merged commit 53163bf into NVIDIA:main Aug 28, 2025
7 checks passed

coderabbitai bot mentioned this pull request Aug 29, 2025

Don't review - Testing: Support fp8 block wide ep #7209

Closed

zongfeijing deleted the user/zongfeij/mnnvl_lowprecision_all2all branch September 2, 2025 08:57

[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155

[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155

Uh oh!

Conversation

zongfeijing commented Aug 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zongfeijing commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

kaiyux commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zongfeijing commented Aug 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 22, 2025 •

edited

Loading