-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155
Conversation
📝 WalkthroughWalkthroughAdds an NVFP4 per-16-element low‑precision path to fused MoE all‑to‑all: device quantize/dequantize utilities, 128B‑aligned metadata and workspace changes, LOW_PRECISION template parameter and compile-time constraints, host/kernel selection by isLowPrecision, and Python/Torch bindings and tests updated to opt into the path. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python caller
participant Op as moe_comm (Torch C++)
participant Host as Host launcher
participant Ker as moeAllToAllKernel<FIELD_COUNT,LOW_PRECISION>
participant Dev as Device utils (NVFP4)
Py->>Op: moe_comm(..., use_low_precision=?)
Op->>Host: build params (isLowPrecision), fill field meta (originalDataType)
Host->>Host: select kernel via getFunc(fieldCount, isLowPrecision)
alt isLowPrecision == true
Host->>Ker: launch moeAllToAllKernel<..., true>
note right of Ker #EEF6F3: FIELD_COUNT==1 enforced for low‑precision
Ker->>Dev: quantize_nvfp4_sharedmem (per‑16 elems) -> write global + per‑group scales
Ker->>Ker: pack -> all‑to‑all exchange
Ker->>Dev: dequantize_nvfp4_sharedmem (use scales) -> unpack
else
Host->>Ker: launch moeAllToAllKernel<..., false>
Ker->>Ker: pack -> all‑to‑all exchange -> unpack (FP path)
end
Ker-->>Host: kernel complete
Host-->>Op: return outputs
Op-->>Py: Tensor[]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
66-67
: Fix malformed default parameter value.The default parameter has an incomplete expression
MoEWeightLoadingMode.
without specifying the enum value.Apply this diff to fix the default parameter:
- weight_loading_mode: MoEWeightLoadingMode = MoEWeightLoadingMode. - VANILLA, + weight_loading_mode: MoEWeightLoadingMode = MoEWeightLoadingMode.VANILLA,
🧹 Nitpick comments (15)
tests/unittest/_torch/thop/test_moe_alltoall.py (5)
75-81
: Verify workspace size units (bytes vs. elements) before division by 8For moe_comm workspace you divide by 8 (uint64 elements), while the prepare workspace uses the raw value. Ensure both size getters return consistent units; otherwise this may under/over‑allocate.
If both APIs return bytes, adjust prepare workspace allocations similarly:
- all_workspaces = torch.zeros(ep_size, - workspace_size, - dtype=torch.uint64, - device=torch.device('cuda')) + all_workspaces = torch.zeros( + ep_size, workspace_size // 8, dtype=torch.uint64, device='cuda' +)If they already return element counts, remove the
// 8
in the moe_comm allocations instead. Please confirm via the operator docs or quick prints.Also applies to: 254-259, 549-556
557-563
: Warmup call uses ep_size=1; likely mismatch with test parameterizationThe first warmup prepare call hardcodes
ep_size=1
, while the rest of the test uses the parameterizedep_size
. This can mask issues and/or size calculations for the workspace.Apply this diff to use the test’s
ep_size
:- torch.ops.trtllm.mnnvl_moe_alltoallv_prepare_without_allgather( - expert_ids_all_ranks[0], experter_count_lists[0], - all_workspaces, max_token_count_per_rank, 0, 1, expert_count, - slot_count, top_k) + torch.ops.trtllm.mnnvl_moe_alltoallv_prepare_without_allgather( + expert_ids_all_ranks[0], + experter_count_lists[0], + all_workspaces, + max_token_count_per_rank, + 0, + ep_size, + expert_count, + slot_count, + top_k, + )
566-586
: Remove dead allocations immediately overwritten belowYou allocate placeholder tensors into
local_*
lists and then immediately reset them to empty lists, making the allocations dead code.Apply this diff to drop the unneeded block:
- # Make torch alloc tensor to avoid cuda sync - local_send_rank_count_cumsum = [] - local_send_rank_indices = [] - local_recv_rank_count_cumsum = [] - local_recv_rank_indices = [] - backward_local_recv_rank_indices = [] - for _ in range(ep_size): - local_send_rank_count_cumsum.append( - torch.empty(ep_size, - dtype=torch.int32, - device=torch.device('cuda'))) - local_send_rank_indices.append( - torch.empty(max_token_count_per_rank * ep_size, - dtype=torch.int32, - device=torch.device('cuda'))) - local_recv_rank_count_cumsum.append( - torch.empty(0, dtype=torch.int32, device=torch.device('cuda'))) - local_recv_rank_indices.append( - torch.empty(0, dtype=torch.int32, device=torch.device('cuda'))) - backward_local_recv_rank_indices.append( - torch.empty(0, dtype=torch.int32, device=torch.device('cuda'))) - - local_send_rank_count_cumsum = [] - local_send_rank_indices = [] - local_recv_rank_count_cumsum = [] - local_recv_rank_indices = [] - backward_local_recv_rank_indices = [] + # These will be filled by the prepare op for the target rank. + local_send_rank_count_cumsum = [] + local_send_rank_indices = [] + local_recv_rank_count_cumsum = [] + local_recv_rank_indices = [] + backward_local_recv_rank_indices = []Also applies to: 587-592
309-406
: Make FP8 test assertions deterministic and reduce stdout noiseThe FP8 test prints shapes/counts and treats any exception as failure. Prefer explicit assertions on returned shapes/dtypes and gate the test if FP8 isn’t supported on the current GPU/runtime.
Suggested adjustments:
- Replace prints with asserts on tensor shapes, dtypes, and that the scatter/gather preserves non‑zeros for the selected indices.
- Add a skip guard when
not torch.cuda.is_available()
or when runtime lackstorch.float8_e4m3fn
kernels (e.g., Hopper requirement).
526-528
: Line length exceeds linter limit (Ruff E501)A few lines exceed 120 chars. Consider wrapping with parentheses for better readability and to appease CI linters.
Also applies to: 606-607
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
587-589
: Make unsupported post‑quant all‑to‑all explicit for MNNVLCurrently this branch is a no‑op (
pass
). Consider raising or logging to surface misconfiguration early whenTRTLLM_MOE_POST_QUANT_ALLTOALLV=1
is set with MNNVL.Apply this diff:
- if self.alltoall_method_type == AlltoallMethodType.MNNVL: - pass + if self.alltoall_method_type == AlltoallMethodType.MNNVL: + raise NotImplementedError( + "Post‑quant all‑to‑all is not supported for MNNVL. " + "Unset TRTLLM_MOE_POST_QUANT_ALLTOALLV or switch backend." + )
1-1
: Missing NVIDIA SPDX headerPer repo guidelines, prepend the NVIDIA SPDX header to source files.
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. +# SPDX-License-Identifier: Apache-2.0 + import ostensorrt_llm/_torch/models/modeling_speculative.py (1)
1-1
: Missing NVIDIA SPDX headerPlease prepend the standard NVIDIA SPDX header.
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. +# SPDX-License-Identifier: Apache-2.0 + from typing import Any, Dict, Generic, Optional, Tuplecpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (2)
34-38
: Consider using consistent naming conventions for struct members.According to the coding guidelines, member variables should use mPrefix lowerCamelCase (e.g.,
mEpSize
,mEpRank
). While public members may omit the prefix, using it is encouraged for consistency.Apply this diff to follow the naming convention:
struct MoeEpWorldInfo { - int epSize; - int epRank; + int mEpSize; + int mEpRank; };
40-44
: Consider using consistent naming conventions for struct members.The member variables should follow mPrefix lowerCamelCase convention (e.g.,
mExpertCount
,mTopK
).Apply this diff to follow the naming convention:
struct MoeExpertParallelInfo { - int expertCount = -1; - int topK = 1; + int mExpertCount = -1; + int mTopK = 1; };tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
373-375
: Add explicit shape assertions for x_sf before swizzlingTo catch any unexpected padding or shape mismatches after communication, it’s safer to assert that
x_sf
really has the dimensions you expect—namely(x_row, ceil_div(x_col, scaling_vector_size))
—before callingswizzle_sf
. Infused_moe_cutlass.py
you can insert these checks in both the AllToAll and AllGather branches:• In the AllToAll branch (around lines 373–375):
if x_sf is not None: - x_row = x_sf.shape[0] - x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size) + # verify shape matches precomputed x_row/x_col + assert x_sf.shape[0] == x_row and x_sf.shape[1] == ceil_div(x_col, self.scaling_vector_size), \ + f"Unexpected x_sf shape {tuple(x_sf.shape)}, expected ({x_row}, {ceil_div(x_col, self.scaling_vector_size)})" + # swizzle using known dimensions + x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)• In the post-quant AllGather branch (around lines 392–394):
if x_sf is not None: - x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size) + # verify shape matches gathered x_row/x_col + assert x_sf.shape[0] == x_row and x_sf.shape[1] == ceil_div(x_col, self.scaling_vector_size), \ + f"Unexpected x_sf shape {tuple(x_sf.shape)} after allgather, expected ({x_row}, {ceil_div(x_col, self.scaling_vector_size)})" + x_sf = swizzle_sf(x_sf, x_row, x_col, self.scaling_vector_size)These assertions mirror the pattern used in
fused_moe_wide_ep.py
and will surface any inconsistency in the scaling‐factor tensor dimensions early in execution.cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
1-15
: Update copyright year to 2025.The copyright notice should be updated to include 2025 to match other files in this PR.
Apply this diff:
/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. *cpp/tensorrt_llm/kernels/moePrepareKernels.cu (2)
125-137
: Optimize counter communication with reduced memory traffic.The current implementation writes values one by one in a loop (Lines 129-137). For better performance, consider batching the writes when possible, especially when
communicationCount
is large.Consider coalescing writes for better memory access patterns:
for (int i = threadIdx.x; i < communicationCount; i += blockDim.x) { int value = i == 0 ? *(sharedSendRecvRankCount) : *(expertStatics + i - 1); - counter.releaseValue(value, i); if (i == 0) { *(sendCounts + targetRankId) = value; } + counter.releaseValue(value, i); }
83-88
: Template function parameter not utilized consistently.The template parameter
kThreadsGroupSize
is defined but the function also takes runtime parameters that might conflict with compile-time assumptions.Consider adding a static assertion to ensure the template parameter matches the runtime expectations:
template <int kThreadsGroupSize> __device__ __forceinline__ void computeCountAndSendStatics(int* experts, int tokenCount, int* sharedSendRecvRankCount, int* sendCounts, int* sendIndiceWorkspace, int* backwardIndiceWorkspace, int* expertStatics, MoeCommWorkspace workspace, int maxTokenCountPerRank, int slotCount, int expertCount, int topK, int epRank, int epSize) { + static_assert(kThreadsGroupSize <= 32 && (kThreadsGroupSize & (kThreadsGroupSize - 1)) == 0, + "kThreadsGroupSize must be a power of 2 and <= 32"); cg::thread_block_tile<kThreadsGroupSize> tile = cg::tiled_partition<kThreadsGroupSize>(cg::this_thread_block());cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
169-169
: Extract and document the 512 limit for consistent validation
The literal
512
in the TORCH_CHECK corresponds exactly to the block size used by the MOE communication kernel (seecpp/tensorrt_llm/kernels/moePrepareKernels.cu:316
:dim3 block(512);
).expertCount + 1
must not exceed the number of threads per block.Define a shared constant, for example in a new header (
moeCommon.h
):// Maximum number of experts per block (must match GPU threads per block in MOE kernels) constexpr int kMaxExpertsPerBlock = 512;In
cpp/tensorrt_llm/thop/moeCommOp.cpp
(line 169), replace:- TORCH_CHECK(expertCount + 1 <= 512, "expertCount + 1 is larger than 512"); + TORCH_CHECK(expertCount + 1 <= kMaxExpertsPerBlock, "expertCount + 1 must not exceed the number of threads per block");In
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(around line 316), replace:- dim3 block(512); + dim3 block(kMaxExpertsPerBlock);Add a brief comment in both locations explaining that this cap exists because the kernel launches at most
kMaxExpertsPerBlock
threads per block, andexpertCount+1
must fit within that.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(1 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(1 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.cu
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernels.h
(0 hunks)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
(1 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.cu
(7 hunks)cpp/tensorrt_llm/kernels/moePrepareKernels.h
(3 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(6 hunks)cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
(0 hunks)cpp/tests/unit_tests/kernels/CMakeLists.txt
(1 hunks)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
(1 hunks)tensorrt_llm/_mnnvl_utils.py
(5 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(2 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(5 hunks)tests/integration/test_lists/waives.txt
(0 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(9 hunks)
💤 Files with no reviewable changes (4)
- cpp/tensorrt_llm/thop/moeLoadBalanceOp.cpp
- tests/integration/test_lists/waives.txt
- cpp/tensorrt_llm/kernels/moeCommKernels.h
- cpp/tensorrt_llm/kernels/moeCommKernels.cu
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/moePrepareKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
tensorrt_llm/_torch/models/modeling_speculative.py
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_speculative.py
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_mnnvl_utils.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
🧠 Learnings (8)
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/kernels/moePrepareKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/moePrepareKernels.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: 2025-08-21T02:39:11.984Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:11.984Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-21T21:48:35.105Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.105Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
moe_load_balancer_set_repeated_for_next_layer
(1037-1046)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (16)
_
(13-56)_
(60-62)_
(65-68)_
(71-76)_
(79-84)_
(87-99)_
(102-107)_
(110-115)_
(118-123)_
(126-138)_
(141-147)_
(150-151)_
(154-157)_
(161-162)_
(165-166)_
(169-180)tensorrt_llm/_mnnvl_utils.py (1)
mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
create_moe
(61-211)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)
_
(239-289)_
(365-373)_
(451-461)_
(630-657)_
(690-700)_
(774-784)_
(874-890)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)
_
(273-316)_
(536-559)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k
(221-221)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (2)
_
(190-196)shape
(103-104)tensorrt_llm/_mnnvl_utils.py (3)
MnnvlMoe
(350-617)mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)mnnvl_moe_alltoallv
(528-589)tensorrt_llm/quantization/utils/fp8_utils.py (1)
ceil_div
(10-21)
cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
memsetExpertIds
(230-248)memsetExpertIds
(230-231)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
tensorrt_llm/_mnnvl_utils.py (7)
MnnvlMemory
(53-336)initialize
(91-100)get_moe_workspaces
(358-373)get_moe_prepare_workspace
(376-387)mnnvl_moe_alltoallv_prepare_without_allgather
(399-443)MoEAlltoallInfo
(340-347)mnnvl_moe_alltoallv
(528-589)
🪛 Ruff (0.12.2)
tests/unittest/_torch/thop/test_moe_alltoall.py
526-526: Line too long (236 > 120)
(E501)
528-528: Line too long (248 > 120)
(E501)
606-606: Line too long (186 > 120)
(E501)
607-607: Line too long (136 > 120)
(E501)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
465-465: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (31)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
82-87
: No changes needed tomoe_comm
call sites
The C++ binding formoe_comm
defines the last two parameters as optional—bool[]? need_zero_output=None
andbool? use_low_precision=None
—so calls may either include or omit the boolean list without breaking the API (). In tests, some cases explicitly pass[True]
to exercise zero-padding behavior, while others rely on the default (no zero-padding) to validate both code paths. Similarly, the MNNVL utilities intentionally omit the flag when zero-padding isn’t needed and include it in the combine path where it is .Because both patterns are valid and cover distinct behaviors, unifying all calls to always include or always omit the boolean list would either force redundant defaults or lose test coverage. Please keep the existing calls as-is.
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
196-202
: Initialize MNNVL workspaces eagerly — LGTMAllocating both dispatch and prepare workspaces up front for MNNVL keeps the forward path clean and avoids first‑use syncs. Looks good.
299-301
: Confirm intent: always allow all‑to‑all for MNNVL, even when chunking would be usedReturning True here bypasses the chunking guard that disables all‑to‑all for other backends. If MNNVL handles chunking/pathological shapes reliably, this is fine; otherwise we may want parity with the chunking check to avoid oversized buffers.
Can you confirm MNNVL supports cases where
calculate_num_chunks(all_rank_num_tokens) > 1
without extra memory pressure or perf regressions?
580-585
: Correctly dispatch through the two‑step MNNVL flowUsing
alltoall_dispatch
with thealltoall_info
produced byalltoall_prepare
is correct, and passingtop_k
ensuresmemset_expert_ids
can rebuild the compact expert IDs later in combine.cpp/tests/unit_tests/kernels/CMakeLists.txt (1)
45-46
: Add fusedMoeCommKernelTest — LGTMRegistering the fused MoE comm kernel tests unconditionally is consistent with the surrounding test targets.
tensorrt_llm/_torch/models/modeling_speculative.py (1)
342-344
: Set load‑balancer repeat count based on next‑N layout — LGTMUsing
num_nextn_predict_layers // mtp_num_layers
correctly derives the repeat factor for the next layer in both MTP‑Eagle (1 layer) and standard Next‑N cases.tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
61-62
: Import cleanup — LGTMSwitching to import only
create_moe
aligns with the removal of the local call sites.
1134-1141
: Confirm removal of moe_load_balancer_set_repeated_for_next_layer hereThis path now no‑ops where the load‑balancer repeat count was previously set. Given that
modeling_speculative.MTPForCausalLM
now sets it centrally, this is likely intended to avoid double‑setting. Please confirm this covers all DeepSeek‑V3 speculative configurations (vanilla and non‑vanilla MTP).If any non‑spec flows still rely on per‑module configuration, we might need a fallback here guarded by the relevant flags.
cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (4)
1-15
: LGTM!The copyright notice follows the Apache 2.0 license template correctly with the updated year range (2019-2025).
16-19
: LGTM!The header guard and include are properly structured - using
#pragma once
and including the required<stdint.h>
for fixed-width integer types.
25-29
: LGTM!Good use of conditional compilation for the ALIGN_256 macro to ensure portability between CUDA and host code.
31-32
: LGTM!The WARP_SIZE and WARP_MASK constants are properly defined and follow the naming convention (kPREFIXED_UPPER_SNAKE for constants).
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (4)
190-195
: LGTM! Simplified alltoall gating logic.The removal of the
experts_per_token % 4 == 0
constraint aligns with the new fused MoE communication infrastructure that handles arbitrary topK values.
354-357
: LGTM! Clean adaptation to the new MoE prepare API.The refactoring correctly updates the function call to return only alltoall_info and gathered_expert_statics, aligning with the simplified fused MoE communication interface.
363-367
: LGTM! Elegant multi-input alltoall dispatch.The new implementation cleanly handles multiple inputs in a single alltoall kernel call, improving efficiency compared to the previous multi-step approach.
369-371
: LGTM! Proper expert ID initialization.The new memset_expert_ids call correctly initializes expert metadata using the fused MoE infrastructure.
tensorrt_llm/_mnnvl_utils.py (4)
369-372
: LGTM! Proper workspace initialization with synchronization.The addition of moe_initialize_workspace followed by a barrier ensures the workspace is properly initialized before use across all ranks.
443-443
: LGTM! Clean API simplification.The function now returns only the essential outputs (alltoall_info and gathered_expert_statics), removing unused intermediate values.
528-589
: Excellent multi-input handling with None-preservation logic!The implementation elegantly handles both single tensors and lists with optional entries, preserving None positions in the output. This is a robust design that maintains API flexibility.
602-614
: LGTM! Consistent adaptation to the new moe_comm interface.The combine function correctly wraps the single input tensor in a list and extracts the first element from the result, maintaining compatibility with the new multi-input API.
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)
182-202
: LGTM! Well-designed multi-input fake implementation.The fake implementation correctly handles the new multi-input signature, creating appropriately sized output tensors for each input.
246-250
: LGTM! New memset_expert_ids op properly registered.The new operation is correctly registered with appropriate parameter types matching the C++ implementation.
cpp/tensorrt_llm/kernels/moePrepareKernels.h (4)
35-35
: LGTM! Cleaner constant definition.Using static constexpr instead of macros is a better practice for type safety.
71-74
: LGTM! Enhanced function signature for expert statistics.The addition of expertStatics and gatheredExpertStatics parameters along with slotCount and expertCount aligns with the new statics-based counting approach in the fused MoE implementation.
82-83
: LGTM! New expert ID initialization function.The memsetExpertIds function provides necessary functionality for the fused MoE communication stack.
43-48
: Verify sufficient buffer capacity for MoeCommFifoConnInfo::values.Please ensure that the hard-coded 512-element
values
array can accommodate the worst-case number of in-flight entries in each FIFO channel. Specifically:
- Audit all kernel implementations that use
MoeCommFifoConnInfo
(e.g. inmoePrepareKernels.cu
) to determine the maximum number of writes tovalues
before reads advancetail
.- Confirm host-side orchestration (the code that updates
head
/tail
) never allows(head – tail) % 512
to exceed 511.- If any configuration (for example,
channelCount × maxOutstandingRequests
) could breach this limit, either increase the array size or make it a runtime parameter.cpp/tensorrt_llm/kernels/moePrepareKernels.cu (1)
84-87
: Ensure consistent naming conventions for function parameters.The parameter name
slotCount
should follow the project's naming conventions. Based on other parameters in the function (e.g.,expertCount
,topK
), consider usingslotCount
(already correct) or ensure it aligns with the naming pattern used throughout the codebase.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1219-1229
: Potential null pointer dereference in kernel attribute query.The function
computeMoeAlltoallMaxDynamicSharedMemorySize
casts a templated function tovoid const*
without validation, which could lead to issues if the function address is invalid.Add error checking for the function attributes query:
int computeMoeAlltoallMaxDynamicSharedMemorySize() { int devId = -1; TLLM_CUDA_CHECK(cudaGetDevice(&devId)); cudaFuncAttributes attr{}; - TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1>)); + auto kernelPtr = reinterpret_cast<void const*>(moeAllToAllKernel<1>); + TLLM_CHECK_WITH_INFO(kernelPtr != nullptr, "Failed to get kernel function pointer"); + TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, kernelPtr)); int staticSmem = static_cast<int>(attr.sharedSizeBytes);Likely an incorrect or invalid review comment.
49-183
: Add CUDA architecture guards for NVFP4 quantization functions.The
quantize_nvfp4_sharedmem
function uses CUDA 10.0+ features but only checks at Lines 55-56. The entire function should be guarded to prevent compilation issues on older architectures.Wrap the entire function with architecture guards:
template <typename DType> __device__ __forceinline__ void quantize_nvfp4_sharedmem(uint8_t* compact_ptr, int sizeInBytes, int laneId) { -#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) +#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ < 1000) + // Function not supported on this architecture +#else int const numElems = sizeInBytes / sizeof(DType); // ... rest of the function ... -#endif +#endif // __CUDA_ARCH__ >= 1000 }Likely an incorrect or invalid review comment.
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (1)
1095-1096
: Verify workspace size calculation parametersI confirmed that the only workspace-size helper is
static size_t computeWorkspaceSizePreRank(int epSize, int channelCount)
in
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h:520
and that your test invokes it exactly once at
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:1095–1096
.The test passes
auto workspaceSizePerRank = FusedMoeWorkspace::computeWorkspaceSizePreRank(1, totalChannelCount); auto totalWorkspaceSize = workspaceSizePerRank;– using
1
forepSize
andtotalChannelCount
forchannelCount
. This is correct only if:
- you really intend to test the single-endpoint case (
epSize == 1
), andtotalChannelCount
represents the number of channels per endpoint (not the global channel count across multiple endpoints).Please verify that, under your channel distribution:
- epSize is always 1 in this test, and
- the value passed as
totalChannelCount
is the per-endpoint channel count expected bycomputeWorkspaceSizePreRank
.If either assumption doesn’t hold (e.g. testing multi-endpoint or using a global channel count), you’ll need to adjust the arguments or multiply by the number of endpoints to ensure the test’s expected workspace size matches the implementation.
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
112-113
: Ignore the low-precision flag “redundancy” – it’s intentional.The local
useLowPrecisionVal
is
- assigned to
params.isLowPrecision
so the subsequentmoeAllToAll
kernel dispatch and shape checks can specialize on low precision- passed again into
fillMetaInfo
so the meta-info sizing logic handles the low-precision path correctlyNo change is needed here.
Likely an incorrect or invalid review comment.
a075095
to
552be46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
261-273
: Static member initialization not thread-safe.The
setMaxUsableSmCount
function modifies static members without synchronization, which could lead to race conditions if called from multiple threads.
298-309
: Static map access not thread-safe.The
getMoeCommChannelCount
function uses a staticstd::map
without synchronization, which can cause race conditions.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (4)
175-216
: Inline PTX assembly lacks proper CUDA architecture guards and error handling.The
e2m1_to_fp32_vec
function uses inline PTX assembly with an unrealistic architecture check (__CUDA_ARCH__ >= 1000
). CUDA compute capabilities don't reach 1000 - the highest current value is around 90.
419-459
: Fix parameter shadowing in fillFieldInfo.Line 458 assigns to the parameter
originalDataType
instead of the member variable, which is a bug.
1153-1223
: Grid-level synchronization issue between send and receive phases.The kernel launches both sender and receiver logic without proper synchronization between them, which could lead to race conditions.
1300-1311
: Large memory operation without size validation.The
initializeLocalWorkspace
method performs large memory operations without validating sizes.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(6 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_mnnvl_utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_mnnvl_utils.py
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧠 Learnings (2)
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧬 Code graph analysis (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)
idx
(458-488)idx
(458-458)topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
fillMetaInfo
(1239-1269)fillMetaInfo
(1239-1240)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
tensorrt_llm/_mnnvl_utils.py (2)
592-600
: LGTM! Parameter added correctly.The
use_low_precision_combine
parameter is properly added as an optional boolean with a default value of False, maintaining backward compatibility.
613-614
: Parameter correctly propagated to moe_comm operation.The new parameter is properly passed as the final argument to the torch operation, following the established pattern in this codebase.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
72-79
: LGTM! Field additions properly aligned and documented.The new fields are correctly added to support low-precision operations:
alignedUnitBit/Count/Stride
enable flexible data alignmentunalignedFieldIndex
andcompact16BOffset
support memory layout optimizationoriginalDataType
preserves type information for quantizationThe alignment constants (Lines 85-88) are well-defined for memory access patterns.
408-409
: Function signature properly extended for low-precision support.The addition of the
isLowPrecision
parameter tofillMetaInfo
correctly enables conditional sizing calculations for low-precision paths.
424-424
: LGTM! Low-precision flag properly added to kernel parameters.The
isLowPrecision
boolean flag is correctly placed in the parameter structure to control kernel behavior.cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
31-37
: LGTM! Function correctly derives and passes originalDataType.The helper function properly converts the torch data type to CUDA data type and passes it to
fillFieldInfo
.
112-117
: LGTM! Low-precision flag properly handled and propagated.The optional parameter is correctly converted to a boolean with a sensible default (false), assigned to the kernel parameters, and propagated to both send and receive metadata initialization.
257-257
: LGTM! Python binding correctly updated.The Torch binding signature properly exposes the new optional
use_low_precision
parameter, maintaining backward compatibility.cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (7)
33-173
: Quantization implementation looks good with proper CUDA architecture guards.The NVFP4 quantization function is well-structured with:
- Proper architecture check for SM 10.0+
- Efficient warp-level reduction for global max
- Per-16-element grouping with FP8 scales
- Appropriate memory layout for packed data
The implementation follows NVIDIA's recommended patterns for low-precision quantization.
218-269
: Dequantization implementation correctly mirrors quantization logic.The dequantization function properly:
- Reads the global scale from the expected location
- Iterates through groups to reconstruct values
- Applies scales in the correct order
- Has proper architecture guards
889-892
: LGTM! Proper compile-time enforcement for low-precision constraints.The static assertion correctly enforces that low-precision mode only supports a single field, preventing misuse at compile time.
1010-1027
: Quantization properly integrated in send path.The conditional quantization block correctly:
- Checks for LOW_PRECISION at compile time
- Extracts necessary field metadata
- Switches on the original data type
- Calls the appropriate typed quantization function
1094-1110
: Dequantization properly integrated in receive path.The dequantization logic mirrors the quantization path appropriately, ensuring data is correctly restored after transfer.
1239-1269
: Low-precision metadata calculation looks correct.The function properly computes compact sizes for low-precision mode:
- Validates single field requirement
- Checks for supported data types (FP16/BF16)
- Calculates space for packed values + scales + global scale
- Aligns to 128-byte boundaries
1327-1349
: Kernel dispatch correctly handles low-precision flag.The kernel selection logic properly:
- Validates that low-precision only works with single field
- Selects appropriate kernel template based on field count and precision mode
- Falls back to non-low-precision variants for multi-field cases
552be46
to
552a379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
31-37
: Update all fillFieldInfo invocations to the new five-argument signatureThe unit tests in
fusedMoeCommKernelTest.cpp
are still calling the old four-argument form. Please update each of these call sites to pass the additionaloriginalDataType
argument (for example, viaconvert_torch_dtype(...)
or the appropriateDataType::…
enum).• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:241
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:388
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:615
• cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp:1070For example, change:
fieldsInfo[i].fillFieldInfo(deviceField, elementSize, vectorSize, vectorSize);to something like:
fieldsInfo[i].fillFieldInfo(deviceField, elementSize, vectorSize, vectorSize, /* originalDataType – e.g. */ convert_torch_dtype(tensor.scalar_type()));
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
421-462
: Critical: originalDataType member is never set in MoeCommFieldInfo::fillFieldInfo.The assignment
originalDataType = originalDataType;
is a self-assignment of the parameter and does not update the member. This silently breaks low-precision branching that relies on fieldsInfo[i].originalDataType.Apply this fix:
-__host__ void MoeCommFieldInfo::fillFieldInfo( - uint8_t* dataPtr, size_t elementSize, int vectorSize, int stride, cudaDataType_t originalDataType) +__host__ void MoeCommFieldInfo::fillFieldInfo( + uint8_t* dataPtr, size_t elementSize, int vectorSize, int stride, cudaDataType_t originalDataType) { ... - alignedUnitCount = vectorSize; - alignedUnitStride = stride; - originalDataType = originalDataType; + alignedUnitCount = vectorSize; + alignedUnitStride = stride; + this->originalDataType = originalDataType; }
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
177-220
: Inline PTX lacks portable fallback and guards are overly restrictive.e2m1_to_fp32_vec uses inline PTX gated at CUDA_ARCH >= 1000 with no else path. Provide a safe fallback (e.g., zero the output or a scalar decoder) and use realistic arch guards. This mirrors prior feedback.
Example minimal fallback:
inline __device__ void e2m1_to_fp32_vec(uint64_t e2m1Vec, float2 (&array)[8]) { -#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) +// Blackwell fast-path +#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) ... -#endif +#else + #pragma unroll + for (int i = 0; i < 8; ++i) { array[i] = make_float2(0.f, 0.f); } +#endif }
1301-1313
: Large memset without bounds validation in initializeLocalWorkspace.Before cuMemsetD32/cudaMemset, verify fifoSize + senderSideInfoSize + receiverSideInfoSize does not exceed rankStrideInU64*sizeof(uint64_t). This matches earlier guidance.
void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo) { ... - TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE, - fifoSize / sizeof(uint32_t))); + size_t totalBytes = fifoSize + senderSideInfoSize + receiverSideInfoSize; + TLLM_CHECK_WITH_INFO(totalBytes <= rankStrideInU64 * sizeof(uint64_t), + "Workspace size (%zu) exceeds allocated stride (%zu).", + totalBytes, rankStrideInU64 * sizeof(uint64_t)); + TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), + FusedMoeProto::INITIALIZED_VALUE, fifoSize / sizeof(uint32_t))); TLLM_CUDA_CHECK(cudaMemset( reinterpret_cast<uint8_t*>(localWorkspacePtr) + fifoSize, 0, senderSideInfoSize + receiverSideInfoSize)); }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
230-248
: Minor: error message typo and clarity in memset_expert_ids.
- Line 234: Check enforces expertsIds.dim() == 2 but the message says “1D tensor”. Please correct to “2D tensor” to avoid confusion.
- TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor"); + TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor");
171-194
: Naming consistency for RecvRankCountCumSum variable.Local variables should use lowerCamelCase per guidelines. RecvRankCountCumSum starts with an uppercase letter and is inconsistent with sendRankCountCumSum. Consider renaming to recvRankCountCumSum for readability.
Also applies to: 211-227
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1227-1237
: Static shared memory attribute sampled from non-LOW_PRECISION variant.computeMoeAlltoallMaxDynamicSharedMemorySize queries attributes from moeAllToAllKernel<1> (LOW_PRECISION=false). If static smem ever diverges between variants, this could under/over-estimate. Consider probing both or the chosen kernelFn.
- TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1>)); + // Probe the conservative (largest) variant if they diverge in future. + TLLM_CUDA_CHECK(cudaFuncGetAttributes(&attr, (void const*) moeAllToAllKernel<1, true>));
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(6 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
- tensorrt_llm/_mnnvl_utils.py
- cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧬 Code graph analysis (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)
idx
(458-488)idx
(458-458)topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
35-37
: Good addition: propagate original dtype into field metadata.Passing convert_torch_dtype(t.scalar_type()) into fillFieldInfo ensures the kernel can reason about the original element type for low-precision packing. This unblocks accurate size/meta calculations downstream.
257-257
: Schema sync for new argument looks correct.The Torch library def reflects the new optional flag at the end, maintaining backward compatibility for positional callers.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
23-24
: Include dependency is appropriate.Including quantization.cuh here is expected for NVFP4/FP8 conversions and keeps device helpers localized to this TU.
874-897
: Constrain low-precision field count at compile-time (nice).static_assert(FIELD_COUNT == 1) is a good defensive check and keeps meta/layout logic simple.
552a379
to
02c6e07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
877-1161
: SingleChannelCommunicator template instantiation with LOW_PRECISION requires architecture checks.The
SingleChannelCommunicator
template withLOW_PRECISION=true
will call quantization/dequantization functions that are no-ops on architectures < SM100. This should be checked at instantiation time.Add architecture validation in the constructor:
__device__ __forceinline__ SingleChannelCommunicator(FusedMoeFieldInfo const& fieldInfo, MoeExpertParallelInfo const& expertParallelInfo, MoeSingleCommMeta const& commMeta, FusedMoeWorkspace const& workspace, FusedMoeWorldInfo const& worldInfo, FusedMoePairInfo const& pairInfo, uint64_t* smemBar, uint8_t* shmemBase) : mFieldInfo(fieldInfo) , mExpertParallelInfo(expertParallelInfo) , mCommMeta(commMeta) , mWorkspace(workspace) , mWorldInfo(worldInfo) , mPairInfo(pairInfo) , mSmemBar(smemBar) , mShmemBase(shmemBase) { if constexpr (LOW_PRECISION) { static_assert(FIELD_COUNT == 1, "Low precision alltoall only support 1 field"); +#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 1000 + // This will cause a compile error on unsupported architectures + static_assert(false, "LOW_PRECISION requires SM100+ (CUDA_ARCH >= 1000)"); +#endif }
♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (6)
180-223
: Duplicate comment: Missing error handling in inline PTX assembly.
463-463
: Duplicate comment: fillFieldInfo method modifies originalDataType parameter incorrectly.
1015-1033
: Duplicate comment: Architecture gating: low-precision path compiles to no-ops on < SM100 and will corrupt I/O.
1249-1274
: Duplicate comment: Low-precision compact size calculation is inconsistent with the quantized layout.
466-599
: Duplicate comment: Complex LL128Proto implementation needs documentation.
1304-1316
: Duplicate comment: Large memory operation without size validation.
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1331-1354
: Template instantiation for low-precision kernel variants lacks compile-time safety.The
getFunc
lambda (Lines 1334-1353) returns kernel function pointers based on runtime parameters. The low-precision path is only instantiated forfieldCount == 1
, but this constraint is not enforced at compile time.Consider using template specialization or static_assert to enforce the constraint:
+template<int FIELD_COUNT> +struct KernelSelector { + static auto get(bool lowPrecision) { + if constexpr (FIELD_COUNT == 1) { + return lowPrecision ? fused_moe_impl::moeAllToAllKernel<1, true> + : fused_moe_impl::moeAllToAllKernel<1>; + } else { + // Low precision not supported for multiple fields + TLLM_CHECK(!lowPrecision); + return fused_moe_impl::moeAllToAllKernel<FIELD_COUNT>; + } + } +}; auto getFunc = [](int fieldCount, bool lowPrecision) { switch (fieldCount) { - case 1: - if (lowPrecision) - return fused_moe_impl::moeAllToAllKernel<1, true>; - else - return fused_moe_impl::moeAllToAllKernel<1>; + case 1: return KernelSelector<1>::get(lowPrecision); case 2: return KernelSelector<2>::get(lowPrecision); // ... etc
1467-1467
: Test harness doesn't test low-precision path.The
launchSingleG2S
function always passesfalse
forisLowPrecision
when callingfillMetaInfo
. The test infrastructure should be extended to support testing the low-precision quantization path.Would you like me to generate test code that exercises the low-precision quantization/dequantization paths? This would help validate the implementation and catch potential issues early.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(6 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
- cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
- tensorrt_llm/_mnnvl_utils.py
- cpp/tensorrt_llm/thop/moeCommOp.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧠 Learnings (2)
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧬 Code graph analysis (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)
idx
(458-488)idx
(458-458)topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1096-1115
: I’ve added searches to locate theFusedMoeProto
struct and inspect bothprotoPack
andprotoUnpack
implementations for how quantized data (scales, values, global scale) are handled. Once we see those definitions, we can confirm if the unpacking matches the expected layout before dequantization.
94b3029
to
9e82990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1334-1357
: Host-side gating for LOW_PRECISION kernels is missingEnforce compute capability >= 10.x when params.isLowPrecision is true. Otherwise, on Hopper/Ampere builds the device helpers are no-ops and the protocol/layout diverge.
void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream) { + if (params.isLowPrecision) + { + int dev = -1; + int major = 0; + TLLM_CUDA_CHECK(cudaGetDevice(&dev)); + TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev)); + TLLM_CHECK_WITH_INFO(major >= 10, + "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x). Detected SM%d0.", major); + } @@ - auto getFunc = [](int fieldCount, bool lowPrecision) + auto getFunc = [](int fieldCount, bool lowPrecision) { switch (fieldCount) { case 1: if (lowPrecision) return fused_moe_impl::moeAllToAllKernel<1, true>; else return fused_moe_impl::moeAllToAllKernel<1>;
♻️ Duplicate comments (10)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (6)
32-40
: Volatile head/tail do not provide synchronization; use atomics or explicit fencesUsing volatile for inter-CTA producer/consumer coordination is insufficient on CUDA; it does not guarantee atomicity or ordering. Replace with plain fields and perform updates with atomics and/or memory fences at the access sites (sender/receiver paths already touched by cp.async and fences), or wrap accesses with appropriate cuda::atomic_ref on supported toolchains.
struct ALIGN_256 SenderSideFifoInfo { - volatile uint64_t head; // write position - volatile uint64_t tail; // read position + uint64_t head; // write position (updated with atomics/fences at use sites) + uint64_t tail; // read position (updated with atomics/fences at use sites) }; struct ALIGN_256 ReceiverSideFifoInfo { - volatile uint64_t head; // write position do we use this? - volatile uint64_t tail; // read position + uint64_t head; // write position (updated with atomics/fences at use sites) + uint64_t tail; // read position (updated with atomics/fences at use sites) };If you prefer device-side C++ atomics:
- include <cuda/atomic>
- use cuda::atomic_ref<uint64_t, cuda::thread_scope_system>(head).store(..., memory_order_release) etc.
156-167
: Pointer arithmetic lacks bounds validation; guard and clamp the copy rangeget16BAlignedLoadCopyRange() assumes rawDataPtr and sizes are valid and performs arithmetic in signed int. Use uintptr_t/size_t, clamp to raw range, and handle nulls to avoid UB.
__device__ __forceinline__ uint8_t* get16BAlignedLoadCopyRange(int index, int* copyByteCount) const { - int rawSize; - uint8_t* rawDataPtr = getRawPtr(index, &rawSize); - uint8_t* rawEndPtr = rawDataPtr + rawSize; - uint8_t* alignedDataPtr - = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawDataPtr) & (~kAlign16BytePtrMask)); - uint32_t copySize = rawEndPtr - alignedDataPtr; - *copyByteCount - = (copySize & kAligned16BMask) != 0 ? (copySize & (~kAligned16BMask)) + BYTES_PER_16B_BLOCK : copySize; - return alignedDataPtr; + int rawSize = 0; + uint8_t* rawDataPtr = getRawPtr(index, &rawSize); + if (rawDataPtr == nullptr || rawSize <= 0) { + if (copyByteCount) *copyByteCount = 0; + return nullptr; + } + uint8_t* rawEndPtr = rawDataPtr + static_cast<size_t>(rawSize); + auto alignedDataPtr = reinterpret_cast<uint8_t*>( + (reinterpret_cast<uintptr_t>(rawDataPtr)) & (~static_cast<uintptr_t>(kAlign16BytePtrMask))); + // clamp start to rawDataPtr + if (alignedDataPtr < rawDataPtr) alignedDataPtr = rawDataPtr; + size_t copySize = static_cast<size_t>(rawEndPtr - alignedDataPtr); + uint32_t copySizeU32 = static_cast<uint32_t>(copySize); + uint32_t padded = (copySizeU32 & kAligned16BMask) != 0 + ? (copySizeU32 & (~kAligned16BMask)) + BYTES_PER_16B_BLOCK + : copySizeU32; + if (copyByteCount) *copyByteCount = static_cast<int>(padded); + return alignedDataPtr; }
169-202
: Store-range computation can underflow/overflow; compute with unsigned and clamp head/tail indicesAs above, compute pointers/offsets with uintptr_t/size_t, clamp alignedDataPtr/alignedEndPtr inside raw range, and set headTail indices to -1 when out of bounds.
__device__ __forceinline__ uint8_t* get16BAlignedStoreCopyRange( int index, int* copyByteCount, int laneId, int* headTailShmIdx, int* headTailGlobalIdx) const { - int rawSize; - uint8_t* rawDataPtr = getRawPtr(index, &rawSize); - uint8_t* rawEndPtr = rawDataPtr + rawSize; - int offset = reinterpret_cast<uint64_t>(rawDataPtr) & kAlign16BytePtrMask; - uint8_t* alignedDataPtr - = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawDataPtr) + BYTES_PER_16B_BLOCK - offset); - uint8_t* alignedEndPtr - = reinterpret_cast<uint8_t*>(reinterpret_cast<uint64_t>(rawEndPtr) & (~kAlign16BytePtrMask)); - int alignedCopyBytes = alignedEndPtr - alignedDataPtr; + int rawSize = 0; + uint8_t* rawDataPtr = getRawPtr(index, &rawSize); + if (!rawDataPtr || rawSize <= 0) { if (copyByteCount) *copyByteCount = 0; if (headTailShmIdx) *headTailShmIdx = -1; if (headTailGlobalIdx) *headTailGlobalIdx = -1; return nullptr; } + uint8_t* rawEndPtr = rawDataPtr + static_cast<size_t>(rawSize); + int offset = static_cast<int>(reinterpret_cast<uintptr_t>(rawDataPtr) & kAlign16BytePtrMask); + auto alignedDataPtr = reinterpret_cast<uint8_t*>( + reinterpret_cast<uintptr_t>(rawDataPtr) + BYTES_PER_16B_BLOCK - offset); + auto alignedEndPtr = reinterpret_cast<uint8_t*>( + reinterpret_cast<uintptr_t>(rawEndPtr) & (~static_cast<uintptr_t>(kAlign16BytePtrMask))); + if (alignedDataPtr < rawDataPtr) alignedDataPtr = rawDataPtr; + if (alignedEndPtr > rawEndPtr) alignedEndPtr = rawEndPtr; + int alignedCopyBytes = static_cast<int>(alignedEndPtr - alignedDataPtr); if (alignedCopyBytes < 0) { alignedCopyBytes = 0; } *copyByteCount = alignedCopyBytes; @@ - *headTailGlobalIdx = *headTailShmIdx - offset; + *headTailGlobalIdx = *headTailShmIdx - offset; if (*headTailGlobalIdx < 0 || *headTailGlobalIdx >= rawSize) { *headTailGlobalIdx = -1; *headTailShmIdx = -1; } return alignedDataPtr; }
261-273
: setMaxUsableSmCount modifies static state without synchronizationConcurrent calls can race on maxSmCountUsed and maxSmCount. Guard with a mutex; also make reads in getMaxUsableSmCount consistent.
Add a mutex member and lock the setter/getter bodies:
class FusedMoeCommunicator { public: @@ - static void setMaxUsableSmCount(int maxUsableSmCount) + static std::mutex maxSmCountMutex; + static void setMaxUsableSmCount(int maxUsableSmCount) { + std::lock_guard<std::mutex> lock(maxSmCountMutex); TLLM_CHECK_WITH_INFO( FusedMoeCommunicator::maxSmCountUsed == false, "setMaxUsableSmCount can be called only before it is used"); @@ - FusedMoeCommunicator::maxSmCount = maxUsableSmCount; + FusedMoeCommunicator::maxSmCount = maxUsableSmCount; }And in getMaxUsableSmCount():
- static int getMaxUsableSmCount() + static int getMaxUsableSmCount() { + std::lock_guard<std::mutex> lock(maxSmCountMutex); FusedMoeCommunicator::maxSmCountUsed = true; if (FusedMoeCommunicator::maxSmCount == -1) { int smCount = tensorrt_llm::common::getMultiProcessorCount(); FusedMoeCommunicator::maxSmCount = smCount; } return FusedMoeCommunicator::maxSmCount; }Note: add
#include <mutex>
at the top and provide a definition in the .cu:std::mutex FusedMoeCommunicator::maxSmCountMutex;
298-309
: Static map access is not thread-safeConcurrent callers of getMoeCommChannelCount() can race on the static map. Guard with a static mutex.
static int getMoeCommChannelCount(int epSize) { static std::map<int, int> channelCountMap{}; + static std::mutex mapMutex; + std::lock_guard<std::mutex> lock(mapMutex); auto iter = channelCountMap.find(epSize); if (iter == channelCountMap.end()) { auto channelCount = FusedMoeCommunicator::computeMoeCommChannelCount(epSize); channelCountMap[epSize] = channelCount; return channelCount; } return iter->second; }
518-524
: Workspace size computation may overflow silentlyMultiplying large epSize/channelCount can overflow size_t on some platforms or wrap intermediates. Add checked arithmetic and verify final total does not exceed the allocated rank stride.
static size_t computeWorkspaceSizePreRank(int epSize, int channelCount) { - size_t fifoSize = static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES) * epSize * channelCount; - size_t senderSideInfoSize = sizeof(SenderSideFifoInfo) * epSize * channelCount; - size_t receiverSideInfoSize = sizeof(ReceiverSideFifoInfo) * epSize * channelCount; - return fifoSize + senderSideInfoSize + receiverSideInfoSize; + auto mul_ok = [](size_t a, size_t b, size_t& out) -> bool { + if (a == 0 || b <= std::numeric_limits<size_t>::max() / a) { out = a * b; return true; } + return false; + }; + size_t fifoSize = 0, senderSideInfoSize = 0, receiverSideInfoSize = 0, tmp = 0; + TLLM_CHECK_WITH_INFO(mul_ok(static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES), epSize, tmp) + && mul_ok(tmp, channelCount, fifoSize), + "computeWorkspaceSizePreRank overflow on fifoSize"); + TLLM_CHECK_WITH_INFO(mul_ok(sizeof(SenderSideFifoInfo), epSize, tmp) && mul_ok(tmp, channelCount, senderSideInfoSize), + "computeWorkspaceSizePreRank overflow on senderSideInfoSize"); + TLLM_CHECK_WITH_INFO(mul_ok(sizeof(ReceiverSideFifoInfo), epSize, tmp) && mul_ok(tmp, channelCount, receiverSideInfoSize), + "computeWorkspaceSizePreRank overflow on receiverSideInfoSize"); + size_t total = fifoSize + senderSideInfoSize + receiverSideInfoSize; + TLLM_CHECK_WITH_INFO(total >= fifoSize, "computeWorkspaceSizePreRank overflow on total"); + return total; }tests/unittest/_torch/thop/test_moe_alltoall.py (1)
252-269
: Reference composition bug: shared output_indice_offset corrupts per-tensor referencesEach tensor dimension writes into the same [start:end] slice using a single shared offset, then advances only once. Maintain independent offsets for each tensor to prevent overlap.
- total_recv_all_ranks_cpu = [] - output_indice_offset = 0 + total_recv_all_ranks_cpu = [] + # Maintain independent offsets per tensor + output_offsets = [0] * tensor_count @@ - for i in range(tensor_count): - ref_output_tensors[i][output_indice_offset:output_indice_offset + local_recv_count_pair] = \ - input_tensors_all_ranks[i][other_rank][send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]] - output_indice_offset += local_recv_count_pair + for i in range(tensor_count): + start = output_offsets[i] + end = start + local_recv_count_pair + ref_output_tensors[i][start:end] = input_tensors_all_ranks[i][other_rank][ + send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]] + ] + output_offsets[i] = endcpp/tensorrt_llm/thop/moeCommOp.cpp (1)
112-117
: Bug: topK used uninitialized when hasBasicFields=falseparams.expertParallelInfo.topK is never set here; passing it into fillMetaInfo is undefined behavior. When hasBasicFields=false, topK is unused; pass a deterministic 0 instead.
- bool useLowPrecisionVal = useLowPrecision.value_or(false); - params.isLowPrecision = useLowPrecisionVal; - params.sendFieldInfo.fillMetaInfo( - &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); - params.recvFieldInfo.fillMetaInfo( - &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); + bool useLowPrecisionVal = useLowPrecision.value_or(false); + params.isLowPrecision = useLowPrecisionVal; + constexpr int kTopKNotUsed = 0; + params.sendFieldInfo.fillMetaInfo( + &(params.sendCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal); + params.recvFieldInfo.fillMetaInfo( + &(params.recvCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1306-1318
: Zeroing large workspace regions without size validation can overrun allocationBefore cuMemsetD32/cudaMemset, verify that the computed total fits within rankStrideInU64*8 and that fifoSize is divisible by sizeof(uint32_t) for cuMemsetD32.
void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo) { @@ - uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64; + uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64; + size_t total = fifoSize + senderSideInfoSize + receiverSideInfoSize; + TLLM_CHECK_WITH_INFO(total <= rankStrideInU64 * sizeof(uint64_t), + "FusedMoeWorkspace overflow: required=%zu allocated=%zu", total, rankStrideInU64 * sizeof(uint64_t)); + TLLM_CHECK_WITH_INFO((fifoSize % sizeof(uint32_t)) == 0, "fifoSize must be a multiple of 4 for cuMemsetD32"); TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE, fifoSize / sizeof(uint32_t)));
1015-1032
: LOW_PRECISION quantize is compiled to no-op on < SM100; add host-side capability gateOn pre-SM100 devices, the quantize/dequant helpers are empty due to arch guards, yet the LOW_PRECISION meta sizes/layout are still used, corrupting I/O. Add a CC check in moeAllToAll before launching LOW_PRECISION kernels.
Apply the host-side gate below (see moeAllToAll).
🧹 Nitpick comments (8)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
105-116
: Local shadowing of alignedUnitBytes harms readability and risks mistakesInside getFieldUncompactSize(), alignedUnitBytes is re-declared with a different meaning. Rename the inner constant to avoid shadowing and accidental misuse.
- int alignedUnitBytes = 1 << alignedUnitBit; + int alignedUnitBytes = 1 << alignedUnitBit; int currentFieldSize = alignedUnitCount * alignedUnitBytes; if (alignedUnitBytes != 16) { - constexpr int alignedUnitBytes = BYTES_PER_16B_BLOCK; - currentFieldSize = currentFieldSize / alignedUnitBytes * alignedUnitBytes; - currentFieldSize += alignedUnitBytes * 2; + constexpr int kAligned16B = BYTES_PER_16B_BLOCK; + currentFieldSize = currentFieldSize / kAligned16B * kAligned16B; + currentFieldSize += kAligned16B * 2; }
286-296
: Typo in variable name and minor nitVariable perferredChannel should be preferredChannel.
- int perferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication - int channelCount = std::max(perferredChannel, 1); // at lease one channel + int preferredChannel = smCount / 2 / blockCountPerChannel; // use half SMs for communication + int channelCount = std::max(preferredChannel, 1); // at least one channel
16-16
: Repository guideline requires include guards in headersOur project guideline mandates include guards of the form TRTLLM_<FILE_NAME>_H for headers. Please add them (you can keep
#pragma once
as a fast-path).Suggested wrapper (apply at top and bottom of the header):
// Top of file: #ifndef TRTLLM_FUSEDMOECOMMKERNELS_H #define TRTLLM_FUSEDMOECOMMKERNELS_H #pragma once // ... existing header content ... // Bottom of file: #endif // TRTLLM_FUSEDMOECOMMKERNELS_Htests/unittest/_torch/thop/test_moe_alltoall.py (1)
337-343
: Low-precision reference conversion loops over tokens; vectorize to speed upPer-token quant/dequant is O(T) Python overhead. Convert the entire ref_output tensor once.
- if use_low_precision: - for token_id in range(ref_output_tensors[i].shape[0]): - ref_output_tensors[i][token_id] = quant_and_dequant( - ref_output_tensors[i][token_id]) - atol, rtol = 1e-2, 1e-2 + if use_low_precision: + # Vectorized conversion + ref_output_tensors[i] = quant_and_dequant(ref_output_tensors[i]) + atol, rtol = 1e-2, 1e-2cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
74-86
: Validate low-precision preconditions early (field count and dtype)Fail fast at the Python binding when low-precision is requested but inputs are incompatible (e.g., multiple fields or unsupported dtype). Improves UX vs. hitting kernel assertions later.
int fieldCount = inputs.size(); TORCH_CHECK(fieldCount <= tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, "Number of fields (", fieldCount, ") exceeds maximum allowed (", tensorrt_llm::kernels::MOE_COMM_FIELD_MAX_COUNT, ")"); + bool useLowPrecisionValEarly = useLowPrecision.has_value() && useLowPrecision.value(); + if (useLowPrecisionValEarly) { + TORCH_CHECK(fieldCount == 1, "Low-precision alltoall supports exactly 1 tensor field."); + auto dt = inputs[0].scalar_type(); + TORCH_CHECK(dt == torch::kHalf || dt == torch::kBFloat16, + "Low-precision alltoall requires float16 or bfloat16 input, got ", c10::toString(dt)); + }cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (3)
35-178
: Quantize path: architecture guard OK, but still consider minimal fallback to avoid accidental usequantize_nvfp4_sharedmem is guarded for CUDA_ARCH >= 1000. Given host-side gating is missing (see below), older architectures will compile LOW_PRECISION branches into no-ops. Add a host gate (in moeAllToAll) and optionally a debug fill or assert here under #else to catch misuse in testing.
Example debug fallback:
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) // existing code... #else - // no-op + // Debug fallback: do nothing but could zero compact_ptr to make misuse visible + if (laneId == 0 && sizeInBytes > 0) { + for (int i = 0; i < min(sizeInBytes, 64); ++i) compact_ptr[i] = 0; + } #endif
183-223
: Inline PTX helper lacks non-SM100 path; ensure callers are gated or provide stube2m1_to_fp32_vec is SM100-specific. Either guarantee all call-sites compile only for >=1000 (host gate below), or add a stub that zero-fills array so accidental calls on lower arch don't propagate uninitialized data.
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) // existing PTX path #else - // (no body) + #pragma unroll + for (int i = 0; i < 8; ++i) { array[i] = make_float2(0.f, 0.f); } #endif
877-898
: Assert invariants for LOW_PRECISION and add alignment assert for shared memory baseYou already static_assert FIELD_COUNT==1 under LOW_PRECISION. Add a shared-memory 128B alignment assert to catch misconfigurations early.
, mShmemBase(shmemBase) { if constexpr (LOW_PRECISION) { static_assert(FIELD_COUNT == 1, "Low precision alltoall only support 1 field"); } + // Shared memory alignment is critical for cp.async.bulk and 128B protocol + assert((reinterpret_cast<uintptr_t>(mShmemBase) & 0x7F) == 0 && "Shared memory must be 128B aligned");
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(4 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_mnnvl_utils.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh}
: Closing braces of namespaces should have a trailing comment naming the namespace (e.g., } // namespace foo)
Prefer const or constexpr variables over #define for constants
Variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside initialization; use named constexpr constants instead
Use Allman brace style
Put semicolon for empty for/while loop on a new line
Bodies of switch/while/do/for must be compound statements with braces
if and else must always use brace-delimited statements, even for single or empty statements
C++ filenames should be camelCase with first letter lowercase (e.g., thisIsAFilename.cpp) and case-insensitively unique within a target
Type names (classes, structs, etc.) use CamelCase starting with uppercase (e.g., FooBarClass)
Local variables, methods, and namespaces use camelCase starting lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not in anonymous namespace: prefix g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in anonymous namespace: prefix s (e.g., sMutableStaticGlobal)
Locally visible static variable names start with s (e.g., static std::once_flag sFlag;)
Class member variables use mPrefix camelCase (e.g., mNbFooValues); public members may omit but prefix encouraged for clarity
Constants (enums, global constants, static constants, function-scope magic numbers) use uppercase snake_case with k prefix (e.g., kDIGIT_NUM)
Function-scope constants that are not magic numbers/literals are named like non-constants (e.g., bool const pass = ...)
Avoid Hungarian notation except apps-Hungarian like nb for counts (e.g., mNbLayers)
If constructor parameter name conflicts with public member, add trailing underscore to parameter (e.g., foo_)
Use uppercase literal suffixes (e.g., 1234L, not 1234l)
Use spaces only; no tabs; indent by 4 spaces
Run clang-format (LLVM style) before submit...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards with symbol TRTLLM_<FILE_NAME>_H (no leading/trailing underscores; directory names excluded)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
CUDA code must be compiled with CUDA compiler; device/global/managed/constant declarations and kernels must be in CUDA sources
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧠 Learnings (5)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tests/unittest/_torch/thop/test_moe_alltoall.py
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧬 Code graph analysis (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
fillMetaInfo
(1244-1276)fillMetaInfo
(1244-1245)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)
92-94
: API extension looks good; keeping original dtype is necessary for NVFP4 pathThe new fillFieldInfo signature carrying originalDataType aligns with the low-precision flow and keeps header/source in sync.
cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
35-37
: Good: original dtype is passed from Torch to kernel field infoForwarding convert_torch_dtype to fillFieldInfo ensures the device path can choose BF16 vs FP16 correctly.
252-258
: Binding schema updated correctly with use_low_precisionThe extended def string maintains backward compatibility (defaults are None) and surfaces the new flag.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
424-464
: Good: fixed member assignment for originalDataTypeThe fillFieldInfo implementation now assigns the member (not the parameter), resolving the earlier self-assignment bug.
1468-1474
: Unit-test launchers: correct (explicit false) for isLowPrecisionfillMetaInfo calls in the test launchers explicitly pass isLowPrecision=false, which is appropriate for these non-quantized helpers.
Also applies to: 1521-1527, 1601-1605, 1664-1668
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (5)
482-616
: Complex LL128Proto implementation - consider adding documentation.The LL128Proto class implements a sophisticated 128-byte aligned protocol but lacks comprehensive documentation about the protocol design and indexing logic.
Add documentation explaining:
- The 15:16 packing ratio (15 data blocks → 16 blocks with tail)
- The role of step values and finish flags
- The indexing mathematics for halfLane/halfIndex calculations
- The purpose of INITIALIZED_VALUE constant
893-1179
: SingleChannelCommunicator template needs alignment validation.The template class performs pointer arithmetic without verifying 128-byte alignment requirements.
Add alignment assertion in constructor:
__device__ __forceinline__ SingleChannelCommunicator(FusedMoeFieldInfo const& fieldInfo, MoeExpertParallelInfo const& expertParallelInfo, MoeSingleCommMeta const& commMeta, FusedMoeWorkspace const& workspace, FusedMoeWorldInfo const& worldInfo, FusedMoePairInfo const& pairInfo, uint64_t* smemBar, uint8_t* shmemBase) : mFieldInfo(fieldInfo) , mExpertParallelInfo(expertParallelInfo) , mCommMeta(commMeta) , mWorkspace(workspace) , mWorldInfo(worldInfo) , mPairInfo(pairInfo) , mSmemBar(smemBar) , mShmemBase(shmemBase) { + // Verify 128-byte alignment for shared memory + assert((reinterpret_cast<uintptr_t>(shmemBase) & 0x7F) == 0 && "Shared memory must be 128-byte aligned");
1338-1406
: Host-side architecture validation needed for low-precision mode.The moeAllToAll function should validate GPU architecture before launching low-precision kernels.
Add architecture check:
void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream) { + if (params.isLowPrecision) + { + int dev = -1, major = 0; + TLLM_CUDA_CHECK(cudaGetDevice(&dev)); + TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev)); + TLLM_CHECK_WITH_INFO(major >= 9, + "Low-precision (NVFP4) all-to-all requires SM90+ (Hopper or newer). Detected SM%d0.", + major); + }
1324-1336
: Workspace initialization lacks size validation.The initializeLocalWorkspace method should validate memory sizes before operations.
Add validation:
void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo) { int epSize = worldInfo.epInfo.epSize; int epRank = worldInfo.epInfo.epRank; size_t fifoSize = static_cast<size_t>(FusedMoeCommunicator::FIFO_TOTAL_BYTES) * epSize * channelCount; + size_t totalSize = fifoSize + sizeof(SenderSideFifoInfo) * epSize * channelCount + + sizeof(ReceiverSideFifoInfo) * epSize * channelCount; + TLLM_CHECK_WITH_INFO(totalSize <= rankStrideInU64 * sizeof(uint64_t), + "Workspace size exceeds allocated memory");
185-228
: Missing error handling and architecture validation in PTX assembly.The
e2m1_to_fp32_vec
function uses inline PTX without proper guards or fallback implementation for unsupported architectures.Add proper guards and fallback:
inline __device__ void e2m1_to_fp32_vec(uint64_t e2m1Vec, float2 (&array)[8]) { -#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 1000) +#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__ < 900) + // Fallback for unsupported architectures + #pragma unroll + for (int i = 0; i < 8; ++i) + { + array[i] = make_float2(0.0f, 0.0f); + } +#else uint32_t out_fp16[8]; asm volatile( // ... existing asm code ... ); // ... rest of conversion ... #endif }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧠 Learnings (2)
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧬 Code graph analysis (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (10)
idx
(458-488)idx
(458-458)topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (7)
30-183
: Well-documented NVFP4 quantization implementation with clear algorithm description.The quantization function includes comprehensive documentation about the NVFP4 format, output layout, and algorithm steps. The implementation correctly handles both half and bfloat16 types with vectorized operations for efficiency.
440-480
: fillFieldInfo method implementation is correct.The method properly fills field information with alignment handling and type conversion. The bug mentioned in past comments about assigning to parameter instead of member has been fixed.
1031-1049
: Low-precision quantization path correctly integrated.The quantization and dequantization steps are properly placed after packing and before unpacking respectively, with appropriate type handling for FP16 and BF16.
Also applies to: 1116-1133
1181-1246
: Kernel template instantiation handles low-precision correctly.The moeAllToAllKernel properly instantiates different template variants based on field count and low-precision flag. The sender/receiver logic is cleanly separated.
1267-1290
: Low-precision metadata calculation appears correct.The compact size calculation for low-precision paths correctly accounts for the NVFP4 packing format with proper alignment to 128 bytes.
1436-1709
: Test utilities are well-structured.The test namespace provides comprehensive testing utilities for G2S, S2G, loopback, and FIFO operations. The kernels properly handle different configurations.
2-15
: Copyright header is properly formatted.The NVIDIA copyright header with Apache 2.0 license is correctly included.
f341cef
to
f3ca1cb
Compare
/bot run --disable-fail-fast |
PR_Github #16550 [ run ] triggered by Bot |
f3ca1cb
to
88c4489
Compare
/bot run --disable-fail-fast |
PR_Github #16579 [ run ] triggered by Bot |
PR_Github #16550 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1333-1370
: Enforce SM capability gate before selecting low-precision variantThe host-side
moeAllToAll
incpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(around lines 1333–1370) unconditionally dispatches the low-precision kernel whenparams.isLowPrecision
is true, without verifying the device’s compute capability. This risks launching half-precision kernels on unsupported GPUs. You must add a CUDA compute-capability check and gateparams.isLowPrecision
accordingly.Suggested fix:
- In
moeAllToAll
(top of the function), query device properties:int devId = -1; TLLM_CUDA_CHECK(cudaGetDevice(&devId)); cudaDeviceProp prop{}; TLLM_CUDA_CHECK(cudaGetDeviceProperties(&prop, devId)); int smVersion = prop.major * 10 + prop.minor; constexpr int kMinSmForLowPrecision = /* e.g. 50 for SM5.0+ */; if (params.isLowPrecision && smVersion < kMinSmForLowPrecision) { // Either override to full precision or error out: params.isLowPrecision = false; // OR: // TLLM_ERROR("Low-precision kernels require SM{}/+ (found SM{})", kMinSmForLowPrecision, smVersion); }- Then proceed to select the kernel via
getFunc(maxFieldCount, params.isLowPrecision);
This ensures the low-precision path is only taken on devices with sufficient SM support.
♻️ Duplicate comments (3)
cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
112-117
: Fix: uninitialized expertParallelInfo.topK used in fillMetaInfo (pass 0 when hasBasicFields=false)
This forwards an indeterminate value. Use a deterministic 0 since basic fields are disabled here.Apply:
- params.sendFieldInfo.fillMetaInfo( - &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); - params.recvFieldInfo.fillMetaInfo( - &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); + constexpr int kTopKNotUsed = 0; + params.sendFieldInfo.fillMetaInfo( + &(params.sendCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal); + params.recvFieldInfo.fillMetaInfo( + &(params.recvCommMeta), kTopKNotUsed, /*hasScales=*/false, /*hasBasicFields=*/false, useLowPrecisionVal);cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
1319-1331
: InitializeLocalWorkspace: consider size validations before memset
Large cuMemsetD32/cudaMemset sizes are derived from epSize*channelCount. Validate bounds against allocated stride.If you want, I can propose a guarded version of this block.
38-45
: LOW_PRECISION path compiles to no-ops on < SM100; add host gating to prevent layout mismatch
quantize_nvfp4_sharedmem, e2m1_to_fp32_vec, and dequantize_nvfp4_sharedmem are guarded by CUDA_ARCH >= 1000. On devices compiled for lower CC, these bodies are effectively empty, but fillMetaInfo still chooses the compact NVFP4 layout leading to pack/unpack/size mismatches and potential memory corruption.Add a device capability gate in moeAllToAll to prohibit low precision unless SM100+ is available:
@@ void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream) -{ +{ + if (params.isLowPrecision) { + int dev = -1, major = 0; + TLLM_CUDA_CHECK(cudaGetDevice(&dev)); + TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev)); + TLLM_CHECK_WITH_INFO(major >= 10, + "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x). " + "Detected compute capability %d.x.", major); + }If you plan to support a software fallback for < SM100, keep the check but switch to a fallback path instead of erroring. Otherwise the gate is required to avoid silent data corruption.
Also applies to: 228-236, 183-187
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)
95-103
: Initialize originalDataType in setUnused to a safe sentinel
Uninitialized fields may be accessed in debug/validation code. Initialize to a known default to avoid UB.Apply:
__host__ void setUnused() { dataPtrBase = nullptr; alignedUnitBit = 4; alignedUnitCount = 0; alignedUnitStride = 0; unalignedFieldIndex = 0; compact16BOffset = 0; + originalDataType = CUDA_R_32F; // sentinel for "unused" }
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
276-281
: Remove debug printf in device code; clamp or drop entirely
printf in kernels is brittle and floods logs. Clamp dequantScale or use project logging at host side.Apply:
- // This is a workaround to avoid the issue of nan or inf. - // TODO: remove this after the issue is fixed. - if (dequantScale > 1e10) - { - printf("This is a workaround to avoid the issue of nan or inf. \n"); - } + // Clamp to avoid potential overflow -> NaN/Inf in edge cases. + // TODO: Revisit scaling to remove this guard. + dequantScale = fminf(dequantScale, 1e10f);
894-906
: Optional: assert 128B alignment of shared-memory base in communicator ctor
Given the protocol assumes 128B blocks, add a debug assert to catch misalignment early.Apply:
, mShmemBase(shmemBase) { + // Debug: shared memory base should be 128B aligned +#ifndef NDEBUG + assert((reinterpret_cast<uintptr_t>(mShmemBase) & 0x7F) == 0 && "Shared memory must be 128-byte aligned"); +#endif
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(4 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_mnnvl_utils.py
- tests/unittest/_torch/thop/test_moe_alltoall.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hh,hpp,hxx,cuh}
: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧠 Learnings (3)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
🧬 Code graph analysis (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (2)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (2)
fillMetaInfo
(1257-1289)fillMetaInfo
(1257-1258)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (8)
topK
(187-286)topK
(187-189)topK
(334-455)topK
(334-336)topK
(533-658)topK
(533-535)topK
(987-1139)topK
(987-990)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2)
185-195
: Good gating of low-precision combine via env + quant-mode (NVFP4)
The feature flag is opt-in and safely guarded behind quant mode support, which prevents accidental enablement on unsupported models. LGTM.
901-910
: mnnvl_moe_alltoallv_combine signature and call sites are consistent
- The static method in
_mnnvl_utils.py
defines bothtoken_count: int
anduse_low_precision_combine: bool = False
in its signature.- In
fused_moe_wide_ep.py
, bothtoken_count
anduse_low_precision_combine
are passed explicitly.- In
fused_moe_cutlass.py
, onlytoken_count
is passed—use_low_precision_combine
correctly falls back to its default (False
).- No other call sites exist under
tensorrt_llm
.No further updates are needed.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (3)
72-80
: New originalDataType and alignment fields: looks correct
The field additions and comments are clear and align with the low-precision data-path requirements. No issues.
406-408
: fillMetaInfo extended with isLowPrecision: LGTM
Signature carries the low-precision toggle into sizing. Matches usage in .cu meta logic.
422-422
: FusedMoeCommKernelParam.isLowPrecision addition is appropriate
The struct-level flag cleanly propagates mode selection into kernels.cpp/tensorrt_llm/thop/moeCommOp.cpp (3)
35-37
: Correctly passing original dtype to fillFieldInfo
Using convert_torch_dtype(t.scalar_type()) wires the dtype through. LGTM.
252-258
: Torch schema updated to include use_low_precision: LGTM
The def string and CUDA impl align with the new optional boolean. Good defaulting behavior.
31-37
: The newuse_low_precision_combine
argument is already optional (defaults toFalse
), so existing call sites don’t need to be updated unless you want to enable it. All direct invocations oftorch.ops.trtllm.moe_comm
in the Python wrappers and tests remain valid without modification.I’ve verified:
• In
tensorrt_llm/_mnnvl_utils.py
, the Python wrapper signature ends withuse_low_precision_combine: bool = Falseand forwards it positionally to
torch.ops.trtllm.moe_comm
with the correct defaults.
• All direct test calls intests/unittest/_torch/thop/test_moe_alltoall.py
pass only the original required arguments (without the new flag), which continues to mapuse_low_precision_combine=False
.No changes are required to existing call sites.
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (4)
30-37
: Appreciate the clear documentation for NVFP4 block layout
The function-level comments for quantize_nvfp4_sharedmem are helpful. Nice.
437-477
: fillFieldInfo now sets originalDataType correctly
The earlier self-assignment bug is resolved; this writes the member. Good fix.
890-911
: Static assert restricting LOW_PRECISION to 1 field is reasonable
The constraint is enforced at compile time. Good defensive measure.
1258-1287
: Low-precision compact sizing formula looks correct and matches the packed layout
numGroups, group size (WARP*8 + WARP), and +4 global-scale with 128B alignment match quantize/dequant. LGTM.
PR_Github #16673 [ run ] completed with state |
PR_Github #16691 [ run ] completed with state |
/bot run |
PR_Github #16758 [ run ] triggered by Bot |
/bot run --disable-fail-fast |
PR_Github #16783 [ run ] triggered by Bot |
PR_Github #16758 [ run ] completed with state |
PR_Github #16783 [ run ] completed with state |
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
14aab49
to
7f36008
Compare
/bot reuse-pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1382-1394
: Initialize workspace: add bounds validation before large memset.Guard against overruns and alignment issues.
Apply:
void FusedMoeWorkspace::initializeLocalWorkspace(FusedMoeWorldInfo const& worldInfo) { @@ - uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64; + uint64_t* localWorkspacePtr = workspacePtr + epRank * rankStrideInU64; + size_t totalBytes = rankStrideInU64 * sizeof(uint64_t); + size_t needed = fifoSize + senderSideInfoSize + receiverSideInfoSize; + TLLM_CHECK_WITH_INFO(needed <= totalBytes, "MOE comm workspace overflow: need=%zu, have=%zu", needed, totalBytes); + TLLM_CHECK_WITH_INFO((fifoSize % sizeof(uint32_t)) == 0, "fifoSize must be multiple of 4 bytes"); @@ - TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE, - fifoSize / sizeof(uint32_t))); + TLLM_CU_CHECK(cuMemsetD32(reinterpret_cast<CUdeviceptr>(localWorkspacePtr), FusedMoeProto::INITIALIZED_VALUE, + fifoSize / sizeof(uint32_t)));
♻️ Duplicate comments (5)
tests/unittest/_torch/thop/test_moe_alltoall.py (2)
260-278
: Bug: shared output_indice_offset corrupts multi-tensor refs.Maintain per-tensor offsets; current code overwrites slices across tensors.
Apply:
- total_recv_all_ranks_cpu = [] - output_indice_offset = 0 + total_recv_all_ranks_cpu = [] + output_offsets = [0] * tensor_count @@ - for i in range(tensor_count): - ref_output_tensors[i][output_indice_offset:output_indice_offset + local_recv_count_pair] = \ - input_tensors_all_ranks[i][other_rank][send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]]] - output_indice_offset += local_recv_count_pair + for i in range(tensor_count): + start = output_offsets[i] + end = start + local_recv_count_pair + ref_output_tensors[i][start:end] = input_tensors_all_ranks[i][other_rank][ + send_ids_all_ranks[other_rank][send_rank_start_end[0]:send_rank_start_end[1]] + ] + output_offsets[i] = end
27-41
: quant_and_dequant: guard zero, avoid forced CPU hops, add docstring.Prevent div-by-zero, prefer on-device op with CPU fallback, and document intent.
Apply:
-def quant_and_dequant(tensor): - tensor = tensor.reshape(1, -1) - global_scale = (448 * 6) / tensor.abs().max().float() - fp4_tensor, scale_factors = torch.ops.trtllm.fp4_quantize( - tensor, global_scale, 16, False, False) - - dequantized_cpu = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2( - fp4_tensor.cpu(), - scale_factors.cpu(), - (1.0 / global_scale).cpu(), - 16, - 1, # sf_type (1 for UE4M3) - False) - return dequantized_cpu.to(tensor.device).reshape(-1) +def quant_and_dequant(tensor): + """Quantize to FP4 and dequantize back; used to relax asserts for low-precision paths.""" + tensor = tensor.reshape(1, -1) + absmax = tensor.abs().max().float() + if absmax == 0: + return torch.zeros_like(tensor.reshape(-1)) + global_scale = (448 * 6) / absmax + fp4_tensor, scale_factors = torch.ops.trtllm.fp4_quantize( + tensor, global_scale, 16, False, False) + try: + out = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2( + fp4_tensor, scale_factors, (1.0 / global_scale), 16, 1, False) + return out.reshape(-1) + except Exception: + out_cpu = torch.ops.tensorrt_llm.e2m1_and_ufp8sf_scale_to_float_v2( + fp4_tensor.cpu(), scale_factors.cpu(), (1.0 / global_scale).cpu(), 16, 1, False) + return out_cpu.to(tensor.device).reshape(-1)cpp/tensorrt_llm/thop/moeCommOp.cpp (2)
74-101
: Validate dtypes when use_low_precision is enabled.Fail fast if any input isn’t fp16/bf16 to avoid silent kernel misuse.
Apply:
int fieldCount = inputs.size(); @@ tensorrt_llm::kernels::FusedMoeFieldInfo sendFieldInfo, recvFieldInfo; @@ - bool useLowPrecisionVal = useLowPrecision.value_or(false); + bool useLowPrecisionVal = useLowPrecision.value_or(false); + if (useLowPrecisionVal) + { + for (int i = 0; i < fieldCount; ++i) + { + auto dt = inputs[i].scalar_type(); + TORCH_CHECK(dt == torch::kHalf || dt == torch::kBFloat16, + "use_low_precision requires float16 or bfloat16 inputs; got ", dt, " at field index ", i); + } + }Also applies to: 112-117
112-117
: Uninitialized topK forwarded into fillMetaInfo (UB).When hasBasicFields=false, pass a known-zero topK to avoid using params.expertParallelInfo.topK.
Apply:
- bool useLowPrecisionVal = useLowPrecision.value_or(false); - params.isLowPrecision = useLowPrecisionVal; - params.sendFieldInfo.fillMetaInfo( - &(params.sendCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); - params.recvFieldInfo.fillMetaInfo( - &(params.recvCommMeta), params.expertParallelInfo.topK, false, false, useLowPrecisionVal); + bool useLowPrecisionVal = useLowPrecision.value_or(false); + params.isLowPrecision = useLowPrecisionVal; + constexpr int kTopKNotUsed = 0; + params.sendFieldInfo.fillMetaInfo( + &(params.sendCommMeta), kTopKNotUsed, false, false, useLowPrecisionVal); + params.recvFieldInfo.fillMetaInfo( + &(params.recvCommMeta), kTopKNotUsed, false, false, useLowPrecisionVal);cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1396-1464
: Gate LOW_PRECISION path on device CC to avoid no-op codepaths on < SM100.Add host-side check; otherwise quant/dequant sections compile out and corrupt I/O.
Apply:
void moeAllToAll(FusedMoeCommKernelParam params, FusedMoeWorkspace workspace, cudaStream_t stream) { + if (params.isLowPrecision) + { + int dev = -1, major = 0; + TLLM_CUDA_CHECK(cudaGetDevice(&dev)); + TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, dev)); + TLLM_CHECK_WITH_INFO(major >= 10, + "Low-precision (NVFP4) all-to-all requires SM100+ (compute capability 10.x)."); + } bool hasBasicFields = params.sendFieldInfo.tokenSelectedSlots != nullptr;
🧹 Nitpick comments (2)
tests/unittest/_torch/thop/test_moe_alltoall.py (1)
345-351
: Optional: vectorize low-precision ref adjust to avoid per-token Python loop.Quant/dequant whole matrix per tensor once for speed in CI.
Example:
if use_low_precision: ref_output_tensors[i] = quant_and_dequant(ref_output_tensors[i]) atol, rtol = 1e-2, 1e-2cpp/tensorrt_llm/thop/moeCommOp.cpp (1)
230-241
: Fix misleading error message (typo).Dim check is 2D but message says 1D.
Apply:
- TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 1D tensor"); + TORCH_CHECK(expertsIds.dim() == 2, "expertsIds must be a 2D tensor");
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
(17 hunks)cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
(4 hunks)cpp/tensorrt_llm/thop/moeCommOp.cpp
(3 hunks)cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
(5 hunks)tensorrt_llm/_mnnvl_utils.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(1 hunks)tests/unittest/_torch/thop/test_moe_alltoall.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_mnnvl_utils.py
- cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h
- cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header with current year to all source files
Files:
tests/unittest/_torch/thop/test_moe_alltoall.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cuh,cpp,cc,cxx,cu}
: In C++/CUDA files, closing braces of namespaces must be commented with the namespace name (e.g., } // namespace foo)
Prefer const or constexpr variables over #define; variables not modified after initialization must be const
Use literals only for initialization except 0, nullptr, true, false; otherwise name them as constexpr
Use Allman indentation style; empty for/while loop’s semicolon on a new line; always use brace-delimited bodies for control statements
Type names are UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Non-static, non-anonymous-namespace global variables use g prefix (e.g., gDontUseGlobalFoos)
Static or anonymous-namespace globals use s prefix (e.g., sMutableStaticGlobal); locally visible static variables also use s prefix
Member variables use m prefix (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants: enums, globals, static constants, and function-scope magic-number/literal constants use k + UPPER_SNAKE_CASE (e.g., kDIGIT_NUM)
Use spaces only; indent 4 spaces
Run LLVM clang-format before submitting; enforce max 120 chars per line; only justify exceptions via clang-format off/on blocks
Use C++ style comments; allow inline C comments only for parameter labeling in calls
Disable code via #if/#endif or dedicated macros; avoid dead code; do not comment out code
Do not throw exceptions across library boundaries
Prefer least forceful casts; avoid C-style and functional casts (except explicit constructor); do not drop const/volatile; void* to T* via static_cast; use reinterpret_cast as last resort; avoid dynamic_cast
Do not use assignment in subexpressions (e.g., if (x = y))
Enum-controlled switch should cover all enum values without default; structure switch cases clearly; no fall-through except to another label; each case must end with break or throw; do not end a case with return; if case is a compound, put break inside braces
Avoid declaring l...
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
**/*.{cpp,cc,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use smart pointers for heap objects; prefer unique_ptr for sole ownership, shared_ptr for shared; avoid deprecated smart pointers
Files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
cpp/tensorrt_llm/thop/moeCommOp.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tests/unittest/_torch/thop/test_moe_alltoall.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/thop/moeCommOp.cpp
🧬 Code graph analysis (2)
tests/unittest/_torch/thop/test_moe_alltoall.py (3)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.h (1)
tensorrt_llm
(25-203)cpp/tensorrt_llm/kernels/moeCommKernelsCommon.h (1)
tensorrt_llm
(20-47)cpp/tensorrt_llm/kernels/moePrepareKernels.h (1)
tensorrt_llm
(25-89)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
cpp/tests/unit_tests/kernels/fusedMoeCommKernelTest.cpp (2)
elementSize
(68-79)elementSize
(68-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/fusedMoeCommKernels.cu (1)
1320-1352
: Low-precision compact size: looks correct; aligns with sender/receiver layouts.numGroups, per-group bytes, and 128B alignment match pack/unpack. LGTM.
PR_Github #16842 [ reuse-pipeline ] triggered by Bot |
PR_Github #16842 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.