-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … #6126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughIntroduces dynamic cluster-shape (CGA) and AB-swap support across SM100 MoE GEMM: encodes tile/cluster shapes as ints, extends CutlassGemmConfig with dynamic/fallback cluster shapes and swap_ab, adds SM100 candidate generation for dynamic clusters, updates dispatch/launcher signatures and runtime dispatch, threads dynamic_cga/swap_ab through Python kernel generation, and adjusts benchmarks/tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Gen as CandidateGen (SM100)
participant Disp as Dispatch (SM100)
participant L as Launcher (TMA/NoSmem)
participant K as Kernel
Note over Gen: Build CutlassGemmConfig list<br/>for epilogue ∈ {TMA, NO_SMEM}<br/>cluster ∈ {1x1x1,4x1x1,4x2x1}<br/>include dynamic & fallback shapes
Gen->>Disp: Candidate configs (tile, epilogue, cluster, dynamic, fallback, swap_ab)
alt dynamic_cga enabled
Disp->>L: Select by (epilogue_schedule, dynamic_cga, swap_ab)
L->>K: Launch with dynamic_cluster_shape + fallback_cluster_shape
else dynamic_cga disabled
Disp->>L: Select static launcher
L->>K: Launch with static cluster_shape
end
Note over K: FINALIZE fusion allowed only with TMA<br/>AB-swap variants applied as configured
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
Also includes changes for #4111 |
b97f555
to
b3f2854
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
83-85
: FINALIZE fusion must provide router_scales (avoid nullptr in fused epilogue)Per prior incident (#3294), FINALIZE epilogue requires non-null router_scales. Add a guard here to fail fast.
TLLM_CHECK_WITH_INFO( workspace_size || hopper_input.isValid(), "Hopper specialisation is missing additional input information"); + + if constexpr (FUSION == EpilogueFusion::FINALIZE) + { + TLLM_CHECK_WITH_INFO( + hopper_input.router_scales != nullptr, + "router_scales must be provided when using FINALIZE epilogue fusion"); + }
1-15
: Header guard required for .h files (and update copyright year)
- Add a preprocessor guard with prefix TRTLLM_ per guidelines.
- Include the current year in the header.
Additional changes outside the selected range:
/* - * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved. *At the very top, just after the license:
+#ifndef TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H +#define TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_HAt the very end of the file, add:
+#endif // TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H
🧹 Nitpick comments (15)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (4)
84-97
: Tighten shape encoding helpers; remove unnecessary static and guard valid ranges
constexpr
functions at namespace scope are implicitly inline;static
is unnecessary and goes against the guideline to prefer anonymous namespaces overstatic
for internal linkage.- Consider guarding against out-of-range inputs to avoid accidental collisions.
Suggested refactor:
-constexpr static int shape_tuple_to_enum(int m, int n, int k) +constexpr int shape_tuple_to_enum(int m, int n, int k) { - return m * 1000000 + n * 1000 + k; + // m in [1..999], n in [1..999], k in [1..999] + // Encodes as m*1'000'000 + n*1'000 + k + return m * 1000000 + n * 1000 + k; } -template <typename TEnum> -constexpr static std::tuple<int, int, int> enum_to_shape_tuple(TEnum shape_id_enum) +template <typename TEnum> +constexpr std::tuple<int, int, int> enum_to_shape_tuple(TEnum shape_id_enum) { static_assert(std::is_enum_v<TEnum> && std::is_same_v<std::underlying_type_t<TEnum>, int>, "TEnum must be an enum with underlying type int"); auto shape_id = static_cast<int>(shape_id_enum); return std::make_tuple(shape_id / 1000000, (shape_id % 1000000) / 1000, shape_id % 1000); }Optionally, add a compile-time/documented constraint that shape components must be < 1000 to avoid collisions.
525-539
: toString extended fields are useful; consider printing decoded shapes for readabilityThe added
dynamic_cluster_shape
andfallback_cluster_shape
IDs are valuable for debugging. For readability, consider also printing their decoded tuples viaenum_to_shape_tuple
when not Undefined. No blocker.
561-572
: Operator<< label “tile_config_sm90_enum” is misleading for SM100/SM120This label is hard-coded to “sm90” while
getTileConfigAsInt()
may reflect SM100/SM120. Consider a neutral label (e.g.,tile_config_enum
) or include the actualsm_version
in the label to avoid confusion in logs.- out << "tile_config_sm90_enum: " << config.getTileConfigAsInt() + out << "tile_config_enum: " << config.getTileConfigAsInt()
1-15
: Update copyright yearPer guidelines, OSS source headers should include the current year. This file still says 2020-2023.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
1-15
: Header hygiene: guard macro and year
- The guidelines prefer preprocessor header guards over
#pragma once
for*.h
. Consider adding a guard like#ifndef TRTLLM_MOE_GEMM_TMA_WS_LAUNCHER_H
.- Update the copyright year.
I can provide a patch if you want it applied across the cutlass_kernels headers for consistency.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (4)
245-251
: Epilogue token sanitization looks fine; keep the hack containedStripping namespaces and removing “1Sm” is a reasonable WAR given the current macro set. Consider adding a TODO with a link to the CUTLASS update once “PtrArrayTmaWarpSpecialized1Sm” is available, to remove the hack.
271-273
: Very long macro line; wrap for readability (ruff E501)The instantiation macro line is well over 120 chars. Consider line-wrapping arguments to improve readability and satisfy tooling:
- INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag}, - {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n + INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM( + {arch_tag}, {act_tag}, {weight_tag}, {out_tag}, + {epi_sched}, {epi_tag}, {epi_fusion}, + {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, + {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, + {"true" if operation.is_mx_fpx else "false"}, + {"true" if operation.dynamic_cga else "false"}, + false + );\n
370-372
: Temporary exclusion for FP4 + NoSmem 1SmThe explicit guard against
PtrArrayNoSmemWarpSpecialized1Sm
for FP4 aligns with CUTLASS limitations. Please add a TODO to re-enable when the backend supports this combination.
6-6
: Star import and F405 warningsRuff flags several symbols as possibly undefined due to
from cutlass_library import *
. If feasible, import the specific names used (EpilogueScheduleType, KernelScheduleType, DataType, GemmKind, etc.) to quiet lints and improve readability. Not a blocker.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (6)
71-76
: Avoid copying gemm_config: pass by const referencegemm_config is passed by value here and along the chain. Prefer passing by const& to avoid copies.
-void dispatchMoeGemmFinalDispatchTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts, - cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy, +void dispatchMoeGemmFinalDispatchTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts, + cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy, size_t* workspace_size)Additionally update the callers’ signatures (see comments at Lines 325-328 and 366-369).
198-206
: Use compile-time condition for ClusterShape gate in constexpr predicatePrefer if constexpr here to keep this purely compile-time and avoid generating dead branches.
- if (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 || cute::size<2>(ClusterShape{}) != 1) + if constexpr (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 + || cute::size<2>(ClusterShape{}) != 1) { return false; }
325-328
: Avoid copying gemm_config in cluster-shape dispatcherSame rationale: pass by const&.
-void dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, - int num_experts, cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream, +void dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, + int num_experts, cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy, size_t* workspace_size)
366-369
: Avoid copying gemm_config in tile-shape dispatcherPass by const& to prevent copies.
-void dispatchMoeGemmSelectTileShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts, - cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy, +void dispatchMoeGemmSelectTileShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts, + cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy, size_t* workspace_size)
362-363
: Incorrect namespace comment on a function’s closing braceThis brace closes a function, not the namespace. Remove or correct the comment to avoid confusion.
-} // namespace tensorrt_llm +}
64-64
: Prefer over <math.h>Use C++ headers instead of C headers.
-#include <math.h> +#include <cmath>
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(10 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(4 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (2)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧬 Code Graph Analysis (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (3)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
TileShape
(213-559)ClusterShape
(345-558)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp (2)
bool
(133-156)bool
(164-170)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1)
bool
(563-647)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
std
(144-180)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (4)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)
_
(234-276)_
(355-363)_
(444-454)_
(626-653)_
(686-696)_
(773-783)_
(876-892)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
EpilogueScheduleType
(204-586)cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
KernelScheduleType
(36-59)cpp/tensorrt_llm/runtime/torchUtils.h (1)
DataType
(99-115)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
271-271: Line too long (313 > 120)
(E501)
353-353: Line too long (122 > 120)
(E501)
370-370: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
532-532: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
659-659: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
702-702: Line too long (139 > 120)
(E501)
704-704: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
705-705: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
724-724: KernelScheduleType
may be undefined, or defined from star imports
(F405)
727-727: DataType
may be undefined, or defined from star imports
(F405)
728-728: DataType
may be undefined, or defined from star imports
(F405)
728-728: DataType
may be undefined, or defined from star imports
(F405)
733-733: GemmKind
may be undefined, or defined from star imports
(F405)
749-749: DataType
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (10)
3rdparty/cutlass (1)
1-1
: Submodule bump acknowledged and looks intentional.Pointer advance to ebe98c5… aligns with broader PR changes adding dynamic CGA support; no direct code in this repo depends on submodule contents at review time. Proceed.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
300-303
: Good defensive check when opening input fileThe
ifstream
open check and explicitinvalid_argument
improves failure diagnostics and is caught upstream in parseArgsAndRunBench. LGTM.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
26-32
: Launcher signature aligned with dynamic CGA and epilogue schedule; verify include-time usageThe added
EpilogueSchedule
/DYNAMIC_CGA
template params and runtimedynamic_cluster_shape
parameter line up with the generator changes and the dispatch path. LGTM. Ensure macro instantiations and all call sites provide the new template arg and pass the runtime shape accordingly (generator has been updated).cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (5)
142-144
: API: New dynamic_cga parameter plumbed into launcherExtending the public constructor with
dynamic_cga=False
and storing it on the instance is consistent with the rest of the plumbing. LGTM.
157-157
: Storing dynamic_cga on the launcherGood—this is used downstream to select the correct instantiation variants.
350-356
: SM100 validity: runtime cluster-shape constraints are clearThe restrictions on cga_m/cga_n/cga_k and the tile checks for FP4 look sensible given current backend support. LGTM.
659-670
: SM120: epi schedule switch to TMA aligns with backendSwitching to
EpilogueScheduleType.TmaWarpSpecialized
here matches current SM120 expectations. LGTM.
702-706
: New SM100 epilogue schedules for 1Sm variantsUsing
PtrArrayNoSmemWarpSpecialized1Sm
andPtrArrayTmaWarpSpecialized1Sm
is consistent with the launcher signature that now accepts an epilogue schedule type. Ensure these tokens exist incutlass_library
for the targets you generate.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
119-132
: WFP4AFP8 scaling-mode gating looks correctRestricting WFP4 weights with FP8 activations to MXFPX and banning MXFPX otherwise is the right enforcement.
338-341
: Wiring gemm_config into final dispatch looks correctForwarding gemm_config to the final dispatch aligns the new centralization. LGTM.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Show resolved
Hide resolved
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
Outdated
Show resolved
Hide resolved
7ad3b48
to
2dc4072
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
17-36
: Add missing include for std::get usageThe code uses
std::get
with the tuple returned byenum_to_shape_tuple
on line 377, but<tuple>
is not included. Add it to avoid relying on transitive includes:#include "tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h" #include "tensorrt_llm/common/cudaBf16Wrapper.h" +#include <tuple> #ifdef __GNUC__ // Check if the compiler is GCC or Clang
♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
728-751
: Bug: dynamic_cga variants not appended correctly; only last one survivesThe validity check and append operation are outside the
dynamic_cga
loop, so only the last variant (dynamic_cga=False) is considered. Move them inside the loop:for otype in otypes: for dynamic_cga in [True, False]: moe_gemm_operation = TrtLlm_GemmLauncher( GemmKind.Grouped, arch, dtype, weight_type, otype, otype, otype, quant_op, epi_tag, cta_shape_mnk, warp_shape, stages, cga_shape, mainloop_schedule, epi_schedule, epi_fusion, is_mx_fpx=(dtype == DataType.e4m3 and weight_type == e2m1), dynamic_cga=dynamic_cga) - - if is_op_valid(moe_gemm_operation): - operations.append(moe_gemm_operation) + if is_op_valid(moe_gemm_operation): + operations.append(moe_gemm_operation)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
134-184
: Guard dynamic cluster shape conversion to prevent undefined behavior.The current code unconditionally calls
enum_to_shape_tuple(gemm_config.dynamic_cluster_shape)
even whendynamic_cga
is false, which could lead to undefined behavior whenClusterShape::Undefined
is passed.Based on past review feedback, this issue was already identified. The recommendation is to guard the cluster shape conversion and validate SM100-specific constraints.
Apply this pattern:
- bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + // SM100 dynamic CGA supports only 1x1x1 and 2x1x1 cluster shapes + TLLM_CHECK_WITH_INFO( + (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) && + std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1, + "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%lld,%lld,%lld).", + (long long) std::get<0>(cluster_shape), (long long) std::get<1>(cluster_shape), (long long) std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + }Apply the same pattern to fallback cluster shape conversion.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
22-23
: Add missing include for type traitsThe code uses
std::is_enum_v
andstd::underlying_type_t
inenum_to_shape_tuple
(lines 93-94) but<type_traits>
is already included on line 23. Good!
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
247-251
: Technical debt: Temporary workaround for missing typeThe removal of "1Sm" suffix is noted as a hack. Consider adding a TODO to properly address this:
epi_sched = epi_sched.split(':')[-1] - epi_sched = epi_sched.replace( - "1Sm", - "") # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type + # TODO: Remove this hack once `PtrArrayTmaWarpSpecialized` type is properly defined + epi_sched = epi_sched.replace("1Sm", "") # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type
271-271
: Fix line length exceeding 120 charactersBreak this long line for better readability:
- INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag}, - {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n + INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag}, + {epi_sched}, {epi_tag}, {epi_fusion}, + {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, + {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, + {"true" if operation.is_mx_fpx else "false"}, + {"true" if operation.dynamic_cga else "false"}, false);\n
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(8 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- 3rdparty/cutlass
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
🧠 Learnings (3)
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
EpilogueScheduleType
(205-565)cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
KernelScheduleType
(36-59)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
FpXBlockScalingType
(203-255)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (19)
ClusterShape
(310-502)TileShape
(214-538)EpilogueScheduleType
(205-565)if
(244-247)if
(248-251)if
(252-255)if
(256-259)if
(260-263)if
(264-267)if
(268-271)if
(272-275)if
(276-279)if
(280-283)if
(284-287)if
(351-354)if
(355-358)if
(359-362)if
(363-366)if
(367-370)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
271-271: Line too long (313 > 120)
(E501)
353-353: Line too long (122 > 120)
(E501)
367-367: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
529-529: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
656-656: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
699-699: Line too long (139 > 120)
(E501)
701-701: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
702-702: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
721-721: KernelScheduleType
may be undefined, or defined from star imports
(F405)
724-724: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
730-730: GemmKind
may be undefined, or defined from star imports
(F405)
746-746: DataType
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8)
73-75
: LGTM: Function signature updated appropriately for dynamic cluster shape support.The function rename and addition of the
gemm_config
parameter align well with the PR's objective of adding dynamic cluster shape support. The parameter placement is logical.
119-132
: LGTM: Well-implemented weight type validation.The weight type gating logic properly validates MXFPX scaling requirements based on the weight/input type combination. The error messages are descriptive and the validation will catch configuration errors early.
202-202
: LGTM: Template signature updated for dynamic cluster shape support.The change from deriving
CtaShape
fromClusterTileShape
to accepting it as a template parameter is consistent with the shift to runtime-determined cluster shapes.
257-257
: LGTM: Consistent template signature update.The template parameter change for SM120 is consistent with the SM100 version and supports the dynamic cluster shape feature.
205-210
: LGTM: Proper cluster shape validation for SM100.The runtime validation correctly enforces the SM100 TMA dynamic CGA constraints (only 1x1x1 and 2x1x1 cluster shapes are supported). The logic is clear and will prevent launching unsupported configurations.
343-345
: LGTM: Function call updated to match new signature.The function call correctly includes the
gemm_config
parameter, enabling dynamic cluster shape support in the dispatch chain.
394-396
: LGTM: Improved error message accuracy.The updated error message correctly references the SM version-specific tile config field, improving debugging capability when unsupported configurations are encountered.
424-424
: LGTM: Added support for 64x32 tile shape on SM100.The new tile shape configuration follows the established macro pattern and expands the available options for SM100 optimization.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
371-390
: Good: Added missing cluster shape conversionsThe new cases for
ClusterShape_1x4x1
,ClusterShape_4x2x1
,ClusterShape_2x4x1
, andClusterShape_4x4x1
properly handle the enum values that were added. This addresses the previous review comment about missing conversion cases.
448-463
: Well-documented dynamic CGA supportThe new constructor properly handles dynamic cluster shape configuration with clear documentation about the behavior when
dynamic_cluster_shape
is provided. The default values for the new parameters maintain backward compatibility.cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
2072-2073
: Test output updated for dynamic cluster shapeThe config name generation correctly includes the new
dynamic_cluster_shape
field in the test output, maintaining consistency with the dynamic CGA feature additions.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
26-32
: Launcher signature properly extended for dynamic CGAThe template parameters and function signature are correctly extended to support dynamic cluster shapes. The use of
cute::Shape<int32_t, int32_t, cute::_1>
appropriately represents 2D cluster shapes with the K dimension fixed at 1.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
371-386
: Good: added cute::Shape mappings for new ClusterShape values (resolves prior gap).These cases complete the mapping for 1x4x1, 4x2x1, 2x4x1, 4x4x1 noted earlier.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
377-377
: Add missing #include for std::get usage.This file uses
std::get
with the tuple returned byenum_to_shape_tuple
but doesn't include<tuple>
. Please add the include to avoid relying on transitive includes.Add near the other standard library includes:
#include <set> #include <vector> +#include <tuple>
458-460
: Bug: computed fallback_cluster_shape is not being used.The
fallback_cluster_shape
is computed on lines 455-457 but the function call passescluster_shape
twice instead of using the computed fallback value.- auto configs = get_candidate_configs_sm100_dynamic_cluster_shape( - config, schedule, cluster_shape, fallback_cluster_shape); + auto configs = get_candidate_configs_sm100_dynamic_cluster_shape( + config, schedule, cluster_shape, fallback_cluster_shape);cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
728-750
: Bug: Only the last dynamic_cga variant is appended to operations list.The
is_op_valid
check andappend
are outside thedynamic_cga
loop, causing only the last variant (dynamic_cga=False) to be considered. Both variants should be validated and appended.for otype in otypes: for dynamic_cga in [True, False]: moe_gemm_operation = TrtLlm_GemmLauncher( GemmKind.Grouped, arch, dtype, weight_type, otype, otype, otype, quant_op, epi_tag, cta_shape_mnk, warp_shape, stages, cga_shape, mainloop_schedule, epi_schedule, epi_fusion, is_mx_fpx=(dtype == DataType.e4m3 and weight_type == e2m1), dynamic_cga=dynamic_cga) - - if is_op_valid(moe_gemm_operation): - operations.append(moe_gemm_operation) + if is_op_valid(moe_gemm_operation): + operations.append(moe_gemm_operation)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
136-142
: Guard dynamic cluster shape conversion and validate SM100 constraints.The code unconditionally calls
enum_to_shape_tuple
even whendynamic_cga
is false, and doesn't validate that SM100 only supports 1x1x1 or 2x1x1 cluster shapes.bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; - auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape); - auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + // SM100 dynamic CGA only supports 1x1x1 or 2x1x1 + TLLM_CHECK_WITH_INFO( + (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) && + std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1, + "SM100 dynamic CGA supports only 1x1x1 or 2x1x1 cluster shapes, got (%d,%d,%d).", + std::get<0>(cluster_shape), std::get<1>(cluster_shape), std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + + auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + }
🧹 Nitpick comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (4)
2-2
: Update copyright year to include the current year (2025).Guideline requires headers to include the current year. Update the range.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
85-98
: Harden shape encoding: avoid magic numbers, add bounds checks; drop redundant static.
- Replace 1000/1000000 with named constants.
- Add asserts to guard m,n,k in [0, 1000) to prevent collisions.
- For headers, prefer constexpr over “constexpr static” at namespace scope.
-constexpr static int shape_tuple_to_enum(int m, int n, int k) -{ - return m * 1000000 + n * 1000 + k; -} +// Encoding: MMM NNN KKK in base 1000 +constexpr int kSHAPE_BASE = 1000; +constexpr int kSHAPE_BASE2 = kSHAPE_BASE * kSHAPE_BASE; + +constexpr int shape_tuple_to_enum(int m, int n, int k) +{ + assert(m >= 0 && m < kSHAPE_BASE && n >= 0 && n < kSHAPE_BASE && k >= 0 && k < kSHAPE_BASE); + return m * kSHAPE_BASE2 + n * kSHAPE_BASE + k; +}
288-291
: Prefer compile-time error over returning 0x0x0 for unknown TileShape.Returning an all-zero cute::Shape can silently mask missing cases. Trigger a compile-time error instead.
- else - { - return cute::Shape<_0, _0, _0>{}; - } + else + { + static_assert(kDependentFalse<Shape_MNK>, "Unsupported TileShape in get_tile_shape()"); + }Add this helper once in the header (outside the function), e.g. near other utilities:
template <auto> inline constexpr bool kDependentFalse = false;
512-516
: Prefer human-readable schedule names in toString().Use the existing get_mainloop_schedule_name helper.
- << "\n\tmainloop sched: " << (int) mainloop_schedule << "\n\tepi sched: " << (int) epilogue_schedule + << "\n\tmainloop sched: " << get_mainloop_schedule_name(mainloop_schedule) + << "\n\tepi sched: " << (int) epilogue_schedulecpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
2-2
: Update copyright year to include 2025.The copyright notice shows 2020-2023, but modifications are being made in 2025. Please update to include the current year.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(8 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- 3rdparty/cutlass
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
KernelScheduleType
(36-59)cpp/tensorrt_llm/runtime/torchUtils.h (1)
DataType
(99-115)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
TileShape
(214-538)ClusterShape
(310-502)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1)
bool
(563-647)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
271-271: Line too long (313 > 120)
(E501)
353-353: Line too long (122 > 120)
(E501)
367-367: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
529-529: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
656-656: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
699-699: Line too long (139 > 120)
(E501)
701-701: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
702-702: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
721-721: KernelScheduleType
may be undefined, or defined from star imports
(F405)
724-724: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
730-730: GemmKind
may be undefined, or defined from star imports
(F405)
746-746: DataType
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (14)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
2072-2073
: LGTM! Test output properly reflects new dynamic cluster shape field.The config label now includes the dynamic_cluster_shape, which aligns with the broader changes to support dynamic CGA in the MoE implementation.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (5)
22-23
: Good: required headers added.Adding and <type_traits> resolves previous missing-include issues for tuple utilities and type traits.
310-324
: Enum expansion looks consistent.New ClusterShape values use the encoded scheme; aligns with decoding helpers.
420-421
: Fields for dynamic/fallback cluster shapes are clear with sensible defaults.Defaults to Undefined are appropriate.
448-463
: SM100 constructor: supports dynamic and fallback cluster shapes — LGTM.Comment explains semantics; initialization order and defaults look good.
549-551
: ostream dump includes dynamic/fallback cluster shapes — LGTM.Good addition for debugging/logging.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (2)
26-28
: LGTM! Template parameters properly extended for dynamic CGA support.The addition of
EpilogueSchedule
andDYNAMIC_CGA
template parameters is well-structured and aligns with the dynamic CGA feature implementation across the codebase.
29-32
: LGTM! Function signature properly extended with dynamic cluster shape parameters.The addition of
dynamic_cluster_shape
andfallback_cluster_shape
parameters at the end of the function signature maintains backward compatibility while enabling dynamic CGA support. The use ofcute::Shape<int32_t, int32_t, cute::_1>
is consistent with CUTLASS conventions.cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
370-437
: LGTM! Well-structured SM100 dynamic cluster shape configuration generator.The function properly:
- Determines 1SM/2SM support based on dynamic_cluster_shape
- Handles FP4_ONLY configurations with appropriate TMA epilogue requirement
- Builds comprehensive tile configurations for both 1SM and 2SM variants
- Correctly uses the new 6-argument CutlassGemmConfig constructor
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
142-157
: LGTM! Clean extension of TrtLlm_GemmLauncher with dynamic_cga support.The addition of the
dynamic_cga
parameter with a default value ofFalse
maintains backward compatibility while enabling the new functionality.
247-271
: LGTM! Proper instantiation with dynamic_cga and epilogue schedule.The instantiation correctly:
- Passes the epilogue schedule after extracting the last segment
- Includes the dynamic_cga flag as a string boolean
- Documents the workaround for the missing
PtrArrayTmaWarpSpecialized
typecpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)
73-75
: LGTM! Function rename and signature update for centralized dispatch.The rename to
dispatchMoeGemmFinalDispatchTmaWarpSpecialized
and addition of thegemm_config
parameter properly centralizes the MOE GEMM dispatch logic.
119-132
: LGTM! Proper validation of FP4/FP8 scaling requirements.The code correctly enforces that WFP4AFP8 combinations must use MXFPX scaling, while preventing its use for other weight combinations. This ensures type safety and proper configuration.
202-255
: LGTM! Proper tile shape validation for SM100 with dynamic CGA constraints.The function correctly:
- Uses CtaShape template parameter instead of ClusterTileShape
- Validates that SM100 only supports 1x1x1 or 2x1x1 cluster shapes
- Maintains proper tile size constraints for different data types
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Show resolved
Hide resolved
2dc4072
to
6edb1e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
85-88
: Add validation for shape dimensions to prevent encoding overflowThe encoding assumes each dimension is less than 1000. Values >= 1000 will produce incorrect results.
constexpr static int shape_tuple_to_enum(int m, int n, int k) { + static_assert(m < 1000 && n < 1000 && k < 1000, "Shape dimensions must be < 1000 for correct encoding"); return m * 1000000 + n * 1000 + k; }
490-503
: Special-case Undefined and ChooseWithHeuristic for readable namesCurrently returns "0x0x1" for special enum values. Should return meaningful names instead.
std::string getTileConfigAsName() const { if (sm_version == 120 || sm_version == 121) + { + if (tile_config_sm120 == CutlassTileConfigSM120::Undefined) return "Undefined"; + if (tile_config_sm120 == CutlassTileConfigSM120::ChooseWithHeuristic) return "ChooseWithHeuristic"; return get_tile_shape_name(tile_config_sm120); + } if (sm_version >= 100 && sm_version < 120) + { + if (tile_config_sm100 == CutlassTileConfigSM100::Undefined) return "Undefined"; + if (tile_config_sm100 == CutlassTileConfigSM100::ChooseWithHeuristic) return "ChooseWithHeuristic"; return get_tile_shape_name(tile_config_sm100); + } if (sm_version == 90) + { + if (tile_config_sm90 == CutlassTileConfigSM90::Undefined) return "Undefined"; + if (tile_config_sm90 == CutlassTileConfigSM90::ChooseWithHeuristic) return "ChooseWithHeuristic"; return get_tile_shape_name(tile_config_sm90); + } if (sm_version < 90) return std::to_string((int) tile_config_sm80); assert(false && "Invalid SM version"); return "invalid"; }cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
134-142
: Guard dynamic cluster shape conversion to prevent invalid shapes
enum_to_shape_tuple
is called unconditionally, producing (0,0,0) for Undefined shapes even when dynamic CGA is disabled.bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; -auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); -auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; -auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape); -auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; +auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; +auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; +if (dynamic_cga) +{ + auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + // SM100 supports only 1x1x1 or 2x1x1 dynamic CGA + TLLM_CHECK_WITH_INFO( + (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) && + std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1, + "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).", + std::get<0>(cluster_shape), std::get<1>(cluster_shape), std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + + auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; +}
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (3)
271-271
: Line exceeds maximum length limitThis line is 313 characters long, exceeding the 120 character limit specified in the coding guidelines.
Split the macro instantiation across multiple lines:
- INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag}, - {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n + INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag}, + {epi_sched}, {epi_tag}, {epi_fusion}, + {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, + {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, + {"true" if operation.is_mx_fpx else "false"}, + {"true" if operation.dynamic_cga else "false"}, false);\n
366-368
: Track TODO: Revert FP4 epilogue schedule restrictionThe TODO comment indicates this is a temporary restriction until CUTLASS adds support for blockscaled + no smem. Consider creating a tracking issue for this limitation.
Would you like me to create a GitHub issue to track the removal of this restriction once CUTLASS support is available?
699-707
: Line length exceeds limitLine 699 exceeds the 120 character limit.
- # Some shapes for SM100 are better with NoSmem, note the kernel will internally map to the 1 or 2 SM variants based on the cga_shape[0] + # Some shapes for SM100 are better with NoSmem. + # Note: kernel internally maps to 1 or 2 SM variants based on cga_shape[0]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(1 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(8 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- 3rdparty/cutlass
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
271-271: Line too long (313 > 120)
(E501)
353-353: Line too long (122 > 120)
(E501)
367-367: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
529-529: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
656-656: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
699-699: Line too long (139 > 120)
(E501)
701-701: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
702-702: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
721-721: KernelScheduleType
may be undefined, or defined from star imports
(F405)
724-724: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
730-730: GemmKind
may be undefined, or defined from star imports
(F405)
746-746: DataType
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (5)
142-157
: LGTM: Clean addition of dynamic_cga parameterThe dynamic_cga parameter is properly added with a sensible default value (False) to maintain backward compatibility.
350-355
: LGTM: Clear runtime CGA constraints for SM100The constraints properly limit SM100 to 1SM or 2SM variants (cga_m <= 2) with single N and K dimensions, which aligns with the hardware capabilities.
529-529
: LGTM: Appropriate epilogue schedule for SM90 grouped GEMMThe change to
PtrArrayNoSmemWarpSpecialized
aligns with the pointer array-based grouped GEMM implementation.
656-666
: LGTM: Simplified SM120 configurationGood simplification by removing the intermediate cga_tile_shape_mnk variable and using TmaWarpSpecialized epilogue schedule for SM120.
728-750
: LGTM: Dynamic CGA variant generation properly fixedGood fix! The validity check and append operations are now correctly placed inside the inner loop, ensuring both dynamic_cga variants (True/False) are properly generated and validated. This addresses the bug identified in the previous review.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
26-32
: LGTM: Well-structured template interface extensionThe addition of
EpilogueSchedule
andDYNAMIC_CGA
template parameters, along with the runtimedynamic_cluster_shape
andfallback_cluster_shape
parameters, provides a clean interface for dynamic cluster shape support. The parameter ordering is logical and consistent with existing conventions.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)
371-390
: LGTM: Complete cluster shape template coverageAll new cluster shapes (1x4x1, 4x2x1, 2x4x1, 4x4x1) are properly handled with appropriate CUTE shape mappings. The default case returning
Shape<_0, _0, _0>
provides safe fallback behavior.
448-463
: LGTM: Well-designed SM100 constructor with dynamic CGA supportThe constructor properly supports dynamic cluster shapes with clear documentation explaining that when
dynamic_cluster_shape
is provided, thecluster_shape
parameter is interpreted as 1SM vs 2SM mode selection rather than a static shape.
512-516
: LGTM: Improved config string representationThe toString() method now provides much more readable output with decoded shape names for tile, cluster, dynamic cluster, and fallback cluster shapes.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)
73-75
: LGTM: Clear function naming and proper config passingThe rename to
dispatchMoeGemmFinalDispatchTmaWarpSpecialized
better reflects this as the final dispatch point, and passinggemm_config
provides all necessary configuration in one parameter.
202-210
: LGTM: Proper SM100 cluster shape validationThe function correctly validates that SM100 only supports 1x1x1 and 2x1x1 cluster shapes for dynamic CGA, with clear constraints documented.
366-366
: LGTM: Clear and specific error messagesThe error messages properly distinguish between cluster shape and tile shape configuration errors, making debugging easier.
Also applies to: 394-396
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
1-16
: Update copyright yearCoding guideline requires the current year in the NVIDIA header. Consider updating “1993-2024” to “1993-2025”.
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
300-303
: Robust workload file-open check — LGTMThrowing invalid_argument when the file can’t be opened is correct and integrates with the existing try/catch in parseArgsAndRunBench. No further changes needed.
462-490
: Behavior change: GEMM profiling now triggers for all non-LAYER casesThe widened condition will apply the profiling path regardless of current routing_config. This also overrides user-provided routing and tactic splits when gemm_to_profile != LAYER. Confirm this broader behavior change is intentional for the new dynamic-CGA workflows.
/bot run |
PR_Github #14860 [ run ] triggered by Bot |
PR_Github #14860 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
2-2
: Update copyright year to reflect current changes.The copyright header shows 2023, but this code is being modified in 2024/2025. The coding guidelines require that "All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year."
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (3)
1075-1075
: Good optimization to fetch tactics once.Storing the tactics vector in a local variable before the logging and counter population blocks is an efficient optimization that avoids redundant calls to
mMoERunner.getTactics()
.
1086-1099
: Excellent addition of detailed per-tactic metadata.The new counters provide comprehensive visibility into tactic characteristics including SM version, tile shape, cluster shapes (static, dynamic, and fallback), and epilogue schedule. This aligns well with the PR's objective to add dynamic cluster shape support and will be valuable for performance analysis and debugging.
The implementation correctly extracts metadata from both selected tactics using consistent naming conventions (t1_* and t2_* prefixes).
1086-1099
: Verify existence of newdynamic_cluster_shape
andfallback_cluster_shape
fieldsI wasn’t able to locate definitions for the new fields in the codebase. Please confirm that the
Tactic
type returned bymMoERunner.getTactics()
(or the underlyingCutlassGemmConfig
) actually declares:
dynamic_cluster_shape
fallback_cluster_shape
If they’re defined in a different header or via a code-gen step, ensure that file is included in your build. Otherwise, add these fields to the
Tactic
/CutlassGemmConfig
struct so the benchmark compiles and reports correct values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
449-468
: Bug fix confirmed: fallback_cluster_shape now used instead of passing cluster_shape twiceThe computed fallback_cluster_shape is correctly passed into get_candidate_configs_sm100_dynamic_cluster_shape.
376-378
: Build break: std::get used without includingThis TU uses std::get with enum_to_shape_tuple but doesn’t include . Some transitive includes may hide this locally, but it’s fragile and can break builds.
Add explicit includes:
#include <cuda_runtime_api.h> +#include <tuple> // for std::get #include <set> #include <vector> +// Optional but recommended for std::pair in tile_configs +#include <utility>
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
430-435
: Avoid shadowing parameter ‘config’ with local variableThe local CutlassGemmConfig config shadows the function parameter config, which harms readability and can confuse future edits.
Apply:
- for (auto [tile, cluster] : tile_configs) - { - CutlassGemmConfig config{ - tile, MainloopScheduleType::AUTO, schedule, cluster, dynamic_cluster_shape, fallback_cluster_shape}; - candidate_configs.push_back(config); - } + for (auto [tile, cluster] : tile_configs) + { + CutlassGemmConfig candidate{ + tile, MainloopScheduleType::AUTO, schedule, cluster, dynamic_cluster_shape, fallback_cluster_shape}; + candidate_configs.push_back(candidate); + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)
74-76
: Verified: All instantiations updated to 5-parameter form
I scanned the entire codebase forshould_filter_tma_warp_specialized_gemm_problem_shape_v<…>
and found no remaining uses of the old 4-parameter signature. Thefalse
value for SM90 is correctly propagated.cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)
37-41
: Intent check: FAST_BUILD filters all dynamic-CGA pathsIncluding DYNAMIC_CGA in the FAST_BUILD filter makes any dynamic cluster-shape path get excluded. If that’s intended (to shrink compile time), LGTM; otherwise, consider gating by arch or feature flag.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
382-400
: FP4-only gating against non-TMA epilogues looks correctFor FP4_ONLY, returning empty set unless schedule == TMA avoids generating unsupported variants. This aligns with the PR’s epilogue constraints.
443-445
: FAST_BUILD default includes dynamic/fallback fields (Undefined) — goodThe single SM100 candidate now carries dynamic_cluster_shape and fallback_cluster_shape as Undefined, keeping the API consistent even in FAST_BUILD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
1-5
: Missing NVIDIA copyright header (2025) — please add required headerThe file is missing the mandatory NVIDIA copyright header with the current year as per project guidelines.
Apply this diff at the top of the file:
+// Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + #include "tensorrt_llm/common/cudaUtils.h"
🧹 Nitpick comments (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)
2068-2074
: Also print fallback_cluster_shape in tactic label for debuggabilityIncluding the fallback cluster shape will make failures easier to pinpoint when dynamic CGA falls back.
Apply this diff to extend the label:
- << " dynamic cluster shape " << (int) conf.dynamic_cluster_shape << " mainloop sched " - << (int) conf.mainloop_schedule << " epi sched " << (int) conf.epilogue_schedule; + << " dynamic cluster shape " << (int) conf.dynamic_cluster_shape + << " fallback cluster shape " << (int) conf.fallback_cluster_shape + << " mainloop sched " << (int) conf.mainloop_schedule + << " epi sched " << (int) conf.epilogue_schedule;
2095-2097
: Guard dynamic-cluster filtering to TMA-warp-specialized configsBeing explicit avoids future confusion if non-TMA configs ever carry non-Undefined dynamic_cluster_shape values.
Apply this diff:
- if (conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined - && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1) + if (conf1.is_tma_warp_specialized + && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined + && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1) continue; // To reduce the number of iterations we only test one dynamic cluster shape- if (conf2.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined - && conf2.dynamic_cluster_shape - != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1) + if (conf2.is_tma_warp_specialized + && conf2.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined + && conf2.dynamic_cluster_shape + != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1) continue; // To reduce the number of iterations we only test one dynamic cluster shapeAlso applies to: 2100-2103
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🔇 Additional comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
2095-2103
: Sanity check: limiting dynamic cluster shapes to only 4x1x1 may miss regressions on other candidate shapesIf the SM100 dynamic-CGA candidate generator emits other shapes (e.g., 2x2x1, 2x1x2), this filter will skip them, reducing coverage. If runtime is a concern, consider sampling a small set of representative dynamic shapes or gating with an env var to expand coverage in CI-nightly.
Would you like me to propose a minimal sampling approach (e.g., test Undefined plus up to N dynamic shapes detected in configs), or gate via an environment variable to include all dynamic shapes in extended runs?
357edca
to
505e1b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
462-490
: Guard GEMM profiling path only for valid specific-GEMM valuesAs written, any unexpected value (e.g., 0 or 4) will enter the profiling path, force uniform routing, and possibly yield surprising behavior. Restrict the condition to explicit GEMM_1/GEMM_2.
- if (gemm_to_profile != (int) GemmToProfile::LAYER) + if (gemm_to_profile == (int) GemmToProfile::GEMM_1 + || gemm_to_profile == (int) GemmToProfile::GEMM_2) {Optionally, also validate gemm_to_profile right after parsing and reject invalid values to fail fast. I can propose that change if desired.
♻️ Duplicate comments (6)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
471-479
: Fix optional vs int comparison; lower log threshold for a warningThis repeats a prior review note and remains unresolved: comparing std::optional directly to an int is not portable pre-C++23 and can fail to compile. Also, a warning should not require ERROR level to be visible.
- static bool printed = false; - if (routing_config != UNIFORM_ROUTING_CONFIG && LOG_LEVEL >= ERROR && !printed) + static bool printed = false; + if (routing_config.has_value() && *routing_config != UNIFORM_ROUTING_CONFIG + && LOG_LEVEL >= INFO && !printed) { std::cerr << "Warning: Profiling a specific GEMM will always use uniform random token distribution" << std::endl; printed = true; } routing_config = UNIFORM_ROUTING_CONFIG;If you have a WARNING level, prefer
LOG_LEVEL >= WARNING
instead of INFO.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)
22-23
: Missing type-traits include is now explicitly addedThe earlier concern about using std::is_enum_v/std::underlying_type_t without including <type_traits> is addressed by adding <type_traits>. Good fix.
85-91
: Added bounds checks for shape encoding prevent overflowThe asserts for non-negative and <1000 dimensions address the prior overflow/encoding-range concern in shape_tuple_to_enum.
347-399
: get_cluster_shape() mapping completed for new shapesCoverage for newly added shapes is present; the previous gap is resolved.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
244-251
: Fragile string hacks for epilogue tags/schedules; prefer explicit mappingThe split(':')[-1] and replace("1Sm","") hacks are brittle. Prefer a small explicit mapping dict for schedule/type token normalization before macro use.
I can provide a mapping-based refactor if desired.
727-751
: Fix for dynamic_cga variants append bug is in placeThe validity check and append are now inside the inner dynamic_cga loop, so both True/False variants are preserved.
🧹 Nitpick comments (11)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
2-2
: Update copyright year to include 2025 to comply with org standardPer the coding guidelines, the NVIDIA copyright header should include the current year.
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)
2-2
: Update copyright year to include 2025 per coding guidelinesAll OSS source files must include an NVIDIA copyright header that includes the current year.
Apply this diff:
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
2-2
: Update copyright year to include 2025 per coding guidelinesAll OSS source files must include an NVIDIA copyright header that includes the current year.
Apply this diff:
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
498-510
: Optional: return consistent, descriptive names for heuristics across SM familiesCurrently getTileConfigAsName() returns “heuristic” for ChooseWithHeuristic. Consider aligning the naming to “chooseWithHeuristic” for clarity and consistency with enum naming in logs.
Apply this minimal patch:
- if (sm_version == 120 || sm_version == 121) - return get_tile_shape_name(tile_config_sm120); + if (sm_version == 120 || sm_version == 121) + { + if (tile_config_sm120 == CutlassTileConfigSM120::Undefined) return "undefined"; + if (tile_config_sm120 == CutlassTileConfigSM120::ChooseWithHeuristic) return "chooseWithHeuristic"; + return get_tile_shape_name(tile_config_sm120); + } - if (sm_version >= 100 && sm_version < 120) - return get_tile_shape_name(tile_config_sm100); + if (sm_version >= 100 && sm_version < 120) + { + if (tile_config_sm100 == CutlassTileConfigSM100::Undefined) return "undefined"; + if (tile_config_sm100 == CutlassTileConfigSM100::ChooseWithHeuristic) return "chooseWithHeuristic"; + return get_tile_shape_name(tile_config_sm100); + } - if (sm_version == 90) - return get_tile_shape_name(tile_config_sm90); + if (sm_version == 90) + { + if (tile_config_sm90 == CutlassTileConfigSM90::Undefined) return "undefined"; + if (tile_config_sm90 == CutlassTileConfigSM90::ChooseWithHeuristic) return "chooseWithHeuristic"; + return get_tile_shape_name(tile_config_sm90); + }cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
2-2
: Update copyright year to include 2025 per coding guidelinesAll OSS source files must include an NVIDIA copyright header that includes the current year.
Apply this diff:
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
236-251
: Dead/broken string manipulation; replace-in-place has no effect and isn’t used
- kernel_sched.replace(...) doesn’t assign back, so it’s a no-op.
- kernel_sched isn’t used in the macro path anyway.
Remove the unused replace and tidy the schedule handling to avoid confusion.
Apply this diff:
- kernel_sched.replace("::Kernel", "::KernelGrouped") - # epi_sched += "Grouped" + // Grouped schedules are handled by the macro; no string surgery needed here.Optionally, also remove the commented “epi_sched += 'Grouped'” line.
699-704
: Style: very long lines; consider breaking for readability and linting (E501)These lines exceed 120 characters. If ruff is enforced in CI, wrap the long string or split arguments across lines to avoid E501.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (4)
2-2
: Update copyright year to include 2025 per coding guidelinesAll OSS source files must include an NVIDIA copyright header that includes the current year.
Apply this diff:
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
171-183
: Static assert message is confusing; check epilogue schedule once, early-returnThe static_assert condition is fine, but the error string reads like guidance. Also, after selecting selected_func, early-return for clarity.
Apply this diff:
- static_assert( - is_wfp4afp8 || (std::is_same_v<T, __nv_fp4_e2m1> && std::is_same_v<WeightType, __nv_fp4_e2m1>), - "Non-block scaled shapes should try no smem epilogue"); + static_assert( + is_wfp4afp8 || (std::is_same_v<T, __nv_fp4_e2m1> && std::is_same_v<WeightType, __nv_fp4_e2m1>), + "Block-scaled MMA requires TMA epilogue; other combos should not select NoSmem here."); TLLM_CHECK_WITH_INFO(gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::TMA, "No Smem epilogue schedule is not supported for block scaled"); auto selected_func = dynamic_cga ? &kernels::cutlass_kernels::tma_warp_specialized_generic_moe_gemm_kernelLauncher<Arch, T, WeightType, OutputType, cutlass::epilogue::PtrArrayTmaWarpSpecialized, EpilogueTag, FUSION, TileShape, ClusterShape, is_wfp4afp8, true, false> : &kernels::cutlass_kernels::tma_warp_specialized_generic_moe_gemm_kernelLauncher<Arch, T, WeightType, OutputType, cutlass::epilogue::PtrArrayTmaWarpSpecialized, EpilogueTag, FUSION, TileShape, ClusterShape, is_wfp4afp8, false, false>; selected_func(hopper_input, num_experts, multi_processor_count, stream, occupancy, workspace_size, cluster_shape_cute, cluster_shape_cute_fallback); + return;
186-198
: Pass empty shapes for non-dynamic paths for consistencyFor SM120 and pre-SM100 paths you pass {}, {}; mirror that for SM100 non-dynamic to avoid propagating (0,0,0) shapes derived from Undefined.
If you adopt the earlier guard (don’t materialize shapes unless dynamic_cga), this is already achieved.
368-368
: Misleading closing-brace commentThis line closes a switch/function scope, not a namespace. Adjust the comment or remove it to avoid confusion.
Apply this diff:
-} // namespace tensorrt_llm +} // end dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(2 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(3 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(8 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- 3rdparty/cutlass
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
271-271: Line too long (313 > 120)
(E501)
353-353: Line too long (122 > 120)
(E501)
367-367: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
529-529: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
656-656: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
699-699: Line too long (139 > 120)
(E501)
701-701: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
702-702: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
721-721: KernelScheduleType
may be undefined, or defined from star imports
(F405)
724-724: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
725-725: DataType
may be undefined, or defined from star imports
(F405)
730-730: GemmKind
may be undefined, or defined from star imports
(F405)
746-746: DataType
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
300-303
: Robust file-open validation — LGTMGood addition. Throwing on file-open failure makes the failure mode explicit and avoids hard-to-debug downstream errors.
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)
74-76
: Template-gate update for DYNAMIC_CGA is correctPassing the new bool template argument (false) to should_filter_tma_warp_specialized_gemm_problem_shape_v aligns with the updated template signature and keeps SM90 static-CGA gating behavior unchanged here.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
298-316
: Tile shape name helper: generic int-backed enum support is solidThe templated get_tile_shape_name handles Undefined/Heuristic and decodes shape triplets cleanly via enum_to_shape_tuple. LGTM.
318-333
: New ClusterShape enum values addedExtended cluster shapes (1x4x1, 4x1x1, 4x2x1, 2x4x1, 4x4x1, 1x8x1, 8x1x1) are defined and int-encoded consistently.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)
25-33
: Template parameter and launcher signature alignment with generatorThe addition of EpilogueSchedule and DYNAMIC_CGA template parameters and dynamic/fallback cluster shape arguments matches the generator changes.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
125-158
: Propagate dynamic_cga through launcher config: LGTMThe new init param and attribute (dynamic_cga) are correctly added, stored, and later used in instantiation.
349-356
: SM100 validation tightened correctly for runtime CGAThe early-return filter to allow only (1|2)x1x1 shapes for SM100 aligns with the dynamic CGA model and your dispatch path.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (4)
119-133
: MXFPX/WFP4-AFP8 gating checks look correctThe runtime checks enforce MXFPX when WeightType is FP4 and T is FP8 e4m3, and ban MXFPX otherwise. Good safety.
202-255
: SM100 tile-shape predicate is cohesiveThe constexpr predicate matches the Python generator’s constraints (TileM in {64,128}, TileN alignment, FP4 gating, small-N FP8 exceptions). LGTM.
371-459
: Tile-shape dispatch updates for SM100/SM120: LGTMNew shape cases and DEFAULT handling are consistent with the additions in gemm_configs and the generator.
465-472
: Workspace sizing reuses dispatch path cleanlycalcMaxWorkspaceSizeTmaWarpSpecialized leverages the updated dispatcher; good reuse to avoid template bloat.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
2-2
: Update copyright year to include 2025 per OSS header policyPer coding guidelines, the header must include the current year.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
16-16
: Add required header include guardsHeaders must have a TRTLLM_-prefixed include guard. This prevents multiple inclusion and follows project conventions.
@@ +#ifndef TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H +#define TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H @@ } // namespace tensorrt_llm::kernels::cutlass_kernels + +#endif // TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_HAlso applies to: 474-474
♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
134-143
: Guard dynamic CGA shape conversion and enforce SM100 allowed shapes (1x1x1, 2x1x1)You unconditionally convert enum_to_shape_tuple even when dynamic CGA is disabled, which can materialize invalid (Undefined) shapes. Also, SM100 dynamic CGA supports only (1|2)x1x1; enforce this at runtime to fail fast. This aligns with prior feedback.
- bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; - auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); - auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; + // Only materialize runtime shapes if dynamic CGA is enabled. + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + auto cluster_shape_fallback + = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + // SM100 dynamic CGA supports only (1|2)x1x1. Fail fast on invalid shapes. + TLLM_CHECK_WITH_INFO( + ((std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) + && std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1), + "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).", + (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + }Would you like me to push a patch updating all similar call sites that materialize dynamic CGA shapes?
🧹 Nitpick comments (7)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (7)
18-21
: Prefer #if defined(...) over #ifdef per guidelinesUse #if defined(...) to improve consistency and readability.
-#ifdef __GNUC__ // Check if the compiler is GCC or Clang +#if defined(__GNUC__) // Check if the compiler is GCC or Clang #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wstrict-aliasing" -#endif +#endif
49-51
: Prefer #if defined(...) over #ifdef per guidelines (pop block)Mirror the change for the matching pop block.
-#ifdef __GNUC__ // Check if the compiler is GCC or Clang +#if defined(__GNUC__) // Check if the compiler is GCC or Clang #pragma GCC diagnostic pop -#endif +#endif
62-66
: Use C++ standard headers instead of C headersReplace <math.h> with per guideline “C headers should not be used directly.”
#include <cuda.h> #include <cuda_fp16.h> -#include <math.h> +#include <cmath> #include <sstream>
162-166
: Optionally return after dispatch to make control-flow explicitNot required due to if-constexpr branching, but adding a return clarifies intent and avoids accidental fall-through on future edits.
func_map[tma_epilogue][dynamic_cga](hopper_input, num_experts, multi_processor_count, stream, occupancy, workspace_size, cluster_shape_cute, cluster_shape_cute_fallback); + return;
202-210
: Use compile-time condition for constant ClusterShape checksThese use compile-time constants; using if constexpr avoids generating dead runtime branches.
- if (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 || cute::size<2>(ClusterShape{}) != 1) + if constexpr (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 + || cute::size<2>(ClusterShape{}) != 1) { return false; }
275-286
: Fix minor grammar in documentation commentTighten the prose for clarity.
/* - 1x1x1 cluster shape is are supported for any tile shape. + 1x1x1 cluster shape is supported for any tile shape. - 2x1x1 cluster shape is only supported for when the M tile is at least 128. + 2x1x1 cluster shape is supported only when the M tile is at least 128. - 1x2x1 cluster shape is only supported when the N tile is at least 128. + 1x2x1 cluster shape is supported only when the N tile is at least 128. - 2x2x1 cluster shape is only supported when both the M and N tiles are at least 128. + 2x2x1 cluster shape is supported only when both the M and N tiles are at least 128. - We make the above restrictions are to improve compilation speed in TRT-LLM by pruning kernels + We make the above restrictions to improve compilation speed in TRT-LLM by pruning kernels that may not be very useful in practice. */
368-368
: Remove incorrect namespace-closing comment on a function braceThis brace closes a function, not a namespace. The comment is misleading.
-} // namespace tensorrt_llm +}
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(8 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧠 Learnings (2)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
119-132
: MXFPX gating for WFP4AFP8 is correctThe scaling-type checks for the WFP4AFP8 combination are precise and fail fast for unsupported pairings. Good defensive validation.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
Show resolved
Hide resolved
4e588e6
to
c1c30a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
85-90
: Add validation for shape dimensions to prevent overflowThe encoding scheme assumes each dimension is less than 1000. For values >= 1000, this will produce incorrect results due to overflow in the encoding logic.
Consider adding documentation and validation:
constexpr static int shape_tuple_to_enum(int m, int n, int k) { + // Dimensions must be < 1000 for correct encoding assert(m >= 0 && n >= 0 && k >= 0); assert(m < 1000 && n < 1000 && k < 1000); return m * 1000000 + n * 1000 + k; }
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
370-437
: LGTM! Well-structured dynamic cluster shape candidate generationThe new helper function correctly:
- Selects 1SM/2SM variants based on dynamic_cluster_shape's M dimension
- Builds separate candidate sets for FP4 and non-FP4 configurations
- Properly carries dynamic_cluster_shape and fallback_cluster_shape through to the 6-argument CutlassGemmConfig constructor
449-467
: LGTM! Correct usage of dynamic cluster shape helperThe refactored SM100 candidate generation correctly:
- Iterates over cluster shape variants (1x1x1, 4x1x1, 4x2x1)
- Derives appropriate fallback shapes
- Calls the new dynamic cluster shape helper with correct parameters
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
723-745
: LGTM! Proper instantiation of dynamic CGA variantsThe code correctly generates two launcher instances (dynamic_cga True/False) for each configuration and validates each before appending to operations.
236-250
: Document the epilogue schedule name workaround more clearlyThe current approach uses string manipulation to work around missing type definitions. This should be better documented or fixed properly.
Consider adding a more explicit comment explaining why this workaround is necessary and what the long-term fix should be:
- # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type + # TODO: Fix the underlying issue by adding PtrArrayTmaWarpSpecialized type definition + # Current workaround: Remove '1Sm' suffix that causes macro instantiation issues + # The '1Sm' suffix indicates single-SM variant but the macro doesn't expect it epi_sched = epi_sched.replace( "1Sm", "")cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
175-186
: Add validation for SM100 dynamic cluster shapesThe code processes dynamic_cluster_shape without validating that only SM100-supported shapes (1x1x1 or 2x1x1) are used.
Apply this validation:
bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; - auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); - auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + // SM100 dynamic CGA supports only (1|2)x1x1 + TLLM_CHECK_WITH_INFO( + (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) && + std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1, + "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).", + (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + }
🧹 Nitpick comments (1)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
298-310
: Consider a more descriptive name for special cases in get_tile_shape_nameThe function returns "undefined" and "heuristic" for 0 and 1 respectively, but these magic values aren't self-documenting.
Consider using named constants or enum values for better readability:
+ constexpr int kUndefinedShape = 0; + constexpr int kHeuristicShape = 1; + static_assert(std::is_enum_v<TEnum> && std::is_same_v<std::underlying_type_t<TEnum>, int>, "TEnum must be an enum with underlying type int"); - if ((int) Shape_MNK == 0) + if ((int) Shape_MNK == kUndefinedShape) { return "undefined"; } - else if ((int) Shape_MNK == 1) + else if ((int) Shape_MNK == kHeuristicShape) { return "heuristic"; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(2 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(3 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- 3rdparty/cutlass
- cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧠 Learnings (3)
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
269-269: Line too long (393 > 120)
(E501)
349-349: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
349-349: Line too long (145 > 120)
(E501)
352-352: Line too long (122 > 120)
(E501)
366-366: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
526-526: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
651-651: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
694-694: Line too long (139 > 120)
(E501)
696-696: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
697-697: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
725-725: GemmKind
may be undefined, or defined from star imports
(F405)
741-741: DataType
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
374-393
: LGTM! Complete handling of new cluster shapesThe updated
get_cluster_shape()
template now handles all new ClusterShape enum values (1x4x1, 4x2x1, 2x4x1, 4x4x1) and includes a proper fallback for undefined shapes.cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)
2088-2089
: LGTM! Consistent dynamic cluster shape printingThe config label now correctly includes dynamic_cluster_shape for TMA warp specialized configs, maintaining consistency with the updated toString() method in CutlassGemmConfig.
2111-2119
: LGTM! Efficient filtering of dynamic cluster shapesThe filtering logic correctly reduces test iterations by testing only one dynamic cluster shape (ClusterShape_4x1x1) when defined, avoiding redundant test combinations.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
143-157
: LGTM! Dynamic CGA support added to launcherThe TrtLlm_GemmLauncher constructor now accepts and stores the dynamic_cga parameter, enabling dynamic cluster shape support for SM100.
349-354
: Validation logic correctly handles finalize fusion constraintsThe check ensures that finalize fusion epilogue only uses PtrArrayTmaWarpSpecialized1Sm, which is the only supported schedule for this configuration.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)
71-117
: Well-structured SM100 dispatch function with proper epilogue handlingThe new
getDispatchFunctionForSM100
helper correctly:
- Creates a 2D function map indexed by TMA epilogue and dynamic CGA
- Handles FP4 types by enforcing TMA epilogue schedule
- Returns appropriate function pointers based on configuration
207-212
: LGTM! Correct runtime cluster shape validation for SM100The function properly validates that only 1x1x1 and 2x1x1 cluster shapes are supported at compile time.
158-171
: LGTM! Proper FP4/FP8 weight combination validationThe validation correctly ensures:
- WFP4AFP8 configurations use MXFPX scaling
- Other configurations don't use MXFPX scaling
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
Outdated
Show resolved
Hide resolved
c1c30a6
to
7c626c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
31-33
: Missing #include for std::array usagegetDispatchFunctionForSM100 uses std::array but this header does not include . This will fail to compile on compilers that don’t implicitly include it.
Apply:
#include "cute/tensor.hpp" +#include <array>
♻️ Duplicate comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
22-23
: Resolved: missing type-traits/tuple includes addedThe explicit inclusion of and <type_traits> addresses prior build fragility when using enum_to_shape_tuple and related static_asserts.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)
30-31
: Default DYNAMIC_CGA to false to avoid widespread template churnProviding a default preserves source compatibility at call sites, reduces boilerplate, and likely fixes some CI breakages from missed updates.
Apply:
-template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA, class ActivationType> +template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA = false, class ActivationType> struct should_filter_tma_warp_specialized_gemm_problem_shape { @@ -template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA, class ActivationType> +template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA = false, class ActivationType> constexpr static bool should_filter_tma_warp_specialized_gemm_problem_shape_v = should_filter_tma_warp_specialized_gemm_problem_shape<ArchTag, TileShape, ClusterShape, DYNAMIC_CGA, ActivationType>::value;Also applies to: 43-46
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
724-746
: Resolved: dynamic_cga variants are now validated and appended per-iterationThe validity check and append are correctly inside the inner loop, so both dynamic and static variants are handled.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
158-171
: Enforce correct scaling mode and guard FINALIZE fusion’s required inputsThe WFP4AFP8/MXFPX checks are good. Add a guard for FINALIZE fusion to ensure the required router scales pointer is set (CUTLASS doesn’t null-check it).
Apply:
else { TLLM_CHECK_WITH_INFO( hopper_input.fpX_block_scaling_type != TmaWarpSpecializedGroupedGemmInput::FpXBlockScalingType::MXFPX, "MXFPX is not supported for the selected weight combination"); } + + // FINALIZE fusion requires valid router scales. + if constexpr (FUSION == EpilogueFusion::FINALIZE) + { + TLLM_CHECK_WITH_INFO( + hopper_input.ptr_router_scales != nullptr, + "ptr_router_scales must be non-null when using FINALIZE fusion epilogue"); + }
173-181
: Avoid building runtime cluster shapes when dynamic CGA is disabled; validate allowed shapesCurrently, you compute cluster shapes even when dynamic CGA is off; Undefined decodes to 0x0x0 and gets passed downstream. Only build and pass shapes when dynamic CGA is enabled, and validate (1|2)x1x1 for SM100.
Apply:
- bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; - auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); - auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + // SM100 dynamic CGA supports only (1|2)x1x1. + TLLM_CHECK_WITH_INFO( + (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) + && std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1, + "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).", + (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape)); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + }
🧹 Nitpick comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
291-343
: Simplify get_cluster_shape with enum_to_shape_tuple (remove manual case cascade)You already generalized TileShape via enum_to_shape_tuple; do the same for ClusterShape to eliminate the long if constexpr chain and ensure all future shapes are covered automatically.
Apply:
-constexpr auto get_cluster_shape() -{ - using namespace cute; - if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x1x1) - { - return cute::Shape<_1, _1, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x1x1) - { - return cute::Shape<_2, _1, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x2x1) - { - return cute::Shape<_1, _2, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x2x1) - { - return cute::Shape<_2, _2, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x1x1) - { - return cute::Shape<_4, _1, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x8x1) - { - return cute::Shape<_1, _8, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_8x1x1) - { - return cute::Shape<_8, _1, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x4x1) - { - return cute::Shape<_1, _4, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x2x1) - { - return cute::Shape<_4, _2, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x4x1) - { - return cute::Shape<_2, _4, _1>{}; - } - else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x4x1) - { - return cute::Shape<_4, _4, _1>{}; - } - else - { - return cute::Shape<_0, _0, _0>{}; - } -} +constexpr auto get_cluster_shape() +{ + using namespace cute; + static_assert(Shape_MNK != ClusterShape::Undefined, "ClusterShape is undefined"); + constexpr auto t = enum_to_shape_tuple(Shape_MNK); + return cute::Shape<cute::Int<std::get<0>(t)>, cute::Int<std::get<1>(t)>, cute::Int<std::get<2>(t)>>{}; +}
456-489
: Improve readability: print schedule names instead of intstoString() currently prints raw enum ints for mainloop/epilogue. Consider mapping to get_mainloop_schedule_name() and a similar helper for EpilogueScheduleType for more informative logs.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
370-437
: Add explicit include to avoid relying on transitive headersThis TU calls std::get on tuples returned by enum_to_shape_tuple; include explicitly for robustness.
Proposed addition near other standard includes:
#include <set> +#include <tuple> #include <vector>
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
6-6
: Avoid star import to satisfy Ruff F405 and improve readabilityfrom cutlass_library import * triggers F405 and makes symbol origins unclear. Prefer a qualified import.
Suggested approach:
- Replace the star import with a qualified import, e.g., import cutlass_library as cl.
- Refer to symbols as cl.EpilogueScheduleType, cl.DataType, cl.GemmKind, cl.DataTypeNames, cl.DataTypeSize, etc.
If refactoring now is too invasive, consider suppressing lint for this file or strategically importing only the used symbols.
269-269
: Long lines exceed 120 columns; consider wrapping or lint suppressionMultiple lines exceed Ruff’s 120-char limit (E501), mostly macro strings and long conditionals. Either wrap them or add targeted noqa comments to keep CI green.
Also applies to: 349-349, 352-352, 366-366, 526-526, 651-651, 694-697, 725-725, 741-741
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
3rdparty/cutlass
(1 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(2 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(3 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(10 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- 3rdparty/cutlass
- cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
TileShape
(213-490)ClusterShape
(262-454)cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1)
ActivationType
(23-33)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
EpilogueScheduleType
(204-517)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
269-269: Line too long (393 > 120)
(E501)
349-349: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
349-349: Line too long (145 > 120)
(E501)
352-352: Line too long (122 > 120)
(E501)
366-366: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
526-526: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
651-651: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
694-694: Line too long (139 > 120)
(E501)
696-696: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
697-697: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
725-725: GemmKind
may be undefined, or defined from star imports
(F405)
741-741: DataType
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
85-101
: Shape encoding helpers look correct and safeThe encoding/decoding utilities are constexpr, validate ranges (<1000) and avoid UB. Good foundation for uniform enum value management across SM families.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)
37-41
: FAST_BUILD filter logic is concise and correctThe filter disables unsupported shapes or dynamic CGA in fast builds. Looks good.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
449-467
: Good: fallback_cluster_shape is now used correctlyThe second argument to get_candidate_configs_sm100_dynamic_cluster_shape passes the computed fallback shape, fixing the earlier logic error.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
71-117
: Dispatch table for SM100 is clean and extensibleThe 2D function map over epilogue schedule and dynamic_cga makes the runtime selection straightforward. Once is included, this pattern is solid.
336-370
: Error messages improved (config.toString), switch pruning is fineThe reporting of unsupported shapes includes the full config string, which will help debugging tactic selection. Looks good.
7c626c3
to
78507c5
Compare
/bot run |
PR_Github #17144 [ run ] triggered by Bot |
PR_Github #17144 [ run ] completed with state |
e0ab0ce
to
94376a3
Compare
/bot run |
PR_Github #17255 [ run ] triggered by Bot |
PR_Github #17255 [ run ] completed with state |
…schedules Signed-off-by: Daniel Stokes <[email protected]> fix: Fix compilation errors for no smem epilogue Signed-off-by: djns99 <[email protected]> fix: Address linker issues Signed-off-by: djns99 <[email protected]> fix: Correct list of supported tactics Signed-off-by: djns99 <[email protected]> fix: Add pretty printing for tile shapes Signed-off-by: djns99 <[email protected]> Fix code rabbit comments Signed-off-by: djns99 <[email protected]> Fix benchmark tactic logic for single gemm profiling Signed-off-by: djns99 <[email protected]> Cleanup gemm_configs.h changes Signed-off-by: djns99 <[email protected]> Improve tactic reporting in MOE backend benchmarks Signed-off-by: djns99 <[email protected]> Reduce instantiated configs Signed-off-by: djns99 <[email protected]> Skip some configs to reduce test runtime Signed-off-by: djns99 <[email protected]> Fix code rabbit comments Signed-off-by: djns99 <[email protected]> Fix rebase issues in launcher logic Signed-off-by: djns99 <[email protected]> Address feedback Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
…d epilogue schedule Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
f28c8be
to
e4f486a
Compare
/bot run --disable-fail-fast |
PR_Github #17295 [ run ] triggered by Bot |
PR_Github #17295 [ run ] completed with state |
/bot run |
PR_Github #17414 [ run ] triggered by Bot |
PR_Github #17414 [ run ] completed with state |
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)
1-2
: Add (or reuse) the version.txt guard in CI to prevent driftSame ask as earlier threads; consolidating for both arches to catch stale hashes or wrong uploads.
#!/bin/bash set -euo pipefail A64="cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt" X64="cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt" check_ver() { local ver="$1" local dir dir="$(dirname "$ver")" local HASH FILE COMMIT HASH="$(awk 'NR==1{print $1}' "$ver")" FILE="$(awk 'NR==1{print $2}' "$ver")" COMMIT="$(awk 'NR==2{print $2}' "$ver")" [[ "$HASH" =~ ^[0-9a-f]{64}$ ]] || { echo "Invalid SHA-256 in $ver: $HASH"; exit 1; } [[ "$COMMIT" =~ ^[0-9a-f]{40}$ ]] || { echo "Invalid commit in $ver: $COMMIT"; exit 1; } if [[ -f "${dir}/${FILE}" ]]; then ACTUAL="$(sha256sum "${dir}/${FILE}" | awk '{print $1}')" [[ "$ACTUAL" == "$HASH" ]] || { echo "Checksum mismatch for ${FILE} in ${dir}"; exit 1; } echo "Checksum OK for ${FILE}" else echo "Artifact ${dir}/${FILE} not found; skipping checksum." fi } check_ver "$A64" check_ver "$X64" # Ensure both arches point to the same internal commit C1="$(sed -n '2p' "$A64")" C2="$(sed -n '2p' "$X64")" [[ "$C1" == "$C2" ]] || { echo "Commit mismatch between arches: ${C1} vs ${C2}"; exit 1; } echo "version.txt checks passed for both arches."cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
301-327
: Make GEMM profiling warning visible and fix optional comparisonTwo issues:
- Warning is gated at ERROR; users won’t see it.
- Comparing std::optional to int is non-portable pre-C++23 and may not compile.
Apply:
- static bool printed = false; - if (routing_config != UNIFORM_ROUTING_CONFIG && LOG_LEVEL >= ERROR && !printed) + static bool printed = false; + if (routing_config.has_value() && *routing_config != UNIFORM_ROUTING_CONFIG + && LOG_LEVEL >= INFO && !printed) { std::cerr << "Warning: Profiling a specific GEMM will always use uniform random token distribution" << std::endl; printed = true; } routing_config = UNIFORM_ROUTING_CONFIG;This is the same class of issue flagged previously; reapplying here.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
370-437
: Add missing<tuple>
include in cutlass_heuristic.cpp
At the top ofcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
add#include <tuple>This file calls
std::get
on the result ofenum_to_shape_tuple
, so it must explicitly include<tuple>
rather than relying on transitive includes.cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
241-241
: Reassign the replace() result or drop the dead linePython strings are immutable; this call is a no-op and
kernel_sched
isn’t used later in this branch. Either reassign or delete to avoid confusion.Apply one of:
- kernel_sched.replace("::Kernel", "::KernelGrouped") + kernel_sched = kernel_sched.replace("::Kernel", "::KernelGrouped")or
- kernel_sched.replace("::Kernel", "::KernelGrouped")
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
71-110
: Missing include for std::arrayThis block uses std::array; add the header to avoid relying on transitive includes.
#include "cute/tensor.hpp" +#include <array>
166-176
: Guard dynamic-cluster shape conversion; avoid Undefined → bogus shapesOnly compute and pass cluster shapes when dynamic_cga is enabled; also keep fallback guarded. This prevents constructing (0,0,0) from Undefined.
- bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; - bool const swap_ab = hopper_input.swap_ab; - auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); - auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; - auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); - auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ - std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined; + bool const swap_ab = hopper_input.swap_ab; + auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{}; + auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{}; + if (dynamic_cga) + { + auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape); + cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}}; + auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape); + cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{ + std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}}; + }
🧹 Nitpick comments (6)
tests/unittest/_torch/thop/parallel/test_moe.py (3)
822-824
: Preserve coverage while avoiding timeouts: gate the 4096-token case behind an env varNice trim for CI stability. To still exercise the largest path in extended runs, gate 4096 behind an opt-in flag.
Apply:
- @pytest.mark.parametrize("num_tokens", [16, 64, 4096]) + @pytest.mark.parametrize("num_tokens", [16, 64] + ([4096] if HEAVY else []))Add once near the imports (os is already imported):
HEAVY = os.environ.get("TRTLLM_HEAVY_TESTS") == "1"I can push the minimal patch across both FP8/FP4 suites if you’d like.
823-824
: Expert-info set reduction looks fine; keep one n_groups=1/top_k<8 light configDropping (32, 1, 1, 5) removes a quick n_groups=1/top_k<8 case. Consider re-adding that single light combo to retain a distinct routing shape without noticeable runtime impact.
941-941
: Same gating for FP4: make 4096-token run opt-inMirror the FP8 change so day-to-day runs stay fast.
Apply:
- @pytest.mark.parametrize("num_tokens", [1, 4096]) + @pytest.mark.parametrize("num_tokens", [1] + ([4096] if HEAVY else []))cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
170-172
: Avoid dumping entire JSON to stdout at INFOLarge files spam logs and slow CI. Print filename and a short preview instead.
Apply:
- if (LOG_LEVEL >= INFO) - std::cout << "Loaded benchmark file: " << file_contents << std::endl; + if (LOG_LEVEL >= INFO) + { + auto preview_len = std::min<size_t>(file_contents.size(), 512); + std::cout << "Loaded benchmark file: " << workloadFile << " (" << file_contents.size() + << " bytes) preview: " << file_contents.substr(0, preview_len) << (file_contents.size() > preview_len ? "..." : "") + << std::endl; + }cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
449-468
: Candidate explosion control: keep 4x2x1 but consider dedup or tieringThe added sweep across two epilogues and multiple cluster shapes is comprehensive; it can noticeably lengthen tactic profiling. If profiling time regresses in CI, consider:
- de-duplicating identical configs produced via different dynamic/fallback pairs, or
- gating the non-1x1x1 cluster shapes behind a FAST_BUILD-like flag.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
508-516
: Label nit: avoid “tile_config_sm90_enum” for all SMsThe stream label is SM90-specific but this path prints for SM90/100/120. Use a neutral label.
- out << "tile_config_sm90_enum: " << config.getTileConfigAsInt() + out << "tile_config_enum: " << config.getTileConfigAsInt()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (2)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz
is excluded by!**/*.xz
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz
is excluded by!**/*.xz
📒 Files selected for processing (14)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(2 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(3 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(10 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
(12 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
(1 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
(1 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(2 hunks)tests/unittest/_torch/thop/parallel/test_moe.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-21T02:41:10.565Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h:141-145
Timestamp: 2025-08-21T02:41:10.565Z
Learning: In TensorRT-LLM MOE GEMM kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h), the stride_act and stride_weight pointers in TmaWarpSpecializedGroupedGemmInput are intentionally declared as void* rather than typed pointers because the actual stride type is determined at runtime based on factors like the swap_ab flag and layout decisions. This runtime type determination makes compile-time type safety impossible, so void* is the correct approach.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-09-01T18:06:21.617Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-01T18:06:21.617Z
Learning: Applies to **/*.{h,hpp,hh,hxx} : Use include guards in headers with symbol format TRTLLM_<FILENAME>_H (no underscores prefix/suffix; filename only)
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 LanguageTool
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 33f251e0599197ad3e6c59d64a42f9721...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...33f251e0599197ad3e6c59d64a42f9721d3cc27c
(QB_NEW_EN)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 33f251e0599197ad3e6c59d64a42f9721...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...33f251e0599197ad3e6c59d64a42f9721d3cc27c
(QB_NEW_EN)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
212-212: EpilogueScheduleTag
may be undefined, or defined from star imports
(F405)
355-355: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
372-372: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
419-419: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
732-732: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
733-733: EpilogueScheduleType
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)
1-2
: Format and values look goodSHA-256 (64 hex) and commit (40 hex) formats are valid; commit aligns with x86_64 entry in this PR. Not flagging any API-drift concerns here per your prior note that internal_cutlass_kernels intentionally diverges.
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1)
1-2
: Consistent update across archesHash/filename line and commit line conform to expected format; commit matches the AArch64 entry.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
163-166
: Good: fail fast when the workload file can’t be openedEarly, explicit invalid_argument is the right behavior here.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
353-361
: SM100 validity gating looks correctCGA limited to (1|2)x1x1 and finalize-fusion+non-TMA excluded. Matches dispatch constraints.
Also applies to: 372-373
730-746
: Good: enumerate SM100 epilogues and dynamic_cga, feed into partial_argsExplicitly iterating epi_schedules and dynamic_cga improves coverage and keeps validation consistent with SM100 path.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
202-207
: Tile-shape predicates for SM100 read wellConstraining ClusterShape to (1|2)x1x1 at compile-time aligns with the generator’s runtime gating.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
85-101
: Nice: constexpr encode/decode of shapes
shape_tuple_to_enum
andenum_to_shape_tuple
are clear and guarded; the <1000 constraint avoids overflow.
451-463
: Readable tile-config names across SMsSwitching to get_tile_shape_name() fixes the “0x0x1” issue for heuristic/undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the tuning part. It is very clean and looks like there are no direct changes associated with the moe tunable ops' API. Thanks a lot for the work.
We can then profile the warm-up time increment and see if it is significant after PR is merged.
/bot reuse-pipeline |
PR_Github #17439 [ reuse-pipeline ] triggered by Bot |
PR_Github #17439 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests
Tooling
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.