Skip to content

Conversation

djns99
Copy link
Collaborator

@djns99 djns99 commented Jul 17, 2025

Summary by CodeRabbit

  • New Features

    • Dynamic cluster-shape (CGA) runtime support and dispatch for SM100 MoE GEMM; AB-swap variants and new epilogue schedules.
  • Improvements

    • Encoded/int-backed tile & cluster shape enums with human-readable names; expanded tile/cluster coverage and heuristic refinements.
    • Benchmarks now expose per-tactic metadata counters and streamlined warnings.
  • Bug Fixes

    • Added runtime validation when loading benchmark workload files.
  • Tests

    • Reduced/adjusted MoE autotune test matrices and config-sweep filters.
  • Tooling

    • Python kernel generator and launcher APIs accept dynamic_cga and swap_ab options.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

Copy link
Contributor

coderabbitai bot commented Jul 17, 2025

📝 Walkthrough

Walkthrough

Introduces dynamic cluster-shape (CGA) and AB-swap support across SM100 MoE GEMM: encodes tile/cluster shapes as ints, extends CutlassGemmConfig with dynamic/fallback cluster shapes and swap_ab, adds SM100 candidate generation for dynamic clusters, updates dispatch/launcher signatures and runtime dispatch, threads dynamic_cga/swap_ab through Python kernel generation, and adjusts benchmarks/tests.

Changes

Cohort / File(s) Summary of Changes
Benchmarks and Fixtures
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu, cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
Add workload file open validation and adjust GEMM profiling gating; prefetch tactics and add per-tactic counters including dynamic/fallback cluster shapes and epilogue schedule; reuse fetched tactics in logs.
Cutlass GEMM Config/Enums
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Add shape encode/decode utilities; switch enums to int-backed values encoded via shape_tuple_to_enum; extend EpilogueScheduleType; add CutlassGemmConfig fields dynamic_cluster_shape, fallback_cluster_shape, swap_ab; add name/getters, toString and streaming output.
SM100 Heuristics
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp, .../cutlass_heuristic.h
Add get_candidate_configs_sm100_dynamic_cluster_shape helper; rework SM100 candidate generation to produce configs with dynamic/fallback cluster shapes across Epilogue schedules; add DYNAMIC_CGA template parameter to problem-shape filter and propagate.
MoE GEMM Dispatch/Launch (TMA WS)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h, .../moe_gemm_template_dispatch_tma_ws.h, .../moe_gemm_template_dispatch.h
Reorder/extend launcher template params and runtime args to accept EpilogueSchedule, DYNAMIC_CGA, SwapAB and runtime dynamic/fallback cluster shapes; add SM100 runtime dispatch selection keyed by (epilogue_schedule,dynamic_cga,swap_ab); route cluster shapes into launcher; add AB-swap config duplication and gating; enforce FINALIZE fusion only with TMA.
Python Kernel Generation
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
Thread new dynamic_cga (new TrtLlm_GemmLauncher attribute) and swap_ab through kernel generation, update macro instantiation ordering/args, expand SM100 epilogue/CGA options, and refine validity checks.
SM90 Launcher Filter Update
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
Pass explicit false for DYNAMIC_CGA to the compile-time problem-shape filter; no runtime behavior change.
Unit Tests
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
Update printed label to include dynamic_cluster_shape; restrict config sweep iterations for dynamic_cluster_shape testing.
Internal Kernel Versions
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/*/version.txt
Update artifact hashes and commit refs for internal cutlass kernel static libs (metadata only).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Gen as CandidateGen (SM100)
  participant Disp as Dispatch (SM100)
  participant L as Launcher (TMA/NoSmem)
  participant K as Kernel

  Note over Gen: Build CutlassGemmConfig list<br/>for epilogue ∈ {TMA, NO_SMEM}<br/>cluster ∈ {1x1x1,4x1x1,4x2x1}<br/>include dynamic & fallback shapes
  Gen->>Disp: Candidate configs (tile, epilogue, cluster, dynamic, fallback, swap_ab)

  alt dynamic_cga enabled
    Disp->>L: Select by (epilogue_schedule, dynamic_cga, swap_ab)
    L->>K: Launch with dynamic_cluster_shape + fallback_cluster_shape
  else dynamic_cga disabled
    Disp->>L: Select static launcher
    L->>K: Launch with static cluster_shape
  end

  Note over K: FINALIZE fusion allowed only with TMA<br/>AB-swap variants applied as configured
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • zongfeijing
  • hyukn
  • litaotju
  • liji-nv
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@djns99
Copy link
Collaborator Author

djns99 commented Jul 17, 2025

Also includes changes for #4111

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

83-85: FINALIZE fusion must provide router_scales (avoid nullptr in fused epilogue)

Per prior incident (#3294), FINALIZE epilogue requires non-null router_scales. Add a guard here to fail fast.

     TLLM_CHECK_WITH_INFO(
         workspace_size || hopper_input.isValid(), "Hopper specialisation is missing additional input information");
+
+    if constexpr (FUSION == EpilogueFusion::FINALIZE)
+    {
+        TLLM_CHECK_WITH_INFO(
+            hopper_input.router_scales != nullptr,
+            "router_scales must be provided when using FINALIZE epilogue fusion");
+    }

1-15: Header guard required for .h files (and update copyright year)

  • Add a preprocessor guard with prefix TRTLLM_ per guidelines.
  • Include the current year in the header.

Additional changes outside the selected range:

 /* 
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
  *

At the very top, just after the license:

+#ifndef TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H
+#define TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H

At the very end of the file, add:

+#endif // TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H
🧹 Nitpick comments (15)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (4)

84-97: Tighten shape encoding helpers; remove unnecessary static and guard valid ranges

  • constexpr functions at namespace scope are implicitly inline; static is unnecessary and goes against the guideline to prefer anonymous namespaces over static for internal linkage.
  • Consider guarding against out-of-range inputs to avoid accidental collisions.

Suggested refactor:

-constexpr static int shape_tuple_to_enum(int m, int n, int k)
+constexpr int shape_tuple_to_enum(int m, int n, int k)
 {
-    return m * 1000000 + n * 1000 + k;
+    // m in [1..999], n in [1..999], k in [1..999]
+    // Encodes as m*1'000'000 + n*1'000 + k
+    return m * 1000000 + n * 1000 + k;
 }
 
-template <typename TEnum>
-constexpr static std::tuple<int, int, int> enum_to_shape_tuple(TEnum shape_id_enum)
+template <typename TEnum>
+constexpr std::tuple<int, int, int> enum_to_shape_tuple(TEnum shape_id_enum)
 {
     static_assert(std::is_enum_v<TEnum> && std::is_same_v<std::underlying_type_t<TEnum>, int>,
         "TEnum must be an enum with underlying type int");
     auto shape_id = static_cast<int>(shape_id_enum);
     return std::make_tuple(shape_id / 1000000, (shape_id % 1000000) / 1000, shape_id % 1000);
 }

Optionally, add a compile-time/documented constraint that shape components must be < 1000 to avoid collisions.


525-539: toString extended fields are useful; consider printing decoded shapes for readability

The added dynamic_cluster_shape and fallback_cluster_shape IDs are valuable for debugging. For readability, consider also printing their decoded tuples via enum_to_shape_tuple when not Undefined. No blocker.


561-572: Operator<< label “tile_config_sm90_enum” is misleading for SM100/SM120

This label is hard-coded to “sm90” while getTileConfigAsInt() may reflect SM100/SM120. Consider a neutral label (e.g., tile_config_enum) or include the actual sm_version in the label to avoid confusion in logs.

-        out << "tile_config_sm90_enum: " << config.getTileConfigAsInt()
+        out << "tile_config_enum: " << config.getTileConfigAsInt()

1-15: Update copyright year

Per guidelines, OSS source headers should include the current year. This file still says 2020-2023.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

1-15: Header hygiene: guard macro and year

  • The guidelines prefer preprocessor header guards over #pragma once for *.h. Consider adding a guard like #ifndef TRTLLM_MOE_GEMM_TMA_WS_LAUNCHER_H.
  • Update the copyright year.

I can provide a patch if you want it applied across the cutlass_kernels headers for consistency.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (4)

245-251: Epilogue token sanitization looks fine; keep the hack contained

Stripping namespaces and removing “1Sm” is a reasonable WAR given the current macro set. Consider adding a TODO with a link to the CUTLASS update once “PtrArrayTmaWarpSpecialized1Sm” is available, to remove the hack.


271-273: Very long macro line; wrap for readability (ruff E501)

The instantiation macro line is well over 120 chars. Consider line-wrapping arguments to improve readability and satisfy tooling:

-        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag},
-                {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n
+        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM(
+            {arch_tag}, {act_tag}, {weight_tag}, {out_tag},
+            {epi_sched}, {epi_tag}, {epi_fusion},
+            {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]},
+            {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]},
+            {"true" if operation.is_mx_fpx else "false"},
+            {"true" if operation.dynamic_cga else "false"},
+            false
+        );\n

370-372: Temporary exclusion for FP4 + NoSmem 1Sm

The explicit guard against PtrArrayNoSmemWarpSpecialized1Sm for FP4 aligns with CUTLASS limitations. Please add a TODO to re-enable when the backend supports this combination.


6-6: Star import and F405 warnings

Ruff flags several symbols as possibly undefined due to from cutlass_library import *. If feasible, import the specific names used (EpilogueScheduleType, KernelScheduleType, DataType, GemmKind, etc.) to quiet lints and improve readability. Not a blocker.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (6)

71-76: Avoid copying gemm_config: pass by const reference

gemm_config is passed by value here and along the chain. Prefer passing by const& to avoid copies.

-void dispatchMoeGemmFinalDispatchTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts,
-    cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy,
+void dispatchMoeGemmFinalDispatchTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts,
+    cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy,
     size_t* workspace_size)

Additionally update the callers’ signatures (see comments at Lines 325-328 and 366-369).


198-206: Use compile-time condition for ClusterShape gate in constexpr predicate

Prefer if constexpr here to keep this purely compile-time and avoid generating dead branches.

-    if (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 || cute::size<2>(ClusterShape{}) != 1)
+    if constexpr (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1
+        || cute::size<2>(ClusterShape{}) != 1)
     {
         return false;
     }

325-328: Avoid copying gemm_config in cluster-shape dispatcher

Same rationale: pass by const&.

-void dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input,
-    int num_experts, cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream,
+void dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input,
+    int num_experts, cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream,
     int* occupancy, size_t* workspace_size)

366-369: Avoid copying gemm_config in tile-shape dispatcher

Pass by const& to prevent copies.

-void dispatchMoeGemmSelectTileShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts,
-    cutlass_extensions::CutlassGemmConfig gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy,
+void dispatchMoeGemmSelectTileShapeTmaWarpSpecialized(TmaWarpSpecializedGroupedGemmInput hopper_input, int num_experts,
+    cutlass_extensions::CutlassGemmConfig const& gemm_config, int multi_processor_count, cudaStream_t stream, int* occupancy,
     size_t* workspace_size)

362-363: Incorrect namespace comment on a function’s closing brace

This brace closes a function, not the namespace. Remove or correct the comment to avoid confusion.

-} // namespace tensorrt_llm
+}

64-64: Prefer over <math.h>

Use C++ headers instead of C headers.

-#include <math.h>
+#include <cmath>
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 49bcaa4 and b3f2854.

📒 Files selected for processing (7)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (10 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (4 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (2)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧬 Code Graph Analysis (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (3)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
  • TileShape (213-559)
  • ClusterShape (345-558)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/collective/epilogue_moe_finalize.hpp (2)
  • bool (133-156)
  • bool (164-170)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1)
  • bool (563-647)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)
  • std (144-180)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (4)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)
  • _ (234-276)
  • _ (355-363)
  • _ (444-454)
  • _ (626-653)
  • _ (686-696)
  • _ (773-783)
  • _ (876-892)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
  • EpilogueScheduleType (204-586)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
  • KernelScheduleType (36-59)
cpp/tensorrt_llm/runtime/torchUtils.h (1)
  • DataType (99-115)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

271-271: Line too long (313 > 120)

(E501)


353-353: Line too long (122 > 120)

(E501)


370-370: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


532-532: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


659-659: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


702-702: Line too long (139 > 120)

(E501)


704-704: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


705-705: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


724-724: KernelScheduleType may be undefined, or defined from star imports

(F405)


727-727: DataType may be undefined, or defined from star imports

(F405)


728-728: DataType may be undefined, or defined from star imports

(F405)


728-728: DataType may be undefined, or defined from star imports

(F405)


733-733: GemmKind may be undefined, or defined from star imports

(F405)


749-749: DataType may be undefined, or defined from star imports

(F405)

🔇 Additional comments (10)
3rdparty/cutlass (1)

1-1: Submodule bump acknowledged and looks intentional.

Pointer advance to ebe98c5… aligns with broader PR changes adding dynamic CGA support; no direct code in this repo depends on submodule contents at review time. Proceed.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

300-303: Good defensive check when opening input file

The ifstream open check and explicit invalid_argument improves failure diagnostics and is caught upstream in parseArgsAndRunBench. LGTM.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

26-32: Launcher signature aligned with dynamic CGA and epilogue schedule; verify include-time usage

The added EpilogueSchedule/DYNAMIC_CGA template params and runtime dynamic_cluster_shape parameter line up with the generator changes and the dispatch path. LGTM. Ensure macro instantiations and all call sites provide the new template arg and pass the runtime shape accordingly (generator has been updated).

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (5)

142-144: API: New dynamic_cga parameter plumbed into launcher

Extending the public constructor with dynamic_cga=False and storing it on the instance is consistent with the rest of the plumbing. LGTM.


157-157: Storing dynamic_cga on the launcher

Good—this is used downstream to select the correct instantiation variants.


350-356: SM100 validity: runtime cluster-shape constraints are clear

The restrictions on cga_m/cga_n/cga_k and the tile checks for FP4 look sensible given current backend support. LGTM.


659-670: SM120: epi schedule switch to TMA aligns with backend

Switching to EpilogueScheduleType.TmaWarpSpecialized here matches current SM120 expectations. LGTM.


702-706: New SM100 epilogue schedules for 1Sm variants

Using PtrArrayNoSmemWarpSpecialized1Sm and PtrArrayTmaWarpSpecialized1Sm is consistent with the launcher signature that now accepts an epilogue schedule type. Ensure these tokens exist in cutlass_library for the targets you generate.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

119-132: WFP4AFP8 scaling-mode gating looks correct

Restricting WFP4 weights with FP8 activations to MXFPX and banning MXFPX otherwise is the right enforcement.


338-341: Wiring gemm_config into final dispatch looks correct

Forwarding gemm_config to the final dispatch aligns the new centralization. LGTM.

@djns99 djns99 changed the title perf: Add MOE support for dynamic cluster shapes and custom epilogue … [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … Aug 11, 2025
@djns99 djns99 marked this pull request as ready for review August 11, 2025 06:44
@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from 7ad3b48 to 2dc4072 Compare August 11, 2025 06:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

17-36: Add missing include for std::get usage

The code uses std::get with the tuple returned by enum_to_shape_tuple on line 377, but <tuple> is not included. Add it to avoid relying on transitive includes:

 #include "tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h"
 #include "tensorrt_llm/common/cudaBf16Wrapper.h"
+#include <tuple>
 
 #ifdef __GNUC__ // Check if the compiler is GCC or Clang
♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

728-751: Bug: dynamic_cga variants not appended correctly; only last one survives

The validity check and append operation are outside the dynamic_cga loop, so only the last variant (dynamic_cga=False) is considered. Move them inside the loop:

         for otype in otypes:
             for dynamic_cga in [True, False]:
                 moe_gemm_operation = TrtLlm_GemmLauncher(
                     GemmKind.Grouped,
                     arch,
                     dtype,
                     weight_type,
                     otype,
                     otype,
                     otype,
                     quant_op,
                     epi_tag,
                     cta_shape_mnk,
                     warp_shape,
                     stages,
                     cga_shape,
                     mainloop_schedule,
                     epi_schedule,
                     epi_fusion,
                     is_mx_fpx=(dtype == DataType.e4m3 and weight_type == e2m1),
                     dynamic_cga=dynamic_cga)
-
-            if is_op_valid(moe_gemm_operation):
-                operations.append(moe_gemm_operation)
+                if is_op_valid(moe_gemm_operation):
+                    operations.append(moe_gemm_operation)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

134-184: Guard dynamic cluster shape conversion to prevent undefined behavior.

The current code unconditionally calls enum_to_shape_tuple(gemm_config.dynamic_cluster_shape) even when dynamic_cga is false, which could lead to undefined behavior when ClusterShape::Undefined is passed.

Based on past review feedback, this issue was already identified. The recommendation is to guard the cluster shape conversion and validate SM100-specific constraints.

Apply this pattern:

-            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                // SM100 dynamic CGA supports only 1x1x1 and 2x1x1 cluster shapes
+                TLLM_CHECK_WITH_INFO(
+                    (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) &&
+                    std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1,
+                    "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%lld,%lld,%lld).",
+                    (long long) std::get<0>(cluster_shape), (long long) std::get<1>(cluster_shape), (long long) std::get<2>(cluster_shape));
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+            }

Apply the same pattern to fallback cluster shape conversion.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

22-23: Add missing include for type traits

The code uses std::is_enum_v and std::underlying_type_t in enum_to_shape_tuple (lines 93-94) but <type_traits> is already included on line 23. Good!

🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

247-251: Technical debt: Temporary workaround for missing type

The removal of "1Sm" suffix is noted as a hack. Consider adding a TODO to properly address this:

             epi_sched = epi_sched.split(':')[-1]
-            epi_sched = epi_sched.replace(
-                "1Sm",
-                "")  # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type
+            # TODO: Remove this hack once `PtrArrayTmaWarpSpecialized` type is properly defined
+            epi_sched = epi_sched.replace("1Sm", "")  # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type

271-271: Fix line length exceeding 120 characters

Break this long line for better readability:

-        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag},
-                {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n
+        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag},
+                {epi_sched}, {epi_tag}, {epi_fusion}, 
+                {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, 
+                {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, 
+                {"true" if operation.is_mx_fpx else "false"}, 
+                {"true" if operation.dynamic_cga else "false"}, false);\n
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f2854 and 2dc4072.

📒 Files selected for processing (8)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • 3rdparty/cutlass
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
🧠 Learnings (3)
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
  • EpilogueScheduleType (205-565)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
  • KernelScheduleType (36-59)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • FpXBlockScalingType (203-255)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (19)
  • ClusterShape (310-502)
  • TileShape (214-538)
  • EpilogueScheduleType (205-565)
  • if (244-247)
  • if (248-251)
  • if (252-255)
  • if (256-259)
  • if (260-263)
  • if (264-267)
  • if (268-271)
  • if (272-275)
  • if (276-279)
  • if (280-283)
  • if (284-287)
  • if (351-354)
  • if (355-358)
  • if (359-362)
  • if (363-366)
  • if (367-370)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

271-271: Line too long (313 > 120)

(E501)


353-353: Line too long (122 > 120)

(E501)


367-367: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


529-529: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


656-656: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


699-699: Line too long (139 > 120)

(E501)


701-701: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


702-702: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


721-721: KernelScheduleType may be undefined, or defined from star imports

(F405)


724-724: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


730-730: GemmKind may be undefined, or defined from star imports

(F405)


746-746: DataType may be undefined, or defined from star imports

(F405)

🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8)

73-75: LGTM: Function signature updated appropriately for dynamic cluster shape support.

The function rename and addition of the gemm_config parameter align well with the PR's objective of adding dynamic cluster shape support. The parameter placement is logical.


119-132: LGTM: Well-implemented weight type validation.

The weight type gating logic properly validates MXFPX scaling requirements based on the weight/input type combination. The error messages are descriptive and the validation will catch configuration errors early.


202-202: LGTM: Template signature updated for dynamic cluster shape support.

The change from deriving CtaShape from ClusterTileShape to accepting it as a template parameter is consistent with the shift to runtime-determined cluster shapes.


257-257: LGTM: Consistent template signature update.

The template parameter change for SM120 is consistent with the SM100 version and supports the dynamic cluster shape feature.


205-210: LGTM: Proper cluster shape validation for SM100.

The runtime validation correctly enforces the SM100 TMA dynamic CGA constraints (only 1x1x1 and 2x1x1 cluster shapes are supported). The logic is clear and will prevent launching unsupported configurations.


343-345: LGTM: Function call updated to match new signature.

The function call correctly includes the gemm_config parameter, enabling dynamic cluster shape support in the dispatch chain.


394-396: LGTM: Improved error message accuracy.

The updated error message correctly references the SM version-specific tile config field, improving debugging capability when unsupported configurations are encountered.


424-424: LGTM: Added support for 64x32 tile shape on SM100.

The new tile shape configuration follows the established macro pattern and expands the available options for SM100 optimization.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

371-390: Good: Added missing cluster shape conversions

The new cases for ClusterShape_1x4x1, ClusterShape_4x2x1, ClusterShape_2x4x1, and ClusterShape_4x4x1 properly handle the enum values that were added. This addresses the previous review comment about missing conversion cases.


448-463: Well-documented dynamic CGA support

The new constructor properly handles dynamic cluster shape configuration with clear documentation about the behavior when dynamic_cluster_shape is provided. The default values for the new parameters maintain backward compatibility.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

2072-2073: Test output updated for dynamic cluster shape

The config name generation correctly includes the new dynamic_cluster_shape field in the test output, maintaining consistency with the dynamic CGA feature additions.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

26-32: Launcher signature properly extended for dynamic CGA

The template parameters and function signature are correctly extended to support dynamic cluster shapes. The use of cute::Shape<int32_t, int32_t, cute::_1> appropriately represents 2D cluster shapes with the K dimension fixed at 1.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

371-386: Good: added cute::Shape mappings for new ClusterShape values (resolves prior gap).

These cases complete the mapping for 1x4x1, 4x2x1, 2x4x1, 4x4x1 noted earlier.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

377-377: Add missing #include for std::get usage.

This file uses std::get with the tuple returned by enum_to_shape_tuple but doesn't include <tuple>. Please add the include to avoid relying on transitive includes.

Add near the other standard library includes:

 #include <set>
 #include <vector>
+#include <tuple>

458-460: Bug: computed fallback_cluster_shape is not being used.

The fallback_cluster_shape is computed on lines 455-457 but the function call passes cluster_shape twice instead of using the computed fallback value.

-                auto configs = get_candidate_configs_sm100_dynamic_cluster_shape(
-                    config, schedule, cluster_shape, fallback_cluster_shape);
+                auto configs = get_candidate_configs_sm100_dynamic_cluster_shape(
+                    config, schedule, cluster_shape, fallback_cluster_shape);
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

728-750: Bug: Only the last dynamic_cga variant is appended to operations list.

The is_op_valid check and append are outside the dynamic_cga loop, causing only the last variant (dynamic_cga=False) to be considered. Both variants should be validated and appended.

         for otype in otypes:
             for dynamic_cga in [True, False]:
                 moe_gemm_operation = TrtLlm_GemmLauncher(
                     GemmKind.Grouped,
                     arch,
                     dtype,
                     weight_type,
                     otype,
                     otype,
                     otype,
                     quant_op,
                     epi_tag,
                     cta_shape_mnk,
                     warp_shape,
                     stages,
                     cga_shape,
                     mainloop_schedule,
                     epi_schedule,
                     epi_fusion,
                     is_mx_fpx=(dtype == DataType.e4m3 and weight_type == e2m1),
                     dynamic_cga=dynamic_cga)
-
-            if is_op_valid(moe_gemm_operation):
-                operations.append(moe_gemm_operation)
+                if is_op_valid(moe_gemm_operation):
+                    operations.append(moe_gemm_operation)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

136-142: Guard dynamic cluster shape conversion and validate SM100 constraints.

The code unconditionally calls enum_to_shape_tuple even when dynamic_cga is false, and doesn't validate that SM100 only supports 1x1x1 or 2x1x1 cluster shapes.

             bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-            auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                // SM100 dynamic CGA only supports 1x1x1 or 2x1x1
+                TLLM_CHECK_WITH_INFO(
+                    (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) &&
+                    std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1,
+                    "SM100 dynamic CGA supports only 1x1x1 or 2x1x1 cluster shapes, got (%d,%d,%d).",
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), std::get<2>(cluster_shape));
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+                
+                auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+                cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            }
🧹 Nitpick comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (4)

2-2: Update copyright year to include the current year (2025).

Guideline requires headers to include the current year. Update the range.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.

85-98: Harden shape encoding: avoid magic numbers, add bounds checks; drop redundant static.

  • Replace 1000/1000000 with named constants.
  • Add asserts to guard m,n,k in [0, 1000) to prevent collisions.
  • For headers, prefer constexpr over “constexpr static” at namespace scope.
-constexpr static int shape_tuple_to_enum(int m, int n, int k)
-{
-    return m * 1000000 + n * 1000 + k;
-}
+// Encoding: MMM NNN KKK in base 1000
+constexpr int kSHAPE_BASE = 1000;
+constexpr int kSHAPE_BASE2 = kSHAPE_BASE * kSHAPE_BASE;
+
+constexpr int shape_tuple_to_enum(int m, int n, int k)
+{
+    assert(m >= 0 && m < kSHAPE_BASE && n >= 0 && n < kSHAPE_BASE && k >= 0 && k < kSHAPE_BASE);
+    return m * kSHAPE_BASE2 + n * kSHAPE_BASE + k;
+}

288-291: Prefer compile-time error over returning 0x0x0 for unknown TileShape.

Returning an all-zero cute::Shape can silently mask missing cases. Trigger a compile-time error instead.

-    else
-    {
-        return cute::Shape<_0, _0, _0>{};
-    }
+    else
+    {
+        static_assert(kDependentFalse<Shape_MNK>, "Unsupported TileShape in get_tile_shape()");
+    }

Add this helper once in the header (outside the function), e.g. near other utilities:

template <auto>
inline constexpr bool kDependentFalse = false;

512-516: Prefer human-readable schedule names in toString().

Use the existing get_mainloop_schedule_name helper.

-                   << "\n\tmainloop sched: " << (int) mainloop_schedule << "\n\tepi sched: " << (int) epilogue_schedule
+                   << "\n\tmainloop sched: " << get_mainloop_schedule_name(mainloop_schedule)
+                   << "\n\tepi sched: " << (int) epilogue_schedule
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

2-2: Update copyright year to include 2025.

The copyright notice shows 2020-2023, but modifications are being made in 2025. Please update to include the current year.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f2854 and 2dc4072.

📒 Files selected for processing (8)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • 3rdparty/cutlass
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/low_latency_gemm.h (1)
  • KernelScheduleType (36-59)
cpp/tensorrt_llm/runtime/torchUtils.h (1)
  • DataType (99-115)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
  • TileShape (214-538)
  • ClusterShape (310-502)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm90_mma_array_tma_gmma_rs_warpspecialized_mixed_input_.hpp (1)
  • bool (563-647)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

271-271: Line too long (313 > 120)

(E501)


353-353: Line too long (122 > 120)

(E501)


367-367: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


529-529: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


656-656: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


699-699: Line too long (139 > 120)

(E501)


701-701: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


702-702: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


721-721: KernelScheduleType may be undefined, or defined from star imports

(F405)


724-724: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


730-730: GemmKind may be undefined, or defined from star imports

(F405)


746-746: DataType may be undefined, or defined from star imports

(F405)

🔇 Additional comments (14)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

2072-2073: LGTM! Test output properly reflects new dynamic cluster shape field.

The config label now includes the dynamic_cluster_shape, which aligns with the broader changes to support dynamic CGA in the MoE implementation.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (5)

22-23: Good: required headers added.

Adding and <type_traits> resolves previous missing-include issues for tuple utilities and type traits.


310-324: Enum expansion looks consistent.

New ClusterShape values use the encoded scheme; aligns with decoding helpers.


420-421: Fields for dynamic/fallback cluster shapes are clear with sensible defaults.

Defaults to Undefined are appropriate.


448-463: SM100 constructor: supports dynamic and fallback cluster shapes — LGTM.

Comment explains semantics; initialization order and defaults look good.


549-551: ostream dump includes dynamic/fallback cluster shapes — LGTM.

Good addition for debugging/logging.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (2)

26-28: LGTM! Template parameters properly extended for dynamic CGA support.

The addition of EpilogueSchedule and DYNAMIC_CGA template parameters is well-structured and aligns with the dynamic CGA feature implementation across the codebase.


29-32: LGTM! Function signature properly extended with dynamic cluster shape parameters.

The addition of dynamic_cluster_shape and fallback_cluster_shape parameters at the end of the function signature maintains backward compatibility while enabling dynamic CGA support. The use of cute::Shape<int32_t, int32_t, cute::_1> is consistent with CUTLASS conventions.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

370-437: LGTM! Well-structured SM100 dynamic cluster shape configuration generator.

The function properly:

  • Determines 1SM/2SM support based on dynamic_cluster_shape
  • Handles FP4_ONLY configurations with appropriate TMA epilogue requirement
  • Builds comprehensive tile configurations for both 1SM and 2SM variants
  • Correctly uses the new 6-argument CutlassGemmConfig constructor
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

142-157: LGTM! Clean extension of TrtLlm_GemmLauncher with dynamic_cga support.

The addition of the dynamic_cga parameter with a default value of False maintains backward compatibility while enabling the new functionality.


247-271: LGTM! Proper instantiation with dynamic_cga and epilogue schedule.

The instantiation correctly:

  • Passes the epilogue schedule after extracting the last segment
  • Includes the dynamic_cga flag as a string boolean
  • Documents the workaround for the missing PtrArrayTmaWarpSpecialized type
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)

73-75: LGTM! Function rename and signature update for centralized dispatch.

The rename to dispatchMoeGemmFinalDispatchTmaWarpSpecialized and addition of the gemm_config parameter properly centralizes the MOE GEMM dispatch logic.


119-132: LGTM! Proper validation of FP4/FP8 scaling requirements.

The code correctly enforces that WFP4AFP8 combinations must use MXFPX scaling, while preventing its use for other weight combinations. This ensures type safety and proper configuration.


202-255: LGTM! Proper tile shape validation for SM100 with dynamic CGA constraints.

The function correctly:

  • Uses CtaShape template parameter instead of ClusterTileShape
  • Validates that SM100 only supports 1x1x1 or 2x1x1 cluster shapes
  • Maintains proper tile size constraints for different data types

@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from 2dc4072 to 6edb1e5 Compare August 11, 2025 22:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

85-88: Add validation for shape dimensions to prevent encoding overflow

The encoding assumes each dimension is less than 1000. Values >= 1000 will produce incorrect results.

 constexpr static int shape_tuple_to_enum(int m, int n, int k)
 {
+    static_assert(m < 1000 && n < 1000 && k < 1000, "Shape dimensions must be < 1000 for correct encoding");
     return m * 1000000 + n * 1000 + k;
 }

490-503: Special-case Undefined and ChooseWithHeuristic for readable names

Currently returns "0x0x1" for special enum values. Should return meaningful names instead.

 std::string getTileConfigAsName() const
 {
     if (sm_version == 120 || sm_version == 121)
+    {
+        if (tile_config_sm120 == CutlassTileConfigSM120::Undefined) return "Undefined";
+        if (tile_config_sm120 == CutlassTileConfigSM120::ChooseWithHeuristic) return "ChooseWithHeuristic";
         return get_tile_shape_name(tile_config_sm120);
+    }
     if (sm_version >= 100 && sm_version < 120)
+    {
+        if (tile_config_sm100 == CutlassTileConfigSM100::Undefined) return "Undefined";
+        if (tile_config_sm100 == CutlassTileConfigSM100::ChooseWithHeuristic) return "ChooseWithHeuristic";
         return get_tile_shape_name(tile_config_sm100);
+    }
     if (sm_version == 90)
+    {
+        if (tile_config_sm90 == CutlassTileConfigSM90::Undefined) return "Undefined";
+        if (tile_config_sm90 == CutlassTileConfigSM90::ChooseWithHeuristic) return "ChooseWithHeuristic";
         return get_tile_shape_name(tile_config_sm90);
+    }
     if (sm_version < 90)
         return std::to_string((int) tile_config_sm80);
     assert(false && "Invalid SM version");
     return "invalid";
 }
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

134-142: Guard dynamic cluster shape conversion to prevent invalid shapes

enum_to_shape_tuple is called unconditionally, producing (0,0,0) for Undefined shapes even when dynamic CGA is disabled.

 bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+if (dynamic_cga)
+{
+    auto cluster_shape = enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+    // SM100 supports only 1x1x1 or 2x1x1 dynamic CGA
+    TLLM_CHECK_WITH_INFO(
+        (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) &&
+        std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1,
+        "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).",
+        std::get<0>(cluster_shape), std::get<1>(cluster_shape), std::get<2>(cluster_shape));
+    cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+        std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+    
+    auto cluster_shape_fallback = enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+    cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+        std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+}
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (3)

271-271: Line exceeds maximum length limit

This line is 313 characters long, exceeding the 120 character limit specified in the coding guidelines.

Split the macro instantiation across multiple lines:

-        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag},
-                {epi_sched}, {epi_tag}, {epi_fusion}, {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, {"true" if operation.is_mx_fpx else "false"}, {"true" if operation.dynamic_cga else "false"}, false);\n
+        INSTANTIATE_TMA_WARP_SPECIALIZED_MOE_GEMM({arch_tag}, {act_tag}, {weight_tag}, {out_tag},
+                {epi_sched}, {epi_tag}, {epi_fusion}, 
+                {operation.cta_shape[0]}, {operation.cta_shape[1]}, {operation.cta_shape[2]}, 
+                {operation.cga_shape[0]}, {operation.cga_shape[1]}, {operation.cga_shape[2]}, 
+                {"true" if operation.is_mx_fpx else "false"}, 
+                {"true" if operation.dynamic_cga else "false"}, false);\n

366-368: Track TODO: Revert FP4 epilogue schedule restriction

The TODO comment indicates this is a temporary restriction until CUTLASS adds support for blockscaled + no smem. Consider creating a tracking issue for this limitation.

Would you like me to create a GitHub issue to track the removal of this restriction once CUTLASS support is available?


699-707: Line length exceeds limit

Line 699 exceeds the 120 character limit.

-    # Some shapes for SM100 are better with NoSmem, note the kernel will internally map to the 1 or 2 SM variants based on the cga_shape[0]
+    # Some shapes for SM100 are better with NoSmem. 
+    # Note: kernel internally maps to 1 or 2 SM variants based on cga_shape[0]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2dc4072 and 6edb1e5.

📒 Files selected for processing (8)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • 3rdparty/cutlass
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

271-271: Line too long (313 > 120)

(E501)


353-353: Line too long (122 > 120)

(E501)


367-367: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


529-529: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


656-656: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


699-699: Line too long (139 > 120)

(E501)


701-701: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


702-702: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


721-721: KernelScheduleType may be undefined, or defined from star imports

(F405)


724-724: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


730-730: GemmKind may be undefined, or defined from star imports

(F405)


746-746: DataType may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (5)

142-157: LGTM: Clean addition of dynamic_cga parameter

The dynamic_cga parameter is properly added with a sensible default value (False) to maintain backward compatibility.


350-355: LGTM: Clear runtime CGA constraints for SM100

The constraints properly limit SM100 to 1SM or 2SM variants (cga_m <= 2) with single N and K dimensions, which aligns with the hardware capabilities.


529-529: LGTM: Appropriate epilogue schedule for SM90 grouped GEMM

The change to PtrArrayNoSmemWarpSpecialized aligns with the pointer array-based grouped GEMM implementation.


656-666: LGTM: Simplified SM120 configuration

Good simplification by removing the intermediate cga_tile_shape_mnk variable and using TmaWarpSpecialized epilogue schedule for SM120.


728-750: LGTM: Dynamic CGA variant generation properly fixed

Good fix! The validity check and append operations are now correctly placed inside the inner loop, ensuring both dynamic_cga variants (True/False) are properly generated and validated. This addresses the bug identified in the previous review.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

26-32: LGTM: Well-structured template interface extension

The addition of EpilogueSchedule and DYNAMIC_CGA template parameters, along with the runtime dynamic_cluster_shape and fallback_cluster_shape parameters, provides a clean interface for dynamic cluster shape support. The parameter ordering is logical and consistent with existing conventions.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)

371-390: LGTM: Complete cluster shape template coverage

All new cluster shapes (1x4x1, 4x2x1, 2x4x1, 4x4x1) are properly handled with appropriate CUTE shape mappings. The default case returning Shape<_0, _0, _0> provides safe fallback behavior.


448-463: LGTM: Well-designed SM100 constructor with dynamic CGA support

The constructor properly supports dynamic cluster shapes with clear documentation explaining that when dynamic_cluster_shape is provided, the cluster_shape parameter is interpreted as 1SM vs 2SM mode selection rather than a static shape.


512-516: LGTM: Improved config string representation

The toString() method now provides much more readable output with decoded shape names for tile, cluster, dynamic cluster, and fallback cluster shapes.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)

73-75: LGTM: Clear function naming and proper config passing

The rename to dispatchMoeGemmFinalDispatchTmaWarpSpecialized better reflects this as the final dispatch point, and passing gemm_config provides all necessary configuration in one parameter.


202-210: LGTM: Proper SM100 cluster shape validation

The function correctly validates that SM100 only supports 1x1x1 and 2x1x1 cluster shapes for dynamic CGA, with clear constraints documented.


366-366: LGTM: Clear and specific error messages

The error messages properly distinguish between cluster shape and tile shape configuration errors, making debugging easier.

Also applies to: 394-396

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

1-16: Update copyright year

Coding guideline requires the current year in the NVIDIA header. Consider updating “1993-2024” to “1993-2025”.

- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6edb1e5 and a793106.

📒 Files selected for processing (1)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

300-303: Robust workload file-open check — LGTM

Throwing invalid_argument when the file can’t be opened is correct and integrates with the existing try/catch in parseArgsAndRunBench. No further changes needed.


462-490: Behavior change: GEMM profiling now triggers for all non-LAYER cases

The widened condition will apply the profiling path regardless of current routing_config. This also overrides user-provided routing and tactic splits when gemm_to_profile != LAYER. Confirm this broader behavior change is intentional for the new dynamic-CGA workflows.

@djns99
Copy link
Collaborator Author

djns99 commented Aug 11, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14860 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14860 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11217 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

2-2: Update copyright year to reflect current changes.

The copyright header shows 2023, but this code is being modified in 2024/2025. The coding guidelines require that "All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year."

- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b86aede and 9d5f8c5.

📒 Files selected for processing (1)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (3)

1075-1075: Good optimization to fetch tactics once.

Storing the tactics vector in a local variable before the logging and counter population blocks is an efficient optimization that avoids redundant calls to mMoERunner.getTactics().


1086-1099: Excellent addition of detailed per-tactic metadata.

The new counters provide comprehensive visibility into tactic characteristics including SM version, tile shape, cluster shapes (static, dynamic, and fallback), and epilogue schedule. This aligns well with the PR's objective to add dynamic cluster shape support and will be valuable for performance analysis and debugging.

The implementation correctly extracts metadata from both selected tactics using consistent naming conventions (t1_* and t2_* prefixes).


1086-1099: Verify existence of new dynamic_cluster_shape and fallback_cluster_shape fields

I wasn’t able to locate definitions for the new fields in the codebase. Please confirm that the Tactic type returned by mMoERunner.getTactics() (or the underlying CutlassGemmConfig) actually declares:

  • dynamic_cluster_shape
  • fallback_cluster_shape

If they’re defined in a different header or via a code-gen step, ensure that file is included in your build. Otherwise, add these fields to the Tactic/CutlassGemmConfig struct so the benchmark compiles and reports correct values.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

449-468: Bug fix confirmed: fallback_cluster_shape now used instead of passing cluster_shape twice

The computed fallback_cluster_shape is correctly passed into get_candidate_configs_sm100_dynamic_cluster_shape.


376-378: Build break: std::get used without including

This TU uses std::get with enum_to_shape_tuple but doesn’t include . Some transitive includes may hide this locally, but it’s fragile and can break builds.

Add explicit includes:

 #include <cuda_runtime_api.h>
+#include <tuple>    // for std::get
 #include <set>
 #include <vector>
+// Optional but recommended for std::pair in tile_configs
+#include <utility>
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

430-435: Avoid shadowing parameter ‘config’ with local variable

The local CutlassGemmConfig config shadows the function parameter config, which harms readability and can confuse future edits.

Apply:

-    for (auto [tile, cluster] : tile_configs)
-    {
-        CutlassGemmConfig config{
-            tile, MainloopScheduleType::AUTO, schedule, cluster, dynamic_cluster_shape, fallback_cluster_shape};
-        candidate_configs.push_back(config);
-    }
+    for (auto [tile, cluster] : tile_configs)
+    {
+        CutlassGemmConfig candidate{
+            tile, MainloopScheduleType::AUTO, schedule, cluster, dynamic_cluster_shape, fallback_cluster_shape};
+        candidate_configs.push_back(candidate);
+    }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d5f8c5 and a543ec6.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)

74-76: Verified: All instantiations updated to 5-parameter form
I scanned the entire codebase for should_filter_tma_warp_specialized_gemm_problem_shape_v<…> and found no remaining uses of the old 4-parameter signature. The false value for SM90 is correctly propagated.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)

37-41: Intent check: FAST_BUILD filters all dynamic-CGA paths

Including DYNAMIC_CGA in the FAST_BUILD filter makes any dynamic cluster-shape path get excluded. If that’s intended (to shrink compile time), LGTM; otherwise, consider gating by arch or feature flag.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

382-400: FP4-only gating against non-TMA epilogues looks correct

For FP4_ONLY, returning empty set unless schedule == TMA avoids generating unsupported variants. This aligns with the PR’s epilogue constraints.


443-445: FAST_BUILD default includes dynamic/fallback fields (Undefined) — good

The single SM100 candidate now carries dynamic_cluster_shape and fallback_cluster_shape as Undefined, keeping the API consistent even in FAST_BUILD.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

1-5: Missing NVIDIA copyright header (2025) — please add required header

The file is missing the mandatory NVIDIA copyright header with the current year as per project guidelines.

Apply this diff at the top of the file:

+// Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
 #include "tensorrt_llm/common/cudaUtils.h"
🧹 Nitpick comments (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)

2068-2074: Also print fallback_cluster_shape in tactic label for debuggability

Including the fallback cluster shape will make failures easier to pinpoint when dynamic CGA falls back.

Apply this diff to extend the label:

-                   << " dynamic cluster shape " << (int) conf.dynamic_cluster_shape << " mainloop sched "
-                   << (int) conf.mainloop_schedule << " epi sched " << (int) conf.epilogue_schedule;
+                   << " dynamic cluster shape " << (int) conf.dynamic_cluster_shape
+                   << " fallback cluster shape " << (int) conf.fallback_cluster_shape
+                   << " mainloop sched " << (int) conf.mainloop_schedule
+                   << " epi sched " << (int) conf.epilogue_schedule;

2095-2097: Guard dynamic-cluster filtering to TMA-warp-specialized configs

Being explicit avoids future confusion if non-TMA configs ever carry non-Undefined dynamic_cluster_shape values.

Apply this diff:

-            if (conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined
-                && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1)
+            if (conf1.is_tma_warp_specialized
+                && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined
+                && conf1.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1)
                 continue; // To reduce the number of iterations we only test one dynamic cluster shape
-                if (conf2.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined
-                    && conf2.dynamic_cluster_shape
-                        != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1)
+                if (conf2.is_tma_warp_specialized
+                    && conf2.dynamic_cluster_shape != tensorrt_llm::cutlass_extensions::ClusterShape::Undefined
+                    && conf2.dynamic_cluster_shape
+                        != tensorrt_llm::cutlass_extensions::ClusterShape::ClusterShape_4x1x1)
                     continue; // To reduce the number of iterations we only test one dynamic cluster shape

Also applies to: 2100-2103

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a543ec6 and 357edca.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🔇 Additional comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

2095-2103: Sanity check: limiting dynamic cluster shapes to only 4x1x1 may miss regressions on other candidate shapes

If the SM100 dynamic-CGA candidate generator emits other shapes (e.g., 2x2x1, 2x1x2), this filter will skip them, reducing coverage. If runtime is a concern, consider sampling a small set of representative dynamic shapes or gating with an env var to expand coverage in CI-nightly.

Would you like me to propose a minimal sampling approach (e.g., test Undefined plus up to N dynamic shapes detected in configs), or gate via an environment variable to include all dynamic shapes in extended runs?

@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from 357edca to 505e1b2 Compare August 12, 2025 03:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

462-490: Guard GEMM profiling path only for valid specific-GEMM values

As written, any unexpected value (e.g., 0 or 4) will enter the profiling path, force uniform routing, and possibly yield surprising behavior. Restrict the condition to explicit GEMM_1/GEMM_2.

-        if (gemm_to_profile != (int) GemmToProfile::LAYER)
+        if (gemm_to_profile == (int) GemmToProfile::GEMM_1
+            || gemm_to_profile == (int) GemmToProfile::GEMM_2)
         {

Optionally, also validate gemm_to_profile right after parsing and reject invalid values to fail fast. I can propose that change if desired.

♻️ Duplicate comments (6)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

471-479: Fix optional vs int comparison; lower log threshold for a warning

This repeats a prior review note and remains unresolved: comparing std::optional directly to an int is not portable pre-C++23 and can fail to compile. Also, a warning should not require ERROR level to be visible.

-            static bool printed = false;
-            if (routing_config != UNIFORM_ROUTING_CONFIG && LOG_LEVEL >= ERROR && !printed)
+            static bool printed = false;
+            if (routing_config.has_value() && *routing_config != UNIFORM_ROUTING_CONFIG
+                && LOG_LEVEL >= INFO && !printed)
             {
                 std::cerr << "Warning: Profiling a specific GEMM will always use uniform random token distribution"
                           << std::endl;
                 printed = true;
             }
             routing_config = UNIFORM_ROUTING_CONFIG;

If you have a WARNING level, prefer LOG_LEVEL >= WARNING instead of INFO.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)

22-23: Missing type-traits include is now explicitly added

The earlier concern about using std::is_enum_v/std::underlying_type_t without including <type_traits> is addressed by adding <type_traits>. Good fix.


85-91: Added bounds checks for shape encoding prevent overflow

The asserts for non-negative and <1000 dimensions address the prior overflow/encoding-range concern in shape_tuple_to_enum.


347-399: get_cluster_shape() mapping completed for new shapes

Coverage for newly added shapes is present; the previous gap is resolved.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

244-251: Fragile string hacks for epilogue tags/schedules; prefer explicit mapping

The split(':')[-1] and replace("1Sm","") hacks are brittle. Prefer a small explicit mapping dict for schedule/type token normalization before macro use.

I can provide a mapping-based refactor if desired.


727-751: Fix for dynamic_cga variants append bug is in place

The validity check and append are now inside the inner dynamic_cga loop, so both True/False variants are preserved.

🧹 Nitpick comments (11)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

2-2: Update copyright year to include 2025 to comply with org standard

Per the coding guidelines, the NVIDIA copyright header should include the current year.

- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)

2-2: Update copyright year to include 2025 per coding guidelines

All OSS source files must include an NVIDIA copyright header that includes the current year.

Apply this diff:

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

2-2: Update copyright year to include 2025 per coding guidelines

All OSS source files must include an NVIDIA copyright header that includes the current year.

Apply this diff:

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.

498-510: Optional: return consistent, descriptive names for heuristics across SM families

Currently getTileConfigAsName() returns “heuristic” for ChooseWithHeuristic. Consider aligning the naming to “chooseWithHeuristic” for clarity and consistency with enum naming in logs.

Apply this minimal patch:

-        if (sm_version == 120 || sm_version == 121)
-            return get_tile_shape_name(tile_config_sm120);
+        if (sm_version == 120 || sm_version == 121)
+        {
+            if (tile_config_sm120 == CutlassTileConfigSM120::Undefined) return "undefined";
+            if (tile_config_sm120 == CutlassTileConfigSM120::ChooseWithHeuristic) return "chooseWithHeuristic";
+            return get_tile_shape_name(tile_config_sm120);
+        }
-        if (sm_version >= 100 && sm_version < 120)
-            return get_tile_shape_name(tile_config_sm100);
+        if (sm_version >= 100 && sm_version < 120)
+        {
+            if (tile_config_sm100 == CutlassTileConfigSM100::Undefined) return "undefined";
+            if (tile_config_sm100 == CutlassTileConfigSM100::ChooseWithHeuristic) return "chooseWithHeuristic";
+            return get_tile_shape_name(tile_config_sm100);
+        }
-        if (sm_version == 90)
-            return get_tile_shape_name(tile_config_sm90);
+        if (sm_version == 90)
+        {
+            if (tile_config_sm90 == CutlassTileConfigSM90::Undefined) return "undefined";
+            if (tile_config_sm90 == CutlassTileConfigSM90::ChooseWithHeuristic) return "chooseWithHeuristic";
+            return get_tile_shape_name(tile_config_sm90);
+        }
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

2-2: Update copyright year to include 2025 per coding guidelines

All OSS source files must include an NVIDIA copyright header that includes the current year.

Apply this diff:

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

236-251: Dead/broken string manipulation; replace-in-place has no effect and isn’t used

  • kernel_sched.replace(...) doesn’t assign back, so it’s a no-op.
  • kernel_sched isn’t used in the macro path anyway.

Remove the unused replace and tidy the schedule handling to avoid confusion.

Apply this diff:

-            kernel_sched.replace("::Kernel", "::KernelGrouped")
-            # epi_sched += "Grouped"
+            // Grouped schedules are handled by the macro; no string surgery needed here.

Optionally, also remove the commented “epi_sched += 'Grouped'” line.


699-704: Style: very long lines; consider breaking for readability and linting (E501)

These lines exceed 120 characters. If ruff is enforced in CI, wrap the long string or split arguments across lines to avoid E501.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (4)

2-2: Update copyright year to include 2025 per coding guidelines

All OSS source files must include an NVIDIA copyright header that includes the current year.

Apply this diff:

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.

171-183: Static assert message is confusing; check epilogue schedule once, early-return

The static_assert condition is fine, but the error string reads like guidance. Also, after selecting selected_func, early-return for clarity.

Apply this diff:

-                static_assert(
-                    is_wfp4afp8 || (std::is_same_v<T, __nv_fp4_e2m1> && std::is_same_v<WeightType, __nv_fp4_e2m1>),
-                    "Non-block scaled shapes should try no smem epilogue");
+                static_assert(
+                    is_wfp4afp8 || (std::is_same_v<T, __nv_fp4_e2m1> && std::is_same_v<WeightType, __nv_fp4_e2m1>),
+                    "Block-scaled MMA requires TMA epilogue; other combos should not select NoSmem here.");
                 TLLM_CHECK_WITH_INFO(gemm_config.epilogue_schedule == cutlass_extensions::EpilogueScheduleType::TMA,
                     "No Smem epilogue schedule is not supported for block scaled");
                 auto selected_func = dynamic_cga
                     ? &kernels::cutlass_kernels::tma_warp_specialized_generic_moe_gemm_kernelLauncher<Arch, T,
                         WeightType, OutputType, cutlass::epilogue::PtrArrayTmaWarpSpecialized, EpilogueTag, FUSION,
                         TileShape, ClusterShape, is_wfp4afp8, true, false>
                     : &kernels::cutlass_kernels::tma_warp_specialized_generic_moe_gemm_kernelLauncher<Arch, T,
                         WeightType, OutputType, cutlass::epilogue::PtrArrayTmaWarpSpecialized, EpilogueTag, FUSION,
                         TileShape, ClusterShape, is_wfp4afp8, false, false>;
                 selected_func(hopper_input, num_experts, multi_processor_count, stream, occupancy, workspace_size,
                     cluster_shape_cute, cluster_shape_cute_fallback);
+                return;

186-198: Pass empty shapes for non-dynamic paths for consistency

For SM120 and pre-SM100 paths you pass {}, {}; mirror that for SM100 non-dynamic to avoid propagating (0,0,0) shapes derived from Undefined.

If you adopt the earlier guard (don’t materialize shapes unless dynamic_cga), this is already achieved.


368-368: Misleading closing-brace comment

This line closes a switch/function scope, not a namespace. Adjust the comment or remove it to avoid confusion.

Apply this diff:

-} // namespace tensorrt_llm
+} // end dispatchMoeGemmSelectClusterShapeTmaWarpSpecialized
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 357edca and 505e1b2.

📒 Files selected for processing (11)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • 3rdparty/cutlass
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

271-271: Line too long (313 > 120)

(E501)


353-353: Line too long (122 > 120)

(E501)


367-367: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


529-529: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


656-656: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


699-699: Line too long (139 > 120)

(E501)


701-701: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


702-702: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


721-721: KernelScheduleType may be undefined, or defined from star imports

(F405)


724-724: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


725-725: DataType may be undefined, or defined from star imports

(F405)


730-730: GemmKind may be undefined, or defined from star imports

(F405)


746-746: DataType may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

300-303: Robust file-open validation — LGTM

Good addition. Throwing on file-open failure makes the failure mode explicit and avoids hard-to-debug downstream errors.

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1)

74-76: Template-gate update for DYNAMIC_CGA is correct

Passing the new bool template argument (false) to should_filter_tma_warp_specialized_gemm_problem_shape_v aligns with the updated template signature and keeps SM90 static-CGA gating behavior unchanged here.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

298-316: Tile shape name helper: generic int-backed enum support is solid

The templated get_tile_shape_name handles Undefined/Heuristic and decodes shape triplets cleanly via enum_to_shape_tuple. LGTM.


318-333: New ClusterShape enum values added

Extended cluster shapes (1x4x1, 4x1x1, 4x2x1, 2x4x1, 4x4x1, 1x8x1, 8x1x1) are defined and int-encoded consistently.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1)

25-33: Template parameter and launcher signature alignment with generator

The addition of EpilogueSchedule and DYNAMIC_CGA template parameters and dynamic/fallback cluster shape arguments matches the generator changes.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

125-158: Propagate dynamic_cga through launcher config: LGTM

The new init param and attribute (dynamic_cga) are correctly added, stored, and later used in instantiation.


349-356: SM100 validation tightened correctly for runtime CGA

The early-return filter to allow only (1|2)x1x1 shapes for SM100 aligns with the dynamic CGA model and your dispatch path.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (4)

119-133: MXFPX/WFP4-AFP8 gating checks look correct

The runtime checks enforce MXFPX when WeightType is FP4 and T is FP8 e4m3, and ban MXFPX otherwise. Good safety.


202-255: SM100 tile-shape predicate is cohesive

The constexpr predicate matches the Python generator’s constraints (TileM in {64,128}, TileN alignment, FP4 gating, small-N FP8 exceptions). LGTM.


371-459: Tile-shape dispatch updates for SM100/SM120: LGTM

New shape cases and DEFAULT handling are consistent with the additions in gemm_configs and the generator.


465-472: Workspace sizing reuses dispatch path cleanly

calcMaxWorkspaceSizeTmaWarpSpecialized leverages the updated dispatcher; good reuse to avoid template bloat.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

2-2: Update copyright year to include 2025 per OSS header policy

Per coding guidelines, the header must include the current year.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.

16-16: Add required header include guards

Headers must have a TRTLLM_-prefixed include guard. This prevents multiple inclusion and follows project conventions.

@@
 
+#ifndef TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H
+#define TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H
@@
 } // namespace tensorrt_llm::kernels::cutlass_kernels
+
+#endif // TRTLLM_MOE_GEMM_TEMPLATE_DISPATCH_TMA_WS_H

Also applies to: 474-474

♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

134-143: Guard dynamic CGA shape conversion and enforce SM100 allowed shapes (1x1x1, 2x1x1)

You unconditionally convert enum_to_shape_tuple even when dynamic CGA is disabled, which can materialize invalid (Undefined) shapes. Also, SM100 dynamic CGA supports only (1|2)x1x1; enforce this at runtime to fail fast. This aligns with prior feedback.

-            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-            auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
+            // Only materialize runtime shapes if dynamic CGA is enabled.
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                auto cluster_shape_fallback
+                    = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+                // SM100 dynamic CGA supports only (1|2)x1x1. Fail fast on invalid shapes.
+                TLLM_CHECK_WITH_INFO(
+                    ((std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2)
+                        && std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1),
+                    "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).",
+                    (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape));
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+                cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            }

Would you like me to push a patch updating all similar call sites that materialize dynamic CGA shapes?

🧹 Nitpick comments (7)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (7)

18-21: Prefer #if defined(...) over #ifdef per guidelines

Use #if defined(...) to improve consistency and readability.

-#ifdef __GNUC__ // Check if the compiler is GCC or Clang
+#if defined(__GNUC__) // Check if the compiler is GCC or Clang
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wstrict-aliasing"
-#endif
+#endif

49-51: Prefer #if defined(...) over #ifdef per guidelines (pop block)

Mirror the change for the matching pop block.

-#ifdef __GNUC__ // Check if the compiler is GCC or Clang
+#if defined(__GNUC__) // Check if the compiler is GCC or Clang
 #pragma GCC diagnostic pop
-#endif
+#endif

62-66: Use C++ standard headers instead of C headers

Replace <math.h> with per guideline “C headers should not be used directly.”

 #include <cuda.h>
 #include <cuda_fp16.h>
-#include <math.h>
+#include <cmath>
 #include <sstream>

162-166: Optionally return after dispatch to make control-flow explicit

Not required due to if-constexpr branching, but adding a return clarifies intent and avoids accidental fall-through on future edits.

                 func_map[tma_epilogue][dynamic_cga](hopper_input, num_experts, multi_processor_count, stream, occupancy,
                     workspace_size, cluster_shape_cute, cluster_shape_cute_fallback);
+                return;

202-210: Use compile-time condition for constant ClusterShape checks

These use compile-time constants; using if constexpr avoids generating dead runtime branches.

-    if (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1 || cute::size<2>(ClusterShape{}) != 1)
+    if constexpr (cute::size<0>(ClusterShape{}) > 2 || cute::size<1>(ClusterShape{}) != 1
+        || cute::size<2>(ClusterShape{}) != 1)
     {
         return false;
     }

275-286: Fix minor grammar in documentation comment

Tighten the prose for clarity.

 /*
-    1x1x1 cluster shape is are supported for any tile shape.
+    1x1x1 cluster shape is supported for any tile shape.
 
-    2x1x1 cluster shape is only supported for when the M tile is at least 128.
+    2x1x1 cluster shape is supported only when the M tile is at least 128.
 
-    1x2x1 cluster shape is only supported when the N tile is at least 128.
+    1x2x1 cluster shape is supported only when the N tile is at least 128.
 
-    2x2x1 cluster shape is only supported when both the M and N tiles are at least 128.
+    2x2x1 cluster shape is supported only when both the M and N tiles are at least 128.
 
-    We make the above restrictions are to improve compilation speed in TRT-LLM by pruning kernels
+    We make the above restrictions to improve compilation speed in TRT-LLM by pruning kernels
     that may not be very useful in practice.
  */

368-368: Remove incorrect namespace-closing comment on a function brace

This brace closes a function, not a namespace. The comment is misleading.

-} // namespace tensorrt_llm
+}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 505e1b2 and 4e588e6.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (8 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧠 Learnings (2)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

119-132: MXFPX gating for WFP4AFP8 is correct

The scaling-type checks for the WFP4AFP8 combination are precise and fail fast for unsupported pairings. Good defensive validation.

@djns99 djns99 requested review from zongfeijing and hyukn August 12, 2025 04:26
@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from 4e588e6 to c1c30a6 Compare August 13, 2025 02:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

85-90: Add validation for shape dimensions to prevent overflow

The encoding scheme assumes each dimension is less than 1000. For values >= 1000, this will produce incorrect results due to overflow in the encoding logic.

Consider adding documentation and validation:

 constexpr static int shape_tuple_to_enum(int m, int n, int k)
 {
+    // Dimensions must be < 1000 for correct encoding
     assert(m >= 0 && n >= 0 && k >= 0);
     assert(m < 1000 && n < 1000 && k < 1000);
     return m * 1000000 + n * 1000 + k;
 }
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

370-437: LGTM! Well-structured dynamic cluster shape candidate generation

The new helper function correctly:

  • Selects 1SM/2SM variants based on dynamic_cluster_shape's M dimension
  • Builds separate candidate sets for FP4 and non-FP4 configurations
  • Properly carries dynamic_cluster_shape and fallback_cluster_shape through to the 6-argument CutlassGemmConfig constructor

449-467: LGTM! Correct usage of dynamic cluster shape helper

The refactored SM100 candidate generation correctly:

  • Iterates over cluster shape variants (1x1x1, 4x1x1, 4x2x1)
  • Derives appropriate fallback shapes
  • Calls the new dynamic cluster shape helper with correct parameters
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

723-745: LGTM! Proper instantiation of dynamic CGA variants

The code correctly generates two launcher instances (dynamic_cga True/False) for each configuration and validates each before appending to operations.


236-250: Document the epilogue schedule name workaround more clearly

The current approach uses string manipulation to work around missing type definitions. This should be better documented or fixed properly.

Consider adding a more explicit comment explaining why this workaround is necessary and what the long-term fix should be:

-            # Hack to WAR missing `PtrArrayTmaWarpSpecialized` type
+            # TODO: Fix the underlying issue by adding PtrArrayTmaWarpSpecialized type definition
+            # Current workaround: Remove '1Sm' suffix that causes macro instantiation issues
+            # The '1Sm' suffix indicates single-SM variant but the macro doesn't expect it
             epi_sched = epi_sched.replace(
                 "1Sm",
                 "")
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

175-186: Add validation for SM100 dynamic cluster shapes

The code processes dynamic_cluster_shape without validating that only SM100-supported shapes (1x1x1 or 2x1x1) are used.

Apply this validation:

             bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-            auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                // SM100 dynamic CGA supports only (1|2)x1x1
+                TLLM_CHECK_WITH_INFO(
+                    (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2) &&
+                    std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1,
+                    "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).",
+                    (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape));
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+                auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+                cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            }
🧹 Nitpick comments (1)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

298-310: Consider a more descriptive name for special cases in get_tile_shape_name

The function returns "undefined" and "heuristic" for 0 and 1 respectively, but these magic values aren't self-documenting.

Consider using named constants or enum values for better readability:

+    constexpr int kUndefinedShape = 0;
+    constexpr int kHeuristicShape = 1;
+    
     static_assert(std::is_enum_v<TEnum> && std::is_same_v<std::underlying_type_t<TEnum>, int>,
         "TEnum must be an enum with underlying type int");
-    if ((int) Shape_MNK == 0)
+    if ((int) Shape_MNK == kUndefinedShape)
     {
         return "undefined";
     }
-    else if ((int) Shape_MNK == 1)
+    else if ((int) Shape_MNK == kHeuristicShape)
     {
         return "heuristic";
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e588e6 and c1c30a6.

📒 Files selected for processing (11)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (12 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • 3rdparty/cutlass
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🧠 Learnings (3)
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

269-269: Line too long (393 > 120)

(E501)


349-349: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


349-349: Line too long (145 > 120)

(E501)


352-352: Line too long (122 > 120)

(E501)


366-366: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


526-526: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


651-651: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


694-694: Line too long (139 > 120)

(E501)


696-696: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


697-697: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


725-725: GemmKind may be undefined, or defined from star imports

(F405)


741-741: DataType may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

374-393: LGTM! Complete handling of new cluster shapes

The updated get_cluster_shape() template now handles all new ClusterShape enum values (1x4x1, 4x2x1, 2x4x1, 4x4x1) and includes a proper fallback for undefined shapes.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)

2088-2089: LGTM! Consistent dynamic cluster shape printing

The config label now correctly includes dynamic_cluster_shape for TMA warp specialized configs, maintaining consistency with the updated toString() method in CutlassGemmConfig.


2111-2119: LGTM! Efficient filtering of dynamic cluster shapes

The filtering logic correctly reduces test iterations by testing only one dynamic cluster shape (ClusterShape_4x1x1) when defined, avoiding redundant test combinations.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

143-157: LGTM! Dynamic CGA support added to launcher

The TrtLlm_GemmLauncher constructor now accepts and stores the dynamic_cga parameter, enabling dynamic cluster shape support for SM100.


349-354: Validation logic correctly handles finalize fusion constraints

The check ensures that finalize fusion epilogue only uses PtrArrayTmaWarpSpecialized1Sm, which is the only supported schedule for this configuration.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)

71-117: Well-structured SM100 dispatch function with proper epilogue handling

The new getDispatchFunctionForSM100 helper correctly:

  • Creates a 2D function map indexed by TMA epilogue and dynamic CGA
  • Handles FP4 types by enforcing TMA epilogue schedule
  • Returns appropriate function pointers based on configuration

207-212: LGTM! Correct runtime cluster shape validation for SM100

The function properly validates that only 1x1x1 and 2x1x1 cluster shapes are supported at compile time.


158-171: LGTM! Proper FP4/FP8 weight combination validation

The validation correctly ensures:

  • WFP4AFP8 configurations use MXFPX scaling
  • Other configurations don't use MXFPX scaling

@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from c1c30a6 to 7c626c3 Compare August 14, 2025 00:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

31-33: Missing #include for std::array usage

getDispatchFunctionForSM100 uses std::array but this header does not include . This will fail to compile on compilers that don’t implicitly include it.

Apply:

 #include "cute/tensor.hpp"
+#include <array>
♻️ Duplicate comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

22-23: Resolved: missing type-traits/tuple includes added

The explicit inclusion of and <type_traits> addresses prior build fragility when using enum_to_shape_tuple and related static_asserts.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)

30-31: Default DYNAMIC_CGA to false to avoid widespread template churn

Providing a default preserves source compatibility at call sites, reduces boilerplate, and likely fixes some CI breakages from missed updates.

Apply:

-template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA, class ActivationType>
+template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA = false, class ActivationType>
 struct should_filter_tma_warp_specialized_gemm_problem_shape
 {
@@
-template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA, class ActivationType>
+template <class ArchTag, class TileShape, class ClusterShape, bool DYNAMIC_CGA = false, class ActivationType>
 constexpr static bool should_filter_tma_warp_specialized_gemm_problem_shape_v
     = should_filter_tma_warp_specialized_gemm_problem_shape<ArchTag, TileShape, ClusterShape, DYNAMIC_CGA,
         ActivationType>::value;

Also applies to: 43-46

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

724-746: Resolved: dynamic_cga variants are now validated and appended per-iteration

The validity check and append are correctly inside the inner loop, so both dynamic and static variants are handled.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

158-171: Enforce correct scaling mode and guard FINALIZE fusion’s required inputs

The WFP4AFP8/MXFPX checks are good. Add a guard for FINALIZE fusion to ensure the required router scales pointer is set (CUTLASS doesn’t null-check it).

Apply:

         else
         {
             TLLM_CHECK_WITH_INFO(
                 hopper_input.fpX_block_scaling_type != TmaWarpSpecializedGroupedGemmInput::FpXBlockScalingType::MXFPX,
                 "MXFPX is not supported for the selected weight combination");
         }
+
+        // FINALIZE fusion requires valid router scales.
+        if constexpr (FUSION == EpilogueFusion::FINALIZE)
+        {
+            TLLM_CHECK_WITH_INFO(
+                hopper_input.ptr_router_scales != nullptr,
+                "ptr_router_scales must be non-null when using FINALIZE fusion epilogue");
+        }

173-181: Avoid building runtime cluster shapes when dynamic CGA is disabled; validate allowed shapes

Currently, you compute cluster shapes even when dynamic CGA is off; Undefined decodes to 0x0x0 and gets passed downstream. Only build and pass shapes when dynamic CGA is enabled, and validate (1|2)x1x1 for SM100.

Apply:

-            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-            auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+                // SM100 dynamic CGA supports only (1|2)x1x1.
+                TLLM_CHECK_WITH_INFO(
+                    (std::get<0>(cluster_shape) == 1 || std::get<0>(cluster_shape) == 2)
+                        && std::get<1>(cluster_shape) == 1 && std::get<2>(cluster_shape) == 1,
+                    "SM100 dynamic CGA supports only (1|2)x1x1 cluster shapes, got (%d,%d,%d).",
+                    (int) std::get<0>(cluster_shape), (int) std::get<1>(cluster_shape), (int) std::get<2>(cluster_shape));
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+                cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            }
🧹 Nitpick comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

291-343: Simplify get_cluster_shape with enum_to_shape_tuple (remove manual case cascade)

You already generalized TileShape via enum_to_shape_tuple; do the same for ClusterShape to eliminate the long if constexpr chain and ensure all future shapes are covered automatically.

Apply:

-constexpr auto get_cluster_shape()
-{
-    using namespace cute;
-    if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x1x1)
-    {
-        return cute::Shape<_1, _1, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x1x1)
-    {
-        return cute::Shape<_2, _1, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x2x1)
-    {
-        return cute::Shape<_1, _2, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x2x1)
-    {
-        return cute::Shape<_2, _2, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x1x1)
-    {
-        return cute::Shape<_4, _1, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x8x1)
-    {
-        return cute::Shape<_1, _8, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_8x1x1)
-    {
-        return cute::Shape<_8, _1, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_1x4x1)
-    {
-        return cute::Shape<_1, _4, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x2x1)
-    {
-        return cute::Shape<_4, _2, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_2x4x1)
-    {
-        return cute::Shape<_2, _4, _1>{};
-    }
-    else if constexpr (Shape_MNK == ClusterShape::ClusterShape_4x4x1)
-    {
-        return cute::Shape<_4, _4, _1>{};
-    }
-    else
-    {
-        return cute::Shape<_0, _0, _0>{};
-    }
-}
+constexpr auto get_cluster_shape()
+{
+    using namespace cute;
+    static_assert(Shape_MNK != ClusterShape::Undefined, "ClusterShape is undefined");
+    constexpr auto t = enum_to_shape_tuple(Shape_MNK);
+    return cute::Shape<cute::Int<std::get<0>(t)>, cute::Int<std::get<1>(t)>, cute::Int<std::get<2>(t)>>{};
+}

456-489: Improve readability: print schedule names instead of ints

toString() currently prints raw enum ints for mainloop/epilogue. Consider mapping to get_mainloop_schedule_name() and a similar helper for EpilogueScheduleType for more informative logs.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

370-437: Add explicit include to avoid relying on transitive headers

This TU calls std::get on tuples returned by enum_to_shape_tuple; include explicitly for robustness.

Proposed addition near other standard includes:

 #include <set>
+#include <tuple>
 #include <vector>
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

6-6: Avoid star import to satisfy Ruff F405 and improve readability

from cutlass_library import * triggers F405 and makes symbol origins unclear. Prefer a qualified import.

Suggested approach:

  • Replace the star import with a qualified import, e.g., import cutlass_library as cl.
  • Refer to symbols as cl.EpilogueScheduleType, cl.DataType, cl.GemmKind, cl.DataTypeNames, cl.DataTypeSize, etc.

If refactoring now is too invasive, consider suppressing lint for this file or strategically importing only the used symbols.


269-269: Long lines exceed 120 columns; consider wrapping or lint suppression

Multiple lines exceed Ruff’s 120-char limit (E501), mostly macro strings and long conditionals. Either wrap them or add targeted noqa comments to keep CI green.

Also applies to: 349-349, 352-352, 366-366, 526-526, 651-651, 694-697, 725-725, 741-741

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1c30a6 and 7c626c3.

📒 Files selected for processing (10)
  • 3rdparty/cutlass (1 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (10 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • 3rdparty/cutlass
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (2)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
  • TileShape (213-490)
  • ClusterShape (262-454)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1)
  • ActivationType (23-33)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
  • EpilogueScheduleType (204-517)
🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

269-269: Line too long (393 > 120)

(E501)


349-349: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


349-349: Line too long (145 > 120)

(E501)


352-352: Line too long (122 > 120)

(E501)


366-366: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


526-526: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


651-651: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


694-694: Line too long (139 > 120)

(E501)


696-696: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


697-697: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


725-725: GemmKind may be undefined, or defined from star imports

(F405)


741-741: DataType may be undefined, or defined from star imports

(F405)

🔇 Additional comments (5)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

85-101: Shape encoding helpers look correct and safe

The encoding/decoding utilities are constexpr, validate ranges (<1000) and avoid UB. Good foundation for uniform enum value management across SM families.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1)

37-41: FAST_BUILD filter logic is concise and correct

The filter disables unsupported shapes or dynamic CGA in fast builds. Looks good.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

449-467: Good: fallback_cluster_shape is now used correctly

The second argument to get_candidate_configs_sm100_dynamic_cluster_shape passes the computed fallback shape, fixing the earlier logic error.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

71-117: Dispatch table for SM100 is clean and extensible

The 2D function map over epilogue schedule and dynamic_cga makes the runtime selection straightforward. Once is included, this pattern is solid.


336-370: Error messages improved (config.toString), switch pruning is fine

The reporting of unsupported shapes includes the full config string, which will help debugging tactic selection. Looks good.

@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from 7c626c3 to 78507c5 Compare August 17, 2025 23:44
@djns99
Copy link
Collaborator Author

djns99 commented Sep 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17144 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17144 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12891 completed with status: 'FAILURE'

@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from e0ab0ce to 94376a3 Compare September 1, 2025 21:51
@djns99
Copy link
Collaborator Author

djns99 commented Sep 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17255 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17255 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12971 completed with status: 'FAILURE'

…schedules

Signed-off-by: Daniel Stokes <[email protected]>

fix: Fix compilation errors for no smem epilogue

Signed-off-by: djns99 <[email protected]>

fix: Address linker issues

Signed-off-by: djns99 <[email protected]>

fix: Correct list of supported tactics

Signed-off-by: djns99 <[email protected]>

fix: Add pretty printing for tile shapes

Signed-off-by: djns99 <[email protected]>

Fix code rabbit comments

Signed-off-by: djns99 <[email protected]>

Fix benchmark tactic logic for single gemm profiling

Signed-off-by: djns99 <[email protected]>

Cleanup gemm_configs.h changes

Signed-off-by: djns99 <[email protected]>

Improve tactic reporting in MOE backend benchmarks

Signed-off-by: djns99 <[email protected]>

Reduce instantiated configs

Signed-off-by: djns99 <[email protected]>

Skip some configs to reduce test runtime

Signed-off-by: djns99 <[email protected]>

Fix code rabbit comments

Signed-off-by: djns99 <[email protected]>

Fix rebase issues in launcher logic

Signed-off-by: djns99 <[email protected]>

Address feedback

Signed-off-by: djns99 <[email protected]>
@djns99 djns99 force-pushed the user/djns99/dynamic_cluster_moe branch from f28c8be to e4f486a Compare September 2, 2025 04:00
@djns99
Copy link
Collaborator Author

djns99 commented Sep 2, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17295 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17295 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12998 completed with status: 'FAILURE'

@djns99
Copy link
Collaborator Author

djns99 commented Sep 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17414 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17414 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13088 completed with status: 'SUCCESS'

@djns99
Copy link
Collaborator Author

djns99 commented Sep 3, 2025

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)

1-2: Add (or reuse) the version.txt guard in CI to prevent drift

Same ask as earlier threads; consolidating for both arches to catch stale hashes or wrong uploads.

#!/bin/bash
set -euo pipefail

A64="cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt"
X64="cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt"

check_ver() {
  local ver="$1"
  local dir
  dir="$(dirname "$ver")"
  local HASH FILE COMMIT
  HASH="$(awk 'NR==1{print $1}' "$ver")"
  FILE="$(awk 'NR==1{print $2}' "$ver")"
  COMMIT="$(awk 'NR==2{print $2}' "$ver")"

  [[ "$HASH" =~ ^[0-9a-f]{64}$ ]] || { echo "Invalid SHA-256 in $ver: $HASH"; exit 1; }
  [[ "$COMMIT" =~ ^[0-9a-f]{40}$ ]] || { echo "Invalid commit in $ver: $COMMIT"; exit 1; }

  if [[ -f "${dir}/${FILE}" ]]; then
    ACTUAL="$(sha256sum "${dir}/${FILE}" | awk '{print $1}')"
    [[ "$ACTUAL" == "$HASH" ]] || { echo "Checksum mismatch for ${FILE} in ${dir}"; exit 1; }
    echo "Checksum OK for ${FILE}"
  else
    echo "Artifact ${dir}/${FILE} not found; skipping checksum."
  fi
}

check_ver "$A64"
check_ver "$X64"

# Ensure both arches point to the same internal commit
C1="$(sed -n '2p' "$A64")"
C2="$(sed -n '2p' "$X64")"
[[ "$C1" == "$C2" ]] || { echo "Commit mismatch between arches: ${C1} vs ${C2}"; exit 1; }
echo "version.txt checks passed for both arches."
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

301-327: Make GEMM profiling warning visible and fix optional comparison

Two issues:

  • Warning is gated at ERROR; users won’t see it.
  • Comparing std::optional to int is non-portable pre-C++23 and may not compile.

Apply:

-            static bool printed = false;
-            if (routing_config != UNIFORM_ROUTING_CONFIG && LOG_LEVEL >= ERROR && !printed)
+            static bool printed = false;
+            if (routing_config.has_value() && *routing_config != UNIFORM_ROUTING_CONFIG
+                && LOG_LEVEL >= INFO && !printed)
             {
                 std::cerr << "Warning: Profiling a specific GEMM will always use uniform random token distribution"
                           << std::endl;
                 printed = true;
             }
             routing_config = UNIFORM_ROUTING_CONFIG;

This is the same class of issue flagged previously; reapplying here.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

370-437: Add missing <tuple> include in cutlass_heuristic.cpp
At the top of cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp add

#include <tuple>

This file calls std::get on the result of enum_to_shape_tuple, so it must explicitly include <tuple> rather than relying on transitive includes.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

241-241: Reassign the replace() result or drop the dead line

Python strings are immutable; this call is a no-op and kernel_sched isn’t used later in this branch. Either reassign or delete to avoid confusion.

Apply one of:

-            kernel_sched.replace("::Kernel", "::KernelGrouped")
+            kernel_sched = kernel_sched.replace("::Kernel", "::KernelGrouped")

or

-            kernel_sched.replace("::Kernel", "::KernelGrouped")
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (2)

71-110: Missing include for std::array

This block uses std::array; add the header to avoid relying on transitive includes.

 #include "cute/tensor.hpp"
+#include <array>

166-176: Guard dynamic-cluster shape conversion; avoid Undefined → bogus shapes

Only compute and pass cluster shapes when dynamic_cga is enabled; also keep fallback guarded. This prevents constructing (0,0,0) from Undefined.

-            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
-            bool const swap_ab = hopper_input.swap_ab;
-            auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
-            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
-            auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
-            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
-                std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            bool const dynamic_cga = gemm_config.dynamic_cluster_shape != cutlass_extensions::ClusterShape::Undefined;
+            bool const swap_ab = hopper_input.swap_ab;
+            auto cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{};
+            auto cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{};
+            if (dynamic_cga)
+            {
+                auto cluster_shape = cutlass_extensions::enum_to_shape_tuple(gemm_config.dynamic_cluster_shape);
+                cluster_shape_cute = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape), std::get<1>(cluster_shape), cute::_1{}};
+                auto cluster_shape_fallback = cutlass_extensions::enum_to_shape_tuple(gemm_config.fallback_cluster_shape);
+                cluster_shape_cute_fallback = cute::Shape<int32_t, int32_t, cute::_1>{
+                    std::get<0>(cluster_shape_fallback), std::get<1>(cluster_shape_fallback), cute::_1{}};
+            }
🧹 Nitpick comments (6)
tests/unittest/_torch/thop/parallel/test_moe.py (3)

822-824: Preserve coverage while avoiding timeouts: gate the 4096-token case behind an env var

Nice trim for CI stability. To still exercise the largest path in extended runs, gate 4096 behind an opt-in flag.

Apply:

-    @pytest.mark.parametrize("num_tokens", [16, 64, 4096])
+    @pytest.mark.parametrize("num_tokens", [16, 64] + ([4096] if HEAVY else []))

Add once near the imports (os is already imported):

HEAVY = os.environ.get("TRTLLM_HEAVY_TESTS") == "1"

I can push the minimal patch across both FP8/FP4 suites if you’d like.


823-824: Expert-info set reduction looks fine; keep one n_groups=1/top_k<8 light config

Dropping (32, 1, 1, 5) removes a quick n_groups=1/top_k<8 case. Consider re-adding that single light combo to retain a distinct routing shape without noticeable runtime impact.


941-941: Same gating for FP4: make 4096-token run opt-in

Mirror the FP8 change so day-to-day runs stay fast.

Apply:

-    @pytest.mark.parametrize("num_tokens", [1, 4096])
+    @pytest.mark.parametrize("num_tokens", [1] + ([4096] if HEAVY else []))
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

170-172: Avoid dumping entire JSON to stdout at INFO

Large files spam logs and slow CI. Print filename and a short preview instead.

Apply:

-    if (LOG_LEVEL >= INFO)
-        std::cout << "Loaded benchmark file: " << file_contents << std::endl;
+    if (LOG_LEVEL >= INFO)
+    {
+        auto preview_len = std::min<size_t>(file_contents.size(), 512);
+        std::cout << "Loaded benchmark file: " << workloadFile << " (" << file_contents.size()
+                  << " bytes) preview: " << file_contents.substr(0, preview_len) << (file_contents.size() > preview_len ? "..." : "")
+                  << std::endl;
+    }
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

449-468: Candidate explosion control: keep 4x2x1 but consider dedup or tiering

The added sweep across two epilogues and multiple cluster shapes is comprehensive; it can noticeably lengthen tactic profiling. If profiling time regresses in CI, consider:

  • de-duplicating identical configs produced via different dynamic/fallback pairs, or
  • gating the non-1x1x1 cluster shapes behind a FAST_BUILD-like flag.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

508-516: Label nit: avoid “tile_config_sm90_enum” for all SMs

The stream label is SM90-specific but this path prints for SM90/100/120. Use a neutral label.

-        out << "tile_config_sm90_enum: " << config.getTileConfigAsInt()
+        out << "tile_config_enum: " << config.getTileConfigAsInt()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 08eebd6 and c7f965e.

⛔ Files ignored due to path filters (2)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
📒 Files selected for processing (14)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (10 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (12 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2 hunks)
  • tests/unittest/_torch/thop/parallel/test_moe.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.inl
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.h
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-08-21T02:41:10.565Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h:141-145
Timestamp: 2025-08-21T02:41:10.565Z
Learning: In TensorRT-LLM MOE GEMM kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h), the stride_act and stride_weight pointers in TmaWarpSpecializedGroupedGemmInput are intentionally declared as void* rather than typed pointers because the actual stride type is determined at runtime based on factors like the swap_ab flag and layout decisions. This runtime type determination makes compile-time type safety impossible, so void* is the correct approach.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h
📚 Learning: 2025-09-01T18:06:21.617Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-09-01T18:06:21.617Z
Learning: Applies to **/*.{h,hpp,hh,hxx} : Use include guards in headers with symbol format TRTLLM_<FILENAME>_H (no underscores prefix/suffix; filename only)

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
🪛 LanguageTool
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt

[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 33f251e0599197ad3e6c59d64a42f9721...

(QB_NEW_EN)


[grammar] ~2-~2: There might be a mistake here.
Context: ...33f251e0599197ad3e6c59d64a42f9721d3cc27c

(QB_NEW_EN)

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt

[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 33f251e0599197ad3e6c59d64a42f9721...

(QB_NEW_EN)


[grammar] ~2-~2: There might be a mistake here.
Context: ...33f251e0599197ad3e6c59d64a42f9721d3cc27c

(QB_NEW_EN)

🪛 Ruff (0.12.2)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

212-212: EpilogueScheduleTag may be undefined, or defined from star imports

(F405)


355-355: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


372-372: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


419-419: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


732-732: EpilogueScheduleType may be undefined, or defined from star imports

(F405)


733-733: EpilogueScheduleType may be undefined, or defined from star imports

(F405)

🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)

1-2: Format and values look good

SHA-256 (64 hex) and commit (40 hex) formats are valid; commit aligns with x86_64 entry in this PR. Not flagging any API-drift concerns here per your prior note that internal_cutlass_kernels intentionally diverges.

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1)

1-2: Consistent update across arches

Hash/filename line and commit line conform to expected format; commit matches the AArch64 entry.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

163-166: Good: fail fast when the workload file can’t be opened

Early, explicit invalid_argument is the right behavior here.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (2)

353-361: SM100 validity gating looks correct

CGA limited to (1|2)x1x1 and finalize-fusion+non-TMA excluded. Matches dispatch constraints.

Also applies to: 372-373


730-746: Good: enumerate SM100 epilogues and dynamic_cga, feed into partial_args

Explicitly iterating epi_schedules and dynamic_cga improves coverage and keeps validation consistent with SM100 path.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (1)

202-207: Tile-shape predicates for SM100 read well

Constraining ClusterShape to (1|2)x1x1 at compile-time aligns with the generator’s runtime gating.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

85-101: Nice: constexpr encode/decode of shapes

shape_tuple_to_enum and enum_to_shape_tuple are clear and guarded; the <1000 constraint avoids overflow.


451-463: Readable tile-config names across SMs

Switching to get_tile_shape_name() fixes the “0x0x1” issue for heuristic/undefined.

Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the tuning part. It is very clean and looks like there are no direct changes associated with the moe tunable ops' API. Thanks a lot for the work.
We can then profile the warm-up time increment and see if it is significant after PR is merged.

@djns99
Copy link
Collaborator Author

djns99 commented Sep 3, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17439 [ reuse-pipeline ] triggered by Bot

@djns99 djns99 enabled auto-merge (squash) September 3, 2025 01:44
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17439 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #17414 for commit c7f965e

@djns99 djns99 merged commit 109f272 into NVIDIA:main Sep 3, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants