Skip to content

Conversation

djns99
Copy link
Collaborator

@djns99 djns99 commented Aug 14, 2025

Summary by CodeRabbit

  • New Features

    • Per‑GEMM tactic selection and profiling (independent choices for GEMM 1 and GEMM 2).
    • Optional "finalize‑fusion" variants where supported, with safe hardware gating.
    • Torch custom ops accept gemm_idx for per‑GEMM execution; autotuner and runners forward extra context via kwargs.
  • Documentation

    • CLI/JSON schema and help updated to use tactic_id1/tactic_id2 and show per‑GEMM listings.
  • Tests

    • Unit tests updated to cover per‑GEMM flows and fused‑finalize behavior.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@djns99 djns99 requested a review from a team as a code owner August 14, 2025 22:43
@djns99 djns99 requested a review from liji-nv August 14, 2025 22:43
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Introduces per‑GEMM tactic selection (MoeGemmId) and epilogue finalize‑fusion awareness across MOE runners, profilers, benchmarks, Torch bindings, plugins, and tests; threads kwargs through autotuner/custom ops; updates several public method signatures, data layouts, and internal kernel artifact versions.

Changes

Cohort / File(s) Summary
Per‑GEMM tactics plumbing
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h, cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu, cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp, cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h, cpp/tensorrt_llm/thop/moeOp.cpp
Replace single global tactic list with gemm-specific tactic vectors via MoeGemmId; update getTactics callers, CLI/JSON parsing and help (tactic_id1/tactic_id2), logging, and Torch/FusedMoeRunner plumbing for per‑GEMM profiles and tactic counts.
MOE GEMM API & finalize‑fusion support
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h, cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h, cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h, cpp/tensorrt_llm/kernels/cutlass_kernels/moe_kernels.cu, cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h, cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h, cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Add MoeGemmId enum and epilogue fusion enum/field; change getTactics/getConfigs APIs to accept gemm_id and supports_finalize_fusion; duplicate TMA‑specialized configs with FINALIZE variants when allowed; add mayHaveFinalizeFused/per‑SM checks; update workspace preparation to be fusion‑aware and use per‑sample per‑fusion caches; adjust data layouts and signatures accordingly.
Benchmark/launcher and parsing updates
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
Change listAllTactics to accept MoeGemmId; remove object-format tactic parsing helper; support independent tactic_id1/tactic_id2 and per‑GEMM listings, warnings, and help text.
Tests updated for per‑GEMM & finalize
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
Split tactic filtering per GEMM, add mUseFusedFinalize flag, select configs considering epilogue_fusion_type, and sweep/test pairwise GEMM_1/GEMM_2 config combinations.
Torch autotuner & custom ops kwargs threading
tensorrt_llm/_torch/autotuner.py, tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py, tests/unittest/_torch/misc/test_autotuner.py
Thread **kwargs into get_valid_tactics and autotuner profiling; add gemm_idx argument propagation in MoERunner.forward and related wrappers; update test runners to accept **kwargs.
Plugin type alias exposure
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
Add public alias using MoeGemmId = CUTLASS_MOE_GEMM_NAMESPACE::MoeGemmId;.
Kernel wrappers / headers adjustments
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
Minor whitespace/namespace adjustments (non-functional).
Internal kernel artifact versions
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt, cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
Bump static library hashes and commit refs for internal Cutlass kernels.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Bench as Benchmark/Plugin/Test
  participant Torch as Torch API / Autotuner
  participant Runner as CutlassMoeFCRunner
  participant Moe as MoeGemmRunner
  participant Prof as GemmProfilerBackend

  Note over Bench,Torch: Per‑GEMM tactic selection is requested
  Bench->>Runner: getTactics(gemm_id)
  Runner->>Moe: getTactics(sm, gemm_id) / getConfigs(sm, supports_finalize_fusion)
  Moe-->>Runner: configs[tactic].epilogue_fusion_type
  Runner-->>Bench: tactics list for gemm_id

  Note over Torch,Prof: Prepare TMA WS templates (both fusion types)
  Torch->>Prof: prepare(...)
  Prof->>Prof: prepareTmaWsInputs(fusion=NONE)
  Prof->>Prof: prepareTmaWsInputs(fusion=FINALIZE)

  Note over Bench,Prof: Run with chosen tactic
  Bench->>Prof: run(profile, tactic)
  alt tactic.epilogue_fusion_type == FINALIZE
    Prof->>Prof: select FINALIZE cache/template for sample
  else
    Prof->>Prof: select NONE cache/template for sample
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • litaotju
  • pcastonguay
  • nv-guomingz
  • juney-nvidia
  • yizhang-nv
  • hyukn

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 71f9272 and 2d7fb1b.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (0 hunks)
💤 Files with no reviewable changes (1)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@djns99 djns99 requested review from hyukn and litaotju August 14, 2025 22:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🔭 Outside diff range comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

52-86: parseTacticToVectorID signature must carry gemm_id (compile error).

You’re calling parseTacticToVectorID(..., MoeGemmId::GEMM_X) below, but the function signature doesn’t accept gemm_id. Also, the "all" branch calls listAllTactics<BenchClass>() with no argument.

Apply this diff to fix both the signature and calls within:

-template <class BenchClass>
-void parseTacticToVectorID(nlohmann::json& tactic, std::vector<int>& tactic_ids)
+template <class BenchClass>
+void parseTacticToVectorID(nlohmann::json& tactic, std::vector<int>& tactic_ids, MoeGemmId gemm_id)
 {
     if (tactic.is_number_integer())
     {
         tactic_ids.push_back(tactic.get<int>());
     }
     else if (tactic.is_array())
     {
         for (auto c : tactic)
         {
-            parseTacticToVectorID<BenchClass>(c, tactic_ids);
+            parseTacticToVectorID<BenchClass>(c, tactic_ids, gemm_id);
         }
     }
     else if (tactic.is_string())
     {
         assert(tactic.is_string());
         auto tactic_name = tactic.get<std::string>();
         if (tactic_name == "all")
         {
-            auto all_tactics = listAllTactics<BenchClass>();
+            auto all_tactics = listAllTactics<BenchClass>(gemm_id);
             tactic_ids.resize(all_tactics.size());
             std::iota(tactic_ids.begin(), tactic_ids.end(), 0);
         }
         else
         {
             assert(tactic.get<std::string>() == "auto");
             tactic_ids.push_back(-1);
         }
     }
     else
     {
         throw std::invalid_argument("Invalid tactic format");
     }
 }

286-301: Fix debug printing of valid tactics (calls outdated API).

This block calls listAllTactics<BenchClass>() with no gemm_id. Print per-GEMM tactics to align with the new API.

Apply this diff:

-                std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n";
-                auto confs = listAllTactics<BenchClass>();
-                for (auto c : confs)
-                    std::cerr << c.toString();
+                std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n";
+                for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2})
+                {
+                    auto confs = listAllTactics<BenchClass>(gemm_id);
+                    int i = 0;
+                    for (auto const& c : confs)
+                    {
+                        std::cerr << "[GEMM " << static_cast<int>(gemm_id) << "] Tactic " << i++ << ":\n"
+                                  << c.toString();
+                    }
+                }
🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1-1: Add NVIDIA SPDX header to comply with repository guidelines

Python sources must carry the NVIDIA copyright header.

Apply at file top:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/autotuner.py (1)

1-1: Add NVIDIA SPDX header to comply with repository guidelines

Python sources must carry the NVIDIA copyright header.

Apply at file top:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

1-15: Update copyright year

Header lists 2020-2023. The guideline requests current-year coverage. Consider updating to include 2025.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)

1120-1142: Tactic selection considers finalize-fusion only for GEMM_2 (as intended).

  • is_tma_warp_specialized gating and is_finalize_fusion for GEMM_2 only is consistent with how finalize-fusion is applied in epilogue.
  • Fallback to the first tactic if a perfect match isn’t found is a pragmatic default.

Consider emitting a one-line INFO log when falling back (to help diagnose missing matches without running at VERBOSE).


1641-1642: Determinism guard is appropriate.

Using !gemm2.is_finalize_fusion || mK < 3 || getSMVersion() < 90 || getSMVersion() >= 120 captures the conditions where results should be deterministic.

This logic is duplicated in multiple places; consider extracting a small helper (e.g., shouldBeDeterministic(config, k)) to avoid divergence over time.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

627-632: Static getTactics path differs from instance path.

Here you rely on RunnerType::getConfigs(sm, bool) to expand finalize-fusion options, whereas the instance path uses addFinalizeFusionConfigs post-processing. If both code paths are expected to behave identically, consider unifying through addFinalizeFusionConfigs for consistency across environments.

If RunnerType::getConfigs(sm, bool) already includes fused variants, leave as-is; otherwise prefer a single expansion mechanism to avoid drift.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 078e907 and 8b09273.

📒 Files selected for processing (12)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (5 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (6 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (10 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/autotuner.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (23)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

84-86: Confirmed: gemm_idx semantics — 1 → GEMM_1, otherwise → GEMM_2 (Python passes 1/2 correctly; default 0 maps to GEMM_2)

C++ binding torch_ext::FusedMoeRunner::getTacticNum(int gemm_idx) returns mGemm1Profiles.size() if gemm_idx == 1, otherwise mGemm2Profiles.size(); Python forwards gemm_idx directly and uses 1 and 2 in tuning calls.

Relevant locations:

  • cpp/tensorrt_llm/thop/moeOp.cpp: getTacticNum implementation (lines ~574–577) and Python binding (.def("get_tactic_num", ...) at ~1057–1060).
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py: get_valid_tactics (lines ~80–86) — returns range(self.fused_moe_runner.get_tactic_num(gemm_idx)); forward signature default gemm_idx=0 (lines ~88–93); tuning calls use gemm_idx=1 and gemm_idx=2 (lines ~185–199).

Conclusion: indexing matches—passing 1 selects GEMM_1 and passing 2 (or 0/other) selects GEMM_2. Optional: consider documenting or changing the forward default (gemm_idx=0) if treating 0 as GEMM_2 is surprising.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

506-508: toString additions look good

Including enable cuda kernel and is_finalize_fusion in the TMA WS path is helpful for debugging.

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)

294-300: Inline const overload delegating to static supportsTmaWarpSpecialized(int sm) is correct

Nice refactor for per-SM checks while preserving the const method call sites.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

569-574: Static supportsTmaWarpSpecialized(int sm) implementation LGTM

Covers Hopper (90), Blackwell [100,120), and SM120/121 explicitly; aligns with the new const wrapper.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4016-4018: Verify finalize epilogue shape ordering (rows should be num_output_tokens, not hidden_size)

Given the swap+transpose path in the finalize-fusion epilogue, the stride/shape should be set such that rows map to num_output_tokens and columns to hidden_size. Double-check that setFinalizeFusionParams(final_output, hidden_size, num_rows, use_reduction) passes arguments in the correct order for the underlying expected layout.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

1281-1282: LGTM: per-GEMM tactic retrieval aligns backend profiling target

Returning tactics keyed by backend.mGemmToProfile keeps the profiler-to-runner contract consistent.

Ensure backend.mGemmToProfile is set prior to this call in all code paths (it is in initialize(), but double-check for any direct usage paths).

cpp/tensorrt_llm/thop/moeOp.cpp (1)

221-223: LGTM: split per-GEMM tactic lists at construction

Using getTactics(MoeGemmId::GEMM_1/2) upfront enables independent selection and profiling per GEMM.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (4)

836-836: Per-GEMM tactic retrieval looks correct (cast guarded by call sites).

pickBestTactic is only called with GEMM_1 or GEMM_2, so the static_cast<MoeGemmId>(gemm_to_profile) here won’t see LAYER. No action needed.


928-953: Correctly splits tactics per GEMM and validates indices.

  • Fetching separate vectors for GEMM_1 and GEMM_2 and choosing based on combo.second is clean.
  • Range checks on t before indexing keep it safe.

970-979: Using the GEMM-specific tactic for profiler execution is aligned.

Assuming setTactic was called and validated earlier (as in runBenchmark), this index access is safe and consistent.


1079-1084: Improved logging for per-GEMM tactics.

Clear separation of GEMM_1 and GEMM_2 tactics with counts and toString() improves diagnosability.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)

373-375: New fused-finalize toggle is sane by default.

Defaulting mUseFusedFinalize to false while still enabling fused-finalize for k < 3 via runtime ensures both paths get exercised in tests.


459-459: Runtime toggle of fused-finalize matches intent.

mMoERunner.use_fused_finalize_ = k < 3 || mUseFusedFinalize; matches the comment and ensures deterministic coverage across k values.


1090-1119: Per-GEMM filtering wrapper reads cleanly.

Fetching tactics via mMoERunner.getTactics(gemm_id) and applying arch-specific filters is a good split of concerns.


1759-1763: Non-deterministic test path is properly gated.

Enabling fused-finalize only for PermuteNonDeterministic keeps the test suite behavior targeted.


2114-2121: Config sweep now respects per-GEMM tactic sets.

Iterating configs1 and configs2 and asserting both names are non-empty prevents accidental uninitialized tactics.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

45-50: Per-GEMM listAllTactics is the right API.

Returning RunnerType::getTactics(sm, gemm_id) aligns with the new per-GEMM contracts.


282-284: Calls to parseTacticToVectorID match the intended gemm_id, but rely on the wrong signature.

Once you apply the signature fix above, these calls are correct.

After applying the signature fix, please ensure there are no remaining calls to the old two-argument parseTacticToVectorID. A quick search in this file should suffice.


415-427: Verbose listing of tactics per GEMM is excellent.

This will help users construct correct tactic_id1/tactic_id2 values per architecture.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

231-237: MoeGemmId introduction is clear and scoped.

Having a shared enum for GEMM ids aligns the API across profilers, runners, and benchmark code.


456-457: Interface change: getTactics now requires MoeGemmId.

This is a breaking API change. Ensure all implementations of CutlassMoeFCRunnerInterface and all call sites are updated accordingly (you’ve addressed key sites in this PR).

If not already done, please verify that no other non-bench code still calls the old getTactics() without arguments.


827-832: Static finalize-fusion capability check is reasonable.

Gates by TMA WS support, SM >= 90, and weight mode. Avoids coupling to instance use_fused_finalize_ which is fine for “listing” APIs.


931-931: Profiler enum alias to MoeGemmId is consistent.

This keeps the profiler aligned with the rest of the API. Note that “LAYER” sentinel isn’t part of MoeGemmId; local enums (like in benchmarks) bridge that gap safely.

Copy link
Contributor

@sklevtsov-nvidia sklevtsov-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 8b09273 to e09cebe Compare August 14, 2025 23:49
@djns99 djns99 changed the title perf: Make finalize fusion part of the tactic selection logic [None][perf] Make finalize fusion part of the tactic selection logic Aug 14, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

70-76: Fix: "all" case clobbers previously accumulated tactic IDs when parsing arrays

When parsing arrays, encountering the "all" string currently resizes tactic_ids and overwrites prior entries (e.g., ["all", 2] loses "2"). This is especially problematic for sweeps that mix explicit IDs with "all".

Append instead of resize:

-        if (tactic_name == "all")
-        {
-            auto all_tactics = listAllTactics<BenchClass>(gemm_id);
-            tactic_ids.resize(all_tactics.size());
-            std::iota(tactic_ids.begin(), tactic_ids.end(), 0);
-        }
+        if (tactic_name == "all")
+        {
+            auto all_tactics = listAllTactics<BenchClass>(gemm_id);
+            auto n = static_cast<int>(all_tactics.size());
+            auto start = static_cast<int>(tactic_ids.size());
+            tactic_ids.reserve(start + n);
+            for (int i = 0; i < n; ++i)
+            {
+                tactic_ids.push_back(i);
+            }
+        }

276-284: Avoid operator[] insertion and handle partial tactic configs gracefully

Accessing run_config["tactic_id1"] / ["tactic_id2"] inserts nulls when keys are absent, causing parseTacticToVectorID to throw. This makes valid partial configs brittle. Use at() for safety and mirror the provided side when only one tactic key is present.

-        bool has_tactic_ids2 = false;
-        std::vector<int> tactic_ids1{};
-        std::vector<int> tactic_ids2{};
-        if (run_config.contains("tactic_id1") || run_config.contains("tactic_id2"))
-        {
-            has_tactic_ids2 = true;
-            parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1);
-            parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2);
-        }
+        bool has_tactic_ids2 = false;
+        std::vector<int> tactic_ids1{};
+        std::vector<int> tactic_ids2{};
+        bool hasT1 = run_config.contains("tactic_id1");
+        bool hasT2 = run_config.contains("tactic_id2");
+        if (hasT1)
+        {
+            parseTacticToVectorID<BenchClass>(run_config.at("tactic_id1"), tactic_ids1, MoeGemmId::GEMM_1);
+        }
+        if (hasT2)
+        {
+            parseTacticToVectorID<BenchClass>(run_config.at("tactic_id2"), tactic_ids2, MoeGemmId::GEMM_2);
+        }
+        // If only t2 is provided, mirror it into t1 so the sweep still runs.
+        if (!hasT1 && hasT2)
+        {
+            tactic_ids1 = tactic_ids2;
+        }
+        has_tactic_ids2 = hasT1 && hasT2;
♻️ Duplicate comments (2)
cpp/tensorrt_llm/thop/moeOp.cpp (2)

608-616: Harden profile selection and gemm_idx validation in runGemmProfile

Guard against invalid gemm_idx, empty tactic lists, and out-of-range profile_id to avoid UB from .front()/indexing.

Apply this diff:

-        auto const gemm_to_profile
-            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
-        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2");
+        auto const gemm_to_profile
+            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
+        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx));
 
         // Get specific profile configs according to the profile_id.
         // Fallback tactic is set to be 0
         // TODO: use the best tactic id found offline for a better default inference perf
-        auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id];
+        const auto& profile = [&]() -> const Profile& {
+            if (profile_id == -1) return profiles.front();
+            TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()),
+                        "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id);
+            return profiles[profile_id];
+        }();

717-726: Validate tactic vectors and ids in setRunnerProfiles

Protect against empty vectors and out-of-range indices before using .front() or .at().

Apply this diff:

-        auto best_gemm1_profile = mGemm1Profiles.front();
-        auto best_gemm2_profile = mGemm2Profiles.front();
+        TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available");
+        auto best_gemm1_profile = mGemm1Profiles.front();
+        auto best_gemm2_profile = mGemm2Profiles.front();
         if (profile_ids.has_value())
         {
             TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids");
-            best_gemm1_profile
-                = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]);
-            best_gemm2_profile
-                = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]);
+            if (profile_ids.value()[0] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[0] >= 0
+                        && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()),
+                        "profile_ids[0] out of range");
+                best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]);
+            }
+            if (profile_ids.value()[1] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[1] >= 0
+                        && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()),
+                        "profile_ids[1] out of range");
+                best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]);
+            }
         }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)

513-515: Include epilogue_fusion_type in all toString branches for consistency

Currently only the TMA path prints epilogue_fusion_type. Consider appending it to the sm80-compatible toString branch as well so log output is uniform across architectures.

Outside the changed hunk, extend the sm80 branch like:

// in toString(), non-TMA branch
tactic << "\n\tstyle=compatible"
       << "\n\ttile shape ID: " << (int) tile_config_sm80 << "\n\tstages: " << (int) stages
       << "\n\tsplit k: " << (int) split_k_factor
       << "\n\tenable cuda kernel: " << (enableCudaKernel ? "true" : "false")
       << "\n\tepilogue fusion type: " << (int) epilogue_fusion_type;

546-548: Mirror epilogue_fusion_type in operator<< for non-TMA configs

Analogous to TMA printing, include epilogue_fusion_type in the non-TMA operator<< branch to keep streamed output aligned with toString.

Suggested change (outside the diff hunk):

-        out << "tile_config_enum: " << config.getTileConfigAsInt()
+        out << "tile_config_enum: " << config.getTileConfigAsInt()
             << ", split_k_style_enum: " << int(config.split_k_style)
             << ", split_k_factor: " << config.split_k_factor
             << ", stages: " << config.stages
-            << ", enable_cuda_kernel: " << (config.enableCudaKernel ? "true" : "false");
+            << ", enable_cuda_kernel: " << (config.enableCudaKernel ? "true" : "false")
+            << ", epilogue_fusion_type: " << int(config.epilogue_fusion_type);
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

66-71: Nit: Redundant assert in string branch

assert(tactic.is_string()) is redundant inside the tactic.is_string() branch. Safe to remove.

-        assert(tactic.is_string());
         auto tactic_name = tactic.get<std::string>();

393-393: Nit: Update comment to reflect new listAllTactics signature

The example still shows listAllTactics without a MoeGemmId; update for clarity.

-    auto cutlass_tactic = {-1};                           // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()};
+    auto cutlass_tactic = {-1};                           // e.g. {0, ..., listAllTactics<BenchClass>(MoeGemmId::GEMM_1).size() - 1}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b09273 and e09cebe.

📒 Files selected for processing (13)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (10 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)

301-306: Public static SM-query and instance wrapper — LGTM

The inline wrapper to the static overload improves ergonomics and makes capability checks usable without an instance. No concerns.


306-306: Definition found — no action required

The static overload is defined — the inline instance method in the header forwards to it and the templated definition exists in the template dispatch header.

  • Declaration / inline forwarder: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (around lines 301–306) — the instance method returns supportsTmaWarpSpecialized(sm_) and the static declaration is present.
  • Definition: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h:569 — definition of MoeGemmRunner<T,WeightType,OutputType,ScaleBiasType>::supportsTmaWarpSpecialized(int sm).
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

436-443: Epilogue fusion state in config — LGTM

Adding EpilogueFusionType and the epilogue_fusion_type field is a clean, explicit way to represent finalize-fusion. Good addition of printing in the TMA toString/operator<< paths.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

2850-2854: Gate finalize-fusion workspace by epilogue_fusion_type — LGTM

Conditioning permuted_token_final_scales_ on FINALIZE via epilogue_fusion_type aligns workspace with the selected tactic.


4010-4016: Runtime alignment of finalize-fusion decision — LGTM

Replacing implicit SM checks with tactic-driven epilogue_fusion_type and asserting agreement with the runner’s flags removes ambiguity and catches mismatches early.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1)

46-46: Public MoeGemmId alias — LGTM

This unifies per-GEMM references in the plugins namespace and matches the rest of the PR’s per-GEMM changes.

cpp/tensorrt_llm/thop/moeOp.cpp (3)

51-51: Public MoeGemmId alias in Torch extension — LGTM

Keeps the Torch side aligned with core headers for per-GEMM flows.


222-224: Fetch tactics per GEMM — LGTM

Populating mGemm1Profiles and mGemm2Profiles via getTactics(MoeGemmId::GEMM_1/2) matches the new per-GEMM profiling model.


575-580: Validate gemm_idx in getTacticNum — LGTM

Adding TORCH_CHECK prevents silent fallbacks and returns explicit sizes. Good catch.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

45-50: Per-GEMM tactic enumeration routed correctly

Passing MoeGemmId through to RunnerType::getTactics(sm, gemm_id) is the right direction for per-GEMM tactic listing. Looks good.


294-301: Helpful: per-GEMM tactic listing in warning path

Printing available tactics per GEMM is useful for debugging invalid tactic configs. LGTM.


420-431: Verbose per-GEMM tactic dump is clear and consistent

The new per-GEMM breakdown aligns with the per-GEMM selection model. Looks good.

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 398bcac to 590e68a Compare August 15, 2025 05:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tensorrt_llm/_torch/autotuner.py (1)

456-457: Filter kwargs before calling runner.get_valid_tactics to avoid TypeError across heterogeneous runners

Many existing runners don’t accept the new kwargs; passing all kwargs directly can break them. Filter kwargs against the target signature unless it accepts **kwargs.

-            valid_tactics = runner.get_valid_tactics(input_tensors, profile, **kwargs)
+            # Filter kwargs to those accepted by get_valid_tactics unless it has **kwargs
+            gvt_params = inspect.signature(runner.get_valid_tactics).parameters
+            accepts_var_kwargs = any(p.kind == inspect.Parameter.VAR_KEYWORD for p in gvt_params.values())
+            accepted_kw = {p.name for p in gvt_params.values()
+                           if p.kind in (inspect.Parameter.POSITIONAL_OR_KEYWORD, inspect.Parameter.KEYWORD_ONLY)}
+            filtered_kwargs = kwargs if accepts_var_kwargs else {k: v for k, v in kwargs.items() if k in accepted_kw}
+            valid_tactics = runner.get_valid_tactics(input_tensors, profile, **filtered_kwargs)
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4612-4614: Nit: make the bool→index conversion explicit for readability

Indexing with a bool works (0/1), but a cast improves clarity for future readers.

-        auto& cache_element = mTmaInputCache[i][use_finalize_fusion];
+        auto& cache_element = mTmaInputCache[i][static_cast<size_t>(use_finalize_fusion)];
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

603-621: Consider taking configs by value for broader API flexibility

Using rvalue reference works with temporaries (e.g., getConfigs()), but taking by value supports both rvalues and lvalues and relies on move elision/moves where applicable. Optional API polish.

-    static auto addFinalizeFusionConfigs(
-        std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize)
+    static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs(
+        std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize)
     {
         if (!use_fused_finalize)
             return configs;
         size_t const num_configs = configs.size();
         for (size_t i = 0; i < num_configs; ++i)
         {
             if (configs[i].is_tma_warp_specialized)
             {
                 configs.push_back(configs[i]);
-                configs.back().epilogue_fusion_type
-                    = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
+                configs.back().epilogue_fusion_type =
+                    cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
             }
         }
         return configs;
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 398bcac and 31e66f0.

📒 Files selected for processing (13)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (7 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/autotuner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/autotuner.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (4)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

2850-2854: LGTM: Cache finalize-fusion workspace only when tactic enables it

Gating permuted_token_final_scales_ on epilogue_fusion_type keeps workspace usage minimal and consistent with tactic selection.


4010-4016: LGTM: Explicit tactic/runner finalize-fusion alignment check

The equality check between requested finalize fusion and runner configuration prevents silent misconfigurations.


4558-4560: Signature extension for prepareTmaWsInputs is consistent with per‑GEMM fusion flows

Adding EpilogueFusion to the profiler prep API is coherent with the new per‑GEMM tactic surface.


4675-4678: LGTM: Precompute TMA inputs for both NONE and FINALIZE epilogues

Preparing both variants amortizes prep cost and enables quick switching during profiling.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

231-236: LGTM: Introduce MoeGemmId to make per‑GEMM flows explicit

Clear, minimal enum that aids per‑GEMM tactic selection and profiling.


456-457: LGTM: getTactics now per‑GEMM

Making tactic retrieval gemm_id‑aware aligns the Python and C++ layers with the new per‑GEMM selection model.


1003-1004: LGTM: 2‑D TMA input cache keyed by finalize/non‑finalize

Keeps profiler prep artifacts separated per epilogue choice. Clean structure.


1016-1018: LGTM: prepareTmaWsInputs API updated to pass fusion

Header and implementation are in sync; enables precomputing per‑fusion TMA inputs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4606-4616: Critical: TMA WS cache aliasing between FINALIZE and NONE in profiler

prepare() calls prepareTmaWsInputs twice (NONE and FINALIZE). Both calls start writing TMA WS templates from the same base workspace pointer (tma_ws_input_workspace), and getProfilerWorkspaces currently sizes only one block of TMA input workspace: workspaceSize * (NUM_ROUTING_SAMPLES + 1). The second call overwrites the first, so mTmaInputCache[...][false] and [true] end up sharing the same memory region, making one cache invalid after the other is prepared.

Allocate distinct workspace halves for NONE and FINALIZE and offset writes when fusion == FINALIZE.

Apply these diffs:

  • In getProfilerWorkspaces (size the TMA WS area for both NONE and FINALIZE caches):
-    size_t tma_ws_input_workspace_size = 0;
-    if (is_tma_ws_input)
-    {
-        tma_ws_input_workspace_size
-            = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
-            * (NUM_ROUTING_SAMPLES + 1);
+    size_t tma_ws_input_workspace_size = 0;
+    if (is_tma_ws_input)
+    {
+        // We pre-build BOTH NONE and FINALIZE templates in prepare(), so reserve space for both.
+        tma_ws_input_workspace_size
+            = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
+            * (NUM_ROUTING_SAMPLES + 1) * 2;
  • In prepareTmaWsInputs (offset to the second half for FINALIZE):
-    size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType);
-
-    TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input;
-    dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace,
-        workspaces.at("gemm_workspace").first, mScalingType);
-    tma_ws_input_workspace += tma_ws_size;
+    size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType);
+    // Reserve the first half for NONE and the second half for FINALIZE.
+    size_t cache_stride = tma_ws_size * (NUM_ROUTING_SAMPLES + 1);
+    if (use_finalize_fusion)
+    {
+        tma_ws_input_workspace += cache_stride;
+    }
+
+    TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input;
+    dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace,
+        workspaces.at("gemm_workspace").first, mScalingType);
+    tma_ws_input_workspace += tma_ws_size;

This ensures the two caches use non-overlapping memory and remain valid across runs.

♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4566-4576: Fix validated: compute groupwise flags before early-return

Declaring use_w4afp8/use_wfp4a16/use_w4_groupwise before the finalize-fusion early-exit resolves the prior “used before declaration” error and keeps the guard readable.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 31e66f0 and 19b9c1a.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (5)

2850-2854: Good: tactic-driven gating of permuted_token_final_scales workspace

Deriving permuted_token_final_scales_ from gemm2_config_->epilogue_fusion_type makes the allocation/use explicit and aligned with the GEMM2 tactic. Clean separation vs. implicit SM checks.


4010-4016: Good: finalize-fusion consistency check and gating

using_fused_finalize combines runtime config/toggles with the tactic’s epilogue_fusion_type and excludes disallowed modes (!use_w4_groupwise, !use_lora). The TLLM_CHECK enforces config-tactic agreement early, which prevents subtle misconfigurations.


4558-4560: API update acknowledged: prepareTmaWsInputs now takes fusion

The additional EpilogueFusion parameter makes profiler template preparation fusion-aware. Call sites (e.g., prepare()) have been updated accordingly.


4632-4636: Finalize-fusion parameterization looks correct

Setting fusion=FINALIZE and calling setFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1) matches the expected (hidden_size, num_output_tokens, reduction) semantics. This aligns with the epilogue’s transpose mapping.


4673-4677: Proactive prebuild of both NONE and FINALIZE templates in prepare()

Populating both caches up front reduces per-run overhead during profiling. Once the non-overlapping workspace fix above is applied, this is a solid optimization.

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 19b9c1a to e58c302 Compare August 17, 2025 23:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4616-4663: Critical: TMA WS workspace overlap between NONE and FINALIZE variants can clobber cached templates

prepare() invokes prepareTmaWsInputs twice (NONE, FINALIZE), but getProfilerWorkspaces only reserves (NUM_ROUTING_SAMPLES + 1) TMA slots, and prepareTmaWsInputs always starts from the same base workspace offset. As a result:

  • mTmaInputCache[i][0] and mTmaInputCache[i][1] point into overlapping regions.
  • The second call overwrites the first call’s templates. At runtime, whichever was prepared last “wins,” and the other variant’s template may be corrupted.

Proposed fix (minimal, preserves your two-call flow):

  • Double the TMA input workspace to hold both variants.
  • Offset the workspace base when preparing the FINALIZE variant so it uses the second half.

Apply these diffs:

  1. Allocate enough workspace for both NONE and FINALIZE variants
-    size_t tma_ws_input_workspace_size
-        = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
-          * (NUM_ROUTING_SAMPLES + 1);
+    // Reserve separate slots for NONE and FINALIZE variants: one dummy + NUM_ROUTING_SAMPLES per variant.
+    size_t tma_ws_input_workspace_size
+        = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
+          * (2 * (NUM_ROUTING_SAMPLES + 1));
  1. Use distinct halves of the workspace depending on fusion
     size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType);
 
+    // If preparing FINALIZE, jump to the second half so NONE and FINALIZE do not overlap.
+    if (use_finalize_fusion)
+    {
+        tma_ws_input_workspace += tma_ws_size * (NUM_ROUTING_SAMPLES + 1);
+    }
+
     TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input;
     dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace,
         workspaces.at("gemm_workspace").first, mScalingType);

This guarantees mTmaInputCache[i][0] and mTmaInputCache[i][1] reference non-overlapping TMA WS regions, preventing clobbering.

♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)

603-621: Make addFinalizeFusionConfigs accept by-value to support both lvalues and rvalues (and reserve capacity)

Taking an rvalue reference restricts call sites to temporaries only. By-value is simpler, safe, and enables both lvalues and rvalues with guaranteed local copy semantics. Also, reserve capacity to avoid reallocation when duplicating entries.

Apply this diff:

-    static auto addFinalizeFusionConfigs(
-        std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize)
+    static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs(
+        std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize)
     {
         if (!use_fused_finalize)
             return configs;

-        size_t const num_configs = configs.size();
+        size_t const num_configs = configs.size();
+        configs.reserve(num_configs * 2);
         for (size_t i = 0; i < num_configs; ++i)
         {
             if (configs[i].is_tma_warp_specialized)
             {
                 configs.push_back(configs[i]);
                 configs.back().epilogue_fusion_type
                     = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
             }
         }
         return configs;
     }

628-633: Static context type usage is ill-formed; don’t use decltype on a non-static data member

decltype(moe_gemm_runner_) in a static method is invalid and won’t compile. Use the known runner type directly.

Apply this diff:

     static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
     {
-        using RunnerType = decltype(moe_gemm_runner_);
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
         return Self::addFinalizeFusionConfigs(
             RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
     }

829-833: Same static context issue in mayHaveFinalizeFused(int sm)

Avoid referencing an instance member type in a static method. Use the concrete MoeGemmRunner type.

Apply this diff:

     static bool mayHaveFinalizeFused(int sm)
     {
-        using RunnerType = decltype(moe_gemm_runner_);
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
         return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
     }
🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

541-549: Help text: clarify fallback when only tactic_id1 is provided

Document that tactic_id2 mirrors tactic_id1 when omitted, to match generator behavior and reduce confusion.

Apply this diff:

-           "- \"tactic_id1, tactic_id2\"\n"
-           "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM"
+           "- \"tactic_id1, tactic_id2\"\n"
+           "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM. "
+           "If tactic_id2 is omitted, it defaults to tactic_id1.\n"
            "Valid tactics are:\n"
            " - An integer: corresponds to an index in the tactics array. WARNING this is not stable between data types "
            "or GPU architectures\n"
            " - An array: of integers, forms a list of tactics to sweep\n"
            " - The string \"all\": This will sweep through all possible tactics\n"
            " - The string \"auto\": This runs a short benchmark to pick the fastest tactic before each benchmark case. "
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 19b9c1a and e58c302.

📒 Files selected for processing (7)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧠 Learnings (4)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (17)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5)

231-236: MoeGemmId enum introduction looks correct

Enum values and scope are appropriate for per‑GEMM selection.


456-456: Interface change to per‑GEMM tactics is aligned

The new getTactics(MoeGemmId) signature makes sense for per‑GEMM flows.


622-626: Per‑GEMM tactics override logic LGTM

Gating finalize fusion to GEMM_2 looks intentional and consistent with the PR goals.


1003-1004: Cache widening to support two TMA inputs per routing sample looks good

Using a second dimension to separate GEMM_1/GEMM_2 is clear and fits the per‑GEMM flow.


1016-1018: prepareTmaWsInputs: epilogue fusion parameter addition is consistent

Signature extension is consistent with finalize fusion gating; call sites appear updated across the PR.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5)

836-844: Per‑GEMM tactic retrieval in pickBestTactic is correct

Casting the local GemmToProfile to MoeGemmId is safe given the defined numeric mapping.


928-936: Fetching separate tactic vectors for GEMM_1 and GEMM_2 looks good

The split avoids conflating finalize‑fused variants into GEMM_1.


952-956: Applying tactics per GEMM is correct

Passing the specific pair into setTactic aligns with the new API.


970-979: runMoEPermute: using the GEMM‑specific tactic is correct

Consistent with the profiler’s gemm selection. Good guard to ensure profiler/backend configuration match.


1079-1085: Diagnostics for selected tactics per GEMM are useful

Clearer than a single aggregated list; good for debugging.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

53-86: Per‑GEMM parsing path LGTM

Recursive parsing with support for integer, array, "all", and "auto" is clear and minimal.


295-301: Printing valid tactics per GEMM improves UX

Helpful one‑time diagnostics to guide configuration.


420-431: Verbose dump of all tactics by GEMM is useful

Clear separation per GEMM, with indices, makes CLI config simpler.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

2850-2854: Fusion-aware workspace wiring for GEMM2 scales: LGTM

Using the epilogue_fusion_type to gate permuted_token_final_scales_ is the right direction and removes SM-coupling.


4010-4016: Tactic-driven finalize gating and consistency check: LGTM

  • The compare against the tactic’s epilogue_fusion_type and the equality check are clear and prevent divergence.
  • Passing (final_output, hidden_size, num_rows, ...) to setFinalizeFusionParams matches the transpose semantics.

4558-4576: prepareTmaWsInputs now parametric on fusion + early groupwise guards: LGTM

  • The new fusion parameter makes the API explicit.
  • Computing use_w4_groupwise before the early-return resolves the prior “used before declaration” issue.

4739-4743: Per-sample/per-fusion cache lookup with validity check: LGTM

Selecting the cached template by fusion type and verifying isValid() prevents accidental use of uninitialized templates.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

66-81: Release-mode bug: unrecognized tactic strings silently treated as "auto"

In release builds, the assert(tactic.get<std::string>() == "auto") (Line 78) is compiled out, so any unrecognized string becomes -1 (“auto”). This is dangerous and hard to debug. Explicitly validate the string and throw on invalid values.

Apply this diff:

-    else if (tactic.is_string())
-    {
-        assert(tactic.is_string());
-        auto tactic_name = tactic.get<std::string>();
-        if (tactic_name == "all")
-        {
-            auto all_tactics = listAllTactics<BenchClass>(gemm_id);
-            tactic_ids.resize(all_tactics.size());
-            std::iota(tactic_ids.begin(), tactic_ids.end(), 0);
-        }
-        else
-        {
-            assert(tactic.get<std::string>() == "auto");
-            tactic_ids.push_back(-1);
-        }
-    }
+    else if (tactic.is_string())
+    {
+        auto tactic_name = tactic.get<std::string>();
+        if (tactic_name == "all")
+        {
+            auto all_tactics = listAllTactics<BenchClass>(gemm_id);
+            tactic_ids.resize(all_tactics.size());
+            std::iota(tactic_ids.begin(), tactic_ids.end(), 0);
+        }
+        else if (tactic_name == "auto")
+        {
+            tactic_ids.push_back(-1);
+        }
+        else
+        {
+            throw std::invalid_argument("Invalid tactic string: " + tactic_name + ". Expected \"all\" or \"auto\"");
+        }
+    }

Optional: replace magic -1 with a named constant (e.g., constexpr int kTACTIC_AUTO = -1) for clarity.

♻️ Duplicate comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

45-50: Potentially invalid RunnerType alias; confirm mMoERunner is static or switch to a concrete runner type

using RunnerType = decltype(BenchClass::mMoERunner); will not compile if mMoERunner is a non-static data member. Even if it compiles, it’s brittle. Prefer a concrete runner type (or a type alias from BenchClass) and call getTactics(sm, gemm_id) on that.

If mMoERunner is non-static, apply this change (as previously suggested) to make the runner type explicit:

 template <class BenchClass>
 auto listAllTactics(MoeGemmId gemm_id)
 {
     int const sm = getSMVersion();
-    using RunnerType = decltype(BenchClass::mMoERunner);
+    using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType,
+        typename BenchClass::WeightType,
+        typename BenchClass::OutputType,
+        typename BenchClass::InputType>;
     return RunnerType::getTactics(sm, gemm_id);
 }

Alternatively, if BenchClass exposes a canonical runner alias, use that (e.g., using RunnerType = typename BenchClass::RunnerType;).

To verify whether mMoERunner is static and where the canonical runner type lives, run:

#!/bin/bash
# Show definitions/usages of mMoERunner and any RunnerType aliases
rg -n -C3 --type=cpp --type=cu '\bmMoERunner\b|RunnerType|CutlassMoeFCRunner'

278-285: LGTM: tactic_id1/2 are now parsed conditionally (fixes prior crash when only one was present)

Good cleanup. This addresses the earlier issue of unconditional access causing exceptions when only one of the keys existed.

🧹 Nitpick comments (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

334-353: Consider defaulting unspecified tactics to “auto” instead of skipping the entire config

Currently, if either tactic_ids1 or tactic_ids2 is empty, the config is skipped. Defaulting the missing side(s) to -1 (“auto”) makes configs more user-friendly (and aligns with the help text’s emphasis on “auto” for quick sweeps).

Apply this diff:

-        if (tactic_ids1.empty() || tactic_ids2.empty())
-        {
-            std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl;
-            static bool printed = false;
-            if (!printed)
-            {
-                printed = true;
-                std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n";
-                for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2})
-                {
-                    std::cerr << "GEMM " << (int) gemm_id << ":\n";
-                    auto confs = listAllTactics<BenchClass>(gemm_id);
-                    for (auto c : confs)
-                        std::cerr << c.toString();
-                    std::cerr << std::endl;
-                }
-            }
-
-            continue;
-        }
+        if (tactic_ids1.empty())
+        {
+            if (LOG_LEVEL >= INFO)
+                std::cerr << "Info: 'tactic_id1' not provided, defaulting to auto (-1)\n";
+            tactic_ids1 = {-1};
+        }
+        if (tactic_ids2.empty())
+        {
+            if (LOG_LEVEL >= INFO)
+                std::cerr << "Info: 'tactic_id2' not provided, defaulting to auto (-1)\n";
+            tactic_ids2 = {-1};
+        }

If you prefer to keep the skip behavior, consider clarifying this requirement in the help text (see comment below).


388-388: Comment nit: reflect per-GEMM tactic enumeration

The comment references a non-existent single-argument listAllTactics and ignores the GEMM split. Tighten the comment to avoid confusion.

-auto cutlass_tactic = {-1};                           // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()};
+auto cutlass_tactic = {-1};                           // Use -1 for "auto". To sweep by GEMM, see:
+                                                     // listAllTactics<BenchClass>(MoeGemmId::GEMM_1/2).size()

Optional: introduce a named constant for -1 (e.g., kTACTIC_AUTO) and reuse at Lines 316, 320, and here.


536-545: Help text looks aligned; optionally clarify required/optional behavior for tactic_id1/2

  • Good: “objects” removed; supported types are now integer, array of integers, or strings "all"/"auto".
  • Suggestion: Explicitly state whether both tactic_id1 and tactic_id2 must be provided. If you adopt the default-to-auto behavior, document that omission defaults to "auto" for the missing side. Otherwise, note that omission will skip the config.

Proposed tweak if you choose to default missing sides to auto:

-           "- \"tactic_id1, tactic_id2\"\n"
-           "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM"
+           "- \"tactic_id1, tactic_id2\"\n"
+           "The config for the CUTLASS GEMMs. tactic_idX sets the tactic for the corresponding GEMM (1 or 2).\n"
+           "If either is omitted, it defaults to \"auto\" for that GEMM."
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79888c4 and 8493e78.

📒 Files selected for processing (1)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

415-426: LGTM: improved per-GEMM tactic listing with indices

Clear, per-GEMM tactic dumps with stable, zero-based indices; helpful for users to map IDs to configurations.

@djns99
Copy link
Collaborator Author

djns99 commented Aug 17, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15547 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15547 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11710 completed with status: 'FAILURE'

@djns99
Copy link
Collaborator Author

djns99 commented Aug 18, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15596 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15596 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11742 completed with status: 'FAILURE'

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 8493e78 to e27b017 Compare August 18, 2025 23:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

949-951: Guard against empty tactic vectors before indexing in deterministic mode

Accessing [0] without checking for emptiness will crash when no tactics are available (e.g., unsupported shape/quant combo). Guard and use .front().

Apply this diff:

-        gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0];
-        gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0];
+        auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1);
+        auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2);
+        TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE");
+        gemm1 = t1.front();
+        gemm2 = t2.front();
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

45-50: Fix type alias: decltype(BenchClass::mMoERunner) is not a concrete runner type

Using decltype on a (likely non-static) data member is incorrect here and will not yield a usable type for the static call. Use the concrete runner type derived from BenchClass template parameters.

Apply this diff:

 template <class BenchClass>
 auto listAllTactics(MoeGemmId gemm_id)
 {
     int const sm = getSMVersion();
-    using RunnerType = decltype(BenchClass::mMoERunner);
-    return RunnerType::getTactics(sm, gemm_id);
+    using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType,
+        typename BenchClass::WeightType,
+        typename BenchClass::OutputType,
+        typename BenchClass::InputType>;
+    return RunnerType::getTactics(sm, gemm_id);
 }
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

1178-1181: Fixed call site to updated selectTacticsForArch signature — LGTM

Removal of the extra argument addresses the prior build issue when assigning std::pair.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

829-833: Same static-context issue as above in mayHaveFinalizeFused(int sm)

Use the explicit runner type; don’t reference an instance member.

Apply this diff:

-    static bool mayHaveFinalizeFused(int sm)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
-    }
+    static bool mayHaveFinalizeFused(int sm)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
+    }

628-633: Compile-time error: referencing non-static member in static context

decltype(moe_gemm_runner_) cannot be used in a static method (no object). Use the known runner type explicitly.

Apply this diff:

-    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return Self::addFinalizeFusionConfigs(
-            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
-    }
+    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return Self::addFinalizeFusionConfigs(
+            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
+    }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4675-4679: Note: No base-pointer bump between NONE and FINALIZE calls

Given the above fix (offset inside prepareTmaWsInputs and doubled allocation), no changes are needed here. If you prefer offsetting at the call site instead, ensure only tma_ws_input_workspace is shifted — not the entire workspace base (or you’ll invalidate other pointers retrieved via GET_WS_PTR).

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

603-620: API ergonomics: accept configs by value instead of rvalue-ref for flexibility

addFinalizeFusionConfigs(std::vector&&) works with temporaries but forces std::move at call sites with lvalues. Taking by value is simpler and lets the compiler elide/move as needed without constraining callers.

Suggested change:

-    static auto addFinalizeFusionConfigs(
-        std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize)
+    static auto addFinalizeFusionConfigs(
+        std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize)

1016-1018: Nit: parameter name mismatch between declaration and definition

Header declares prepareTmaWsInputs(..., char* workspace, ...), while the definition uses char* workspace_ptr_char. Please align parameter names across declaration/definition to match the repo’s consistency guideline.

Example header tweak:

-    void prepareTmaWsInputs(int num_tokens, char* workspace, void const* expert_weights,
+    void prepareTmaWsInputs(int num_tokens, char* workspace_ptr_char, void const* expert_weights,
         TmaWarpSpecializedGroupedGemmInput::EpilogueFusion fusion, cudaStream_t stream);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8493e78 and e27b017.

📒 Files selected for processing (13)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)

436-443: Well-scoped addition of EpilogueFusionType

Introducing a local EpilogueFusionType and storing it on the config is clean and future-proof. Defaulting to NONE avoids impacting existing call sites.


513-515: toString: adds epilogue_fusion_type and enableCudaKernel in TMA path — good for diagnostics

The expanded diagnostic string helps correlate tactics with finalize fusion during debugging.


546-548: operator<< parity with toString for fusion type — LGTM

Including epilogue_fusion_type in the TMA path streaming output keeps logs/prints consistent.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

1279-1282: Per-GEMM tactic plumbing in profiler getTactics — LGTM

Returning tactics via mRunner->mMOERunner->getTactics(backend.mGemmToProfile) correctly aligns profiler enumeration with the selected GEMM.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (6)

53-86: Per-GEMM tactic parser refactor — LGTM

Signature now threads gemm_id through recursion and supports scalar, array, and "all"/"auto" inputs. Correct and robust.


279-285: Conditional parsing for tactic_id1/tactic_id2 — LGTM

Parsing each key only if present avoids exceptions and permits asymmetric specification.


334-353: Fail-fast when no valid tactics, with helpful diagnostics — LGTM

Clear warning and a dump of valid tactics per GEMM will aid users authoring configs.


388-389: Default cutlass_tactic sentinel (-1) — LGTM

Using -1 to represent "auto" aligns with the parser semantics.


415-426: Verbose listing split per GEMM — LGTM

The per-GEMM breakdown improves debuggability for tactic enumeration.


536-545: Help text corrected to remove unsupported “objects” and clarify valid forms — LGTM

Docs now match the parser. Good cleanup; reduces user confusion.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (10)

373-375: Introduce mUseFusedFinalize test control — LGTM

Clear knob for fused finalize in tests; defaults to false to cover both paths via K threshold elsewhere.


459-459: Test runner finalize gating matches production logic — LGTM

use_fused_finalize_ = (k < 3 || mUseFusedFinalize) mirrors runtime behavior and keeps tests aligned.


1090-1119: getFilteredConfigs now gemm-aware with SM-specific filtering — LGTM

Threading MoeGemmId and retaining the L40S filter preserves previous behavior while enabling per-GEMM selection.


1120-1144: selectTacticsForArch returns per-GEMM pair and accounts for fusion type — LGTM

Selecting GEMM_2 by both TMA specialization and epilogue_fusion_type is the right pivot; sensible fallback if not found.


1641-1645: Determinism check keyed off epilogue_fusion_type — LGTM

Using gemm2.epilogue_fusion_type to decide determinism is precise and reflects the fused-finalize path.


1763-1766: PermuteNonDeterministic explicitly enables fused finalize — LGTM

Targets the intended non-deterministic path for K=3 without conflating with legacy flags.


1906-1914: Determinism guard in EP/TP loop (first rank) — LGTM

Consistent with BasicPermuteTest, including the SM-range carveouts.


1928-1932: Determinism guard in EP/TP loop (subsequent ranks) — LGTM

Same policy replicated for subsequent ranks; good coverage.


2095-2095: ConfigSweep: enable fused finalize for sweep — LGTM

Ensures both fusion states are exercised during exhaustive config sweeps.


2123-2129: ConfigSweep: enumerate tactics per GEMM — LGTM

Splitting configs1/configs2 via gemm_id is necessary for the new API.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)

2850-2854: LGTM: finalize-fusion workspace pointer is now tactic-driven

Setting permuted_token_final_scales_ based on epilogue_fusion_type is a clean improvement and removes stale coupling. No issues spotted.


4008-4023: LGTM: tactic-aligned FINALIZE gating and param setup

The consistency check and the gating condition for using_fused_finalize are sound. The setFinalizeFusionParams call aligns with hidden_size-by-num_rows semantics as used elsewhere in this file.


4710-4745: LGTM: tactic-resolved access to per-sample TMA template

Selecting mTmaInputCache[mSampleIndex][finalize?1:0] with a validity check is correct and matches the new dual-variant caching model.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

1003-1004: LGTM: dual-variant cache for TMA inputs

mTmaInputCache[SAMPLE][0|1] establishes clear separation between NONE and FINALIZE variants. This matches the runtime selection in the profiler.

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from e27b017 to 975f1c3 Compare August 19, 2025 00:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4616-4640: Critical: TMA WS for NONE vs FINALIZE variants overlap; allocate 2× and separate regions

prepareTmaWsInputs is invoked twice (NONE and FINALIZE), but both runs currently configure mTmaInputCache from the same base tma_ws_input_workspace span. This causes the FINALIZE templates to clobber the NONE templates (and vice versa), making the cached TMA descriptors nondeterministic.

Fix requires two coordinated changes:

  • Double the TMA WS allocation in the profiler to reserve independent halves for NONE and FINALIZE.
  • When preparing the FINALIZE variant, offset the local workspace pointer into the second half so the two variants never overlap.

Apply the prepare-side separation (see next comment for an exact diff). Additionally, update the workspace sizing as shown below.

Additional change outside this hunk (GemmProfilerBackend::getProfilerWorkspaces):

// Before:
tma_ws_input_workspace_size
    = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
      * (NUM_ROUTING_SAMPLES + 1);

// After (reserve two halves: NONE and FINALIZE):
tma_ws_input_workspace_size
    = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType)
      * (2 * (NUM_ROUTING_SAMPLES + 1));
♻️ Duplicate comments (10)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

4675-4679: Separate TMA WS halves for NONE and FINALIZE during prepare()

Advance the workspace pointer into the second half before preparing the FINALIZE cache to avoid overlap with the NONE cache. This pairs with the 2× allocation change in getProfilerWorkspaces.

-    prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights,
-        TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE, stream);
-    prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights,
-        TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE, stream);
+    // Prepare NONE variant in first half
+    prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights,
+        TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE, stream);
+    // Advance into second half for FINALIZE variant to avoid overlap
+    auto ws = getProfilerWorkspaces(num_tokens, mSM >= 90);
+    workspace_ptr_char += ws.at("tma_ws_input_workspace").first / 2;
+    prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights,
+        TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE, stream);
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

949-951: Guard tactic access in deterministic path to avoid out-of-range access

Indexing [0] without checking can crash if no tactics are available for the chosen GEMM. Guard and use .front() after validating.

-        gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0];
-        gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0];
+        auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1);
+        auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2);
+        TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE");
+        gemm1 = t1.front();
+        gemm2 = t2.front();
cpp/tensorrt_llm/thop/moeOp.cpp (2)

630-638: Harden GEMM/profile selection: validate gemm_idx, guard empty profiles, and validate profile_id.

As written, any non-1 index maps to GEMM_2; .front() will throw on empty; direct indexing can go OOB.

-        auto const gemm_to_profile
-            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
-        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2");
+        auto const gemm_to_profile
+            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
+        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx));
@@
-        // TODO: use the best tactic id found offline for a better default inference perf
-        auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id];
+        // TODO: use the best tactic id found offline for a better default inference perf
+        const auto& profile = [&]() -> const Profile&
+        {
+            if (profile_id == -1) return profiles.front();
+            TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()),
+                        "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id);
+            return profiles[profile_id];
+        }();

740-749: Guard empty tactic lists and validate profile_ids before .front() / .at().

Avoid UB when no tactics are available and enforce index range for the provided ids.

-        auto best_gemm1_profile = mGemm1Profiles.front();
-        auto best_gemm2_profile = mGemm2Profiles.front();
+        TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available");
+        auto best_gemm1_profile = mGemm1Profiles.front();
+        auto best_gemm2_profile = mGemm2Profiles.front();
         if (profile_ids.has_value())
         {
             TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids");
-            best_gemm1_profile
-                = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]);
-            best_gemm2_profile
-                = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]);
+            if (profile_ids.value()[0] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[0] >= 0
+                        && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()),
+                        "profile_ids[0] out of range");
+                best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]);
+            }
+            if (profile_ids.value()[1] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[1] >= 0
+                        && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()),
+                        "profile_ids[1] out of range");
+                best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]);
+            }
         }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

536-545: Docs updated to remove “objects” in tactic arrays — matches parser behavior.

This aligns the help text with the implemented JSON parsing.


45-50: Fix invalid type alias: decltype on non-static member isn’t a runner type; use the concrete runner alias.

decltype(BenchClass::mMoERunner) for a non-static member yields a pointer-to-member or won’t compile. Use the concrete runner built from BenchClass’ type aliases.

 template <class BenchClass>
-auto listAllTactics(MoeGemmId gemm_id)
+auto listAllTactics(MoeGemmId gemm_id)
 {
     int const sm = getSMVersion();
-    using RunnerType = decltype(BenchClass::mMoERunner);
-    return RunnerType::getTactics(sm, gemm_id);
+    using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType,
+        typename BenchClass::WeightType,
+        typename BenchClass::OutputType,
+        typename BenchClass::InputType>;
+    return RunnerType::getTactics(sm, gemm_id);
 }
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

1178-1181: Fixed: selectTacticsForArch called with a single sm argument.

This resolves the previous signature mismatch.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)

603-621: Fixed: addFinalizeFusionConfigs no longer tries to bind temporaries to non‑const lvalue refs.

Taking by rvalue-reference and returning by value is the right choice here.


628-633: Compile fix: don’t use decltype(moe_gemm_runner_) in a static method.

Referencing a non‑static member in static context is ill‑formed. Use the concrete runner type.

-    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return Self::addFinalizeFusionConfigs(
-            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
-    }
+    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return Self::addFinalizeFusionConfigs(
+            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
+    }

829-833: Compile fix: same static context issue in mayHaveFinalizeFused(int sm).

Replace decltype with the concrete MoeGemmRunner type.

-    static bool mayHaveFinalizeFused(int sm)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
-    }
+    static bool mayHaveFinalizeFused(int sm)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
+    }
🧹 Nitpick comments (2)
cpp/tensorrt_llm/thop/moeOp.cpp (1)

590-595: Nit: cast vector::size() to int64_t to avoid narrowing warnings.

Return type is int64_t, but size() is size_t. Cast explicitly.

-        return (gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size();
+        return static_cast<int64_t>((gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size());
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

2-2: Nit: refresh copyright year.

Header still says 2020–2023; repository guidelines indicate using the current year.

- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e27b017 and 975f1c3.

📒 Files selected for processing (13)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • tensorrt_llm/_torch/autotuner.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧬 Code Graph Analysis (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
  • sm (1090-1118)
  • sm (1090-1090)
  • sm (1120-1144)
  • sm (1120-1120)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
  • getTactics (1278-1282)
  • getTactics (1278-1278)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
  • gemm2 (702-726)
  • mK (986-986)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2)
  • gemm2 (638-662)
  • mK (923-923)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
  • mK (381-381)
cpp/include/tensorrt_llm/common/cudaUtils.h (1)
  • getSMVersion (298-364)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
  • vector (566-570)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
  • getTactics (1278-1282)
  • getTactics (1278-1278)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • EpilogueFusion (167-332)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (21)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)

2849-2854: Gating scales buffer by epilogue fusion looks right

Selecting permuted_token_final_scales_ only when GEMM2 uses FINALIZE epilogue matches the tactic-driven design and prevents dangling pointers when running NONE.


4010-4016: Tactic-driven finalize gating is consistent and clearer than SM checks

Replacing SM-based logic with epilogue_fusion_type and asserting alignment between runner and tactic is the right move.


4566-4576: Good fix: compute groupwise flags before early return

Declaring use_w4afp8/use_wfp4a16/use_w4_groupwise before the FINALIZE gating avoids the previous “used-before-declared” issue and makes the early return safe.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

1281-1282: Per-GEMM tactic retrieval in profiler is correct

Returning tactics via mRunner->mMOERunner->getTactics(backend.mGemmToProfile) keeps the profiler aligned with the GEMM being profiled. Looks good.

cpp/tensorrt_llm/thop/moeOp.cpp (3)

51-51: Good: explicit alias to MoeGemmId keeps thop consistent with kernels.

This makes the intent clear and avoids namespace churn. No issues.


218-221: Per-GEMM tactic vectors populated at construction look correct.

Pulling tactics separately for GEMM_1 and GEMM_2 aligns with the new API and avoids mixing profiles.


648-649: LGTM: propagate selected GEMM into profiler backend.

This keeps the profiler state coherent with the per-GEMM selection path.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

279-285: Good: tactic_id parsing now guards key presence.

Fixes the exception when only one of tactic_id1/tactic_id2 is provided.


334-353: Nice UX: skip with actionable dump when tactic lists are empty.

The per-GEMM listing helps users pick valid indices quickly.


415-427: Helpful: per-GEMM tactic inventory printing.

Clear visibility into available tactics by GEMM. Good for debugging configs.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (7)

373-375: Reasonable default: don’t force fused finalize outside K<3.

Keeps test surface predictable; explicit opt-in where needed.


459-460: Good: tie runner’s finalize flag to K and test override.

This matches production behavior where finalize fusion only applies for K<3 unless forced.


1090-1118: Per‑GEMM tactic fetch in tests looks correct.

The SM-specific filtering and GEMM‑aware retrieval match the new API.


1123-1144: Finalize awareness in tactic selection is well integrated.

Filtering GEMM_2 by EpilogueFusionType and falling back when not found is pragmatic.


1641-1645: Determinism gating is sensible.

Only assert determinism when not using FINALIZE on SM90+ with K>=3.


1763-1766: Explicit non-determinism test via fused finalize.

Clear intent for the PermuteNonDeterministic case.


2093-2151: ConfigSweep updated to pairwise sweep per-GEMM configs.

Combinatorial coverage makes sense with the per‑GEMM split. Good failure messaging.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

622-626: Per‑GEMM tactic listing with conditional finalize variants looks good.

Using runtime capability checks to gate FINALIZE copies is appropriate.


931-936: GemmProfilerBackend now aliases GemmToProfile to MoeGemmId — consistent and cleaner.

This unifies the Gemm identifier across components.


1003-1004: Good: cache two TMA inputs per sample.

The 2‑wide inner array aligns with GEMM_1/GEMM_2 profiling.


1016-1018: prepareTmaWsInputs now accepts EpilogueFusion — necessary for finalize-aware setup.

Signature change matches call sites and intent.

@djns99
Copy link
Collaborator Author

djns99 commented Aug 19, 2025

/bot run

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 975f1c3 to a18eae9 Compare August 19, 2025 03:51
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15705 [ run ] triggered by Bot

Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the generating tactic part. LGTM.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

628-633: Fix static context: avoid referencing instance member in static getTactics(int, MoeGemmId)

Using decltype(moe_gemm_runner_) in a static method is ill‑formed. Use the concrete runner type instead.

-    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return Self::addFinalizeFusionConfigs(
-            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
-    }
+    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return Self::addFinalizeFusionConfigs(
+            RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
+    }

829-834: Fix static context: avoid referencing instance member in mayHaveFinalizeFused(int sm)

Same issue as above; don’t use decltype(moe_gemm_runner_) in a static method.

-    static bool mayHaveFinalizeFused(int sm)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
-    }
+    static bool mayHaveFinalizeFused(int sm)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
+    }
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

949-951: Guard deterministic tactic selection against empty tactic lists

Indexing [0] on the result of getTactics() can crash if no tactics are available (unsupported shape/quant/arch).

-        gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0];
-        gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0];
+        auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1);
+        auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2);
+        TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE");
+        gemm1 = t1.front();
+        gemm2 = t2.front();
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

1079-1084: Guard INFO logging against empty tactic lists

Avoid out‑of‑range access when printing selections.

-        std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"
-                  << tactics1[tactic_idx1].toString() << std::endl;
-        std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"
-                  << tactics2[tactic_idx2].toString() << std::endl;
+        std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n";
+        if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size())) {
+            std::cout << tactics1[tactic_idx1].toString() << std::endl;
+        }
+        std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n";
+        if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size())) {
+            std::cout << tactics2[tactic_idx2].toString() << std::endl;
+        }
cpp/tensorrt_llm/thop/moeOp.cpp (2)

630-638: Harden runGemmProfile: check empty profiles and profile_id range

Avoid undefined behavior when no tactics are available and when profile_id is invalid.

-        auto const gemm_to_profile
-            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
-        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        auto const gemm_to_profile
+            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
+        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx));
@@
-        auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id];
+        const auto& profile = [&]() -> const Profile&
+        {
+            if (profile_id == -1) return profiles.front();
+            TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()),
+                        "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id);
+            return profiles[profile_id];
+        }();

740-749: Harden setRunnerProfiles: guard empty lists and id ranges

Avoid front()/at() on empty vectors and out‑of‑range indices.

-        auto best_gemm1_profile = mGemm1Profiles.front();
-        auto best_gemm2_profile = mGemm2Profiles.front();
+        TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available");
+        auto best_gemm1_profile = mGemm1Profiles.front();
+        auto best_gemm2_profile = mGemm2Profiles.front();
         if (profile_ids.has_value())
         {
             TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids");
-            best_gemm1_profile
-                = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]);
-            best_gemm2_profile
-                = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]);
+            if (profile_ids.value()[0] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[0] >= 0
+                        && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()),
+                        "profile_ids[0] out of range");
+                best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]);
+            }
+            if (profile_ids.value()[1] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[1] >= 0
+                        && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()),
+                        "profile_ids[1] out of range");
+                best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]);
+            }
         }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

536-544: Help text fixed to remove unsupported 'objects' — thanks

This aligns with the current parser, preventing confusion and exceptions from outdated guidance.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

84-87: Return a concrete list and make gemm_idx optional for resilience

Returning a range while the signature advertises List[int] can surprise callers and kwargs["gemm_idx"] will KeyError if omitted. Use kwargs.get and wrap in list for type consistency and backward compatibility.

Apply this diff:

-    def get_valid_tactics(self, inputs: List[torch.Tensor],
-                          profile: OptimizationProfile, **kwargs) -> List[int]:
-        return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"]))
+    def get_valid_tactics(self, inputs: List[torch.Tensor],
+                          profile: OptimizationProfile, **kwargs) -> List[int]:
+        gemm_idx = kwargs.get("gemm_idx", 0)
+        return list(range(self.fused_moe_runner.get_tactic_num(gemm_idx)))

Additionally, confirm that the domain for gemm_idx matches the underlying implementation (0/1 vs. 1/2). If it expects MoeGemmId, keep passing the enum-compatible values.

#!/bin/bash
# Verify the expected domain/type of gemm_idx for FusedMoeRunner::get_tactic_num and run_gemm_profile.

set -euo pipefail

echo "Searching for FusedMoeRunner::get_tactic_num signature and usages..."
rg -n -C2 -P '\b(get_tactic_num|run_gemm_profile)\s*\('

echo
echo "Searching for MoeGemmId definition and values..."
rg -n -C2 -P '\benum\s+class\s+MoeGemmId\b|MoeGemmId::GEMM_'

echo
echo "Searching Python/C++ bindings for FusedMoeRunner to see the expected arg type..."
fd -t f | rg -n -C3 -P 'FusedMoeRunner|get_tactic_num|run_gemm_profile|MoeGemmId'
🧹 Nitpick comments (10)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

2850-2854: Good gating for finalize scales; make the local flag const

The allocation of permuted_token_final_scales_ only when GEMM2 uses FINALIZE is correct and avoids dangling pointers when FINALIZE is off.

Minor style nit: declare the flag as const for clarity and consistency with nearby bool const locals.

-    bool gemm2_using_finalize_fusion
+    bool const gemm2_using_finalize_fusion
         = gemm2_config_->epilogue_fusion_type == cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
     permuted_token_final_scales_
         = gemm2_using_finalize_fusion ? getWsPtr(float{}, "permuted_token_final_scales") : nullptr;

4010-4016: Tactic-driven finalize fusion gating looks correct; prefer const locals

The alignment check between tactic request and runtime gating is a good fast-fail.

Nit: mark both locals const to match surrounding style.

-        bool gemm2_using_finalize_fusion = gemm2_config_->epilogue_fusion_type
+        bool const gemm2_using_finalize_fusion = gemm2_config_->epilogue_fusion_type
             == cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
-        bool using_fused_finalize
+        bool const using_fused_finalize
             = use_fused_finalize_ && gemm2_using_finalize_fusion && !use_w4_groupwise && !use_lora;
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

603-621: Finalize-fusion tactic derivation helper reads correctly

  • Taking configs by rvalue-ref avoids the temporary-binding issue and is efficient.
  • Appending FINALIZE epilogue only for TMA WS configs is correct.
  • Minor nit: consider returning an explicit std::vector<…> instead of auto for readability in a header.
-static auto addFinalizeFusionConfigs(
+static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs(
     std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize)

17-17: Adopt include guards per repository guidelines

Headers should use TRTLLM__H include guards instead of #pragma once.

-#pragma once
+#ifndef TRTLLM_MOE_KERNELS_H
+#define TRTLLM_MOE_KERNELS_H

Add this at end of file (outside the selected range):

#endif // TRTLLM_MOE_KERNELS_H
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

970-979: Optional: add a bounds check before indexing tactics in runMoEPermute

While setTactic pre-validates, a defensive check here would prevent accidental misuse if runMoEPermute is called without prior setTactic.

-            auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile))[tactic_idx];
+            auto all = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile));
+            if (tactic_idx < 0 || tactic_idx >= static_cast<int>(all.size())) {
+                throw std::runtime_error("Invalid tactic index for runMoEPermute");
+            }
+            auto tactics = all[tactic_idx];
cpp/tensorrt_llm/thop/moeOp.cpp (2)

603-606: Validate gemm_idx in runGemmProfile

Fail fast on invalid gemm_idx to avoid silently selecting the wrong set.

     {
         std::lock_guard<std::mutex> lock(mMutex);
+        TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2");
 
         // TODO: support profiling under fp8 block scaling in the future

268-299: Remove duplicated bias validation block

The bias checks are duplicated back‑to‑back; keep a single block to reduce noise.

-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }
-
-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }
+        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
+        {
+            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
+            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
+            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
+            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
+            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
+                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
+            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
+                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
+            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
+                "fc1_expert_biases should match fc1_expert_weights output shape.");
+            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
+                "fc2_expert_biases should match fc2_expert_weights output shape.");
+        }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

53-86: Minor cleanup: redundant assert on string check

else if (tactic.is_string()) makes the assert(tactic.is_string()) immediately below redundant. Safe to remove for cleanliness.

Apply this small diff:

-    else if (tactic.is_string())
-    {
-        assert(tactic.is_string());
+    else if (tactic.is_string())
+    {
         auto tactic_name = tactic.get<std::string>();

278-285: Quality of life: default missing tactic_idX to 'auto'

When only one of tactic_id1/tactic_id2 is provided, defaulting the other to -1 ("auto") avoids skipping the benchmark later and mirrors user intent.

Apply this diff:

-        if (run_config.contains("tactic_id1"))
-        {
-            parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1);
-        }
-        if (run_config.contains("tactic_id2"))
-        {
-            parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2);
-        }
+        if (run_config.contains("tactic_id1"))
+        {
+            parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1);
+        }
+        else
+        {
+            tactic_ids1 = {-1}; // default to "auto"
+        }
+        if (run_config.contains("tactic_id2"))
+        {
+            parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2);
+        }
+        else
+        {
+            tactic_ids2 = {-1}; // default to "auto"
+        }

388-389: Update stale comment example to reflect per‑GEMM API

The comment references the old listAllTactics signature. Adjust for clarity.

-    auto cutlass_tactic = {-1};                           // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()};
+    auto cutlass_tactic = {-1};                           // {0,..., listAllTactics<BenchClass>(MoeGemmId::GEMM_1).size()};
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a18eae9 and 610353c.

📒 Files selected for processing (15)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6 hunks)
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5 hunks)
  • tests/unittest/_torch/test_autotuner.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • tensorrt_llm/_torch/autotuner.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • tests/unittest/_torch/test_autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • tests/unittest/_torch/test_autotuner.py
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (26)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

4566-4576: Early-return gating for FINALIZE resolves prior compile hazard

Precomputing the groupwise flags and gating out unsupported FINALIZE conditions before use looks good and addresses the earlier “used before declaration” issue.


4616-4628: Deliberate TMA WS reuse per fusion-type; cache selection is sound

Using mTmaInputCache[i][use_finalize_fusion] and reusing the same workspace region per the design comment is aligned with the stated memory-saving approach. Selecting the cache element for GEMM1/GEMM2 based on mGemmToProfile is correct.


4634-4638: Finalize params for profiler path look correct

Enabling FINALIZE for GEMM2 and calling setFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1) matches the expected “rows=num_tokens, cols=hidden” semantics in this path.


4675-4679: Preparing both NONE and FINALIZE variants up-front is appropriate

Seeding templates for both fusion types during prepare() aligns with the per-tactic selection in runProfiler and avoids runtime initialization hiccups.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

231-236: Per‑GEMM selector enum looks good

MoeGemmId is well‑scoped and follows naming conventions. It cleanly enables per‑GEMM tactic selection downstream.


622-627: Instance getTactics uses the new gemm_id correctly

Correctly composes configs and gates FINALIZE only for GEMM_2 when supported.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

1279-1282: Profiler tactic retrieval aligns with per‑GEMM API

Returning the per‑GEMM tactics via mGemmToProfile is correct, keeping the benchmark backend consistent with the new interface.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)

836-846: Per‑GEMM pickBestTactic path is correct

Casting GemmToProfile::GEMM_1/2 to MoeGemmId and profiling within that scope is sound.


928-953: setTactic: selects per‑GEMM tactics correctly and validates indices

The two‑vector approach and bounds checks before setTactic are appropriate.

tests/unittest/_torch/test_autotuner.py (1)

92-98: Adding kwargs to get_valid_tactics signatures is appropriate

Keeps the runner/test interfaces compatible with callers forwarding extra context (e.g., GEMM id) without altering behavior.

Also applies to: 153-156, 229-231, 317-320

cpp/tensorrt_llm/thop/moeOp.cpp (3)

51-52: MoeGemmId aliasing improves clarity and avoids coupling to namespace qualifiers

Good addition; downstream usage reads cleanly.


219-221: Eagerly caching per‑GEMM tactic vectors is sound

Fetching tactics for GEMM_1 and GEMM_2 up front simplifies subsequent indexing and avoids repeated queries.


590-595: getTacticNum: input validation added and return type widened

Validating gemm_idx and returning sizes as int64_t is correct and robust.

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)

125-126: Signature extended with kwargs — aligns with per‑GEMM autotuner flow

Accepting **kwargs keeps this runner compatible with callers forwarding extra context such as gemm_idx. Ignoring kwargs here is fine since the underlying kernel API doesn’t take gemm id yet.


409-410: Signature extended with kwargs — aligns with per‑GEMM autotuner flow

Same note as above; forwarding compatibility without behavior change is good.


667-668: Signature extended with kwargs — aligns with per‑GEMM autotuner flow

Looks good; no behavior change and keeps the interface consistent across runners.


901-902: Signature extended with kwargs — aligns with per‑GEMM autotuner flow

Good consistency. No further action needed here.


1114-1116: Signature extended with kwargs — aligns with per‑GEMM autotuner flow

All set; consistent with the rest of the file and upstream autotuner changes.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

45-50: Confirm RunnerType deduction is valid for BenchClass::mMoERunner

Using decltype(BenchClass::mMoERunner) assumes mMoERunner is a type alias or a member whose declared type is the desired runner type. If it’s a non-static data member, depending on definitions this can be brittle across compilers. Consider explicitly naming the runner type (if available) or requiring BenchClass to provide a using RunnerType = ...; alias to make intent clear.

If BenchClass can expose a concrete alias, a safer pattern is:

-    using RunnerType = decltype(BenchClass::mMoERunner);
+    using RunnerType = typename BenchClass::RunnerType;
     return RunnerType::getTactics(sm, gemm_id);

Otherwise, verify that decltype(BenchClass::mMoERunner) resolves to the correct type on all supported toolchains.


334-353: Helpful diagnostics when no tactics — nice touch

Clear, one-time messages per run and listing per‑GEMM available tactics improves UX for misconfigured cases.


415-427: Verbose per‑GEMM tactic listing is clear and well‑structured

The per‑GEMM breakdown with numbered tactics is easy to scan and consistent with the new API.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)

318-321: LGTM: returns a concrete list of tactic indices

Consistent with the type hint and other runners.


400-403: LGTM: returns a concrete list of tactic indices

Matches the declared return type and avoids iterator surprises.


512-525: LGTM: defers to kernel for valid configs with shape‑aware filtering

Interface matches the rest of the repo; no issues spotted.


726-729: LGTM: concrete list and stable interface

Consistent across runners; good.


801-805: LGTM: conforms to List[int] and aligns with tuning flow

No concerns here.

@djns99
Copy link
Collaborator Author

djns99 commented Aug 19, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15715 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15705 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

608-612: Compilation issue already addressed in past review.

This is the same issue identified in the past review comments - using decltype(moe_gemm_runner_) in a static method. The fix was already suggested to use the concrete type directly.


808-812: Compilation issue already addressed in past review.

Same static context issue as above - this was already flagged in past review comments.

🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

831-833: Consider documenting the workspace size calculation behavior.

The call to getTmaWarpSpecializedConfigs(sm_, true) always requests configs with finalize fusion support for workspace sizing, even if finalize fusion won't be used. This ensures adequate workspace allocation for all possible configurations.

Consider adding a comment to clarify this intentional behavior:

 size_t MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>::calcMaxWorkspaceSize(int num_experts) const
 {
     // ... 
-    // Finalize fusion may not actually be supported by the kernel,
-    // if they are not we will catch the error and skip them
+    // Always request configs with finalize fusion support when calculating workspace size
+    // to ensure adequate allocation for all possible configurations.
+    // Finalize fusion may not actually be supported by the kernel,
+    // if they are not we will catch the error and skip them
     auto configs = getTmaWarpSpecializedConfigs(sm_, true);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 610353c and f6668fc.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)

292-295: LGTM!

The API evolution to support finalize fusion is well-designed. The changes appropriately propagate the supports_finalize_fusion flag through the config generation paths.

I appreciate the learning from past MOE reviews about avoiding excessive finalize fusion validation checks - the implementation here keeps it clean and focused.


300-304: LGTM!

Good refactoring to delegate to the static method, maintaining backward compatibility while centralizing the logic.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)

474-478: LGTM!

The refactoring to pass the supports_finalize_fusion flag through to the static method is clean and consistent with the architectural changes.


483-489: LGTM!

The signature update to accept supports_finalize_fusion and its propagation to getTmaWarpSpecializedConfigs is consistent with the per-GEMM tactic selection design.


521-571: Effective finalize fusion config duplication logic!

The implementation correctly:

  1. Gates finalize fusion configs based on the supports_finalize_fusion flag
  2. Creates duplicates with epilogue_fusion_type set to FINALIZE
  3. Appends them to the original configs

This approach allows profiling both fusion and non-fusion variants as discussed in the past review comments.


581-587: LGTM!

The static method for checking TMA warp specialized support across different SM architectures is well-structured and comprehensive:

  • SM90 (Hopper)
  • SM100-119 (Blackwell)
  • SM120/121 (SM120 series)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (6)

231-236: LGTM!

The MoeGemmId enum is well-placed near MOEParallelismConfig for cross-module visibility and provides clear semantic identifiers for the different GEMMs in the MOE pipeline.


456-456: LGTM!

The API update to accept MoeGemmId enables per-GEMM tactic selection as intended.


603-606: LGTM!

The implementation correctly gates finalize fusion configs for GEMM_2 only, as expected for the second GEMM in the MOE pipeline.


911-911: LGTM!

The type alias simplifies the code and avoids redundant enum definitions.


982-982: LGTM!

The 2D array structure for mTmaInputCache appropriately handles both fusion variants (NONE and FINALIZE) for each routing sample.


995-996: LGTM!

The updated prepareTmaWsInputs signature with the fusion parameter enables proper preparation of workspaces for different epilogue fusion types.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15715 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11816 completed with status: 'FAILURE'

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from f6668fc to 513c6dd Compare August 19, 2025 21:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

691-701: NameError in fake kernel: act_fp8 referenced but parameter is act_fp4.

This will raise at shape-prop time. Rename the parameter to act_fp8 (or use act_fp4 consistently).

Apply this diff:

-@w4a8_mxfp4_fp8_gemm.register_fake
-def _(
-    act_fp4: torch.Tensor,
+@w4a8_mxfp4_fp8_gemm.register_fake
+def _(
+    act_fp8: torch.Tensor,
     weight: torch.Tensor,
     act_sf: torch.Tensor,
     weight_scale: torch.Tensor,
     alpha: torch.Tensor,
     output_dtype: torch.dtype,
     to_userbuffers: bool = False,
 ) -> torch.Tensor:
-    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
+    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
                              dtype=output_dtype)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

928-956: Don’t index into empty tactics for unprofiled GEMM in setTactic()

When profiling a single GEMM, the other GEMM may legitimately have zero valid tactics (dtype/arch dependent). Unconditionally indexing tactics2[tactic_idx2] can go out-of-bounds. Pass nullopt for the unprofiled/empty side.

Apply this diff:

-        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
-        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
+        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
+        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
@@
-        mMoERunner.setTactic(tactics1[tactic_idx1], tactics2[tactic_idx2]);
+        // For GEMM-specific profiling, the unprofiled GEMM can have no valid tactics.
+        // Guard and pass nullopt to avoid OOB.
+        std::optional<tensorrt_llm::cutlass_extensions::CutlassGemmConfig> opt1;
+        std::optional<tensorrt_llm::cutlass_extensions::CutlassGemmConfig> opt2;
+        if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size()))
+        {
+            opt1 = tactics1[tactic_idx1];
+        }
+        if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size()))
+        {
+            opt2 = tactics2[tactic_idx2];
+        }
+        mMoERunner.setTactic(opt1, opt2);
cpp/tensorrt_llm/thop/moeOp.cpp (3)

60-96: Compile error: returning unique_ptr but storing into shared_ptr

switch_output_type returns std::unique_ptr, but mKernelRunner is a std::shared_ptr. This won’t compile at call sites. Make the factory return a std::shared_ptr and construct with make_shared.

Apply this diff:

-    std::unique_ptr<kernels::CutlassMoeFCRunnerInterface> switch_output_type(c10::ScalarType output_type)
+    std::shared_ptr<kernels::CutlassMoeFCRunnerInterface> switch_output_type(c10::ScalarType output_type)
     {
         switch (output_type)
         {
         case c10::ScalarType::Long: // INT64 == FP4
         case c10::ScalarType::Float8_e4m3fn:
             // TODO We need an atomic FP8 reduction for the finalize fusions
             C10_THROW_ERROR_FORMATTED(NotImplementedError,
                 "Outputting " << torch::toString(output_type) << " directly is not currently supported");
             // return std::make_unique<kernels::CutlassMoeFCRunner<Type, Type>>();
         case c10::ScalarType::Half:
             if constexpr (NeedQuant)
             {
-                return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, half>>();
+                return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, half>>();
             }
             else
             {
-                return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, TypeAct>>();
+                return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, TypeAct>>();
             }
         }
 #ifdef ENABLE_BF16
         case c10::ScalarType::BFloat16:
             if constexpr (NeedQuant)
             {
-                return std::make_unique<
+                return std::make_shared<
                     kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, __nv_bfloat16>>();
             }
             else
             {
-                return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, TypeAct>>();
+                return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, TypeAct>>();
             }
 #endif
         default:
             C10_THROW_ERROR_FORMATTED(Error,
                 "Invalid output type " << torch::toString(output_type) << " specified for "
                                        << torch::toString(mActivationDtype));
         }
     };

98-120: Compile error: factory returns unique_ptr but assigned to shared_ptr

Same issue for the weight-quant path. Return shared_ptr and use make_shared.

Apply this diff:

-    std::unique_ptr<kernels::CutlassMoeFCRunnerInterface> create_weight_quant_runner()
+    std::shared_ptr<kernels::CutlassMoeFCRunnerInterface> create_weight_quant_runner()
     {
         if (isInt8Quant())
         {
-            return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, uint8_t>>();
+            return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, uint8_t>>();
         }
         else if (isInt4Quant())
         {
 #ifdef ENABLE_FP8
             if (mUseW4GroupScaling)
             {
-                return std::make_unique<
+                return std::make_shared<
                     kernels::CutlassMoeFCRunner<__nv_fp8_e4m3, cutlass::uint4b_t, TypeAct, TypeAct>>();
             }
 #endif
-            return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, cutlass::uint4b_t>>();
+            return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, cutlass::uint4b_t>>();
         }
         else
         {
             C10_THROW_ERROR_FORMATTED(Error, "Unsupported weight quantization type");
         }
     }

147-159: Compile error: assigning unique_ptr to shared_ptr

This line uses make_unique while mKernelRunner is a shared_ptr. Also, after changing the factories to return shared_ptr, the FP8 path assignment becomes correct.

Apply this diff:

-        else if (mActivationDtype == c10::ScalarType::BFloat16 && mWeightDtype == c10::ScalarType::Float8_e4m3fn)
-        {
-            mKernelRunner = std::make_unique<kernels::CutlassMoeFCRunner<__nv_bfloat16, __nv_fp8_e4m3>>();
-        }
+        else if (mActivationDtype == c10::ScalarType::BFloat16 && mWeightDtype == c10::ScalarType::Float8_e4m3fn)
+        {
+            mKernelRunner = std::make_shared<kernels::CutlassMoeFCRunner<__nv_bfloat16, __nv_fp8_e4m3>>();
+        }
♻️ Duplicate comments (9)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

84-86: Handle missing gemm_idx and return a concrete List[int] (not range).

Current implementation assumes kwargs["gemm_idx"] exists and returns a range, which violates the List[int] return type and can raise KeyError. Make gemm_idx optional and coerce to list for type consistency.

Apply this diff:

-    def get_valid_tactics(self, inputs: List[torch.Tensor],
-                          profile: OptimizationProfile, **kwargs) -> List[int]:
-        return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"]))
+    def get_valid_tactics(self, inputs: List[torch.Tensor],
+                          profile: OptimizationProfile, **kwargs) -> List[int]:
+        gemm_idx = kwargs.get("gemm_idx", 0)
+        return list(range(self.fused_moe_runner.get_tactic_num(gemm_idx)))

To ensure the default index is correct, verify whether 0 is a valid sentinel (e.g., Undefined) or if indexing starts at 1:

#!/bin/bash
# Inspect uses/definitions to confirm valid gemm index domain
rg -nP 'MoeGemmId' -C3
rg -nP 'get_tactic_num\s*\(' -C2
rg -nP 'getTacticNum\s*\(' -C2
rg -nP 'gemm_idx\s*=\s*(1|2)\b' -C2
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

949-951: Avoid out-of-range access when deterministic tactics are requested

Indexing [0] without guarding can crash if no tactics are available (e.g., unsupported shapes/dtypes/arch or misconfig). Guard and use front() after validating non-emptiness.

Apply this diff:

-        gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0];
-        gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0];
+        auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1);
+        auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2);
+        TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE");
+        gemm1 = t1.front();
+        gemm2 = t2.front();
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

1079-1084: Guard logging against empty tactic lists in GEMM-specific mode

When only one GEMM is profiled, the other may have an empty tactic list. Logging unconditionally indexes both vectors and can go OOB.

Apply this diff:

-        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
-        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
-        std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"
-                  << tactics1[tactic_idx1].toString() << std::endl;
-        std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"
-                  << tactics2[tactic_idx2].toString() << std::endl;
+        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
+        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
+        if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_1)
+        {
+            std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n";
+            if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size()))
+            {
+                std::cout << tactics1[tactic_idx1].toString() << std::endl;
+            }
+        }
+        if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_2)
+        {
+            std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n";
+            if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size()))
+            {
+                std::cout << tactics2[tactic_idx2].toString() << std::endl;
+            }
+        }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

536-545: Help text: arrays can also contain the strings “auto”/“all” (not only integers)

The parser accepts arrays containing integers or strings. Update the help text accordingly to avoid user confusion.

Apply this diff:

-           " - An array: of integers, forms a list of tactics to sweep\n"
+           " - An array: of integers or strings, forms a list of tactics to sweep\n"

45-50: Fix: use the concrete runner type instead of decltype on a (likely) data member

decltype(BenchClass::mMoERunner) is unreliable here and won’t provide a type with the required static getTactics(int, MoeGemmId) API. Use the concrete runner type derived from BenchClass’ template parameters.

Apply this diff:

 template <class BenchClass>
 auto listAllTactics(MoeGemmId gemm_id)
 {
     int const sm = getSMVersion();
-    using RunnerType = decltype(BenchClass::mMoERunner);
-    return RunnerType::getTactics(sm, gemm_id);
+    using RunnerType = tensorrt_llm::kernels::cutlass_kernels::CutlassMoeFCRunner<
+        typename BenchClass::DataType,
+        typename BenchClass::WeightType,
+        typename BenchClass::OutputType,
+        typename BenchClass::InputType>;
+    return RunnerType::getTactics(sm, gemm_id);
 }
cpp/tensorrt_llm/thop/moeOp.cpp (2)

630-638: Guard empty profiles and out-of-range profile_id in runGemmProfile

profiles.front() and profiles[profile_id] can throw if the list is empty or the id is out of range. Add checks to fail fast with a clear message.

Apply this diff:

-        auto const gemm_to_profile
-            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
-        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        auto const gemm_to_profile
+            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
+        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx));
@@
-        auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id];
+        const auto& profile = [&]() -> const Profile&
+        {
+            if (profile_id == -1) return profiles.front();
+            TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()),
+                        "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id);
+            return profiles[profile_id];
+        }();

740-749: Guard empty tactic lists and validate profile_ids in setRunnerProfiles

Avoid undefined behavior on .front() and .at() when no tactics are available or ids are invalid.

Apply this diff:

-        auto best_gemm1_profile = mGemm1Profiles.front();
-        auto best_gemm2_profile = mGemm2Profiles.front();
+        TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available");
+        auto best_gemm1_profile = mGemm1Profiles.front();
+        auto best_gemm2_profile = mGemm2Profiles.front();
         if (profile_ids.has_value())
         {
             TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids");
-            best_gemm1_profile
-                = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]);
-            best_gemm2_profile
-                = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]);
+            if (profile_ids.value()[0] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[0] >= 0
+                        && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()),
+                        "profile_ids[0] out of range");
+                best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]);
+            }
+            if (profile_ids.value()[1] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[1] >= 0
+                        && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()),
+                        "profile_ids[1] out of range");
+                best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]);
+            }
         }
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)

608-612: Fix static context: avoid referencing instance member in static method

decltype(moe_gemm_runner_) references a non‑static member in a static method and will not compile. Use the known runner type directly.

Apply this diff:

-    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return RunnerType::getConfigs(sm, gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
-    }
+    static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return RunnerType::getConfigs(sm, gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm));
+    }

808-812: Fix static context (same issue) in mayHaveFinalizeFused(int sm)

Avoid decltype(moe_gemm_runner_) inside a static method; use the runner type.

Apply this diff:

-    static bool mayHaveFinalizeFused(int sm)
-    {
-        using RunnerType = decltype(moe_gemm_runner_);
-        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
-    }
+    static bool mayHaveFinalizeFused(int sm)
+    {
+        using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>;
+        return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise;
+    }
🧹 Nitpick comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

334-353: Optional: default missing tactic lists to “auto” instead of skipping the benchmark

Skipping entirely when either tactic list is empty can be inconvenient (e.g., when a user only provides one side). Consider defaulting the missing side to -1 (“auto”) and still proceed, while printing the same warning once.

Apply this diff:

-        if (tactic_ids1.empty() || tactic_ids2.empty())
-        {
-            std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl;
-            static bool printed = false;
-            if (!printed)
-            {
-                printed = true;
-                std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n";
-                for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2})
-                {
-                    std::cerr << "GEMM " << (int) gemm_id << ":\n";
-                    auto confs = listAllTactics<BenchClass>(gemm_id);
-                    for (auto c : confs)
-                        std::cerr << c.toString();
-                    std::cerr << std::endl;
-                }
-            }
-
-            continue;
-        }
+        if (tactic_ids1.empty() || tactic_ids2.empty())
+        {
+            std::cerr << "Warning: Missing tactic list for one GEMM; defaulting to auto (-1)." << std::endl;
+            static bool printed = false;
+            if (!printed)
+            {
+                printed = true;
+                std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n";
+                for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2})
+                {
+                    std::cerr << "GEMM " << (int) gemm_id << ":\n";
+                    auto confs = listAllTactics<BenchClass>(gemm_id);
+                    for (auto c : confs)
+                        std::cerr << c.toString();
+                    std::cerr << std::endl;
+                }
+            }
+            if (tactic_ids1.empty()) tactic_ids1 = {-1};
+            if (tactic_ids2.empty()) tactic_ids2 = {-1};
+        }
cpp/tensorrt_llm/thop/moeOp.cpp (1)

268-298: Remove duplicated bias validation block

The bias validation block appears twice back-to-back; the second occurrence is redundant and increases maintenance overhead.

Apply this diff to remove the duplicate block:

-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f6668fc and 513c6dd.

📒 Files selected for processing (15)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6 hunks)
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5 hunks)
  • tests/unittest/_torch/test_autotuner.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • tests/unittest/_torch/test_autotuner.py
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧬 Code Graph Analysis (6)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
  • std (43-114)
  • vector (608-612)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)
  • supportsTmaWarpSpecialized (293-312)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (3)
  • isValidHopperMOESpecialisation (63-78)
  • isValidBlackwellMOESpecialisation (47-57)
  • isValidSM120MOESpecialisation (34-42)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
  • sm (1090-1118)
  • sm (1090-1090)
  • sm (1120-1144)
  • sm (1120-1120)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
  • getTactics (1278-1282)
  • getTactics (1278-1278)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (9)
  • vector (1267-1279)
  • vector (1267-1267)
  • sm (1090-1118)
  • sm (1090-1090)
  • sm (1120-1144)
  • sm (1120-1120)
  • num_tokens (409-451)
  • num_tokens (409-410)
  • stream (829-841)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
  • getTactics (1278-1282)
  • getTactics (1278-1278)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (2)
  • supportsTmaWarpSpecialized (293-312)
  • EpilogueFusion (168-319)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
  • prepareTmaWsInputs (4558-4663)
  • prepareTmaWsInputs (4558-4559)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • EpilogueFusion (167-331)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
  • MoeGemmId (231-354)
  • std (43-114)
  • vector (608-612)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
  • vector (566-570)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
tests/unittest/_torch/test_autotuner.py (4)
  • get_valid_tactics (94-97)
  • get_valid_tactics (153-156)
  • get_valid_tactics (229-231)
  • get_valid_tactics (317-324)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
  • get_valid_tactics (125-141)
  • get_valid_tactics (409-422)
  • get_valid_tactics (667-680)
  • get_valid_tactics (901-914)
  • get_valid_tactics (1114-1127)
tensorrt_llm/_torch/autotuner.py (2)
  • get_valid_tactics (147-165)
  • OptimizationProfile (118-133)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
  • permuted_token_final_scales_ (871-871)
  • mSM (962-962)
  • mMinLatencyMode (987-987)
  • mNumExpertsPerNode (964-964)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
  • permuted_token_final_scales_ (822-822)
  • mSM (920-920)
  • mMinLatencyMode (944-944)
  • mNumExpertsPerNode (922-922)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (29)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)

318-320: LGTM: kwargs accepted and returns List[int].

Signature is future-proofed with **kwargs and the return type matches the hint.


400-402: LGTM: consistent with other runners.

Returns a concrete list and aligns with the tuner's expectations.


512-525: LGTM: shape-derived config query is clear and specific.

Inputs are unpacked explicitly; delegating to kernel_runner.get_valid_configs(b, m, n, k) reads well. No issues spotted.


726-728: LGTM: correct return type and signature.

Conforms to the List[int] contract; kwargs reserved for future use.


801-805: LGTM: list conversion maintains type consistency.

Good use of list(range(...)) for the advertised return type.

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)

1281-1282: LGTM: getTactics now honors backend-selected GEMM

Returning tactics via backend.mGemmToProfile aligns with the per‑GEMM profiling path and removes the unused m/n/k coupling. Looks correct.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)

836-836: LGTM: per‑GEMM tactic retrieval in pickBestTactic

Using static_cast<MoeGemmId>(gemm_to_profile) here is fine since the callers pass GEMM_1/GEMM_2. Keeps tactic lists GEMM‑specific.


970-979: LGTM: GEMM-only profiler path uses per‑GEMM tactic

Indexing by tactic_idx here is fine because setTactic() validates the profiled GEMM index before use. With the above guard in setTactic(), the unprofiled GEMM won’t be dereferenced in this path.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (8)

2850-2854: Gating final scales buffer on tactic’s finalize epilogue — LGTM

Using gemm2_config_->epilogue_fusion_type to derive gemm2_using_finalize_fusion and only wiring permuted_token_final_scales_ in that case aligns the buffer lifetime with the chosen epilogue. Clean and clear.


4010-4016: Finalize-fusion decision ties runner and tactic — LGTM

using_fused_finalize is gated by both the runner setting and the tactic’s requested epilogue, and the sanity TLLM_CHECK_WITH_INFO ensures they don’t diverge. This is the right coupling point.


4558-4560: Profiler: pass epilogue fusion into TMA-WS preparation — LGTM

Promoting fusion as an explicit parameter to prepareTmaWsInputs makes the per-fusion cache population unambiguous and keeps the API future-proof.


4566-4576: Early-exit gating for finalize fusion in profiler — LGTM

The support check (runner flag, min-latency, groupwise W4*, GEMM_2 only) avoids preparing unsupported variants. Computing the groupwise flags up front also prevents the “use-before-declaration” issue seen earlier.


4616-4621: Reuse TMA-WS buffers between NONE/FINALIZE variants is intentional — acknowledged

The comment and logic reflect the deliberate reuse to save space. No changes requested here.


4626-4638: Per-fusion template selection and parameterization — LGTM

Selecting the cache entry by use_finalize_fusion, setting fusion per-GEMM, and calling setFinalizeFusionParams(output, hidden_size, num_tokens, mK > 1) matches the expected (hidden_size, num_tokens) layout for the transposed finalize path. Looks correct.


4675-4679: Prepare both NONE and FINALIZE templates up front for the profiler — LGTM

Preparing both fusion variants in one pass improves tactic switching latency during profiling. Given the intentional workspace reuse, this remains consistent.


4742-4745: TMA-WS template lookup by tactic’s fusion type — LGTM

Indexing mTmaInputCache by epilogue fusion and asserting .isValid() ensures the profiler executes with an initialized template corresponding to the chosen tactic.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)

473-478: Instance-level getConfigs(bool) — LGTM

Forwarding to the SM-parametrized overload keeps call sites simple while allowing per-runner decisions when needed.


482-489: SM-parametrized getConfigs(sm, supports_finalize_fusion) — LGTM

Combining TMA-WS and Ampere configs under a single API and threading the supports_finalize_fusion flag through is a good consolidation.


521-571: TMA-WS configs: duplicate for FINALIZE when supported — LGTM

Cloning configs and tagging epilogue_fusion_type = FINALIZE when finalize is supported cleanly surfaces finalize-capable tactics to the selector without touching base heuristics.


832-834: Workspace sizing includes FINALIZE-capable TMA configs — LGTM

Using getTmaWarpSpecializedConfigs(sm_, true) and trying both NONE and FINALIZE (with try/catch) ensures the allocation bounds cover the largest supported path at runtime.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

278-285: LGTM: per-GEMM tactic parsing

Conditional parsing of tactic_id1 and tactic_id2 is correct and aligns with the new per‑GEMM API.


415-426: LGTM: verbose per‑GEMM tactic listing

Clear per‑GEMM enumeration with indices will help users map IDs to configs.

cpp/tensorrt_llm/thop/moeOp.cpp (3)

154-159: LGTM after factory return-type change

With switch_output_type returning a shared_ptr, this assignment is correct and fixes the CI compile breakage in the FP8 path.


202-207: LGTM after factory return-type change

With create_weight_quant_runner returning a shared_ptr, these assignments compile and are consistent.


590-595: LGTM: input validation for gemm_idx

Explicitly checking gemm_idx prevents silent misuse and matches the Python binding contract.

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)

231-236: LGTM: MoeGemmId enum introduction

Clear scoped enum for GEMM routing; improves type safety across the per‑GEMM APIs.


456-457: LGTM: per‑GEMM tactic API on the interface

The getTactics(MoeGemmId) signature change is consistent with the PR intent and downstream usage.


603-606: Correctness: gate finalize variants per GEMM via gemm_id

The override correctly gates FINALIZE variants only for GEMM_2 when supported.


911-997: LGTM: profiler type alias and TMA input cache shape

  • Using MoeGemmId as GemmToProfile is consistent and simplifies cross‑module usage.
  • Expanding mTmaInputCache to index by fusion (NONE/FINALIZE) matches the updated workspace preparation.

@djns99 djns99 force-pushed the user/djns99/finalize-choosable-tactic branch from 5804b77 to 71f9272 Compare August 20, 2025 23:03
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

689-701: register_fake variable typo: act_fp8 vs act_fp4

The fake implementation for w4a8_mxfp4_fp8_gemm declares act_fp4 but returns using act_fp8, which is undefined and will NameError at trace time.

Apply this diff:

-@w4a8_mxfp4_fp8_gemm.register_fake
+@w4a8_mxfp4_fp8_gemm.register_fake
 def _(
-    act_fp4: torch.Tensor,
+    act_fp8: torch.Tensor,
@@
 ) -> torch.Tensor:
-    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
-                             dtype=output_dtype)
+    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
+                             dtype=output_dtype)

Alternatively, if the op truly consumes FP8 activations, rename the parameter to act_fp8 (as above). If it’s FP4, change both the annotation and the return to use act_fp4 consistently.

♻️ Duplicate comments (7)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

1079-1085: Guard logging when a GEMM has no tactics or index is invalid

Unconditional indexing can go out of bounds when a GEMM isn’t profiled or yields no valid tactics. Use bounds checks before printing.

Apply this diff:

-        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
-        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
-        std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"
-                  << tactics1[tactic_idx1].toString() << std::endl;
-        std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"
-                  << tactics2[tactic_idx2].toString() << std::endl;
+        auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1);
+        auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2);
+        if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_1)
+        {
+            std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n";
+            if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size()))
+            {
+                std::cout << tactics1[tactic_idx1].toString() << std::endl;
+            }
+        }
+        if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_2)
+        {
+            std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n";
+            if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size()))
+            {
+                std::cout << tactics2[tactic_idx2].toString() << std::endl;
+            }
+        }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)

45-50: Compile error: decltype on non-static member yields pointer-to-member, not the runner type

decltype(BenchClass::mMoERunner) is invalid here. Use the concrete runner alias derived from BenchClass’ template parameters.

Apply this diff:

 template <class BenchClass>
 auto listAllTactics(MoeGemmId gemm_id)
 {
     int const sm = getSMVersion();
-    using RunnerType = decltype(BenchClass::mMoERunner);
-    return RunnerType::getTactics(sm, gemm_id);
+    using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType,
+        typename BenchClass::WeightType,
+        typename BenchClass::OutputType,
+        typename BenchClass::InputType>;
+    return RunnerType::getTactics(sm, gemm_id);
 }
cpp/tensorrt_llm/thop/moeOp.cpp (3)

590-595: Make return type explicit and keep validation — minor polish

You already validate gemm_idx. Consider explicitly casting size() to int64_t to avoid implicit narrowing across platforms.

Apply this diff:

-        return (gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size();
+        return static_cast<int64_t>((gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size());

630-637: Guard empty tactic lists and out-of-range profile_id in runGemmProfile

profiles.front() and profiles[profile_id] can UB if profiles is empty or profile_id is invalid. Fail fast with clear messages.

Apply this diff:

-        auto const gemm_to_profile
-            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
-        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        auto const gemm_to_profile
+            = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2;
+        auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles;
+        TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%ld", gemm_idx);
@@
-        auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id];
+        const auto& profile = [&]() -> const Profile& {
+            if (profile_id == -1) return profiles.front();
+            TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()),
+                        "profile_id out of range for GEMM_%ld: %ld", gemm_idx, profile_id);
+            return profiles[profile_id];
+        }();

740-749: Protect .front()/.at() in setRunnerProfiles and validate profile_ids

This currently assumes non-empty mGemm{1,2}Profiles and valid indices. Add guards to prevent UB and clearer error messages.

Apply this diff:

-        auto best_gemm1_profile = mGemm1Profiles.front();
-        auto best_gemm2_profile = mGemm2Profiles.front();
+        TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available");
+        auto best_gemm1_profile = mGemm1Profiles.front();
+        auto best_gemm2_profile = mGemm2Profiles.front();
         if (profile_ids.has_value())
         {
             TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids");
-            best_gemm1_profile
-                = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]);
-            best_gemm2_profile
-                = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]);
+            if (profile_ids.value()[0] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[0] >= 0
+                        && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()),
+                        "profile_ids[0] out of range for GEMM_1");
+                best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]);
+            }
+            if (profile_ids.value()[1] != -1)
+            {
+                TORCH_CHECK(profile_ids.value()[1] >= 0
+                        && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()),
+                        "profile_ids[1] out of range for GEMM_2");
+                best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]);
+            }
         }
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)

1178-1181: Fixed signature mismatch — good catch

Using selectTacticsForArch(sm) resolves the earlier extra-arg build error.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

84-87: Return a list and make gemm_idx resilient in get_valid_tactics

Returning range violates the List[int] contract, and blindly indexing kwargs risks KeyError. Convert to a list and default gemm_idx to a sane value.

Apply this diff:

-    def get_valid_tactics(self, inputs: List[torch.Tensor],
-                          profile: OptimizationProfile, **kwargs) -> List[int]:
-        return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"]))
+    def get_valid_tactics(self, inputs: List[torch.Tensor],
+                          profile: OptimizationProfile, **kwargs) -> List[int]:
+        gemm_idx = int(kwargs.get("gemm_idx", 1))
+        return list(range(int(self.fused_moe_runner.get_tactic_num(gemm_idx))))
🧹 Nitpick comments (16)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1)

2-2: Disambiguate the provenance of the archived kernels’ commit.

The “commit” line shows a 40-char hash but not which repo it refers to. Make provenance unambiguous so downstream triage (especially for CI failures) can trace the exact source.

Apply this minimal format tweak:

-commit 444ef1b3b06cdc7ee66b4e612ce26ad25967440b
+source internal_cutlass_kernels@444ef1b3b06cdc7ee66b4e612ce26ad25967440b

Optionally, add build metadata (compiler/CUDA/CUTLASS versions) if you already capture it during artifact build; this pays off when multiple ABI-compatible archives exist.

cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (2)

61-63: Namespace relocation is fine; add brief doc and clarify intent.

The new nested namespace reads as an API/ABI surface change. A short comment helps future readers understand why these helpers live under oss.

Apply this small doc tweak:

-namespace oss
+// Internal helpers for int8 GEMM dispatch isolated under a sub-namespace to keep
+// the public runner API stable while we evolve tactic selection plumbing.
+namespace oss
 {

341-345: Parenthesize the condition for readability.

Make the precedence explicit to avoid misreads.

-    else if (mSm >= 80 && mSm <= 90 || mSm >= 120)
+    else if ((mSm >= 80 && mSm <= 90) || (mSm >= 120))
     {
         oss::dispatchGemmToCutlass<T, cutlass::arch::Sm80>(A, B, quantOption, alphaCol, alphaRow, C, m, n, k,
             workspacePtr, workspaceBytes, gemmConfig, stream, occupancy);
     }
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)

836-838: Avoid enum-to-enum static_cast; map GemmToProfile to MoeGemmId explicitly

Casting between distinct enums relies on coincidental underlying values and risks invalid values (e.g., LAYER). Use an explicit mapping.

Apply this diff:

-        auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile));
+        MoeGemmId gemmId = (gemm_to_profile == GemmToProfile::GEMM_1)
+            ? MoeGemmId::GEMM_1
+            : MoeGemmId::GEMM_2;
+        auto tactics = mMoERunner.getTactics(gemmId);

970-977: Same enum mapping nit: avoid static_cast here too

Mirror the earlier explicit mapping to prevent accidental misuse if gemm_to_profile is ever LAYER.

Apply this diff:

-            auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile))[tactic_idx];
+            MoeGemmId gemmId = (gemm_to_profile == GemmToProfile::GEMM_1)
+                ? MoeGemmId::GEMM_1
+                : MoeGemmId::GEMM_2;
+            auto tactics = mMoERunner.getTactics(gemmId)[tactic_idx];
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (2)

521-571: Pre-reserve capacity before duplicating configs for FINALIZE

Minor perf/readability: reserve to avoid reallocation while appending FINALIZE variants.

Apply this diff:

-    if (supports_finalize_fusion)
+    if (supports_finalize_fusion)
     {
-        // Duplicate the configs and set the epilogue fusion type to FINALIZE
+        // Duplicate the configs and set the epilogue fusion type to FINALIZE
+        tma_ws_configs.reserve(tma_ws_configs.size() * 2);
         auto finalize_configs = tma_ws_configs;
         std::transform(finalize_configs.begin(), finalize_configs.end(), std::back_inserter(tma_ws_configs),
             [](auto& config)
             {
                 config.epilogue_fusion_type = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE;
                 return config;
             });
     }

831-834: Avoid duplicated work when computing workspace size

You already try both NONE and FINALIZE paths explicitly below. Calling getTmaWarpSpecializedConfigs(sm_, true) produces duplicates, doubling iterations. Query only base configs here.

Apply this diff:

-        // Finalize fusion may not actually be supported by the kernel,
-        // if they are not we will catch the error and skip them
-        auto configs = getTmaWarpSpecializedConfigs(sm_, true);
+        // Fetch base configs once; we explicitly try NONE and (on SM90) FINALIZE below.
+        auto configs = getTmaWarpSpecializedConfigs(sm_, /*supports_finalize_fusion=*/false);
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)

279-285: Optional: accept single-sided tactic spec by defaulting the other to auto

Right now, missing tactic_id1 or tactic_id2 results in skipping later. Defaulting the absent side to [-1] (auto) improves usability for layer profiling.

Apply this diff:

-        if (run_config.contains("tactic_id1"))
+        if (run_config.contains("tactic_id1"))
         {
             parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1);
         }
-        if (run_config.contains("tactic_id2"))
+        if (run_config.contains("tactic_id2"))
         {
             parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2);
         }
+        // Default missing side to auto for full-layer runs
+        if (tactic_ids1.empty()) { tactic_ids1 = {-1}; }
+        if (tactic_ids2.empty()) { tactic_ids2 = {-1}; }

334-353: Current behavior skips benchmarks when tactics missing — prefer auto-default instead

If you don’t adopt the defaulting above, consider at least making the skip reason explicit in logs. Otherwise, applying the suggested default avoids this branch entirely.

Apply this diff (if you keep skip):

-            std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl;
+            std::cerr << "Warning: Skipping benchmark: no tactics specified for one or both GEMMs "
+                         "(provide tactic_id1/tactic_id2 or use \"auto\"/\"all\")." << std::endl;
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

2849-2854: Tactic-driven gating of permuted_token_final_scales_ looks good

Using gemm2_config_->epilogue_fusion_type to decide whether to thread the scales buffer aligns the workspace wiring with the chosen tactic. No functional concerns. If you want to reduce duplication later, consider a tiny helper (e.g., gemm2RequestsFinalize()) to reuse this check across call sites.


4010-4016: Good consistency check between tactic and runner config

The equality check between using_fused_finalize and gemm2_using_finalize_fusion is the right fail-fast. As a tiny DX improvement, consider including both boolean values in the error message for quicker diagnosis.


4634-4638: Finalize params ordering matches row/col semantics

setFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1) passes (hidden_size, num_output_tokens, reduce), which is consistent with the swapped/transpose layout. Consider a brief comment here to make the rationale obvious to future readers.


4675-4679: Dual call to prepareTmaWsInputs is clear; optional call-site note

Preparing NONE and FINALIZE templates back-to-back is consistent with the caching design. If helpful, add a one-liner here mirroring the “reuse to save space” note from prepareTmaWsInputs to make the overlaying intentional at the call site too.

cpp/tensorrt_llm/thop/moeOp.cpp (2)

268-298: Remove duplicated bias validation block in runMoe

The validation of fc{1,2}_expert_biases appears twice verbatim, increasing maintenance risk.

Apply this diff to drop the second copy:

-        if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value())
-        {
-            CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype);
-            CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype);
-            TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D.");
-            TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D.");
-            TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0],
-                "fc1_expert_weights and fc1_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0],
-                "fc2_expert_weights and fc2_expert_biases must have the same number of experts.");
-            TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1],
-                "fc1_expert_biases should match fc1_expert_weights output shape.");
-            TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1],
-                "fc2_expert_biases should match fc2_expert_weights output shape.");
-        }

433-443: Typo in method name runMoeMinLantency

Minor naming nit: “Lantency” → “Latency”. Consider renaming the method and adjusting the TORCH_LIBRARY binding accordingly to improve readability. This is API-facing only via the binding symbol “run_moe_min_latency”, so internal rename is low-risk.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

91-94: Avoid defaulting gemm_idx to 0 in forward

C++ side now TORCH_CHECKs gemm_idx in {1,2}. Defaulting to 0 will throw if callers forget to pass it. Prefer default=1 or validate locally.

Apply this diff:

-        gemm_idx: int = 0,
+        gemm_idx: int = 1,
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 513c6dd and 71f9272.

⛔ Files ignored due to path filters (2)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz is excluded by !**/*.xz
📒 Files selected for processing (18)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5 hunks)
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (7 hunks)
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (3 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (5 hunks)
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (6 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1 hunks)
  • cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2 hunks)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1 hunks)
  • cpp/tensorrt_llm/thop/moeOp.cpp (7 hunks)
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (11 hunks)
  • tensorrt_llm/_torch/autotuner.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6 hunks)
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5 hunks)
  • tests/unittest/_torch/misc/test_autotuner.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
  • cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/unittest/_torch/misc/test_autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/unittest/_torch/misc/test_autotuner.py
  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
  • cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

  • cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧬 Code Graph Analysis (6)
tests/unittest/_torch/misc/test_autotuner.py (2)
tensorrt_llm/_torch/autotuner.py (2)
  • OptimizationProfile (118-133)
  • FakeTensor (138-141)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6)
  • get_valid_tactics (84-86)
  • get_valid_tactics (318-320)
  • get_valid_tactics (400-402)
  • get_valid_tactics (512-524)
  • get_valid_tactics (726-728)
  • get_valid_tactics (801-804)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
  • permuted_token_final_scales_ (822-822)
  • mSM (920-920)
  • mMinLatencyMode (944-944)
  • mNumExpertsPerNode (922-922)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
  • std (43-114)
  • vector (608-612)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)
  • supportsTmaWarpSpecialized (293-312)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (3)
  • isValidHopperMOESpecialisation (63-78)
  • isValidBlackwellMOESpecialisation (47-57)
  • isValidSM120MOESpecialisation (34-42)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
  • sm (1090-1118)
  • sm (1090-1090)
  • sm (1120-1144)
  • sm (1120-1120)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
  • getTactics (1278-1282)
  • getTactics (1278-1278)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
  • MoeGemmId (231-354)
  • std (43-114)
  • vector (608-612)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
tests/unittest/_torch/misc/test_autotuner.py (4)
  • get_valid_tactics (94-97)
  • get_valid_tactics (153-156)
  • get_valid_tactics (229-231)
  • get_valid_tactics (317-324)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
  • get_valid_tactics (125-141)
  • get_valid_tactics (409-422)
  • get_valid_tactics (667-680)
  • get_valid_tactics (901-914)
  • get_valid_tactics (1114-1127)
tensorrt_llm/_torch/autotuner.py (2)
  • get_valid_tactics (147-165)
  • OptimizationProfile (118-133)
🪛 LanguageTool
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt

[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 444ef1b3b06cdc7ee66b4e612ce26ad25...

(QB_NEW_EN)


[grammar] ~2-~2: There might be a mistake here.
Context: ...444ef1b3b06cdc7ee66b4e612ce26ad25967440b

(QB_NEW_EN)

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt

[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 444ef1b3b06cdc7ee66b4e612ce26ad25...

(QB_NEW_EN)


[grammar] ~2-~2: There might be a mistake here.
Context: ...444ef1b3b06cdc7ee66b4e612ce26ad25967440b

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (36)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (2)

1-1: Action Needed: Confirm checksum verification and clarify format

I wasn’t able to verify the checksum automatically because sha256sum isn’t available in the CI shell. Please:

  • Ensure the stored hash in
    cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
    actually matches the archive. You can manually run, for example:
    shasum -a 256 cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/libtensorrt_llm_internal_cutlass_kernels_static.a
    and compare it to the value in the file.
  • Update the first line of version.txt to include an explicit algorithm prefix, e.g.:
    -5a31acd0fb1415196bff71fa4a8d1dded147e15ea10821cc46c66986ee  libtensorrt_llm_internal_cutlass_kernels_static.a
    +sha256:5a31acd0fb1415196bff71fa4a8d1dded147e15ea10821cc46c66986ee  libtensorrt_llm_internal_cutlass_kernels_static.a
  • Add a CI step (using shasum -a 256 or openssl dgst -sha256) to validate that version.txt and the static library never drift out of sync.

1-2: Cross-arch version.txt files are in sync

Both the AArch64 and x86_64 version.txt files reference the same commit 444ef1b3b06cdc7ee66b4e612ce26ad25967440b, and the differing SHA256 lines correspond to the distinct per-architecture static libraries as expected. There’s no mismatch or drift to address here.

cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (3)

333-335: Call-site update to oss-qualified dispatch looks correct.

Template params and argument order are unchanged; only qualification updated. Good.


338-340: LGTM on SM75 branch.

Consistent with the SM72 change; no behavioral differences introduced.


61-63: All clear—no remaining unqualified call sites detected.
The scans for both dispatchGemmToCutlass<…> and genericInt8GemmKernelLauncher<…> outside of int8_gemm_template.h returned no matches, confirming that all uses have been updated to the new oss::-qualified names. No further changes are needed here.

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)

1-2: LGTM: artifact bump recorded

Hash and commit updated as expected. No further action from code perspective.

tests/unittest/_torch/misc/test_autotuner.py (3)

153-156: Signature extended to accept kwargs — aligned with per-GEMM context

Adding **kwargs to PartialCrashedRunner.get_valid_tactics keeps tests compatible with callers that propagate extra context (e.g., gemm_idx).


229-231: Propagating kwargs in tests is fine

GemmRunnerWithAttributes.get_valid_tactics now accepts **kwargs. This maintains API consistency across runners used by the autotuner tests.


317-320: Supports tactic config generation with kwargs

GemmRunnerWithTacticConfigs.get_valid_tactics adds **kwargs without altering behavior. Good forward-compat for additional selection hints.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)

928-953: Per-GEMM tactic selection flow looks correct

Fetching tactics1/tactics2, gating on gemm_to_profile, and lazy best-pick with range checks is sound.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)

473-478: New const overload is appropriate

getConfigs(bool) delegating to the SM-parametrized version cleanly exposes the new “supports finalize fusion” knob at the instance level.


481-489: Composition of TMA WS and Ampere configs looks correct

Merging getTmaWarpSpecializedConfigs(sm, supports_finalize_fusion) with Ampere configs preserves existing fallback behavior.


582-587: SM-gated specialization predicate is fine

The new (int sm) helper is clear and matches the architecture-specific validity checks.

cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)

53-86: Per-GEMM tactic parsing is clean and recursive

The refactor to thread gemm_id through parsing and to support arrays/“all”/“auto” is solid. Error handling for invalid formats is appropriate.


415-426: Verbose dump per GEMM is helpful

Enumerating tactics per GEMM with indices improves debuggability. Nice.


536-544: Help text matches parser (no “objects”); clear on valid tactic forms

Docs now accurately reflect accepted types and warn about index instability. Good catch-up with the parser changes.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

4558-4576: prepareTmaWsInputs: early gating and groupwise flags computed before use — solid

  • Passing the fusion enum into prepareTmaWsInputs makes the intent explicit.
  • Computing use_w4afp8/use_wfp4a16/use_w4_groupwise before the early-return resolves the prior “used before declaration” hazard.

4616-4621: Intentional WS reuse for NONE/FINALIZE variants acknowledged

The cache selects mTmaInputCache[i][use_finalize_fusion] and reuses the same underlying workspace region by design to save memory (as noted in the comment). The code matches that intent.


4626-4628: Per-GEMM cache binding during profiling is correct

Routing the active cache to the GEMM under test and using a local dummy for the other keeps configuration isolated without extra allocations. Looks good.


4742-4745: Profiler: selecting cached template by fusion type — LGTM

Indexing mTmaInputCache by (sample, isFinalize) and asserting isValid() gives a clean, explicit handoff to the run phase.

cpp/tensorrt_llm/thop/moeOp.cpp (2)

51-51: Expose MoeGemmId alias — looks good

Publicly surfacing MoeGemmId in torch_ext is helpful for downstream clarity and keeps signatures consistent across layers.


219-221: Initialize per-GEMM tactic vectors — good separation

Fetching tactics separately for GEMM_1 and GEMM_2 aligns with the new per-GEMM selection model.

cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (6)

373-375: Default fused-finalize flag is off — good coverage design

Defaulting mUseFusedFinalize to false ensures both fused and unfused paths get exercised as k varies.


459-460: Per-test control of finalize fusion — sensible toggle

Tying use_fused_finalize_ to (k < 3) || mUseFusedFinalize matches prod behavior and keeps determinism checks meaningful.


1090-1119: Per-GEMM filtered configs — API usage looks correct

getFilteredConfigs now threads MoeGemmId and filters by SM. The L40S filter guards make sense; EXPECT_FALSE(tactics.empty()) asserts coverage.


1120-1144: selectTacticsForArch: finalize-aware pairing — LGTM

Deriving epilogue_fusion_type and selecting a finalize-matching GEMM_2 tactic is consistent with the new model. Sensible fallback logic with a warning.


1642-1645: Determinism gate based on finalize fusion — appropriate

The condition matches known non-determinism in FINALIZE on SM90 with K≥3; the guard keeps tests stable elsewhere.


2093-2096: ConfigSweep: fused finalize set true while sweeping both configs — reasonable

Enabling fused finalize here while sweeping both GEMMs ensures coverage of FINALIZE paths; the gemm_id-aware config enumeration fits the per-GEMM API.

Also applies to: 2123-2124, 2127-2129

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (8)

318-321: Consistently return List[int] for FP8RowwiseGemmRunner.get_valid_tactics

Good: list(range(...)) aligns with the typed contract and avoids surprises.


400-403: NVFP4/FP4 get_valid_tactics returns List[int] — good

Type-consistent and mirrors other runners.


726-729: WeightOnlyQuantGemmRunner.get_valid_tactics — type fix is correct

list(range(...)) matches List[int] and the autotuner’s expectations.


801-805: FinegrainedMixedDtypeGemm.get_valid_tactics — consistent return type

Looks good.


187-207: Autotuner integration passes gemm_idx per GEMM — LGTM

Tuning two slots with explicit gemm_idx={1,2} matches the C++ API and per-GEMM tactic model.


151-163: Tuner input handling for enable_alltoall — sensible

Constraining warm-up to the non-alltoall path and reusing tuner_top_k mirrors the C++ assumptions and keeps tactic selection stable.

Also applies to: 185-186


1-941: Cross-file get_tactic_num migration sanity check passed

All occurrences of get_tactic_num in the Python codebase now include the required gemm_idx argument, and no legacy zero-argument calls remain:

  • Only one match of get_tactic_num was found in MoERunner.get_valid_tactics, where it correctly uses kwargs["gemm_idx"].
  • No calls to fused_moe_runner.get_tactic_num() without arguments were detected.

No further changes are needed.


512-525: Verify get_valid_configs return type

Adding **kwargs to get_valid_tactics is appropriate to match the autotuner interface. However, the implementation of get_valid_configs could not be found in the Python or C/C++ sources in this repository—it's likely provided by an external extension. Please manually confirm that:

  • kernel_runner.get_valid_configs(b, m, n, k) indeed returns a List[int].

Key location:
• tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, around lines 522–524 (tactics = self.kernel_runner.get_valid_configs…, return tactics)

@djns99
Copy link
Collaborator Author

djns99 commented Aug 20, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15961 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15961 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11995 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@hlu1 hlu1 merged commit f7c597e into NVIDIA:main Aug 21, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants