-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][perf] Make finalize fusion part of the tactic selection logic #6915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][perf] Make finalize fusion part of the tactic selection logic #6915
Conversation
📝 WalkthroughWalkthroughIntroduces per‑GEMM tactic selection (MoeGemmId) and epilogue finalize‑fusion awareness across MOE runners, profilers, benchmarks, Torch bindings, plugins, and tests; threads kwargs through autotuner/custom ops; updates several public method signatures, data layouts, and internal kernel artifact versions. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Bench as Benchmark/Plugin/Test
participant Torch as Torch API / Autotuner
participant Runner as CutlassMoeFCRunner
participant Moe as MoeGemmRunner
participant Prof as GemmProfilerBackend
Note over Bench,Torch: Per‑GEMM tactic selection is requested
Bench->>Runner: getTactics(gemm_id)
Runner->>Moe: getTactics(sm, gemm_id) / getConfigs(sm, supports_finalize_fusion)
Moe-->>Runner: configs[tactic].epilogue_fusion_type
Runner-->>Bench: tactics list for gemm_id
Note over Torch,Prof: Prepare TMA WS templates (both fusion types)
Torch->>Prof: prepare(...)
Prof->>Prof: prepareTmaWsInputs(fusion=NONE)
Prof->>Prof: prepareTmaWsInputs(fusion=FINALIZE)
Note over Bench,Prof: Run with chosen tactic
Bench->>Prof: run(profile, tactic)
alt tactic.epilogue_fusion_type == FINALIZE
Prof->>Prof: select FINALIZE cache/template for sample
else
Prof->>Prof: select NONE cache/template for sample
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🔭 Outside diff range comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
52-86
: parseTacticToVectorID signature must carry gemm_id (compile error).You’re calling
parseTacticToVectorID(..., MoeGemmId::GEMM_X)
below, but the function signature doesn’t acceptgemm_id
. Also, the"all"
branch callslistAllTactics<BenchClass>()
with no argument.Apply this diff to fix both the signature and calls within:
-template <class BenchClass> -void parseTacticToVectorID(nlohmann::json& tactic, std::vector<int>& tactic_ids) +template <class BenchClass> +void parseTacticToVectorID(nlohmann::json& tactic, std::vector<int>& tactic_ids, MoeGemmId gemm_id) { if (tactic.is_number_integer()) { tactic_ids.push_back(tactic.get<int>()); } else if (tactic.is_array()) { for (auto c : tactic) { - parseTacticToVectorID<BenchClass>(c, tactic_ids); + parseTacticToVectorID<BenchClass>(c, tactic_ids, gemm_id); } } else if (tactic.is_string()) { assert(tactic.is_string()); auto tactic_name = tactic.get<std::string>(); if (tactic_name == "all") { - auto all_tactics = listAllTactics<BenchClass>(); + auto all_tactics = listAllTactics<BenchClass>(gemm_id); tactic_ids.resize(all_tactics.size()); std::iota(tactic_ids.begin(), tactic_ids.end(), 0); } else { assert(tactic.get<std::string>() == "auto"); tactic_ids.push_back(-1); } } else { throw std::invalid_argument("Invalid tactic format"); } }
286-301
: Fix debug printing of valid tactics (calls outdated API).This block calls
listAllTactics<BenchClass>()
with nogemm_id
. Print per-GEMM tactics to align with the new API.Apply this diff:
- std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n"; - auto confs = listAllTactics<BenchClass>(); - for (auto c : confs) - std::cerr << c.toString(); + std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n"; + for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2}) + { + auto confs = listAllTactics<BenchClass>(gemm_id); + int i = 0; + for (auto const& c : confs) + { + std::cerr << "[GEMM " << static_cast<int>(gemm_id) << "] Tactic " << i++ << ":\n" + << c.toString(); + } + }
🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
1-1
: Add NVIDIA SPDX header to comply with repository guidelinesPython sources must carry the NVIDIA copyright header.
Apply at file top:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0tensorrt_llm/_torch/autotuner.py (1)
1-1
: Add NVIDIA SPDX header to comply with repository guidelinesPython sources must carry the NVIDIA copyright header.
Apply at file top:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
1-15
: Update copyright yearHeader lists 2020-2023. The guideline requests current-year coverage. Consider updating to include 2025.
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (2)
1120-1142
: Tactic selection considers finalize-fusion only for GEMM_2 (as intended).
is_tma_warp_specialized
gating andis_finalize_fusion
for GEMM_2 only is consistent with how finalize-fusion is applied in epilogue.- Fallback to the first tactic if a perfect match isn’t found is a pragmatic default.
Consider emitting a one-line INFO log when falling back (to help diagnose missing matches without running at VERBOSE).
1641-1642
: Determinism guard is appropriate.Using
!gemm2.is_finalize_fusion || mK < 3 || getSMVersion() < 90 || getSMVersion() >= 120
captures the conditions where results should be deterministic.This logic is duplicated in multiple places; consider extracting a small helper (e.g.,
shouldBeDeterministic(config, k)
) to avoid divergence over time.cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
627-632
: Static getTactics path differs from instance path.Here you rely on
RunnerType::getConfigs(sm, bool)
to expand finalize-fusion options, whereas the instance path usesaddFinalizeFusionConfigs
post-processing. If both code paths are expected to behave identically, consider unifying throughaddFinalizeFusionConfigs
for consistency across environments.If
RunnerType::getConfigs(sm, bool)
already includes fused variants, leave as-is; otherwise prefer a single expansion mechanism to avoid drift.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(5 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(2 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(2 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(6 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(10 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/autotuner.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/autotuner.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (23)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
84-86
: Confirmed: gemm_idx semantics — 1 → GEMM_1, otherwise → GEMM_2 (Python passes 1/2 correctly; default 0 maps to GEMM_2)C++ binding torch_ext::FusedMoeRunner::getTacticNum(int gemm_idx) returns mGemm1Profiles.size() if gemm_idx == 1, otherwise mGemm2Profiles.size(); Python forwards gemm_idx directly and uses 1 and 2 in tuning calls.
Relevant locations:
- cpp/tensorrt_llm/thop/moeOp.cpp: getTacticNum implementation (lines ~574–577) and Python binding (.def("get_tactic_num", ...) at ~1057–1060).
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py: get_valid_tactics (lines ~80–86) — returns range(self.fused_moe_runner.get_tactic_num(gemm_idx)); forward signature default gemm_idx=0 (lines ~88–93); tuning calls use gemm_idx=1 and gemm_idx=2 (lines ~185–199).
Conclusion: indexing matches—passing 1 selects GEMM_1 and passing 2 (or 0/other) selects GEMM_2. Optional: consider documenting or changing the forward default (gemm_idx=0) if treating 0 as GEMM_2 is surprising.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
506-508
: toString additions look goodIncluding enable cuda kernel and is_finalize_fusion in the TMA WS path is helpful for debugging.
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)
294-300
: Inline const overload delegating to static supportsTmaWarpSpecialized(int sm) is correctNice refactor for per-SM checks while preserving the const method call sites.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
569-574
: Static supportsTmaWarpSpecialized(int sm) implementation LGTMCovers Hopper (90), Blackwell [100,120), and SM120/121 explicitly; aligns with the new const wrapper.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4016-4018
: Verify finalize epilogue shape ordering (rows should be num_output_tokens, not hidden_size)Given the swap+transpose path in the finalize-fusion epilogue, the stride/shape should be set such that rows map to
num_output_tokens
and columns tohidden_size
. Double-check thatsetFinalizeFusionParams(final_output, hidden_size, num_rows, use_reduction)
passes arguments in the correct order for the underlying expected layout.cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
1281-1282
: LGTM: per-GEMM tactic retrieval aligns backend profiling targetReturning tactics keyed by
backend.mGemmToProfile
keeps the profiler-to-runner contract consistent.Ensure
backend.mGemmToProfile
is set prior to this call in all code paths (it is in initialize(), but double-check for any direct usage paths).cpp/tensorrt_llm/thop/moeOp.cpp (1)
221-223
: LGTM: split per-GEMM tactic lists at constructionUsing
getTactics(MoeGemmId::GEMM_1/2)
upfront enables independent selection and profiling per GEMM.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (4)
836-836
: Per-GEMM tactic retrieval looks correct (cast guarded by call sites).
pickBestTactic
is only called with GEMM_1 or GEMM_2, so thestatic_cast<MoeGemmId>(gemm_to_profile)
here won’t seeLAYER
. No action needed.
928-953
: Correctly splits tactics per GEMM and validates indices.
- Fetching separate vectors for GEMM_1 and GEMM_2 and choosing based on
combo.second
is clean.- Range checks on
t
before indexing keep it safe.
970-979
: Using the GEMM-specific tactic for profiler execution is aligned.Assuming
setTactic
was called and validated earlier (as inrunBenchmark
), this index access is safe and consistent.
1079-1084
: Improved logging for per-GEMM tactics.Clear separation of GEMM_1 and GEMM_2 tactics with counts and
toString()
improves diagnosability.cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)
373-375
: New fused-finalize toggle is sane by default.Defaulting
mUseFusedFinalize
to false while still enabling fused-finalize for k < 3 via runtime ensures both paths get exercised in tests.
459-459
: Runtime toggle of fused-finalize matches intent.
mMoERunner.use_fused_finalize_ = k < 3 || mUseFusedFinalize;
matches the comment and ensures deterministic coverage across k values.
1090-1119
: Per-GEMM filtering wrapper reads cleanly.Fetching tactics via
mMoERunner.getTactics(gemm_id)
and applying arch-specific filters is a good split of concerns.
1759-1763
: Non-deterministic test path is properly gated.Enabling fused-finalize only for
PermuteNonDeterministic
keeps the test suite behavior targeted.
2114-2121
: Config sweep now respects per-GEMM tactic sets.Iterating
configs1
andconfigs2
and asserting both names are non-empty prevents accidental uninitialized tactics.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
45-50
: Per-GEMM listAllTactics is the right API.Returning
RunnerType::getTactics(sm, gemm_id)
aligns with the new per-GEMM contracts.
282-284
: Calls to parseTacticToVectorID match the intended gemm_id, but rely on the wrong signature.Once you apply the signature fix above, these calls are correct.
After applying the signature fix, please ensure there are no remaining calls to the old two-argument
parseTacticToVectorID
. A quick search in this file should suffice.
415-427
: Verbose listing of tactics per GEMM is excellent.This will help users construct correct tactic_id1/tactic_id2 values per architecture.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
231-237
: MoeGemmId introduction is clear and scoped.Having a shared enum for GEMM ids aligns the API across profilers, runners, and benchmark code.
456-457
: Interface change: getTactics now requires MoeGemmId.This is a breaking API change. Ensure all implementations of
CutlassMoeFCRunnerInterface
and all call sites are updated accordingly (you’ve addressed key sites in this PR).If not already done, please verify that no other non-bench code still calls the old
getTactics()
without arguments.
827-832
: Static finalize-fusion capability check is reasonable.Gates by TMA WS support, SM >= 90, and weight mode. Avoids coupling to instance
use_fused_finalize_
which is fine for “listing” APIs.
931-931
: Profiler enum alias to MoeGemmId is consistent.This keeps the profiler aligned with the rest of the API. Note that “LAYER” sentinel isn’t part of
MoeGemmId
; local enums (like in benchmarks) bridge that gap safely.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
8b09273
to
e09cebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
70-76
: Fix: "all" case clobbers previously accumulated tactic IDs when parsing arraysWhen parsing arrays, encountering the "all" string currently resizes tactic_ids and overwrites prior entries (e.g., ["all", 2] loses "2"). This is especially problematic for sweeps that mix explicit IDs with "all".
Append instead of resize:
- if (tactic_name == "all") - { - auto all_tactics = listAllTactics<BenchClass>(gemm_id); - tactic_ids.resize(all_tactics.size()); - std::iota(tactic_ids.begin(), tactic_ids.end(), 0); - } + if (tactic_name == "all") + { + auto all_tactics = listAllTactics<BenchClass>(gemm_id); + auto n = static_cast<int>(all_tactics.size()); + auto start = static_cast<int>(tactic_ids.size()); + tactic_ids.reserve(start + n); + for (int i = 0; i < n; ++i) + { + tactic_ids.push_back(i); + } + }
276-284
: Avoid operator[] insertion and handle partial tactic configs gracefullyAccessing run_config["tactic_id1"] / ["tactic_id2"] inserts nulls when keys are absent, causing parseTacticToVectorID to throw. This makes valid partial configs brittle. Use at() for safety and mirror the provided side when only one tactic key is present.
- bool has_tactic_ids2 = false; - std::vector<int> tactic_ids1{}; - std::vector<int> tactic_ids2{}; - if (run_config.contains("tactic_id1") || run_config.contains("tactic_id2")) - { - has_tactic_ids2 = true; - parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1); - parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2); - } + bool has_tactic_ids2 = false; + std::vector<int> tactic_ids1{}; + std::vector<int> tactic_ids2{}; + bool hasT1 = run_config.contains("tactic_id1"); + bool hasT2 = run_config.contains("tactic_id2"); + if (hasT1) + { + parseTacticToVectorID<BenchClass>(run_config.at("tactic_id1"), tactic_ids1, MoeGemmId::GEMM_1); + } + if (hasT2) + { + parseTacticToVectorID<BenchClass>(run_config.at("tactic_id2"), tactic_ids2, MoeGemmId::GEMM_2); + } + // If only t2 is provided, mirror it into t1 so the sweep still runs. + if (!hasT1 && hasT2) + { + tactic_ids1 = tactic_ids2; + } + has_tactic_ids2 = hasT1 && hasT2;
♻️ Duplicate comments (2)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
608-616
: Harden profile selection and gemm_idx validation in runGemmProfileGuard against invalid gemm_idx, empty tactic lists, and out-of-range profile_id to avoid UB from
.front()
/indexing.Apply this diff:
- auto const gemm_to_profile - = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; - auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2"); + auto const gemm_to_profile + = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; + auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx)); // Get specific profile configs according to the profile_id. // Fallback tactic is set to be 0 // TODO: use the best tactic id found offline for a better default inference perf - auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id]; + const auto& profile = [&]() -> const Profile& { + if (profile_id == -1) return profiles.front(); + TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()), + "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id); + return profiles[profile_id]; + }();
717-726
: Validate tactic vectors and ids in setRunnerProfilesProtect against empty vectors and out-of-range indices before using
.front()
or.at()
.Apply this diff:
- auto best_gemm1_profile = mGemm1Profiles.front(); - auto best_gemm2_profile = mGemm2Profiles.front(); + TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available"); + auto best_gemm1_profile = mGemm1Profiles.front(); + auto best_gemm2_profile = mGemm2Profiles.front(); if (profile_ids.has_value()) { TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids"); - best_gemm1_profile - = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]); - best_gemm2_profile - = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]); + if (profile_ids.value()[0] != -1) + { + TORCH_CHECK(profile_ids.value()[0] >= 0 + && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()), + "profile_ids[0] out of range"); + best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]); + } + if (profile_ids.value()[1] != -1) + { + TORCH_CHECK(profile_ids.value()[1] >= 0 + && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()), + "profile_ids[1] out of range"); + best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]); + } }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (2)
513-515
: Include epilogue_fusion_type in all toString branches for consistencyCurrently only the TMA path prints epilogue_fusion_type. Consider appending it to the sm80-compatible toString branch as well so log output is uniform across architectures.
Outside the changed hunk, extend the sm80 branch like:
// in toString(), non-TMA branch tactic << "\n\tstyle=compatible" << "\n\ttile shape ID: " << (int) tile_config_sm80 << "\n\tstages: " << (int) stages << "\n\tsplit k: " << (int) split_k_factor << "\n\tenable cuda kernel: " << (enableCudaKernel ? "true" : "false") << "\n\tepilogue fusion type: " << (int) epilogue_fusion_type;
546-548
: Mirror epilogue_fusion_type in operator<< for non-TMA configsAnalogous to TMA printing, include epilogue_fusion_type in the non-TMA operator<< branch to keep streamed output aligned with toString.
Suggested change (outside the diff hunk):
- out << "tile_config_enum: " << config.getTileConfigAsInt() + out << "tile_config_enum: " << config.getTileConfigAsInt() << ", split_k_style_enum: " << int(config.split_k_style) << ", split_k_factor: " << config.split_k_factor << ", stages: " << config.stages - << ", enable_cuda_kernel: " << (config.enableCudaKernel ? "true" : "false"); + << ", enable_cuda_kernel: " << (config.enableCudaKernel ? "true" : "false") + << ", epilogue_fusion_type: " << int(config.epilogue_fusion_type);cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
66-71
: Nit: Redundant assert in string branchassert(tactic.is_string()) is redundant inside the tactic.is_string() branch. Safe to remove.
- assert(tactic.is_string()); auto tactic_name = tactic.get<std::string>();
393-393
: Nit: Update comment to reflect new listAllTactics signatureThe example still shows listAllTactics without a MoeGemmId; update for clarity.
- auto cutlass_tactic = {-1}; // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()}; + auto cutlass_tactic = {-1}; // e.g. {0, ..., listAllTactics<BenchClass>(MoeGemmId::GEMM_1).size() - 1}
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(10 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/_torch/autotuner.py
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/thop/moeOp.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)
301-306
: Public static SM-query and instance wrapper — LGTMThe inline wrapper to the static overload improves ergonomics and makes capability checks usable without an instance. No concerns.
306-306
: Definition found — no action requiredThe static overload is defined — the inline instance method in the header forwards to it and the templated definition exists in the template dispatch header.
- Declaration / inline forwarder: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (around lines 301–306) — the instance method returns supportsTmaWarpSpecialized(sm_) and the static declaration is present.
- Definition: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h:569 — definition of MoeGemmRunner<T,WeightType,OutputType,ScaleBiasType>::supportsTmaWarpSpecialized(int sm).
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
436-443
: Epilogue fusion state in config — LGTMAdding EpilogueFusionType and the epilogue_fusion_type field is a clean, explicit way to represent finalize-fusion. Good addition of printing in the TMA toString/operator<< paths.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
2850-2854
: Gate finalize-fusion workspace by epilogue_fusion_type — LGTMConditioning permuted_token_final_scales_ on FINALIZE via epilogue_fusion_type aligns workspace with the selected tactic.
4010-4016
: Runtime alignment of finalize-fusion decision — LGTMReplacing implicit SM checks with tactic-driven epilogue_fusion_type and asserting agreement with the runner’s flags removes ambiguity and catches mismatches early.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1)
46-46
: Public MoeGemmId alias — LGTMThis unifies per-GEMM references in the plugins namespace and matches the rest of the PR’s per-GEMM changes.
cpp/tensorrt_llm/thop/moeOp.cpp (3)
51-51
: Public MoeGemmId alias in Torch extension — LGTMKeeps the Torch side aligned with core headers for per-GEMM flows.
222-224
: Fetch tactics per GEMM — LGTMPopulating mGemm1Profiles and mGemm2Profiles via getTactics(MoeGemmId::GEMM_1/2) matches the new per-GEMM profiling model.
575-580
: Validate gemm_idx in getTacticNum — LGTMAdding TORCH_CHECK prevents silent fallbacks and returns explicit sizes. Good catch.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
45-50
: Per-GEMM tactic enumeration routed correctlyPassing MoeGemmId through to RunnerType::getTactics(sm, gemm_id) is the right direction for per-GEMM tactic listing. Looks good.
294-301
: Helpful: per-GEMM tactic listing in warning pathPrinting available tactics per GEMM is useful for debugging invalid tactic configs. LGTM.
420-431
: Verbose per-GEMM tactic dump is clear and consistentThe new per-GEMM breakdown aligns with the per-GEMM selection model. Looks good.
398bcac
to
590e68a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tensorrt_llm/_torch/autotuner.py (1)
456-457
: Filter kwargs before calling runner.get_valid_tactics to avoid TypeError across heterogeneous runnersMany existing runners don’t accept the new kwargs; passing all kwargs directly can break them. Filter kwargs against the target signature unless it accepts **kwargs.
- valid_tactics = runner.get_valid_tactics(input_tensors, profile, **kwargs) + # Filter kwargs to those accepted by get_valid_tactics unless it has **kwargs + gvt_params = inspect.signature(runner.get_valid_tactics).parameters + accepts_var_kwargs = any(p.kind == inspect.Parameter.VAR_KEYWORD for p in gvt_params.values()) + accepted_kw = {p.name for p in gvt_params.values() + if p.kind in (inspect.Parameter.POSITIONAL_OR_KEYWORD, inspect.Parameter.KEYWORD_ONLY)} + filtered_kwargs = kwargs if accepts_var_kwargs else {k: v for k, v in kwargs.items() if k in accepted_kw} + valid_tactics = runner.get_valid_tactics(input_tensors, profile, **filtered_kwargs)
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4612-4614
: Nit: make the bool→index conversion explicit for readabilityIndexing with a bool works (0/1), but a cast improves clarity for future readers.
- auto& cache_element = mTmaInputCache[i][use_finalize_fusion]; + auto& cache_element = mTmaInputCache[i][static_cast<size_t>(use_finalize_fusion)];cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
603-621
: Consider taking configs by value for broader API flexibilityUsing rvalue reference works with temporaries (e.g., getConfigs()), but taking by value supports both rvalues and lvalues and relies on move elision/moves where applicable. Optional API polish.
- static auto addFinalizeFusionConfigs( - std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize) + static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs( + std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize) { if (!use_fused_finalize) return configs; size_t const num_configs = configs.size(); for (size_t i = 0; i < num_configs; ++i) { if (configs[i].is_tma_warp_specialized) { configs.push_back(configs[i]); - configs.back().epilogue_fusion_type - = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; + configs.back().epilogue_fusion_type = + cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; } } return configs; }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(7 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
- cpp/tensorrt_llm/thop/moeOp.cpp
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/autotuner.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/autotuner.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (4)
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
2850-2854
: LGTM: Cache finalize-fusion workspace only when tactic enables itGating permuted_token_final_scales_ on epilogue_fusion_type keeps workspace usage minimal and consistent with tactic selection.
4010-4016
: LGTM: Explicit tactic/runner finalize-fusion alignment checkThe equality check between requested finalize fusion and runner configuration prevents silent misconfigurations.
4558-4560
: Signature extension for prepareTmaWsInputs is consistent with per‑GEMM fusion flowsAdding EpilogueFusion to the profiler prep API is coherent with the new per‑GEMM tactic surface.
4675-4678
: LGTM: Precompute TMA inputs for both NONE and FINALIZE epiloguesPreparing both variants amortizes prep cost and enables quick switching during profiling.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
231-236
: LGTM: Introduce MoeGemmId to make per‑GEMM flows explicitClear, minimal enum that aids per‑GEMM tactic selection and profiling.
456-457
: LGTM: getTactics now per‑GEMMMaking tactic retrieval gemm_id‑aware aligns the Python and C++ layers with the new per‑GEMM selection model.
1003-1004
: LGTM: 2‑D TMA input cache keyed by finalize/non‑finalizeKeeps profiler prep artifacts separated per epilogue choice. Clean structure.
1016-1018
: LGTM: prepareTmaWsInputs API updated to pass fusionHeader and implementation are in sync; enables precomputing per‑fusion TMA inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4606-4616
: Critical: TMA WS cache aliasing between FINALIZE and NONE in profilerprepare() calls prepareTmaWsInputs twice (NONE and FINALIZE). Both calls start writing TMA WS templates from the same base workspace pointer (tma_ws_input_workspace), and getProfilerWorkspaces currently sizes only one block of TMA input workspace: workspaceSize * (NUM_ROUTING_SAMPLES + 1). The second call overwrites the first, so mTmaInputCache[...][false] and [true] end up sharing the same memory region, making one cache invalid after the other is prepared.
Allocate distinct workspace halves for NONE and FINALIZE and offset writes when fusion == FINALIZE.
Apply these diffs:
- In getProfilerWorkspaces (size the TMA WS area for both NONE and FINALIZE caches):
- size_t tma_ws_input_workspace_size = 0; - if (is_tma_ws_input) - { - tma_ws_input_workspace_size - = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) - * (NUM_ROUTING_SAMPLES + 1); + size_t tma_ws_input_workspace_size = 0; + if (is_tma_ws_input) + { + // We pre-build BOTH NONE and FINALIZE templates in prepare(), so reserve space for both. + tma_ws_input_workspace_size + = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) + * (NUM_ROUTING_SAMPLES + 1) * 2;
- In prepareTmaWsInputs (offset to the second half for FINALIZE):
- size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType); - - TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input; - dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace, - workspaces.at("gemm_workspace").first, mScalingType); - tma_ws_input_workspace += tma_ws_size; + size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType); + // Reserve the first half for NONE and the second half for FINALIZE. + size_t cache_stride = tma_ws_size * (NUM_ROUTING_SAMPLES + 1); + if (use_finalize_fusion) + { + tma_ws_input_workspace += cache_stride; + } + + TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input; + dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace, + workspaces.at("gemm_workspace").first, mScalingType); + tma_ws_input_workspace += tma_ws_size;This ensures the two caches use non-overlapping memory and remain valid across runs.
♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4566-4576
: Fix validated: compute groupwise flags before early-returnDeclaring use_w4afp8/use_wfp4a16/use_w4_groupwise before the finalize-fusion early-exit resolves the prior “used before declaration” error and keeps the guard readable.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (3)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (5)
2850-2854
: Good: tactic-driven gating of permuted_token_final_scales workspaceDeriving permuted_token_final_scales_ from gemm2_config_->epilogue_fusion_type makes the allocation/use explicit and aligned with the GEMM2 tactic. Clean separation vs. implicit SM checks.
4010-4016
: Good: finalize-fusion consistency check and gatingusing_fused_finalize combines runtime config/toggles with the tactic’s epilogue_fusion_type and excludes disallowed modes (!use_w4_groupwise, !use_lora). The TLLM_CHECK enforces config-tactic agreement early, which prevents subtle misconfigurations.
4558-4560
: API update acknowledged: prepareTmaWsInputs now takes fusionThe additional EpilogueFusion parameter makes profiler template preparation fusion-aware. Call sites (e.g., prepare()) have been updated accordingly.
4632-4636
: Finalize-fusion parameterization looks correctSetting fusion=FINALIZE and calling setFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1) matches the expected (hidden_size, num_output_tokens, reduction) semantics. This aligns with the epilogue’s transpose mapping.
4673-4677
: Proactive prebuild of both NONE and FINALIZE templates in prepare()Populating both caches up front reduces per-run overhead during profiling. Once the non-overlapping workspace fix above is applied, this is a solid optimization.
19b9c1a
to
e58c302
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4616-4663
: Critical: TMA WS workspace overlap between NONE and FINALIZE variants can clobber cached templatesprepare() invokes prepareTmaWsInputs twice (NONE, FINALIZE), but getProfilerWorkspaces only reserves (NUM_ROUTING_SAMPLES + 1) TMA slots, and prepareTmaWsInputs always starts from the same base workspace offset. As a result:
- mTmaInputCache[i][0] and mTmaInputCache[i][1] point into overlapping regions.
- The second call overwrites the first call’s templates. At runtime, whichever was prepared last “wins,” and the other variant’s template may be corrupted.
Proposed fix (minimal, preserves your two-call flow):
- Double the TMA input workspace to hold both variants.
- Offset the workspace base when preparing the FINALIZE variant so it uses the second half.
Apply these diffs:
- Allocate enough workspace for both NONE and FINALIZE variants
- size_t tma_ws_input_workspace_size - = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) - * (NUM_ROUTING_SAMPLES + 1); + // Reserve separate slots for NONE and FINALIZE variants: one dummy + NUM_ROUTING_SAMPLES per variant. + size_t tma_ws_input_workspace_size + = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) + * (2 * (NUM_ROUTING_SAMPLES + 1));
- Use distinct halves of the workspace depending on fusion
size_t tma_ws_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(mNumExpertsPerNode, mScalingType); + // If preparing FINALIZE, jump to the second half so NONE and FINALIZE do not overlap. + if (use_finalize_fusion) + { + tma_ws_input_workspace += tma_ws_size * (NUM_ROUTING_SAMPLES + 1); + } + TmaWarpSpecializedGroupedGemmInput dummy_tma_ws_input; dummy_tma_ws_input.configureWorkspace(tma_ws_input_workspace, mNumExpertsPerNode, gemm_workspace, workspaces.at("gemm_workspace").first, mScalingType);This guarantees mTmaInputCache[i][0] and mTmaInputCache[i][1] reference non-overlapping TMA WS regions, preventing clobbering.
♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
603-621
: Make addFinalizeFusionConfigs accept by-value to support both lvalues and rvalues (and reserve capacity)Taking an rvalue reference restricts call sites to temporaries only. By-value is simpler, safe, and enables both lvalues and rvalues with guaranteed local copy semantics. Also, reserve capacity to avoid reallocation when duplicating entries.
Apply this diff:
- static auto addFinalizeFusionConfigs( - std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize) + static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs( + std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize) { if (!use_fused_finalize) return configs; - size_t const num_configs = configs.size(); + size_t const num_configs = configs.size(); + configs.reserve(num_configs * 2); for (size_t i = 0; i < num_configs; ++i) { if (configs[i].is_tma_warp_specialized) { configs.push_back(configs[i]); configs.back().epilogue_fusion_type = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; } } return configs; }
628-633
: Static context type usage is ill-formed; don’t use decltype on a non-static data member
decltype(moe_gemm_runner_)
in a static method is invalid and won’t compile. Use the known runner type directly.Apply this diff:
static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) { - using RunnerType = decltype(moe_gemm_runner_); + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; return Self::addFinalizeFusionConfigs( RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); }
829-833
: Same static context issue in mayHaveFinalizeFused(int sm)Avoid referencing an instance member type in a static method. Use the concrete
MoeGemmRunner
type.Apply this diff:
static bool mayHaveFinalizeFused(int sm) { - using RunnerType = decltype(moe_gemm_runner_); + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; }
🧹 Nitpick comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
541-549
: Help text: clarify fallback when only tactic_id1 is providedDocument that tactic_id2 mirrors tactic_id1 when omitted, to match generator behavior and reduce confusion.
Apply this diff:
- "- \"tactic_id1, tactic_id2\"\n" - "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM" + "- \"tactic_id1, tactic_id2\"\n" + "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM. " + "If tactic_id2 is omitted, it defaults to tactic_id1.\n" "Valid tactics are:\n" " - An integer: corresponds to an index in the tactics array. WARNING this is not stable between data types " "or GPU architectures\n" " - An array: of integers, forms a list of tactics to sweep\n" " - The string \"all\": This will sweep through all possible tactics\n" " - The string \"auto\": This runs a short benchmark to pick the fastest tactic before each benchmark case. "
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧠 Learnings (4)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (17)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5)
231-236
: MoeGemmId enum introduction looks correctEnum values and scope are appropriate for per‑GEMM selection.
456-456
: Interface change to per‑GEMM tactics is alignedThe new
getTactics(MoeGemmId)
signature makes sense for per‑GEMM flows.
622-626
: Per‑GEMM tactics override logic LGTMGating finalize fusion to GEMM_2 looks intentional and consistent with the PR goals.
1003-1004
: Cache widening to support two TMA inputs per routing sample looks goodUsing a second dimension to separate GEMM_1/GEMM_2 is clear and fits the per‑GEMM flow.
1016-1018
: prepareTmaWsInputs: epilogue fusion parameter addition is consistentSignature extension is consistent with finalize fusion gating; call sites appear updated across the PR.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (5)
836-844
: Per‑GEMM tactic retrieval in pickBestTactic is correctCasting the local GemmToProfile to MoeGemmId is safe given the defined numeric mapping.
928-936
: Fetching separate tactic vectors for GEMM_1 and GEMM_2 looks goodThe split avoids conflating finalize‑fused variants into GEMM_1.
952-956
: Applying tactics per GEMM is correctPassing the specific pair into
setTactic
aligns with the new API.
970-979
: runMoEPermute: using the GEMM‑specific tactic is correctConsistent with the profiler’s gemm selection. Good guard to ensure profiler/backend configuration match.
1079-1085
: Diagnostics for selected tactics per GEMM are usefulClearer than a single aggregated list; good for debugging.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
53-86
: Per‑GEMM parsing path LGTMRecursive parsing with support for integer, array, "all", and "auto" is clear and minimal.
295-301
: Printing valid tactics per GEMM improves UXHelpful one‑time diagnostics to guide configuration.
420-431
: Verbose dump of all tactics by GEMM is usefulClear separation per GEMM, with indices, makes CLI config simpler.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
2850-2854
: Fusion-aware workspace wiring for GEMM2 scales: LGTMUsing the epilogue_fusion_type to gate permuted_token_final_scales_ is the right direction and removes SM-coupling.
4010-4016
: Tactic-driven finalize gating and consistency check: LGTM
- The compare against the tactic’s epilogue_fusion_type and the equality check are clear and prevent divergence.
- Passing (final_output, hidden_size, num_rows, ...) to setFinalizeFusionParams matches the transpose semantics.
4558-4576
: prepareTmaWsInputs now parametric on fusion + early groupwise guards: LGTM
- The new fusion parameter makes the API explicit.
- Computing use_w4_groupwise before the early-return resolves the prior “used before declaration” issue.
4739-4743
: Per-sample/per-fusion cache lookup with validity check: LGTMSelecting the cached template by fusion type and verifying isValid() prevents accidental use of uninitialized templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
66-81
: Release-mode bug: unrecognized tactic strings silently treated as "auto"In release builds, the
assert(tactic.get<std::string>() == "auto")
(Line 78) is compiled out, so any unrecognized string becomes-1
(“auto”). This is dangerous and hard to debug. Explicitly validate the string and throw on invalid values.Apply this diff:
- else if (tactic.is_string()) - { - assert(tactic.is_string()); - auto tactic_name = tactic.get<std::string>(); - if (tactic_name == "all") - { - auto all_tactics = listAllTactics<BenchClass>(gemm_id); - tactic_ids.resize(all_tactics.size()); - std::iota(tactic_ids.begin(), tactic_ids.end(), 0); - } - else - { - assert(tactic.get<std::string>() == "auto"); - tactic_ids.push_back(-1); - } - } + else if (tactic.is_string()) + { + auto tactic_name = tactic.get<std::string>(); + if (tactic_name == "all") + { + auto all_tactics = listAllTactics<BenchClass>(gemm_id); + tactic_ids.resize(all_tactics.size()); + std::iota(tactic_ids.begin(), tactic_ids.end(), 0); + } + else if (tactic_name == "auto") + { + tactic_ids.push_back(-1); + } + else + { + throw std::invalid_argument("Invalid tactic string: " + tactic_name + ". Expected \"all\" or \"auto\""); + } + }Optional: replace magic
-1
with a named constant (e.g., constexpr int kTACTIC_AUTO = -1) for clarity.
♻️ Duplicate comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
45-50
: Potentially invalid RunnerType alias; confirmmMoERunner
is static or switch to a concrete runner type
using RunnerType = decltype(BenchClass::mMoERunner);
will not compile ifmMoERunner
is a non-static data member. Even if it compiles, it’s brittle. Prefer a concrete runner type (or a type alias from BenchClass) and callgetTactics(sm, gemm_id)
on that.If
mMoERunner
is non-static, apply this change (as previously suggested) to make the runner type explicit:template <class BenchClass> auto listAllTactics(MoeGemmId gemm_id) { int const sm = getSMVersion(); - using RunnerType = decltype(BenchClass::mMoERunner); + using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType, + typename BenchClass::WeightType, + typename BenchClass::OutputType, + typename BenchClass::InputType>; return RunnerType::getTactics(sm, gemm_id); }Alternatively, if BenchClass exposes a canonical runner alias, use that (e.g.,
using RunnerType = typename BenchClass::RunnerType;
).To verify whether
mMoERunner
is static and where the canonical runner type lives, run:#!/bin/bash # Show definitions/usages of mMoERunner and any RunnerType aliases rg -n -C3 --type=cpp --type=cu '\bmMoERunner\b|RunnerType|CutlassMoeFCRunner'
278-285
: LGTM: tactic_id1/2 are now parsed conditionally (fixes prior crash when only one was present)Good cleanup. This addresses the earlier issue of unconditional access causing exceptions when only one of the keys existed.
🧹 Nitpick comments (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
334-353
: Consider defaulting unspecified tactics to “auto” instead of skipping the entire configCurrently, if either
tactic_ids1
ortactic_ids2
is empty, the config is skipped. Defaulting the missing side(s) to-1
(“auto”) makes configs more user-friendly (and aligns with the help text’s emphasis on “auto” for quick sweeps).Apply this diff:
- if (tactic_ids1.empty() || tactic_ids2.empty()) - { - std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl; - static bool printed = false; - if (!printed) - { - printed = true; - std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n"; - for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2}) - { - std::cerr << "GEMM " << (int) gemm_id << ":\n"; - auto confs = listAllTactics<BenchClass>(gemm_id); - for (auto c : confs) - std::cerr << c.toString(); - std::cerr << std::endl; - } - } - - continue; - } + if (tactic_ids1.empty()) + { + if (LOG_LEVEL >= INFO) + std::cerr << "Info: 'tactic_id1' not provided, defaulting to auto (-1)\n"; + tactic_ids1 = {-1}; + } + if (tactic_ids2.empty()) + { + if (LOG_LEVEL >= INFO) + std::cerr << "Info: 'tactic_id2' not provided, defaulting to auto (-1)\n"; + tactic_ids2 = {-1}; + }If you prefer to keep the skip behavior, consider clarifying this requirement in the help text (see comment below).
388-388
: Comment nit: reflect per-GEMM tactic enumerationThe comment references a non-existent single-argument
listAllTactics
and ignores the GEMM split. Tighten the comment to avoid confusion.-auto cutlass_tactic = {-1}; // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()}; +auto cutlass_tactic = {-1}; // Use -1 for "auto". To sweep by GEMM, see: + // listAllTactics<BenchClass>(MoeGemmId::GEMM_1/2).size()Optional: introduce a named constant for
-1
(e.g., kTACTIC_AUTO) and reuse at Lines 316, 320, and here.
536-545
: Help text looks aligned; optionally clarify required/optional behavior for tactic_id1/2
- Good: “objects” removed; supported types are now integer, array of integers, or strings "all"/"auto".
- Suggestion: Explicitly state whether both
tactic_id1
andtactic_id2
must be provided. If you adopt the default-to-auto behavior, document that omission defaults to"auto"
for the missing side. Otherwise, note that omission will skip the config.Proposed tweak if you choose to default missing sides to auto:
- "- \"tactic_id1, tactic_id2\"\n" - "The config for the CUTLASS GEMM. tactic_idX sets the tactic for the corresponding GEMM" + "- \"tactic_id1, tactic_id2\"\n" + "The config for the CUTLASS GEMMs. tactic_idX sets the tactic for the corresponding GEMM (1 or 2).\n" + "If either is omitted, it defaults to \"auto\" for that GEMM."
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
415-426
: LGTM: improved per-GEMM tactic listing with indicesClear, per-GEMM tactic dumps with stable, zero-based indices; helpful for users to map IDs to configurations.
/bot run |
PR_Github #15547 [ run ] triggered by Bot |
PR_Github #15547 [ run ] completed with state |
/bot run |
PR_Github #15596 [ run ] triggered by Bot |
PR_Github #15596 [ run ] completed with state |
8493e78
to
e27b017
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
949-951
: Guard against empty tactic vectors before indexing in deterministic modeAccessing [0] without checking for emptiness will crash when no tactics are available (e.g., unsupported shape/quant combo). Guard and use .front().
Apply this diff:
- gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0]; - gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0]; + auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1); + auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2); + TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE"); + gemm1 = t1.front(); + gemm2 = t2.front();cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
45-50
: Fix type alias: decltype(BenchClass::mMoERunner) is not a concrete runner typeUsing decltype on a (likely non-static) data member is incorrect here and will not yield a usable type for the static call. Use the concrete runner type derived from BenchClass template parameters.
Apply this diff:
template <class BenchClass> auto listAllTactics(MoeGemmId gemm_id) { int const sm = getSMVersion(); - using RunnerType = decltype(BenchClass::mMoERunner); - return RunnerType::getTactics(sm, gemm_id); + using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType, + typename BenchClass::WeightType, + typename BenchClass::OutputType, + typename BenchClass::InputType>; + return RunnerType::getTactics(sm, gemm_id); }cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
1178-1181
: Fixed call site to updated selectTacticsForArch signature — LGTMRemoval of the extra argument addresses the prior build issue when assigning std::pair.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
829-833
: Same static-context issue as above in mayHaveFinalizeFused(int sm)Use the explicit runner type; don’t reference an instance member.
Apply this diff:
- static bool mayHaveFinalizeFused(int sm) - { - using RunnerType = decltype(moe_gemm_runner_); - return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; - } + static bool mayHaveFinalizeFused(int sm) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; + }
628-633
: Compile-time error: referencing non-static member in static contextdecltype(moe_gemm_runner_) cannot be used in a static method (no object). Use the known runner type explicitly.
Apply this diff:
- static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) - { - using RunnerType = decltype(moe_gemm_runner_); - return Self::addFinalizeFusionConfigs( - RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); - } + static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return Self::addFinalizeFusionConfigs( + RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); + }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4675-4679
: Note: No base-pointer bump between NONE and FINALIZE callsGiven the above fix (offset inside prepareTmaWsInputs and doubled allocation), no changes are needed here. If you prefer offsetting at the call site instead, ensure only tma_ws_input_workspace is shifted — not the entire workspace base (or you’ll invalidate other pointers retrieved via GET_WS_PTR).
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
603-620
: API ergonomics: accept configs by value instead of rvalue-ref for flexibilityaddFinalizeFusionConfigs(std::vector&&) works with temporaries but forces std::move at call sites with lvalues. Taking by value is simpler and lets the compiler elide/move as needed without constraining callers.
Suggested change:
- static auto addFinalizeFusionConfigs( - std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize) + static auto addFinalizeFusionConfigs( + std::vector<cutlass_extensions::CutlassGemmConfig> configs, bool use_fused_finalize)
1016-1018
: Nit: parameter name mismatch between declaration and definitionHeader declares prepareTmaWsInputs(..., char* workspace, ...), while the definition uses char* workspace_ptr_char. Please align parameter names across declaration/definition to match the repo’s consistency guideline.
Example header tweak:
- void prepareTmaWsInputs(int num_tokens, char* workspace, void const* expert_weights, + void prepareTmaWsInputs(int num_tokens, char* workspace_ptr_char, void const* expert_weights, TmaWarpSpecializedGroupedGemmInput::EpilogueFusion fusion, cudaStream_t stream);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
- tensorrt_llm/_torch/autotuner.py
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
- cpp/tensorrt_llm/thop/moeOp.cpp
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (24)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (3)
436-443
: Well-scoped addition of EpilogueFusionTypeIntroducing a local EpilogueFusionType and storing it on the config is clean and future-proof. Defaulting to NONE avoids impacting existing call sites.
513-515
: toString: adds epilogue_fusion_type and enableCudaKernel in TMA path — good for diagnosticsThe expanded diagnostic string helps correlate tactics with finalize fusion during debugging.
546-548
: operator<< parity with toString for fusion type — LGTMIncluding epilogue_fusion_type in the TMA path streaming output keeps logs/prints consistent.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
1279-1282
: Per-GEMM tactic plumbing in profiler getTactics — LGTMReturning tactics via mRunner->mMOERunner->getTactics(backend.mGemmToProfile) correctly aligns profiler enumeration with the selected GEMM.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (6)
53-86
: Per-GEMM tactic parser refactor — LGTMSignature now threads gemm_id through recursion and supports scalar, array, and "all"/"auto" inputs. Correct and robust.
279-285
: Conditional parsing for tactic_id1/tactic_id2 — LGTMParsing each key only if present avoids exceptions and permits asymmetric specification.
334-353
: Fail-fast when no valid tactics, with helpful diagnostics — LGTMClear warning and a dump of valid tactics per GEMM will aid users authoring configs.
388-389
: Default cutlass_tactic sentinel (-1) — LGTMUsing -1 to represent "auto" aligns with the parser semantics.
415-426
: Verbose listing split per GEMM — LGTMThe per-GEMM breakdown improves debuggability for tactic enumeration.
536-545
: Help text corrected to remove unsupported “objects” and clarify valid forms — LGTMDocs now match the parser. Good cleanup; reduces user confusion.
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (10)
373-375
: Introduce mUseFusedFinalize test control — LGTMClear knob for fused finalize in tests; defaults to false to cover both paths via K threshold elsewhere.
459-459
: Test runner finalize gating matches production logic — LGTMuse_fused_finalize_ = (k < 3 || mUseFusedFinalize) mirrors runtime behavior and keeps tests aligned.
1090-1119
: getFilteredConfigs now gemm-aware with SM-specific filtering — LGTMThreading MoeGemmId and retaining the L40S filter preserves previous behavior while enabling per-GEMM selection.
1120-1144
: selectTacticsForArch returns per-GEMM pair and accounts for fusion type — LGTMSelecting GEMM_2 by both TMA specialization and epilogue_fusion_type is the right pivot; sensible fallback if not found.
1641-1645
: Determinism check keyed off epilogue_fusion_type — LGTMUsing gemm2.epilogue_fusion_type to decide determinism is precise and reflects the fused-finalize path.
1763-1766
: PermuteNonDeterministic explicitly enables fused finalize — LGTMTargets the intended non-deterministic path for K=3 without conflating with legacy flags.
1906-1914
: Determinism guard in EP/TP loop (first rank) — LGTMConsistent with BasicPermuteTest, including the SM-range carveouts.
1928-1932
: Determinism guard in EP/TP loop (subsequent ranks) — LGTMSame policy replicated for subsequent ranks; good coverage.
2095-2095
: ConfigSweep: enable fused finalize for sweep — LGTMEnsures both fusion states are exercised during exhaustive config sweeps.
2123-2129
: ConfigSweep: enumerate tactics per GEMM — LGTMSplitting configs1/configs2 via gemm_id is necessary for the new API.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
2850-2854
: LGTM: finalize-fusion workspace pointer is now tactic-drivenSetting permuted_token_final_scales_ based on epilogue_fusion_type is a clean improvement and removes stale coupling. No issues spotted.
4008-4023
: LGTM: tactic-aligned FINALIZE gating and param setupThe consistency check and the gating condition for using_fused_finalize are sound. The setFinalizeFusionParams call aligns with hidden_size-by-num_rows semantics as used elsewhere in this file.
4710-4745
: LGTM: tactic-resolved access to per-sample TMA templateSelecting mTmaInputCache[mSampleIndex][finalize?1:0] with a validity check is correct and matches the new dual-variant caching model.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
1003-1004
: LGTM: dual-variant cache for TMA inputsmTmaInputCache[SAMPLE][0|1] establishes clear separation between NONE and FINALIZE variants. This matches the runtime selection in the profiler.
e27b017
to
975f1c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4616-4640
: Critical: TMA WS for NONE vs FINALIZE variants overlap; allocate 2× and separate regionsprepareTmaWsInputs is invoked twice (NONE and FINALIZE), but both runs currently configure mTmaInputCache from the same base tma_ws_input_workspace span. This causes the FINALIZE templates to clobber the NONE templates (and vice versa), making the cached TMA descriptors nondeterministic.
Fix requires two coordinated changes:
- Double the TMA WS allocation in the profiler to reserve independent halves for NONE and FINALIZE.
- When preparing the FINALIZE variant, offset the local workspace pointer into the second half so the two variants never overlap.
Apply the prepare-side separation (see next comment for an exact diff). Additionally, update the workspace sizing as shown below.
Additional change outside this hunk (GemmProfilerBackend::getProfilerWorkspaces):
// Before: tma_ws_input_workspace_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) * (NUM_ROUTING_SAMPLES + 1); // After (reserve two halves: NONE and FINALIZE): tma_ws_input_workspace_size = TmaWarpSpecializedGroupedGemmInput::workspaceSize(num_experts_per_node, mScalingType) * (2 * (NUM_ROUTING_SAMPLES + 1));
♻️ Duplicate comments (10)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4675-4679
: Separate TMA WS halves for NONE and FINALIZE during prepare()Advance the workspace pointer into the second half before preparing the FINALIZE cache to avoid overlap with the NONE cache. This pairs with the 2× allocation change in getProfilerWorkspaces.
- prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights, - TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE, stream); - prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights, - TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE, stream); + // Prepare NONE variant in first half + prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights, + TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE, stream); + // Advance into second half for FINALIZE variant to avoid overlap + auto ws = getProfilerWorkspaces(num_tokens, mSM >= 90); + workspace_ptr_char += ws.at("tma_ws_input_workspace").first / 2; + prepareTmaWsInputs(num_tokens, workspace_ptr_char, expert_weights, + TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE, stream);cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
949-951
: Guard tactic access in deterministic path to avoid out-of-range accessIndexing [0] without checking can crash if no tactics are available for the chosen GEMM. Guard and use .front() after validating.
- gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0]; - gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0]; + auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1); + auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2); + TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE"); + gemm1 = t1.front(); + gemm2 = t2.front();cpp/tensorrt_llm/thop/moeOp.cpp (2)
630-638
: Harden GEMM/profile selection: validate gemm_idx, guard empty profiles, and validate profile_id.As written, any non-1 index maps to GEMM_2;
.front()
will throw on empty; direct indexing can go OOB.- auto const gemm_to_profile - = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; - auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2"); + auto const gemm_to_profile + = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; + auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx)); @@ - // TODO: use the best tactic id found offline for a better default inference perf - auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id]; + // TODO: use the best tactic id found offline for a better default inference perf + const auto& profile = [&]() -> const Profile& + { + if (profile_id == -1) return profiles.front(); + TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()), + "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id); + return profiles[profile_id]; + }();
740-749
: Guard empty tactic lists and validate profile_ids before.front()
/.at()
.Avoid UB when no tactics are available and enforce index range for the provided ids.
- auto best_gemm1_profile = mGemm1Profiles.front(); - auto best_gemm2_profile = mGemm2Profiles.front(); + TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available"); + auto best_gemm1_profile = mGemm1Profiles.front(); + auto best_gemm2_profile = mGemm2Profiles.front(); if (profile_ids.has_value()) { TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids"); - best_gemm1_profile - = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]); - best_gemm2_profile - = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]); + if (profile_ids.value()[0] != -1) + { + TORCH_CHECK(profile_ids.value()[0] >= 0 + && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()), + "profile_ids[0] out of range"); + best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]); + } + if (profile_ids.value()[1] != -1) + { + TORCH_CHECK(profile_ids.value()[1] >= 0 + && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()), + "profile_ids[1] out of range"); + best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]); + } }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
536-545
: Docs updated to remove “objects” in tactic arrays — matches parser behavior.This aligns the help text with the implemented JSON parsing.
45-50
: Fix invalid type alias: decltype on non-static member isn’t a runner type; use the concrete runner alias.
decltype(BenchClass::mMoERunner)
for a non-static member yields a pointer-to-member or won’t compile. Use the concrete runner built from BenchClass’ type aliases.template <class BenchClass> -auto listAllTactics(MoeGemmId gemm_id) +auto listAllTactics(MoeGemmId gemm_id) { int const sm = getSMVersion(); - using RunnerType = decltype(BenchClass::mMoERunner); - return RunnerType::getTactics(sm, gemm_id); + using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType, + typename BenchClass::WeightType, + typename BenchClass::OutputType, + typename BenchClass::InputType>; + return RunnerType::getTactics(sm, gemm_id); }cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
1178-1181
: Fixed: selectTacticsForArch called with a single sm argument.This resolves the previous signature mismatch.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
603-621
: Fixed: addFinalizeFusionConfigs no longer tries to bind temporaries to non‑const lvalue refs.Taking by rvalue-reference and returning by value is the right choice here.
628-633
: Compile fix: don’t use decltype(moe_gemm_runner_) in a static method.Referencing a non‑static member in static context is ill‑formed. Use the concrete runner type.
- static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) - { - using RunnerType = decltype(moe_gemm_runner_); - return Self::addFinalizeFusionConfigs( - RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); - } + static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return Self::addFinalizeFusionConfigs( + RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); + }
829-833
: Compile fix: same static context issue in mayHaveFinalizeFused(int sm).Replace decltype with the concrete MoeGemmRunner type.
- static bool mayHaveFinalizeFused(int sm) - { - using RunnerType = decltype(moe_gemm_runner_); - return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; - } + static bool mayHaveFinalizeFused(int sm) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; + }
🧹 Nitpick comments (2)
cpp/tensorrt_llm/thop/moeOp.cpp (1)
590-595
: Nit: cast vector::size() to int64_t to avoid narrowing warnings.Return type is int64_t, but size() is size_t. Cast explicitly.
- return (gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size(); + return static_cast<int64_t>((gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size());cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
2-2
: Nit: refresh copyright year.Header still says 2020–2023; repository guidelines indicate using the current year.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- tensorrt_llm/_torch/autotuner.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧬 Code Graph Analysis (3)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
sm
(1090-1118)sm
(1090-1090)sm
(1120-1144)sm
(1120-1120)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
getTactics
(1278-1282)getTactics
(1278-1278)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
gemm2
(702-726)mK
(986-986)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (2)
gemm2
(638-662)mK
(923-923)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
mK
(381-381)cpp/include/tensorrt_llm/common/cudaUtils.h (1)
getSMVersion
(298-364)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
vector
(566-570)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
getTactics
(1278-1282)getTactics
(1278-1278)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
EpilogueFusion
(167-332)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (21)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
2849-2854
: Gating scales buffer by epilogue fusion looks rightSelecting permuted_token_final_scales_ only when GEMM2 uses FINALIZE epilogue matches the tactic-driven design and prevents dangling pointers when running NONE.
4010-4016
: Tactic-driven finalize gating is consistent and clearer than SM checksReplacing SM-based logic with epilogue_fusion_type and asserting alignment between runner and tactic is the right move.
4566-4576
: Good fix: compute groupwise flags before early returnDeclaring use_w4afp8/use_wfp4a16/use_w4_groupwise before the FINALIZE gating avoids the previous “used-before-declared” issue and makes the early return safe.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
1281-1282
: Per-GEMM tactic retrieval in profiler is correctReturning tactics via mRunner->mMOERunner->getTactics(backend.mGemmToProfile) keeps the profiler aligned with the GEMM being profiled. Looks good.
cpp/tensorrt_llm/thop/moeOp.cpp (3)
51-51
: Good: explicit alias to MoeGemmId keeps thop consistent with kernels.This makes the intent clear and avoids namespace churn. No issues.
218-221
: Per-GEMM tactic vectors populated at construction look correct.Pulling tactics separately for GEMM_1 and GEMM_2 aligns with the new API and avoids mixing profiles.
648-649
: LGTM: propagate selected GEMM into profiler backend.This keeps the profiler state coherent with the per-GEMM selection path.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
279-285
: Good: tactic_id parsing now guards key presence.Fixes the exception when only one of tactic_id1/tactic_id2 is provided.
334-353
: Nice UX: skip with actionable dump when tactic lists are empty.The per-GEMM listing helps users pick valid indices quickly.
415-427
: Helpful: per-GEMM tactic inventory printing.Clear visibility into available tactics by GEMM. Good for debugging configs.
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (7)
373-375
: Reasonable default: don’t force fused finalize outside K<3.Keeps test surface predictable; explicit opt-in where needed.
459-460
: Good: tie runner’s finalize flag to K and test override.This matches production behavior where finalize fusion only applies for K<3 unless forced.
1090-1118
: Per‑GEMM tactic fetch in tests looks correct.The SM-specific filtering and GEMM‑aware retrieval match the new API.
1123-1144
: Finalize awareness in tactic selection is well integrated.Filtering GEMM_2 by EpilogueFusionType and falling back when not found is pragmatic.
1641-1645
: Determinism gating is sensible.Only assert determinism when not using FINALIZE on SM90+ with K>=3.
1763-1766
: Explicit non-determinism test via fused finalize.Clear intent for the PermuteNonDeterministic case.
2093-2151
: ConfigSweep updated to pairwise sweep per-GEMM configs.Combinatorial coverage makes sense with the per‑GEMM split. Good failure messaging.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
622-626
: Per‑GEMM tactic listing with conditional finalize variants looks good.Using runtime capability checks to gate FINALIZE copies is appropriate.
931-936
: GemmProfilerBackend now aliases GemmToProfile to MoeGemmId — consistent and cleaner.This unifies the Gemm identifier across components.
1003-1004
: Good: cache two TMA inputs per sample.The 2‑wide inner array aligns with GEMM_1/GEMM_2 profiling.
1016-1018
: prepareTmaWsInputs now accepts EpilogueFusion — necessary for finalize-aware setup.Signature change matches call sites and intent.
/bot run |
975f1c3
to
a18eae9
Compare
PR_Github #15705 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about the generating tactic part. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
628-633
: Fix static context: avoid referencing instance member in static getTactics(int, MoeGemmId)Using decltype(moe_gemm_runner_) in a static method is ill‑formed. Use the concrete runner type instead.
- static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) - { - using RunnerType = decltype(moe_gemm_runner_); - return Self::addFinalizeFusionConfigs( - RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); - } + static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return Self::addFinalizeFusionConfigs( + RunnerType::getConfigs(sm), gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); + }
829-834
: Fix static context: avoid referencing instance member in mayHaveFinalizeFused(int sm)Same issue as above; don’t use decltype(moe_gemm_runner_) in a static method.
- static bool mayHaveFinalizeFused(int sm) - { - using RunnerType = decltype(moe_gemm_runner_); - return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; - } + static bool mayHaveFinalizeFused(int sm) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; + }cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
949-951
: Guard deterministic tactic selection against empty tactic listsIndexing [0] on the result of getTactics() can crash if no tactics are available (unsupported shape/quant/arch).
- gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0]; - gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0]; + auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1); + auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2); + TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE"); + gemm1 = t1.front(); + gemm2 = t2.front();cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
1079-1084
: Guard INFO logging against empty tactic listsAvoid out‑of‑range access when printing selections.
- std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n" - << tactics1[tactic_idx1].toString() << std::endl; - std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n" - << tactics2[tactic_idx2].toString() << std::endl; + std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"; + if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size())) { + std::cout << tactics1[tactic_idx1].toString() << std::endl; + } + std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"; + if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size())) { + std::cout << tactics2[tactic_idx2].toString() << std::endl; + }cpp/tensorrt_llm/thop/moeOp.cpp (2)
630-638
: Harden runGemmProfile: check empty profiles and profile_id rangeAvoid undefined behavior when no tactics are available and when profile_id is invalid.
- auto const gemm_to_profile - = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; - auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + auto const gemm_to_profile + = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; + auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx)); @@ - auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id]; + const auto& profile = [&]() -> const Profile& + { + if (profile_id == -1) return profiles.front(); + TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()), + "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id); + return profiles[profile_id]; + }();
740-749
: Harden setRunnerProfiles: guard empty lists and id rangesAvoid front()/at() on empty vectors and out‑of‑range indices.
- auto best_gemm1_profile = mGemm1Profiles.front(); - auto best_gemm2_profile = mGemm2Profiles.front(); + TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available"); + auto best_gemm1_profile = mGemm1Profiles.front(); + auto best_gemm2_profile = mGemm2Profiles.front(); if (profile_ids.has_value()) { TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids"); - best_gemm1_profile - = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]); - best_gemm2_profile - = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]); + if (profile_ids.value()[0] != -1) + { + TORCH_CHECK(profile_ids.value()[0] >= 0 + && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()), + "profile_ids[0] out of range"); + best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]); + } + if (profile_ids.value()[1] != -1) + { + TORCH_CHECK(profile_ids.value()[1] >= 0 + && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()), + "profile_ids[1] out of range"); + best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]); + } }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
536-544
: Help text fixed to remove unsupported 'objects' — thanksThis aligns with the current parser, preventing confusion and exceptions from outdated guidance.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
84-87
: Return a concrete list and make gemm_idx optional for resilienceReturning a
range
while the signature advertisesList[int]
can surprise callers andkwargs["gemm_idx"]
will KeyError if omitted. Usekwargs.get
and wrap inlist
for type consistency and backward compatibility.Apply this diff:
- def get_valid_tactics(self, inputs: List[torch.Tensor], - profile: OptimizationProfile, **kwargs) -> List[int]: - return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"])) + def get_valid_tactics(self, inputs: List[torch.Tensor], + profile: OptimizationProfile, **kwargs) -> List[int]: + gemm_idx = kwargs.get("gemm_idx", 0) + return list(range(self.fused_moe_runner.get_tactic_num(gemm_idx)))Additionally, confirm that the domain for
gemm_idx
matches the underlying implementation (0/1 vs. 1/2). If it expects MoeGemmId, keep passing the enum-compatible values.#!/bin/bash # Verify the expected domain/type of gemm_idx for FusedMoeRunner::get_tactic_num and run_gemm_profile. set -euo pipefail echo "Searching for FusedMoeRunner::get_tactic_num signature and usages..." rg -n -C2 -P '\b(get_tactic_num|run_gemm_profile)\s*\(' echo echo "Searching for MoeGemmId definition and values..." rg -n -C2 -P '\benum\s+class\s+MoeGemmId\b|MoeGemmId::GEMM_' echo echo "Searching Python/C++ bindings for FusedMoeRunner to see the expected arg type..." fd -t f | rg -n -C3 -P 'FusedMoeRunner|get_tactic_num|run_gemm_profile|MoeGemmId'
🧹 Nitpick comments (10)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
2850-2854
: Good gating for finalize scales; make the local flagconst
The allocation of
permuted_token_final_scales_
only when GEMM2 uses FINALIZE is correct and avoids dangling pointers when FINALIZE is off.Minor style nit: declare the flag as const for clarity and consistency with nearby
bool const
locals.- bool gemm2_using_finalize_fusion + bool const gemm2_using_finalize_fusion = gemm2_config_->epilogue_fusion_type == cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; permuted_token_final_scales_ = gemm2_using_finalize_fusion ? getWsPtr(float{}, "permuted_token_final_scales") : nullptr;
4010-4016
: Tactic-driven finalize fusion gating looks correct; preferconst
localsThe alignment check between tactic request and runtime gating is a good fast-fail.
Nit: mark both locals
const
to match surrounding style.- bool gemm2_using_finalize_fusion = gemm2_config_->epilogue_fusion_type + bool const gemm2_using_finalize_fusion = gemm2_config_->epilogue_fusion_type == cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; - bool using_fused_finalize + bool const using_fused_finalize = use_fused_finalize_ && gemm2_using_finalize_fusion && !use_w4_groupwise && !use_lora;cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
603-621
: Finalize-fusion tactic derivation helper reads correctly
- Taking configs by rvalue-ref avoids the temporary-binding issue and is efficient.
- Appending FINALIZE epilogue only for TMA WS configs is correct.
- Minor nit: consider returning an explicit std::vector<…> instead of auto for readability in a header.
-static auto addFinalizeFusionConfigs( +static std::vector<cutlass_extensions::CutlassGemmConfig> addFinalizeFusionConfigs( std::vector<cutlass_extensions::CutlassGemmConfig>&& configs, bool use_fused_finalize)
17-17
: Adopt include guards per repository guidelinesHeaders should use TRTLLM__H include guards instead of #pragma once.
-#pragma once +#ifndef TRTLLM_MOE_KERNELS_H +#define TRTLLM_MOE_KERNELS_HAdd this at end of file (outside the selected range):
#endif // TRTLLM_MOE_KERNELS_H
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
970-979
: Optional: add a bounds check before indexing tactics in runMoEPermuteWhile setTactic pre-validates, a defensive check here would prevent accidental misuse if runMoEPermute is called without prior setTactic.
- auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile))[tactic_idx]; + auto all = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile)); + if (tactic_idx < 0 || tactic_idx >= static_cast<int>(all.size())) { + throw std::runtime_error("Invalid tactic index for runMoEPermute"); + } + auto tactics = all[tactic_idx];cpp/tensorrt_llm/thop/moeOp.cpp (2)
603-606
: Validate gemm_idx in runGemmProfileFail fast on invalid gemm_idx to avoid silently selecting the wrong set.
{ std::lock_guard<std::mutex> lock(mMutex); + TORCH_CHECK(gemm_idx == 1 || gemm_idx == 2, "gemm_idx must be 1 or 2"); // TODO: support profiling under fp8 block scaling in the future
268-299
: Remove duplicated bias validation blockThe bias checks are duplicated back‑to‑back; keep a single block to reduce noise.
- if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) - { - CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); - CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); - TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); - TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); - TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], - "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); - TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], - "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); - TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], - "fc1_expert_biases should match fc1_expert_weights output shape."); - TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], - "fc2_expert_biases should match fc2_expert_weights output shape."); - } - - if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) - { - CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); - CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); - TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); - TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); - TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], - "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); - TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], - "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); - TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], - "fc1_expert_biases should match fc1_expert_weights output shape."); - TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], - "fc2_expert_biases should match fc2_expert_weights output shape."); - } + if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) + { + CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); + CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); + TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); + TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); + TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], + "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); + TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], + "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); + TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], + "fc1_expert_biases should match fc1_expert_weights output shape."); + TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], + "fc2_expert_biases should match fc2_expert_weights output shape."); + }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
53-86
: Minor cleanup: redundant assert on string check
else if (tactic.is_string())
makes theassert(tactic.is_string())
immediately below redundant. Safe to remove for cleanliness.Apply this small diff:
- else if (tactic.is_string()) - { - assert(tactic.is_string()); + else if (tactic.is_string()) + { auto tactic_name = tactic.get<std::string>();
278-285
: Quality of life: default missing tactic_idX to 'auto'When only one of
tactic_id1
/tactic_id2
is provided, defaulting the other to-1
("auto"
) avoids skipping the benchmark later and mirrors user intent.Apply this diff:
- if (run_config.contains("tactic_id1")) - { - parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1); - } - if (run_config.contains("tactic_id2")) - { - parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2); - } + if (run_config.contains("tactic_id1")) + { + parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1); + } + else + { + tactic_ids1 = {-1}; // default to "auto" + } + if (run_config.contains("tactic_id2")) + { + parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2); + } + else + { + tactic_ids2 = {-1}; // default to "auto" + }
388-389
: Update stale comment example to reflect per‑GEMM APIThe comment references the old listAllTactics signature. Adjust for clarity.
- auto cutlass_tactic = {-1}; // {0,..., listAllTactics<BenchClass>(MoeGemmId).size()}; + auto cutlass_tactic = {-1}; // {0,..., listAllTactics<BenchClass>(MoeGemmId::GEMM_1).size()};
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(6 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
(5 hunks)tests/unittest/_torch/test_autotuner.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- tensorrt_llm/_torch/autotuner.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tests/unittest/_torch/test_autotuner.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tests/unittest/_torch/test_autotuner.py
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (26)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
4566-4576
: Early-return gating for FINALIZE resolves prior compile hazardPrecomputing the groupwise flags and gating out unsupported FINALIZE conditions before use looks good and addresses the earlier “used before declaration” issue.
4616-4628
: Deliberate TMA WS reuse per fusion-type; cache selection is soundUsing
mTmaInputCache[i][use_finalize_fusion]
and reusing the same workspace region per the design comment is aligned with the stated memory-saving approach. Selecting the cache element for GEMM1/GEMM2 based onmGemmToProfile
is correct.
4634-4638
: Finalize params for profiler path look correctEnabling FINALIZE for GEMM2 and calling
setFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1)
matches the expected “rows=num_tokens, cols=hidden” semantics in this path.
4675-4679
: Preparing both NONE and FINALIZE variants up-front is appropriateSeeding templates for both fusion types during prepare() aligns with the per-tactic selection in
runProfiler
and avoids runtime initialization hiccups.cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
231-236
: Per‑GEMM selector enum looks goodMoeGemmId is well‑scoped and follows naming conventions. It cleanly enables per‑GEMM tactic selection downstream.
622-627
: Instance getTactics uses the new gemm_id correctlyCorrectly composes configs and gates FINALIZE only for GEMM_2 when supported.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
1279-1282
: Profiler tactic retrieval aligns with per‑GEMM APIReturning the per‑GEMM tactics via mGemmToProfile is correct, keeping the benchmark backend consistent with the new interface.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
836-846
: Per‑GEMM pickBestTactic path is correctCasting GemmToProfile::GEMM_1/2 to MoeGemmId and profiling within that scope is sound.
928-953
: setTactic: selects per‑GEMM tactics correctly and validates indicesThe two‑vector approach and bounds checks before setTactic are appropriate.
tests/unittest/_torch/test_autotuner.py (1)
92-98
: Adding kwargs to get_valid_tactics signatures is appropriateKeeps the runner/test interfaces compatible with callers forwarding extra context (e.g., GEMM id) without altering behavior.
Also applies to: 153-156, 229-231, 317-320
cpp/tensorrt_llm/thop/moeOp.cpp (3)
51-52
: MoeGemmId aliasing improves clarity and avoids coupling to namespace qualifiersGood addition; downstream usage reads cleanly.
219-221
: Eagerly caching per‑GEMM tactic vectors is soundFetching tactics for GEMM_1 and GEMM_2 up front simplifies subsequent indexing and avoids repeated queries.
590-595
: getTacticNum: input validation added and return type widenedValidating gemm_idx and returning sizes as int64_t is correct and robust.
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
125-126
: Signature extended with kwargs — aligns with per‑GEMM autotuner flowAccepting
**kwargs
keeps this runner compatible with callers forwarding extra context such asgemm_idx
. Ignoring kwargs here is fine since the underlying kernel API doesn’t take gemm id yet.
409-410
: Signature extended with kwargs — aligns with per‑GEMM autotuner flowSame note as above; forwarding compatibility without behavior change is good.
667-668
: Signature extended with kwargs — aligns with per‑GEMM autotuner flowLooks good; no behavior change and keeps the interface consistent across runners.
901-902
: Signature extended with kwargs — aligns with per‑GEMM autotuner flowGood consistency. No further action needed here.
1114-1116
: Signature extended with kwargs — aligns with per‑GEMM autotuner flowAll set; consistent with the rest of the file and upstream autotuner changes.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
45-50
: Confirm RunnerType deduction is valid for BenchClass::mMoERunnerUsing
decltype(BenchClass::mMoERunner)
assumesmMoERunner
is a type alias or a member whose declared type is the desired runner type. If it’s a non-static data member, depending on definitions this can be brittle across compilers. Consider explicitly naming the runner type (if available) or requiringBenchClass
to provide ausing RunnerType = ...;
alias to make intent clear.If
BenchClass
can expose a concrete alias, a safer pattern is:- using RunnerType = decltype(BenchClass::mMoERunner); + using RunnerType = typename BenchClass::RunnerType; return RunnerType::getTactics(sm, gemm_id);Otherwise, verify that
decltype(BenchClass::mMoERunner)
resolves to the correct type on all supported toolchains.
334-353
: Helpful diagnostics when no tactics — nice touchClear, one-time messages per run and listing per‑GEMM available tactics improves UX for misconfigured cases.
415-427
: Verbose per‑GEMM tactic listing is clear and well‑structuredThe per‑GEMM breakdown with numbered tactics is easy to scan and consistent with the new API.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
318-321
: LGTM: returns a concrete list of tactic indicesConsistent with the type hint and other runners.
400-403
: LGTM: returns a concrete list of tactic indicesMatches the declared return type and avoids iterator surprises.
512-525
: LGTM: defers to kernel for valid configs with shape‑aware filteringInterface matches the rest of the repo; no issues spotted.
726-729
: LGTM: concrete list and stable interfaceConsistent across runners; good.
801-805
: LGTM: conforms to List[int] and aligns with tuning flowNo concerns here.
/bot run --disable-fail-fast |
PR_Github #15715 [ run ] triggered by Bot |
PR_Github #15705 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
608-612
: Compilation issue already addressed in past review.This is the same issue identified in the past review comments - using
decltype(moe_gemm_runner_)
in a static method. The fix was already suggested to use the concrete type directly.
808-812
: Compilation issue already addressed in past review.Same static context issue as above - this was already flagged in past review comments.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
831-833
: Consider documenting the workspace size calculation behavior.The call to
getTmaWarpSpecializedConfigs(sm_, true)
always requests configs with finalize fusion support for workspace sizing, even if finalize fusion won't be used. This ensures adequate workspace allocation for all possible configurations.Consider adding a comment to clarify this intentional behavior:
size_t MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>::calcMaxWorkspaceSize(int num_experts) const { // ... - // Finalize fusion may not actually be supported by the kernel, - // if they are not we will catch the error and skip them + // Always request configs with finalize fusion support when calculating workspace size + // to ensure adequate allocation for all possible configurations. + // Finalize fusion may not actually be supported by the kernel, + // if they are not we will catch the error and skip them auto configs = getTmaWarpSpecializedConfigs(sm_, true);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (2)
292-295
: LGTM!The API evolution to support finalize fusion is well-designed. The changes appropriately propagate the
supports_finalize_fusion
flag through the config generation paths.I appreciate the learning from past MOE reviews about avoiding excessive finalize fusion validation checks - the implementation here keeps it clean and focused.
300-304
: LGTM!Good refactoring to delegate to the static method, maintaining backward compatibility while centralizing the logic.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)
474-478
: LGTM!The refactoring to pass the
supports_finalize_fusion
flag through to the static method is clean and consistent with the architectural changes.
483-489
: LGTM!The signature update to accept
supports_finalize_fusion
and its propagation togetTmaWarpSpecializedConfigs
is consistent with the per-GEMM tactic selection design.
521-571
: Effective finalize fusion config duplication logic!The implementation correctly:
- Gates finalize fusion configs based on the
supports_finalize_fusion
flag- Creates duplicates with
epilogue_fusion_type
set toFINALIZE
- Appends them to the original configs
This approach allows profiling both fusion and non-fusion variants as discussed in the past review comments.
581-587
: LGTM!The static method for checking TMA warp specialized support across different SM architectures is well-structured and comprehensive:
- SM90 (Hopper)
- SM100-119 (Blackwell)
- SM120/121 (SM120 series)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (6)
231-236
: LGTM!The
MoeGemmId
enum is well-placed nearMOEParallelismConfig
for cross-module visibility and provides clear semantic identifiers for the different GEMMs in the MOE pipeline.
456-456
: LGTM!The API update to accept
MoeGemmId
enables per-GEMM tactic selection as intended.
603-606
: LGTM!The implementation correctly gates finalize fusion configs for GEMM_2 only, as expected for the second GEMM in the MOE pipeline.
911-911
: LGTM!The type alias simplifies the code and avoids redundant enum definitions.
982-982
: LGTM!The 2D array structure for
mTmaInputCache
appropriately handles both fusion variants (NONE and FINALIZE) for each routing sample.
995-996
: LGTM!The updated
prepareTmaWsInputs
signature with thefusion
parameter enables proper preparation of workspaces for different epilogue fusion types.
PR_Github #15715 [ run ] completed with state |
f6668fc
to
513c6dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
691-701
: NameError in fake kernel: act_fp8 referenced but parameter is act_fp4.This will raise at shape-prop time. Rename the parameter to act_fp8 (or use act_fp4 consistently).
Apply this diff:
-@w4a8_mxfp4_fp8_gemm.register_fake -def _( - act_fp4: torch.Tensor, +@w4a8_mxfp4_fp8_gemm.register_fake +def _( + act_fp8: torch.Tensor, weight: torch.Tensor, act_sf: torch.Tensor, weight_scale: torch.Tensor, alpha: torch.Tensor, output_dtype: torch.dtype, to_userbuffers: bool = False, ) -> torch.Tensor: - return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), + return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), dtype=output_dtype)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
928-956
: Don’t index into empty tactics for unprofiled GEMM in setTactic()When profiling a single GEMM, the other GEMM may legitimately have zero valid tactics (dtype/arch dependent). Unconditionally indexing
tactics2[tactic_idx2]
can go out-of-bounds. Pass nullopt for the unprofiled/empty side.Apply this diff:
- auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); - auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); + auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); + auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); @@ - mMoERunner.setTactic(tactics1[tactic_idx1], tactics2[tactic_idx2]); + // For GEMM-specific profiling, the unprofiled GEMM can have no valid tactics. + // Guard and pass nullopt to avoid OOB. + std::optional<tensorrt_llm::cutlass_extensions::CutlassGemmConfig> opt1; + std::optional<tensorrt_llm::cutlass_extensions::CutlassGemmConfig> opt2; + if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size())) + { + opt1 = tactics1[tactic_idx1]; + } + if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size())) + { + opt2 = tactics2[tactic_idx2]; + } + mMoERunner.setTactic(opt1, opt2);cpp/tensorrt_llm/thop/moeOp.cpp (3)
60-96
: Compile error: returning unique_ptr but storing into shared_ptr
switch_output_type
returnsstd::unique_ptr
, butmKernelRunner
is astd::shared_ptr
. This won’t compile at call sites. Make the factory return astd::shared_ptr
and construct withmake_shared
.Apply this diff:
- std::unique_ptr<kernels::CutlassMoeFCRunnerInterface> switch_output_type(c10::ScalarType output_type) + std::shared_ptr<kernels::CutlassMoeFCRunnerInterface> switch_output_type(c10::ScalarType output_type) { switch (output_type) { case c10::ScalarType::Long: // INT64 == FP4 case c10::ScalarType::Float8_e4m3fn: // TODO We need an atomic FP8 reduction for the finalize fusions C10_THROW_ERROR_FORMATTED(NotImplementedError, "Outputting " << torch::toString(output_type) << " directly is not currently supported"); // return std::make_unique<kernels::CutlassMoeFCRunner<Type, Type>>(); case c10::ScalarType::Half: if constexpr (NeedQuant) { - return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, half>>(); + return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, half>>(); } else { - return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, TypeAct>>(); + return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, half, TypeAct>>(); } } #ifdef ENABLE_BF16 case c10::ScalarType::BFloat16: if constexpr (NeedQuant) { - return std::make_unique< + return std::make_shared< kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, __nv_bfloat16>>(); } else { - return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, TypeAct>>(); + return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, TypeWeight, __nv_bfloat16, TypeAct>>(); } #endif default: C10_THROW_ERROR_FORMATTED(Error, "Invalid output type " << torch::toString(output_type) << " specified for " << torch::toString(mActivationDtype)); } };
98-120
: Compile error: factory returns unique_ptr but assigned to shared_ptrSame issue for the weight-quant path. Return
shared_ptr
and usemake_shared
.Apply this diff:
- std::unique_ptr<kernels::CutlassMoeFCRunnerInterface> create_weight_quant_runner() + std::shared_ptr<kernels::CutlassMoeFCRunnerInterface> create_weight_quant_runner() { if (isInt8Quant()) { - return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, uint8_t>>(); + return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, uint8_t>>(); } else if (isInt4Quant()) { #ifdef ENABLE_FP8 if (mUseW4GroupScaling) { - return std::make_unique< + return std::make_shared< kernels::CutlassMoeFCRunner<__nv_fp8_e4m3, cutlass::uint4b_t, TypeAct, TypeAct>>(); } #endif - return std::make_unique<kernels::CutlassMoeFCRunner<TypeAct, cutlass::uint4b_t>>(); + return std::make_shared<kernels::CutlassMoeFCRunner<TypeAct, cutlass::uint4b_t>>(); } else { C10_THROW_ERROR_FORMATTED(Error, "Unsupported weight quantization type"); } }
147-159
: Compile error: assigning unique_ptr to shared_ptrThis line uses
make_unique
whilemKernelRunner
is ashared_ptr
. Also, after changing the factories to returnshared_ptr
, the FP8 path assignment becomes correct.Apply this diff:
- else if (mActivationDtype == c10::ScalarType::BFloat16 && mWeightDtype == c10::ScalarType::Float8_e4m3fn) - { - mKernelRunner = std::make_unique<kernels::CutlassMoeFCRunner<__nv_bfloat16, __nv_fp8_e4m3>>(); - } + else if (mActivationDtype == c10::ScalarType::BFloat16 && mWeightDtype == c10::ScalarType::Float8_e4m3fn) + { + mKernelRunner = std::make_shared<kernels::CutlassMoeFCRunner<__nv_bfloat16, __nv_fp8_e4m3>>(); + }
♻️ Duplicate comments (9)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
84-86
: Handle missing gemm_idx and return a concrete List[int] (not range).Current implementation assumes kwargs["gemm_idx"] exists and returns a range, which violates the List[int] return type and can raise KeyError. Make gemm_idx optional and coerce to list for type consistency.
Apply this diff:
- def get_valid_tactics(self, inputs: List[torch.Tensor], - profile: OptimizationProfile, **kwargs) -> List[int]: - return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"])) + def get_valid_tactics(self, inputs: List[torch.Tensor], + profile: OptimizationProfile, **kwargs) -> List[int]: + gemm_idx = kwargs.get("gemm_idx", 0) + return list(range(self.fused_moe_runner.get_tactic_num(gemm_idx)))To ensure the default index is correct, verify whether 0 is a valid sentinel (e.g., Undefined) or if indexing starts at 1:
#!/bin/bash # Inspect uses/definitions to confirm valid gemm index domain rg -nP 'MoeGemmId' -C3 rg -nP 'get_tactic_num\s*\(' -C2 rg -nP 'getTacticNum\s*\(' -C2 rg -nP 'gemm_idx\s*=\s*(1|2)\b' -C2cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
949-951
: Avoid out-of-range access when deterministic tactics are requestedIndexing [0] without guarding can crash if no tactics are available (e.g., unsupported shapes/dtypes/arch or misconfig). Guard and use front() after validating non-emptiness.
Apply this diff:
- gemm1 = mMOERunner->getTactics(MoeGemmId::GEMM_1)[0]; - gemm2 = mMOERunner->getTactics(MoeGemmId::GEMM_2)[0]; + auto t1 = mMOERunner->getTactics(MoeGemmId::GEMM_1); + auto t2 = mMOERunner->getTactics(MoeGemmId::GEMM_2); + TLLM_CHECK_WITH_INFO(!t1.empty() && !t2.empty(), "No available tactics for deterministic MOE"); + gemm1 = t1.front(); + gemm2 = t2.front();cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
1079-1084
: Guard logging against empty tactic lists in GEMM-specific modeWhen only one GEMM is profiled, the other may have an empty tactic list. Logging unconditionally indexes both vectors and can go OOB.
Apply this diff:
- auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); - auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); - std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n" - << tactics1[tactic_idx1].toString() << std::endl; - std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n" - << tactics2[tactic_idx2].toString() << std::endl; + auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); + auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); + if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_1) + { + std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"; + if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size())) + { + std::cout << tactics1[tactic_idx1].toString() << std::endl; + } + } + if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_2) + { + std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"; + if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size())) + { + std::cout << tactics2[tactic_idx2].toString() << std::endl; + } + }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
536-545
: Help text: arrays can also contain the strings “auto”/“all” (not only integers)The parser accepts arrays containing integers or strings. Update the help text accordingly to avoid user confusion.
Apply this diff:
- " - An array: of integers, forms a list of tactics to sweep\n" + " - An array: of integers or strings, forms a list of tactics to sweep\n"
45-50
: Fix: use the concrete runner type instead of decltype on a (likely) data member
decltype(BenchClass::mMoERunner)
is unreliable here and won’t provide a type with the required staticgetTactics(int, MoeGemmId)
API. Use the concrete runner type derived from BenchClass’ template parameters.Apply this diff:
template <class BenchClass> auto listAllTactics(MoeGemmId gemm_id) { int const sm = getSMVersion(); - using RunnerType = decltype(BenchClass::mMoERunner); - return RunnerType::getTactics(sm, gemm_id); + using RunnerType = tensorrt_llm::kernels::cutlass_kernels::CutlassMoeFCRunner< + typename BenchClass::DataType, + typename BenchClass::WeightType, + typename BenchClass::OutputType, + typename BenchClass::InputType>; + return RunnerType::getTactics(sm, gemm_id); }cpp/tensorrt_llm/thop/moeOp.cpp (2)
630-638
: Guard empty profiles and out-of-range profile_id in runGemmProfile
profiles.front()
andprofiles[profile_id]
can throw if the list is empty or the id is out of range. Add checks to fail fast with a clear message.Apply this diff:
- auto const gemm_to_profile - = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; - auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + auto const gemm_to_profile + = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; + auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%d", int(gemm_idx)); @@ - auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id]; + const auto& profile = [&]() -> const Profile& + { + if (profile_id == -1) return profiles.front(); + TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()), + "profile_id out of range for GEMM_%d: %ld", int(gemm_idx), profile_id); + return profiles[profile_id]; + }();
740-749
: Guard empty tactic lists and validate profile_ids in setRunnerProfilesAvoid undefined behavior on
.front()
and.at()
when no tactics are available or ids are invalid.Apply this diff:
- auto best_gemm1_profile = mGemm1Profiles.front(); - auto best_gemm2_profile = mGemm2Profiles.front(); + TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available"); + auto best_gemm1_profile = mGemm1Profiles.front(); + auto best_gemm2_profile = mGemm2Profiles.front(); if (profile_ids.has_value()) { TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids"); - best_gemm1_profile - = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]); - best_gemm2_profile - = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]); + if (profile_ids.value()[0] != -1) + { + TORCH_CHECK(profile_ids.value()[0] >= 0 + && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()), + "profile_ids[0] out of range"); + best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]); + } + if (profile_ids.value()[1] != -1) + { + TORCH_CHECK(profile_ids.value()[1] >= 0 + && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()), + "profile_ids[1] out of range"); + best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]); + } }cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
608-612
: Fix static context: avoid referencing instance member in static method
decltype(moe_gemm_runner_)
references a non‑static member in a static method and will not compile. Use the known runner type directly.Apply this diff:
- static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) - { - using RunnerType = decltype(moe_gemm_runner_); - return RunnerType::getConfigs(sm, gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); - } + static std::vector<cutlass_extensions::CutlassGemmConfig> getTactics(int sm, MoeGemmId gemm_id) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return RunnerType::getConfigs(sm, gemm_id == MoeGemmId::GEMM_2 && Self::mayHaveFinalizeFused(sm)); + }
808-812
: Fix static context (same issue) in mayHaveFinalizeFused(int sm)Avoid
decltype(moe_gemm_runner_)
inside a static method; use the runner type.Apply this diff:
- static bool mayHaveFinalizeFused(int sm) - { - using RunnerType = decltype(moe_gemm_runner_); - return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; - } + static bool mayHaveFinalizeFused(int sm) + { + using RunnerType = MoeGemmRunner<T, WeightType, OutputType, ScaleBiasType>; + return RunnerType::supportsTmaWarpSpecialized(sm) && sm >= 90 && !use_w4_groupwise; + }
🧹 Nitpick comments (2)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
334-353
: Optional: default missing tactic lists to “auto” instead of skipping the benchmarkSkipping entirely when either tactic list is empty can be inconvenient (e.g., when a user only provides one side). Consider defaulting the missing side to
-1
(“auto”) and still proceed, while printing the same warning once.Apply this diff:
- if (tactic_ids1.empty() || tactic_ids2.empty()) - { - std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl; - static bool printed = false; - if (!printed) - { - printed = true; - std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n"; - for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2}) - { - std::cerr << "GEMM " << (int) gemm_id << ":\n"; - auto confs = listAllTactics<BenchClass>(gemm_id); - for (auto c : confs) - std::cerr << c.toString(); - std::cerr << std::endl; - } - } - - continue; - } + if (tactic_ids1.empty() || tactic_ids2.empty()) + { + std::cerr << "Warning: Missing tactic list for one GEMM; defaulting to auto (-1)." << std::endl; + static bool printed = false; + if (!printed) + { + printed = true; + std::cerr << __PRETTY_FUNCTION__ << ": Valid Tactics are:\n"; + for (auto gemm_id : {MoeGemmId::GEMM_1, MoeGemmId::GEMM_2}) + { + std::cerr << "GEMM " << (int) gemm_id << ":\n"; + auto confs = listAllTactics<BenchClass>(gemm_id); + for (auto c : confs) + std::cerr << c.toString(); + std::cerr << std::endl; + } + } + if (tactic_ids1.empty()) tactic_ids1 = {-1}; + if (tactic_ids2.empty()) tactic_ids2 = {-1}; + }cpp/tensorrt_llm/thop/moeOp.cpp (1)
268-298
: Remove duplicated bias validation blockThe bias validation block appears twice back-to-back; the second occurrence is redundant and increases maintenance overhead.
Apply this diff to remove the duplicate block:
- if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) - { - CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); - CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); - TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); - TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); - TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], - "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); - TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], - "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); - TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], - "fc1_expert_biases should match fc1_expert_weights output shape."); - TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], - "fc2_expert_biases should match fc2_expert_weights output shape."); - }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(6 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
(5 hunks)tests/unittest/_torch/test_autotuner.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- tensorrt_llm/_torch/autotuner.py
- tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
- tests/unittest/_torch/test_autotuner.py
- cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-19T03:35:20.797Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.797Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧬 Code Graph Analysis (6)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
std
(43-114)vector
(608-612)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)
supportsTmaWarpSpecialized
(293-312)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (3)
isValidHopperMOESpecialisation
(63-78)isValidBlackwellMOESpecialisation
(47-57)isValidSM120MOESpecialisation
(34-42)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
sm
(1090-1118)sm
(1090-1090)sm
(1120-1144)sm
(1120-1120)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
getTactics
(1278-1282)getTactics
(1278-1278)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (5)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (9)
vector
(1267-1279)vector
(1267-1267)sm
(1090-1118)sm
(1090-1090)sm
(1120-1144)sm
(1120-1120)num_tokens
(409-451)num_tokens
(409-410)stream
(829-841)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
getTactics
(1278-1282)getTactics
(1278-1278)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (2)
supportsTmaWarpSpecialized
(293-312)EpilogueFusion
(168-319)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
prepareTmaWsInputs
(4558-4663)prepareTmaWsInputs
(4558-4559)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
EpilogueFusion
(167-331)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
MoeGemmId
(231-354)std
(43-114)vector
(608-612)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
vector
(566-570)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
tests/unittest/_torch/test_autotuner.py (4)
get_valid_tactics
(94-97)get_valid_tactics
(153-156)get_valid_tactics
(229-231)get_valid_tactics
(317-324)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
get_valid_tactics
(125-141)get_valid_tactics
(409-422)get_valid_tactics
(667-680)get_valid_tactics
(901-914)get_valid_tactics
(1114-1127)tensorrt_llm/_torch/autotuner.py (2)
get_valid_tactics
(147-165)OptimizationProfile
(118-133)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
permuted_token_final_scales_
(871-871)mSM
(962-962)mMinLatencyMode
(987-987)mNumExpertsPerNode
(964-964)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
permuted_token_final_scales_
(822-822)mSM
(920-920)mMinLatencyMode
(944-944)mNumExpertsPerNode
(922-922)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (29)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
318-320
: LGTM: kwargs accepted and returns List[int].Signature is future-proofed with **kwargs and the return type matches the hint.
400-402
: LGTM: consistent with other runners.Returns a concrete list and aligns with the tuner's expectations.
512-525
: LGTM: shape-derived config query is clear and specific.Inputs are unpacked explicitly; delegating to kernel_runner.get_valid_configs(b, m, n, k) reads well. No issues spotted.
726-728
: LGTM: correct return type and signature.Conforms to the List[int] contract; kwargs reserved for future use.
801-805
: LGTM: list conversion maintains type consistency.Good use of list(range(...)) for the advertised return type.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
1281-1282
: LGTM: getTactics now honors backend-selected GEMMReturning tactics via
backend.mGemmToProfile
aligns with the per‑GEMM profiling path and removes the unused m/n/k coupling. Looks correct.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
836-836
: LGTM: per‑GEMM tactic retrieval in pickBestTacticUsing
static_cast<MoeGemmId>(gemm_to_profile)
here is fine since the callers pass GEMM_1/GEMM_2. Keeps tactic lists GEMM‑specific.
970-979
: LGTM: GEMM-only profiler path uses per‑GEMM tacticIndexing by
tactic_idx
here is fine becausesetTactic()
validates the profiled GEMM index before use. With the above guard insetTactic()
, the unprofiled GEMM won’t be dereferenced in this path.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (8)
2850-2854
: Gating final scales buffer on tactic’s finalize epilogue — LGTMUsing
gemm2_config_->epilogue_fusion_type
to derivegemm2_using_finalize_fusion
and only wiringpermuted_token_final_scales_
in that case aligns the buffer lifetime with the chosen epilogue. Clean and clear.
4010-4016
: Finalize-fusion decision ties runner and tactic — LGTM
using_fused_finalize
is gated by both the runner setting and the tactic’s requested epilogue, and the sanityTLLM_CHECK_WITH_INFO
ensures they don’t diverge. This is the right coupling point.
4558-4560
: Profiler: pass epilogue fusion into TMA-WS preparation — LGTMPromoting fusion as an explicit parameter to
prepareTmaWsInputs
makes the per-fusion cache population unambiguous and keeps the API future-proof.
4566-4576
: Early-exit gating for finalize fusion in profiler — LGTMThe support check (runner flag, min-latency, groupwise W4*, GEMM_2 only) avoids preparing unsupported variants. Computing the groupwise flags up front also prevents the “use-before-declaration” issue seen earlier.
4616-4621
: Reuse TMA-WS buffers between NONE/FINALIZE variants is intentional — acknowledgedThe comment and logic reflect the deliberate reuse to save space. No changes requested here.
4626-4638
: Per-fusion template selection and parameterization — LGTMSelecting the cache entry by
use_finalize_fusion
, setting fusion per-GEMM, and callingsetFinalizeFusionParams(output, hidden_size, num_tokens, mK > 1)
matches the expected (hidden_size, num_tokens) layout for the transposed finalize path. Looks correct.
4675-4679
: Prepare both NONE and FINALIZE templates up front for the profiler — LGTMPreparing both fusion variants in one pass improves tactic switching latency during profiling. Given the intentional workspace reuse, this remains consistent.
4742-4745
: TMA-WS template lookup by tactic’s fusion type — LGTMIndexing
mTmaInputCache
by epilogue fusion and asserting.isValid()
ensures the profiler executes with an initialized template corresponding to the chosen tactic.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)
473-478
: Instance-level getConfigs(bool) — LGTMForwarding to the SM-parametrized overload keeps call sites simple while allowing per-runner decisions when needed.
482-489
: SM-parametrized getConfigs(sm, supports_finalize_fusion) — LGTMCombining TMA-WS and Ampere configs under a single API and threading the
supports_finalize_fusion
flag through is a good consolidation.
521-571
: TMA-WS configs: duplicate for FINALIZE when supported — LGTMCloning configs and tagging
epilogue_fusion_type = FINALIZE
when finalize is supported cleanly surfaces finalize-capable tactics to the selector without touching base heuristics.
832-834
: Workspace sizing includes FINALIZE-capable TMA configs — LGTMUsing
getTmaWarpSpecializedConfigs(sm_, true)
and trying bothNONE
andFINALIZE
(with try/catch) ensures the allocation bounds cover the largest supported path at runtime.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
278-285
: LGTM: per-GEMM tactic parsingConditional parsing of
tactic_id1
andtactic_id2
is correct and aligns with the new per‑GEMM API.
415-426
: LGTM: verbose per‑GEMM tactic listingClear per‑GEMM enumeration with indices will help users map IDs to configs.
cpp/tensorrt_llm/thop/moeOp.cpp (3)
154-159
: LGTM after factory return-type changeWith
switch_output_type
returning ashared_ptr
, this assignment is correct and fixes the CI compile breakage in the FP8 path.
202-207
: LGTM after factory return-type changeWith
create_weight_quant_runner
returning ashared_ptr
, these assignments compile and are consistent.
590-595
: LGTM: input validation for gemm_idxExplicitly checking
gemm_idx
prevents silent misuse and matches the Python binding contract.cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
231-236
: LGTM: MoeGemmId enum introductionClear scoped enum for GEMM routing; improves type safety across the per‑GEMM APIs.
456-457
: LGTM: per‑GEMM tactic API on the interfaceThe
getTactics(MoeGemmId)
signature change is consistent with the PR intent and downstream usage.
603-606
: Correctness: gate finalize variants per GEMM via gemm_idThe override correctly gates FINALIZE variants only for GEMM_2 when supported.
911-997
: LGTM: profiler type alias and TMA input cache shape
- Using
MoeGemmId
asGemmToProfile
is consistent and simplifies cross‑module usage.- Expanding
mTmaInputCache
to index by fusion (NONE/FINALIZE) matches the updated workspace preparation.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
Show resolved
Hide resolved
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
Signed-off-by: djns99 <[email protected]>
5804b77
to
71f9272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
689-701
: register_fake variable typo: act_fp8 vs act_fp4The fake implementation for w4a8_mxfp4_fp8_gemm declares act_fp4 but returns using act_fp8, which is undefined and will NameError at trace time.
Apply this diff:
-@w4a8_mxfp4_fp8_gemm.register_fake +@w4a8_mxfp4_fp8_gemm.register_fake def _( - act_fp4: torch.Tensor, + act_fp8: torch.Tensor, @@ ) -> torch.Tensor: - return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), - dtype=output_dtype) + return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), + dtype=output_dtype)Alternatively, if the op truly consumes FP8 activations, rename the parameter to act_fp8 (as above). If it’s FP4, change both the annotation and the return to use act_fp4 consistently.
♻️ Duplicate comments (7)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
1079-1085
: Guard logging when a GEMM has no tactics or index is invalidUnconditional indexing can go out of bounds when a GEMM isn’t profiled or yields no valid tactics. Use bounds checks before printing.
Apply this diff:
- auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); - auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); - std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n" - << tactics1[tactic_idx1].toString() << std::endl; - std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n" - << tactics2[tactic_idx2].toString() << std::endl; + auto tactics1 = mMoERunner.getTactics(MoeGemmId::GEMM_1); + auto tactics2 = mMoERunner.getTactics(MoeGemmId::GEMM_2); + if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_1) + { + std::cout << "Selected tactic #1: " << tactic_idx1 << "/" << tactics1.size() << "\n"; + if (!tactics1.empty() && tactic_idx1 >= 0 && tactic_idx1 < static_cast<int>(tactics1.size())) + { + std::cout << tactics1[tactic_idx1].toString() << std::endl; + } + } + if (gemm_to_profile == GemmToProfile::LAYER || gemm_to_profile == GemmToProfile::GEMM_2) + { + std::cout << "Selected tactic #2: " << tactic_idx2 << "/" << tactics2.size() << "\n"; + if (!tactics2.empty() && tactic_idx2 >= 0 && tactic_idx2 < static_cast<int>(tactics2.size())) + { + std::cout << tactics2[tactic_idx2].toString() << std::endl; + } + }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
45-50
: Compile error: decltype on non-static member yields pointer-to-member, not the runner type
decltype(BenchClass::mMoERunner)
is invalid here. Use the concrete runner alias derived from BenchClass’ template parameters.Apply this diff:
template <class BenchClass> auto listAllTactics(MoeGemmId gemm_id) { int const sm = getSMVersion(); - using RunnerType = decltype(BenchClass::mMoERunner); - return RunnerType::getTactics(sm, gemm_id); + using RunnerType = CutlassMoeFCRunner<typename BenchClass::DataType, + typename BenchClass::WeightType, + typename BenchClass::OutputType, + typename BenchClass::InputType>; + return RunnerType::getTactics(sm, gemm_id); }cpp/tensorrt_llm/thop/moeOp.cpp (3)
590-595
: Make return type explicit and keep validation — minor polishYou already validate gemm_idx. Consider explicitly casting size() to int64_t to avoid implicit narrowing across platforms.
Apply this diff:
- return (gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size(); + return static_cast<int64_t>((gemm_idx == 1) ? mGemm1Profiles.size() : mGemm2Profiles.size());
630-637
: Guard empty tactic lists and out-of-range profile_id in runGemmProfileprofiles.front() and profiles[profile_id] can UB if profiles is empty or profile_id is invalid. Fail fast with clear messages.
Apply this diff:
- auto const gemm_to_profile - = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; - auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + auto const gemm_to_profile + = (gemm_idx == 1) ? profiler_backend::GemmToProfile::GEMM_1 : profiler_backend::GemmToProfile::GEMM_2; + auto const& profiles = (gemm_idx == 1) ? mGemm1Profiles : mGemm2Profiles; + TORCH_CHECK(!profiles.empty(), "No tactics available for GEMM_%ld", gemm_idx); @@ - auto const& profile = profile_id == -1 ? profiles.front() : profiles[profile_id]; + const auto& profile = [&]() -> const Profile& { + if (profile_id == -1) return profiles.front(); + TORCH_CHECK(profile_id >= 0 && profile_id < static_cast<int64_t>(profiles.size()), + "profile_id out of range for GEMM_%ld: %ld", gemm_idx, profile_id); + return profiles[profile_id]; + }();
740-749
: Protect .front()/.at() in setRunnerProfiles and validate profile_idsThis currently assumes non-empty mGemm{1,2}Profiles and valid indices. Add guards to prevent UB and clearer error messages.
Apply this diff:
- auto best_gemm1_profile = mGemm1Profiles.front(); - auto best_gemm2_profile = mGemm2Profiles.front(); + TORCH_CHECK(!mGemm1Profiles.empty() && !mGemm2Profiles.empty(), "No tactics available"); + auto best_gemm1_profile = mGemm1Profiles.front(); + auto best_gemm2_profile = mGemm2Profiles.front(); if (profile_ids.has_value()) { TORCH_CHECK(profile_ids.value().size() == 2, "Expecting 2 profile ids"); - best_gemm1_profile - = profile_ids.value()[0] == -1 ? best_gemm1_profile : mGemm1Profiles.at(profile_ids.value()[0]); - best_gemm2_profile - = profile_ids.value()[1] == -1 ? best_gemm2_profile : mGemm2Profiles.at(profile_ids.value()[1]); + if (profile_ids.value()[0] != -1) + { + TORCH_CHECK(profile_ids.value()[0] >= 0 + && profile_ids.value()[0] < static_cast<int64_t>(mGemm1Profiles.size()), + "profile_ids[0] out of range for GEMM_1"); + best_gemm1_profile = mGemm1Profiles.at(profile_ids.value()[0]); + } + if (profile_ids.value()[1] != -1) + { + TORCH_CHECK(profile_ids.value()[1] >= 0 + && profile_ids.value()[1] < static_cast<int64_t>(mGemm2Profiles.size()), + "profile_ids[1] out of range for GEMM_2"); + best_gemm2_profile = mGemm2Profiles.at(profile_ids.value()[1]); + } }cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
1178-1181
: Fixed signature mismatch — good catchUsing selectTacticsForArch(sm) resolves the earlier extra-arg build error.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
84-87
: Return a list and make gemm_idx resilient in get_valid_tacticsReturning range violates the List[int] contract, and blindly indexing kwargs risks KeyError. Convert to a list and default gemm_idx to a sane value.
Apply this diff:
- def get_valid_tactics(self, inputs: List[torch.Tensor], - profile: OptimizationProfile, **kwargs) -> List[int]: - return range(self.fused_moe_runner.get_tactic_num(kwargs["gemm_idx"])) + def get_valid_tactics(self, inputs: List[torch.Tensor], + profile: OptimizationProfile, **kwargs) -> List[int]: + gemm_idx = int(kwargs.get("gemm_idx", 1)) + return list(range(int(self.fused_moe_runner.get_tactic_num(gemm_idx))))
🧹 Nitpick comments (16)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (1)
2-2
: Disambiguate the provenance of the archived kernels’ commit.The “commit” line shows a 40-char hash but not which repo it refers to. Make provenance unambiguous so downstream triage (especially for CI failures) can trace the exact source.
Apply this minimal format tweak:
-commit 444ef1b3b06cdc7ee66b4e612ce26ad25967440b +source internal_cutlass_kernels@444ef1b3b06cdc7ee66b4e612ce26ad25967440bOptionally, add build metadata (compiler/CUDA/CUTLASS versions) if you already capture it during artifact build; this pays off when multiple ABI-compatible archives exist.
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (2)
61-63
: Namespace relocation is fine; add brief doc and clarify intent.The new nested namespace reads as an API/ABI surface change. A short comment helps future readers understand why these helpers live under oss.
Apply this small doc tweak:
-namespace oss +// Internal helpers for int8 GEMM dispatch isolated under a sub-namespace to keep +// the public runner API stable while we evolve tactic selection plumbing. +namespace oss {
341-345
: Parenthesize the condition for readability.Make the precedence explicit to avoid misreads.
- else if (mSm >= 80 && mSm <= 90 || mSm >= 120) + else if ((mSm >= 80 && mSm <= 90) || (mSm >= 120)) { oss::dispatchGemmToCutlass<T, cutlass::arch::Sm80>(A, B, quantOption, alphaCol, alphaRow, C, m, n, k, workspacePtr, workspaceBytes, gemmConfig, stream, occupancy); }cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
836-838
: Avoid enum-to-enum static_cast; map GemmToProfile to MoeGemmId explicitlyCasting between distinct enums relies on coincidental underlying values and risks invalid values (e.g., LAYER). Use an explicit mapping.
Apply this diff:
- auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile)); + MoeGemmId gemmId = (gemm_to_profile == GemmToProfile::GEMM_1) + ? MoeGemmId::GEMM_1 + : MoeGemmId::GEMM_2; + auto tactics = mMoERunner.getTactics(gemmId);
970-977
: Same enum mapping nit: avoid static_cast here tooMirror the earlier explicit mapping to prevent accidental misuse if
gemm_to_profile
is ever LAYER.Apply this diff:
- auto tactics = mMoERunner.getTactics(static_cast<MoeGemmId>(gemm_to_profile))[tactic_idx]; + MoeGemmId gemmId = (gemm_to_profile == GemmToProfile::GEMM_1) + ? MoeGemmId::GEMM_1 + : MoeGemmId::GEMM_2; + auto tactics = mMoERunner.getTactics(gemmId)[tactic_idx];cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (2)
521-571
: Pre-reserve capacity before duplicating configs for FINALIZEMinor perf/readability: reserve to avoid reallocation while appending FINALIZE variants.
Apply this diff:
- if (supports_finalize_fusion) + if (supports_finalize_fusion) { - // Duplicate the configs and set the epilogue fusion type to FINALIZE + // Duplicate the configs and set the epilogue fusion type to FINALIZE + tma_ws_configs.reserve(tma_ws_configs.size() * 2); auto finalize_configs = tma_ws_configs; std::transform(finalize_configs.begin(), finalize_configs.end(), std::back_inserter(tma_ws_configs), [](auto& config) { config.epilogue_fusion_type = cutlass_extensions::CutlassGemmConfig::EpilogueFusionType::FINALIZE; return config; }); }
831-834
: Avoid duplicated work when computing workspace sizeYou already try both NONE and FINALIZE paths explicitly below. Calling
getTmaWarpSpecializedConfigs(sm_, true)
produces duplicates, doubling iterations. Query only base configs here.Apply this diff:
- // Finalize fusion may not actually be supported by the kernel, - // if they are not we will catch the error and skip them - auto configs = getTmaWarpSpecializedConfigs(sm_, true); + // Fetch base configs once; we explicitly try NONE and (on SM90) FINALIZE below. + auto configs = getTmaWarpSpecializedConfigs(sm_, /*supports_finalize_fusion=*/false);cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
279-285
: Optional: accept single-sided tactic spec by defaulting the other to autoRight now, missing
tactic_id1
ortactic_id2
results in skipping later. Defaulting the absent side to[-1]
(auto) improves usability for layer profiling.Apply this diff:
- if (run_config.contains("tactic_id1")) + if (run_config.contains("tactic_id1")) { parseTacticToVectorID<BenchClass>(run_config["tactic_id1"], tactic_ids1, MoeGemmId::GEMM_1); } - if (run_config.contains("tactic_id2")) + if (run_config.contains("tactic_id2")) { parseTacticToVectorID<BenchClass>(run_config["tactic_id2"], tactic_ids2, MoeGemmId::GEMM_2); } + // Default missing side to auto for full-layer runs + if (tactic_ids1.empty()) { tactic_ids1 = {-1}; } + if (tactic_ids2.empty()) { tactic_ids2 = {-1}; }
334-353
: Current behavior skips benchmarks when tactics missing — prefer auto-default insteadIf you don’t adopt the defaulting above, consider at least making the skip reason explicit in logs. Otherwise, applying the suggested default avoids this branch entirely.
Apply this diff (if you keep skip):
- std::cerr << "Warning: Skipping benchmark, no valid tactic found" << std::endl; + std::cerr << "Warning: Skipping benchmark: no tactics specified for one or both GEMMs " + "(provide tactic_id1/tactic_id2 or use \"auto\"/\"all\")." << std::endl;cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
2849-2854
: Tactic-driven gating of permuted_token_final_scales_ looks goodUsing gemm2_config_->epilogue_fusion_type to decide whether to thread the scales buffer aligns the workspace wiring with the chosen tactic. No functional concerns. If you want to reduce duplication later, consider a tiny helper (e.g., gemm2RequestsFinalize()) to reuse this check across call sites.
4010-4016
: Good consistency check between tactic and runner configThe equality check between using_fused_finalize and gemm2_using_finalize_fusion is the right fail-fast. As a tiny DX improvement, consider including both boolean values in the error message for quicker diagnosis.
4634-4638
: Finalize params ordering matches row/col semanticssetFinalizeFusionParams(output, mExpertHiddenSize, num_tokens, mK > 1) passes (hidden_size, num_output_tokens, reduce), which is consistent with the swapped/transpose layout. Consider a brief comment here to make the rationale obvious to future readers.
4675-4679
: Dual call to prepareTmaWsInputs is clear; optional call-site notePreparing NONE and FINALIZE templates back-to-back is consistent with the caching design. If helpful, add a one-liner here mirroring the “reuse to save space” note from prepareTmaWsInputs to make the overlaying intentional at the call site too.
cpp/tensorrt_llm/thop/moeOp.cpp (2)
268-298
: Remove duplicated bias validation block in runMoeThe validation of fc{1,2}_expert_biases appears twice verbatim, increasing maintenance risk.
Apply this diff to drop the second copy:
- if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) - { - CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); - CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); - TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); - TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); - TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], - "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); - TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], - "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); - TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], - "fc1_expert_biases should match fc1_expert_weights output shape."); - TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], - "fc2_expert_biases should match fc2_expert_weights output shape."); - }
433-443
: Typo in method name runMoeMinLantencyMinor naming nit: “Lantency” → “Latency”. Consider renaming the method and adjusting the TORCH_LIBRARY binding accordingly to improve readability. This is API-facing only via the binding symbol “run_moe_min_latency”, so internal rename is low-risk.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
91-94
: Avoid defaulting gemm_idx to 0 in forwardC++ side now TORCH_CHECKs gemm_idx in {1,2}. Defaulting to 0 will throw if callers forget to pass it. Prefer default=1 or validate locally.
Apply this diff:
- gemm_idx: int = 0, + gemm_idx: int = 1,
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (2)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz
is excluded by!**/*.xz
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/tensorrt_llm_internal_cutlass_kernels_static.tar.xz
is excluded by!**/*.xz
📒 Files selected for processing (18)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(5 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
(7 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(7 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
(3 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
(5 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(6 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
(1 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
(1 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
(1 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(7 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(11 hunks)tensorrt_llm/_torch/autotuner.py
(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(6 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
(5 hunks)tests/unittest/_torch/misc/test_autotuner.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h
- tensorrt_llm/_torch/autotuner.py
- tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/unittest/_torch/misc/test_autotuner.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/unittest/_torch/misc/test_autotuner.py
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
🧬 Code Graph Analysis (6)
tests/unittest/_torch/misc/test_autotuner.py (2)
tensorrt_llm/_torch/autotuner.py (2)
OptimizationProfile
(118-133)FakeTensor
(138-141)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6)
get_valid_tactics
(84-86)get_valid_tactics
(318-320)get_valid_tactics
(400-402)get_valid_tactics
(512-524)get_valid_tactics
(726-728)get_valid_tactics
(801-804)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
permuted_token_final_scales_
(822-822)mSM
(920-920)mMinLatencyMode
(944-944)mNumExpertsPerNode
(922-922)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
std
(43-114)vector
(608-612)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h (1)
supportsTmaWarpSpecialized
(293-312)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (3)
isValidHopperMOESpecialisation
(63-78)isValidBlackwellMOESpecialisation
(47-57)isValidSM120MOESpecialisation
(34-42)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
sm
(1090-1118)sm
(1090-1090)sm
(1120-1144)sm
(1120-1120)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (2)
getTactics
(1278-1282)getTactics
(1278-1278)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
MoeGemmId
(231-354)std
(43-114)vector
(608-612)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
tests/unittest/_torch/misc/test_autotuner.py (4)
get_valid_tactics
(94-97)get_valid_tactics
(153-156)get_valid_tactics
(229-231)get_valid_tactics
(317-324)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (5)
get_valid_tactics
(125-141)get_valid_tactics
(409-422)get_valid_tactics
(667-680)get_valid_tactics
(901-914)get_valid_tactics
(1114-1127)tensorrt_llm/_torch/autotuner.py (2)
get_valid_tactics
(147-165)OptimizationProfile
(118-133)
🪛 LanguageTool
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt
[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 444ef1b3b06cdc7ee66b4e612ce26ad25...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...444ef1b3b06cdc7ee66b4e612ce26ad25967440b
(QB_NEW_EN)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
[grammar] ~1-~1: There might be a mistake here.
Context: ...rt_llm_internal_cutlass_kernels_static.a commit 444ef1b3b06cdc7ee66b4e612ce26ad25...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...444ef1b3b06cdc7ee66b4e612ce26ad25967440b
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (36)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt (2)
1-1
: Action Needed: Confirm checksum verification and clarify formatI wasn’t able to verify the checksum automatically because
sha256sum
isn’t available in the CI shell. Please:
- Ensure the stored hash in
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/version.txt
actually matches the archive. You can manually run, for example:and compare it to the value in the file.shasum -a 256 cpp/tensorrt_llm/kernels/internal_cutlass_kernels/x86_64-linux-gnu/libtensorrt_llm_internal_cutlass_kernels_static.a- Update the first line of
version.txt
to include an explicit algorithm prefix, e.g.:-5a31acd0fb1415196bff71fa4a8d1dded147e15ea10821cc46c66986ee libtensorrt_llm_internal_cutlass_kernels_static.a +sha256:5a31acd0fb1415196bff71fa4a8d1dded147e15ea10821cc46c66986ee libtensorrt_llm_internal_cutlass_kernels_static.a- Add a CI step (using
shasum -a 256
oropenssl dgst -sha256
) to validate thatversion.txt
and the static library never drift out of sync.
1-2
: Cross-arch version.txt files are in syncBoth the AArch64 and x86_64 version.txt files reference the same commit
444ef1b3b06cdc7ee66b4e612ce26ad25967440b
, and the differing SHA256 lines correspond to the distinct per-architecture static libraries as expected. There’s no mismatch or drift to address here.cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h (3)
333-335
: Call-site update to oss-qualified dispatch looks correct.Template params and argument order are unchanged; only qualification updated. Good.
338-340
: LGTM on SM75 branch.Consistent with the SM72 change; no behavioral differences introduced.
61-63
: All clear—no remaining unqualified call sites detected.
The scans for bothdispatchGemmToCutlass<…>
andgenericInt8GemmKernelLauncher<…>
outside ofint8_gemm_template.h
returned no matches, confirming that all uses have been updated to the newoss::
-qualified names. No further changes are needed here.cpp/tensorrt_llm/kernels/internal_cutlass_kernels/aarch64-linux-gnu/version.txt (1)
1-2
: LGTM: artifact bump recordedHash and commit updated as expected. No further action from code perspective.
tests/unittest/_torch/misc/test_autotuner.py (3)
153-156
: Signature extended to accept kwargs — aligned with per-GEMM contextAdding
**kwargs
to PartialCrashedRunner.get_valid_tactics keeps tests compatible with callers that propagate extra context (e.g., gemm_idx).
229-231
: Propagating kwargs in tests is fineGemmRunnerWithAttributes.get_valid_tactics now accepts
**kwargs
. This maintains API consistency across runners used by the autotuner tests.
317-320
: Supports tactic config generation with kwargsGemmRunnerWithTacticConfigs.get_valid_tactics adds
**kwargs
without altering behavior. Good forward-compat for additional selection hints.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
928-953
: Per-GEMM tactic selection flow looks correctFetching
tactics1
/tactics2
, gating ongemm_to_profile
, and lazy best-pick with range checks is sound.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
473-478
: New const overload is appropriate
getConfigs(bool)
delegating to the SM-parametrized version cleanly exposes the new “supports finalize fusion” knob at the instance level.
481-489
: Composition of TMA WS and Ampere configs looks correctMerging
getTmaWarpSpecializedConfigs(sm, supports_finalize_fusion)
with Ampere configs preserves existing fallback behavior.
582-587
: SM-gated specialization predicate is fineThe new
(int sm)
helper is clear and matches the architecture-specific validity checks.cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (3)
53-86
: Per-GEMM tactic parsing is clean and recursiveThe refactor to thread
gemm_id
through parsing and to support arrays/“all”/“auto” is solid. Error handling for invalid formats is appropriate.
415-426
: Verbose dump per GEMM is helpfulEnumerating tactics per GEMM with indices improves debuggability. Nice.
536-544
: Help text matches parser (no “objects”); clear on valid tactic formsDocs now accurately reflect accepted types and warn about index instability. Good catch-up with the parser changes.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
4558-4576
: prepareTmaWsInputs: early gating and groupwise flags computed before use — solid
- Passing the fusion enum into prepareTmaWsInputs makes the intent explicit.
- Computing use_w4afp8/use_wfp4a16/use_w4_groupwise before the early-return resolves the prior “used before declaration” hazard.
4616-4621
: Intentional WS reuse for NONE/FINALIZE variants acknowledgedThe cache selects mTmaInputCache[i][use_finalize_fusion] and reuses the same underlying workspace region by design to save memory (as noted in the comment). The code matches that intent.
4626-4628
: Per-GEMM cache binding during profiling is correctRouting the active cache to the GEMM under test and using a local dummy for the other keeps configuration isolated without extra allocations. Looks good.
4742-4745
: Profiler: selecting cached template by fusion type — LGTMIndexing mTmaInputCache by (sample, isFinalize) and asserting isValid() gives a clean, explicit handoff to the run phase.
cpp/tensorrt_llm/thop/moeOp.cpp (2)
51-51
: Expose MoeGemmId alias — looks goodPublicly surfacing MoeGemmId in torch_ext is helpful for downstream clarity and keeps signatures consistent across layers.
219-221
: Initialize per-GEMM tactic vectors — good separationFetching tactics separately for GEMM_1 and GEMM_2 aligns with the new per-GEMM selection model.
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (6)
373-375
: Default fused-finalize flag is off — good coverage designDefaulting mUseFusedFinalize to false ensures both fused and unfused paths get exercised as k varies.
459-460
: Per-test control of finalize fusion — sensible toggleTying use_fused_finalize_ to (k < 3) || mUseFusedFinalize matches prod behavior and keeps determinism checks meaningful.
1090-1119
: Per-GEMM filtered configs — API usage looks correctgetFilteredConfigs now threads MoeGemmId and filters by SM. The L40S filter guards make sense; EXPECT_FALSE(tactics.empty()) asserts coverage.
1120-1144
: selectTacticsForArch: finalize-aware pairing — LGTMDeriving epilogue_fusion_type and selecting a finalize-matching GEMM_2 tactic is consistent with the new model. Sensible fallback logic with a warning.
1642-1645
: Determinism gate based on finalize fusion — appropriateThe condition matches known non-determinism in FINALIZE on SM90 with K≥3; the guard keeps tests stable elsewhere.
2093-2096
: ConfigSweep: fused finalize set true while sweeping both configs — reasonableEnabling fused finalize here while sweeping both GEMMs ensures coverage of FINALIZE paths; the gemm_id-aware config enumeration fits the per-GEMM API.
Also applies to: 2123-2124, 2127-2129
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (8)
318-321
: Consistently return List[int] for FP8RowwiseGemmRunner.get_valid_tacticsGood: list(range(...)) aligns with the typed contract and avoids surprises.
400-403
: NVFP4/FP4 get_valid_tactics returns List[int] — goodType-consistent and mirrors other runners.
726-729
: WeightOnlyQuantGemmRunner.get_valid_tactics — type fix is correctlist(range(...)) matches List[int] and the autotuner’s expectations.
801-805
: FinegrainedMixedDtypeGemm.get_valid_tactics — consistent return typeLooks good.
187-207
: Autotuner integration passes gemm_idx per GEMM — LGTMTuning two slots with explicit gemm_idx={1,2} matches the C++ API and per-GEMM tactic model.
151-163
: Tuner input handling for enable_alltoall — sensibleConstraining warm-up to the non-alltoall path and reusing tuner_top_k mirrors the C++ assumptions and keeps tactic selection stable.
Also applies to: 185-186
1-941
: Cross-file get_tactic_num migration sanity check passedAll occurrences of
get_tactic_num
in the Python codebase now include the requiredgemm_idx
argument, and no legacy zero-argument calls remain:
- Only one match of
get_tactic_num
was found inMoERunner.get_valid_tactics
, where it correctly useskwargs["gemm_idx"]
.- No calls to
fused_moe_runner.get_tactic_num()
without arguments were detected.No further changes are needed.
512-525
: Verify get_valid_configs return typeAdding
**kwargs
toget_valid_tactics
is appropriate to match the autotuner interface. However, the implementation ofget_valid_configs
could not be found in the Python or C/C++ sources in this repository—it's likely provided by an external extension. Please manually confirm that:
kernel_runner.get_valid_configs(b, m, n, k)
indeed returns aList[int]
.Key location:
• tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, around lines 522–524 (tactics = self.kernel_runner.get_valid_configs…
,return tactics
)
cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
Outdated
Show resolved
Hide resolved
Signed-off-by: djns99 <[email protected]>
/bot run --disable-fail-fast |
PR_Github #15961 [ run ] triggered by Bot |
PR_Github #15961 [ run ] completed with state |
Summary by CodeRabbit
New Features
Documentation
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.