Skip to content

Conversation

hyukn
Copy link
Collaborator

@hyukn hyukn commented Jul 20, 2025

Enable AllReduce-associated fusion patterns with fp4 and fp8 quantization in Llama3/4.

  • Added support for controlling fusion optimizations via environment variables.
  • Applied AR+Residual + RMS_NORM + Quant fp4/fp8 fusion. This is also compatible with the speculative decoding capturing in these models.
  • Some improvements for the two-shot allreduce kernel.
  • Disable fusion for small models with a hidden size no greater than 4096 to avoid accuracy drop issues.

Summary by CodeRabbit

  • New Features

    • Introduced environment-variable control for fusion features in model layers, allowing more flexible optimization.
    • Added support for quantization-aware fusion, improving performance with quantized models.
    • Enabled cross-layer fusion by linking normalization and attention modules between layers.
  • Improvements

    • Unified and simplified logic for controlling fusion and all-reduce operations, enhancing model efficiency.
    • Optimized CUDA kernel configuration for better performance on supported hardware.
  • Bug Fixes

    • Removed redundant normalization on final hidden states to streamline output.

Copy link
Contributor

coderabbitai bot commented Jul 20, 2025

📝 Walkthrough

"""

Walkthrough

The changes refactor fusion and all-reduce logic in Llama decoder layers, introducing environment-variable-based fusion enablement, quantization-aware fusion, and cross-layer fusion via new attributes. The code unifies fusion and all-reduce control, adds quantization support, links normalization and attention modules across layers for enhanced fusion capabilities, and adds a CUDA kernel launch bounds attribute for kernel optimization.

Changes

File(s) Change Summary
tensorrt_llm/_torch/models/modeling_llama.py Refactored Llama decoder layers to use environment-variable-controlled fusion flags; unified all-reduce logic; added quantization-aware fusion; introduced cross-layer fusion by linking normalization and attention modules; added new attributes and methods for fusion and weight loading; removed redundant normalization call; updated tensor parallel checks.
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu Added __launch_bounds__(1024) attribute to allreduce_fusion_kernel_twoshot_sync CUDA kernel for optimization.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LlamaForCausalLM
    participant DecoderLayer
    participant NextDecoderLayer

    User->>LlamaForCausalLM: load_weights(weights)
    LlamaForCausalLM->>DecoderLayer: set next_layer_layernorm, next_attn (from NextDecoderLayer)
    Note right of DecoderLayer: Enables cross-layer fusion

    User->>DecoderLayer: forward(input)
    DecoderLayer->>DecoderLayer: Check enable_fusion (from env)
    DecoderLayer->>DecoderLayer: Set fusion and all-reduce flags
    DecoderLayer->>DecoderLayer: If quantized, use quantization-aware fusion
    DecoderLayer->>DecoderLayer: Perform forward pass with fusion logic
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • nv-yilinf
    """

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (1)

416-425: Remove dead code or clarify the forced oneshot override

The unconditional allreduce_fusion_params.use_oneshot = true; (line 425) makes the preceding branch (lines 417–423) and the TWOSHOT validation block (lines 427–432) unreachable. Please either:

  • Remove the dead code (the conditional that sets use_oneshot based on strategy/seq_len and the subsequent TWOSHOT check), if this override is permanent.
  • Or, add a clear comment explaining why oneshot is being forced, when/under what conditions it will be revisited, and disable or gate the override accordingly.

Locations to update:
• cpp/tensorrt_llm/thop/allreduceOp.cpp, around lines 417–432

Suggested diff:

-        // Determine if using oneshot or twoshot allreduce kernel
-        if (strategy == AllReduceStrategyType::MIN_LATENCY)
-        {
-            allreduce_fusion_params.use_oneshot = seq_len <= tensorrt_llm::kernels::ar_fusion::kOneShotMaxToken;
-        }
-        else
-        {
-            allreduce_fusion_params.use_oneshot = strategy == AllReduceStrategyType::ONESHOT;
-        }
-        // Force use oneshot
-        allreduce_fusion_params.use_oneshot = true;
-
-        // Check for some kernel constraints if using TWOSHOT kernel
-        if (!allreduce_fusion_params.use_oneshot)
-        {
-            TORCH_CHECK(input.size(0) >= static_cast<int64_t>(tp_size),
-                "Sequence length must be greater than or equal to TP size");
-        }
+        // Force use oneshot kernel for all fusion patterns.
+        // TODO: Remove this override or restore conditional logic after benchmarking with fp4/fp8.
+        allreduce_fusion_params.use_oneshot = true;
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

626-629: Consider unifying fusion configuration between decoder classes

LlamaDecoderLayer uses instance attributes (self.PRE_MLP_FUSION, self.POST_MLP_FUSION) while Llama4DecoderLayer uses self.fusion_config.PRE_MLP_FUSION. Consider using a consistent approach across both classes for better maintainability.

+        self.fusion_config = EagerFusionConfig()
-        self.PRE_MLP_FUSION = self.mapping.has_tp(
+        self.fusion_config.PRE_MLP_FUSION = self.mapping.has_tp(
         ) and not self.enable_attention_dp and self.enable_fusion
-        self.POST_MLP_FUSION = self.mapping.has_tp() and self.enable_fusion
+        self.fusion_config.POST_MLP_FUSION = self.mapping.has_tp() and self.enable_fusion

Then update the usage in the forward method accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5300a99 and c10ac0b.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/thop/allreduceOp.cpp (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (15 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/thop/allreduceOp.cpp (1)

424-425: Verify performance impact of forcing oneshot kernel

Forcing the oneshot kernel for all cases might not be optimal, especially for longer sequences where the twoshot kernel could be more efficient. The PR aims to enable fusion patterns for fp4/fp8 quantization, but it's unclear if oneshot is the best choice for all scenarios.

Could you clarify:

  1. Is this change temporary for testing or permanent?
  2. Have you benchmarked the performance impact for various sequence lengths?
  3. Should this be configurable based on quantization type (fp4/fp8)?
tensorrt_llm/_torch/models/modeling_llama.py (4)

341-343: Ensure consistent AllReduce behavior with attention DP

The condition for performing AllReduce checks both enable_attention_dp and has_tp(). When attention data parallelism is enabled, the AllReduce is skipped. Please verify this is the intended behavior for the fusion patterns with fp4/fp8 quantization.


482-492: Consistent handling of quantization-aware fusion outputs

The quantization-aware fusion code properly handles NVFP4 outputs by unpacking them into Fp4QuantizedTensor objects. The implementation looks correct for both pre-fusion and post-fusion cases.

Good job on maintaining consistency across different fusion points!

Also applies to: 544-555, 673-691, 703-721


913-924: Well-designed cross-layer fusion setup

The load_weights method properly sets up cross-layer references (next_layer_layernorm and next_attn) to enable fusion across decoder layers. This is a clean approach to enable cross-layer optimizations.

The special handling for the last layer (using model.norm) is also correct.

Also applies to: 1051-1058


853-853: Mapping.has_tp() is a defined method—no change needed

The Mapping class in tensorrt_llm/mapping.py (around line 387) implements def has_tp(self):…, so invoking model_config.mapping.has_tp() is the intended API. You can safely ignore the suggestion to switch to a property-based check.

Likely an incorrect or invalid review comment.

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch 2 times, most recently from 4b557d0 to b73390b Compare July 20, 2025 17:03
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 20, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12388 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12388 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9207 completed with status: 'FAILURE'

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch 2 times, most recently from 3ca263d to 74d7950 Compare July 21, 2025 09:48
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 21, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

688-697: Similar code duplication issue as in Llama4DecoderLayer.

The fusion op assignment logic here has the same duplication pattern as in Llama4DecoderLayer. The same refactoring suggestion applies.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

699-700: Document the reason for disabling POST_MLP_FUSION.

The TODO comment indicates POST_MLP_FUSION is disabled due to accuracy drop, but lacks specificity about the issue or timeline for resolution.

Consider expanding the comment to include:

  • What kind of accuracy drop was observed
  • Under what conditions it occurs
  • Timeline or conditions for re-enabling
  • Alternative approaches being considered
-        # TODO: Disable this to avoid large accuracy drop
+        # TODO: POST_MLP_FUSION disabled due to accuracy degradation in specific workloads
+        # Re-enable after investigating root cause and implementing fixes
         self.POST_MLP_FUSION = False
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b73390b and 74d7950.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
🔇 Additional comments (14)
tensorrt_llm/_torch/models/modeling_llama.py (14)

2-2: LGTM: Import addition is necessary for environment variable access.

The os import is correctly added to support the new environment variable-based fusion control functionality.


341-341: LGTM: Improved abstraction for tensor parallel condition.

Using self.mapping.has_tp() instead of direct tensor parallel size comparison provides better encapsulation and consistency with the rest of the codebase.


388-390: LGTM: Environment variable control for fusion enablement.

The environment variable TRTLLM_LLAMA_EAGER_FUSION_DISABLED provides appropriate runtime control for fusion behavior. The naming is model-specific and clear in intent.


473-481: LGTM: Consolidated allreduce disable logic.

The boolean flags disable_attn_allreduce and disable_feed_forward_allreduce properly consolidate the conditions for disabling allreduce operations, making the logic clearer and more maintainable.


511-513: LGTM: Proper usage of consolidated allreduce disable flags.

The forward method correctly uses disable_attn_allreduce and disable_feed_forward_allreduce flags to control AllReduce operations, maintaining consistency with the initialization logic.

Also applies to: 557-559


516-520: LGTM: Quantization-aware scale extraction.

The scale extraction logic properly handles both NVFP4 and FP8 quantization modes, providing the necessary scale information for fusion operations.


547-552: LGTM: Proper handling of NVFP4 quantized outputs.

The code correctly unpacks NVFP4 quantized outputs into Fp4QuantizedTensor objects, maintaining the quantization state through the computation pipeline.


570-628: LGTM: Cross-layer fusion implementation.

The cross-layer fusion logic using next_layer_layernorm and next_attn attributes enables advanced fusion patterns. The implementation handles both normal and cutlass min-latency modes appropriately, with proper quantization support.


642-650: LGTM: Proper initialization of new member variables.

The addition of mapping, quantization flags, and next layer references follows the same pattern as Llama4DecoderLayer and provides necessary context for fusion operations.

Also applies to: 672-676


682-687: LGTM: Consistent fusion control implementation.

The environment variable-based fusion control and flag initialization follows the same pattern as Llama4DecoderLayer, maintaining consistency across model variants.


728-730: LGTM: Consistent allreduce disable flag usage.

The forward method correctly uses the consolidated disable flags for controlling AllReduce operations in both attention and MLP sections.

Also applies to: 758-760


733-755: LGTM: Quantization-aware PRE_MLP_FUSION implementation.

The PRE_MLP_FUSION logic properly handles scale extraction for NVFP4 quantization and correctly unpacks fusion outputs into appropriate tensor formats.


913-913: LGTM: Consistent tensor parallel condition abstraction.

Same improvement as seen elsewhere - using has_tp() instead of direct size comparison provides better encapsulation.


973-984: LGTM: Essential cross-layer fusion setup.

The load_weights method correctly establishes the cross-layer references needed for fusion patterns. The logic properly handles both intermediate layers (linking to next layer's components) and the final layer (linking to the model's norm).

This enables the cross-layer fusion capabilities referenced in the forward methods of the decoder layers.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12439 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12439 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9251 completed with status: 'FAILURE'

@hyukn hyukn marked this pull request as ready for review July 22, 2025 06:10
@hyukn hyukn requested a review from a team as a code owner July 22, 2025 06:10
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 74d7950 to a4d3136 Compare July 22, 2025 06:12
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 22, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (3)

391-400: Note: Fusion op assignment logic is still duplicated.

This code segment still has the same duplication issue mentioned in the past review comment. The fusion op assignment logic is repeated between this section and the LlamaDecoderLayer class (lines 649-658). The previous suggestion to consolidate this into shared variables still applies.


643-644: Consistent environment variable naming needed.

Same issue as in Llama4DecoderLayer - the environment variable name should be consistent across both classes.


649-658: Code duplication: Fusion op assignment logic repeated.

This fusion op assignment logic is duplicated from the Llama4DecoderLayer class (lines 391-400). The previous review suggestion to consolidate this into shared variables still applies to reduce code duplication.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (3)

388-390: Consider renaming the environment variable for clarity.

The environment variable TRTLLM_LLAMA_EAGER_FUSION_DISABLED is used for both Llama and Llama4 models, which might be confusing. Consider using a more generic name like TRTLLM_EAGER_FUSION_DISABLED or model-specific names.

-        self.enable_fusion = os.environ.get(
-            "TRTLLM_LLAMA_EAGER_FUSION_DISABLED", "0") == "0"
+        self.enable_fusion = os.environ.get(
+            "TRTLLM_EAGER_FUSION_DISABLED", "0") == "0"

461-461: Remove debug print statement.

The debug print statement should be removed before merging to production.

-        print(f"init Llama4DecoderLayer")

579-579: Remove debug print statements.

Debug print statements should be removed before production deployment.

-            print(f"{self.layer_idx}, {self.next_layer_layernorm}")
-        print(f"in forward")

Also applies to: 583-583

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74d7950 and a4d3136.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/_torch/models/modeling_llama.py (9)

2-2: LGTM - Import addition for environment variable support.

The os import is correctly added to support the environment variable-based fusion control introduced later in the code.


341-341: LGTM - Consistent use of mapping API.

The change from tp_size > 1 to has_tp() improves consistency with the mapping API usage pattern throughout the codebase.


452-459: LGTM - Well-structured all-reduce disable flags.

The consolidation of all-reduce disable logic into boolean flags (disable_attn_allreduce and disable_feed_forward_allreduce) improves code clarity and maintainability by centralizing the conditions.


517-521: LGTM - Proper quantization handling in fusion.

The quantization-aware fusion logic correctly unpacks the fusion output into separate components (fp4 tensor, scale factor, and residual) when NVFP4 quantization is enabled, and wraps them appropriately.

Also applies to: 584-588


660-661: Clarify the accuracy drop issue.

The TODO comment mentions disabling fusion "to avoid large accuracy drop" but doesn't provide details about the cause or planned resolution. This could impact performance benefits.

Can you provide more context about this accuracy drop issue? Is this a temporary workaround, and what's the timeline for fixing it?


663-668: LGTM - Consistent disable flags pattern.

The all-reduce disable flags follow the same well-structured pattern as the Llama4DecoderLayer, improving code consistency and maintainability.


694-722: LGTM - Comprehensive fusion logic with quantization support.

The pre-MLP and post-MLP fusion logic properly handles both regular and quantized (NVFP4) cases, correctly unpacking fusion outputs and creating appropriate tensor wrappers. The cross-layer fusion setup is also well-implemented.

Also applies to: 731-752


874-874: LGTM - Consistent mapping API usage.

The change to use has_tp() instead of tp_size > 1 maintains consistency with the mapping API pattern used throughout the codebase.


934-944: LGTM - Well-implemented cross-layer fusion setup.

The load_weights method correctly establishes cross-layer references needed for AllReduce fusion patterns. The logic properly handles the last layer (linking to final norm) and intermediate layers (linking to next layer's input normalization and attention modules).

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12528 [ run ] triggered by Bot

@hyukn hyukn requested a review from yilin-void July 22, 2025 07:12
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from a4d3136 to d84c4ce Compare July 22, 2025 09:20
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12528 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9313 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 22, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12579 [ run ] triggered by Bot

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 12fa995 to 1765565 Compare July 25, 2025 02:52
coderabbitai[bot]

This comment was marked as resolved.

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 1765565 to fe8a34a Compare July 25, 2025 04:27
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 25, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 25, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 25, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

388-390: Consider renaming environment variable for consistency.

The environment variable TRTLLM_LLAMA_EAGER_FUSION_DISABLED is used in Llama4DecoderLayer but the name suggests it's for regular Llama models. Consider using TRTLLM_LLAMA4_EAGER_FUSION_DISABLED for clarity, or use a more generic name if this applies to all Llama variants.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1765565 and fe8a34a.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
🧠 Learnings (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tensorrt_llm/_torch/models/modeling_llama.py (16)

2-2: LGTM: Import addition for environment variable support.

The os import is correctly added to support the environment variable checks used for fusion control throughout the file.


341-341: LGTM: Improved condition semantics.

The change from tp_size > 1 to has_tp() improves code readability and consistency with similar checks throughout the codebase.


391-407: LGTM: Well-structured quantization-aware fusion setup.

The fusion operation configuration properly handles different quantization modes (NVFP4, FP8, default) and correctly sets different operations for MLP vs MOE layers. The logic is clear and follows the expected pattern.


420-423: LGTM: Consistent fusion configuration logic.

The fusion flags are properly configured with consistent conditions across MLP and MOE cases. The logic correctly enables fusion only when tensor parallelism is active, attention data parallelism is disabled, and fusion is enabled via environment variable.

Also applies to: 436-439


457-464: LGTM: Clean consolidation of allreduce control logic.

The pre-computed disable flags consolidate the allreduce decision logic effectively. This improves readability and maintainability by centralizing the conditions that determine when allreduce operations should be skipped.


495-496: LGTM: Proper use of consolidated allreduce flags.

The forward method correctly uses the pre-computed disable_attn_allreduce and disable_feed_forward_allreduce flags to control allreduce operations, improving code clarity.

Also applies to: 535-536


500-519: LGTM: Well-implemented pre-fusion logic with quantization support.

The pre-fusion logic properly handles quantization by extracting scales when needed and correctly unpacking NVFP4 outputs into Fp4QuantizedTensor objects. The conditional logic is clear and handles both quantized and non-quantized cases appropriately.


524-529: LGTM: Proper speculative decoding integration.

The spec metadata handling correctly disables fusion for layers captured during speculative decoding, which is important for maintaining correctness in speculative execution scenarios.


551-592: LGTM: Comprehensive post-fusion logic with cross-layer support.

The post-fusion logic properly handles cross-layer fusion by using next_layer_layernorm and next_attn references. The quantization handling and scale extraction logic is correct, and the fallback to AllReduceFusionOp.RESIDUAL_RMS_NORM for the last layer is appropriate.


610-617: LGTM: Consistent quantization detection setup.

The quantization detection flags are properly initialized and follow the same pattern as Llama4DecoderLayer, ensuring consistency across model variants.


650-654: LGTM: Smart fusion control with size-based safeguard.

The environment variable control combined with the hidden_size > 4096 check effectively prevents fusion on small models where accuracy degradation could occur, as mentioned in the PR objectives.


655-675: LGTM: Consistent fusion configuration across model variants.

The fusion flags and operations setup mirrors the Llama4DecoderLayer implementation, maintaining consistency while properly handling quantization modes and allreduce control logic.


700-768: LGTM: Well-implemented fusion logic with proper safeguards.

The forward method correctly implements the fusion patterns with proper quantization handling, spec metadata integration, and cross-layer fusion support. The logic follows the established patterns and includes appropriate safeguards for speculative decoding.


892-892: LGTM: Consistent condition update.

The change to has_tp() maintains consistency with similar condition updates throughout the file.


952-962: LGTM: Proper cross-layer fusion setup.

The load_weights method correctly establishes cross-layer references by setting next_layer_layernorm and next_attn attributes. This enables the cross-layer fusion optimizations while properly handling the final layer case.


1090-1097: LGTM: Consistent cross-layer fusion implementation.

The cross-layer fusion setup mirrors the LlamaForCausalLM implementation, ensuring consistent behavior across model variants.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12942 [ run ] triggered by Bot

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jul 25, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12946 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12942 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12946 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9654 completed with status: 'FAILURE'

…ama3/4.

* Added support for controlling fusion optimizations via environment variables.
* Applied AR+Residual + RMS_NORM + Quant fp4/fp8 fusion. This is also compatible with the speculative decoding capturing in these models.
* Some improvements for the two-shot allreduce kernel.
* Disable fusion for small models with a hidden size no greater than 4096 to avoid accuracy drop issues.

Signed-off-by: Yukun He <[email protected]>
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from fe8a34a to 42bc302 Compare July 25, 2025 13:13
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 25, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

388-390: Consider environment variable naming consistency.

The environment variable name TRTLLM_LLAMA_EAGER_FUSION_DISABLED is used for Llama4, but this might be confusing since it doesn't distinguish between Llama3/4 models. Consider using a more generic name like TRTLLM_EAGER_FUSION_DISABLED or model-specific names.

-        self.enable_fusion = os.environ.get(
-            "TRTLLM_LLAMA_EAGER_FUSION_DISABLED", "0") == "0"
+        self.enable_fusion = os.environ.get(
+            "TRTLLM_EAGER_FUSION_DISABLED", "0") == "0"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe8a34a and 42bc302.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
🔇 Additional comments (17)
tensorrt_llm/_torch/models/modeling_llama.py (17)

2-2: LGTM: Import addition is appropriate.

The os import is necessary for environment variable access used in fusion control logic.


341-341: LGTM: Improved tensor parallelism check.

Using has_tp() is more semantic and consistent with other parts of the codebase compared to checking tp_size > 1.


394-407: LGTM: Well-structured quantization-aware fusion setup.

The fusion operation setup properly handles different quantization modes (NVFP4, FP8) and correctly defaults to standard residual RMS normalization for MoE pre-fusion.


420-424: LGTM: Consistent fusion flag setup.

The fusion configuration correctly considers tensor parallelism presence, attention data parallelism, and fusion enablement for both MLP and MoE variants.

Also applies to: 436-439


457-464: LGTM: Well-consolidated allreduce control flags.

The consolidation of allreduce disable logic into boolean flags improves readability and maintainability by centralizing the decision logic.


495-497: LGTM: Proper use of consolidated allreduce control.

Using the consolidated disable_attn_allreduce flag in AllReduceParams improves code clarity and consistency.


500-519: LGTM: Excellent quantization-aware fusion implementation.

The fusion logic properly handles quantization modes by:

  • Extracting appropriate scales for NVFP4/FP8 quantization
  • Correctly unpacking allreduce output into quantized tensor wrappers
  • Maintaining proper tensor types throughout the pipeline

524-530: LGTM: Proper speculative decoding integration.

The speculative metadata handling correctly disables fusion for captured layers and updates the allreduce control flags accordingly.


551-592: LGTM: Well-implemented cross-layer fusion.

The cross-layer fusion implementation properly:

  • Determines appropriate scales based on next layer's quantization
  • Handles both regular and MOE allreduce operations
  • Correctly unpacks quantized tensors when needed
  • Falls back to standard normalization when fusion isn't available

650-654: LGTM: Good fusion safety measures.

The environment variable control and model size check (hidden_size > 4096) provide appropriate safeguards against accuracy degradation in small models.

Note: Same environment variable naming consideration applies here as mentioned for Llama4DecoderLayer.


655-674: LGTM: Consistent fusion setup with Llama4.

The fusion configuration and allreduce control flags mirror the Llama4DecoderLayer implementation, providing consistency across model variants.


695-697: LGTM: Consistent allreduce control.

Using the consolidated disable_attn_allreduce flag maintains consistency with the Llama4 implementation.


700-721: LGTM: Robust quantization-aware pre-MLP fusion.

The implementation properly handles quantization scales and tensor unpacking while maintaining backward compatibility.


723-730: LGTM: Defensive speculative decoding handling.

The use of hasattr to check for is_layer_capture provides good backward compatibility and prevents attribute errors.


747-768: LGTM: Comprehensive cross-layer fusion implementation.

The post-MLP fusion logic correctly handles quantization modes, scale determination, and tensor unpacking with proper fallback to standard normalization.


892-892: LGTM: Consistent tensor parallelism check.

Using has_tp() aligns with the similar change in Llama4MoE and improves code consistency.


952-962: LGTM: Essential cross-layer fusion setup.

The load_weights method properly establishes the cross-layer references needed for fusion optimizations:

  • Links each layer to the next layer's input normalization and attention
  • Correctly handles the final layer by pointing to the model's output normalization
  • Enables the cross-layer fusion functionality implemented in the forward methods

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13012 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13012 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9716 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 27, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13107 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13107 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9804 completed with status: 'SUCCESS'

@litaotju litaotju merged commit 93a0fd0 into NVIDIA:main Jul 28, 2025
3 checks passed
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants