Skip to content

Conversation

yilin-void
Copy link
Collaborator

@yilin-void yilin-void commented Aug 12, 2025

DeepEP diff: https://github.com/deepseek-ai/DeepEP/compare/edf3ea2b086a393d3163bf2773eab69d9191cc01...515a311f290eb6d9592fcccfcc80c40f5123ca72?expand=1

Summary by CodeRabbit

  • New Features

    • Added optional low-precision FP4 combine path for MoE in low-latency all‑to‑all mode, gated by an environment variable (default off) and using per-token global scales.
  • Chores

    • Updated DeepEP dependency commit.
    • Adjusted CUDA architecture targets to include SM100a/SM110a/SM120a variants required for FP4-related instructions; existing 9.x targets remain unchanged.

@yilin-void yilin-void self-assigned this Aug 12, 2025
@yilin-void yilin-void requested a review from a team as a code owner August 12, 2025 09:43
@yilin-void yilin-void requested a review from hlu1 August 12, 2025 09:43
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

📝 Walkthrough

Walkthrough

Bumps DeepEP submodule commit and adjusts CUDA arch suffixing for FP4 conversions in CMake. Adds an FP4-specific low-latency combine API in DeepEP utils (replacing the prior generic combine). Adds a gated low-precision FP4 combine path to WideEPMoE that computes per-token global scales and calls the new FP4 combine.

Changes

Cohort / File(s) Summary of changes
DeepEP CMake integration
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
Bumps DEEP_EP_COMMIT to 515a311f...; make CUDA arch append conditional so SM100/110/120 use an “a” suffix (SM100a/SM110a/SM120a) required by FP4 conversion instructions; existing 9.x logic unchanged.
DeepEP utils FP4 combine API
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
Adds low_latency_combine_fp4(hidden_states, global_scales, topk_idx, topk_weights, handle) and removes the previous low_latency_combine; minor cleanup of comments in low_latency_dispatch_fp4.
WideEPMoE FP4 low-precision combine path
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
Adds use_low_precision_combine flag (env-gated, requires NVFP4, only when alltoall enabled); when enabled computes per-token global_scales from final_hidden_states and calls deep_ep_buffer.low_latency_combine_fp4(...); preserves high-precision path when disabled.

Sequence Diagram(s)

sequenceDiagram
    participant WideEPMoE
    participant DeepEPBuffer

    WideEPMoE->>WideEPMoE: forward_chunk(...) -> final_hidden_states
    alt use_low_precision_combine && DeepEPLowLatency
        WideEPMoE->>WideEPMoE: global_scales = max_abs(final_hidden_states) per token
        WideEPMoE->>DeepEPBuffer: low_latency_combine_fp4(hidden_states, global_scales, topk_idx, topk_weights, handle)
        DeepEPBuffer-->>WideEPMoE: combined_hidden_states (FP4)
    else
        WideEPMoE->>DeepEPBuffer: low_latency_combine(hidden_states, topk_idx, topk_weights, handle)
        DeepEPBuffer-->>WideEPMoE: combined_hidden_states (high-precision)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • yuantailing
  • yizhang-nv
  • litaotju
  • yuxianq

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 118f9e7 and f8730d5.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yilin-void yilin-void requested review from hlu1 and removed request for hlu1 and hyukn August 12, 2025 09:44
@yilin-void yilin-void changed the title DeepEP LL combine FP4 [Feat] DeepEP LL combine FP4 Aug 12, 2025
@yilin-void yilin-void changed the title [Feat] DeepEP LL combine FP4 [None][Feat] DeepEP LL combine FP4 Aug 12, 2025
@yilin-void yilin-void changed the title [None][Feat] DeepEP LL combine FP4 [None][feat] DeepEP LL combine FP4 Aug 12, 2025
@yilin-void
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14947 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

176-176: Line too long - consider breaking for readability.

Line 176 exceeds the 120 character limit. Consider breaking it across multiple lines for better readability.

-        recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
-            self.buffer.low_latency_dispatch_fp4(hidden_states, scales, topk_idx, num_max_dispatch_tokens_per_rank, num_experts)
+        recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
+            self.buffer.low_latency_dispatch_fp4(
+                hidden_states, scales, topk_idx, 
+                num_max_dispatch_tokens_per_rank, num_experts
+            )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

543-544: FP4 quantization call updated to positional arguments.

The change from named arguments to positional booleans in the FP4 quantization call maintains the same behavior (both False), but reduces code clarity. Consider adding a comment to document what these boolean parameters represent.

                         x, x_sf = torch.ops.trtllm.fp4_quantize(
                             x, self.fc31_input_scale, self.scaling_vector_size,
-                            False, False)
+                            False,  # sfUseUE8M0
+                            False)  # swizzedLayout
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8845e0f and b549573.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

176-176: Line too long (128 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)

1-1: DeepEP commit hash verified

The commit 515a311f290eb6d9592fcccfcc80c40f5123ca72 exists in the deepseek-ai/DeepEP repository, and the comparison from the previous hash (edf3ea2b086a393d3163bf2773eab69d9191cc01) is accessible. No further action needed.

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (2)

182-191: LGTM! FP4-specific combine method properly implemented.

The new low_latency_combine_fp4 method correctly delegates to the buffer's FP4 combine implementation with appropriate assertions for event and hook handling.


157-168: No breaking change: low_latency_combine signature is unchanged
The method low_latency_combine in deep_ep_utils.py still takes exactly four arguments (hidden_states, topk_idx, topk_weights, handle). Only the FP4 variant—low_latency_combine_fp4—accepts an additional global_scales parameter, and its callers in fused_moe_wide_ep.py have already been updated. You can safely ignore the original breaking-change warning.

• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@157–168:
def low_latency_combine(self, hidden_states, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@182–187:
def low_latency_combine_fp4(self, hidden_states, global_scales, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py: all calls to low_latency_combine_fp4 include the new global_scales argument.

Likely an incorrect or invalid review comment.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14947 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11282 completed with status: 'FAILURE'

@yilin-void
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14967 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14967 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11299 completed with status: 'FAILURE'

Signed-off-by: Yilin Zhang <[email protected]>
@yilin-void yilin-void force-pushed the deep_ep/fp4_combine branch from b549573 to 118f9e7 Compare August 13, 2025 01:04
@yilin-void
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15040 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

209-214: Fix NameError in type annotations: replace undefined Map with Dict

Map is not imported and will raise at runtime (Python 3.8 evaluates annotations). Use typing.Dict.

Apply this diff in the selected lines:

-        self.buffers: Map[Mapping,
-                          weakref.ReferenceType[VariableLengthBuffer]] = {}
-        self.low_latency_buffers: Map[
-            Mapping,
-            weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {}
+        self.buffers: Dict[Mapping,
+                           weakref.ReferenceType[VariableLengthBuffer]] = {}
+        self.low_latency_buffers: Dict[
+            Mapping,
+            weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {}

Additionally, update the imports at the top of this file:

from typing import Dict, List, Tuple, Union
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

193-195: Document TRTLLM_MOE_USE_LOW_PRECISION_COMBINE and its constraints

Add a brief inline comment explaining purpose, expected values, default, and gating conditions. This mirrors an existing suggestion.

Apply this diff in the selected lines:

-            self.use_low_precision_combine = (os.environ.get(
-                "TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0")
-                                              == "1") and qm.has_nvfp4()
+            # Enable FP4 low-precision combine for DeepEPLowLatency when set to "1".
+            # Default "0" (disabled). Effective only if quant_mode.has_nvfp4() is true.
+            # Note: The flag is only used on DeepEPLowLatency alltoall path.
+            self.use_low_precision_combine = (
+                os.environ.get("TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0") == "1"
+            ) and qm.has_nvfp4()
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3)

182-188: Add a brief docstring for low_latency_combine_fp4 to clarify shapes and semantics

Document expected shapes and dtypes to avoid integration mistakes.

Apply this diff in the selected lines:

-    def low_latency_combine_fp4(self, hidden_states: torch.Tensor,
-                                global_scales: torch.Tensor,
-                                topk_idx: torch.Tensor,
-                                topk_weights: torch.Tensor, handle: Tuple):
+    def low_latency_combine_fp4(self, hidden_states: torch.Tensor,
+                                global_scales: torch.Tensor,
+                                topk_idx: torch.Tensor,
+                                topk_weights: torch.Tensor, handle: Tuple):
+        """
+        Low-latency combine for FP4 path.
+        Args:
+            hidden_states: [num_local_experts, tokens_per_expert, hidden_size], dtype matches compute (e.g., bf16/fp16).
+            global_scales: [num_local_experts, tokens_per_expert, 1], torch.float32 per-token global scale.
+            topk_idx: [num_tokens, top_k], torch.int32, original routing indices.
+            topk_weights: [num_tokens, top_k], torch.float32, original router weights (may be ignored by FP4 kernel).
+            handle: Opaque handle from low_latency_dispatch(_fp4).
+        Returns:
+            Combined hidden states tensor with the same dtype as input compute.
+        """

182-188: Wrap long calls to satisfy line-length check (E501 at Line 176)

Ruff flagged a long line earlier in this file. Wrap the FP4 combine/dispatch calls for readability and lint compliance.

Apply this formatting to the earlier dispatch and the call here:

# Earlier around line 176:
recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
    self.buffer.low_latency_dispatch_fp4(
        hidden_states,
        scales,
        topk_idx,
        num_max_dispatch_tokens_per_rank,
        num_experts,
    )

# Here:
combined_hidden_states, event, hook = \
    self.buffer.low_latency_combine_fp4(
        hidden_states,
        global_scales,
        topk_idx,
        topk_weights,
        handle,
    )

157-168: Missing NVIDIA copyright header

Per coding guidelines, prepend the current NVIDIA copyright header.

Add at the top of the file:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)

692-701: Add a runtime guard to ensure FP4 combine is used only when supported

Make the failure mode explicit if misconfigured in downstream usage.

Apply this diff in the selected lines:

-                if self.use_low_precision_combine:
+                if self.use_low_precision_combine:
+                    assert self.has_nvfp4, "FP4 low-precision combine requires NVFP4 support"

692-701: Add minimal tests for the FP4 low-precision combine branch

There are no tests covering low_latency_combine_fp4. Add a unit/integration test that:

  • Exercises use_low_precision_combine=True on DeepEPLowLatency path with small tensors.
  • Verifies shapes/dtypes and that scales are finite (no inf/NaN) when inputs contain zeros.

I can draft a minimal test scaffold that mocks DeepEP buffer to validate the branch logic. Want me to open a follow-up PR with tests?


187-195: Missing NVIDIA copyright header

Per coding guidelines, prepend the current NVIDIA copyright header.

Add at the top of the file:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b549573 and 118f9e7.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

176-176: Line too long (128 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

157-168: AI summary mismatch: non-FP4 low_latency_combine still exists

The summary claims the generic combine entry was removed, but this method is present and used by the fallback path. No change requested; flagging for clarity.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15040 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11360 completed with status: 'FAILURE'

@yilin-void
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15045 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15045 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11363 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@yilin-void yilin-void enabled auto-merge (squash) August 13, 2025 07:58
@yilin-void
Copy link
Collaborator Author

/bot reuse-pipeline

@yilin-void
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15098 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15099 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15098 [ reuse-pipeline ] completed with state ABORTED
Can't reuse PR_Github #0 with status: UNKNOWN

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15099 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #15045 for commit f8730d5

@yilin-void yilin-void merged commit 1d80df0 into NVIDIA:main Aug 13, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants