[None][feat] DeepEP LL combine FP4 #6822

yilin-void · 2025-08-12T09:43:51Z

DeepEP diff: https://github.com/deepseek-ai/DeepEP/compare/edf3ea2b086a393d3163bf2773eab69d9191cc01...515a311f290eb6d9592fcccfcc80c40f5123ca72?expand=1

Summary by CodeRabbit

New Features
- Added optional low-precision FP4 combine path for MoE in low-latency all‑to‑all mode, gated by an environment variable (default off) and using per-token global scales.
Chores
- Updated DeepEP dependency commit.
- Adjusted CUDA architecture targets to include SM100a/SM110a/SM120a variants required for FP4-related instructions; existing 9.x targets remain unchanged.

coderabbitai · 2025-08-12T09:43:58Z

📝 Walkthrough

Walkthrough

Bumps DeepEP submodule commit and adjusts CUDA arch suffixing for FP4 conversions in CMake. Adds an FP4-specific low-latency combine API in DeepEP utils (replacing the prior generic combine). Adds a gated low-precision FP4 combine path to WideEPMoE that computes per-token global scales and calls the new FP4 combine.

Changes

Cohort / File(s)	Summary of changes
DeepEP CMake integration `cpp/tensorrt_llm/deep_ep/CMakeLists.txt`	Bumps `DEEP_EP_COMMIT` to `515a311f...`; make CUDA arch append conditional so SM100/110/120 use an “a” suffix (`SM100a`/`SM110a`/`SM120a`) required by FP4 conversion instructions; existing 9.x logic unchanged.
DeepEP utils FP4 combine API `tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py`	Adds `low_latency_combine_fp4(hidden_states, global_scales, topk_idx, topk_weights, handle)` and removes the previous `low_latency_combine`; minor cleanup of comments in `low_latency_dispatch_fp4`.
WideEPMoE FP4 low-precision combine path `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`	Adds `use_low_precision_combine` flag (env-gated, requires NVFP4, only when alltoall enabled); when enabled computes per-token `global_scales` from `final_hidden_states` and calls `deep_ep_buffer.low_latency_combine_fp4(...)`; preserves high-precision path when disabled.

Sequence Diagram(s)

sequenceDiagram
    participant WideEPMoE
    participant DeepEPBuffer

    WideEPMoE->>WideEPMoE: forward_chunk(...) -> final_hidden_states
    alt use_low_precision_combine && DeepEPLowLatency
        WideEPMoE->>WideEPMoE: global_scales = max_abs(final_hidden_states) per token
        WideEPMoE->>DeepEPBuffer: low_latency_combine_fp4(hidden_states, global_scales, topk_idx, topk_weights, handle)
        DeepEPBuffer-->>WideEPMoE: combined_hidden_states (FP4)
    else
        WideEPMoE->>DeepEPBuffer: low_latency_combine(hidden_states, topk_idx, topk_weights, handle)
        DeepEPBuffer-->>WideEPMoE: combined_hidden_states (high-precision)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

DeepEP LL dispatch FP4 #6296: Modifies DeepEP FP4 low-latency paths and touches the same DeepEP utils and WideEPMoE modules.
DeepEP LL support variable hidden size and tokens num #6141: Updates DeepEP submodule reference and alters DeepEP buffer/dispatch/combine handling; overlaps with the new combine API changes.
[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc #6811: Changes WideEPMoE implementation and gating behavior in fused_moe_wide_ep.py, affecting the same class/file modified here.

Suggested reviewers

yuantailing
yizhang-nv
litaotju
yuxianq

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 118f9e7 and f8730d5.

📒 Files selected for processing (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

yilin-void · 2025-08-12T09:49:19Z

/bot run

tensorrt-cicd · 2025-08-12T09:54:35Z

PR_Github #14947 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
176-176: Line too long - consider breaking for readability.

Line 176 exceeds the 120 character limit. Consider breaking it across multiple lines for better readability.
-        recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
-            self.buffer.low_latency_dispatch_fp4(hidden_states, scales, topk_idx, num_max_dispatch_tokens_per_rank, num_experts)
+        recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
+            self.buffer.low_latency_dispatch_fp4(
+                hidden_states, scales, topk_idx, 
+                num_max_dispatch_tokens_per_rank, num_experts
+            )
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
543-544: FP4 quantization call updated to positional arguments.

The change from named arguments to positional booleans in the FP4 quantization call maintains the same behavior (both False), but reduces code clarity. Consider adding a comment to document what these boolean parameters represent.
                         x, x_sf = torch.ops.trtllm.fp4_quantize(
                             x, self.fc31_input_scale, self.scaling_vector_size,
-                            False, False)
+                            False,  # sfUseUE8M0
+                            False)  # swizzedLayout

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8845e0f and b549573.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

176-176: Line too long (128 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)

1-1: DeepEP commit hash verified

The commit 515a311f290eb6d9592fcccfcc80c40f5123ca72 exists in the deepseek-ai/DeepEP repository, and the comparison from the previous hash (edf3ea2b086a393d3163bf2773eab69d9191cc01) is accessible. No further action needed.

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (2)

182-191: LGTM! FP4-specific combine method properly implemented.

The new low_latency_combine_fp4 method correctly delegates to the buffer's FP4 combine implementation with appropriate assertions for event and hook handling.

157-168: No breaking change: low_latency_combine signature is unchanged
The method low_latency_combine in deep_ep_utils.py still takes exactly four arguments (hidden_states, topk_idx, topk_weights, handle). Only the FP4 variant—low_latency_combine_fp4—accepts an additional global_scales parameter, and its callers in fused_moe_wide_ep.py have already been updated. You can safely ignore the original breaking-change warning.

• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@157–168:
def low_latency_combine(self, hidden_states, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@182–187:
def low_latency_combine_fp4(self, hidden_states, global_scales, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py: all calls to low_latency_combine_fp4 include the new global_scales argument.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt-cicd · 2025-08-12T11:39:49Z

PR_Github #14947 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11282 completed with status: 'FAILURE'

yilin-void · 2025-08-12T12:11:25Z

/bot run

tensorrt-cicd · 2025-08-12T12:16:35Z

PR_Github #14967 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-12T13:26:20Z

PR_Github #14967 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11299 completed with status: 'FAILURE'

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void · 2025-08-13T01:04:20Z

/bot run

tensorrt-cicd · 2025-08-13T01:09:35Z

PR_Github #15040 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

209-214: Fix NameError in type annotations: replace undefined Map with Dict

Map is not imported and will raise at runtime (Python 3.8 evaluates annotations). Use typing.Dict.

Apply this diff in the selected lines:

-        self.buffers: Map[Mapping,
-                          weakref.ReferenceType[VariableLengthBuffer]] = {}
-        self.low_latency_buffers: Map[
-            Mapping,
-            weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {}
+        self.buffers: Dict[Mapping,
+                           weakref.ReferenceType[VariableLengthBuffer]] = {}
+        self.low_latency_buffers: Dict[
+            Mapping,
+            weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {}

Additionally, update the imports at the top of this file:

from typing import Dict, List, Tuple, Union

♻️ Duplicate comments (1)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

193-195: Document TRTLLM_MOE_USE_LOW_PRECISION_COMBINE and its constraints

Add a brief inline comment explaining purpose, expected values, default, and gating conditions. This mirrors an existing suggestion.

Apply this diff in the selected lines:

-            self.use_low_precision_combine = (os.environ.get(
-                "TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0")
-                                              == "1") and qm.has_nvfp4()
+            # Enable FP4 low-precision combine for DeepEPLowLatency when set to "1".
+            # Default "0" (disabled). Effective only if quant_mode.has_nvfp4() is true.
+            # Note: The flag is only used on DeepEPLowLatency alltoall path.
+            self.use_low_precision_combine = (
+                os.environ.get("TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0") == "1"
+            ) and qm.has_nvfp4()

🧹 Nitpick comments (6)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3)
182-188: Add a brief docstring for low_latency_combine_fp4 to clarify shapes and semantics

Document expected shapes and dtypes to avoid integration mistakes.

Apply this diff in the selected lines:
-    def low_latency_combine_fp4(self, hidden_states: torch.Tensor,
-                                global_scales: torch.Tensor,
-                                topk_idx: torch.Tensor,
-                                topk_weights: torch.Tensor, handle: Tuple):
+    def low_latency_combine_fp4(self, hidden_states: torch.Tensor,
+                                global_scales: torch.Tensor,
+                                topk_idx: torch.Tensor,
+                                topk_weights: torch.Tensor, handle: Tuple):
+        """
+        Low-latency combine for FP4 path.
+        Args:
+            hidden_states: [num_local_experts, tokens_per_expert, hidden_size], dtype matches compute (e.g., bf16/fp16).
+            global_scales: [num_local_experts, tokens_per_expert, 1], torch.float32 per-token global scale.
+            topk_idx: [num_tokens, top_k], torch.int32, original routing indices.
+            topk_weights: [num_tokens, top_k], torch.float32, original router weights (may be ignored by FP4 kernel).
+            handle: Opaque handle from low_latency_dispatch(_fp4).
+        Returns:
+            Combined hidden states tensor with the same dtype as input compute.
+        """
182-188: Wrap long calls to satisfy line-length check (E501 at Line 176)

Ruff flagged a long line earlier in this file. Wrap the FP4 combine/dispatch calls for readability and lint compliance.

Apply this formatting to the earlier dispatch and the call here:
# Earlier around line 176:
recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \
    self.buffer.low_latency_dispatch_fp4(
        hidden_states,
        scales,
        topk_idx,
        num_max_dispatch_tokens_per_rank,
        num_experts,
    )

# Here:
combined_hidden_states, event, hook = \
    self.buffer.low_latency_combine_fp4(
        hidden_states,
        global_scales,
        topk_idx,
        topk_weights,
        handle,
    )
157-168: Missing NVIDIA copyright header

Per coding guidelines, prepend the current NVIDIA copyright header.

Add at the top of the file:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
692-701: Add a runtime guard to ensure FP4 combine is used only when supported

Make the failure mode explicit if misconfigured in downstream usage.

Apply this diff in the selected lines:
-                if self.use_low_precision_combine:
+                if self.use_low_precision_combine:
+                    assert self.has_nvfp4, "FP4 low-precision combine requires NVFP4 support"
692-701: Add minimal tests for the FP4 low-precision combine branch

There are no tests covering low_latency_combine_fp4. Add a unit/integration test that:

Exercises use_low_precision_combine=True on DeepEPLowLatency path with small tensors.

Verifies shapes/dtypes and that scales are finite (no inf/NaN) when inputs contain zeros.

I can draft a minimal test scaffold that mocks DeepEP buffer to validate the branch logic. Want me to open a follow-up PR with tests?

187-195: Missing NVIDIA copyright header

Per coding guidelines, prepend the current NVIDIA copyright header.

Add at the top of the file:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b549573 and 118f9e7.

📒 Files selected for processing (3)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/tensorrt_llm/deep_ep/CMakeLists.txt

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py

176-176: Line too long (128 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)

157-168: AI summary mismatch: non-FP4 low_latency_combine still exists

The summary claims the generic combine entry was removed, but this method is present and used by the fallback path. No change requested; flagging for clarity.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt-cicd · 2025-08-13T01:27:12Z

PR_Github #15040 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11360 completed with status: 'FAILURE'

yilin-void · 2025-08-13T01:46:23Z

/bot run

tensorrt-cicd · 2025-08-13T01:56:34Z

PR_Github #15045 [ run ] triggered by Bot

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

tensorrt-cicd · 2025-08-13T07:39:48Z

PR_Github #15045 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11363 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yilin-void · 2025-08-13T07:59:01Z

/bot reuse-pipeline

yilin-void · 2025-08-13T07:59:42Z

/bot reuse-pipeline

tensorrt-cicd · 2025-08-13T08:04:14Z

PR_Github #15098 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-08-13T08:04:59Z

PR_Github #15099 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-08-13T08:05:02Z

PR_Github #15098 [ reuse-pipeline ] completed with state ABORTED
Can't reuse PR_Github #0 with status: UNKNOWN

tensorrt-cicd · 2025-08-13T08:20:19Z

PR_Github #15099 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #15045 for commit f8730d5

yilin-void requested review from yuantailing and hyukn August 12, 2025 09:43

yilin-void self-assigned this Aug 12, 2025

yilin-void requested a review from a team as a code owner August 12, 2025 09:43

yilin-void requested a review from hlu1 August 12, 2025 09:43

yilin-void requested review from hlu1 and removed request for hlu1 and hyukn August 12, 2025 09:44

yilin-void changed the title ~~DeepEP LL combine FP4~~ [Feat] DeepEP LL combine FP4 Aug 12, 2025

yilin-void changed the title ~~[Feat] DeepEP LL combine FP4~~ [None][Feat] DeepEP LL combine FP4 Aug 12, 2025

yilin-void changed the title ~~[None][Feat] DeepEP LL combine FP4~~ [None][feat] DeepEP LL combine FP4 Aug 12, 2025

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

cpp/tensorrt_llm/deep_ep/CMakeLists.txt Show resolved Hide resolved

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Show resolved Hide resolved

DeepEP LL combine FP4

118f9e7

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void force-pushed the deep_ep/fp4_combine branch from b549573 to 118f9e7 Compare August 13, 2025 01:04

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Show resolved Hide resolved

yuantailing approved these changes Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py Outdated Show resolved Hide resolved

yuxianq approved these changes Aug 13, 2025

View reviewed changes

yilin-void enabled auto-merge (squash) August 13, 2025 07:58

Merge branch 'main' into deep_ep/fp4_combine

f8730d5

yilin-void merged commit 1d80df0 into NVIDIA:main Aug 13, 2025
5 checks passed

This was referenced Aug 21, 2025

[None][perf] Accelerate global scale calculations for deepEP fp4 combine #7126

Merged

[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155

Merged

coderabbitai bot mentioned this pull request Sep 2, 2025

[None] [test] Add MNNVL AlltoAll tests to pre-merge #7465

Merged

1 task

[None][feat] DeepEP LL combine FP4 #6822

[None][feat] DeepEP LL combine FP4 #6822

Uh oh!

Conversation

yilin-void commented Aug 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

yilin-void commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

yilin-void commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

tensorrt-cicd commented Aug 12, 2025

Uh oh!

yilin-void commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

yilin-void commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

yilin-void commented Aug 13, 2025

Uh oh!

yilin-void commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

yilin-void commented Aug 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 12, 2025 •

edited

Loading