-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][feat] DeepEP LL combine FP4 #6822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughBumps DeepEP submodule commit and adjusts CUDA arch suffixing for FP4 conversions in CMake. Adds an FP4-specific low-latency combine API in DeepEP utils (replacing the prior generic combine). Adds a gated low-precision FP4 combine path to WideEPMoE that computes per-token global scales and calls the new FP4 combine. Changes
Sequence Diagram(s)sequenceDiagram
participant WideEPMoE
participant DeepEPBuffer
WideEPMoE->>WideEPMoE: forward_chunk(...) -> final_hidden_states
alt use_low_precision_combine && DeepEPLowLatency
WideEPMoE->>WideEPMoE: global_scales = max_abs(final_hidden_states) per token
WideEPMoE->>DeepEPBuffer: low_latency_combine_fp4(hidden_states, global_scales, topk_idx, topk_weights, handle)
DeepEPBuffer-->>WideEPMoE: combined_hidden_states (FP4)
else
WideEPMoE->>DeepEPBuffer: low_latency_combine(hidden_states, topk_idx, topk_weights, handle)
DeepEPBuffer-->>WideEPMoE: combined_hidden_states (high-precision)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run |
PR_Github #14947 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
176-176
: Line too long - consider breaking for readability.Line 176 exceeds the 120 character limit. Consider breaking it across multiple lines for better readability.
- recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \ - self.buffer.low_latency_dispatch_fp4(hidden_states, scales, topk_idx, num_max_dispatch_tokens_per_rank, num_experts) + recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \ + self.buffer.low_latency_dispatch_fp4( + hidden_states, scales, topk_idx, + num_max_dispatch_tokens_per_rank, num_experts + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
543-544
: FP4 quantization call updated to positional arguments.The change from named arguments to positional booleans in the FP4 quantization call maintains the same behavior (both False), but reduces code clarity. Consider adding a comment to document what these boolean parameters represent.
x, x_sf = torch.ops.trtllm.fp4_quantize( x, self.fc31_input_scale, self.scaling_vector_size, - False, False) + False, # sfUseUE8M0 + False) # swizzedLayout
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
176-176: Line too long (128 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt (1)
1-1
: DeepEP commit hash verifiedThe commit 515a311f290eb6d9592fcccfcc80c40f5123ca72 exists in the deepseek-ai/DeepEP repository, and the comparison from the previous hash (edf3ea2b086a393d3163bf2773eab69d9191cc01) is accessible. No further action needed.
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (2)
182-191
: LGTM! FP4-specific combine method properly implemented.The new
low_latency_combine_fp4
method correctly delegates to the buffer's FP4 combine implementation with appropriate assertions for event and hook handling.
157-168
: No breaking change:low_latency_combine
signature is unchanged
The methodlow_latency_combine
indeep_ep_utils.py
still takes exactly four arguments(hidden_states, topk_idx, topk_weights, handle)
. Only the FP4 variant—low_latency_combine_fp4
—accepts an additionalglobal_scales
parameter, and its callers infused_moe_wide_ep.py
have already been updated. You can safely ignore the original breaking-change warning.• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@157–168:
def low_latency_combine(self, hidden_states, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py@182–187:
def low_latency_combine_fp4(self, hidden_states, global_scales, topk_idx, topk_weights, handle)
• tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py: all calls tolow_latency_combine_fp4
include the newglobal_scales
argument.Likely an incorrect or invalid review comment.
PR_Github #14947 [ run ] completed with state |
/bot run |
PR_Github #14967 [ run ] triggered by Bot |
PR_Github #14967 [ run ] completed with state |
Signed-off-by: Yilin Zhang <[email protected]>
b549573
to
118f9e7
Compare
/bot run |
PR_Github #15040 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
209-214
: Fix NameError in type annotations: replace undefined Map with DictMap is not imported and will raise at runtime (Python 3.8 evaluates annotations). Use typing.Dict.
Apply this diff in the selected lines:
- self.buffers: Map[Mapping, - weakref.ReferenceType[VariableLengthBuffer]] = {} - self.low_latency_buffers: Map[ - Mapping, - weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {} + self.buffers: Dict[Mapping, + weakref.ReferenceType[VariableLengthBuffer]] = {} + self.low_latency_buffers: Dict[ + Mapping, + weakref.ReferenceType[VariableLengthLowLatencyBuffer]] = {}Additionally, update the imports at the top of this file:
from typing import Dict, List, Tuple, Union
♻️ Duplicate comments (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)
193-195
: Document TRTLLM_MOE_USE_LOW_PRECISION_COMBINE and its constraintsAdd a brief inline comment explaining purpose, expected values, default, and gating conditions. This mirrors an existing suggestion.
Apply this diff in the selected lines:
- self.use_low_precision_combine = (os.environ.get( - "TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0") - == "1") and qm.has_nvfp4() + # Enable FP4 low-precision combine for DeepEPLowLatency when set to "1". + # Default "0" (disabled). Effective only if quant_mode.has_nvfp4() is true. + # Note: The flag is only used on DeepEPLowLatency alltoall path. + self.use_low_precision_combine = ( + os.environ.get("TRTLLM_MOE_USE_LOW_PRECISION_COMBINE", "0") == "1" + ) and qm.has_nvfp4()
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (3)
182-188
: Add a brief docstring for low_latency_combine_fp4 to clarify shapes and semanticsDocument expected shapes and dtypes to avoid integration mistakes.
Apply this diff in the selected lines:
- def low_latency_combine_fp4(self, hidden_states: torch.Tensor, - global_scales: torch.Tensor, - topk_idx: torch.Tensor, - topk_weights: torch.Tensor, handle: Tuple): + def low_latency_combine_fp4(self, hidden_states: torch.Tensor, + global_scales: torch.Tensor, + topk_idx: torch.Tensor, + topk_weights: torch.Tensor, handle: Tuple): + """ + Low-latency combine for FP4 path. + Args: + hidden_states: [num_local_experts, tokens_per_expert, hidden_size], dtype matches compute (e.g., bf16/fp16). + global_scales: [num_local_experts, tokens_per_expert, 1], torch.float32 per-token global scale. + topk_idx: [num_tokens, top_k], torch.int32, original routing indices. + topk_weights: [num_tokens, top_k], torch.float32, original router weights (may be ignored by FP4 kernel). + handle: Opaque handle from low_latency_dispatch(_fp4). + Returns: + Combined hidden states tensor with the same dtype as input compute. + """
182-188
: Wrap long calls to satisfy line-length check (E501 at Line 176)Ruff flagged a long line earlier in this file. Wrap the FP4 combine/dispatch calls for readability and lint compliance.
Apply this formatting to the earlier dispatch and the call here:
# Earlier around line 176: recv_hidden_states, recv_scales, recv_expert_count, handle, event, hook = \ self.buffer.low_latency_dispatch_fp4( hidden_states, scales, topk_idx, num_max_dispatch_tokens_per_rank, num_experts, ) # Here: combined_hidden_states, event, hook = \ self.buffer.low_latency_combine_fp4( hidden_states, global_scales, topk_idx, topk_weights, handle, )
157-168
: Missing NVIDIA copyright headerPer coding guidelines, prepend the current NVIDIA copyright header.
Add at the top of the file:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (3)
692-701
: Add a runtime guard to ensure FP4 combine is used only when supportedMake the failure mode explicit if misconfigured in downstream usage.
Apply this diff in the selected lines:
- if self.use_low_precision_combine: + if self.use_low_precision_combine: + assert self.has_nvfp4, "FP4 low-precision combine requires NVFP4 support"
692-701
: Add minimal tests for the FP4 low-precision combine branchThere are no tests covering low_latency_combine_fp4. Add a unit/integration test that:
- Exercises use_low_precision_combine=True on DeepEPLowLatency path with small tensors.
- Verifies shapes/dtypes and that scales are finite (no inf/NaN) when inputs contain zeros.
I can draft a minimal test scaffold that mocks DeepEP buffer to validate the branch logic. Want me to open a follow-up PR with tests?
187-195
: Missing NVIDIA copyright headerPer coding guidelines, prepend the current NVIDIA copyright header.
Add at the top of the file:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/tensorrt_llm/deep_ep/CMakeLists.txt
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/deep_ep/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
176-176: Line too long (128 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py (1)
157-168
: AI summary mismatch: non-FP4 low_latency_combine still existsThe summary claims the generic combine entry was removed, but this method is present and used by the fallback path. No change requested; flagging for clarity.
PR_Github #15040 [ run ] completed with state |
/bot run |
PR_Github #15045 [ run ] triggered by Bot |
PR_Github #15045 [ run ] completed with state |
/bot reuse-pipeline |
/bot reuse-pipeline |
PR_Github #15098 [ reuse-pipeline ] triggered by Bot |
PR_Github #15099 [ reuse-pipeline ] triggered by Bot |
PR_Github #15098 [ reuse-pipeline ] completed with state |
PR_Github #15099 [ reuse-pipeline ] completed with state |
DeepEP diff: https://github.com/deepseek-ai/DeepEP/compare/edf3ea2b086a393d3163bf2773eab69d9191cc01...515a311f290eb6d9592fcccfcc80c40f5123ca72?expand=1
Summary by CodeRabbit
New Features
Chores