-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. #6886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Warning Rate limit exceeded@timlee0212 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 24 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds stricter guards and logging around MNNVLAllReduce initialization (disabling it if tp_size != world_size or runtime checks fail), makes AllReduce/MoEAllReduce.forward require keyword-only Changes
Sequence Diagram(s)sequenceDiagram
participant Init as AllReduce.__init__
participant AR as AllReduce
participant MNNVL as MNNVLAllReduce
Init->>AR: __init__(mapping, strategy, dtype...)
alt strategy in {AUTO, MNNVL} and mapping.tp_size > 1
alt mapping.tp_size != mapping.world_size
AR->>AR: set mnnvl_allreduce = None\nlog tp_size vs world_size mismatch
else mapping.tp_size == mapping.world_size
AR->>MNNVL: is_mnnvl(mapping, dtype)?
alt True
AR->>MNNVL: try instantiate MNNVLAllReduce(mapping, dtype)
alt success
AR->>AR: set mnnvl_allreduce instance\nlog enabled
else failure
AR->>AR: set mnnvl_allreduce = None\nlog exception
end
else False
AR->>AR: set mnnvl_allreduce = None\nlog unsupported dtype/config
end
end
else
AR->>AR: use non-MNNVL initialization/fallback
end
sequenceDiagram
participant Layer as DeepseekV3DecoderLayer
participant Cfg as Config (tp, GPUs_per_node, strategy)
participant MNNVL as MNNVLAllReduce
Layer->>Cfg: read tp, GPUs_per_node, strategy
Layer->>MNNVL: MNNVLAllReduce.is_mnnvl(mapping, dtype) (runtime)
alt tp > GPUs_per_node and (strategy ∉ {AUTO, MNNVL} or MNNVL unsupported)
Layer->>Layer: mlp_tp_size = gcd(tp, GPUs_per_node)
else
Layer->>Layer: mlp_tp_size = tp
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #15235 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/distributed/ops.py
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
44-45
: Importing AllReduceStrategy for TP-size selection logic is appropriateThis addition is needed for the MLP TP selection logic that branches on strategy.
54-56
: Importing MNNVLAllReduce for runtime capability checks is OKAssuming
..distributed
re-exportsMNNVLAllReduce
, this keeps call sites consistent. If not already exported, ensure__all__
includes it to avoid import errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/distributed/ops.py (1)
460-481
: Tighten dtype gating and improve debug messages; remove redundant conditionalTwo cleanups will make this block clearer and a tad leaner:
- Avoid calling
is_mnnvl(...)
(and its import) whendtype
is None.- Remove the redundant
if dtype else None
in the instantiation (it’s unreachable ifis_mnnvl
returned True).- Include actual values in the disable log to aid troubleshooting.
Apply this diff:
- if self.mapping.tp_size != self.mapping.world_size: - logger.debug( - f"MNNVLAllReduce is disabled due to tp_size != world_size" - ) - self.mnnvl_allreduce = None - elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype): + if self.mapping.tp_size != self.mapping.world_size: + logger.debug( + f"MNNVLAllReduce disabled: tp_size({self.mapping.tp_size}) != world_size({self.mapping.world_size})" + ) + self.mnnvl_allreduce = None + elif dtype is not None and MNNVLAllReduce.is_mnnvl(self.mapping, dtype): try: - self.mnnvl_allreduce = MNNVLAllReduce( - self.mapping, dtype) if dtype else None + self.mnnvl_allreduce = MNNVLAllReduce(self.mapping, dtype) if self.mnnvl_allreduce: logger.debug(f"MNNVLAllReduce is enabled") else: logger.debug(f"MNNVLAllReduce is disabled") except Exception as e: logger.debug( f"MNNVL AllReduce can't be enabled due to {e}.") self.mnnvl_allreduce = None else: logger.debug( - f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check." + "MNNVLAllReduce can't be enabled due to failing the is_mnnvl check or missing dtype." ) self.mnnvl_allreduce = None
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/distributed/ops.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/distributed/ops.py
🔇 Additional comments (1)
tensorrt_llm/_torch/distributed/ops.py (1)
460-465
: Guard now short-circuits MNNVL init as intended — good fixSwitching to the
elif
branch ensures thetp_size != world_size
path cannot be overridden by a subsequentis_mnnvl
check. This resolves the earlier guard-ineffectiveness issue.
/bot kill |
PR_Github #15239 [ kill ] triggered by Bot |
PR_Github #15235 [ run ] completed with state |
PR_Github #15239 [ kill ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
742-754
: Align MLP TP-size gating with AllReduce’s MNNVL guard (require tp_size == world_size when relying on MNNVL)Right now, the MLP TP-size logic only checks MNNVLAllReduce.is_mnnvl(mapping, dtype). AllReduce.init explicitly disables MNNVL when tp_size != world_size. This discrepancy can lead to choosing inter-node TP in MLP while MNNVL is disabled at runtime, reintroducing the extra inter-node AR cost and potential instability you’re trying to avoid.
Gate “MNNVL usable” exactly like AllReduce does to keep decisions consistent and predictable.
Apply this refactor:
- if tp > self.mapping.gpus_per_node and ( - self.model_config.allreduce_strategy not in ( - AllReduceStrategy.AUTO, - AllReduceStrategy.MNNVL, - ) or not MNNVLAllReduce.is_mnnvl( - self.mapping, - self.model_config.pretrained_config.torch_dtype)): - mlp_tp_size = math.gcd( - tp, - self.mapping.gpus_per_node, - ) # Avoid costly inter-node TP when MNNVL is not supported - else: - mlp_tp_size = tp + strategy = self.model_config.allreduce_strategy + dtype = self.model_config.pretrained_config.torch_dtype + mnnvl_available = MNNVLAllReduce.is_mnnvl(self.mapping, dtype) + # Keep consistent with AllReduce.__init__: disable MNNVL when tp_size != world_size + mnnvl_usable = mnnvl_available and ( + self.mapping.tp_size == self.mapping.world_size + ) + + if tp > self.mapping.gpus_per_node and ( + strategy not in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL) + or not mnnvl_usable + ): + # Avoid costly inter-node TP when MNNVL is not usable + mlp_tp_size = math.gcd(tp, self.mapping.gpus_per_node) + else: + mlp_tp_size = tp
🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
54-58
: Consolidate ModelConfig/QuantConfig imports to avoid shadowing and confusionModelConfig is imported from two different modules and one import will shadow the other. This is error-prone and makes type origins ambiguous. Keep a single, canonical ModelConfig import and avoid re-importing QuantConfig unnecessarily.
Apply this cleanup:
-from ..distributed import (AllReduce, AllReduceFusionOp, AllReduceParams, - MNNVLAllReduce, MoEAllReduce, MoEAllReduceParams, - allgather) +from ..distributed import (AllReduce, AllReduceFusionOp, AllReduceParams, + MNNVLAllReduce, MoEAllReduce, MoEAllReduceParams, + allgather) from ..model_config import ModelConfig -from ..models.modeling_utils import ModelConfig, QuantConfig +from ..models.modeling_utils import QuantConfigtensorrt_llm/_torch/distributed/ops.py (2)
462-465
: Fix Ruff E501 long line and prefer structured loggingThe debug string exceeds the 120-char limit. Use parameterized logging to satisfy linting and avoid building the string unnecessarily.
Apply this diff:
- logger.debug( - f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} != world_size:{self.mapping.world_size}" - ) + logger.debug( + "MNNVLAllReduce disabled: tp_size(%s) != world_size(%s)", + self.mapping.tp_size, + self.mapping.world_size, + )
465-469
: Optional: avoid calling is_mnnvl with dtype=NoneMicro-clarity improvement: only evaluate is_mnnvl when dtype is provided; semantics are unchanged, but it saves a call and makes the intent explicit.
Apply this diff:
- elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype): + elif dtype and MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/distributed/ops.py
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/ops.py
462-462: Line too long (131 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
44-45
: LGTM: Added imports are correct and necessaryBringing in AllReduceStrategy and PositionEmbeddingType aligns with their usage below (MLP TP-size logic and PositionalEmbeddingParams). No issues.
tensorrt_llm/_torch/distributed/ops.py (1)
460-481
: Short-circuit guard for tp_size != world_size in MNNVL init: LGTMThis resolves the earlier issue where the disable path could be immediately overridden. This flow is now coherent and prevents contradictory logs/behavior.
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #15240 [ run ] triggered by Bot |
PR_Github #15240 [ run ] completed with state |
c4a0a72
to
40154d9
Compare
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #15322 [ run ] triggered by Bot |
/bot kill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tensorrt_llm/_torch/distributed/ops.py (3)
91-93
: Consider documenting the purpose offorce_applying_finalize
parameterThe new
force_applying_finalize
parameter lacks documentation about when and why it should be used.Add a docstring to clarify the parameter's purpose:
def userbuffers_allreduce_finalize(input: torch.Tensor, force_applying_finalize: bool = False) -> torch.Tensor: + """Finalize usersbuffers allreduce operation. + + Args: + input: Input tensor to finalize + force_applying_finalize: If True, forces finalization even if not typically required + """ output = torch.ops.trtllm.userbuffers_allreduce_finalize(input, force_applying_finalize)
96-101
: Replace lambda expression with a proper function definitionUsing lambda expressions for assignments is against Python best practices. Define a proper function instead.
-def filter_valid_input(input_list: List[torch.Tensor]) -> Tuple[List[torch.Tensor], List[bool]]: - func_valid = lambda x: x is not None - valid_list = list(map(func_valid, input_list)) - input_list = list(filter(func_valid, input_list)) +def filter_valid_input(input_list: List[torch.Tensor]) -> Tuple[List[torch.Tensor], List[bool]]: + def is_valid(x): + return x is not None + valid_list = list(map(is_valid, input_list)) + input_list = list(filter(is_valid, input_list)) return input_list, valid_list
470-470
: Use idiomatic Python for boolean checksReplace the explicit
False
comparison with a more idiomatic approach.- if self.mapping.tp_size == 1 or (all_reduce_params is not None and all_reduce_params.enable_allreduce == False): + if self.mapping.tp_size == 1 or (all_reduce_params is not None and not all_reduce_params.enable_allreduce):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/distributed/ops.py
(23 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.681Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
tensorrt_llm/functional.py (1)
AllReduceStrategy
(3876-3885)tensorrt_llm/_torch/distributed/ops.py (2)
MNNVLAllReduce
(249-357)is_mnnvl
(275-286)
tensorrt_llm/_torch/distributed/ops.py (3)
tensorrt_llm/functional.py (6)
AllReduceFusionOp
(3888-3897)AllReduceParams
(3900-3939)AllReduceStrategy
(3876-3885)dtype
(255-259)dtype
(262-267)Tensor
(107-602)tensorrt_llm/plugin/plugin.py (6)
CustomAllReduceHelper
(668-844)allocate_allreduce_fusion_workspace
(818-844)max_workspace_size_auto
(708-716)allocate_lowprecision_workspace
(787-815)max_workspace_size_lowprecision
(719-720)initialize_lowprecision_buffers
(723-727)tensorrt_llm/mapping.py (3)
Mapping
(32-513)local_rank
(399-400)is_multi_node
(420-421)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/ops.py
56-56: Local variable force_mn
is assigned to but never used
Remove assignment to unused variable force_mn
(F841)
104-104: Do not assign a lambda
expression, use a def
Rewrite func_valid
as a def
(E731)
470-470: Avoid equality comparisons to False
; use not all_reduce_params.enable_allreduce:
for false checks
Replace with not all_reduce_params.enable_allreduce
(E712)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (17)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
44-44
: LGTM! Imports are correctly updated for AllReduceStrategy and PositionEmbeddingTypeThese imports are properly used in the new MLP TP-size selection logic below.
55-55
: LGTM! MNNVLAllReduce import is properly added for runtime MNNVL availability checksThis distributed op is correctly imported to support the
is_mnnvl
check in the MLP TP-size selection logic.
742-756
: LGTM! MLP TP-size selection now properly considers AllReduceStrategy and MNNVL availabilityThe implementation correctly:
- Uses
self.model_config.allreduce_strategy
instead of the uninitializedself.allreduce
(fixing the AttributeError)- Calls
MNNVLAllReduce.is_mnnvl()
with required parameters(self.mapping, self.model_config.pretrained_config.torch_dtype)
(fixing the TypeError)- Properly restricts inter-node TP when MNNVL is not available or when the strategy doesn't support it
The design decision to have different MNNVL criteria between this component and AllReduce.init is documented as intentional per the retrieved learnings.
tensorrt_llm/_torch/distributed/ops.py (14)
13-13
: LGTM! Import consolidation of AllReduceParams and AllReduceFusionOpThe imports are correctly updated to include AllReduceStrategy alongside the existing imports.
23-33
: Minor: String quoting style consistency changeThe change from single to double quotes for thread-local attribute names (
"allreduce_workspaces_{mapping.pp_rank}"
) is a formatting change that maintains functionality.
30-30
: LGTM! Added support_deterministic parameter to workspace allocationThe workspace allocation now explicitly passes
support_deterministic=False
tomax_workspace_size_auto
, which provides better clarity about the deterministic behavior.
37-37
: LGTM! Thread-local storage key consistency and improved workspace initializationThe workspace initialization is properly updated with:
- Consistent thread-local storage key naming
- Explicit parameter passing for tp_size
- Separate initialization call for low precision buffers
Also applies to: 43-46
64-88
: LGTM! Improved MNNVL workspace buffer initializationThe MNNVL workspace initialization is properly structured with:
- Correct buffer size calculation aligned to stride boundaries
- Proper type hints and unpacking of the tuple return value
- Clear documentation of buffer structure
262-268
: LGTM! Improved assertion message and better workspace tuple unpackingThe changes improve code clarity with:
- Better formatted assertion message including the unsupported dtypes
- More readable unpacking of the workspace tuple
280-286
: LGTM! Consistent formatting for MNNVL availability checkThe multi-line formatting improves readability while maintaining the same logic.
323-323
: LGTM! Improved buffer slicing readabilityThe buffer view operation is now more explicit about the slicing steps.
335-338
: LGTM! Consistent formatting for fusion condition checkThe multi-line formatting improves readability of the complex condition.
421-439
: LGTM! Well-structured MNNVL initialization with proper guard and loggingThe implementation correctly:
- Guards against
tp_size != world_size
mismatch early in the flow- Only evaluates
is_mnnvl
if the guard passes- Provides clear debug logging for all enable/disable decisions
- Properly handles exceptions during MNNVL initialization
This resolves the previous issue where the guard could be overridden.
441-446
: LGTM! Forward signature properly enforces keyword-only all_reduce_paramsThe change to enforce keyword-only passing of
all_reduce_params
improves API safety by preventing accidental positional argument misuse.
541-546
: LGTM! MoEAllReduce forward signature properly enforces keyword-only all_reduce_paramsConsistent with the AllReduce changes, this enforces keyword-only passing to prevent API misuse.
579-581
: LGTM! Improved assertion message formattingThe multi-line formatting improves readability of the assertion condition.
587-587
: LGTM! Simplified parameter passingUsing the direct attribute reference is cleaner than the previous multi-line form.
PR_Github #15324 [ kill ] triggered by Bot |
PR_Github #15322 [ run ] completed with state |
PR_Github #15324 [ kill ] completed with state |
/bot run --add-multi-gpu-test --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (4)
tensorrt_llm/_torch/distributed/ops.py (4)
1-21
: Remove logger shadowing and add required NVIDIA header
- You import
logger
fromtensorrt_llm.logger
and then immediately shadow it with a stdlib logger. This can bypass project-wide logging configuration unexpectedly. Pick one; recommended: keep the project logger and drop the local assignment (and unusedimport logging
).- Per coding guidelines, prepend the NVIDIA copyright header.
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + -import logging @@ -from tensorrt_llm.logger import logger -logger = logging.getLogger(__name__) +from tensorrt_llm.logger import logger
113-120
: Fix return type annotation for get_output_infoThe function returns a dict with shape and numel info, but the annotation is
-> List[int]
. This harms type checkers and readability.Apply this diff to correct typing:
-from typing import List, Optional, Tuple, Union +from typing import Dict, List, Optional, Tuple, Union @@ -def get_output_info(input: torch.Tensor, dim: int) -> List[int]: +def get_output_info(input: torch.Tensor, dim: int) -> Dict[str, Union[List[int], int]]: @@ - return {"output_shape": output_shape, "numel_base": numel_base} + return {"output_shape": output_shape, "numel_base": numel_base}
324-337
: Align MNNVLAllReduce.forward typing and calling convention with None-fallback behaviorThe method can return
None
to signal fallback, but the signature/doc don’t reflect that. Also, for consistency with AllReduce/MoEAllReduce, enforce keyword-onlyall_reduce_params
.Apply this diff:
- def forward( - self, - input: torch.Tensor, - all_reduce_params: AllReduceParams, - ) -> Union[torch.Tensor, Tuple[torch.Tensor, ...]]: + def forward( + self, + input: torch.Tensor, + *, + all_reduce_params: AllReduceParams, + ) -> Optional[Union[torch.Tensor, Tuple[torch.Tensor, ...]]]: @@ - Returns: - Union[torch.Tensor, Tuple[torch.Tensor, ...]]: Reduced tensor(s) + Returns: + Optional[Union[torch.Tensor, Tuple[torch.Tensor, ...]]]: + - Reduced tensor(s) when MNNVL path is applicable. + - None to indicate the caller should fall back to regular AllReduce.Note: You’ll need to import Optional at the top (already present).
588-594
: Fix required: update positional MoEAllReduce calls to pass all_reduce_params as keywordMoEAllReduce.forward now enforces a keyword-only all_reduce_params. I scanned the repo — all existing callsites already use the keyword except one that must be fixed:
- tensorrt_llm/_torch/models/modeling_llama.py (around line ~578)
allreduce_output = self.moe_allreduce(
residual,
self.next_layer_layernorm.weight,
device_num_experts=num_activated_experts_per_node,
scale_input=experts_to_token_score,
active_experts_token_input=hidden_states_activated_experts,
token_input=shared_output,
eps=self.next_layer_layernorm.variance_epsilon,
)Action: replace the positional argument usage by constructing/passing a MoEAllReduceParams instance (or otherwise supplying all_reduce_params as a keyword). Example pattern:
- moe_params = MoEAllReduceParams(residual=..., norm_weight=..., device_num_experts=..., scale_input=..., active_experts_token_input=..., shared_expert_output=..., eps=...)
- allreduce_output = self.moe_allreduce(active_experts_token_input, all_reduce_params=moe_params)
This will align the call with the new forward signature and avoid runtime TypeError.
♻️ Duplicate comments (2)
tensorrt_llm/_torch/distributed/ops.py (2)
459-483
: Guard now properly short-circuits MNNVL init when tp_size != world_sizeThis fixes the previously raised issue where the disable guard could be bypassed. The flow is now unambiguous with clean logging.
59-67
: Remove unused variableforce_mn
force_mn
is assigned but never used. It’s dead code and flagged by Ruff (F841).Apply this diff:
if not hasattr(_thread_local, f"allreduce_mnnvl_workspaces_{mapping.pp_rank}"): setattr(_thread_local, f"allreduce_mnnvl_workspaces_{mapping.pp_rank}", {}) - force_mn = os.environ.get("TRTLLM_FORCE_MNNVL_AR", "0") == "1" - allreduce_mnnvl_workspaces = getattr( _thread_local, f"allreduce_mnnvl_workspaces_{mapping.pp_rank}")
🧹 Nitpick comments (2)
tensorrt_llm/_torch/distributed/ops.py (2)
39-41
: Nit: function name typo (“presicion”)The function is spelled
allocate_low_presicion_allreduce_workspace
. Consider renaming toallocate_low_precision_allreduce_workspace
for clarity and consistency. If public, plan a deprecation shim.
209-214
: Make split dimension explicit for readabilityIn the
sizes is not None
branch,x.split(sizes)
defaults to dim=0. Make it explicit for clarity and symmetry withreducescatter
.Apply this diff:
- else: - x_list = x.split(sizes) + else: + x_list = x.split(sizes, dim=0)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/distributed/ops.py
(16 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/distributed/ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/distributed/ops.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
📚 Learning: 2025-08-14T06:36:40.681Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/distributed/ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/ops.py
63-63: Local variable force_mn
is assigned to but never used
Remove assignment to unused variable force_mn
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/distributed/ops.py (7)
24-29
: LGTM: thread-local workspace keyed by pp_rankUsing a pp_rank-scoped thread-local dictionary avoids cross-PP interference. Looks correct.
147-171
: LGTM: clearer allgather docstringThe expanded doc clarifies sizes semantics and output shaping. No issues spotted.
189-197
: LGTM: consistent 2D flattening with computed numel_baseThe
view(-1, numel_base)
approach is consistent with hownumel_base
is derived. Works for both Tensor and list inputs.
251-258
: LGTM: reducescatter input reshaping logicThe reshaping and splitting by
dim
is correct and mirrors the allgather logic.
280-286
: LGTM: reducescatter output re-shapingRestoring original shapes using cached
output_info
is correct; list handling preservesNone
positions viarestore_full_output
.
400-405
: LGTM: AllReduce now optionally takes dtype and exposes strategyConstructor parameters are clear and maintain backward compat via defaults.
490-512
: No positional all_reduce_params usages found — OK to enforce keyword-onlyI searched for direct AllReduce.forward / AllReduce instance calls that pass all_reduce_params positionally and found none; callers either use the functional wrapper or pass all_reduce_params by name (or via **params).
Checked/important locations (all looked safe):
- tensorrt_llm/_torch/distributed/ops.py — AllReduce implementation (forward).
- tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py — constructs AllReduce and calls torch_op(tensor, all_reduce_params=...).
- tensorrt_llm/_torch/functional.py (def allreduce) — wrapper used across the codebase (e.g. functional.py:2800,2825).
- tests/unittest/_torch/multi_gpu/test_user_buffers.py — ar.forward(...) uses all_reduce_params=ar_params.
- quantization/layers.py, modeling_qwen3_moe.py, modeling_deepseekv3.py, layers/moe.py — use the wrapper or pass all_reduce_params as a keyword.
Conclusion: no callers found that would break from making all_reduce_params keyword-only.
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #16778 [ run ] triggered by Bot |
PR_Github #16778 [ run ] completed with state |
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #16835 [ run ] triggered by Bot |
PR_Github #16835 [ run ] completed with state |
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Shiyu Li <[email protected]>
/bot run --add-multi-gpu-test --disable-fail-fast |
PR_Github #16873 [ run ] triggered by Bot |
PR_Github #16873 [ run ] completed with state |
…ek V3 to avoid hanging. (NVIDIA#6886) Signed-off-by: Shiyu Li <[email protected]>
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Description
This PR fixes the hanging issue when MNNVL is used for Deepseek V3.
is_mnnvl
function in MNNVLAllreduce indicates the availability of MNNVLink, we do not need to impose the restriction on the TP size within a single node for MLP.tp_size != world_size
to prevent functional errors arising from unexpected parallel mapping. Subsequent work will attempt to eliminate this limitation by acquiring the correction mapping information within the multicast buffer initialization.Relevant PR: 6860, which should be reverted once we find this fix is satisfying.
Known Issue:
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.