[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. #6886

timlee0212 · 2025-08-14T06:08:59Z

Summary by CodeRabbit

New Features
- Smarter MLP tensor-parallel sizing that adapts to cluster topology and chosen all-reduce strategy, reducing inter-node communication and improving scalability.
- Runtime-aware selection of optimized all-reduce with safer fallbacks when the optimized path isn't applicable.
Bug Fixes
- Prevents enabling an optimized all-reduce mode when tensor-parallel and cluster/world sizes mismatch, with added diagnostic logging.
Chores
- All-reduce forward calls now require passing all_reduce_params as a keyword argument to clarify usage.

Description

This PR fixes the hanging issue when MNNVL is used for Deepseek V3.

In the context of a multinode scenario, where the is_mnnvl function in MNNVLAllreduce indicates the availability of MNNVLink, we do not need to impose the restriction on the TP size within a single node for MLP.
Implement a fallback path when tp_size != world_size to prevent functional errors arising from unexpected parallel mapping. Subsequent work will attempt to eliminate this limitation by acquiring the correction mapping information within the multicast buffer initialization.

Relevant PR: 6860, which should be reverted once we find this fix is satisfying.

Known Issue:

The fallback path in MNNVLAllreduce is dynamically selected based on the number of tokens in each iteration. Consequently, when fallback occurs and a non-MNNVL strategy is employed, the MLP will incur the additional cost of inter-node communication.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-14T06:09:09Z

Warning

Rate limit exceeded

@timlee0212 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5e84f18 and 19fcc3f.

📒 Files selected for processing (1)

tensorrt_llm/_torch/distributed/ops.py (1 hunks)

📝 Walkthrough

Walkthrough

Adds stricter guards and logging around MNNVLAllReduce initialization (disabling it if tp_size != world_size or runtime checks fail), makes AllReduce/MoEAllReduce.forward require keyword-only all_reduce_params, and updates DeepseekV3 MLP TP-size selection to consider AllReduceStrategy and runtime MNNVL support.

Changes

Cohort / File(s)	Summary
Distributed AllReduce guard & API surface `tensorrt_llm/_torch/distributed/ops.py`	Tighten MNNVL init: if `mapping.tp_size != mapping.world_size` disable `MNNVLAllReduce` and log; otherwise perform `is_mnnvl(...)`, try instantiation, log enabled/disabled or exceptions, and fall back on failure. Change `AllReduce.forward` and `MoEAllReduce.forward` to require keyword-only `all_reduce_params`; pass `trigger_completion_at_end` / `expanded_idx_to_permuted_idx` via params. Minor formatting/docstring tweaks.
DeepseekV3 TP-size logic and imports `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Add `AllReduceStrategy` and `MNNVLAllReduce` imports. Update `DeepseekV3DecoderLayer._compute_mlp_tp_size` to use `gcd(tp, GPUs_per_node)` when `tp > GPUs_per_node` and either strategy is not AUTO/MNNVL or MNNVL is unsupported at runtime; otherwise keep `mlp_tp_size = tp`. No public API changes.

Sequence Diagram(s)

sequenceDiagram
  participant Init as AllReduce.__init__
  participant AR as AllReduce
  participant MNNVL as MNNVLAllReduce

  Init->>AR: __init__(mapping, strategy, dtype...)
  alt strategy in {AUTO, MNNVL} and mapping.tp_size > 1
    alt mapping.tp_size != mapping.world_size
      AR->>AR: set mnnvl_allreduce = None\nlog tp_size vs world_size mismatch
    else mapping.tp_size == mapping.world_size
      AR->>MNNVL: is_mnnvl(mapping, dtype)?
      alt True
        AR->>MNNVL: try instantiate MNNVLAllReduce(mapping, dtype)
        alt success
          AR->>AR: set mnnvl_allreduce instance\nlog enabled
        else failure
          AR->>AR: set mnnvl_allreduce = None\nlog exception
        end
      else False
        AR->>AR: set mnnvl_allreduce = None\nlog unsupported dtype/config
      end
    end
  else
    AR->>AR: use non-MNNVL initialization/fallback
  end

sequenceDiagram
  participant Layer as DeepseekV3DecoderLayer
  participant Cfg as Config (tp, GPUs_per_node, strategy)
  participant MNNVL as MNNVLAllReduce

  Layer->>Cfg: read tp, GPUs_per_node, strategy
  Layer->>MNNVL: MNNVLAllReduce.is_mnnvl(mapping, dtype) (runtime)
  alt tp > GPUs_per_node and (strategy ∉ {AUTO, MNNVL} or MNNVL unsupported)
    Layer->>Layer: mlp_tp_size = gcd(tp, GPUs_per_node)
  else
    Layer->>Layer: mlp_tp_size = tp
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[https://nvbugs/5445466][fix] fix deepseek r1 hang by not enabling mnnvl by default #6860 — Related changes to MNNVLAllReduce enablement logic in ops.py (overlaps on guard/enable behavior).
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. #6237 — Modifies AllReduce/MNNVL integration and fallback behavior in the same module.

Suggested labels

weekly_release_blocker

Suggested reviewers

lucaslie
symphonylyh
litaotju
yilin-void
kaiyux

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

timlee0212 · 2025-08-14T06:09:37Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-14T06:15:27Z

PR_Github #15235 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cd8c0f and 2dc1707.

📒 Files selected for processing (2)

tensorrt_llm/_torch/distributed/ops.py (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

44-45: Importing AllReduceStrategy for TP-size selection logic is appropriate

This addition is needed for the MLP TP selection logic that branches on strategy.

54-56: Importing MNNVLAllReduce for runtime capability checks is OK

Assuming ..distributed re-exports MNNVLAllReduce, this keeps call sites consistent. If not already exported, ensure __all__ includes it to avoid import errors.

tensorrt_llm/_torch/distributed/ops.py

tensorrt_llm/_torch/models/modeling_deepseekv3.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/distributed/ops.py (1)

460-481: Tighten dtype gating and improve debug messages; remove redundant conditional

Two cleanups will make this block clearer and a tad leaner:

Avoid calling is_mnnvl(...) (and its import) when dtype is None.
Remove the redundant if dtype else None in the instantiation (it’s unreachable if is_mnnvl returned True).
Include actual values in the disable log to aid troubleshooting.

Apply this diff:

-                if self.mapping.tp_size != self.mapping.world_size:
-                    logger.debug(
-                        f"MNNVLAllReduce is disabled due to tp_size != world_size"
-                    )
-                    self.mnnvl_allreduce = None
-                elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
+                if self.mapping.tp_size != self.mapping.world_size:
+                    logger.debug(
+                        f"MNNVLAllReduce disabled: tp_size({self.mapping.tp_size}) != world_size({self.mapping.world_size})"
+                    )
+                    self.mnnvl_allreduce = None
+                elif dtype is not None and MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
                     try:
-                        self.mnnvl_allreduce = MNNVLAllReduce(
-                            self.mapping, dtype) if dtype else None
+                        self.mnnvl_allreduce = MNNVLAllReduce(self.mapping, dtype)
                         if self.mnnvl_allreduce:
                             logger.debug(f"MNNVLAllReduce is enabled")
                         else:
                             logger.debug(f"MNNVLAllReduce is disabled")
                     except Exception as e:
                         logger.debug(
                             f"MNNVL AllReduce can't be enabled due to {e}.")
                         self.mnnvl_allreduce = None
                 else:
                     logger.debug(
-                        f"MNNVLAllReduce can't be enabled due to failing the is_mnnvl check."
+                        "MNNVLAllReduce can't be enabled due to failing the is_mnnvl check or missing dtype."
                     )
                     self.mnnvl_allreduce = None

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2dc1707 and 5ba7397.

📒 Files selected for processing (1)

tensorrt_llm/_torch/distributed/ops.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/distributed/ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/distributed/ops.py

🔇 Additional comments (1)

tensorrt_llm/_torch/distributed/ops.py (1)

460-465: Guard now short-circuits MNNVL init as intended — good fix

Switching to the elif branch ensures the tp_size != world_size path cannot be overridden by a subsequent is_mnnvl check. This resolves the earlier guard-ineffectiveness issue.

timlee0212 · 2025-08-14T06:24:05Z

/bot kill

tensorrt-cicd · 2025-08-14T06:29:35Z

PR_Github #15239 [ kill ] triggered by Bot

tensorrt-cicd · 2025-08-14T06:29:36Z

PR_Github #15235 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-14T06:30:07Z

PR_Github #15239 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 0fdf632

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

742-754: Align MLP TP-size gating with AllReduce’s MNNVL guard (require tp_size == world_size when relying on MNNVL)

Right now, the MLP TP-size logic only checks MNNVLAllReduce.is_mnnvl(mapping, dtype). AllReduce.init explicitly disables MNNVL when tp_size != world_size. This discrepancy can lead to choosing inter-node TP in MLP while MNNVL is disabled at runtime, reintroducing the extra inter-node AR cost and potential instability you’re trying to avoid.

Gate “MNNVL usable” exactly like AllReduce does to keep decisions consistent and predictable.

Apply this refactor:

-            if tp > self.mapping.gpus_per_node and (
-                    self.model_config.allreduce_strategy not in (
-                        AllReduceStrategy.AUTO,
-                        AllReduceStrategy.MNNVL,
-                    ) or not MNNVLAllReduce.is_mnnvl(
-                        self.mapping,
-                        self.model_config.pretrained_config.torch_dtype)):
-                mlp_tp_size = math.gcd(
-                    tp,
-                    self.mapping.gpus_per_node,
-                )  # Avoid costly inter-node TP when MNNVL is not supported
-            else:
-                mlp_tp_size = tp
+            strategy = self.model_config.allreduce_strategy
+            dtype = self.model_config.pretrained_config.torch_dtype
+            mnnvl_available = MNNVLAllReduce.is_mnnvl(self.mapping, dtype)
+            # Keep consistent with AllReduce.__init__: disable MNNVL when tp_size != world_size
+            mnnvl_usable = mnnvl_available and (
+                self.mapping.tp_size == self.mapping.world_size
+            )
+
+            if tp > self.mapping.gpus_per_node and (
+                strategy not in (AllReduceStrategy.AUTO, AllReduceStrategy.MNNVL)
+                or not mnnvl_usable
+            ):
+                # Avoid costly inter-node TP when MNNVL is not usable
+                mlp_tp_size = math.gcd(tp, self.mapping.gpus_per_node)
+            else:
+                mlp_tp_size = tp

🧹 Nitpick comments (3)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
54-58: Consolidate ModelConfig/QuantConfig imports to avoid shadowing and confusion

ModelConfig is imported from two different modules and one import will shadow the other. This is error-prone and makes type origins ambiguous. Keep a single, canonical ModelConfig import and avoid re-importing QuantConfig unnecessarily.

Apply this cleanup:
-from ..distributed import (AllReduce, AllReduceFusionOp, AllReduceParams,
-                           MNNVLAllReduce, MoEAllReduce, MoEAllReduceParams,
-                           allgather)
+from ..distributed import (AllReduce, AllReduceFusionOp, AllReduceParams,
+                           MNNVLAllReduce, MoEAllReduce, MoEAllReduceParams,
+                           allgather)
 from ..model_config import ModelConfig
-from ..models.modeling_utils import ModelConfig, QuantConfig
+from ..models.modeling_utils import QuantConfig
tensorrt_llm/_torch/distributed/ops.py (2)
462-465: Fix Ruff E501 long line and prefer structured logging

The debug string exceeds the 120-char limit. Use parameterized logging to satisfy linting and avoid building the string unnecessarily.

Apply this diff:
-                    logger.debug(
-                        f"MNNVLAllReduce is disabled due to tp_size:{self.mapping.tp_size} != world_size:{self.mapping.world_size}"
-                    )
+                    logger.debug(
+                        "MNNVLAllReduce disabled: tp_size(%s) != world_size(%s)",
+                        self.mapping.tp_size,
+                        self.mapping.world_size,
+                    )
465-469: Optional: avoid calling is_mnnvl with dtype=None

Micro-clarity improvement: only evaluate is_mnnvl when dtype is provided; semantics are unchanged, but it saves a call and makes the intent explicit.

Apply this diff:
-                elif MNNVLAllReduce.is_mnnvl(self.mapping, dtype):
+                elif dtype and MNNVLAllReduce.is_mnnvl(self.mapping, dtype):

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ba7397 and 0fdf632.

📒 Files selected for processing (2)

tensorrt_llm/_torch/distributed/ops.py (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/distributed/ops.py

462-462: Line too long (131 > 120)

(E501)

🔇 Additional comments (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

44-45: LGTM: Added imports are correct and necessary

Bringing in AllReduceStrategy and PositionEmbeddingType aligns with their usage below (MLP TP-size logic and PositionalEmbeddingParams). No issues.

tensorrt_llm/_torch/distributed/ops.py (1)

460-481: Short-circuit guard for tp_size != world_size in MNNVL init: LGTM

This resolves the earlier issue where the disable path could be immediately overridden. This flow is now coherent and prevents contradictory logs/behavior.

timlee0212 · 2025-08-14T06:36:31Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-14T06:41:24Z

PR_Github #15240 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-14T07:31:35Z

PR_Github #15240 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11508 completed with status: 'FAILURE'

timlee0212 · 2025-08-14T16:35:51Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-14T16:42:02Z

PR_Github #15322 [ run ] triggered by Bot

timlee0212 · 2025-08-14T16:43:53Z

/bot kill

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tensorrt_llm/_torch/distributed/ops.py (3)

91-93: Consider documenting the purpose of force_applying_finalize parameter

The new force_applying_finalize parameter lacks documentation about when and why it should be used.

Add a docstring to clarify the parameter's purpose:

 def userbuffers_allreduce_finalize(input: torch.Tensor, force_applying_finalize: bool = False) -> torch.Tensor:
+    """Finalize usersbuffers allreduce operation.
+    
+    Args:
+        input: Input tensor to finalize
+        force_applying_finalize: If True, forces finalization even if not typically required
+    """
     output = torch.ops.trtllm.userbuffers_allreduce_finalize(input, force_applying_finalize)

96-101: Replace lambda expression with a proper function definition

Using lambda expressions for assignments is against Python best practices. Define a proper function instead.

-def filter_valid_input(input_list: List[torch.Tensor]) -> Tuple[List[torch.Tensor], List[bool]]:
-    func_valid = lambda x: x is not None
-    valid_list = list(map(func_valid, input_list))
-    input_list = list(filter(func_valid, input_list))
+def filter_valid_input(input_list: List[torch.Tensor]) -> Tuple[List[torch.Tensor], List[bool]]:
+    def is_valid(x):
+        return x is not None
+    valid_list = list(map(is_valid, input_list))
+    input_list = list(filter(is_valid, input_list))
     return input_list, valid_list

470-470: Use idiomatic Python for boolean checks

Replace the explicit False comparison with a more idiomatic approach.

-        if self.mapping.tp_size == 1 or (all_reduce_params is not None and all_reduce_params.enable_allreduce == False):
+        if self.mapping.tp_size == 1 or (all_reduce_params is not None and not all_reduce_params.enable_allreduce):

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c4a0a72 and 40154d9.

📒 Files selected for processing (2)

tensorrt_llm/_torch/distributed/ops.py (23 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

📚 Learning: 2025-08-14T06:36:40.681Z

Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/distributed/ops.py

🧬 Code Graph Analysis (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

tensorrt_llm/functional.py (1)

AllReduceStrategy (3876-3885)

tensorrt_llm/_torch/distributed/ops.py (2)

MNNVLAllReduce (249-357)

is_mnnvl (275-286)

tensorrt_llm/_torch/distributed/ops.py (3)

tensorrt_llm/functional.py (6)

AllReduceFusionOp (3888-3897)

AllReduceParams (3900-3939)

AllReduceStrategy (3876-3885)

dtype (255-259)

dtype (262-267)

Tensor (107-602)

tensorrt_llm/plugin/plugin.py (6)

CustomAllReduceHelper (668-844)

allocate_allreduce_fusion_workspace (818-844)

max_workspace_size_auto (708-716)

allocate_lowprecision_workspace (787-815)

max_workspace_size_lowprecision (719-720)

initialize_lowprecision_buffers (723-727)

tensorrt_llm/mapping.py (3)

Mapping (32-513)

local_rank (399-400)

is_multi_node (420-421)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/distributed/ops.py

56-56: Local variable force_mn is assigned to but never used

Remove assignment to unused variable force_mn

(F841)

104-104: Do not assign a lambda expression, use a def

Rewrite func_valid as a def

(E731)

470-470: Avoid equality comparisons to False; use not all_reduce_params.enable_allreduce: for false checks

Replace with not all_reduce_params.enable_allreduce

(E712)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (17)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

44-44: LGTM! Imports are correctly updated for AllReduceStrategy and PositionEmbeddingType

These imports are properly used in the new MLP TP-size selection logic below.

55-55: LGTM! MNNVLAllReduce import is properly added for runtime MNNVL availability checks

This distributed op is correctly imported to support the is_mnnvl check in the MLP TP-size selection logic.

742-756: LGTM! MLP TP-size selection now properly considers AllReduceStrategy and MNNVL availability

The implementation correctly:

Uses self.model_config.allreduce_strategy instead of the uninitialized self.allreduce (fixing the AttributeError)

Calls MNNVLAllReduce.is_mnnvl() with required parameters (self.mapping, self.model_config.pretrained_config.torch_dtype) (fixing the TypeError)

Properly restricts inter-node TP when MNNVL is not available or when the strategy doesn't support it

The design decision to have different MNNVL criteria between this component and AllReduce.init is documented as intentional per the retrieved learnings.

tensorrt_llm/_torch/distributed/ops.py (14)

13-13: LGTM! Import consolidation of AllReduceParams and AllReduceFusionOp

The imports are correctly updated to include AllReduceStrategy alongside the existing imports.

23-33: Minor: String quoting style consistency change

The change from single to double quotes for thread-local attribute names ("allreduce_workspaces_{mapping.pp_rank}") is a formatting change that maintains functionality.

30-30: LGTM! Added support_deterministic parameter to workspace allocation

The workspace allocation now explicitly passes support_deterministic=False to max_workspace_size_auto, which provides better clarity about the deterministic behavior.

37-37: LGTM! Thread-local storage key consistency and improved workspace initialization

The workspace initialization is properly updated with:

Consistent thread-local storage key naming

Explicit parameter passing for tp_size

Separate initialization call for low precision buffers

Also applies to: 43-46

64-88: LGTM! Improved MNNVL workspace buffer initialization

The MNNVL workspace initialization is properly structured with:

Correct buffer size calculation aligned to stride boundaries

Proper type hints and unpacking of the tuple return value

Clear documentation of buffer structure

262-268: LGTM! Improved assertion message and better workspace tuple unpacking

The changes improve code clarity with:

Better formatted assertion message including the unsupported dtypes

More readable unpacking of the workspace tuple

280-286: LGTM! Consistent formatting for MNNVL availability check

The multi-line formatting improves readability while maintaining the same logic.

323-323: LGTM! Improved buffer slicing readability

The buffer view operation is now more explicit about the slicing steps.

335-338: LGTM! Consistent formatting for fusion condition check

The multi-line formatting improves readability of the complex condition.

421-439: LGTM! Well-structured MNNVL initialization with proper guard and logging

The implementation correctly:

Guards against tp_size != world_size mismatch early in the flow

Only evaluates is_mnnvl if the guard passes

Provides clear debug logging for all enable/disable decisions

Properly handles exceptions during MNNVL initialization

This resolves the previous issue where the guard could be overridden.

441-446: LGTM! Forward signature properly enforces keyword-only all_reduce_params

The change to enforce keyword-only passing of all_reduce_params improves API safety by preventing accidental positional argument misuse.

541-546: LGTM! MoEAllReduce forward signature properly enforces keyword-only all_reduce_params

Consistent with the AllReduce changes, this enforces keyword-only passing to prevent API misuse.

579-581: LGTM! Improved assertion message formatting

The multi-line formatting improves readability of the assertion condition.

587-587: LGTM! Simplified parameter passing

Using the direct attribute reference is cleaner than the previous multi-line form.

tensorrt_llm/_torch/distributed/ops.py

tensorrt-cicd · 2025-08-14T16:49:05Z

PR_Github #15324 [ kill ] triggered by Bot

tensorrt-cicd · 2025-08-14T16:49:06Z

PR_Github #15322 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-14T16:49:36Z

PR_Github #15324 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 5e84f18

timlee0212 · 2025-08-14T16:56:59Z

/bot run --add-multi-gpu-test --disable-fail-fast

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (4)

tensorrt_llm/_torch/distributed/ops.py (4)
1-21: Remove logger shadowing and add required NVIDIA header

You import logger from tensorrt_llm.logger and then immediately shadow it with a stdlib logger. This can bypass project-wide logging configuration unexpectedly. Pick one; recommended: keep the project logger and drop the local assignment (and unused import logging).

Per coding guidelines, prepend the NVIDIA copyright header.

Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
-import logging
@@
-from tensorrt_llm.logger import logger
-logger = logging.getLogger(__name__)
+from tensorrt_llm.logger import logger
113-120: Fix return type annotation for get_output_info

The function returns a dict with shape and numel info, but the annotation is -> List[int]. This harms type checkers and readability.

Apply this diff to correct typing:
-from typing import List, Optional, Tuple, Union
+from typing import Dict, List, Optional, Tuple, Union
@@
-def get_output_info(input: torch.Tensor, dim: int) -> List[int]:
+def get_output_info(input: torch.Tensor, dim: int) -> Dict[str, Union[List[int], int]]:
@@
-    return {"output_shape": output_shape, "numel_base": numel_base}
+    return {"output_shape": output_shape, "numel_base": numel_base}
324-337: Align MNNVLAllReduce.forward typing and calling convention with None-fallback behavior

The method can return None to signal fallback, but the signature/doc don’t reflect that. Also, for consistency with AllReduce/MoEAllReduce, enforce keyword-only all_reduce_params.

Apply this diff:
-    def forward(
-        self,
-        input: torch.Tensor,
-        all_reduce_params: AllReduceParams,
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, ...]]:
+    def forward(
+        self,
+        input: torch.Tensor,
+        *,
+        all_reduce_params: AllReduceParams,
+    ) -> Optional[Union[torch.Tensor, Tuple[torch.Tensor, ...]]]:
@@
-        Returns:
-            Union[torch.Tensor, Tuple[torch.Tensor, ...]]: Reduced tensor(s)
+        Returns:
+            Optional[Union[torch.Tensor, Tuple[torch.Tensor, ...]]]:
+                - Reduced tensor(s) when MNNVL path is applicable.
+                - None to indicate the caller should fall back to regular AllReduce.
Note: You’ll need to import Optional at the top (already present).

588-594: Fix required: update positional MoEAllReduce calls to pass all_reduce_params as keyword

MoEAllReduce.forward now enforces a keyword-only all_reduce_params. I scanned the repo — all existing callsites already use the keyword except one that must be fixed:

tensorrt_llm/_torch/models/modeling_llama.py (around line ~578)
allreduce_output = self.moe_allreduce(
residual,
self.next_layer_layernorm.weight,
device_num_experts=num_activated_experts_per_node,
scale_input=experts_to_token_score,
active_experts_token_input=hidden_states_activated_experts,
token_input=shared_output,
eps=self.next_layer_layernorm.variance_epsilon,
)

Action: replace the positional argument usage by constructing/passing a MoEAllReduceParams instance (or otherwise supplying all_reduce_params as a keyword). Example pattern:

moe_params = MoEAllReduceParams(residual=..., norm_weight=..., device_num_experts=..., scale_input=..., active_experts_token_input=..., shared_expert_output=..., eps=...)

allreduce_output = self.moe_allreduce(active_experts_token_input, all_reduce_params=moe_params)

This will align the call with the new forward signature and avoid runtime TypeError.

♻️ Duplicate comments (2)

tensorrt_llm/_torch/distributed/ops.py (2)
459-483: Guard now properly short-circuits MNNVL init when tp_size != world_size

This fixes the previously raised issue where the disable guard could be bypassed. The flow is now unambiguous with clean logging.

59-67: Remove unused variable force_mn

force_mn is assigned but never used. It’s dead code and flagged by Ruff (F841).

Apply this diff:
     if not hasattr(_thread_local,
                    f"allreduce_mnnvl_workspaces_{mapping.pp_rank}"):
         setattr(_thread_local, f"allreduce_mnnvl_workspaces_{mapping.pp_rank}",
                 {})
 
-    force_mn = os.environ.get("TRTLLM_FORCE_MNNVL_AR", "0") == "1"
-
     allreduce_mnnvl_workspaces = getattr(
         _thread_local, f"allreduce_mnnvl_workspaces_{mapping.pp_rank}")

🧹 Nitpick comments (2)

tensorrt_llm/_torch/distributed/ops.py (2)
39-41: Nit: function name typo (“presicion”)

The function is spelled allocate_low_presicion_allreduce_workspace. Consider renaming to allocate_low_precision_allreduce_workspace for clarity and consistency. If public, plan a deprecation shim.

209-214: Make split dimension explicit for readability

In the sizes is not None branch, x.split(sizes) defaults to dim=0. Make it explicit for clarity and symmetry with reducescatter.

Apply this diff:
-            else:
-                x_list = x.split(sizes)
+            else:
+                x_list = x.split(sizes, dim=0)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 40154d9 and 5e84f18.

📒 Files selected for processing (1)

tensorrt_llm/_torch/distributed/ops.py (16 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/distributed/ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/distributed/ops.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

📚 Learning: 2025-08-14T06:36:40.681Z

Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/distributed/ops.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/distributed/ops.py

63-63: Local variable force_mn is assigned to but never used

Remove assignment to unused variable force_mn

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/_torch/distributed/ops.py (7)

24-29: LGTM: thread-local workspace keyed by pp_rank

Using a pp_rank-scoped thread-local dictionary avoids cross-PP interference. Looks correct.

147-171: LGTM: clearer allgather docstring

The expanded doc clarifies sizes semantics and output shaping. No issues spotted.

189-197: LGTM: consistent 2D flattening with computed numel_base

The view(-1, numel_base) approach is consistent with how numel_base is derived. Works for both Tensor and list inputs.

251-258: LGTM: reducescatter input reshaping logic

The reshaping and splitting by dim is correct and mirrors the allgather logic.

280-286: LGTM: reducescatter output re-shaping

Restoring original shapes using cached output_info is correct; list handling preserves None positions via restore_full_output.

400-405: LGTM: AllReduce now optionally takes dtype and exposes strategy

Constructor parameters are clear and maintain backward compat via defaults.

490-512: No positional all_reduce_params usages found — OK to enforce keyword-only

I searched for direct AllReduce.forward / AllReduce instance calls that pass all_reduce_params positionally and found none; callers either use the functional wrapper or pass all_reduce_params by name (or via **params).

Checked/important locations (all looked safe):

tensorrt_llm/_torch/distributed/ops.py — AllReduce implementation (forward).

tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py — constructs AllReduce and calls torch_op(tensor, all_reduce_params=...).

tensorrt_llm/_torch/functional.py (def allreduce) — wrapper used across the codebase (e.g. functional.py:2800,2825).

tests/unittest/_torch/multi_gpu/test_user_buffers.py — ar.forward(...) uses all_reduce_params=ar_params.

quantization/layers.py, modeling_qwen3_moe.py, modeling_deepseekv3.py, layers/moe.py — use the wrapper or pass all_reduce_params as a keyword.

Conclusion: no callers found that would break from making all_reduce_params keyword-only.

tensorrt_llm/_torch/distributed/ops.py

kaiyux · 2025-08-28T02:45:57Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-28T02:50:52Z

PR_Github #16778 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T07:48:12Z

PR_Github #16778 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12594 completed with status: 'FAILURE'

kaiyux · 2025-08-28T09:02:58Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-28T09:08:33Z

PR_Github #16835 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T13:11:32Z

PR_Github #16835 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12637 completed with status: 'FAILURE'

Signed-off-by: Shiyu Li <[email protected]>

kaiyux · 2025-08-28T14:24:14Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-08-28T14:29:54Z

PR_Github #16873 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T20:25:35Z

PR_Github #16873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12670 completed with status: 'SUCCESS'

…ek V3 to avoid hanging. (NVIDIA#6886) Signed-off-by: Shiyu Li <[email protected]>

timlee0212 requested review from a team as code owners August 14, 2025 06:09

timlee0212 requested review from hlu1, yechank-nvidia and lucaslie August 14, 2025 06:09

timlee0212 requested a review from litaotju August 14, 2025 06:09

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

timlee0212 force-pushed the dsv3_mnnvl_fix branch from c4a0a72 to 40154d9 Compare August 14, 2025 16:35

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

kaiyux enabled auto-merge (squash) August 28, 2025 02:46

kaiyux force-pushed the dsv3_mnnvl_fix branch from f4acfc0 to 053a0ab Compare August 28, 2025 09:02

timlee0212 added 12 commits August 28, 2025 22:24

Avoid split mlp tp for mnnvl

45c7d52

Signed-off-by: Shiyu Li <[email protected]>

Wrong branching

dbbb442

Signed-off-by: Shiyu Li <[email protected]>

Address review comments

91f2d57

Signed-off-by: Shiyu Li <[email protected]>

Fix long log information.

a016e59

Signed-off-by: Shiyu Li <[email protected]>

Fix import error

a8d978c

Signed-off-by: Shiyu Li <[email protected]>

Fix format error during rebase

e9465dc

Signed-off-by: Shiyu Li <[email protected]>

Remove diffs caused by format

5cea31c

Signed-off-by: Shiyu Li <[email protected]>

Fix format issue

6eb66f0

Signed-off-by: Shiyu Li <[email protected]>

Address review comments

0b1f3f7

Signed-off-by: Shiyu Li <[email protected]>

Move the check into allreduce class.

9548d4d

Signed-off-by: Shiyu Li <[email protected]>

Move the allreduce instance creation to the top.

82c1e98

Signed-off-by: Shiyu Li <[email protected]>

Address CI error

38e2b91

Signed-off-by: Shiyu Li <[email protected]>

kaiyux force-pushed the dsv3_mnnvl_fix branch from 053a0ab to 38e2b91 Compare August 28, 2025 14:24

kaiyux merged commit b093d94 into NVIDIA:main Aug 28, 2025
5 checks passed

timlee0212 deleted the dsv3_mnnvl_fix branch August 28, 2025 23:36

coderabbitai bot mentioned this pull request Aug 28, 2025

Don't review - Testing: Support fp8 block wide ep #7209

Closed

timlee0212 mentioned this pull request Aug 30, 2025

[https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. #7387

Merged

1 task

coderabbitai bot mentioned this pull request Sep 2, 2025

[None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch #7487

Merged

1 task

chang-l pushed a commit to chang-l/TensorRT-LLM that referenced this pull request Sep 2, 2025

[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSe…

c6064ca

…ek V3 to avoid hanging. (NVIDIA#6886) Signed-off-by: Shiyu Li <[email protected]>

[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. #6886

[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. #6886

Uh oh!

Conversation

timlee0212 commented Aug 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

timlee0212 commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

timlee0212 commented Aug 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 14, 2025 •

edited

Loading