Use backend to replace macro to control enablement of MNNVL all reduce #4635

HuiGao-NV · 2025-05-25T01:09:41Z

No description provided.

juney-nvidia · 2025-05-25T01:15:58Z

Thanks, Jerry. This indeed makes the code cleaner.

June

HuiGao-NV · 2025-05-26T01:18:14Z

/bot run

tensorrt-cicd · 2025-05-26T01:23:45Z

PR_Github #6402 [ run ] triggered by Bot

tensorrt_llm/_torch/distributed/ops.py

tensorrt_llm/_torch/model_config.py

tensorrt-cicd · 2025-05-26T05:45:28Z

PR_Github #6402 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4680 completed with status: 'FAILURE'

HuiGao-NV · 2025-05-26T07:10:20Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-Others-1"

HuiGao-NV · 2025-05-26T10:12:02Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-05-26T10:18:06Z

PR_Github #6459 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-26T14:49:15Z

PR_Github #6459 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4726 (Partly Tested) completed with status: 'SUCCESS'

tensorrt_llm/_torch/distributed/ops.py

HuiGao-NV · 2025-05-30T11:15:09Z

/bot run

tensorrt-cicd · 2025-05-30T11:20:45Z

PR_Github #7074 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-30T20:16:44Z

PR_Github #7074 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5119 completed with status: 'FAILURE'

HuiGao-NV · 2025-05-30T22:33:17Z

/bot run

tensorrt-cicd · 2025-05-30T22:39:25Z

PR_Github #7105 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-31T04:03:23Z

PR_Github #7105 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5136 completed with status: 'FAILURE'

HuiGao-NV · 2025-05-31T09:02:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-11T03:22:23Z

PR_Github #8389 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-11T12:17:27Z

PR_Github #8389 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6078 (Partly Tested) completed with status: 'FAILURE'

HuiGao-NV · 2025-06-11T13:01:46Z

Failure on fp8_block_scaling_gemm with error which is not related to this change: RuntimeError: Assertion failed: cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess

It's a known issue. Need to rerun.

HuiGao-NV · 2025-06-11T13:02:13Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

tensorrt-cicd · 2025-06-11T13:08:24Z

PR_Github #8487 [ run ] triggered by Bot

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Signed-off-by: Hui Gao <[email protected]>

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Signed-off-by: Hui Gao <[email protected]>

mappingg Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV · 2025-06-11T14:00:52Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1" --comment="rerun after rebase with that known issue is waived "

tensorrt-cicd · 2025-06-11T14:07:04Z

PR_Github #8499 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --comment=rerun after rebase with that known issue is waived

HuiGao-NV · 2025-06-11T14:36:27Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

tensorrt-cicd · 2025-06-11T14:42:36Z

PR_Github #8506 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-11T16:02:59Z

PR_Github #8506 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6166 (Partly Tested) completed with status: 'FAILURE'

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV · 2025-06-11T22:55:07Z

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

tensorrt-cicd · 2025-06-11T23:01:31Z

PR_Github #8555 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-12T02:57:27Z

PR_Github #8555 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6206 (Partly Tested) completed with status: 'SUCCESS'

HuiGao-NV · 2025-06-12T02:59:19Z

/bot skip --comment="After rerun, all multi-gpu stages passed."

tensorrt-cicd · 2025-06-12T03:05:23Z

PR_Github #8587 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-12T03:22:48Z

PR_Github #8587 [ skip ] completed with state SUCCESS
Skipping testing for commit 5da1db9

HuiGao-NV requested a review from a team as a code owner May 25, 2025 01:09

HuiGao-NV requested review from dcampora and litaotju May 25, 2025 01:09

juney-nvidia requested a review from zongfeijing May 25, 2025 01:13

HuiGao-NV requested a review from hlu1 May 26, 2025 01:22

HuiGao-NV changed the title ~~Use backend to replace macro to control enable MNVL all reduce~~ Use backend to replace macro to control enablement of MNNVL all reduce May 26, 2025

zongfeijing reviewed May 26, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

kaiyux reviewed May 26, 2025

View reviewed changes

tensorrt_llm/_torch/model_config.py Outdated Show resolved Hide resolved

HuiGao-NV force-pushed the ar_backend branch from b2a688f to 5536034 Compare May 26, 2025 06:27

HuiGao-NV self-assigned this May 26, 2025

hlu1 reviewed May 27, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

hlu1 requested changes May 27, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

HuiGao-NV marked this pull request as draft May 28, 2025 00:57

HuiGao-NV force-pushed the ar_backend branch from 99bad16 to 0921a79 Compare May 30, 2025 11:13

HuiGao-NV force-pushed the ar_backend branch from 0921a79 to 7fd2da9 Compare May 31, 2025 08:58

HuiGao-NV added 8 commits June 11, 2025 13:46

Use backend to replace macro to control enable MNVL all reduce

aaeb67b

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Revert some change in tests

6130848

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Fix MNNVL name

225c1b0

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Address comments

d5b372c

Signed-off-by: Hui Gao <[email protected]>

Add strategy support in extra llm api config

bd1183d

Signed-off-by: Hui Gao <[email protected]>

Fix docs test cases

bd1b058

Signed-off-by: Hui Gao <[email protected]> Signed-off-by: Hui Gaoâ� <[email protected]>

Address comments

d93a876

Signed-off-by: Hui Gao <[email protected]>

Address comments to remove code setting strategies to Linear whem no

a7fab8b

mappingg Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV force-pushed the ar_backend branch from 0271fb5 to a7fab8b Compare June 11, 2025 13:59

Fix format

e4426ac

Signed-off-by: Hui Gao <[email protected]>

Merge branch 'main' into ar_backend

5da1db9

HuiGao-NV enabled auto-merge (squash) June 12, 2025 03:16

HuiGao-NV merged commit 4319237 into NVIDIA:main Jun 12, 2025
3 checks passed

HuiGao-NV deleted the ar_backend branch June 13, 2025 03:54

Use backend to replace macro to control enablement of MNNVL all reduce #4635

Use backend to replace macro to control enablement of MNNVL all reduce #4635

Uh oh!

Conversation

HuiGao-NV commented May 25, 2025

Uh oh!

juney-nvidia commented May 25, 2025

Uh oh!

HuiGao-NV commented May 26, 2025

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

HuiGao-NV commented May 26, 2025

Uh oh!

HuiGao-NV commented May 26, 2025

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

HuiGao-NV commented May 30, 2025

Uh oh!

tensorrt-cicd commented May 30, 2025

Uh oh!

tensorrt-cicd commented May 30, 2025

Uh oh!

HuiGao-NV commented May 30, 2025

Uh oh!

tensorrt-cicd commented May 30, 2025

Uh oh!

tensorrt-cicd commented May 31, 2025

Uh oh!

HuiGao-NV commented May 31, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

HuiGao-NV commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuiGao-NV commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

HuiGao-NV commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

HuiGao-NV commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

HuiGao-NV commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

HuiGao-NV commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

HuiGao-NV commented Jun 11, 2025 •

edited

Loading