Skip to content

Conversation

HuiGao-NV
Copy link
Collaborator

No description provided.

@HuiGao-NV HuiGao-NV requested a review from a team as a code owner May 25, 2025 01:09
@HuiGao-NV HuiGao-NV requested review from dcampora and litaotju May 25, 2025 01:09
@juney-nvidia juney-nvidia requested a review from zongfeijing May 25, 2025 01:13
@juney-nvidia
Copy link
Collaborator

Thanks, Jerry. This indeed makes the code cleaner.

June

@HuiGao-NV
Copy link
Collaborator Author

/bot run

@HuiGao-NV HuiGao-NV requested a review from hlu1 May 26, 2025 01:22
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6402 [ run ] triggered by Bot

@HuiGao-NV HuiGao-NV changed the title Use backend to replace macro to control enable MNVL all reduce Use backend to replace macro to control enablement of MNNVL all reduce May 26, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6402 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4680 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-Others-1"

1 similar comment
@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-Others-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6459 [ run ] triggered by Bot

@HuiGao-NV HuiGao-NV self-assigned this May 26, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6459 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4726 (Partly Tested) completed with status: 'SUCCESS'

@HuiGao-NV HuiGao-NV marked this pull request as draft May 28, 2025 00:57
@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7074 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7074 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5119 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7105 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7105 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5136 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8389 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8389 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6078 (Partly Tested) completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

HuiGao-NV commented Jun 11, 2025

Failure on fp8_block_scaling_gemm with error which is not related to this change: RuntimeError: Assertion failed: cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) == cudaSuccess

It's a known issue. Need to rerun.

@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8487 [ run ] triggered by Bot

Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gaoâ� <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gaoâ� <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gaoâ� <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1" --comment="rerun after rebase with that known issue is waived "

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8499 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --comment=rerun after rebase with that known issue is waived

@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8506 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8506 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6166 (Partly Tested) completed with status: 'FAILURE'

Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8555 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8555 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6206 (Partly Tested) completed with status: 'SUCCESS'

@HuiGao-NV
Copy link
Collaborator Author

/bot skip --comment="After rerun, all multi-gpu stages passed."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8587 [ skip ] triggered by Bot

@HuiGao-NV HuiGao-NV enabled auto-merge (squash) June 12, 2025 03:16
@tensorrt-cicd
Copy link
Collaborator

PR_Github #8587 [ skip ] completed with state SUCCESS
Skipping testing for commit 5da1db9

@HuiGao-NV HuiGao-NV merged commit 4319237 into NVIDIA:main Jun 12, 2025
3 checks passed
@HuiGao-NV HuiGao-NV deleted the ar_backend branch June 13, 2025 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants