-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: large-scale EP(part 7: DeepEP integration) #4792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c54709
to
c3b20cd
Compare
Notes from offline discussion
|
/bot -h |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot run --stage-list "Build-Docker-Images" |
/bot -h |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
PR_Github #7071 [ run ] triggered by Bot |
PR_Github #7071 [ run ] completed with state |
@yuantailing Please fix the style following the guidance https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style |
09f4f14
to
ec39fda
Compare
/bot run --stage-list "Build-Docker-Images" |
PR_Github #7270 [ run ] triggered by Bot |
PR_Github #7270 [ run ] completed with state |
8995de6
to
9c5fc69
Compare
/bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
PR_Github #8785 [ run ] triggered by Bot |
PR_Github #8785 [ run ] completed with state |
OOM on a previously passed test case: The test case was passed in PR_Github #8651 There is no code change between these two CI runs.
|
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
PR_Github #8822 [ run ] triggered by Bot |
Compare the environment of PR_Github #8651 and PR_Github #8785 Pipeline 8651 installed Pipeline 8651:
Pipeline 8785:
|
PR_Github #8822 [ run ] completed with state |
Build timeout. Note that #5027 changed |
Maybe the second build can reuse ccache. Run again. |
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
PR_Github #8866 [ run ] triggered by Bot |
PR_Github #8866 [ run ] completed with state |
ToT failure in the Merge main and test again. |
/bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
PR_Github #8873 [ run ] triggered by Bot |
PR_Github #8873 [ run ] completed with state |
The rerun test is I noticed that PR #5140 reran Both reruns happen in the same file and have the same call stack. So I believe the root cause is ToT high failure rate in Appendix: call stack
|
/bot skip --comment "PR_Github #8541, PR_Github #8651, and PR_Github #8873 form a full test. The main branch grows 39 commits from the first test." |
PR_Github #8883 [ skip ] triggered by Bot |
PR_Github #8883 [ skip ] completed with state |
@yuantailing Hi, I tried to enable DeepEP and found num_nvl_peers and comm is not params of DeepEP's Buffer init function. So I guess you modified DeepEP's source code? ---- I've figured out how to install the modified DeepEP. Please see docker/common/install_deep_ep.sh |
Hi @WanchaoYao , |
DeepEP integration
Description
Support matrix:
Please refer to
select_alltoall_method_type
(infused_moe_cutlass.py
) for the condition of enabling DeepEP or DeepEPLowLatency. This is an experimental feature, so an environment variableTRTLLM_CAN_USE_DEEP_EP=1
is required.One of the following lines will be printed at initialization:
Known issues:
TRTLLM_MOE_POST_QUANT_ALLTOALLV=0
instead.Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.