Skip to content

Conversation

yuantailing
Copy link
Member

@yuantailing yuantailing commented May 30, 2025

DeepEP integration

Description

Support matrix:

EP Communication method Intra-node Inter-node CUDA Graphs Notes
MNNVL Y Y Y Only MNNVL systems
DeepEP Y Y N IBGDA required for inter-node
DeepEPLowLatency Y Y Y IBGDA required

Please refer to select_alltoall_method_type (in fused_moe_cutlass.py) for the condition of enabling DeepEP or DeepEPLowLatency. This is an experimental feature, so an environment variable TRTLLM_CAN_USE_DEEP_EP=1 is required.

One of the following lines will be printed at initialization:

[TRT-LLM] [RANK 0] [I] CutlassFusedMoE selects alltoall_method_type <AlltoallMethodType.NotEnabled: 0>
[TRT-LLM] [RANK 0] [I] CutlassFusedMoE selects alltoall_method_type <AlltoallMethodType.MNNVL: 1>
[TRT-LLM] [RANK 0] [I] CutlassFusedMoE selects alltoall_method_type <AlltoallMethodType.DeepEP: 2>
[TRT-LLM] [RANK 0] [I] CutlassFusedMoE selects alltoall_method_type <AlltoallMethodType.DeepEPLowLatency: 3>

Known issues:

  1. DeepEPLowLatency + FP4 postquant_alltoall is not implemented. Set TRTLLM_MOE_POST_QUANT_ALLTOALLV=0 instead.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yuantailing yuantailing force-pushed the deepep branch 2 times, most recently from 9c54709 to c3b20cd Compare May 30, 2025 05:53
@kaiyux
Copy link
Member

kaiyux commented May 30, 2025

Notes from offline discussion

  1. It's ok to not support ootb pip install as the first step, moving further we need to figure out a way
  2. Use env var as the first step to enable deepep
  3. Add a multi-gpu e2e test to cover basic functionality

@kaiyux
Copy link
Member

kaiyux commented May 30, 2025

/bot -h

Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@kaiyux
Copy link
Member

kaiyux commented May 30, 2025

@yuantailing
Copy link
Member Author

/bot run --stage-list "Build-Docker-Images"

@yuantailing
Copy link
Member Author

/bot -h

Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7071 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7071 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #5117 (Partly Tested) completed with status: 'FAILURE'

@juney-nvidia juney-nvidia changed the title Add DeepEP support for large-scale EP feat: large-scale EP(part 7: DeepEP integration) Jun 1, 2025
@kaiyux
Copy link
Member

kaiyux commented Jun 3, 2025

PR_Github #7071 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #5117 (Partly Tested) completed with status: 'FAILURE'

@yuantailing Please fix the style following the guidance https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style

@yuantailing yuantailing force-pushed the deepep branch 3 times, most recently from 09f4f14 to ec39fda Compare June 3, 2025 02:44
@yuantailing
Copy link
Member Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7270 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7270 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #5268 (Partly Tested) completed with status: 'FAILURE'

@yuantailing yuantailing force-pushed the deepep branch 2 times, most recently from 8995de6 to 9c5fc69 Compare June 3, 2025 06:16
@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8785 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8785 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6381 (Partly Tested) completed with status: 'FAILURE'

@yuantailing
Copy link
Member Author

yuantailing commented Jun 13, 2025

OOM on a previously passed test case: DGX_H100-4_GPUs-PyTorch-DeepSeek-1/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]

The test case was passed in PR_Github #8651

There is no code change between these two CI runs.

[2025-06-13T14:06:48.472Z] [06/13/2025-12:08:49] [TRT-LLM] [RANK 3] [E] Encountered an error in forward function: BatchQKApplyRotaryPosIdsCosSinCache failed with error code out of memory
[2025-06-13T14:06:48.472Z] [06/13/2025-12:08:49] [TRT-LLM] [RANK 3] [E] Error in event loop: Cannot get the result for a response with an error: BatchQKApplyRotaryPosIdsCosSinCache failed with error code out of memory (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/executor/responseImpl.h:73)
[2025-06-13T14:06:48.472Z] 1       0x7fa406cd7798 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1dae798) [0x7fa406cd7798]
[2025-06-13T14:06:48.472Z] 2       0x7fa478bb1727 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x19f727) [0x7fa478bb1727]
[2025-06-13T14:06:48.472Z] 3       0x7fa478adc003 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xca003) [0x7fa478adc003]
[2025-06-13T14:06:48.472Z] 4             0x58208f /usr/bin/python() [0x58208f]
[2025-06-13T14:06:48.472Z] 5             0x549185 _PyObject_MakeTpCall + 117
[2025-06-13T14:06:48.472Z] 6             0x54ce86 /usr/bin/python() [0x54ce86]
[2025-06-13T14:06:48.472Z] 7             0x5a17b4 /usr/bin/python() [0x5a17b4]
[2025-06-13T14:06:48.472Z] 8             0x6786c8 /usr/bin/python() [0x6786c8]
[2025-06-13T14:06:48.472Z] 9             0x5a4f12 /usr/bin/python() [0x5a4f12]
[2025-06-13T14:06:48.472Z] 10            0x581fa2 /usr/bin/python() [0x581fa2]
[2025-06-13T14:06:48.472Z] 11            0x54a86d PyObject_CallOneArg + 77
[2025-06-13T14:06:48.472Z] 12            0x623635 /usr/bin/python() [0x623635]
[2025-06-13T14:06:48.472Z] 13            0x6244af /usr/bin/python() [0x6244af]
[2025-06-13T14:06:48.472Z] 14            0x622f2f /usr/bin/python() [0x622f2f]
[2025-06-13T14:06:48.472Z] 15            0x6c61ca /usr/bin/python() [0x6c61ca]
[2025-06-13T14:06:48.472Z] 16            0x581f0d /usr/bin/python() [0x581f0d]
[2025-06-13T14:06:48.472Z] 17      0x7fa89305f2cb /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x5b2cb) [0x7fa89305f2cb]
[2025-06-13T14:06:48.472Z] 18      0x7fa8930855ef /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x815ef) [0x7fa8930855ef]
[2025-06-13T14:06:48.472Z] 19      0x7fa8931020ed /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0xfe0ed) [0x7fa8931020ed]
[2025-06-13T14:06:48.472Z] 20      0x7fa893105f8a /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x101f8a) [0x7fa893105f8a]
[2025-06-13T14:06:48.472Z] 21            0x551598 /usr/bin/python() [0x551598]
[2025-06-13T14:06:48.472Z] 22            0x549b85 PyObject_Vectorcall + 53
[2025-06-13T14:06:48.472Z] 23            0x5d73c9 _PyEval_EvalFrameDefault + 2697
[2025-06-13T14:06:48.472Z] 24            0x54cd32 /usr/bin/python() [0x54cd32]
[2025-06-13T14:06:48.472Z] 25            0x5db55b _PyEval_EvalFrameDefault + 19483
[2025-06-13T14:06:48.472Z] 26            0x54cd32 /usr/bin/python() [0x54cd32]
[2025-06-13T14:06:48.472Z] 27            0x6f826c /usr/bin/python() [0x6f826c]
[2025-06-13T14:06:48.472Z] 28            0x6b917c /usr/bin/python() [0x6b917c]
[2025-06-13T14:06:48.472Z] 29      0x7fa8948d3aa4 /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fa8948d3aa4]
[2025-06-13T14:06:48.472Z] 30      0x7fa894960c3c /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fa894960c3c]
[2025-06-13T14:06:48.472Z] [06/13/2025-12:08:49] [TRT-LLM] [RANK 3] [E] Traceback (most recent call last):
[2025-06-13T14:06:48.472Z]   File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1615, in _forward_step
[2025-06-13T14:06:48.472Z]     outputs = forward(scheduled_requests, self.resource_manager,
[2025-06-13T14:06:48.472Z]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-06-13T14:06:48.472Z]   File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
[2025-06-13T14:06:48.472Z]     result = func(*args, **kwargs)
[2025-06-13T14:06:48.472Z]              ^^^^^^^^^^^^^^^^^^^^^
[2025-06-13T14:06:48.472Z]   File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1605, in forward
[2025-06-13T14:06:48.472Z]     return self.model_engine.forward(
[2025-06-13T14:06:48.472Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^

@kaiyux
Copy link
Member

kaiyux commented Jun 13, 2025

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8822 [ run ] triggered by Bot

@yuantailing
Copy link
Member Author

Compare the environment of PR_Github #8651 and PR_Github #8785

Pipeline 8651 installed optimum-1.25.3
Pipeline 8785 installed optimum-1.26.1

Pipeline 8651:

Successfully installed DataProperty-1.1.0 StrEnum-0.4.15 accelerate-1.7.0 aenum-3.1.16 alembic-1.16.1 backoff-2.2.1 bandit-1.7.7 blake3-1.0.5 cfgv-3.4.0 chardet-5.2.0 click_option_group-0.5.7 colorama-0.4.6 colored-2.3.0 colorlog-6.9.0 coverage-7.9.0 cramjam-2.10.0 datasets-3.1.0 diffusers-0.33.1 dill-0.3.8 distlib-0.3.9 distro-1.9.0 docstring_parser-0.16 etcd3-0.12.0 evaluate-0.4.3 fastapi-0.115.4 fastparquet-2024.11.0 flashinfer-python-0.2.5 fsspec-2024.9.0 genai-perf-0.0.13 graphviz-0.20.3 greenlet-3.2.3 h5py-3.12.1 hf-xet-1.1.3 httpcore-1.0.9 huggingface-hub-0.33.0 identify-2.6.12 jieba-0.42.1 jiter-0.10.0 jsonlines-4.0.0 kaleido-0.2.1 lark-1.2.2 lm_eval-0.4.8 lxml-5.4.0 mako-1.3.10 mbstrdecoder-1.1.4 multiprocess-0.70.16 mypy-1.16.0 narwhals-1.42.1 nltk-3.9.1 nodeenv-1.9.1 numexpr-2.11.0 nvidia-cuda-nvrtc-cu12-12.9.86 nvidia-modelopt-0.31.0 nvidia-modelopt-core-0.31.0 nvidia-nccl-cu12-2.27.3 onnx_graphsurgeon-0.5.8 openai-1.86.0 optimum-1.25.3 optuna-3.6.1 ordered-set-4.1.0 oyaml-1.0 parameterized-0.9.0 pathvalidate-3.2.3 patsy-1.0.1 pbr-6.1.1 peft-0.15.2 pillow-10.3.0 plotly-6.1.2 portalocker-3.1.1 pre-commit-4.2.0 py-1.11.0 pybind11-stubgen-2.5.4 pytablewriter-1.2.1 pytest-8.4.0 pytest-asyncio-1.0.0 pytest-cov-6.2.1 pytest-csv-3.0.0 pytest-env-1.1.5 pytest-forked-1.6.0 pytest-mock-3.14.1 pytest-split-0.10.0 pytest-threadleak-0.5.0 pytest-timeout-2.4.0 python-rapidjson-1.20 responses-0.25.7 rouge-1.0.1 rouge_score-0.1.2 ruff-0.9.4 sacrebleu-2.5.1 sentencepiece-0.2.0 sqlalchemy-2.0.41 sqlitedict-2.1.0 starlette-0.41.3 statsmodels-0.14.4 stevedore-5.4.1 tabledata-1.3.4 tcolorpy-0.1.7 tenacity-9.1.2 tiktoken-0.9.0 tokenizers-0.21.1 tqdm-multiprocess-0.0.11 transformers-4.51.3 tritonclient-2.58.0 typepy-1.3.4 uvicorn-0.34.3 virtualenv-20.31.2 word2number-1.1 xgrammar-0.1.16 xxhash-3.5.0 zstandard-0.23.0

Pipeline 8785:

Successfully installed DataProperty-1.1.0 StrEnum-0.4.15 accelerate-1.7.0 aenum-3.1.16 alembic-1.16.1 backoff-2.2.1 bandit-1.7.7 blake3-1.0.5 cfgv-3.4.0 chardet-5.2.0 click_option_group-0.5.7 colorama-0.4.6 colored-2.3.0 colorlog-6.9.0 coverage-7.9.0 cramjam-2.10.0 datasets-3.1.0 diffusers-0.33.1 dill-0.3.8 distlib-0.3.9 distro-1.9.0 docstring_parser-0.16 etcd3-0.12.0 evaluate-0.4.3 fastapi-0.115.4 fastparquet-2024.11.0 flashinfer-python-0.2.5 fsspec-2024.9.0 genai-perf-0.0.13 graphviz-0.20.3 greenlet-3.2.3 h5py-3.12.1 hf-xet-1.1.3 httpcore-1.0.9 huggingface-hub-0.33.0 identify-2.6.12 jieba-0.42.1 jiter-0.10.0 jsonlines-4.0.0 kaleido-0.2.1 lark-1.2.2 lm_eval-0.4.8 lxml-5.4.0 mako-1.3.10 mbstrdecoder-1.1.4 multiprocess-0.70.16 mypy-1.16.0 narwhals-1.42.1 nltk-3.9.1 nodeenv-1.9.1 numexpr-2.11.0 nvidia-cuda-nvrtc-cu12-12.9.86 nvidia-modelopt-0.31.0 nvidia-modelopt-core-0.31.0 nvidia-nccl-cu12-2.27.3 onnx_graphsurgeon-0.5.8 openai-1.86.0 optimum-1.26.1 optuna-3.6.1 ordered-set-4.1.0 oyaml-1.0 parameterized-0.9.0 pathvalidate-3.2.3 patsy-1.0.1 pbr-6.1.1 peft-0.15.2 pillow-10.3.0 plotly-6.1.2 portalocker-3.1.1 pre-commit-4.2.0 py-1.11.0 pybind11-stubgen-2.5.4 pytablewriter-1.2.1 pytest-8.4.0 pytest-asyncio-1.0.0 pytest-cov-6.2.1 pytest-csv-3.0.0 pytest-env-1.1.5 pytest-forked-1.6.0 pytest-mock-3.14.1 pytest-split-0.10.0 pytest-threadleak-0.5.0 pytest-timeout-2.4.0 python-rapidjson-1.20 responses-0.25.7 rouge-1.0.1 rouge_score-0.1.2 ruff-0.9.4 sacrebleu-2.5.1 sentencepiece-0.2.0 sqlalchemy-2.0.41 sqlitedict-2.1.0 starlette-0.41.3 statsmodels-0.14.4 stevedore-5.4.1 tabledata-1.3.4 tcolorpy-0.1.7 tenacity-9.1.2 tiktoken-0.9.0 tokenizers-0.21.1 tqdm-multiprocess-0.0.11 transformers-4.51.3 tritonclient-2.58.0 typepy-1.3.4 uvicorn-0.34.3 virtualenv-20.31.2 word2number-1.1 xgrammar-0.1.16 xxhash-3.5.0 zstandard-0.23.0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8822 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6410 (Partly Tested) completed with status: 'FAILURE'

@yuantailing
Copy link
Member Author

Build timeout. Note that #5027 changed BUILD_JOBS from 8 to 4, which may make build time 2x longer.

@yuantailing
Copy link
Member Author

Maybe the second build can reuse ccache. Run again.

@yuantailing
Copy link
Member Author

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8866 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8866 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6453 (Partly Tested) completed with status: 'FAILURE'

@yuantailing
Copy link
Member Author

yuantailing commented Jun 14, 2025

ToT failure in the Check Test Lists phase. Should be fixed with #5205 .

Merge main and test again.

@yuantailing
Copy link
Member Author

/bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8873 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6460 (Partly Tested) completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@yuantailing
Copy link
Member Author

The rerun test is DGX_H100-4_GPUs-PyTorch-DeepSeek-1/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_attention_dp_overlap_cuda_graph[DeepSeek-V3-Lite-fp8].

I noticed that PR #5140 reran DGX_H100-4_GPUs-PyTorch-DeepSeek-1/disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_bf16_cache_aware_balance[DeepSeek-V3-Lite-bf16], rerun report

Both reruns happen in the same file and have the same call stack. So I believe the root cause is ToT high failure rate in test_disaggregated.py. Therefore, ignore the rerun in the pull request is reasonable.

Appendix: call stack

Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/LLM/main/L0_Test-x86_64/llmVanilla/TensorRT-LLM/src/examples/disaggregated/clients/disagg_client.py", line 189, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/jenkins/agent/workspace/LLM/main/L0_Test-x86_64/llmVanilla/TensorRT-LLM/src/examples/disaggregated/clients/disagg_client.py", line 182, in main
    responses = await asyncio.gather(*tasks)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/agent/workspace/LLM/main/L0_Test-x86_64/llmVanilla/TensorRT-LLM/src/examples/disaggregated/clients/disagg_client.py", line 45, in send_request
    raise Exception(f"Error: {await response.text()}")
Exception: Error: {"detail":"Internal server error [Errno 32] Broken pipe"}

@yuantailing
Copy link
Member Author

/bot skip --comment "PR_Github #8541, PR_Github #8651, and PR_Github #8873 form a full test. The main branch grows 39 commits from the first test."

@yuantailing yuantailing enabled auto-merge (squash) June 14, 2025 09:41
@tensorrt-cicd
Copy link
Collaborator

PR_Github #8883 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8883 [ skip ] completed with state SUCCESS
Skipping testing for commit 4181a6e

@yuantailing yuantailing disabled auto-merge June 14, 2025 09:56
@kaiyux kaiyux enabled auto-merge (squash) June 14, 2025 11:11
@kaiyux kaiyux merged commit 0b60da2 into NVIDIA:main Jun 14, 2025
3 checks passed
@WanchaoYao
Copy link

WanchaoYao commented Jul 8, 2025

@yuantailing Hi, I tried to enable DeepEP and found num_nvl_peers and comm is not params of DeepEP's Buffer init function. So I guess you modified DeepEP's source code? ---- I've figured out how to install the modified DeepEP. Please see docker/common/install_deep_ep.sh

@yuantailing yuantailing deleted the deepep branch July 23, 2025 07:42
@yuantailing
Copy link
Member Author

Hi @WanchaoYao ,
Sorry for late reply. Glad to that you have found the installation script. Besides, we have made the tensorrt_llm.deep_ep self-contained ( #5534 ) by compiling NVSHMEM and DeepEP at the build time, so there would not require system-wide DeepEP, FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants