fix: limit process pool size when prefetching #5088

zhengd-nv · 2025-06-10T10:47:14Z

Description

Only use necessary number of processes when prefetching.

This is probably the root cause of the flaky nvbug/5301492, because 4 ranks are used in the test and the prefetching used up to 8xCPU core / 64 processes, while only 4 is needed because the model is small. Some ranks may end up with the following message and stop logging after the Prefetching 2.05GB checkpoint files. log, without a corresponding Loading {file}:

[2025-06-10T02:31:05.687Z]   File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 68, in _signal_handler_cleanup_child
[2025-06-10T02:31:05.687Z]     sys.exit(128 + signum)
[2025-06-10T02:31:05.687Z] SystemExit: 143

Other ranks will waiting for the defunct rank and finally lead to timeout/broken pipe.

The affected tests are:

test_disaggregated_load_balance (pipeline 12885)
test_disaggregated_cache_aware_balance (pipeline 12908, 12909)
- test_workers_kv_cache_aware_router
test_disaggregated_conditional (pipeline 12894)

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Zheng Duan <[email protected]>

zhengd-nv · 2025-06-10T10:48:20Z

/bot run

tensorrt-cicd · 2025-06-10T10:54:35Z

PR_Github #8283 [ run ] triggered by Bot

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt-cicd · 2025-06-10T14:20:23Z

PR_Github #8283 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5996 completed with status: 'FAILURE'

Signed-off-by: Zheng Duan <[email protected]>

zhengd-nv · 2025-06-11T01:23:41Z

/bot run

tensorrt-cicd · 2025-06-11T01:29:52Z

PR_Github #8361 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-11T03:32:35Z

PR_Github #8361 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6055 completed with status: 'FAILURE'

zhengd-nv · 2025-06-11T05:21:48Z

/bot run

tensorrt-cicd · 2025-06-11T05:27:29Z

PR_Github #8405 [ run ] triggered by Bot

zhengd-nv · 2025-06-11T08:58:37Z

/bot run

tensorrt-cicd · 2025-06-11T09:00:40Z

PR_Github #8405 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6091 completed with status: 'FAILURE'

tensorrt-cicd · 2025-06-11T09:04:27Z

PR_Github #8460 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-11T23:19:39Z

PR_Github #8460 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6129 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

limit process pool size when prefetching

8f30b7f

Signed-off-by: Zheng Duan <[email protected]>

zhengd-nv requested a review from a team as a code owner June 10, 2025 10:47

zhengd-nv requested review from pcastonguay, Shixiaowei02 and yuxianq June 10, 2025 10:47

yuxianq reviewed Jun 10, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Show resolved Hide resolved

zhengd-nv added 2 commits June 11, 2025 01:21

handling no files to load

60db2d6

Signed-off-by: Zheng Duan <[email protected]>

Merge branch 'main' into limit-process-pool

87e31c3

Merge branch 'main' into limit-process-pool

9cbeb62

Merge branch 'main' into limit-process-pool

085e264

zhengd-nv requested a review from yuxianq June 12, 2025 02:12

yuxianq approved these changes Jun 12, 2025

View reviewed changes

Shixiaowei02 merged commit c592798 into NVIDIA:main Jun 12, 2025
3 checks passed

zhengd-nv mentioned this pull request Jun 12, 2025

ci: waive test [NVBUGS/5301492] #5081

Closed

zhengd-nv deleted the limit-process-pool branch June 12, 2025 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: limit process pool size when prefetching #5088

fix: limit process pool size when prefetching #5088

Uh oh!

zhengd-nv commented Jun 10, 2025 •

edited

Loading

Uh oh!

zhengd-nv commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

fix: limit process pool size when prefetching #5088

fix: limit process pool size when prefetching #5088

Uh oh!

Conversation

zhengd-nv commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

zhengd-nv commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

zhengd-nv commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

zhengd-nv commented Jun 10, 2025 •

edited

Loading