[TRTLLM-5825][fix] Fix torch LoRA TP #5338

amitz-nv · 2025-06-18T16:08:54Z

Description

The torch LoRA TP tests in unittest/llmapi/test_llm_multi_gpu_pytorch.py were not running so far on CI because they use just 2 GPUs, while the DGX has 4, and the pytest command line filtered only the tests marked with @pytest.mark.gpu4, so I fixed the pytest command line to also run tests marked with @pytest.mark.gpu2.
Fixed ModelConfig.get_bindings_model_config method calculations to be like gptJsonConfig.cpp::createModelConfig, as stated in get_bindings_model_config docstring.

Test Coverage

Torch LoRA TP>1 tests:

tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py::test_llama_v2_13b_lora_tp2
tests/unittest/llmapi/test_llm_multi_gpu_pytorch.py::test_llama_7b_multi_lora_tp2

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…gptJsonConfig.cpp::createModelConfig Signed-off-by: Amit Zuker <[email protected]>

Signed-off-by: Amit Zuker <[email protected]>

amitz-nv · 2025-06-18T16:11:08Z

/bot run

tensorrt-cicd · 2025-06-18T16:16:48Z

PR_Github #9405 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-19T00:58:20Z

PR_Github #9405 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6901 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

shaharmor98

LTGM

Signed-off-by: Amit Zuker <[email protected]>

amitz-nv added 2 commits June 18, 2025 15:57

Fixed ModelConfig.get_bindings_model_config to calculate fields like …

b41098a

…gptJsonConfig.cpp::createModelConfig Signed-off-by: Amit Zuker <[email protected]>

Configure pytorch LoRA TP tests to run on DGX

7b2d3de

Signed-off-by: Amit Zuker <[email protected]>

amitz-nv requested a review from a team as a code owner June 18, 2025 16:08

amitz-nv requested review from lfr-0531 and liji-nv June 18, 2025 16:08

amitz-nv changed the title ~~[[TRTLLM-5825](https://jirasw.nvidia.com/browse/TRTLLM-5825)][fix] Fix torch LoRA TP~~ [TRTLLM-5825][fix] Fix torch LoRA TP Jun 18, 2025

amitz-nv requested a review from shaharmor98 June 18, 2025 16:25

shaharmor98 approved these changes Jun 19, 2025

View reviewed changes

shaharmor98 merged commit 1753202 into NVIDIA:main Jun 19, 2025
4 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

d0f4ed3

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

84e5ed1

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

6a59fe6

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

80903cc

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

43eb195

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

45cb4ab

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

82ef6f2

Signed-off-by: Amit Zuker <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-5825][fix] Fix torch LoRA TP (NVIDIA#5338)

04a6de8

Signed-off-by: Amit Zuker <[email protected]>

amitz-nv mentioned this pull request Sep 8, 2025

[TRTLLM-7958][doc] add 1.0 release notes #7605

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-5825][fix] Fix torch LoRA TP #5338

[TRTLLM-5825][fix] Fix torch LoRA TP #5338

Uh oh!

amitz-nv commented Jun 18, 2025 •

edited

Loading

Uh oh!

amitz-nv commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

shaharmor98 left a comment

Uh oh!

Uh oh!

Uh oh!

[TRTLLM-5825][fix] Fix torch LoRA TP #5338

[TRTLLM-5825][fix] Fix torch LoRA TP #5338

Uh oh!

Conversation

amitz-nv commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

amitz-nv commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 19, 2025

Uh oh!

shaharmor98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amitz-nv commented Jun 18, 2025 •

edited

Loading