-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge #7074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReworks DGX B200 4‑GPU L0 entries in Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer
participant Jenkins as Jenkins L0
participant DB as Test DB (l0_dgx_b200.yml)
participant SLURM as SLURM Cluster
participant Runner as Test Runner
Dev->>Jenkins: Trigger L0 pipeline
Jenkins->>DB: Query tests for system_gpu_count=4, *b200*, ubuntu*, x86_64
DB-->>Jenkins: Return expanded 4‑GPU test matrix
Note over Jenkins,SLURM: Jenkins creates staged split jobs (stage_id variants, split_count=2)
loop for each stage/split
Jenkins->>SLURM: Dispatch job (4 GPUs, split part)
SLURM->>Runner: Start tests
Runner-->>SLURM: Test results
SLURM-->>Jenkins: Job status
end
Jenkins-->>Dev: Aggregate and publish results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4" |
Signed-off-by: Yiqing Yan <[email protected]>
/bot run --gpu-type "DGX_B200" |
PR_Github #15872 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(1 hunks)
PR_Github #15874 [ run ] triggered by Bot |
PR_Github #15872 [ run ] completed with state |
PR_Github #15874 [ run ] completed with state |
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4,DGX_B200-4_GPUs-PyTorch-Post-Merge-5,DGX_B200-4_GPUs-PyTorch-Post-Merge-6,DGX_B200-4_GPUs-PyTorch-Post-Merge-7,DGX_B200-4_GPUs-PyTorch-Post-Merge-8" |
PR_Github #15969 [ run ] triggered by Bot |
PR_Github #15969 [ run ] completed with state |
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4,DGX_B200-4_GPUs-PyTorch-Post-Merge-5,DGX_B200-4_GPUs-PyTorch-Post-Merge-6,DGX_B200-4_GPUs-PyTorch-Post-Merge-7,DGX_B200-4_GPUs-PyTorch-Post-Merge-8" --disable-fail-fast |
PR_Github #15981 [ run ] triggered by Bot |
PR_Github #15981 [ run ] completed with state |
Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (4)
43-54
: Move disaggregated/UCX/NIXL tests to post_merge to reduce PR flakiness.Disaggregated UCX/NIXL paths are infra-heavy and tend to be noisy on PRs. Suggest relocating these five cases to the post_merge block.
Apply:
- - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ucx[DeepSeek-V3-Lite-fp8] ... - - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-bf16] - - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[llama-3.1-8b-instruct-hf-fp8] - - accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend - - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend - - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] + # (moved to post_merge below)Then add them under the post_merge tests list (suggest placing near lines ~104):
+ - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ucx[DeepSeek-V3-Lite-fp8] + - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-bf16] + - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[llama-3.1-8b-instruct-hf-fp8] + - accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend + - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend + - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8]
1-106
: Minor maintainability: consider YAML anchors for repeated patterns (optional).You repeat many nvfp4 and DeepSeekV3Lite parameter combos with only small deltas (torch_compile, mtp_nextn, backend). Anchors/aliases could reduce drift and copy-paste errors, but keep explicit lists if tooling disallows anchors.
Example pattern (only if your test list loader accepts YAML anchors):
common_cutlass_nvfp4: &common_cutlass_nvfp4 "accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=0" # ... - *common_cutlass_nvfp4 -tp4-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=True]
65-106
: Verify Jenkins post-merge shards and rebalance uneven test distributionThe Jenkins
x86SlurmTestConfigs
include “DGX_B200-4_GPUs-PyTorch-Post-Merge-1” through “…-8” withstage_id
1..8 andsplit_count = 8
, correctly targeting thel0_dgx_b200
post-merge matrix.
Test assignments per shard (via MD5 mod 8 on test names) are notably uneven:
- Shard 1: 4 tests
- Shard 2: 3 tests
- Shard 3: 4 tests
- Shard 4: 5 tests
- Shard 5: 4 tests
- Shard 6: 3 tests
- Shard 7: 6 tests
- Shard 8: 9 tests
With shards ranging from 3 to 9 tests, runtimes risk being unbalanced. Please consider redistributing some of the longer-running tests between Jenkins stages in
jenkins/L0_Test.groovy
to even out execution time.
32-38
: Sanity-check nvfp4 + TRTLLM/CUTLASS mixes and ep4 variants on DGX_B200Please verify the following before merging the new DeepSeekV3Lite nvfp4 cases in
tests/integration/test_lists/test-db/l0_dgx_b200.yml
(lines 32-38):• TRTLLM-mtp_nextn=0/2 with ep4 + nvfp4: confirm that this combination is supported by the DGX_B200 wheels you’re targeting.
• torch_compile=True with pp4 + nvfp4: ensure these configurations are stable under nvfp4 on B200.If either configuration is known to be flaky on B200, consider moving it to a post-merge test list.
You may cross-reference existing parametrizations in the other hardware-specific lists (for example,
l0_h100.yml
andl0_dgx_h100.yml
) to confirm support.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (6)
42-47
: LGTM on adding Scout and GPTOSS coverage.Good spread across backends (TRTLLM/CUTLASS/TRITON) and precisions (fp4/fp8/w4). This helps surface cross-backend regressions early.
16-23
: Good inclusion of CLI flow along with LLM API tests.Having both “unittest/_torch/multi_gpu_modeling -k deepseek” and API tests matches the usual practice and covers both flows.
14-15
: Confirm DGX_B200 Pre-merge Test Count and Infrastructure Capacity– File: tests/integration/test_lists/test-db/l0_dgx_b200.yml
• pre_merge: 37 tests
• post_merge: 38 testsYou’ve shifted 37 DGX_B200 4-GPU tests into the pre_merge block, meaning they will run on every PR. Please confirm that your infra can handle this added load and that PR latency remains acceptable. If this change was unintentional, revert these tests back to post_merge.
69-97
: Confirmed FP8/FP4 gating on DGX_B200
The l0_dgx_b200.yml test list only includes FLASHINFER runs for bfloat16 and FP8-kv=False – it omits any fp8kv=True or fp4 FLASHINFER variants, effectively gating those combinations on the DGX_B200 pool. No additional wheel checks or capability-flag skips are needed at this time.
68-68
: No naming inconsistency – bothDeepEP
andDeepEPLowLatency
are valid variants.The test definition in
tests/unittest/_torch/modules/test_fused_moe.py
is parameterized over three enum members—MNNVL
,DeepEP
, andDeepEPLowLatency
—so there’s no unintended renaming or mismatch within the code. You can safely ignore the suggestion to normalize to a single name here.Likely an incorrect or invalid review comment.
39-41
: No parameter-name typos—IDs are defined as written
The strings “dep4” and “tep4” appear verbatim in theids
array for thetest_nvfp4
marker in
tests/integration/defs/accuracy/test_llm_api_pytorch.py
, so the YAML entries will collect correctly.
You can verify with, for example:rg -nP 'parametrize.*\b(tp|pp|ep|dp)\d' -C2 tests/integration/defs/accuracy/test_llm_api_pytorch.py
—no changes needed to the test list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1867-1874
: Re-verify Slurm platform key 'b200-x4' exists in SlurmConfig.This was flagged earlier and marked addressed, but it’s a hard requirement for these stages to schedule. Let’s re-check the repo to avoid a surprise at runtime.
#!/bin/bash # Verify Slurm platform key and test DB context exist set -euo pipefail echo "Searching for Slurm platform key 'b200-x4'..." rg -nP -C3 '\bb200-x4\b' || { echo "MISSING: 'b200-x4' not found in Slurm configs"; exit 1; } echo echo "Searching for test DB context 'l0_dgx_b200'..." fd -a -t f -g 'l0_dgx_b200.yml' || true rg -nP -C2 '\bl0_dgx_b200\b' tests/integration/test_lists/test-db || { echo "MISSING: test-db context 'l0_dgx_b200'"; exit 1; }
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
1867-1874
: DRY the stage map with a tiny helper to avoid drift.Optional: generate these eight keys programmatically to prevent copy/paste mistakes when counts change.
- x86SlurmTestConfigs = [ - "DGX_B200-4_GPUs-PyTorch-1": ["b200-x4", "l0_dgx_b200", 1, 4, 4], - "DGX_B200-4_GPUs-PyTorch-2": ["b200-x4", "l0_dgx_b200", 2, 4, 4], - "DGX_B200-4_GPUs-PyTorch-3": ["b200-x4", "l0_dgx_b200", 3, 4, 4], - "DGX_B200-4_GPUs-PyTorch-4": ["b200-x4", "l0_dgx_b200", 4, 4, 4], - "DGX_B200-4_GPUs-PyTorch-Post-Merge-1": ["b200-x4", "l0_dgx_b200", 1, 4, 4], - "DGX_B200-4_GPUs-PyTorch-Post-Merge-2": ["b200-x4", "l0_dgx_b200", 2, 4, 4], - "DGX_B200-4_GPUs-PyTorch-Post-Merge-3": ["b200-x4", "l0_dgx_b200", 3, 4, 4], - "DGX_B200-4_GPUs-PyTorch-Post-Merge-4": ["b200-x4", "l0_dgx_b200", 4, 4, 4], - ] + def mkB200Stages = { String prefix, boolean postMerge -> + (1..4).collectEntries { i -> + def name = "DGX_B200-4_GPUs-${prefix}${postMerge ? '-Post-Merge' : ''}-${i}" + [(name): ["b200-x4", "l0_dgx_b200", i, 4, 4]] + } + } + x86SlurmTestConfigs = [:] + x86SlurmTestConfigs += mkB200Stages("PyTorch", false) + x86SlurmTestConfigs += mkB200Stages("PyTorch", true)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(1 hunks)
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
1867-1874
: DGX_B200 split to four parts looks correct and aligned with filtering logic.
- Stage names match the established pattern and will set backend=pytorch and stage=pre_merge/post_merge via getMakoArgsFromStageName.
- split_id/split_count (1..4/4) and gpu_count=4 are consistent with the PR intent.
1867-1874
: Add explicit 4-GPU gating to the l0_dgx_b200 test listI verified that
l0_dgx_b200.yml
already contains bothstage: pre_merge
andstage: post_merge
blocks, but it lacks anysystem_gpu_count: 4
term to gate the 4-GPU variants:
- File: tests/integration/test_lists/test-db/l0_dgx_b200.yml
- Lines 14–16:
stage: pre_merge
present- Lines 65–67:
stage: post_merge
present- No occurrence of
system_gpu_count: 4
foundPlease add the GPU-count gating under each
terms:
block so that only the intended 4-GPU tests are selected. For example:terms: stage: pre_merge backend: pytorch + system_gpu_count: 4 tests: # … terms: stage: post_merge backend: pytorch + system_gpu_count: 4 tests: # …This will ensure that the “DGX_B200-4_GPUs-PyTorch” variants only run when exactly four GPUs are available.
Likely an incorrect or invalid review comment.
Signed-off-by: Yanchao Lu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (4)
43-58
: Replace eval with safe argv execution; fix array args split.Today the array embeds space-separated option/value pairs (e.g., "--rootdir ...") and relies on eval + word splitting. This is brittle and unsafe; it breaks on paths with spaces and invites injection if any arg is uncontrolled.
Refactor to use proper argv vectors and execute without eval:
testCmdLines=( "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch" "pytest" "-v" "--timeout=$pytestTestTimeout" "--test-list=$testListPathNode" "--waives-file=$waivesListPathNode" - "--rootdir $llmSrcNode/tests/integration/defs" + "--rootdir=$llmSrcNode/tests/integration/defs" "--test-prefix=$stageName" - "--splits $splits" - "--group $splitId" + "--splits" "$splits" + "--group" "$splitId" "--output-dir=$jobWorkspace/" "--csv=$resultsPath/report.csv" - "--junit-xml $resultsPath/results.xml" - "-o junit_logging=out-err" + "--junit-xml=$resultsPath/results.xml" + "-o" "junit_logging=out-err" ) @@ -fullCmd="${testCmdLines[*]}" -echo "Running: $testCase" -echo "Full Command: $fullCmd" -eval $fullCmd +echo "Running: $testCase" +printf 'Full Command (argv): %q ' "${testCmdLines[@]}"; echo +"${testCmdLines[@]}"Also applies to: 89-92
66-69
: Do not mutate the wheel path; fix sed escaping for arbitrary paths.
- sed 's/[[:space:]]+/_/g' is BRE, so + is literal; no effect.
- Replacing whitespace with underscores would corrupt valid paths if it did run.
- sed replacement must escape backslashes, ampersands, and your custom delimiter for safety.
Use robust extraction and escape the replacement:
-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) -trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') -echo "TRTLLM WHEEL PATH: $trtllmWhlPath" -sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile" +trtllmWhlPath="$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{print $2; exit}')" +echo "TRTLLM WHEEL PATH: $trtllmWhlPath" +# Escape sed replacement metacharacters +trtllmWhlPathEsc=${trtllmWhlPath//\\/\\\\} +trtllmWhlPathEsc=${trtllmWhlPathEsc//&/\\&} +trtllmWhlPathEsc=${trtllmWhlPathEsc//|/\\|} +sed -i "s|---wheel_path---|$trtllmWhlPathEsc|g" "$coverageConfigFile"
77-85
: Harden LD_LIBRARY_PATH augmentation and avoid bogus whitespace transforms.
- The current sed again uses a literal + and may mangle paths.
- Substring check can produce false positives (e.g., /usr/lib vs /usr/lib64).
- Avoid duplicate separators and preserve empty LD_LIBRARY_PATH cleanly.
-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}') -containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') -containerLDLibPath=$LD_LIBRARY_PATH -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') -if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then - containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath" - containerLDLibPath="${containerLDLibPath%:}" -fi +containerPipLLMLibPath="$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')" +containerLDLibPath="${LD_LIBRARY_PATH:-}" +if [[ ":$containerLDLibPath:" != *":$containerPipLLMLibPath:"* ]]; then + containerLDLibPath="$containerPipLLMLibPath${containerLDLibPath:+:$containerLDLibPath}" +fi export LD_LIBRARY_PATH=$containerLDLibPath
6-15
: Add strict error handling, guards, and proper quotingThe ShellCheck output confirms several remaining warnings around unguarded variables, unquoted expansions, and unchecked
cd
commands. To eliminate these and fail fast on errors, please apply the following mandatory refactors:• Add at the top of
jenkins/scripts/slurm_run.sh
:#!/bin/bash set -Eeuo pipefailThis ensures the script exits on any unhandled error, undefined variable, or failed pipeline.
(Addresses SC2154 & SC2086 warnings for undefined and unquoted vars.)• Validate required environment variables before use:
: "${resourcePathNode:?resourcePathNode must be set}" : "${jobWorkspace:?jobWorkspace must be set}" : "${stageName:?stageName must be set}"This converts SC2154 warnings into immediate failures if any of these are unset.
• Quote all variable expansions and guard
cd
:- cd $resourcePathNode + cd "$resourcePathNode" || exit … - coverageConfigFile="$jobWorkspace/.coveragerc" + coverageConfigFile="$jobWorkspace/.coveragerc" … - mkdir -p $resultsPath + mkdir -p "$resultsPath" … - if [ $SLURM_LOCALID -eq 0 ]; then + if [ "${SLURM_LOCALID:-}" -eq 0 ]; then … - wget -nv $llmTarfile + wget -nv "$llmTarfile" - tar -zxf $tarName + tar -zxf "$tarName" - cd $llmSrcNode && pip3 install … + cd "$llmSrcNode" || exit && pip3 install … - eval $fullCmd + eval "$fullCmd"Quoting prevents globbing/word-splitting (SC2086) and checking
cd
prevents silent directory-change failures (SC2164).• Replace
sed 's/[[:space:]]\+/_/g'
with parameter expansion where possible:trtllmWhlPath="${trtllmWhlPath//[[:space:]]/_}"
This addresses SC2001 style suggestions.
• Review or remove
testList
if it’s truly unused (SC2034).• Optionally, quote your here-doc delimiter (
<< "EOF"
) if you ever need literal$
in the config file—otherwise, leave it unquoted to allow expansion.Applying these changes will clear all ShellCheck warnings and make the script robust against missing variables, failed commands, and unsafe expansions.
🧹 Nitpick comments (4)
jenkins/scripts/slurm_run.sh (4)
19-36
: Safer lockfile handling and SLURM_LOCALID check.
- If the master path differs from worker PWDs, install_lock.lock may not be visible; anchor it in jobWorkspace.
- Clean up the lock on exit to avoid blocking subsequent runs if the master fails.
- Guard SLURM_LOCALID to avoid unbound var errors.
-resultsPath=$jobWorkspace/results +resultsPath="$jobWorkspace/results" mkdir -p $resultsPath -if [ $SLURM_LOCALID -eq 0 ]; then +lock_file="$jobWorkspace/install_lock.lock" +if (( ${SLURM_LOCALID:-0} == 0 )); then @@ - touch install_lock.lock + trap 'rm -f "$lock_file"' EXIT + : > "$lock_file" else - while [ ! -f install_lock.lock ]; do + while [ ! -f "$lock_file" ]; do sleep 5 done fi
20-27
: Improve setup robustness.
- Prefer command -v over which.
- apt-get install should be preceded by apt-get update in ephemeral workers.
- which python3 + command -v python3 || true python3 --version - apt-get install -y libffi-dev + apt-get update -y && apt-get install -y libffi-dev
30-31
: Nit: double equals in log line; standardize formatting.Current line prints "GPU_UUIDS = =...". Remove the extra "=" and quote vars to avoid weird spacing.
- echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" + echo "HOST_NODE_NAME=$HOST_NODE_NAME ; GPU_UUIDS=$gpuUuids ; STAGE_NAME=$stageName"
37-37
: Remove unusedtestList
assignment in slurm_run.shThe assignment on line 37:
testList="$testList_$splitId"
is never referenced elsewhere in the script (only
testListPathNode
is used for--test-list
) . To avoid confusion and dead code, please remove this unused variable assignment.• File:
jenkins/scripts/slurm_run.sh
• Remove line 37:testList="$testList_$splitId"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/scripts/slurm_run.sh
(1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 6-6: jobWorkspace is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/scripts/slurm_run.sh (1)
70-76
: Coverage flags look consistent with the new .coveragerc path.Using "--cov-config=$coverageConfigFile" and adding both source tree and installed wheel paths is correct and aligns with the sed replacement above.
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2, DGX_B200-4_GPUs-PyTorch-3, DGX_B200-4_GPUs-PyTorch-4, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2, DGX_B200-4_GPUs-PyTorch-Post-Merge-3, DGX_B200-4_GPUs-PyTorch-Post-Merge-4, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16314 [ run ] triggered by Bot |
Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16322 [ run ] triggered by Bot |
PR_Github #16314 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
54-108
: Post-merge block: good separation; keep duplicates intentional and watch runtimeDefining a distinct post-merge condition with a tuned matrix is the right call. A few configs overlap conceptually with pre-merge; if intentional for stability, fine. Otherwise consider moving the heaviest variants to post-merge only to keep PR latency reasonable.
Note: The AutoDeploy test at Line 107 reflects an earlier suggestion and is now present—nice.
jenkins/L0_Test.groovy (1)
1897-1901
: DGX_B200 split introduced (2-way pre- and post-merge): naming, gating, and split_count alignStage names, platform key, test context, split_id/split_count, and gpu_count=4 are consistent with the test DB and mako gating (“-PyTorch-”, “Post-Merge”). LGTM.
Reminder from earlier thread: ensure SlurmConfig has platform key "b200-x4" configured; previous verification flagged it once. If already added in this PR series, ignore.
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
1-1
: Avoid pinning to a dev branch of the shared Jenkins library for mainlinePinning bloom-jenkins-shared-lib to dev-yanchaol-empty-partition is brittle for main. Prefer main or make this branch selectable via an env/param toggle restricted to CI experimentation.
Apply one of these:
-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-empty-partition', 'trtllm-jenkins-shared-lib@main']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Or parameterize:
-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-empty-partition', 'trtllm-jenkins-shared-lib@main']) _ +@Library(["bloom-jenkins-shared-lib@${env.BLOOM_LIB_REF ?: 'main'}", 'trtllm-jenkins-shared-lib@main']) _I can wire BLOOM_LIB_REF into your GitHub bot flags if desired.
2160-2162
: 30-minute timeout for pip sanity may be tight in some environmentsPrevious runs sometimes spend >30m when upstream is slow or cache is cold. Consider 45–60m or make it configurable via env to reduce flaky timeouts.
Proposed tweak:
- timeout(time: 30, unit: 'MINUTES') { + timeout(time: 60, unit: 'MINUTES') {Or configurable:
- timeout(time: 30, unit: 'MINUTES') { + timeout(time: (env.PIP_CHECK_TIMEOUT_MINUTES ?: '60') as Integer, unit: 'MINUTES') {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(3 hunks)tests/integration/test_lists/test-db/l0_dgx_b200.yml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (3)
14-16
: Pre-merge gating is correct; verify intended shift from post_merge to pre_mergeSwitching this block to stage: pre_merge aligns with adding a separate post-merge block below. Looks good; just confirm downstream consumers weren’t relying on these tests being post-merge previously.
Would you like a quick script to compare which tests moved pre→post vs stayed the same to spot unexpected shifts?
32-38
: Unable to locate any rendered list files locally—these are generated in your CI artifacts rather than the repo. Please verify manually that the new DeepSeek V3 Lite nvfp4 test configurations are evenly distributed across your two splits (or tagged as post-merge) to avoid clustering heavy runs in one group.
39-53
: All referenced test symbols verified
All of the classes and parametrized test functions listed inl0_dgx_b200.yml
were found in the codebase—there are no missing or mistyped symbols that would lead to “collected 0 tests.” Closing this check.
PR_Github #16322 [ run ] completed with state |
/bot skip --comment "Partial testing is sufficient" |
PR_Github #16331 [ skip ] triggered by Bot |
PR_Github #16331 [ skip ] completed with state |
Signed-off-by: Yanchao Lu <[email protected]>
/bot skip --comment "Partial testing is sufficient" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1897-1901
: Verify SLURM platform key and test-db context exist for B200These stages reference:
- platform key: b200-x4
- test-db context: l0_dgx_b200
Please confirm both are present in SlurmConfig/partition configs and tests/integration/test_lists/test-db, or these stages will fail at scheduling/test-list rendering time.
Run this repo-scoped check:
#!/bin/bash set -euo pipefail echo "Search Slurm/partition configs for platform 'b200-x4' (excluding this file):" rg -nP -C3 '\bb200-x4\b' -g '!jenkins/L0_Test.groovy' || true echo echo "Search for l0_dgx_b200 test-db YAML and references:" fd -a -t f -g 'l0_dgx_b200*.yml' tests || true rg -nP -C2 '\bl0_dgx_b200\b' tests || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2160-2163
: Hard-coded 30-minute timeout for pip-install sanity may be too aggressive; make it configurableNetwork/artifactory variance and cold caches can push this step beyond 30 minutes. Suggest exposing the timeout via an env/param with a sane default to reduce flakiness.
Apply:
- timeout(time: 30, unit: 'MINUTES') { + def pipInstallTimeoutMin = (env.pipInstallTimeoutMinutes ?: '30') as Integer + timeout(time: pipInstallTimeoutMin, unit: 'MINUTES') { checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") }Optionally add pipInstallTimeoutMinutes to the pipeline parameters or environment with default "30".
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1897-1901
: DGX_B200 4-GPU split and Post-Merge variants: config looks correct
- Stage names adhere to the documented "Stage-Name": ["platform", "yaml", split_id, split_count, gpu_count] format.
- split_count=2 with indices 1..2 is internally consistent for both pre- and post-merge sets.
- gpu_count=4 is carried through, so multi-GPU filtering and phase-2 orchestration will pick these up.
PR_Github #16336 [ skip ] triggered by Bot |
PR_Github #16336 [ skip ] completed with state |
Summary by CodeRabbit
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.