Skip to content

Conversation

yiqingy0
Copy link
Collaborator

@yiqingy0 yiqingy0 commented Aug 20, 2025

Summary by CodeRabbit

  • Tests

    • Expanded 4‑GPU CI: replaced one post‑merge entry with multiple split entries, adding PyTorch and post‑merge variants and increasing parallel shards.
    • Reworked 4‑GPU gating: prior test block stage changed and replaced by a much larger, restructured matrix covering many models, backends and config variants (disaggregated, bf16/fp8, compile/backends).
  • Chores

    • Shortened pip install timeout for CI and made coverage-file write behavior more explicit.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

📝 Walkthrough

Walkthrough

Reworks DGX B200 4‑GPU L0 entries in jenkins/L0_Test.groovy (splits and adds PyTorch/Post‑Merge variants with split_count=2), shortens a pip-install timeout, expands the 4‑GPU test matrix in tests/.../l0_dgx_b200.yml (new 4‑GPU gated condition and many new tests), and changes how .coveragerc is written in jenkins/scripts/slurm_run.sh.

Changes

Cohort / File(s) Summary
Jenkins CI pipeline
jenkins/L0_Test.groovy
Replaced a single DGX_B200-4_GPUs-PyTorch-Post-Merge entry with four new PyTorch/Post‑Merge entries (e.g., DGX_B200-4_GPUs-PyTorch-1/2, ...Post-Merge-1/2) using split_count=2 and gpu_count=4; shortened pip-install timeout from 1 HOUR to 30 MINUTES.
Integration test DB: DGX B200 4‑GPU matrix
tests/integration/test_lists/test-db/l0_dgx_b200.yml
Added a new top-level condition gated for system_gpu_count: 4 (gpu *b200*, ubuntu*, x86_64) that switches stage semantics (previous block stage changed pre/post) and inserts a substantially expanded test matrix covering many models, backends, precisions, and disaggregated variants. Removed/restructured the previous smaller test list.
Jenkins helper script
jenkins/scripts/slurm_run.sh
Replaced here-doc redirection target with an explicit coverageConfigFile variable and redirected the here-doc to the quoted variable when writing .coveragerc (file content unchanged).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Dev as Developer
    participant Jenkins as Jenkins L0
    participant DB as Test DB (l0_dgx_b200.yml)
    participant SLURM as SLURM Cluster
    participant Runner as Test Runner

    Dev->>Jenkins: Trigger L0 pipeline
    Jenkins->>DB: Query tests for system_gpu_count=4, *b200*, ubuntu*, x86_64
    DB-->>Jenkins: Return expanded 4‑GPU test matrix
    Note over Jenkins,SLURM: Jenkins creates staged split jobs (stage_id variants, split_count=2)
    loop for each stage/split
      Jenkins->>SLURM: Dispatch job (4 GPUs, split part)
      SLURM->>Runner: Start tests
      Runner-->>SLURM: Test results
      SLURM-->>Jenkins: Job status
    end
    Jenkins-->>Dev: Aggregate and publish results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • kxdc
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4"

@yiqingy0 yiqingy0 changed the title [None][infra] Split DGX_B200 stage into 4 stages [None][infra] Split DGX_B200 stage into 8 parts Aug 20, 2025
@yiqingy0
Copy link
Collaborator Author

/bot run --gpu-type "DGX_B200"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15872 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d4e8ad8 and 302bd1d.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (1 hunks)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15874 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15872 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15874 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11933 (Partly Tested) completed with status: 'SUCCESS'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4,DGX_B200-4_GPUs-PyTorch-Post-Merge-5,DGX_B200-4_GPUs-PyTorch-Post-Merge-6,DGX_B200-4_GPUs-PyTorch-Post-Merge-7,DGX_B200-4_GPUs-PyTorch-Post-Merge-8"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15969 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15969 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12002 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2,DGX_B200-4_GPUs-PyTorch-Post-Merge-3,DGX_B200-4_GPUs-PyTorch-Post-Merge-4,DGX_B200-4_GPUs-PyTorch-Post-Merge-5,DGX_B200-4_GPUs-PyTorch-Post-Merge-6,DGX_B200-4_GPUs-PyTorch-Post-Merge-7,DGX_B200-4_GPUs-PyTorch-Post-Merge-8" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15981 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15981 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12010 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych changed the title [None][infra] Split DGX_B200 stage into 8 parts [None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge Aug 24, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (4)

43-54: Move disaggregated/UCX/NIXL tests to post_merge to reduce PR flakiness.

Disaggregated UCX/NIXL paths are infra-heavy and tend to be noisy on PRs. Suggest relocating these five cases to the post_merge block.

Apply:

-  - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ucx[DeepSeek-V3-Lite-fp8]
...
-  - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-bf16]
-  - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[llama-3.1-8b-instruct-hf-fp8]
-  - accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
-  - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend
-  - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8]
+  # (moved to post_merge below)

Then add them under the post_merge tests list (suggest placing near lines ~104):

+  - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ucx[DeepSeek-V3-Lite-fp8]
+  - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[DeepSeek-V3-Lite-bf16]
+  - disaggregated/test_disaggregated.py::test_disaggregated_benchmark_on_diff_backends[llama-3.1-8b-instruct-hf-fp8]
+  - accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
+  - accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend
+  - disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8]

1-106: Minor maintainability: consider YAML anchors for repeated patterns (optional).

You repeat many nvfp4 and DeepSeekV3Lite parameter combos with only small deltas (torch_compile, mtp_nextn, backend). Anchors/aliases could reduce drift and copy-paste errors, but keep explicit lists if tooling disallows anchors.

Example pattern (only if your test list loader accepts YAML anchors):

common_cutlass_nvfp4: &common_cutlass_nvfp4 "accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=0"
# ...
- *common_cutlass_nvfp4 -tp4-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=True]

65-106: Verify Jenkins post-merge shards and rebalance uneven test distribution

The Jenkins x86SlurmTestConfigs include “DGX_B200-4_GPUs-PyTorch-Post-Merge-1” through “…-8” with stage_id 1..8 and split_count = 8, correctly targeting the l0_dgx_b200 post-merge matrix.
Test assignments per shard (via MD5 mod 8 on test names) are notably uneven:

  • Shard 1: 4 tests
  • Shard 2: 3 tests
  • Shard 3: 4 tests
  • Shard 4: 5 tests
  • Shard 5: 4 tests
  • Shard 6: 3 tests
  • Shard 7: 6 tests
  • Shard 8: 9 tests

With shards ranging from 3 to 9 tests, runtimes risk being unbalanced. Please consider redistributing some of the longer-running tests between Jenkins stages in jenkins/L0_Test.groovy to even out execution time.


32-38: Sanity-check nvfp4 + TRTLLM/CUTLASS mixes and ep4 variants on DGX_B200

Please verify the following before merging the new DeepSeekV3Lite nvfp4 cases in tests/integration/test_lists/test-db/l0_dgx_b200.yml (lines 32-38):

• TRTLLM-mtp_nextn=0/2 with ep4 + nvfp4: confirm that this combination is supported by the DGX_B200 wheels you’re targeting.
• torch_compile=True with pp4 + nvfp4: ensure these configurations are stable under nvfp4 on B200.

If either configuration is known to be flaky on B200, consider moving it to a post-merge test list.

You may cross-reference existing parametrizations in the other hardware-specific lists (for example, l0_h100.yml and l0_dgx_h100.yml) to confirm support.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e092855 and 8aa56c3.

📒 Files selected for processing (1)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (6)

42-47: LGTM on adding Scout and GPTOSS coverage.

Good spread across backends (TRTLLM/CUTLASS/TRITON) and precisions (fp4/fp8/w4). This helps surface cross-backend regressions early.


16-23: Good inclusion of CLI flow along with LLM API tests.

Having both “unittest/_torch/multi_gpu_modeling -k deepseek” and API tests matches the usual practice and covers both flows.


14-15: Confirm DGX_B200 Pre-merge Test Count and Infrastructure Capacity

– File: tests/integration/test_lists/test-db/l0_dgx_b200.yml
• pre_merge: 37 tests
• post_merge: 38 tests

You’ve shifted 37 DGX_B200 4-GPU tests into the pre_merge block, meaning they will run on every PR. Please confirm that your infra can handle this added load and that PR latency remains acceptable. If this change was unintentional, revert these tests back to post_merge.


69-97: Confirmed FP8/FP4 gating on DGX_B200
The l0_dgx_b200.yml test list only includes FLASHINFER runs for bfloat16 and FP8-kv=False – it omits any fp8kv=True or fp4 FLASHINFER variants, effectively gating those combinations on the DGX_B200 pool. No additional wheel checks or capability-flag skips are needed at this time.


68-68: No naming inconsistency – both DeepEP and DeepEPLowLatency are valid variants.

The test definition in tests/unittest/_torch/modules/test_fused_moe.py is parameterized over three enum members—MNNVL, DeepEP, and DeepEPLowLatency—so there’s no unintended renaming or mismatch within the code. You can safely ignore the suggestion to normalize to a single name here.

Likely an incorrect or invalid review comment.


39-41: No parameter-name typos—IDs are defined as written
The strings “dep4” and “tep4” appear verbatim in the ids array for the test_nvfp4 marker in
tests/integration/defs/accuracy/test_llm_api_pytorch.py, so the YAML entries will collect correctly.
You can verify with, for example:

rg -nP 'parametrize.*\b(tp|pp|ep|dp)\d' -C2 tests/integration/defs/accuracy/test_llm_api_pytorch.py

—no changes needed to the test list.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1867-1874: Re-verify Slurm platform key 'b200-x4' exists in SlurmConfig.

This was flagged earlier and marked addressed, but it’s a hard requirement for these stages to schedule. Let’s re-check the repo to avoid a surprise at runtime.

#!/bin/bash
# Verify Slurm platform key and test DB context exist
set -euo pipefail
echo "Searching for Slurm platform key 'b200-x4'..."
rg -nP -C3 '\bb200-x4\b' || { echo "MISSING: 'b200-x4' not found in Slurm configs"; exit 1; }
echo
echo "Searching for test DB context 'l0_dgx_b200'..."
fd -a -t f -g 'l0_dgx_b200.yml' || true
rg -nP -C2 '\bl0_dgx_b200\b' tests/integration/test_lists/test-db || { echo "MISSING: test-db context 'l0_dgx_b200'"; exit 1; }
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

1867-1874: DRY the stage map with a tiny helper to avoid drift.

Optional: generate these eight keys programmatically to prevent copy/paste mistakes when counts change.

-    x86SlurmTestConfigs = [
-        "DGX_B200-4_GPUs-PyTorch-1": ["b200-x4", "l0_dgx_b200", 1, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-2": ["b200-x4", "l0_dgx_b200", 2, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-3": ["b200-x4", "l0_dgx_b200", 3, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-4": ["b200-x4", "l0_dgx_b200", 4, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-Post-Merge-1": ["b200-x4", "l0_dgx_b200", 1, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-Post-Merge-2": ["b200-x4", "l0_dgx_b200", 2, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-Post-Merge-3": ["b200-x4", "l0_dgx_b200", 3, 4, 4],
-        "DGX_B200-4_GPUs-PyTorch-Post-Merge-4": ["b200-x4", "l0_dgx_b200", 4, 4, 4],
-    ]
+    def mkB200Stages = { String prefix, boolean postMerge ->
+        (1..4).collectEntries { i ->
+            def name = "DGX_B200-4_GPUs-${prefix}${postMerge ? '-Post-Merge' : ''}-${i}"
+            [(name): ["b200-x4", "l0_dgx_b200", i, 4, 4]]
+        }
+    }
+    x86SlurmTestConfigs = [:]
+    x86SlurmTestConfigs += mkB200Stages("PyTorch", false)
+    x86SlurmTestConfigs += mkB200Stages("PyTorch", true)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8aa56c3 and 59ab9a7.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (1 hunks)
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

1867-1874: DGX_B200 split to four parts looks correct and aligned with filtering logic.

  • Stage names match the established pattern and will set backend=pytorch and stage=pre_merge/post_merge via getMakoArgsFromStageName.
  • split_id/split_count (1..4/4) and gpu_count=4 are consistent with the PR intent.

1867-1874: Add explicit 4-GPU gating to the l0_dgx_b200 test list

I verified that l0_dgx_b200.yml already contains both stage: pre_merge and stage: post_merge blocks, but it lacks any system_gpu_count: 4 term to gate the 4-GPU variants:

  • File: tests/integration/test_lists/test-db/l0_dgx_b200.yml
    • Lines 14–16: stage: pre_merge present
    • Lines 65–67: stage: post_merge present
    • No occurrence of system_gpu_count: 4 found

Please add the GPU-count gating under each terms: block so that only the intended 4-GPU tests are selected. For example:

   terms:
     stage: pre_merge
     backend: pytorch
+    system_gpu_count: 4
   tests:
     # …
 
   terms:
     stage: post_merge
     backend: pytorch
+    system_gpu_count: 4
   tests:
     # …

This will ensure that the “DGX_B200-4_GPUs-PyTorch” variants only run when exactly four GPUs are available.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (4)

43-58: Replace eval with safe argv execution; fix array args split.

Today the array embeds space-separated option/value pairs (e.g., "--rootdir ...") and relies on eval + word splitting. This is brittle and unsafe; it breaks on paths with spaces and invites injection if any arg is uncontrolled.

Refactor to use proper argv vectors and execute without eval:

 testCmdLines=(
     "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch"
     "pytest"
     "-v"
     "--timeout=$pytestTestTimeout"
     "--test-list=$testListPathNode"
     "--waives-file=$waivesListPathNode"
-    "--rootdir $llmSrcNode/tests/integration/defs"
+    "--rootdir=$llmSrcNode/tests/integration/defs"
     "--test-prefix=$stageName"
-    "--splits $splits"
-    "--group $splitId"
+    "--splits" "$splits"
+    "--group" "$splitId"
     "--output-dir=$jobWorkspace/"
     "--csv=$resultsPath/report.csv"
-    "--junit-xml $resultsPath/results.xml"
-    "-o junit_logging=out-err"
+    "--junit-xml=$resultsPath/results.xml"
+    "-o" "junit_logging=out-err"
 )
@@
-fullCmd="${testCmdLines[*]}"
-echo "Running: $testCase"
-echo "Full Command: $fullCmd"
-eval $fullCmd
+echo "Running: $testCase"
+printf 'Full Command (argv): %q ' "${testCmdLines[@]}"; echo
+"${testCmdLines[@]}"

Also applies to: 89-92


66-69: Do not mutate the wheel path; fix sed escaping for arbitrary paths.

  • sed 's/[[:space:]]+/_/g' is BRE, so + is literal; no effect.
  • Replacing whitespace with underscores would corrupt valid paths if it did run.
  • sed replacement must escape backslashes, ampersands, and your custom delimiter for safety.

Use robust extraction and escape the replacement:

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
-echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
-sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
+trtllmWhlPath="$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{print $2; exit}')"
+echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
+# Escape sed replacement metacharacters
+trtllmWhlPathEsc=${trtllmWhlPath//\\/\\\\}
+trtllmWhlPathEsc=${trtllmWhlPathEsc//&/\\&}
+trtllmWhlPathEsc=${trtllmWhlPathEsc//|/\\|}
+sed -i "s|---wheel_path---|$trtllmWhlPathEsc|g" "$coverageConfigFile"

77-85: Harden LD_LIBRARY_PATH augmentation and avoid bogus whitespace transforms.

  • The current sed again uses a literal + and may mangle paths.
  • Substring check can produce false positives (e.g., /usr/lib vs /usr/lib64).
  • Avoid duplicate separators and preserve empty LD_LIBRARY_PATH cleanly.
-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
-if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then
-  containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath"
-  containerLDLibPath="${containerLDLibPath%:}"
-fi
+containerPipLLMLibPath="$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')"
+containerLDLibPath="${LD_LIBRARY_PATH:-}"
+if [[ ":$containerLDLibPath:" != *":$containerPipLLMLibPath:"* ]]; then
+  containerLDLibPath="$containerPipLLMLibPath${containerLDLibPath:+:$containerLDLibPath}"
+fi
 export LD_LIBRARY_PATH=$containerLDLibPath

6-15: Add strict error handling, guards, and proper quoting

The ShellCheck output confirms several remaining warnings around unguarded variables, unquoted expansions, and unchecked cd commands. To eliminate these and fail fast on errors, please apply the following mandatory refactors:

• Add at the top of jenkins/scripts/slurm_run.sh:

#!/bin/bash
set -Eeuo pipefail

This ensures the script exits on any unhandled error, undefined variable, or failed pipeline.
(Addresses SC2154 & SC2086 warnings for undefined and unquoted vars.)

• Validate required environment variables before use:

: "${resourcePathNode:?resourcePathNode must be set}"
: "${jobWorkspace:?jobWorkspace must be set}"
: "${stageName:?stageName must be set}"

This converts SC2154 warnings into immediate failures if any of these are unset.

• Quote all variable expansions and guard cd:

- cd $resourcePathNode
+ cd "$resourcePathNode" || exit- coverageConfigFile="$jobWorkspace/.coveragerc"
+ coverageConfigFile="$jobWorkspace/.coveragerc"- mkdir -p $resultsPath
+ mkdir -p "$resultsPath"- if [ $SLURM_LOCALID -eq 0 ]; then
+ if [ "${SLURM_LOCALID:-}" -eq 0 ]; then- wget -nv $llmTarfile
+ wget -nv "$llmTarfile"
- tar -zxf $tarName
+ tar -zxf "$tarName"
- cd $llmSrcNode && pip3 install … 
+ cd "$llmSrcNode" || exit && pip3 install …
- eval $fullCmd
+ eval "$fullCmd"

Quoting prevents globbing/word-splitting (SC2086) and checking cd prevents silent directory-change failures (SC2164).

• Replace sed 's/[[:space:]]\+/_/g' with parameter expansion where possible:

trtllmWhlPath="${trtllmWhlPath//[[:space:]]/_}"

This addresses SC2001 style suggestions.

• Review or remove testList if it’s truly unused (SC2034).

• Optionally, quote your here-doc delimiter (<< "EOF") if you ever need literal $ in the config file—otherwise, leave it unquoted to allow expansion.

Applying these changes will clear all ShellCheck warnings and make the script robust against missing variables, failed commands, and unsafe expansions.

🧹 Nitpick comments (4)
jenkins/scripts/slurm_run.sh (4)

19-36: Safer lockfile handling and SLURM_LOCALID check.

  • If the master path differs from worker PWDs, install_lock.lock may not be visible; anchor it in jobWorkspace.
  • Clean up the lock on exit to avoid blocking subsequent runs if the master fails.
  • Guard SLURM_LOCALID to avoid unbound var errors.
-resultsPath=$jobWorkspace/results
+resultsPath="$jobWorkspace/results"
 mkdir -p $resultsPath
-if [ $SLURM_LOCALID -eq 0 ]; then
+lock_file="$jobWorkspace/install_lock.lock"
+if (( ${SLURM_LOCALID:-0} == 0 )); then
@@
-    touch install_lock.lock
+    trap 'rm -f "$lock_file"' EXIT
+    : > "$lock_file"
 else
-    while [ ! -f install_lock.lock ]; do
+    while [ ! -f "$lock_file" ]; do
         sleep 5
     done
 fi

20-27: Improve setup robustness.

  • Prefer command -v over which.
  • apt-get install should be preceded by apt-get update in ephemeral workers.
-    which python3
+    command -v python3 || true
     python3 --version
-    apt-get install -y libffi-dev
+    apt-get update -y && apt-get install -y libffi-dev

30-31: Nit: double equals in log line; standardize formatting.

Current line prints "GPU_UUIDS = =...". Remove the extra "=" and quote vars to avoid weird spacing.

-    echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME=$HOST_NODE_NAME ; GPU_UUIDS=$gpuUuids ; STAGE_NAME=$stageName"

37-37: Remove unused testList assignment in slurm_run.sh

The assignment on line 37:

testList="$testList_$splitId"

is never referenced elsewhere in the script (only testListPathNode is used for --test-list) . To avoid confusion and dead code, please remove this unused variable assignment.

• File: jenkins/scripts/slurm_run.sh
• Remove line 37: testList="$testList_$splitId"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 59ab9a7 and 503494c.

📒 Files selected for processing (1)
  • jenkins/scripts/slurm_run.sh (1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 6-6: jobWorkspace is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/scripts/slurm_run.sh (1)

70-76: Coverage flags look consistent with the new .coveragerc path.

Using "--cov-config=$coverageConfigFile" and adding both source tree and installed wheel paths is correct and aligns with the sed replacement above.

@chzblych
Copy link
Collaborator

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2, DGX_B200-4_GPUs-PyTorch-3, DGX_B200-4_GPUs-PyTorch-4, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2, DGX_B200-4_GPUs-PyTorch-Post-Merge-3, DGX_B200-4_GPUs-PyTorch-Post-Merge-4, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@chzblych chzblych marked this pull request as ready for review August 24, 2025 14:02
@chzblych chzblych requested a review from a team as a code owner August 24, 2025 14:02
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16314 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-2, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-2, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16322 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16314 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

54-108: Post-merge block: good separation; keep duplicates intentional and watch runtime

Defining a distinct post-merge condition with a tuned matrix is the right call. A few configs overlap conceptually with pre-merge; if intentional for stability, fine. Otherwise consider moving the heaviest variants to post-merge only to keep PR latency reasonable.

Note: The AutoDeploy test at Line 107 reflects an earlier suggestion and is now present—nice.

jenkins/L0_Test.groovy (1)

1897-1901: DGX_B200 split introduced (2-way pre- and post-merge): naming, gating, and split_count align

Stage names, platform key, test context, split_id/split_count, and gpu_count=4 are consistent with the test DB and mako gating (“-PyTorch-”, “Post-Merge”). LGTM.

Reminder from earlier thread: ensure SlurmConfig has platform key "b200-x4" configured; previous verification flagged it once. If already added in this PR series, ignore.

🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

1-1: Avoid pinning to a dev branch of the shared Jenkins library for mainline

Pinning bloom-jenkins-shared-lib to dev-yanchaol-empty-partition is brittle for main. Prefer main or make this branch selectable via an env/param toggle restricted to CI experimentation.

Apply one of these:

-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-empty-partition', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

Or parameterize:

-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-empty-partition', 'trtllm-jenkins-shared-lib@main']) _
+@Library(["bloom-jenkins-shared-lib@${env.BLOOM_LIB_REF ?: 'main'}", 'trtllm-jenkins-shared-lib@main']) _

I can wire BLOOM_LIB_REF into your GitHub bot flags if desired.


2160-2162: 30-minute timeout for pip sanity may be tight in some environments

Previous runs sometimes spend >30m when upstream is slow or cache is cold. Consider 45–60m or make it configurable via env to reduce flaky timeouts.

Proposed tweak:

-    timeout(time: 30, unit: 'MINUTES') {
+    timeout(time: 60, unit: 'MINUTES') {

Or configurable:

-    timeout(time: 30, unit: 'MINUTES') {
+    timeout(time: (env.PIP_CHECK_TIMEOUT_MINUTES ?: '60') as Integer, unit: 'MINUTES') {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a355a6e and 4cea7e5.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (3 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (3)

14-16: Pre-merge gating is correct; verify intended shift from post_merge to pre_merge

Switching this block to stage: pre_merge aligns with adding a separate post-merge block below. Looks good; just confirm downstream consumers weren’t relying on these tests being post-merge previously.

Would you like a quick script to compare which tests moved pre→post vs stayed the same to spot unexpected shifts?


32-38: Unable to locate any rendered list files locally—these are generated in your CI artifacts rather than the repo. Please verify manually that the new DeepSeek V3 Lite nvfp4 test configurations are evenly distributed across your two splits (or tagged as post-merge) to avoid clustering heavy runs in one group.


39-53: All referenced test symbols verified
All of the classes and parametrized test functions listed in l0_dgx_b200.yml were found in the codebase—there are no missing or mistyped symbols that would lead to “collected 0 tests.” Closing this check.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16322 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12272 (Partly Tested) completed with status: 'SUCCESS'

@chzblych
Copy link
Collaborator

/bot skip --comment "Partial testing is sufficient"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16331 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16331 [ skip ] completed with state SUCCESS
Skipping testing for commit 4cea7e5

@chzblych
Copy link
Collaborator

/bot skip --comment "Partial testing is sufficient"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1897-1901: Verify SLURM platform key and test-db context exist for B200

These stages reference:

  • platform key: b200-x4
  • test-db context: l0_dgx_b200

Please confirm both are present in SlurmConfig/partition configs and tests/integration/test_lists/test-db, or these stages will fail at scheduling/test-list rendering time.

Run this repo-scoped check:

#!/bin/bash
set -euo pipefail

echo "Search Slurm/partition configs for platform 'b200-x4' (excluding this file):"
rg -nP -C3 '\bb200-x4\b' -g '!jenkins/L0_Test.groovy' || true

echo
echo "Search for l0_dgx_b200 test-db YAML and references:"
fd -a -t f -g 'l0_dgx_b200*.yml' tests || true
rg -nP -C2 '\bl0_dgx_b200\b' tests || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

2160-2163: Hard-coded 30-minute timeout for pip-install sanity may be too aggressive; make it configurable

Network/artifactory variance and cold caches can push this step beyond 30 minutes. Suggest exposing the timeout via an env/param with a sane default to reduce flakiness.

Apply:

-            timeout(time: 30, unit: 'MINUTES') {
+            def pipInstallTimeoutMin = (env.pipInstallTimeoutMinutes ?: '30') as Integer
+            timeout(time: pipInstallTimeoutMin, unit: 'MINUTES') {
                 checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
             }

Optionally add pipInstallTimeoutMinutes to the pipeline parameters or environment with default "30".

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4cea7e5 and 6f574de.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

1897-1901: DGX_B200 4-GPU split and Post-Merge variants: config looks correct

  • Stage names adhere to the documented "Stage-Name": ["platform", "yaml", split_id, split_count, gpu_count] format.
  • split_count=2 with indices 1..2 is internally consistent for both pre- and post-merge sets.
  • gpu_count=4 is carried through, so multi-GPU filtering and phase-2 orchestration will pick these up.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16336 [ skip ] triggered by Bot

@chzblych chzblych enabled auto-merge (squash) August 25, 2025 01:00
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16336 [ skip ] completed with state SUCCESS
Skipping testing for commit 6f574de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants