Skip to content

Conversation

yiqingy0
Copy link
Collaborator

@yiqingy0 yiqingy0 commented Aug 20, 2025

Summary by CodeRabbit

  • Chores

    • CI now uses an updated per-stage retry library reference; per-stage retry wrapper accepts a configurable sleep/delay.
  • Tests

    • Test matrix reduced—many PyTorch and end-to-end entries disabled/commented to shrink run scope.
    • Per-stage automatic retry wrapping applied across test stages for consistency.
  • Improvements

    • Rerun failure messages surface underlying error details for clearer diagnostics.
  • Bug Fixes

    • Infrastructure/machine errors now propagate immediately and skip automated reruns when detected.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

📝 Walkthrough

Walkthrough

Switches the TRTL Jenkins shared-library reference to a user branch, wraps per-stage execution with trtllm_utils.llmStageWithRetry(..., [sleepTimeInSecs: 120]), adds infrastructure-error propagation and a hudson.AbortException inspection to skip reruns for infra/machine failures, and comments out many PyTorch/test_e2e entries in a test matrix YAML.

Changes

Cohort / File(s) Summary
Jenkins pipeline & library reference
jenkins/L0_Test.groovy
Update @Library to include trtllm-jenkins-shared-lib@user/yiqingy/retry_stage; replace stage(key) { ... } wrappers with trtllm_utils.llmStageWithRetry(pipeline, key, { ... }, [sleepTimeInSecs: 120]); add infrastructure-error propagation path and a catch (hudson.AbortException) that inspects failure text to detect infra/machine errors and skip reruns when detected; retain original stage logic and add else { values() } path for non-list/non-docker flows.
Integration test matrix
tests/integration/test_lists/test-db/l0_a10.yml
Commented-out many PyTorch and test_e2e entries (e.g., modeling_*, disaggregated_*, many test_e2e items); removed disaggregated_cuda_graph. CPP, TensorRT, and pybind sections unchanged.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Jenkins as Jenkins Pipeline
    participant Utils as trtllm_utils.llmStageWithRetry
    participant Closure as Stage Closure (tests)
    participant Shell as Shell / PyTest
    participant Catcher as AbortException inspector

    Jenkins->>Utils: llmStageWithRetry(pipeline, key, closure, [sleepTimeInSecs:120])
    Utils->>Closure: invoke closure
    Closure->>Shell: run test command(s)
    Shell-->>Closure: success / throws Exception
    alt success
        Closure-->>Utils: return success
        Utils-->>Jenkins: stage success
    else failure
        Closure-->>Utils: throw
        Utils->>Catcher: catch hudson.AbortException? / inspect message
        alt infra/machine detected
            Catcher-->>Jenkins: propagate infra error (no rerun)
        else normal failure
            Utils->>Utils: apply retry logic (sleep/attempts)
            alt retries exhausted
                Utils-->>Jenkins: propagate final failure (includes e.message)
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ZhanruiSunCh
  • chzblych
  • litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
jenkins/L0_Test.groovy (4)

1-1: Pinning shared lib to a user branch — revert before merge.

Good for experimentation, but please switch trtllm-jenkins-shared-lib back to a stable ref (e.g., main) before merging to avoid pipeline drift or accidental dependency on a personal branch.

Apply this diff when ready to merge:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

1414-1414: Remove redundant mutation of testList.

The reassignment isn’t used after testDBList is computed and can confuse future readers.

Apply this diff:

-        testList = "${testList}_${splitId}"

1492-1495: Declare local variable for clarity and scope.

Use a local binding to avoid leaking into the script binding.

Apply this diff:

-                    isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)
+                    def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)

1500-1519: Avoid re-running perf/report steps on retry.

As written, if the perf check/report fails, the retry wrapper will re-run the entire test suite. That’s expensive and unnecessary. Limit retries to the Pytest execution block and leave perf post-processing outside the retry closure.

Apply this diff to close the retry block before perf and add explicit locals:

         }
-        if (perfMode) {
-            basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv"
-            basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}"
+    })
+
+    if (perfMode) {
+        def basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv"
+        def basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}"
         stage("Check perf result") {
             sh """
                 python3 ${llmSrc}/tests/integration/defs/perf/sanity_perf_check.py \
                 ${stageName}/perf_script_test_results.csv \
                 ${basePerfPath}
             """
         }
         stage("Create perf report") {
             sh """
                 python3 ${llmSrc}/tests/integration/defs/perf/create_perf_comparison_report.py \
                 --output_path ${stageName}/report.pdf \
                 --files ${stageName}/perf_script_test_results.csv \
                 ${basePerfPath}
             """
         }
-    })
+    }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 983fb8e and fa4ff74.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (3 hunks)
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

1400-1411: Retry wrapper around Pytest stage — looks good.

Wrapping the Pytest stage with llmExecStepWithRetry is a sensible containment of flakiness and should improve stability without altering the test commands.


1400-1400: Confirm wrapper signature supports a closure.

Ensure trtllm_utils.llmExecStepWithRetry(pipeline, String, Closure) is available in the pinned shared-lib branch. If not, this will fail at runtime with a signature mismatch.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

1-1: Avoid pinning shared lib to a personal branch when targeting main

Using trtllm-jenkins-shared-lib@user/yiqingy/retry_stage in a top-level @Library annotation will make all downstream jobs depend on a non-main, personal branch. This is risky to merge and can break unrelated pipelines if that branch changes or is deleted.

Please revert to @main before merging this PR (or validate on a temporary job). If you need dynamic versioning in production, prefer the library() step with a parameterized version inside a guarded path rather than the annotation.

Apply this diff before merging:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fa4ff74 and d607581.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

1400-1512: Stage-level retry wrapper is a good addition

Wrapping the Pytest stage with trtllm_utils.llmStageWithRetry(...) makes sense for infra flakiness. The closure body keeps the original logic, and the outer cacheErrorAndUploadResult(...) still handles results packaging/publishing, so the flow remains consistent.

Two asks:

  • Confirm llmStageWithRetry exists on the referenced shared-lib branch and that it only retries infra-flaky failures (not genuine test failures).
  • Ensure the wrapper does not perform its own junit/artifact publishing to avoid duplicates with cacheErrorAndUploadResult.

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15899 [ run ] triggered by Bot

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15901 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15899 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15901 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11952 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1485-1493: Unconditional throw guarantees stage failure; remove or guard behind a flag and preserve root cause

Right after pytest completes you unconditionally throw new Exception("Infrastructure error"). The catch then replaces any failure with the same generic exception. This forces every run to fail and defeats the per-stage retry goal.

Fix by:

  • Removing the unconditional throw in the try block.
  • Preserving InterruptedException.
  • Only mapping to “Infrastructure error” when explicitly requested (e.g., to exercise retry), otherwise rethrow the original exception.

Patch:

                     sh """
                         rm -rf ${stageName}/ && \
                         cd ${llmSrc}/tests/integration/defs && \
                         ${testCmdLine.join(" ")}
                     """
-                    throw new Exception("Infrastructure error")
-                } catch (Exception e) {
-                    throw new Exception("Infrastructure error")
+                } catch (InterruptedException e) {
+                    // Preserve interruption semantics (user abort/timeout)
+                    throw e
+                } catch (Exception e) {
+                    // Optionally mark infra to exercise retry path
+                    if (env.FORCE_INFRA_RETRY?.toBoolean()) {
+                        throw new Exception("Infrastructure error", e)
+                    }
+                    // Keep the original failure for accurate diagnosis and reporting
+                    throw e
                 }

This aligns with the retry-wrapper intent without turning every attempt into a hard failure.

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

2281-2301: Stage-level retry wrapper adoption: confirm it creates a Jenkins stage and avoids duplicate JUnit on retries

Wrapping each job with trtllm_utils.llmStageWithRetry(pipeline, key, { ... }) is the right direction. Two verification points:

  • Ensure the wrapper itself opens a stage(key) so Blue Ocean and log grouping remain intact per attempt. If it does not, consider nesting a stage(key) inside the wrapper or adjusting the wrapper to do so.

  • On retries, cacheErrorAndUploadResult currently always archives and runs junit(...). Multiple attempts may re-parse/merge JUnit results and re-upload artifacts. Verify the wrapper either:

    • cleans ${stageName} before each attempt (so previous results aren’t re-ingested), and/or
    • exposes an “is final attempt” signal (env var like TRTLLM_RETRY_FINAL=true) so we can conditionally run junit(...) only once.

If the wrapper already guarantees these properties, we’re good. Otherwise, I can propose a small change to gate junit(...) on a final-attempt signal.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d607581 and 8fff0db.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@@ -1,4 +1,4 @@
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Do not merge with shared-lib pinned to a personal branch; pin to main or a commit SHA

Using trtllm-jenkins-shared-lib@user/yiqingy/retry_stage couples mainline CI to a mutable personal branch and can break/rewrite CI behavior without a code change. Please either:

  • revert to @main, or
  • pin to a reviewed tag/commit SHA for the shared lib.

Suggested change:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

If you need the retry wrapper during iteration, gate the dev lib behind a clearly named env/param (e.g., USE_DEV_SHARED_LIB=true) and default it off.


🏁 Script executed:

#!/bin/bash
# List all @Library declarations in the repository to verify usage of personal branches
rg -n "@Library\\(" -C2

Length of output: 855


Pin shared library to a stable branch or SHA
The trtllm-jenkins-shared-lib@user/yiqingy/retry_stage reference in your CI pipeline ties mainline tests to an unreviewed, mutable personal branch. To prevent unintended breakages or behavior changes:

• In jenkins/L0_Test.groovy (line 1), replace:

- @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+ @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

—or better—pin trtllm-jenkins-shared-lib to a reviewed tag or commit SHA.
• If you still need the retry_stage logic for development, wrap it behind an explicit flag (e.g. USE_DEV_SHARED_LIB=true) that defaults to off, so CI remains stable by default.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the shared library reference uses a
personal mutable branch (`trtllm-jenkins-shared-lib@user/yiqingy/retry_stage`),
which can introduce unreviewed changes into CI; replace that reference with a
stable reviewed tag or a specific commit SHA (for example
`[email protected]` or `@<commit-sha>`) to pin behavior, or
alternatively implement a simple boolean flag (e.g. USE_DEV_SHARED_LIB=false by
default) and conditionally include the dev branch only when that flag is
explicitly set to true so CI uses the stable pinned library by default.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16004 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16004 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12035 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)

802-803: Implement a redacted environment dump helper to replace all env | sort calls

Directly dumping the full environment can leak secrets that Jenkins’s own masking won’t catch (especially third-party or custom variables). We located 10 occurrences of sh "env | sort" in jenkins/L0_Test.groovy and found no existing safeEnvDump utility to centralize redaction:

  • Lines 555, 802, 834, 1284, 1483, 1579, 2128–2129, 2137–2138, 2374, 2383

Please refactor as follows:

  1. Add a new helper in trtllm_utils, e.g.:

    // File: src/trtllm_utils.groovy (or appropriate shared library)
    class trtllm_utils {
      static void safeEnvDump(steps, File outputFile = null) {
        def cmd = '''
          set -e
          env | sort \
            | sed -E 's/(?i)(PASS|PASSWORD|TOKEN|SECRET|KEY|AUTH|COOKIE)=.*/\\1=***REDACTED***/'
        '''
        if (outputFile) {
          steps.sh "${cmd} > ${outputFile}"
        } else {
          steps.sh cmd
        }
      }
      // ...
    }
  2. Replace each sh "env | sort" (and variants that redirect to files) with:

    -    sh "env | sort${redirect}"
    +    trtllm_utils.safeEnvDump(this${redirect ? ", file('${redirectFile}')" : ""})

    For example, at lines 802–803:

    -    sh "env | sort"
    +    trtllm_utils.safeEnvDump(this)
  3. Review credential usage to ensure Jenkins masking is correctly configured—but remember that any non-standard environment variables (e.g., those injected by plugins or external tools) will still be redacted by this helper.

This change both unifies environment dumping and ensures consistent redaction across all 10 sites in jenkins/L0_Test.groovy.

♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)

1-1: Do not pin shared library to a mutable personal branch

Using trtllm-jenkins-shared-lib@user/yiqingy/retry_stage makes mainline CI depend on an unreviewed, mutable branch. Pin to a stable branch (e.g., @main) or a reviewed tag/commit SHA before merge.

Apply one of the following:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

—or pin to a release tag/commit:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@<reviewed-tag-or-sha>']) _

Confirm there is no other file referencing @user/yiqingy/retry_stage. If needed, I can provide a repo scan script.


1490-1490: Unconditional failure forces every run to fail

throw new Exception("Infrastructure error") guarantees this stage fails, defeats rerun logic, and matches the observed downstream failures. Remove it or guard it behind an explicit, disabled-by-default flag.

Minimal removal:

-                    throw new Exception("Infrastructure error")

Safer: feature-flag for local testing only (defaults off):

-                    throw new Exception("Infrastructure error")
+                    if (env.FORCE_INFRA_ERROR?.toBoolean()) {
+                        error "Forced infrastructure error for retry-path validation (FORCE_INFRA_ERROR=true)"
+                    }
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

1496-1496: Sanitize exception text in user-visible failure message

Including raw ${e.message} can leak sensitive data (tokens/URLs) and can be overly verbose. Emit a bounded, sanitized summary and rely on the already-uploaded rerun report for details.

-                        error "The tests still failed after rerun attempt. ${e.message}"
+                        def msg = (e?.message ?: e?.toString() ?: "Unknown error")
+                        // Bound length to avoid log spam; basic hygiene against accidental secrets in messages.
+                        msg = msg.take(512)
+                        error "The tests still failed after rerun attempt. ${msg}"

Please confirm no secrets are ever embedded in exception messages from the test harness. If unsure, prefer not logging the raw message here and point to the rerun HTML report instead.


2286-2305: Validate llmStageWithRetry behavior: stage UI, retries, and result publication semantics

Great to add a retry wrapper. Two things to verify:

  • That llmStageWithRetry(pipeline, key, { ... }) still produces a proper Jenkins Stage in the UI and propagates currentBuild.result/failFast correctly under parallel.
  • That retries don’t double-publish JUnit results or artifacts. Given cacheErrorAndUploadResult runs inside the retried closure, the first attempt may upload results before a retry. Consider teaching the wrapper to expose “attempt index/last attempt” so we can skip JUnit/artifact upload on non-final attempts.

If the helper supports it, pass an “on-final-attempt” hook; otherwise, we can add a lightweight guard around upload steps.

Could you share the llmStageWithRetry signature/semantics or point to its implementation? I can tailor a small refactor to publish results only on the last attempt.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8fff0db and 183f139.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16108 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16108 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12116 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)

1485-1499: Remove the unconditional infrastructure exception; it forces every stage to fail

throw new Exception("Infrastructure error") guarantees failure and defeats the retry-on-infra-flakiness goal, explaining the repeated A10-PyTorch failures after triggering reruns.

Safe options:

  • Remove it entirely
-                    """
-                        rm -rf ${stageName}/ && \
-                        cd ${llmSrc}/tests/integration/defs && \
-                        ${testCmdLine.join(" ")}
-                    """
-                    throw new Exception("Infrastructure error")
+                    """
+                        rm -rf ${stageName}/ && \
+                        cd ${llmSrc}/tests/integration/defs && \
+                        ${testCmdLine.join(" ")}
+                    """
  • Or guard behind an explicit, default-off flag to exercise the retry path only when intended:
                     """
                         rm -rf ${stageName}/ && \
                         cd ${llmSrc}/tests/integration/defs && \
                         ${testCmdLine.join(" ")}
                     """
-                    throw new Exception("Infrastructure error")
+                    if (env.FORCE_INFRA_FAILURE?.toBoolean()) {
+                        error "Forced infrastructure failure to exercise retry path"
+                    }

1-1: Do not pin CI to a personal/mutable shared-lib branch

Referencing trtllm-jenkins-shared-lib@user/yiqingy/retry_stage makes mainline CI behavior mutable and unreviewed. Pin to a stable branch or commit SHA, and use a guarded flag if you must test a dev lib.

Apply this diff (or pin to a reviewed tag/SHA):

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

Optional gating approach (outside this line): add a boolean param/env like USE_DEV_SHARED_LIB=false and include the dev lib only when explicitly enabled.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 183f139 and aab700f.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

1494-1498: Good: include original exception text in rerun failure message

Appending e.message makes the rerun failure actionable in build logs. Keep this.

Comment on lines 2287 to 2341
trtllm_utils.llmStageWithRetry(pipeline, key, {
if (key in testFilter[REUSE_STAGE_LIST]) {
stage("Skip - reused") {
echo "Skip - Passed in the last pipeline."
}
} else if (values instanceof List && dockerNode == null) {
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", {
values[1]()
})
} else if (values instanceof List && dockerNode != null) {
node(dockerNode) {
deleteDir()
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) {
values[1]()
}
}
} else {
values()
}
} else {
values()
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix NPE risk in REUSE_STAGE_LIST check and streamline wrapper logic

  • key in testFilter[REUSE_STAGE_LIST] will throw if REUSE_STAGE_LIST is null (the default in this file). Use a null-safe contains check.
  • Optional: avoid nested stage("Skip - reused") inside an outer stage(key) to keep Stage View tidy; a simple echo suffices.

Apply this minimal fix:

-                if (key in testFilter[REUSE_STAGE_LIST]) {
+                if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
                     stage("Skip - reused") {
                         echo "Skip - Passed in the last pipeline."
                     }

Optional clean-up (outside this hunk): default REUSE_STAGE_LIST to [] instead of null where testFilter is declared to eliminate null checks.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
trtllm_utils.llmStageWithRetry(pipeline, key, {
if (key in testFilter[REUSE_STAGE_LIST]) {
stage("Skip - reused") {
echo "Skip - Passed in the last pipeline."
}
} else if (values instanceof List && dockerNode == null) {
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", {
values[1]()
})
} else if (values instanceof List && dockerNode != null) {
node(dockerNode) {
deleteDir()
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) {
values[1]()
}
}
} else {
values()
}
} else {
values()
}
})
trtllm_utils.llmStageWithRetry(pipeline, key, {
if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
stage("Skip - reused") {
echo "Skip - Passed in the last pipeline."
}
} else if (values instanceof List && dockerNode == null) {
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", {
values[1]()
})
} else if (values instanceof List && dockerNode != null) {
node(dockerNode) {
deleteDir()
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) {
values[1]()
}
}
} else {
values()
}
})
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 2287 to 2306, the check `key in
testFilter[REUSE_STAGE_LIST]` can NPE when REUSE_STAGE_LIST is null and the
inner `stage("Skip - reused")` creates nesting; change the condition to a
null-safe contains check (e.g. use testFilter[REUSE_STAGE_LIST] != null &&
testFilter[REUSE_STAGE_LIST].contains(key) or the Groovy safe navigation
variant) and replace the nested stage block with a simple echo/skip message (no
inner stage) so the outer stage remains tidy; optionally set REUSE_STAGE_LIST
default to [] where testFilter is declared to avoid future null checks.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16132 [ run ] triggered by Bot

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/integration/test_lists/test-db/l0_a10.yml (3)

17-20: Pre-merge PyTorch coverage heavily reduced — prefer isolating to a temporary stage instead of editing the default matrix.

Commenting out most PyTorch entries in the pre_merge block will materially reduce signal and can mask regressions if this lands as-is. If the goal is to validate the new rerun flow quickly, create a dedicated lightweight “retry” stage (or use a stage overlay) and keep the default pre_merge matrix intact.

This also aligns with prior learnings that CLI flow tests and PyTorch API tests should coexist to cover both paths.

Example approach (add a targeted block instead of editing pre_merge):

# New block to keep experiments isolated from default pre_merge
- condition:
    ranges:
      system_gpu_count:
        gte: 1
        lte: 1
    wildcards:
      gpu:
      - '*a10*'
      linux_distribution_name: ubuntu*
    terms:
      stage: retry_stage   # trigger with /bot run --stage-list "A10-PyTorch-1" + mapping
      backend: pytorch
  tests:
  - test_e2e.py::test_openai_misc_example[pytorch]
  - test_e2e.py::test_openai_completions_example[pytorch]
  - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]

If introducing a new stage isn’t feasible, please at least add a loud TODO on this block to ensure it’s reverted before merge. I can open a tracking issue if helpful.


21-21: Avoid relying on cuda_graph as the sole PyTorch smoke — it can be flaky and environment-sensitive.

Keeping only test_disaggregated_cuda_graph as the lone active pre_merge PyTorch test is brittle for A10 CI; CUDA graph tests are more sensitive to driver/runtime details and may fail for infra reasons unrelated to the rerun logic.

Proposed minimal smoke swap: comment out the cuda_graph test and enable a stable E2E miscellany test instead.

-  - disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0]
+  # Prefer a simpler, faster PyTorch smoke to validate rerun infra
+  - test_e2e.py::test_openai_misc_example[pytorch]

22-36: Add representative PyTorch smoke tests back into the pre-merge matrix

Verification of l0_a10.yml shows only one active pre_merge/pytorch test remains:

  • disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0]

To preserve baseline signal without significantly increasing runtime, please re-enable:

  • One disaggregated variant:
    • disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]

  • Two fast E2E PyTorch tests:
    • test_e2e.py::test_openai_misc_example[pytorch]
    • test_e2e.py::test_openai_completions_example[pytorch]

Optionally, you may also enable test_openai_reasoning[pytorch] if time allows.

Please apply a diff like:

@@ tests/integration/test_lists/test-db/l0_a10.yml
-  # - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]
+  - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]

@@
-  # - test_e2e.py::test_openai_misc_example[pytorch]
+  - test_e2e.py::test_openai_misc_example[pytorch]
-  # - test_e2e.py::test_openai_completions_example[pytorch]
+  - test_e2e.py::test_openai_completions_example[pytorch]

This will bring the active count up to at least three, exercising both the rerun wrapper and core PyTorch paths without a large runtime impact.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between aab700f and 2ae88a5.

📒 Files selected for processing (1)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_a10.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16148 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16132 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16148 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12147 (Partly Tested) completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
jenkins/L0_Test.groovy (3)

1515-1529: Unconditional failure makes every test stage fail; rerun path is unreachable

throw new Exception("Infrastructure error") at Line 1520 forces a failure even when tests pass. The second throw at Line 1524 aborts before rerunFailedTests(...), so reruns and the improved failure message never execute. This likely explains repeated A10-PyTorch-1 stage failures observed in this PR thread.

Apply this diff to let reruns happen and keep the enhanced error message:

                 sh """
                     rm -rf ${stageName}/ && \
                     cd ${llmSrc}/tests/integration/defs && \
                     ${testCmdLine.join(" ")}
                 """
-                throw new Exception("Infrastructure error")
             } catch (InterruptedException e) {
                 throw e
             } catch (Exception e) {
-                throw new Exception("Infrastructure error")
-                isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)
-                if (isRerunFailed) {
-                    error "The tests still failed after rerun attempt. ${e.message}"
-                }
+                echo "Test execution failed, attempting rerun. Reason: ${e.message}"
+                def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)
+                if (isRerunFailed) {
+                    error "The tests still failed after rerun attempt. ${e.message}"
+                } else {
+                    echo "Rerun succeeded; proceeding."
+                }
             }

If you need to exercise the retry path locally, guard the artificial failure behind an env flag (e.g., only throw when env.FORCE_INFRA_FAIL == 'true') and default it off.


1-1: Pin shared library to a stable ref (not a personal branch)

Referencing trtllm-jenkins-shared-lib@user/yiqingy/retry_stage couples CI in main to a mutable personal branch. Pin to @main, a reviewed tag, or a commit SHA. If you must iterate on a dev branch, gate it behind an explicit flag and default to the stable ref.

Apply this diff:

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

Optional pattern (if you really need a dev lib temporarily, but keep default stable):

  • Keep the annotation pinned to main.
  • Load the dev lib only when explicitly requested via the imperative library 'trtllm-jenkins-shared-lib@<sha-or-dev-branch>' inside a guarded block (note: be mindful of CPS compilation order and symbol resolution).

2320-2343: Retry wrapper: avoid nested stage and fix null-safe reuse check

  • if (key in testFilter[REUSE_STAGE_LIST]) will NPE when REUSE_STAGE_LIST is null (default is null in this file).
  • Nesting stage("Skip - reused") inside stage(key) clutters Stage View.

Apply this minimal, behavior-preserving diff:

-        stage(key) {
-            trtllm_utils.llmStageWithRetry(pipeline, key, {
-                if (key in testFilter[REUSE_STAGE_LIST]) {
-                    stage("Skip - reused") {
-                        echo "Skip - Passed in the last pipeline."
-                    }
-                } else if (values instanceof List && dockerNode == null) {
+        stage(key) {
+            trtllm_utils.llmStageWithRetry(pipeline, key, {
+                if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
+                    echo "Skip - Passed in the last pipeline."
+                    return
+                } else if (values instanceof List && dockerNode == null) {
                     trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", {
                         values[1]()
                     })
                 } else if (values instanceof List && dockerNode != null) {
                     node(dockerNode) {
                         deleteDir()
                         docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) {
                             values[1]()
                         }
                     }
                 } else {
                     values()
                 }
             })
         }

Optional hardening outside this hunk: default REUSE_STAGE_LIST to [] to remove the need for null checks:

-    (REUSE_STAGE_LIST): null,
+    (REUSE_STAGE_LIST): [],
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

2320-2323: Confirm llmStageWithRetry does not create its own stage to avoid double-wrapping

stage(key) { llmStageWithRetry(pipeline, key, { ... }) } may produce nested stages if the retry helper also calls stage(key) internally. If it does, prefer a single stage wrapper to keep UI clean and timing accurate.

Two options:

  • If llmStageWithRetry already manages the stage, drop the outer stage(key) and let the helper own rendering.
  • If it doesn’t, keep the outer stage(key) and ensure the helper focuses solely on retries without creating nested Jenkins stages.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae88a5 and 81876cb.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (3 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_a10.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16346 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16346 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12286 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "A10-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16419 [ run ] triggered by Bot

@yiqingy0 yiqingy0 changed the title [None][infra] Rerun stage [TRTLLM-7458][infra] Rerun stage Aug 25, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16419 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12339 (Partly Tested) completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
jenkins/L0_Test.groovy (3)

2325-2345: Null-safe reuse check and avoid nested stage inside stage.

key in testFilter[REUSE_STAGE_LIST] NPEs when the list is null (default). Also the inner stage("Skip - reused") clutters Stage View. Echo and return instead.

-            trtllm_utils.llmStageWithRetry(pipeline, key, {
-                if (key in testFilter[REUSE_STAGE_LIST]) {
-                    stage("Skip - reused") {
-                        echo "Skip - Passed in the last pipeline."
-                    }
+            trtllm_utils.llmStageWithRetry(pipeline, key, {
+                if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
+                    echo "Skip - Passed in the last pipeline."
+                    return
                 } else if (values instanceof List && dockerNode == null) {

If desired, also default REUSE_STAGE_LIST to [] where testFilter is declared to eliminate null checks.


1-1: Pin shared library to a stable ref (don’t use a personal branch).

This couples CI to a mutable user branch and risks unexpected breaks.

-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

Optionally pin to a reviewed tag or commit SHA instead of main.


1520-1527: Remove unconditional failure; it bypasses retry and guarantees stage failure.

error "Infrastructure error" will fail every run. Also, error throws FlowInterruptedException, not caught by catch (hudson.AbortException), so the rerun path won’t execute. This matches the repeated A10-PyTorch-1 failures in this PR thread.

Apply this minimal fix:

-                    error "Infrastructure error"
+                    // Do not force-fail here; rely on test outcomes and rerun logic below.

Optional safe guard for local testing only:

if (env.FORCE_INFRA_FAIL?.toBoolean()) {
    error "Forced infrastructure error (dev-only)"
}
#!/bin/bash
# Ensure no forced failures remain
rg -n 'error\s*\(\s*"(Infrastructure error|Forced infrastructure error).*"\s*\)' jenkins/L0_Test.groovy || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

2411-2430: Good use of llmStageWithRetry around image sanity checks.

Wrapper placement and pod launch flow look correct.

Consider centralizing [sleepTimeInSecs: 120] as a constant/param to keep behavior consistent across all stage wrappers.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2a746a6 and a762944.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (4 hunks)

Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
@yiqingy0 yiqingy0 marked this pull request as ready for review September 1, 2025 09:26
@yiqingy0 yiqingy0 requested review from a team as code owners September 1, 2025 09:26
@yiqingy0 yiqingy0 changed the title [TRTLLM-7458][infra] Rerun stage [TRTLLM-7458][infra] Rerun test stage Sep 1, 2025
@chzblych chzblych self-requested a review September 2, 2025 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants