-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7458][infra] Rerun test stage #7088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughSwitches the TRTL Jenkins shared-library reference to a user branch, wraps per-stage execution with Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Jenkins as Jenkins Pipeline
participant Utils as trtllm_utils.llmStageWithRetry
participant Closure as Stage Closure (tests)
participant Shell as Shell / PyTest
participant Catcher as AbortException inspector
Jenkins->>Utils: llmStageWithRetry(pipeline, key, closure, [sleepTimeInSecs:120])
Utils->>Closure: invoke closure
Closure->>Shell: run test command(s)
Shell-->>Closure: success / throws Exception
alt success
Closure-->>Utils: return success
Utils-->>Jenkins: stage success
else failure
Closure-->>Utils: throw
Utils->>Catcher: catch hudson.AbortException? / inspect message
alt infra/machine detected
Catcher-->>Jenkins: propagate infra error (no rerun)
else normal failure
Utils->>Utils: apply retry logic (sleep/attempts)
alt retries exhausted
Utils-->>Jenkins: propagate final failure (includes e.message)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
jenkins/L0_Test.groovy (4)
1-1
: Pinning shared lib to a user branch — revert before merge.Good for experimentation, but please switch
trtllm-jenkins-shared-lib
back to a stable ref (e.g.,main
) before merging to avoid pipeline drift or accidental dependency on a personal branch.Apply this diff when ready to merge:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
1414-1414
: Remove redundant mutation of testList.The reassignment isn’t used after
testDBList
is computed and can confuse future readers.Apply this diff:
- testList = "${testList}_${splitId}"
1492-1495
: Declare local variable for clarity and scope.Use a local binding to avoid leaking into the script binding.
Apply this diff:
- isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) + def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine)
1500-1519
: Avoid re-running perf/report steps on retry.As written, if the perf check/report fails, the retry wrapper will re-run the entire test suite. That’s expensive and unnecessary. Limit retries to the Pytest execution block and leave perf post-processing outside the retry closure.
Apply this diff to close the retry block before perf and add explicit locals:
} - if (perfMode) { - basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv" - basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}" + }) + + if (perfMode) { + def basePerfFilename = stageName.contains("PyTorch") ? "base_perf_pytorch.csv" : "base_perf.csv" + def basePerfPath = "${llmSrc}/tests/integration/defs/perf/${basePerfFilename}" stage("Check perf result") { sh """ python3 ${llmSrc}/tests/integration/defs/perf/sanity_perf_check.py \ ${stageName}/perf_script_test_results.csv \ ${basePerfPath} """ } stage("Create perf report") { sh """ python3 ${llmSrc}/tests/integration/defs/perf/create_perf_comparison_report.py \ --output_path ${stageName}/report.pdf \ --files ${stageName}/perf_script_test_results.csv \ ${basePerfPath} """ } - }) + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(3 hunks)
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
1400-1411
: Retry wrapper around Pytest stage — looks good.Wrapping the Pytest stage with
llmExecStepWithRetry
is a sensible containment of flakiness and should improve stability without altering the test commands.
1400-1400
: Confirm wrapper signature supports a closure.Ensure
trtllm_utils.llmExecStepWithRetry(pipeline, String, Closure)
is available in the pinned shared-lib branch. If not, this will fail at runtime with a signature mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
1-1
: Avoid pinning shared lib to a personal branch when targeting mainUsing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
in a top-level@Library
annotation will make all downstream jobs depend on a non-main, personal branch. This is risky to merge and can break unrelated pipelines if that branch changes or is deleted.Please revert to
@main
before merging this PR (or validate on a temporary job). If you need dynamic versioning in production, prefer thelibrary()
step with a parameterized version inside a guarded path rather than the annotation.Apply this diff before merging:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1400-1512
: Stage-level retry wrapper is a good additionWrapping the Pytest stage with
trtllm_utils.llmStageWithRetry(...)
makes sense for infra flakiness. The closure body keeps the original logic, and the outercacheErrorAndUploadResult(...)
still handles results packaging/publishing, so the flow remains consistent.Two asks:
- Confirm
llmStageWithRetry
exists on the referenced shared-lib branch and that it only retries infra-flaky failures (not genuine test failures).- Ensure the wrapper does not perform its own
junit
/artifact publishing to avoid duplicates withcacheErrorAndUploadResult
.
/bot run --stage-list "A10-PyTorch-1" |
PR_Github #15899 [ run ] triggered by Bot |
/bot run --stage-list "A10-PyTorch-1" |
PR_Github #15901 [ run ] triggered by Bot |
PR_Github #15899 [ run ] completed with state |
PR_Github #15901 [ run ] completed with state |
/bot run --stage-list "A10-PyTorch-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1485-1493
: Unconditional throw guarantees stage failure; remove or guard behind a flag and preserve root causeRight after pytest completes you unconditionally
throw new Exception("Infrastructure error")
. The catch then replaces any failure with the same generic exception. This forces every run to fail and defeats the per-stage retry goal.Fix by:
- Removing the unconditional throw in the try block.
- Preserving
InterruptedException
.- Only mapping to “Infrastructure error” when explicitly requested (e.g., to exercise retry), otherwise rethrow the original exception.
Patch:
sh """ rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") - } catch (Exception e) { - throw new Exception("Infrastructure error") + } catch (InterruptedException e) { + // Preserve interruption semantics (user abort/timeout) + throw e + } catch (Exception e) { + // Optionally mark infra to exercise retry path + if (env.FORCE_INFRA_RETRY?.toBoolean()) { + throw new Exception("Infrastructure error", e) + } + // Keep the original failure for accurate diagnosis and reporting + throw e }This aligns with the retry-wrapper intent without turning every attempt into a hard failure.
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2281-2301
: Stage-level retry wrapper adoption: confirm it creates a Jenkins stage and avoids duplicate JUnit on retriesWrapping each job with
trtllm_utils.llmStageWithRetry(pipeline, key, { ... })
is the right direction. Two verification points:
Ensure the wrapper itself opens a
stage(key)
so Blue Ocean and log grouping remain intact per attempt. If it does not, consider nesting astage(key)
inside the wrapper or adjusting the wrapper to do so.On retries,
cacheErrorAndUploadResult
currently always archives and runsjunit(...)
. Multiple attempts may re-parse/merge JUnit results and re-upload artifacts. Verify the wrapper either:
- cleans
${stageName}
before each attempt (so previous results aren’t re-ingested), and/or- exposes an “is final attempt” signal (env var like
TRTLLM_RETRY_FINAL=true
) so we can conditionally runjunit(...)
only once.If the wrapper already guarantees these properties, we’re good. Otherwise, I can propose a small change to gate
junit(...)
on a final-attempt signal.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
@@ -1,4 +1,4 @@ | |||
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | |||
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not merge with shared-lib pinned to a personal branch; pin to main or a commit SHA
Using trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
couples mainline CI to a mutable personal branch and can break/rewrite CI behavior without a code change. Please either:
- revert to
@main
, or - pin to a reviewed tag/commit SHA for the shared lib.
Suggested change:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
If you need the retry wrapper during iteration, gate the dev lib behind a clearly named env/param (e.g., USE_DEV_SHARED_LIB=true
) and default it off.
🏁 Script executed:
#!/bin/bash
# List all @Library declarations in the repository to verify usage of personal branches
rg -n "@Library\\(" -C2
Length of output: 855
Pin shared library to a stable branch or SHA
The trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
reference in your CI pipeline ties mainline tests to an unreviewed, mutable personal branch. To prevent unintended breakages or behavior changes:
• In jenkins/L0_Test.groovy
(line 1), replace:
- @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _
+ @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
—or better—pin trtllm-jenkins-shared-lib
to a reviewed tag or commit SHA.
• If you still need the retry_stage
logic for development, wrap it behind an explicit flag (e.g. USE_DEV_SHARED_LIB=true
) that defaults to off, so CI remains stable by default.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ | |
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ |
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the shared library reference uses a
personal mutable branch (`trtllm-jenkins-shared-lib@user/yiqingy/retry_stage`),
which can introduce unreviewed changes into CI; replace that reference with a
stable reviewed tag or a specific commit SHA (for example
`[email protected]` or `@<commit-sha>`) to pin behavior, or
alternatively implement a simple boolean flag (e.g. USE_DEV_SHARED_LIB=false by
default) and conditionally include the dev branch only when that flag is
explicitly set to true so CI uses the stable pinned library by default.
PR_Github #16004 [ run ] triggered by Bot |
PR_Github #16004 [ run ] completed with state |
/bot run --stage-list "A10-PyTorch-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)
802-803
: Implement a redacted environment dump helper to replace allenv | sort
callsDirectly dumping the full environment can leak secrets that Jenkins’s own masking won’t catch (especially third-party or custom variables). We located 10 occurrences of
sh "env | sort"
injenkins/L0_Test.groovy
and found no existingsafeEnvDump
utility to centralize redaction:
- Lines 555, 802, 834, 1284, 1483, 1579, 2128–2129, 2137–2138, 2374, 2383
Please refactor as follows:
Add a new helper in
trtllm_utils
, e.g.:// File: src/trtllm_utils.groovy (or appropriate shared library) class trtllm_utils { static void safeEnvDump(steps, File outputFile = null) { def cmd = ''' set -e env | sort \ | sed -E 's/(?i)(PASS|PASSWORD|TOKEN|SECRET|KEY|AUTH|COOKIE)=.*/\\1=***REDACTED***/' ''' if (outputFile) { steps.sh "${cmd} > ${outputFile}" } else { steps.sh cmd } } // ... }Replace each
sh "env | sort"
(and variants that redirect to files) with:- sh "env | sort${redirect}" + trtllm_utils.safeEnvDump(this${redirect ? ", file('${redirectFile}')" : ""})For example, at lines 802–803:
- sh "env | sort" + trtllm_utils.safeEnvDump(this)Review credential usage to ensure Jenkins masking is correctly configured—but remember that any non-standard environment variables (e.g., those injected by plugins or external tools) will still be redacted by this helper.
This change both unifies environment dumping and ensures consistent redaction across all 10 sites in
jenkins/L0_Test.groovy
.
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1-1
: Do not pin shared library to a mutable personal branchUsing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
makes mainline CI depend on an unreviewed, mutable branch. Pin to a stable branch (e.g.,@main
) or a reviewed tag/commit SHA before merge.Apply one of the following:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _—or pin to a release tag/commit:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@<reviewed-tag-or-sha>']) _Confirm there is no other file referencing
@user/yiqingy/retry_stage
. If needed, I can provide a repo scan script.
1490-1490
: Unconditional failure forces every run to fail
throw new Exception("Infrastructure error")
guarantees this stage fails, defeats rerun logic, and matches the observed downstream failures. Remove it or guard it behind an explicit, disabled-by-default flag.Minimal removal:
- throw new Exception("Infrastructure error")
Safer: feature-flag for local testing only (defaults off):
- throw new Exception("Infrastructure error") + if (env.FORCE_INFRA_ERROR?.toBoolean()) { + error "Forced infrastructure error for retry-path validation (FORCE_INFRA_ERROR=true)" + }
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
1496-1496
: Sanitize exception text in user-visible failure messageIncluding raw
${e.message}
can leak sensitive data (tokens/URLs) and can be overly verbose. Emit a bounded, sanitized summary and rely on the already-uploaded rerun report for details.- error "The tests still failed after rerun attempt. ${e.message}" + def msg = (e?.message ?: e?.toString() ?: "Unknown error") + // Bound length to avoid log spam; basic hygiene against accidental secrets in messages. + msg = msg.take(512) + error "The tests still failed after rerun attempt. ${msg}"Please confirm no secrets are ever embedded in exception messages from the test harness. If unsure, prefer not logging the raw message here and point to the rerun HTML report instead.
2286-2305
: Validate llmStageWithRetry behavior: stage UI, retries, and result publication semanticsGreat to add a retry wrapper. Two things to verify:
- That
llmStageWithRetry(pipeline, key, { ... })
still produces a proper Jenkins Stage in the UI and propagatescurrentBuild.result
/failFast correctly underparallel
.- That retries don’t double-publish JUnit results or artifacts. Given
cacheErrorAndUploadResult
runs inside the retried closure, the first attempt may upload results before a retry. Consider teaching the wrapper to expose “attempt index/last attempt” so we can skip JUnit/artifact upload on non-final attempts.If the helper supports it, pass an “on-final-attempt” hook; otherwise, we can add a lightweight guard around upload steps.
Could you share the
llmStageWithRetry
signature/semantics or point to its implementation? I can tailor a small refactor to publish results only on the last attempt.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #16108 [ run ] triggered by Bot |
PR_Github #16108 [ run ] completed with state |
/bot run --stage-list "A10-PyTorch-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1485-1499
: Remove the unconditional infrastructure exception; it forces every stage to fail
throw new Exception("Infrastructure error")
guarantees failure and defeats the retry-on-infra-flakiness goal, explaining the repeated A10-PyTorch failures after triggering reruns.Safe options:
- Remove it entirely
- """ - rm -rf ${stageName}/ && \ - cd ${llmSrc}/tests/integration/defs && \ - ${testCmdLine.join(" ")} - """ - throw new Exception("Infrastructure error") + """ + rm -rf ${stageName}/ && \ + cd ${llmSrc}/tests/integration/defs && \ + ${testCmdLine.join(" ")} + """
- Or guard behind an explicit, default-off flag to exercise the retry path only when intended:
""" rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") + if (env.FORCE_INFRA_FAILURE?.toBoolean()) { + error "Forced infrastructure failure to exercise retry path" + }
1-1
: Do not pin CI to a personal/mutable shared-lib branchReferencing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
makes mainline CI behavior mutable and unreviewed. Pin to a stable branch or commit SHA, and use a guarded flag if you must test a dev lib.Apply this diff (or pin to a reviewed tag/SHA):
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Optional gating approach (outside this line): add a boolean param/env like
USE_DEV_SHARED_LIB=false
and include the dev lib only when explicitly enabled.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1494-1498
: Good: include original exception text in rerun failure messageAppending
e.message
makes the rerun failure actionable in build logs. Keep this.
jenkins/L0_Test.groovy
Outdated
trtllm_utils.llmStageWithRetry(pipeline, key, { | ||
if (key in testFilter[REUSE_STAGE_LIST]) { | ||
stage("Skip - reused") { | ||
echo "Skip - Passed in the last pipeline." | ||
} | ||
} else if (values instanceof List && dockerNode == null) { | ||
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | ||
values[1]() | ||
}) | ||
} else if (values instanceof List && dockerNode != null) { | ||
node(dockerNode) { | ||
deleteDir() | ||
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | ||
values[1]() | ||
} | ||
} | ||
} else { | ||
values() | ||
} | ||
} else { | ||
values() | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix NPE risk in REUSE_STAGE_LIST check and streamline wrapper logic
key in testFilter[REUSE_STAGE_LIST]
will throw ifREUSE_STAGE_LIST
is null (the default in this file). Use a null-safe contains check.- Optional: avoid nested
stage("Skip - reused")
inside an outerstage(key)
to keep Stage View tidy; a simpleecho
suffices.
Apply this minimal fix:
- if (key in testFilter[REUSE_STAGE_LIST]) {
+ if (testFilter[REUSE_STAGE_LIST]?.contains(key)) {
stage("Skip - reused") {
echo "Skip - Passed in the last pipeline."
}
Optional clean-up (outside this hunk): default REUSE_STAGE_LIST
to []
instead of null
where testFilter
is declared to eliminate null checks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
trtllm_utils.llmStageWithRetry(pipeline, key, { | |
if (key in testFilter[REUSE_STAGE_LIST]) { | |
stage("Skip - reused") { | |
echo "Skip - Passed in the last pipeline." | |
} | |
} else if (values instanceof List && dockerNode == null) { | |
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | |
values[1]() | |
}) | |
} else if (values instanceof List && dockerNode != null) { | |
node(dockerNode) { | |
deleteDir() | |
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | |
values[1]() | |
} | |
} | |
} else { | |
values() | |
} | |
} else { | |
values() | |
} | |
}) | |
trtllm_utils.llmStageWithRetry(pipeline, key, { | |
if (testFilter[REUSE_STAGE_LIST]?.contains(key)) { | |
stage("Skip - reused") { | |
echo "Skip - Passed in the last pipeline." | |
} | |
} else if (values instanceof List && dockerNode == null) { | |
trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { | |
values[1]() | |
}) | |
} else if (values instanceof List && dockerNode != null) { | |
node(dockerNode) { | |
deleteDir() | |
docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { | |
values[1]() | |
} | |
} | |
} else { | |
values() | |
} | |
}) |
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 2287 to 2306, the check `key in
testFilter[REUSE_STAGE_LIST]` can NPE when REUSE_STAGE_LIST is null and the
inner `stage("Skip - reused")` creates nesting; change the condition to a
null-safe contains check (e.g. use testFilter[REUSE_STAGE_LIST] != null &&
testFilter[REUSE_STAGE_LIST].contains(key) or the Groovy safe navigation
variant) and replace the nested stage block with a simple echo/skip message (no
inner stage) so the outer stage remains tidy; optionally set REUSE_STAGE_LIST
default to [] where testFilter is declared to avoid future null checks.
PR_Github #16132 [ run ] triggered by Bot |
/bot run --stage-list "A10-PyTorch-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/integration/test_lists/test-db/l0_a10.yml (3)
17-20
: Pre-merge PyTorch coverage heavily reduced — prefer isolating to a temporary stage instead of editing the default matrix.Commenting out most PyTorch entries in the pre_merge block will materially reduce signal and can mask regressions if this lands as-is. If the goal is to validate the new rerun flow quickly, create a dedicated lightweight “retry” stage (or use a stage overlay) and keep the default pre_merge matrix intact.
This also aligns with prior learnings that CLI flow tests and PyTorch API tests should coexist to cover both paths.
Example approach (add a targeted block instead of editing pre_merge):
# New block to keep experiments isolated from default pre_merge - condition: ranges: system_gpu_count: gte: 1 lte: 1 wildcards: gpu: - '*a10*' linux_distribution_name: ubuntu* terms: stage: retry_stage # trigger with /bot run --stage-list "A10-PyTorch-1" + mapping backend: pytorch tests: - test_e2e.py::test_openai_misc_example[pytorch] - test_e2e.py::test_openai_completions_example[pytorch] - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]If introducing a new stage isn’t feasible, please at least add a loud TODO on this block to ensure it’s reverted before merge. I can open a tracking issue if helpful.
21-21
: Avoid relying on cuda_graph as the sole PyTorch smoke — it can be flaky and environment-sensitive.Keeping only
test_disaggregated_cuda_graph
as the lone active pre_merge PyTorch test is brittle for A10 CI; CUDA graph tests are more sensitive to driver/runtime details and may fail for infra reasons unrelated to the rerun logic.Proposed minimal smoke swap: comment out the cuda_graph test and enable a stable E2E miscellany test instead.
- - disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0] + # Prefer a simpler, faster PyTorch smoke to validate rerun infra + - test_e2e.py::test_openai_misc_example[pytorch]
22-36
: Add representative PyTorch smoke tests back into the pre-merge matrixVerification of
l0_a10.yml
shows only one activepre_merge
/pytorch
test remains:
- disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0]
To preserve baseline signal without significantly increasing runtime, please re-enable:
One disaggregated variant:
• disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0]Two fast E2E PyTorch tests:
• test_e2e.py::test_openai_misc_example[pytorch]
• test_e2e.py::test_openai_completions_example[pytorch]Optionally, you may also enable test_openai_reasoning[pytorch] if time allows.
Please apply a diff like:
@@ tests/integration/test_lists/test-db/l0_a10.yml - # - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] + - disaggregated/test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] @@ - # - test_e2e.py::test_openai_misc_example[pytorch] + - test_e2e.py::test_openai_misc_example[pytorch] - # - test_e2e.py::test_openai_completions_example[pytorch] + - test_e2e.py::test_openai_completions_example[pytorch]This will bring the active count up to at least three, exercising both the rerun wrapper and core PyTorch paths without a large runtime impact.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_a10.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #16148 [ run ] triggered by Bot |
PR_Github #16132 [ run ] completed with state |
PR_Github #16148 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
jenkins/L0_Test.groovy (3)
1515-1529
: Unconditional failure makes every test stage fail; rerun path is unreachable
throw new Exception("Infrastructure error")
at Line 1520 forces a failure even when tests pass. The second throw at Line 1524 aborts beforererunFailedTests(...)
, so reruns and the improved failure message never execute. This likely explains repeated A10-PyTorch-1 stage failures observed in this PR thread.Apply this diff to let reruns happen and keep the enhanced error message:
sh """ rm -rf ${stageName}/ && \ cd ${llmSrc}/tests/integration/defs && \ ${testCmdLine.join(" ")} """ - throw new Exception("Infrastructure error") } catch (InterruptedException e) { throw e } catch (Exception e) { - throw new Exception("Infrastructure error") - isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) - if (isRerunFailed) { - error "The tests still failed after rerun attempt. ${e.message}" - } + echo "Test execution failed, attempting rerun. Reason: ${e.message}" + def isRerunFailed = rerunFailedTests(stageName, llmSrc, testCmdLine) + if (isRerunFailed) { + error "The tests still failed after rerun attempt. ${e.message}" + } else { + echo "Rerun succeeded; proceeding." + } }If you need to exercise the retry path locally, guard the artificial failure behind an env flag (e.g., only throw when
env.FORCE_INFRA_FAIL == 'true'
) and default it off.
1-1
: Pin shared library to a stable ref (not a personal branch)Referencing
trtllm-jenkins-shared-lib@user/yiqingy/retry_stage
couples CI in main to a mutable personal branch. Pin to@main
, a reviewed tag, or a commit SHA. If you must iterate on a dev branch, gate it behind an explicit flag and default to the stable ref.Apply this diff:
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Optional pattern (if you really need a dev lib temporarily, but keep default stable):
- Keep the annotation pinned to main.
- Load the dev lib only when explicitly requested via the imperative
library 'trtllm-jenkins-shared-lib@<sha-or-dev-branch>'
inside a guarded block (note: be mindful of CPS compilation order and symbol resolution).
2320-2343
: Retry wrapper: avoid nested stage and fix null-safe reuse check
if (key in testFilter[REUSE_STAGE_LIST])
will NPE whenREUSE_STAGE_LIST
is null (default is null in this file).- Nesting
stage("Skip - reused")
insidestage(key)
clutters Stage View.Apply this minimal, behavior-preserving diff:
- stage(key) { - trtllm_utils.llmStageWithRetry(pipeline, key, { - if (key in testFilter[REUSE_STAGE_LIST]) { - stage("Skip - reused") { - echo "Skip - Passed in the last pipeline." - } - } else if (values instanceof List && dockerNode == null) { + stage(key) { + trtllm_utils.llmStageWithRetry(pipeline, key, { + if (testFilter[REUSE_STAGE_LIST]?.contains(key)) { + echo "Skip - Passed in the last pipeline." + return + } else if (values instanceof List && dockerNode == null) { trtllm_utils.launchKubernetesPod(pipeline, values[0], "trt-llm", { values[1]() }) } else if (values instanceof List && dockerNode != null) { node(dockerNode) { deleteDir() docker.image(LLM_DOCKER_IMAGE).inside(dockerArgs) { values[1]() } } } else { values() } }) }Optional hardening outside this hunk: default
REUSE_STAGE_LIST
to[]
to remove the need for null checks:- (REUSE_STAGE_LIST): null, + (REUSE_STAGE_LIST): [],
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2320-2323
: Confirm llmStageWithRetry does not create its own stage to avoid double-wrapping
stage(key) { llmStageWithRetry(pipeline, key, { ... }) }
may produce nested stages if the retry helper also callsstage(key)
internally. If it does, prefer a single stage wrapper to keep UI clean and timing accurate.Two options:
- If
llmStageWithRetry
already manages the stage, drop the outerstage(key)
and let the helper own rendering.- If it doesn’t, keep the outer
stage(key)
and ensure the helper focuses solely on retries without creating nested Jenkins stages.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(3 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_a10.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
/bot run --stage-list "A10-PyTorch-1" |
PR_Github #16346 [ run ] triggered by Bot |
PR_Github #16346 [ run ] completed with state |
/bot run --stage-list "A10-PyTorch-1" |
PR_Github #16419 [ run ] triggered by Bot |
PR_Github #16419 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
jenkins/L0_Test.groovy (3)
2325-2345
: Null-safe reuse check and avoid nested stage inside stage.
key in testFilter[REUSE_STAGE_LIST]
NPEs when the list is null (default). Also the innerstage("Skip - reused")
clutters Stage View. Echo and return instead.- trtllm_utils.llmStageWithRetry(pipeline, key, { - if (key in testFilter[REUSE_STAGE_LIST]) { - stage("Skip - reused") { - echo "Skip - Passed in the last pipeline." - } + trtllm_utils.llmStageWithRetry(pipeline, key, { + if (testFilter[REUSE_STAGE_LIST]?.contains(key)) { + echo "Skip - Passed in the last pipeline." + return } else if (values instanceof List && dockerNode == null) {If desired, also default REUSE_STAGE_LIST to [] where testFilter is declared to eliminate null checks.
1-1
: Pin shared library to a stable ref (don’t use a personal branch).This couples CI to a mutable user branch and risks unexpected breaks.
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@user/yiqingy/retry_stage']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Optionally pin to a reviewed tag or commit SHA instead of main.
1520-1527
: Remove unconditional failure; it bypasses retry and guarantees stage failure.
error "Infrastructure error"
will fail every run. Also,error
throws FlowInterruptedException, not caught bycatch (hudson.AbortException)
, so the rerun path won’t execute. This matches the repeated A10-PyTorch-1 failures in this PR thread.Apply this minimal fix:
- error "Infrastructure error" + // Do not force-fail here; rely on test outcomes and rerun logic below.Optional safe guard for local testing only:
if (env.FORCE_INFRA_FAIL?.toBoolean()) { error "Forced infrastructure error (dev-only)" }#!/bin/bash # Ensure no forced failures remain rg -n 'error\s*\(\s*"(Infrastructure error|Forced infrastructure error).*"\s*\)' jenkins/L0_Test.groovy || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
2411-2430
: Good use of llmStageWithRetry around image sanity checks.Wrapper placement and pod launch flow look correct.
Consider centralizing
[sleepTimeInSecs: 120]
as a constant/param to keep behavior consistent across all stage wrappers.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(4 hunks)
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
a762944
to
795fd0e
Compare
Summary by CodeRabbit
Chores
Tests
Improvements
Bug Fixes
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.