-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][opt] Add batch waiting when scheduling #7416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/bot run --disable-fail-fast |
PR_Github #17152 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughAdds batch-waiting controls to PyTorch backend: new config fields and wiring, max token budget, and a scheduling wait mechanism in PyExecutor. Extends CLI/API args with validation and propagates to backend config. Updates API stability references and adds parametrized integration tests (including duplicate test definitions and test-list entries). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant LLM API
participant PyExecutor
participant ReqQueue as RequestQueue
participant Scheduler
Client->>LLM API: submit requests
LLM API->>ReqQueue: enqueue (context/gen)
LLM API->>PyExecutor: run_step()
PyExecutor->>Scheduler: _schedule()
alt attention_dp disabled AND batch-waiting enabled
Scheduler->>ReqQueue: peek context & gen batches
Scheduler->>Scheduler: _waiting_requests(tokens, iters, max_num_tokens, ratio)
alt Wait condition met
note right of Scheduler: Delay scheduling context<br/>Increment wait iters
Scheduler-->>PyExecutor: return (possibly empty) context set
else Proceed
note right of Scheduler: Reset wait iters
Scheduler-->>PyExecutor: return selected context set
end
else Normal scheduling
Scheduler-->>PyExecutor: schedule as usual
end
PyExecutor->>LLM API: step result
LLm API-->>Client: progress/output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
2570-2637
: Passmax_num_tokens
intoPyTorchConfig
Inget_pytorch_backend_config()
(tensorrt_llm/llmapi/llm_args.py), add:return PyTorchConfig( … + max_num_tokens=( + self.max_num_tokens if self.max_num_tokens is not None + else PyTorchConfig.model_fields['max_num_tokens'].default + ), torch_compile_enabled=bool(self.torch_compile_config is not None), … )so that
batch_wait_max_tokens_ratio
uses the user-specified limit instead of the hardcoded default.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/config.py (1)
1-1
: Add NVIDIA copyright header.Required by project guidelines.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
1-1
: Add NVIDIA copyright header.Required by project guidelines.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/unittest/api_stability/references/llm.yaml (1)
134-141
: API surface update looks consistent; please add brief comments on ranges/disable semantics.Suggest inline comments to avoid misuse and align with executor logic:
- timeout iters: 0 disables
- ratio: 0 disables; range [0.0, 1.0]
Example:
batch_wait_timeout_ms: annotation: float default: 0 status: prototype batch_wait_timeout_iters: annotation: int default: 0 status: prototype + # 0 disables iteration-based waiting batch_wait_max_tokens_ratio: annotation: float default: 0 status: prototype + # 0 disables; range [0.0, 1.0]; threshold applies to max_num_tokenstensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-1
: Add NVIDIA copyright header.Required by project guidelines.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1626-1633
: Assert new knobs are plumbed end-to-end.Add explicit assertions to ensure LLM args reflect the test’s batch-wait settings; this will catch wiring regressions early.
Apply this diff right after asserting NVFP4:
assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4 + assert llm.args.batch_wait_timeout_iters == batch_wait_timeout_iters + assert llm.args.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratiotensorrt_llm/llmapi/llm_args.py (2)
2258-2269
: Polish field docs (clarify wording and fix minor grammar).Tighten the descriptions; “new coming” → “incoming”; add missing space and clarify the “0 disables” phrasing.
- "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.", + "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler delays batch processing up to this iteration budget. If 0, timeout-iters-based delays are disabled.", @@ - "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.", + "Token-accumulation threshold ratio. If > 0, the scheduler accumulates requests until total tokens reach (batch_wait_max_tokens_ratio * max_num_tokens). This improves GPU utilization by ensuring adequate batch sizes. If 0, token-ratio-based delays are disabled.",
2543-2558
: Optional: cross-field guardrail to avoid misconfiguration.Consider warning when batch_wait_max_tokens_ratio > 0 while both timeout knobs are 0, as this can increase stall risk under sparse traffic. Not a blocker; scheduler-side defaults may already mitigate this.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(5 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/unittest/api_stability/references/llm.yaml
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks
Files:
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code graph analysis (2)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
default
(50-58)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)
tests/integration/defs/conftest.py (2)
parametrize_with_ids
(1786-1811)llm_models_root
(77-83)tensorrt_llm/llmapi/llm_args.py (5)
KvCacheConfig
(961-1092)TorchCompileConfig
(2123-2170)CudaGraphConfig
(108-165)MoeConfig
(168-196)MTPDecodingConfig
(536-571)tensorrt_llm/llmapi/llm.py (1)
LLM
(1013-1029)tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)tests/integration/defs/accuracy/accuracy_core.py (3)
GSM8K
(293-308)evaluate
(147-206)evaluate
(707-717)tensorrt_llm/evaluate/interface.py (1)
evaluate
(81-110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/test_lists/qa/llm_function_full.txt (1)
516-516
: No duplicate test entry found
Search returned only one occurrence oftest_nvfp4_batch_waiting
in the test lists—nothing to remove.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
187-187
: Ensure max_num_tokens reflects the engine’s SequenceInfo.PyExecutor reads from model_engine.pytorch_backend_config.max_num_tokens. With the current ADEngine hardcoded default, this may diverge from seq_info. Pair with the ADEngine fix to source seq_info.max_num_tokens.
196-199
: enable_batch_waiting guard is fine; but see waiting logic bug below.No change needed here once waiting logic is fixed.
1374-1381
: Condition to invoke waiting looks correct (prevents dead-wait with only one type present).No changes.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
1592-1601
: Prevent potential hang for ratio-only gating
The(batch_wait_timeout_iters=0, batch_wait_max_tokens_ratio=0.75)
combination can starve under light load. Optional: bump the timeout to 1 for a safety net:- @parametrize_with_ids( - "batch_wait_timeout_iters,batch_wait_max_tokens_ratio", - [(0, 0), (10, 0.75), (10, 0), (0, 0.75)]) + @parametrize_with_ids( + "batch_wait_timeout_iters,batch_wait_max_tokens_ratio", + [(0, 0), (10, 0.75), (10, 0), (1, 0.75)])Please manually run the focused test matrix in your local/CI environment to confirm there are no hangs for the
0-0.75
vs.1-0.75
cases.
1592-1634
: Ignore duplicate registration concern. The testtest_nvfp4_batch_waiting
is listed only once intests/integration/test_lists/qa/llm_function_full.txt
(line 516); there’s no duplicate entry to remove.Likely an incorrect or invalid review comment.
tensorrt_llm/llmapi/llm_args.py (1)
2634-2637
: Wiring of batch-wait knobs into PyTorchConfig — LGTM.The new fields are correctly forwarded to the backend config.
self.pytorch_backend_config.batch_wait_timeout_iters = 0 | ||
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0 | ||
self.pytorch_backend_config.max_num_tokens = 8192 | ||
self.iter_counter = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix max_num_tokens mismatch: use seq_info.max_num_tokens instead of a hardcoded 8192.
Currently PyExecutor reads model_engine.pytorch_backend_config.max_num_tokens, but ADEngine sets it to 8192, ignoring AutoDeploy/SequenceInfo. This desynchronizes token budgeting vs. sequence manager.
Apply:
- self.pytorch_backend_config.max_num_tokens = 8192
+ self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.pytorch_backend_config.batch_wait_timeout_iters = 0 | |
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0 | |
self.pytorch_backend_config.max_num_tokens = 8192 | |
self.iter_counter = 0 | |
self.pytorch_backend_config.batch_wait_timeout_iters = 0 | |
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0 | |
self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens | |
self.iter_counter = 0 |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py around lines 140-143 the
code sets self.pytorch_backend_config.max_num_tokens = 8192; replace this
hardcoded value with the model's sequence info value (use
seq_info.max_num_tokens, e.g. self.pytorch_backend_config.max_num_tokens =
self.seq_info.max_num_tokens or self.sequence_info.max_num_tokens depending on
the attribute name used in this class) and add a safe fallback if seq_info is
missing (keep the existing value or raise a clear error). This ensures the
executor respects AutoDeploy/SequenceInfo token limits and stays in sync with
the sequence manager.
max_num_tokens: int = 8192 | ||
|
||
batch_wait_timeout_ms: float = 0 | ||
batch_wait_timeout_iters: int = 0 | ||
batch_wait_max_tokens_ratio: float = 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify semantics and add guardrails for new batch-waiting knobs (also prefer 0.0 for floats).
- Make ratio-based threshold opt-in only when > 0.0; document ranges.
- Minor: initialize float default as 0.0 for clarity.
Apply:
- max_num_tokens: int = 8192
+ # Max tokens budget per batch; used by batch-wait gating.
+ max_num_tokens: int = 8192
@@
- batch_wait_timeout_iters: int = 0
- batch_wait_max_tokens_ratio: float = 0
+ # Iterations to wait before scheduling context even if token budget not reached (0 disables).
+ batch_wait_timeout_iters: int = 0
+ # Fraction [0, 1] of max_num_tokens at which to stop waiting (0 disables).
+ batch_wait_max_tokens_ratio: float = 0.0
Optionally add a lightweight validator to prevent misconfiguration:
# Place inside PyTorchConfig
def __post_init__(self):
if self.max_num_tokens <= 0:
raise ValueError("max_num_tokens must be > 0")
if not (0.0 <= self.batch_wait_max_tokens_ratio <= 1.0):
raise ValueError("batch_wait_max_tokens_ratio must be in [0.0, 1.0]")
if self.batch_wait_timeout_iters < 0:
raise ValueError("batch_wait_timeout_iters must be >= 0")
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/config.py around lines 53 to 58, the new
batch-waiting knobs need clarified semantics and guardrails: change float
defaults to 0.0 for clarity, document that batch_wait_max_tokens_ratio is opt-in
only when > 0.0 and must be in the range [0.0, 1.0], and ensure
batch_wait_timeout_iters is non-negative and max_num_tokens is > 0; implement
these checks by adding a __post_init__ validator on PyTorchConfig that raises
ValueError for invalid values (max_num_tokens <= 0, batch_wait_max_tokens_ratio
not in [0.0,1.0], or batch_wait_timeout_iters < 0) and update inline
comments/docs to state the ratio behavior and allowed ranges.
def _waiting_requests(self, context_requests: list[LlmRequest], | ||
generation_requests: list[LlmRequest]): | ||
if not self.enable_batch_waiting: | ||
return context_requests | ||
|
||
waited_context_requests = [] | ||
stop_waiting = False | ||
num_scheduled_ctx_tokens = sum( | ||
len(ctx_req.get_tokens(0)) for ctx_req in context_requests) | ||
num_scheduled_gen_tokens = sum( | ||
len(gen_req.get_tokens(0)) for gen_req in generation_requests) | ||
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens | ||
|
||
stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens | ||
if stop_waiting: | ||
waited_context_requests = context_requests | ||
self.batch_wait_iters_count = 0 | ||
else: | ||
self.batch_wait_iters_count += 1 | ||
return waited_context_requests | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting logic: ratio=0 causes immediate stop; generation token counting likely overestimates.
- Bug: when batch_wait_timeout_iters>0 and ratio=0, current code evaluates num_scheduled_tokens >= 0 as True, so it never waits by iterations.
- Token accounting: use 1 token per generation request (per-iter), not len(get_tokens(0)), which can include the full sequence length and prematurely stop waiting.
Apply:
def _waiting_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.enable_batch_waiting:
return context_requests
waited_context_requests = []
- stop_waiting = False
- num_scheduled_ctx_tokens = sum(
- len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
- num_scheduled_gen_tokens = sum(
- len(gen_req.get_tokens(0)) for gen_req in generation_requests)
- num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
- stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+ # Context tokens: sum of incoming ctx chunk lengths.
+ num_scheduled_ctx_tokens = sum(len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
+ # Generation step typically consumes 1 token per req per iter; avoid counting full prompt length.
+ num_scheduled_gen_tokens = len(generation_requests)
+ num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+
+ iter_reached = (self.batch_wait_timeout_iters > 0 and
+ self.batch_wait_iters_count >= self.batch_wait_timeout_iters)
+ ratio_reached = (self.batch_wait_max_tokens_ratio > 0.0 and
+ num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens)
+ stop_waiting = iter_reached or ratio_reached
if stop_waiting:
waited_context_requests = context_requests
self.batch_wait_iters_count = 0
else:
self.batch_wait_iters_count += 1
return waited_context_requests
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _waiting_requests(self, context_requests: list[LlmRequest], | |
generation_requests: list[LlmRequest]): | |
if not self.enable_batch_waiting: | |
return context_requests | |
waited_context_requests = [] | |
stop_waiting = False | |
num_scheduled_ctx_tokens = sum( | |
len(ctx_req.get_tokens(0)) for ctx_req in context_requests) | |
num_scheduled_gen_tokens = sum( | |
len(gen_req.get_tokens(0)) for gen_req in generation_requests) | |
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens | |
stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens | |
if stop_waiting: | |
waited_context_requests = context_requests | |
self.batch_wait_iters_count = 0 | |
else: | |
self.batch_wait_iters_count += 1 | |
return waited_context_requests | |
def _waiting_requests(self, context_requests: list[LlmRequest], | |
generation_requests: list[LlmRequest]): | |
if not self.enable_batch_waiting: | |
return context_requests | |
waited_context_requests = [] | |
# Context tokens: sum of incoming ctx chunk lengths. | |
num_scheduled_ctx_tokens = sum( | |
len(ctx_req.get_tokens(0)) for ctx_req in context_requests) | |
# Generation step typically consumes 1 token per req per iter; avoid counting full prompt length. | |
num_scheduled_gen_tokens = len(generation_requests) | |
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens | |
iter_reached = ( | |
self.batch_wait_timeout_iters > 0 and | |
self.batch_wait_iters_count >= self.batch_wait_timeout_iters | |
) | |
ratio_reached = ( | |
self.batch_wait_max_tokens_ratio > 0.0 and | |
num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens | |
) | |
stop_waiting = iter_reached or ratio_reached | |
if stop_waiting: | |
waited_context_requests = context_requests | |
self.batch_wait_iters_count = 0 | |
else: | |
self.batch_wait_iters_count += 1 | |
return waited_context_requests |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1343-1363, the
waiting logic incorrectly treats batch_wait_max_tokens_ratio == 0 as an
immediate stop (because num_scheduled_tokens >= 0 is always true) and also
overcounts generation tokens by summing full token sequences; change the
generation token accounting to count 1 token per generation request
(num_scheduled_gen_tokens = len(generation_requests)) and only evaluate the
token-threshold condition when the computed threshold is positive (e.g., compute
threshold = self.batch_wait_max_tokens_ratio * self.max_num_tokens and check
num_scheduled_tokens >= threshold only if threshold > 0); keep the existing
iteration-based timeout logic and ensure batch_wait_iters_count is
incremented/reset as before.
PR_Github #17152 [ run ] completed with state |
Signed-off-by: yunruis <[email protected]>
9f937be
to
e84d357
Compare
/bot run --disable-fail-fast |
PR_Github #17636 [ run ] triggered by Bot |
PR_Github #17636 [ run ] completed with state |
Summary by CodeRabbit
Description
From #7287
to open the feature, add parameter to config.yaml. For example
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.