[None][opt] Add batch waiting when scheduling #7416

yunruis · 2025-09-01T04:35:53Z

Summary by CodeRabbit

New Features
- Added configurable batch-waiting controls: batch_wait_timeout_iters and batch_wait_max_tokens_ratio to optimize batching efficiency.
- Introduced max_num_tokens (default 8192) for improved scheduling and capacity management.
- Validations ensure safe ranges for new parameters; exposed via user arguments.
Tests
- Added integration coverage for NVFP4 batch-waiting scenarios across multiple runtime configurations.
- Updated test lists to include new scenarios.
Chores
- Updated API stability references to include new parameters.

Description

to open the feature, add parameter to config.yaml. For example

batch_wait_timeout_iters: 20
batch_wait_max_tokens_ratio: 0.75

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yunruis · 2025-09-01T04:36:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-01T04:43:04Z

PR_Github #17152 [ run ] triggered by Bot

coderabbitai · 2025-09-01T04:44:51Z

📝 Walkthrough

Walkthrough

Adds batch-waiting controls to PyTorch backend: new config fields and wiring, max token budget, and a scheduling wait mechanism in PyExecutor. Extends CLI/API args with validation and propagates to backend config. Updates API stability references and adds parametrized integration tests (including duplicate test definitions and test-list entries).

Changes

Cohort / File(s)	Summary of changes
Backend config: add batch-waiting and token budget fields `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/pyexecutor/config.py`	Introduce `max_num_tokens`, `batch_wait_timeout_iters`, `batch_wait_max_tokens_ratio` to PyTorch backend configuration initialization and schema. Defaults: 8192, 0, 0.0.
Scheduler: batch-waiting control flow `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Add executor attributes for new fields, waiting counter, enable flag, and `_waiting_requests(...)` logic. Integrate waiting into `_schedule` when not using attention DP to gate context scheduling based on token ratio and iteration thresholds.
LLM args: wiring + validation `tensorrt_llm/llmapi/llm_args.py`	Add `batch_wait_timeout_iters` and `batch_wait_max_tokens_ratio` (prototype), validate ranges, and pass through to `PyTorchConfig`.
Integration tests: NVFP4 batch waiting `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Add parametrized `test_nvfp4_batch_waiting` covering combinations of compile/kv dtype/cuda graph/overlap/MTP/timeout-iters/token-ratio. Test added twice with identical definitions.
QA test list updates `tests/integration/test_lists/qa/llm_function_full.txt`	Add new test entry for DeepSeekV3Lite NVFP4 batch-waiting; entry duplicated in two places.
API stability reference `tests/unittest/api_stability/references/llm.yaml`	Declare new prototype params `batch_wait_timeout_iters` and `batch_wait_max_tokens_ratio` after `batch_wait_timeout_ms`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant LLM API
  participant PyExecutor
  participant ReqQueue as RequestQueue
  participant Scheduler

  Client->>LLM API: submit requests
  LLM API->>ReqQueue: enqueue (context/gen)
  LLM API->>PyExecutor: run_step()

  PyExecutor->>Scheduler: _schedule()
  alt attention_dp disabled AND batch-waiting enabled
    Scheduler->>ReqQueue: peek context & gen batches
    Scheduler->>Scheduler: _waiting_requests(tokens, iters, max_num_tokens, ratio)
    alt Wait condition met
      note right of Scheduler: Delay scheduling context<br/>Increment wait iters
      Scheduler-->>PyExecutor: return (possibly empty) context set
    else Proceed
      note right of Scheduler: Reset wait iters
      Scheduler-->>PyExecutor: return selected context set
    end
  else Normal scheduling
    Scheduler-->>PyExecutor: schedule as usual
  end

  PyExecutor->>LLM API: step result
  LLm API-->>Client: progress/output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[None][chore] Create PyExecutor from TorchLlmArgs Part 1 #7105 — Refactors PyExecutor creation and argument propagation; directly related to how new batch-wait fields flow from TorchLlmArgs to the executor.

Suggested reviewers

netanel-haber
QiJune
Superjomn
syuoni
lucaslie

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/llmapi/llm_args.py (1)
2570-2637: Pass max_num_tokens into PyTorchConfig
In get_pytorch_backend_config() (tensorrt_llm/llmapi/llm_args.py), add:
 return PyTorchConfig(
     …  
+    max_num_tokens=(
+        self.max_num_tokens if self.max_num_tokens is not None
+        else PyTorchConfig.model_fields['max_num_tokens'].default
+    ),
     torch_compile_enabled=bool(self.torch_compile_config is not None),
     …
 )
so that batch_wait_max_tokens_ratio uses the user-specified limit instead of the hardcoded default.

🧹 Nitpick comments (7)

tensorrt_llm/_torch/pyexecutor/config.py (1)
1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/unittest/api_stability/references/llm.yaml (1)
134-141: API surface update looks consistent; please add brief comments on ranges/disable semantics.

Suggest inline comments to avoid misuse and align with executor logic:

timeout iters: 0 disables

ratio: 0 disables; range [0.0, 1.0]

Example:
       batch_wait_timeout_ms:
         annotation: float
         default: 0
         status: prototype
       batch_wait_timeout_iters:
         annotation: int
         default: 0
         status: prototype
+        # 0 disables iteration-based waiting
       batch_wait_max_tokens_ratio:
         annotation: float
         default: 0
         status: prototype
+        # 0 disables; range [0.0, 1.0]; threshold applies to max_num_tokens
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1626-1633: Assert new knobs are plumbed end-to-end.

Add explicit assertions to ensure LLM args reflect the test’s batch-wait settings; this will catch wiring regressions early.

Apply this diff right after asserting NVFP4:
             assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4
+            assert llm.args.batch_wait_timeout_iters == batch_wait_timeout_iters
+            assert llm.args.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratio
tensorrt_llm/llmapi/llm_args.py (2)
2258-2269: Polish field docs (clarify wording and fix minor grammar).

Tighten the descriptions; “new coming” → “incoming”; add missing space and clarify the “0 disables” phrasing.
-        "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.",
+        "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler delays batch processing up to this iteration budget. If 0, timeout-iters-based delays are disabled.",
@@
-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token-accumulation threshold ratio. If > 0, the scheduler accumulates requests until total tokens reach (batch_wait_max_tokens_ratio * max_num_tokens). This improves GPU utilization by ensuring adequate batch sizes. If 0, token-ratio-based delays are disabled.",
2543-2558: Optional: cross-field guardrail to avoid misconfiguration.

Consider warning when batch_wait_max_tokens_ratio > 0 while both timeout knobs are 0, as this can increase stall risk under sparse traffic. Not a blocker; scheduler-side defaults may already mitigate this.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b0558c7 and 9f937be.

📒 Files selected for processing (7)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
tests/unittest/api_stability/references/llm.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks

Files:

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧬 Code graph analysis (2)

tensorrt_llm/llmapi/llm_args.py (1)

tensorrt_llm/builder.py (1)

default (50-58)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)

tests/integration/defs/conftest.py (2)

parametrize_with_ids (1786-1811)

llm_models_root (77-83)

tensorrt_llm/llmapi/llm_args.py (5)

KvCacheConfig (961-1092)

TorchCompileConfig (2123-2170)

CudaGraphConfig (108-165)

MoeConfig (168-196)

MTPDecodingConfig (536-571)

tensorrt_llm/llmapi/llm.py (1)

LLM (1013-1029)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-46)

tests/integration/defs/accuracy/accuracy_core.py (3)

GSM8K (293-308)

evaluate (147-206)

evaluate (707-717)

tensorrt_llm/evaluate/interface.py (1)

evaluate (81-110)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tests/integration/test_lists/qa/llm_function_full.txt (1)

516-516: No duplicate test entry found
Search returned only one occurrence of test_nvfp4_batch_waiting in the test lists—nothing to remove.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

187-187: Ensure max_num_tokens reflects the engine’s SequenceInfo.

PyExecutor reads from model_engine.pytorch_backend_config.max_num_tokens. With the current ADEngine hardcoded default, this may diverge from seq_info. Pair with the ADEngine fix to source seq_info.max_num_tokens.

196-199: enable_batch_waiting guard is fine; but see waiting logic bug below.

No change needed here once waiting logic is fixed.

1374-1381: Condition to invoke waiting looks correct (prevents dead-wait with only one type present).

No changes.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)
1592-1601: Prevent potential hang for ratio-only gating
The (batch_wait_timeout_iters=0, batch_wait_max_tokens_ratio=0.75) combination can starve under light load. Optional: bump the timeout to 1 for a safety net:
-    @parametrize_with_ids(
-        "batch_wait_timeout_iters,batch_wait_max_tokens_ratio",
-        [(0, 0), (10, 0.75), (10, 0), (0, 0.75)])
+    @parametrize_with_ids(
+        "batch_wait_timeout_iters,batch_wait_max_tokens_ratio",
+        [(0, 0), (10, 0.75), (10, 0), (1, 0.75)])
Please manually run the focused test matrix in your local/CI environment to confirm there are no hangs for the 0-0.75 vs. 1-0.75 cases.

1592-1634: Ignore duplicate registration concern. The test test_nvfp4_batch_waiting is listed only once in tests/integration/test_lists/qa/llm_function_full.txt (line 516); there’s no duplicate entry to remove.

Likely an incorrect or invalid review comment.
tensorrt_llm/llmapi/llm_args.py (1)

2634-2637: Wiring of batch-wait knobs into PyTorchConfig — LGTM.

The new fields are correctly forwarded to the backend config.

coderabbitai · 2025-09-01T04:44:54Z

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

+        self.pytorch_backend_config.batch_wait_timeout_iters = 0
+        self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0
+        self.pytorch_backend_config.max_num_tokens = 8192
        self.iter_counter = 0


⚠️ Potential issue

Fix max_num_tokens mismatch: use seq_info.max_num_tokens instead of a hardcoded 8192.

Currently PyExecutor reads model_engine.pytorch_backend_config.max_num_tokens, but ADEngine sets it to 8192, ignoring AutoDeploy/SequenceInfo. This desynchronizes token budgeting vs. sequence manager.

Apply:

- self.pytorch_backend_config.max_num_tokens = 8192 + self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.pytorch_backend_config.batch_wait_timeout_iters = 0

self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0

self.pytorch_backend_config.max_num_tokens = 8192

self.iter_counter = 0

self.pytorch_backend_config.batch_wait_timeout_iters = 0

self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0

self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens

self.iter_counter = 0

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py around lines 140-143 the code sets self.pytorch_backend_config.max_num_tokens = 8192; replace this hardcoded value with the model's sequence info value (use seq_info.max_num_tokens, e.g. self.pytorch_backend_config.max_num_tokens = self.seq_info.max_num_tokens or self.sequence_info.max_num_tokens depending on the attribute name used in this class) and add a safe fallback if seq_info is missing (keep the existing value or raise a clear error). This ensures the executor respects AutoDeploy/SequenceInfo token limits and stays in sync with the sequence manager.

coderabbitai · 2025-09-01T04:44:54Z

tensorrt_llm/_torch/pyexecutor/config.py

+    max_num_tokens: int = 8192
+
    batch_wait_timeout_ms: float = 0
+    batch_wait_timeout_iters: int = 0
+    batch_wait_max_tokens_ratio: float = 0



🛠️ Refactor suggestion

Clarify semantics and add guardrails for new batch-waiting knobs (also prefer 0.0 for floats).

Make ratio-based threshold opt-in only when > 0.0; document ranges.

Minor: initialize float default as 0.0 for clarity.

Apply:

- max_num_tokens: int = 8192 + # Max tokens budget per batch; used by batch-wait gating. + max_num_tokens: int = 8192 @@ - batch_wait_timeout_iters: int = 0 - batch_wait_max_tokens_ratio: float = 0 + # Iterations to wait before scheduling context even if token budget not reached (0 disables). + batch_wait_timeout_iters: int = 0 + # Fraction [0, 1] of max_num_tokens at which to stop waiting (0 disables). + batch_wait_max_tokens_ratio: float = 0.0

Optionally add a lightweight validator to prevent misconfiguration:

# Place inside PyTorchConfig def __post_init__(self): if self.max_num_tokens <= 0: raise ValueError("max_num_tokens must be > 0") if not (0.0 <= self.batch_wait_max_tokens_ratio <= 1.0): raise ValueError("batch_wait_max_tokens_ratio must be in [0.0, 1.0]") if self.batch_wait_timeout_iters < 0: raise ValueError("batch_wait_timeout_iters must be >= 0")

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/config.py around lines 53 to 58, the new batch-waiting knobs need clarified semantics and guardrails: change float defaults to 0.0 for clarity, document that batch_wait_max_tokens_ratio is opt-in only when > 0.0 and must be in the range [0.0, 1.0], and ensure batch_wait_timeout_iters is non-negative and max_num_tokens is > 0; implement these checks by adding a __post_init__ validator on PyTorchConfig that raises ValueError for invalid values (max_num_tokens <= 0, batch_wait_max_tokens_ratio not in [0.0,1.0], or batch_wait_timeout_iters < 0) and update inline comments/docs to state the ratio behavior and allowed ranges.

coderabbitai · 2025-09-01T04:44:54Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

+    def _waiting_requests(self, context_requests: list[LlmRequest],
+                          generation_requests: list[LlmRequest]):
+        if not self.enable_batch_waiting:
+            return context_requests
+
+        waited_context_requests = []
+        stop_waiting = False
+        num_scheduled_ctx_tokens = sum(
+            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
+        num_scheduled_gen_tokens = sum(
+            len(gen_req.get_tokens(0)) for gen_req in generation_requests)
+        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+
+        stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        if stop_waiting:
+            waited_context_requests = context_requests
+            self.batch_wait_iters_count = 0
+        else:
+            self.batch_wait_iters_count += 1
+        return waited_context_requests
+


⚠️ Potential issue

Waiting logic: ratio=0 causes immediate stop; generation token counting likely overestimates.

Bug: when batch_wait_timeout_iters>0 and ratio=0, current code evaluates num_scheduled_tokens >= 0 as True, so it never waits by iterations.

Token accounting: use 1 token per generation request (per-iter), not len(get_tokens(0)), which can include the full sequence length and prematurely stop waiting.

Apply:

def _waiting_requests(self, context_requests: list[LlmRequest], generation_requests: list[LlmRequest]): if not self.enable_batch_waiting: return context_requests waited_context_requests = [] - stop_waiting = False - num_scheduled_ctx_tokens = sum( - len(ctx_req.get_tokens(0)) for ctx_req in context_requests) - num_scheduled_gen_tokens = sum( - len(gen_req.get_tokens(0)) for gen_req in generation_requests) - num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens - - stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens + # Context tokens: sum of incoming ctx chunk lengths. + num_scheduled_ctx_tokens = sum(len(ctx_req.get_tokens(0)) for ctx_req in context_requests) + # Generation step typically consumes 1 token per req per iter; avoid counting full prompt length. + num_scheduled_gen_tokens = len(generation_requests) + num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens + + iter_reached = (self.batch_wait_timeout_iters > 0 and + self.batch_wait_iters_count >= self.batch_wait_timeout_iters) + ratio_reached = (self.batch_wait_max_tokens_ratio > 0.0 and + num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens) + stop_waiting = iter_reached or ratio_reached if stop_waiting: waited_context_requests = context_requests self.batch_wait_iters_count = 0 else: self.batch_wait_iters_count += 1 return waited_context_requests

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _waiting_requests(self, context_requests: list[LlmRequest],

generation_requests: list[LlmRequest]):

if not self.enable_batch_waiting:

return context_requests

waited_context_requests = []

stop_waiting = False

num_scheduled_ctx_tokens = sum(

len(ctx_req.get_tokens(0)) for ctx_req in context_requests)

num_scheduled_gen_tokens = sum(

len(gen_req.get_tokens(0)) for gen_req in generation_requests)

num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens

stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens

if stop_waiting:

waited_context_requests = context_requests

self.batch_wait_iters_count = 0

else:

self.batch_wait_iters_count += 1

return waited_context_requests

def _waiting_requests(self, context_requests: list[LlmRequest],

generation_requests: list[LlmRequest]):

if not self.enable_batch_waiting:

return context_requests

waited_context_requests = []

# Context tokens: sum of incoming ctx chunk lengths.

num_scheduled_ctx_tokens = sum(

len(ctx_req.get_tokens(0)) for ctx_req in context_requests)

# Generation step typically consumes 1 token per req per iter; avoid counting full prompt length.

num_scheduled_gen_tokens = len(generation_requests)

num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens

iter_reached = (

self.batch_wait_timeout_iters > 0 and

self.batch_wait_iters_count >= self.batch_wait_timeout_iters

)

ratio_reached = (

self.batch_wait_max_tokens_ratio > 0.0 and

num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens

)

stop_waiting = iter_reached or ratio_reached

if stop_waiting:

waited_context_requests = context_requests

self.batch_wait_iters_count = 0

else:

self.batch_wait_iters_count += 1

return waited_context_requests

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1343-1363, the waiting logic incorrectly treats batch_wait_max_tokens_ratio == 0 as an immediate stop (because num_scheduled_tokens >= 0 is always true) and also overcounts generation tokens by summing full token sequences; change the generation token accounting to count 1 token per generation request (num_scheduled_gen_tokens = len(generation_requests)) and only evaluate the token-threshold condition when the computed threshold is positive (e.g., compute threshold = self.batch_wait_max_tokens_ratio * self.max_num_tokens and check num_scheduled_tokens >= threshold only if threshold > 0); keep the existing iteration-based timeout logic and ensure batch_wait_iters_count is incremented/reset as before.

tensorrt-cicd · 2025-09-01T09:37:56Z

PR_Github #17152 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12899 completed with status: 'FAILURE'

Signed-off-by: yunruis <[email protected]>

yunruis · 2025-09-04T06:24:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-04T06:35:08Z

PR_Github #17636 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T14:57:12Z

PR_Github #17636 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13257 completed with status: 'FAILURE'

yunruis requested review from a team as code owners September 1, 2025 04:35

yunruis requested review from netanel-haber, syuoni, Fridah-nv and pcastonguay September 1, 2025 04:35

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

opt: add batch waiting when scheduling, to accumulate batch

e84d357

Signed-off-by: yunruis <[email protected]>

yunruis force-pushed the user/yunruis/opt_non_adp_wait branch from 9f937be to e84d357 Compare September 4, 2025 06:23

[None][opt] Add batch waiting when scheduling #7416

Are you sure you want to change the base?

[None][opt] Add batch waiting when scheduling #7416

Uh oh!

Conversation

yunruis commented Sep 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yunruis commented Sep 1, 2025

Uh oh!

tensorrt-cicd commented Sep 1, 2025

Uh oh!

coderabbitai bot commented Sep 1, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Sep 1, 2025

Uh oh!

yunruis commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

Uh oh!

yunruis commented Sep 1, 2025 •

edited by coderabbitai bot

Loading