Skip to content

Conversation

yunruis
Copy link
Contributor

@yunruis yunruis commented Sep 1, 2025

Summary by CodeRabbit

  • New Features
    • Added configurable batch-waiting controls: batch_wait_timeout_iters and batch_wait_max_tokens_ratio to optimize batching efficiency.
    • Introduced max_num_tokens (default 8192) for improved scheduling and capacity management.
    • Validations ensure safe ranges for new parameters; exposed via user arguments.
  • Tests
    • Added integration coverage for NVFP4 batch-waiting scenarios across multiple runtime configurations.
    • Updated test lists to include new scenarios.
  • Chores
    • Updated API stability references to include new parameters.

Description

From #7287

to open the feature, add parameter to config.yaml. For example

batch_wait_timeout_iters: 20
batch_wait_max_tokens_ratio: 0.75

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yunruis
Copy link
Contributor Author

yunruis commented Sep 1, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17152 [ run ] triggered by Bot

Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

📝 Walkthrough

Walkthrough

Adds batch-waiting controls to PyTorch backend: new config fields and wiring, max token budget, and a scheduling wait mechanism in PyExecutor. Extends CLI/API args with validation and propagates to backend config. Updates API stability references and adds parametrized integration tests (including duplicate test definitions and test-list entries).

Changes

Cohort / File(s) Summary of changes
Backend config: add batch-waiting and token budget fields
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/pyexecutor/config.py
Introduce max_num_tokens, batch_wait_timeout_iters, batch_wait_max_tokens_ratio to PyTorch backend configuration initialization and schema. Defaults: 8192, 0, 0.0.
Scheduler: batch-waiting control flow
tensorrt_llm/_torch/pyexecutor/py_executor.py
Add executor attributes for new fields, waiting counter, enable flag, and _waiting_requests(...) logic. Integrate waiting into _schedule when not using attention DP to gate context scheduling based on token ratio and iteration thresholds.
LLM args: wiring + validation
tensorrt_llm/llmapi/llm_args.py
Add batch_wait_timeout_iters and batch_wait_max_tokens_ratio (prototype), validate ranges, and pass through to PyTorchConfig.
Integration tests: NVFP4 batch waiting
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Add parametrized test_nvfp4_batch_waiting covering combinations of compile/kv dtype/cuda graph/overlap/MTP/timeout-iters/token-ratio. Test added twice with identical definitions.
QA test list updates
tests/integration/test_lists/qa/llm_function_full.txt
Add new test entry for DeepSeekV3Lite NVFP4 batch-waiting; entry duplicated in two places.
API stability reference
tests/unittest/api_stability/references/llm.yaml
Declare new prototype params batch_wait_timeout_iters and batch_wait_max_tokens_ratio after batch_wait_timeout_ms.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant LLM API
  participant PyExecutor
  participant ReqQueue as RequestQueue
  participant Scheduler

  Client->>LLM API: submit requests
  LLM API->>ReqQueue: enqueue (context/gen)
  LLM API->>PyExecutor: run_step()

  PyExecutor->>Scheduler: _schedule()
  alt attention_dp disabled AND batch-waiting enabled
    Scheduler->>ReqQueue: peek context & gen batches
    Scheduler->>Scheduler: _waiting_requests(tokens, iters, max_num_tokens, ratio)
    alt Wait condition met
      note right of Scheduler: Delay scheduling context<br/>Increment wait iters
      Scheduler-->>PyExecutor: return (possibly empty) context set
    else Proceed
      note right of Scheduler: Reset wait iters
      Scheduler-->>PyExecutor: return selected context set
    end
  else Normal scheduling
    Scheduler-->>PyExecutor: schedule as usual
  end

  PyExecutor->>LLM API: step result
  LLm API-->>Client: progress/output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • netanel-haber
  • QiJune
  • Superjomn
  • syuoni
  • lucaslie
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

2570-2637: Pass max_num_tokens into PyTorchConfig
In get_pytorch_backend_config() (tensorrt_llm/llmapi/llm_args.py), add:

 return PyTorchConfig(
     …  
+    max_num_tokens=(
+        self.max_num_tokens if self.max_num_tokens is not None
+        else PyTorchConfig.model_fields['max_num_tokens'].default
+    ),
     torch_compile_enabled=bool(self.torch_compile_config is not None),
     …
 )

so that batch_wait_max_tokens_ratio uses the user-specified limit instead of the hardcoded default.

🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/config.py (1)

1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/unittest/api_stability/references/llm.yaml (1)

134-141: API surface update looks consistent; please add brief comments on ranges/disable semantics.

Suggest inline comments to avoid misuse and align with executor logic:

  • timeout iters: 0 disables
  • ratio: 0 disables; range [0.0, 1.0]

Example:

       batch_wait_timeout_ms:
         annotation: float
         default: 0
         status: prototype
       batch_wait_timeout_iters:
         annotation: int
         default: 0
         status: prototype
+        # 0 disables iteration-based waiting
       batch_wait_max_tokens_ratio:
         annotation: float
         default: 0
         status: prototype
+        # 0 disables; range [0.0, 1.0]; threshold applies to max_num_tokens
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Add NVIDIA copyright header.

Required by project guidelines.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1626-1633: Assert new knobs are plumbed end-to-end.

Add explicit assertions to ensure LLM args reflect the test’s batch-wait settings; this will catch wiring regressions early.

Apply this diff right after asserting NVFP4:

             assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4
+            assert llm.args.batch_wait_timeout_iters == batch_wait_timeout_iters
+            assert llm.args.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratio
tensorrt_llm/llmapi/llm_args.py (2)

2258-2269: Polish field docs (clarify wording and fix minor grammar).

Tighten the descriptions; “new coming” → “incoming”; add missing space and clarify the “0 disables” phrasing.

-        "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.",
+        "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler delays batch processing up to this iteration budget. If 0, timeout-iters-based delays are disabled.",
@@
-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token-accumulation threshold ratio. If > 0, the scheduler accumulates requests until total tokens reach (batch_wait_max_tokens_ratio * max_num_tokens). This improves GPU utilization by ensuring adequate batch sizes. If 0, token-ratio-based delays are disabled.",

2543-2558: Optional: cross-field guardrail to avoid misconfiguration.

Consider warning when batch_wait_max_tokens_ratio > 0 while both timeout knobs are 0, as this can increase stall risk under sparse traffic. Not a blocker; scheduler-side defaults may already mitigate this.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b0558c7 and 9f937be.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code graph analysis (2)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
  • default (50-58)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)
tests/integration/defs/conftest.py (2)
  • parametrize_with_ids (1786-1811)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm_args.py (5)
  • KvCacheConfig (961-1092)
  • TorchCompileConfig (2123-2170)
  • CudaGraphConfig (108-165)
  • MoeConfig (168-196)
  • MTPDecodingConfig (536-571)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1013-1029)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-46)
tests/integration/defs/accuracy/accuracy_core.py (3)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (707-717)
tensorrt_llm/evaluate/interface.py (1)
  • evaluate (81-110)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tests/integration/test_lists/qa/llm_function_full.txt (1)

516-516: No duplicate test entry found
Search returned only one occurrence of test_nvfp4_batch_waiting in the test lists—nothing to remove.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

187-187: Ensure max_num_tokens reflects the engine’s SequenceInfo.

PyExecutor reads from model_engine.pytorch_backend_config.max_num_tokens. With the current ADEngine hardcoded default, this may diverge from seq_info. Pair with the ADEngine fix to source seq_info.max_num_tokens.


196-199: enable_batch_waiting guard is fine; but see waiting logic bug below.

No change needed here once waiting logic is fixed.


1374-1381: Condition to invoke waiting looks correct (prevents dead-wait with only one type present).

No changes.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (2)

1592-1601: Prevent potential hang for ratio-only gating
The (batch_wait_timeout_iters=0, batch_wait_max_tokens_ratio=0.75) combination can starve under light load. Optional: bump the timeout to 1 for a safety net:

-    @parametrize_with_ids(
-        "batch_wait_timeout_iters,batch_wait_max_tokens_ratio",
-        [(0, 0), (10, 0.75), (10, 0), (0, 0.75)])
+    @parametrize_with_ids(
+        "batch_wait_timeout_iters,batch_wait_max_tokens_ratio",
+        [(0, 0), (10, 0.75), (10, 0), (1, 0.75)])

Please manually run the focused test matrix in your local/CI environment to confirm there are no hangs for the 0-0.75 vs. 1-0.75 cases.


1592-1634: Ignore duplicate registration concern. The test test_nvfp4_batch_waiting is listed only once in tests/integration/test_lists/qa/llm_function_full.txt (line 516); there’s no duplicate entry to remove.

Likely an incorrect or invalid review comment.

tensorrt_llm/llmapi/llm_args.py (1)

2634-2637: Wiring of batch-wait knobs into PyTorchConfig — LGTM.

The new fields are correctly forwarded to the backend config.

Comment on lines +140 to 142
self.pytorch_backend_config.batch_wait_timeout_iters = 0
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0
self.pytorch_backend_config.max_num_tokens = 8192
self.iter_counter = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix max_num_tokens mismatch: use seq_info.max_num_tokens instead of a hardcoded 8192.

Currently PyExecutor reads model_engine.pytorch_backend_config.max_num_tokens, but ADEngine sets it to 8192, ignoring AutoDeploy/SequenceInfo. This desynchronizes token budgeting vs. sequence manager.

Apply:

-        self.pytorch_backend_config.max_num_tokens = 8192
+        self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.pytorch_backend_config.batch_wait_timeout_iters = 0
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0
self.pytorch_backend_config.max_num_tokens = 8192
self.iter_counter = 0
self.pytorch_backend_config.batch_wait_timeout_iters = 0
self.pytorch_backend_config.batch_wait_max_tokens_ratio = 0.0
self.pytorch_backend_config.max_num_tokens = seq_info.max_num_tokens
self.iter_counter = 0
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py around lines 140-143 the
code sets self.pytorch_backend_config.max_num_tokens = 8192; replace this
hardcoded value with the model's sequence info value (use
seq_info.max_num_tokens, e.g. self.pytorch_backend_config.max_num_tokens =
self.seq_info.max_num_tokens or self.sequence_info.max_num_tokens depending on
the attribute name used in this class) and add a safe fallback if seq_info is
missing (keep the existing value or raise a clear error). This ensures the
executor respects AutoDeploy/SequenceInfo token limits and stays in sync with
the sequence manager.

Comment on lines +53 to 58
max_num_tokens: int = 8192

batch_wait_timeout_ms: float = 0
batch_wait_timeout_iters: int = 0
batch_wait_max_tokens_ratio: float = 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify semantics and add guardrails for new batch-waiting knobs (also prefer 0.0 for floats).

  • Make ratio-based threshold opt-in only when > 0.0; document ranges.
  • Minor: initialize float default as 0.0 for clarity.

Apply:

-    max_num_tokens: int = 8192
+    # Max tokens budget per batch; used by batch-wait gating.
+    max_num_tokens: int = 8192
@@
-    batch_wait_timeout_iters: int = 0
-    batch_wait_max_tokens_ratio: float = 0
+    # Iterations to wait before scheduling context even if token budget not reached (0 disables).
+    batch_wait_timeout_iters: int = 0
+    # Fraction [0, 1] of max_num_tokens at which to stop waiting (0 disables).
+    batch_wait_max_tokens_ratio: float = 0.0

Optionally add a lightweight validator to prevent misconfiguration:

# Place inside PyTorchConfig
def __post_init__(self):
    if self.max_num_tokens <= 0:
        raise ValueError("max_num_tokens must be > 0")
    if not (0.0 <= self.batch_wait_max_tokens_ratio <= 1.0):
        raise ValueError("batch_wait_max_tokens_ratio must be in [0.0, 1.0]")
    if self.batch_wait_timeout_iters < 0:
        raise ValueError("batch_wait_timeout_iters must be >= 0")
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/config.py around lines 53 to 58, the new
batch-waiting knobs need clarified semantics and guardrails: change float
defaults to 0.0 for clarity, document that batch_wait_max_tokens_ratio is opt-in
only when > 0.0 and must be in the range [0.0, 1.0], and ensure
batch_wait_timeout_iters is non-negative and max_num_tokens is > 0; implement
these checks by adding a __post_init__ validator on PyTorchConfig that raises
ValueError for invalid values (max_num_tokens <= 0, batch_wait_max_tokens_ratio
not in [0.0,1.0], or batch_wait_timeout_iters < 0) and update inline
comments/docs to state the ratio behavior and allowed ranges.

Comment on lines 1343 to 1378
def _waiting_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.enable_batch_waiting:
return context_requests

waited_context_requests = []
stop_waiting = False
num_scheduled_ctx_tokens = sum(
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
num_scheduled_gen_tokens = sum(
len(gen_req.get_tokens(0)) for gen_req in generation_requests)
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens

stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
if stop_waiting:
waited_context_requests = context_requests
self.batch_wait_iters_count = 0
else:
self.batch_wait_iters_count += 1
return waited_context_requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Waiting logic: ratio=0 causes immediate stop; generation token counting likely overestimates.

  • Bug: when batch_wait_timeout_iters>0 and ratio=0, current code evaluates num_scheduled_tokens >= 0 as True, so it never waits by iterations.
  • Token accounting: use 1 token per generation request (per-iter), not len(get_tokens(0)), which can include the full sequence length and prematurely stop waiting.

Apply:

 def _waiting_requests(self, context_requests: list[LlmRequest],
                       generation_requests: list[LlmRequest]):
     if not self.enable_batch_waiting:
         return context_requests

     waited_context_requests = []
-    stop_waiting = False
-    num_scheduled_ctx_tokens = sum(
-        len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-    num_scheduled_gen_tokens = sum(
-        len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-    num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-    stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+    # Context tokens: sum of incoming ctx chunk lengths.
+    num_scheduled_ctx_tokens = sum(len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
+    # Generation step typically consumes 1 token per req per iter; avoid counting full prompt length.
+    num_scheduled_gen_tokens = len(generation_requests)
+    num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+
+    iter_reached = (self.batch_wait_timeout_iters > 0 and
+                    self.batch_wait_iters_count >= self.batch_wait_timeout_iters)
+    ratio_reached = (self.batch_wait_max_tokens_ratio > 0.0 and
+                     num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens)
+    stop_waiting = iter_reached or ratio_reached
     if stop_waiting:
         waited_context_requests = context_requests
         self.batch_wait_iters_count = 0
     else:
         self.batch_wait_iters_count += 1
     return waited_context_requests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _waiting_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.enable_batch_waiting:
return context_requests
waited_context_requests = []
stop_waiting = False
num_scheduled_ctx_tokens = sum(
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
num_scheduled_gen_tokens = sum(
len(gen_req.get_tokens(0)) for gen_req in generation_requests)
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
if stop_waiting:
waited_context_requests = context_requests
self.batch_wait_iters_count = 0
else:
self.batch_wait_iters_count += 1
return waited_context_requests
def _waiting_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.enable_batch_waiting:
return context_requests
waited_context_requests = []
# Context tokens: sum of incoming ctx chunk lengths.
num_scheduled_ctx_tokens = sum(
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
# Generation step typically consumes 1 token per req per iter; avoid counting full prompt length.
num_scheduled_gen_tokens = len(generation_requests)
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
iter_reached = (
self.batch_wait_timeout_iters > 0 and
self.batch_wait_iters_count >= self.batch_wait_timeout_iters
)
ratio_reached = (
self.batch_wait_max_tokens_ratio > 0.0 and
num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
)
stop_waiting = iter_reached or ratio_reached
if stop_waiting:
waited_context_requests = context_requests
self.batch_wait_iters_count = 0
else:
self.batch_wait_iters_count += 1
return waited_context_requests
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1343-1363, the
waiting logic incorrectly treats batch_wait_max_tokens_ratio == 0 as an
immediate stop (because num_scheduled_tokens >= 0 is always true) and also
overcounts generation tokens by summing full token sequences; change the
generation token accounting to count 1 token per generation request
(num_scheduled_gen_tokens = len(generation_requests)) and only evaluate the
token-threshold condition when the computed threshold is positive (e.g., compute
threshold = self.batch_wait_max_tokens_ratio * self.max_num_tokens and check
num_scheduled_tokens >= threshold only if threshold > 0); keep the existing
iteration-based timeout logic and ensure batch_wait_iters_count is
incremented/reset as before.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17152 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12899 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the user/yunruis/opt_non_adp_wait branch from 9f937be to e84d357 Compare September 4, 2025 06:23
@yunruis
Copy link
Contributor Author

yunruis commented Sep 4, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17636 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17636 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13257 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants