Skip to content

Conversation

IzzyPutterman
Copy link
Collaborator

@IzzyPutterman IzzyPutterman commented Jun 13, 2025

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of batch processing to avoid unnecessary operations when batch size is zero.
    • Refined sorting of generation requests for more accurate processing.
    • Enhanced conditional checks for resource management during batch execution.

@litaotju
Copy link
Collaborator

Thanks for the change. How to test the change?
Whats the existing cases can cover it if no need new test cases? Thx

@IzzyPutterman IzzyPutterman force-pushed the iputterman/skip-last-overlap branch from 282cca7 to b596c3c Compare July 30, 2025 03:37
Copy link
Contributor

coderabbitai bot commented Jul 30, 2025

📝 Walkthrough

Walkthrough

The internal logic of the _executor_loop_overlap method in PyExecutor was refactored. Conditional checks around batch size and kv_cache_transceiver were tightened, the sorting criteria for generation requests were refined, and control flow was reorganized to avoid unnecessary operations when batch size is zero.

Changes

Cohort / File(s) Change Summary
PyExecutor Overlap Logic Refactor
tensorrt_llm/_torch/pyexecutor/py_executor.py
Refined sorting key for generation requests; added and reorganized conditional checks around batch size and kv_cache_transceiver; restructured control flow to minimize unnecessary operations; adjusted assignment and processing of previous batch state.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Possibly related PRs

Suggested reviewers

  • pcastonguay
  • HuiGao-NV
  • leslie-fang25

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1014-1031: Potential undefined variable error for ctx_transmission_reqs

The variable ctx_transmission_reqs is only defined inside the if self.kv_cache_transceiver condition (lines 1014-1016), but it's referenced unconditionally in line 1031. This will cause a NameError when kv_cache_transceiver is None.

Initialize ctx_transmission_reqs before the conditional block:

                    if scheduled_batch.batch_size > 0:
                        if self.kv_cache_transceiver:
                            # Return the first token to the client
                            self._handle_first_token_response(scheduled_batch)

                        previous_tensors_device = self.previous_batch and self.previous_batch.sample_state and self.previous_batch.sample_state.device

                        batch_outputs = self._forward_step(
                            scheduled_batch, previous_tensors_device)

                        if self.previous_batch is not None:
                            self._update_requests(
                                self.previous_batch.sample_state)

                        if self.guided_decoder is not None:
                            self.guided_decoder.build(scheduled_batch)
                            self.guided_decoder.execute(scheduled_batch,
                                                        batch_outputs['logits'])

                        sample_state = self._sample_async(
                            scheduled_batch, batch_outputs)
                        assert sample_state is not None, "Sampling failed"

                        self._update_request_states(scheduled_batch)

+                       ctx_transmission_reqs = []
                        ctx_transmission_reqs = self._send_disagg_ctx_cache(
                            scheduled_batch.context_requests
                        ) if self.kv_cache_transceiver else []

946-1037: Add unit tests for _executor_loop_overlap behavior

Our search didn’t uncover any existing tests targeting this overlap executor logic (sorting of generation requests, empty‐batch handling, or previous‐batch cleanup). To ensure these changes are correct and guard against regressions, please add dedicated tests covering:

  • Sorting criteria: mixed generation_requests with and without py_batch_idx, varying py_decoding_iter and py_max_new_tokens, and assert expected stable order.
  • Empty‐batch handling: invoke _executor_loop_overlap (or simulate its loop) with a batch of size 0 in overlap mode and verify it breaks/skips correctly without errors.
  • Previous‐batch cleanup: set up a non‐empty previous_batch, then trigger conditions where it should be cleared (e.g., empty next batch or end of loop), and assert self.previous_batch is None.

Suggested location:
tests/unittest/_torch/pyexecutor/test_executor_overlap.py

🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

985-988: The new sorting condition may impact request ordering and execution

The sorting key now includes an additional condition req.py_decoding_iter + 1 < req.py_max_new_tokens, which changes how generation requests are ordered. This ensures that requests that have not reached their maximum token limit are processed first, which aligns with the PR's goal of "skipping last iter on length".

However, this change could affect the execution order of requests in mixed batches. Consider adding a comment to explain the sorting logic for future maintainability.

                    scheduled_batch.generation_requests = sorted(  # stable sort
                        scheduled_batch.generation_requests,
+                       # Sort requests that haven't reached max tokens first to ensure proper batch ordering
                        key=lambda req:
                        int(req.py_batch_idx is not None and req.
                            py_decoding_iter + 1 < req.py_max_new_tokens),
                    )

994-994: Line exceeds maximum length

This line is 150 characters long, exceeding the project's 120-character limit.

-                        previous_tensors_device = self.previous_batch and self.previous_batch.sample_state and self.previous_batch.sample_state.device
+                        previous_tensors_device = (
+                            self.previous_batch and 
+                            self.previous_batch.sample_state and 
+                            self.previous_batch.sample_state.device
+                        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9ed1ab and b596c3c.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

994-994: Line too long (150 > 120)

(E501)

🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1026-1033: Good defensive programming for empty batches

Setting previous_batch to None when batch_size is 0 prevents unnecessary processing in subsequent iterations and aligns with the PR's goal of skipping iterations when appropriate. This is a clean way to handle empty batches.

)
if scheduled_batch.batch_size > 0:
Copy link
Contributor

@coderabbitai coderabbitai bot Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove redundant batch size check

This conditional check is redundant since it's already inside a if scheduled_batch.batch_size > 0: block starting at line 970. The nested check serves no purpose and reduces code readability.

-                    if scheduled_batch.batch_size > 0:

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py at line 989, remove the
redundant conditional check for scheduled_batch.batch_size > 0 since this check
is already performed in the outer block starting at line 970. Deleting this
nested if statement will improve code readability without affecting
functionality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@pcastonguay You're absolutely right, and I can confirm this is indeed a redundant conditional check.

Looking at the code structure, there are two nested if scheduled_batch.batch_size > 0: conditions:

  1. Outer condition (line 970): if scheduled_batch.batch_size > 0:
  2. Inner condition (line 989): if scheduled_batch.batch_size > 0: (newly added)

The inner check is completely redundant since we're already inside a block that only executes when scheduled_batch.batch_size > 0 is true. This creates unnecessary nesting and reduces code readability.

Solution: Remove the redundant inner conditional check at line 989. The code inside that block should remain but be unindented by one level to align with the rest of the outer conditional block.

The fix would involve:

  1. Removing the line if scheduled_batch.batch_size > 0:
  2. Unindenting the subsequent code block that was inside this redundant condition

This will maintain the exact same functionality while improving code clarity and reducing unnecessary nesting depth.


🏁 Script executed:

#!/bin/bash
# Let's examine the specific section of code around the redundant conditional check
sed -n '965,995p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Length of output: 1402


🏁 Script executed:

#!/bin/bash
# Let's get more context around line 970 to see the outer conditional block structure
sed -n '968,1020p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Length of output: 2436


@pcastonguay After examining the code more carefully, I can see that the issue is in the _executor_loop_overlap method, not the _executor_loop method I was initially looking at. Let me get the correct context for the redundant conditional check.


🏁 Script executed:

#!/bin/bash
# Let's look specifically at the _executor_loop_overlap method where the redundant check is located
grep -n "_executor_loop_overlap" tensorrt_llm/_torch/pyexecutor/py_executor.py -A 100 | head -150

Length of output: 7082


🏁 Script executed:

#!/bin/bash
# Let's examine the specific line range around 970-1000 in the _executor_loop_overlap method
awk 'NR >= 950 && NR <= 1000' tensorrt_llm/_torch/pyexecutor/py_executor.py

Length of output: 2317


@pcastonguay Perfect! Now I can see the exact issue. Looking at the _executor_loop_overlap method, there are indeed two nested conditional checks for scheduled_batch.batch_size > 0:

  1. Line 1138: if scheduled_batch.batch_size > 0: (outer condition)
  2. Line 1147: if scheduled_batch.batch_size > 0: (inner condition)

And from the diff, I can see that the change adds yet another nested check at what appears to be around line 989 in the diff context.

The structure shows:

  • Line 1138: First if scheduled_batch.batch_size > 0: block
  • Line 1147: Second if scheduled_batch.batch_size > 0: block (redundant)
  • The diff adds a third nested check

This creates unnecessary nesting and makes the code harder to read. The inner conditional checks are completely redundant since the code is already inside a block that only executes when scheduled_batch.batch_size > 0 is true.

Recommendation: Remove the redundant inner conditional check(s) and unindent the code accordingly to maintain the same functionality while improving readability.


🏁 Script executed:

#!/bin/bash
# Let's get more lines from the _executor_loop_overlap method to see the redundant check
grep -n "_executor_loop_overlap" tensorrt_llm/_torch/pyexecutor/py_executor.py -A 200 | grep -A 50 "scheduled_batch.batch_size > 0"

Length of output: 6562


🏁 Script executed:

#!/bin/bash
# Let's look at the specific area around lines 1138-1200 in the _executor_loop_overlap method
awk 'NR >= 1138 && NR <= 1200' tensorrt_llm/_torch/pyexecutor/py_executor.py

Length of output: 3613


@pcastonguay Perfect! Now I can clearly see the issue. Looking at the _executor_loop_overlap method, there are three redundant nested conditional checks for scheduled_batch.batch_size > 0:

  1. Line 1138: if scheduled_batch.batch_size > 0: (first outer condition)
  2. Line 1147: if scheduled_batch.batch_size > 0: (second redundant condition)
  3. Line 1160: if scheduled_batch.batch_size > 0: (third redundant condition - this is the one highlighted in the diff)
  4. Line 1200: if scheduled_batch.batch_size > 0: (fourth redundant condition)

This creates deeply nested code that's hard to read and maintain. All the inner conditions are redundant since we're already inside the first conditional block.

Here's what needs to be fixed:

  1. Remove the redundant checks at lines 1147, 1160, and 1200
  2. Unindent the code blocks that were inside these redundant conditions
  3. Merge the logic into the single outer conditional at line 1138

The corrected structure should be:

if scheduled_batch.batch_size > 0:
    # All the functionality from the nested blocks should be here
    # without additional batch_size > 0 checks
    
    if self.kv_cache_transceiver:
        self._prepare_disagg_gen_transmission_complete(scheduled_batch)
    self.resource_manager.prepare_resources(scheduled_batch)
    self._kv_connector_start_batch(scheduled_batch)

    # Sorting logic
    scheduled_batch.generation_requests = sorted(
        scheduled_batch.generation_requests,
        key=lambda req: int(req.py_batch_idx is not None and req.py_decoding_iter + 1 < req.py_max_new_tokens)
    )
    
    # First token response and forward step logic
    if self.kv_cache_transceiver:
        self._handle_first_token_response(scheduled_batch)
    
    # ... rest of the logic without nested batch_size checks
    
    self.previous_batch = BatchState(
        sample_state=sample_state,
        iter_start_time=iter_start_time,
        iter_stats=iter_stats,
        ctx_transmission_reqs=ctx_transmission_reqs
    )
else:
    self.previous_batch = None

This will eliminate the redundant nesting and make the code much more readable while maintaining exactly the same functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not redundant, the batch size can be set to 0 because of the added check py_decoding_iter + 1 < req.py_max_new_tokens, so we need to redo the if statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, you're changing the key used to sort the generation requests, but that won't change the size of scheduled_batch.generation_requests. So batch_size should stay the same unless I'm missing something.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

989-989: Remove redundant batch size check.

This conditional check is redundant since it's already inside a if scheduled_batch.batch_size > 0: block starting at line 970. The nested check serves no purpose and reduces code readability.

-                    if scheduled_batch.batch_size > 0:
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

983-987: LGTM! Sorting logic correctly implements last iteration optimization.

The addition of req.py_decoding_iter + 1 < req.py_max_new_tokens to the sorting key effectively identifies requests that are not on their last iteration, which aligns with the PR objective to "skip last iter on length".

However, there's a line length issue that should be addressed:

-                        key=lambda req:
-                        int(req.py_batch_idx is not None and req.
-                            py_decoding_iter + 1 < req.py_max_new_tokens),
+                        key=lambda req: int(
+                            req.py_batch_idx is not None and 
+                            req.py_decoding_iter + 1 < req.py_max_new_tokens
+                        ),
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b596c3c and 6983bf1.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,h,hpp,cc,cxx,cu,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

994-994: Line too long (150 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

990-1016: LGTM! Control flow reorganization improves efficiency.

The reorganization to execute expensive operations only when batch_size > 0 is a good optimization. This avoids unnecessary work when there are no requests to process, including:

  • KV cache transceiver operations
  • Forward step execution
  • Guided decoding and sampling
  • Context cache transmission

The variable scoping is correct - previous_tensors_device is properly initialized within the conditional where it's needed.


1026-1033: LGTM! Conditional batch state assignment completes the optimization.

The conditional assignment of self.previous_batch is consistent with the overall optimization strategy:

  • When batch_size > 0: Creates proper BatchState with sampling results to carry forward
  • When batch_size == 0: Sets to None to avoid carrying forward empty state

This ensures that the "skip last iter on length" optimization is properly reflected in the state management, preventing unnecessary state propagation when no processing occurred.

@pcastonguay
Copy link
Collaborator

@IzzyPutterman is this PR still relevant? If so, could you rebase and resolve conflicts?

@IzzyPutterman IzzyPutterman force-pushed the iputterman/skip-last-overlap branch 2 times, most recently from 42e9603 to 3ca3494 Compare September 5, 2025 03:40
@IzzyPutterman IzzyPutterman changed the title Overlap: Skip last iter on length [None][fix] Overlap: Skip last iter on length Sep 5, 2025
@IzzyPutterman
Copy link
Collaborator Author

@IzzyPutterman is this PR still relevant? If so, could you rebase and resolve conflicts?

Yes I believe this is still relevant, we are doing extra work in some of these cases when its not needed.

)
if scheduled_batch.batch_size > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address

iter_start_time=iter_start_time,
iter_stats=iter_stats,
ctx_transmission_reqs=ctx_transmission_reqs)
if scheduled_batch.batch_size > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this seems redundant since already inside an if scheduled_batch.batch_size > 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, the scheduled_batch.generation_requests reassignment can reduce BS to 0 when already inside the top level conditional

Comment on lines 1156 to 1158
key=lambda req:
int(req.py_batch_idx is not None and req.
py_decoding_iter + 1 < req.py_max_new_tokens),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test to verify this is behaving as expected? Maybe check the per iteration stats?

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17820 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17820 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13340 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17846 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17846 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13360 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17892 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17892 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13401 completed with status: 'FAILURE'

@IzzyPutterman IzzyPutterman force-pushed the iputterman/skip-last-overlap branch from 6d9c6a8 to 2f2b8e3 Compare September 6, 2025 20:17
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17895 [ run ] triggered by Bot

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17950 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17950 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13453 completed with status: 'FAILURE'

Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
@IzzyPutterman IzzyPutterman force-pushed the iputterman/skip-last-overlap branch from 73ad66d to 0dcc979 Compare September 8, 2025 18:00
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18082 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18082 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13554 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18091 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18091 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13558 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18099 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18099 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13563 completed with status: 'FAILURE'

Signed-off-by: Izzy Putterman <[email protected]>
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18110 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18110 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13571 completed with status: 'FAILURE'

@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18218 [ run ] triggered by Bot

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18401 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18401 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13805 completed with status: 'FAILURE'

@pcastonguay pcastonguay requested a review from QiJune September 11, 2025 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants