-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][fix] Overlap: Skip last iter on length #5211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[None][fix] Overlap: Skip last iter on length #5211
Conversation
Thanks for the change. How to test the change? |
282cca7
to
b596c3c
Compare
📝 WalkthroughWalkthroughThe internal logic of the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
1014-1031
: Potential undefined variable error forctx_transmission_reqs
The variable
ctx_transmission_reqs
is only defined inside theif self.kv_cache_transceiver
condition (lines 1014-1016), but it's referenced unconditionally in line 1031. This will cause a NameError whenkv_cache_transceiver
is None.Initialize
ctx_transmission_reqs
before the conditional block:if scheduled_batch.batch_size > 0: if self.kv_cache_transceiver: # Return the first token to the client self._handle_first_token_response(scheduled_batch) previous_tensors_device = self.previous_batch and self.previous_batch.sample_state and self.previous_batch.sample_state.device batch_outputs = self._forward_step( scheduled_batch, previous_tensors_device) if self.previous_batch is not None: self._update_requests( self.previous_batch.sample_state) if self.guided_decoder is not None: self.guided_decoder.build(scheduled_batch) self.guided_decoder.execute(scheduled_batch, batch_outputs['logits']) sample_state = self._sample_async( scheduled_batch, batch_outputs) assert sample_state is not None, "Sampling failed" self._update_request_states(scheduled_batch) + ctx_transmission_reqs = [] ctx_transmission_reqs = self._send_disagg_ctx_cache( scheduled_batch.context_requests ) if self.kv_cache_transceiver else []
946-1037
: Add unit tests for_executor_loop_overlap
behaviorOur search didn’t uncover any existing tests targeting this overlap executor logic (sorting of generation requests, empty‐batch handling, or previous‐batch cleanup). To ensure these changes are correct and guard against regressions, please add dedicated tests covering:
- Sorting criteria: mixed
generation_requests
with and withoutpy_batch_idx
, varyingpy_decoding_iter
andpy_max_new_tokens
, and assert expected stable order.- Empty‐batch handling: invoke
_executor_loop_overlap
(or simulate its loop) with a batch of size 0 in overlap mode and verify it breaks/skips correctly without errors.- Previous‐batch cleanup: set up a non‐empty
previous_batch
, then trigger conditions where it should be cleared (e.g., empty next batch or end of loop), and assertself.previous_batch is None
.Suggested location:
•tests/unittest/_torch/pyexecutor/test_executor_overlap.py
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
985-988
: The new sorting condition may impact request ordering and executionThe sorting key now includes an additional condition
req.py_decoding_iter + 1 < req.py_max_new_tokens
, which changes how generation requests are ordered. This ensures that requests that have not reached their maximum token limit are processed first, which aligns with the PR's goal of "skipping last iter on length".However, this change could affect the execution order of requests in mixed batches. Consider adding a comment to explain the sorting logic for future maintainability.
scheduled_batch.generation_requests = sorted( # stable sort scheduled_batch.generation_requests, + # Sort requests that haven't reached max tokens first to ensure proper batch ordering key=lambda req: int(req.py_batch_idx is not None and req. py_decoding_iter + 1 < req.py_max_new_tokens), )
994-994
: Line exceeds maximum lengthThis line is 150 characters long, exceeding the project's 120-character limit.
- previous_tensors_device = self.previous_batch and self.previous_batch.sample_state and self.previous_batch.sample_state.device + previous_tensors_device = ( + self.previous_batch and + self.previous_batch.sample_state and + self.previous_batch.sample_state.device + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
994-994: Line too long (150 > 120)
(E501)
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1026-1033
: Good defensive programming for empty batchesSetting
previous_batch
to None whenbatch_size
is 0 prevents unnecessary processing in subsequent iterations and aligns with the PR's goal of skipping iterations when appropriate. This is a clean way to handle empty batches.
) | ||
if scheduled_batch.batch_size > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant batch size check
This conditional check is redundant since it's already inside a if scheduled_batch.batch_size > 0:
block starting at line 970. The nested check serves no purpose and reduces code readability.
- if scheduled_batch.batch_size > 0:
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py at line 989, remove the
redundant conditional check for scheduled_batch.batch_size > 0 since this check
is already performed in the outer block starting at line 970. Deleting this
nested if statement will improve code readability without affecting
functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@pcastonguay You're absolutely right, and I can confirm this is indeed a redundant conditional check.
Looking at the code structure, there are two nested if scheduled_batch.batch_size > 0:
conditions:
- Outer condition (line 970):
if scheduled_batch.batch_size > 0:
- Inner condition (line 989):
if scheduled_batch.batch_size > 0:
(newly added)
The inner check is completely redundant since we're already inside a block that only executes when scheduled_batch.batch_size > 0
is true. This creates unnecessary nesting and reduces code readability.
Solution: Remove the redundant inner conditional check at line 989. The code inside that block should remain but be unindented by one level to align with the rest of the outer conditional block.
The fix would involve:
- Removing the line
if scheduled_batch.batch_size > 0:
- Unindenting the subsequent code block that was inside this redundant condition
This will maintain the exact same functionality while improving code clarity and reducing unnecessary nesting depth.
🏁 Script executed:
#!/bin/bash
# Let's examine the specific section of code around the redundant conditional check
sed -n '965,995p' tensorrt_llm/_torch/pyexecutor/py_executor.py
Length of output: 1402
🏁 Script executed:
#!/bin/bash
# Let's get more context around line 970 to see the outer conditional block structure
sed -n '968,1020p' tensorrt_llm/_torch/pyexecutor/py_executor.py
Length of output: 2436
@pcastonguay After examining the code more carefully, I can see that the issue is in the _executor_loop_overlap
method, not the _executor_loop
method I was initially looking at. Let me get the correct context for the redundant conditional check.
🏁 Script executed:
#!/bin/bash
# Let's look specifically at the _executor_loop_overlap method where the redundant check is located
grep -n "_executor_loop_overlap" tensorrt_llm/_torch/pyexecutor/py_executor.py -A 100 | head -150
Length of output: 7082
🏁 Script executed:
#!/bin/bash
# Let's examine the specific line range around 970-1000 in the _executor_loop_overlap method
awk 'NR >= 950 && NR <= 1000' tensorrt_llm/_torch/pyexecutor/py_executor.py
Length of output: 2317
@pcastonguay Perfect! Now I can see the exact issue. Looking at the _executor_loop_overlap
method, there are indeed two nested conditional checks for scheduled_batch.batch_size > 0
:
- Line 1138:
if scheduled_batch.batch_size > 0:
(outer condition) - Line 1147:
if scheduled_batch.batch_size > 0:
(inner condition)
And from the diff, I can see that the change adds yet another nested check at what appears to be around line 989 in the diff context.
The structure shows:
- Line 1138: First
if scheduled_batch.batch_size > 0:
block - Line 1147: Second
if scheduled_batch.batch_size > 0:
block (redundant) - The diff adds a third nested check
This creates unnecessary nesting and makes the code harder to read. The inner conditional checks are completely redundant since the code is already inside a block that only executes when scheduled_batch.batch_size > 0
is true.
Recommendation: Remove the redundant inner conditional check(s) and unindent the code accordingly to maintain the same functionality while improving readability.
🏁 Script executed:
#!/bin/bash
# Let's get more lines from the _executor_loop_overlap method to see the redundant check
grep -n "_executor_loop_overlap" tensorrt_llm/_torch/pyexecutor/py_executor.py -A 200 | grep -A 50 "scheduled_batch.batch_size > 0"
Length of output: 6562
🏁 Script executed:
#!/bin/bash
# Let's look at the specific area around lines 1138-1200 in the _executor_loop_overlap method
awk 'NR >= 1138 && NR <= 1200' tensorrt_llm/_torch/pyexecutor/py_executor.py
Length of output: 3613
@pcastonguay Perfect! Now I can clearly see the issue. Looking at the _executor_loop_overlap
method, there are three redundant nested conditional checks for scheduled_batch.batch_size > 0
:
- Line 1138:
if scheduled_batch.batch_size > 0:
(first outer condition) - Line 1147:
if scheduled_batch.batch_size > 0:
(second redundant condition) - Line 1160:
if scheduled_batch.batch_size > 0:
(third redundant condition - this is the one highlighted in the diff) - Line 1200:
if scheduled_batch.batch_size > 0:
(fourth redundant condition)
This creates deeply nested code that's hard to read and maintain. All the inner conditions are redundant since we're already inside the first conditional block.
Here's what needs to be fixed:
- Remove the redundant checks at lines 1147, 1160, and 1200
- Unindent the code blocks that were inside these redundant conditions
- Merge the logic into the single outer conditional at line 1138
The corrected structure should be:
if scheduled_batch.batch_size > 0:
# All the functionality from the nested blocks should be here
# without additional batch_size > 0 checks
if self.kv_cache_transceiver:
self._prepare_disagg_gen_transmission_complete(scheduled_batch)
self.resource_manager.prepare_resources(scheduled_batch)
self._kv_connector_start_batch(scheduled_batch)
# Sorting logic
scheduled_batch.generation_requests = sorted(
scheduled_batch.generation_requests,
key=lambda req: int(req.py_batch_idx is not None and req.py_decoding_iter + 1 < req.py_max_new_tokens)
)
# First token response and forward step logic
if self.kv_cache_transceiver:
self._handle_first_token_response(scheduled_batch)
# ... rest of the logic without nested batch_size checks
self.previous_batch = BatchState(
sample_state=sample_state,
iter_start_time=iter_start_time,
iter_stats=iter_stats,
ctx_transmission_reqs=ctx_transmission_reqs
)
else:
self.previous_batch = None
This will eliminate the redundant nesting and make the code much more readable while maintaining exactly the same functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not redundant, the batch size can be set to 0 because of the added check py_decoding_iter + 1 < req.py_max_new_tokens
, so we need to redo the if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, you're changing the key used to sort the generation requests, but that won't change the size of scheduled_batch.generation_requests
. So batch_size
should stay the same unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
989-989
: Remove redundant batch size check.This conditional check is redundant since it's already inside a
if scheduled_batch.batch_size > 0:
block starting at line 970. The nested check serves no purpose and reduces code readability.- if scheduled_batch.batch_size > 0:
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
983-987
: LGTM! Sorting logic correctly implements last iteration optimization.The addition of
req.py_decoding_iter + 1 < req.py_max_new_tokens
to the sorting key effectively identifies requests that are not on their last iteration, which aligns with the PR objective to "skip last iter on length".However, there's a line length issue that should be addressed:
- key=lambda req: - int(req.py_batch_idx is not None and req. - py_decoding_iter + 1 < req.py_max_new_tokens), + key=lambda req: int( + req.py_batch_idx is not None and + req.py_decoding_iter + 1 < req.py_max_new_tokens + ),
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,h,hpp,cc,cxx,cu,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
994-994: Line too long (150 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
990-1016
: LGTM! Control flow reorganization improves efficiency.The reorganization to execute expensive operations only when
batch_size > 0
is a good optimization. This avoids unnecessary work when there are no requests to process, including:
- KV cache transceiver operations
- Forward step execution
- Guided decoding and sampling
- Context cache transmission
The variable scoping is correct -
previous_tensors_device
is properly initialized within the conditional where it's needed.
1026-1033
: LGTM! Conditional batch state assignment completes the optimization.The conditional assignment of
self.previous_batch
is consistent with the overall optimization strategy:
- When
batch_size > 0
: Creates properBatchState
with sampling results to carry forward- When
batch_size == 0
: Sets toNone
to avoid carrying forward empty stateThis ensures that the "skip last iter on length" optimization is properly reflected in the state management, preventing unnecessary state propagation when no processing occurred.
@IzzyPutterman is this PR still relevant? If so, could you rebase and resolve conflicts? |
42e9603
to
3ca3494
Compare
Yes I believe this is still relevant, we are doing extra work in some of these cases when its not needed. |
) | ||
if scheduled_batch.batch_size > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address
iter_start_time=iter_start_time, | ||
iter_stats=iter_stats, | ||
ctx_transmission_reqs=ctx_transmission_reqs) | ||
if scheduled_batch.batch_size > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this seems redundant since already inside an if scheduled_batch.batch_size > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, the scheduled_batch.generation_requests
reassignment can reduce BS to 0 when already inside the top level conditional
key=lambda req: | ||
int(req.py_batch_idx is not None and req. | ||
py_decoding_iter + 1 < req.py_max_new_tokens), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test to verify this is behaving as expected? Maybe check the per iteration stats?
/bot run |
PR_Github #17820 [ run ] triggered by Bot |
PR_Github #17820 [ run ] completed with state |
/bot run |
PR_Github #17846 [ run ] triggered by Bot |
PR_Github #17846 [ run ] completed with state |
/bot run |
PR_Github #17892 [ run ] triggered by Bot |
PR_Github #17892 [ run ] completed with state |
6d9c6a8
to
2f2b8e3
Compare
/bot run |
PR_Github #17895 [ run ] triggered by Bot |
/bot run |
PR_Github #17950 [ run ] triggered by Bot |
PR_Github #17950 [ run ] completed with state |
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
73ad66d
to
0dcc979
Compare
/bot run |
PR_Github #18082 [ run ] triggered by Bot |
PR_Github #18082 [ run ] completed with state |
/bot run |
PR_Github #18091 [ run ] triggered by Bot |
PR_Github #18091 [ run ] completed with state |
/bot run |
PR_Github #18099 [ run ] triggered by Bot |
PR_Github #18099 [ run ] completed with state |
Signed-off-by: Izzy Putterman <[email protected]>
/bot run |
PR_Github #18110 [ run ] triggered by Bot |
PR_Github #18110 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #18218 [ run ] triggered by Bot |
/bot run |
PR_Github #18401 [ run ] triggered by Bot |
PR_Github #18401 [ run ] completed with state |
Summary by CodeRabbit