-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997
Conversation
85a0b9c
to
e3660a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help @evezhier 👍 .
3b5742d
to
654bac2
Compare
Warning Rate limit exceeded@evezhier has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 10 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughThe changes add defaulted move and copy constructors to C++ LLM request classes, expose new properties and methods in Python bindings, extend Python LlmRequest to handle child requests, update the PyExecutor to manage hierarchical requests and responses, introduce CLI arguments for multi-sequence generation, and add unit tests validating multi-output and child request functionality. The ExecutorRequestQueue is enhanced to track child request IDs and incorporate them in enqueue, dequeue, latency tracking, and merging operations. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant PyExecutor
participant LlmRequest
participant LlmResponse
User->>CLI: Provide prompts with --n and --best_of
CLI->>PyExecutor: Submit ExecutorRequest(s)
PyExecutor->>LlmRequest: Create parent LlmRequest
loop For each child_req_id
PyExecutor->>LlmRequest: create_child_request(child_id)
LlmRequest->>LlmRequest: Add child to children list
end
PyExecutor->>LlmRequest: Merge parent and children into batch
LlmRequest->>LlmResponse: create_response (handles parent/child IDs)
PyExecutor->>User: Return multiple outputs per prompt
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (9)
tests/unittest/_torch/test_best_of_n.py (1)
32-37
: Consider using a fixture to improve test efficiency.Creating the LLM instance inside the test function means it gets recreated for each parameter combination, which is time-consuming. As mentioned in previous reviews, consider using a pytest fixture to create the LLM once and reuse it across test cases.
Example fixture:
@pytest.fixture(scope="module") def llm_instance(): return LLM(model=os.path.join(llm_models_root(), "llama-models-v2", "TinyLlama-1.1B-Chat-v1.0"), kv_cache_config=global_kvcache_config, max_batch_size=128, max_seq_len=128, enable_trtllm_sampler=True)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
365-365
: Remove or uncomment the move constructor bindingThis commented line should either be removed or uncommented to maintain code cleanliness.
examples/llm-api/quickstart_advanced.py (1)
234-240
: Clarify parameter usage for beam search and best_ofThe current logic mixes
max_beam_width
withn
andbest_of
parameters, which can be confusing. Consider:
- Deprecating
max_beam_width
in favor ofuse_beam_search
(as suggested in past reviews)- Making the parameter relationships clearer
- best_of=args.max_beam_width - if args.max_beam_width > 1 else args.best_of, + # Clearly document the parameter logic + # When beam search is used, best_of defaults to beam width + best_of=args.best_of if args.best_of is not None else ( + args.max_beam_width if args.max_beam_width > 1 else args.n + ),tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
507-511
: Consider moving child append to create_child_requestFor consistency with the C++ implementation, consider moving the append operation into the
create_child_request
method as suggested in past reviews.
361-380
: Fix attribute copying in create_child_requestThe current implementation copies all attributes from parent to child using
__dict__.update()
, which can lead to issues:
- It overwrites the child's already initialized attributes
- It doesn't properly handle mutable attributes like lists
Based on past review comments, consider this improved approach:
def create_child_request(self, child_id): child = super().create_child_request(child_id) py_request = LlmRequest(llm_request=child) - py_request.__dict__.update(**self.__dict__) + # Copy only Python-specific attributes from parent to child + for attr_name, attr_value in self.__dict__.items(): + if attr_name.startswith('py_') and attr_name not in ['py_request_id', 'py_result']: + setattr(py_request, attr_name, copy.deepcopy(attr_value)) + elif attr_name in ['is_attention_dp_dummy', 'is_cuda_graph_dummy']: + setattr(py_request, attr_name, attr_value) py_request.py_result = PyResult( self.py_prompt_len, self.py_max_new_tokens, self.py_return_logits_device_memory, self.streaming, self.py_return_log_probs, self.py_return_context_logits, self.py_return_generation_logits) py_request.py_request_id = child.request_id py_request.children = [] + self.children.append(py_request)Also, add the import at the top of the file:
import copytensorrt_llm/_torch/pyexecutor/py_executor.py (4)
534-542
: Simplify child request handling using extend().As suggested in the past review, this can be simplified for better readability and efficiency.
req_with_children = [] for req_item in new_requests: req = executor_request_to_llm_request( req_item.id, req_item.request, req_item.child_req_ids, self._should_exclude_last_generation_logits()) req_with_children.append(req) - for child in req.children: - req_with_children.append(child) + if req.children: + req_with_children.extend(req.children) return req_with_children
2108-2111
: Consider if child requests should terminate independently.The current implementation terminates child requests when the parent finishes. However, a past review comment suggested that "each request should be terminated by themselves, not from the parent request."
Consider whether child requests should manage their own termination lifecycle. If they should terminate independently, simplify to:
if response.result.is_final: requests_to_terminate.append(request) - for child in request.children: - requests_to_terminate.append(child)
371-381
: Extract child request ID generation to a helper method.This logic is duplicated in the
enqueue_request
method. As suggested in the past review, consider extracting this to a helper function to follow DRY principles.Add a helper method:
def _generate_child_request_ids(self, request: ExecutorRequest) -> List[int]: """Generate child request IDs for a given request.""" child_req_ids = [] num_child_requests = _get_num_child_requests(request) for _ in range(num_child_requests): self.next_req_id += 1 child_req_ids.append(self.next_req_id) return child_req_idsThen use it here:
- child_req_ids = [] - num_child_requests = _get_num_child_requests(request) - for _ in range(num_child_requests): - self.next_req_id += 1 - child_req_ids.append(self.next_req_id) + child_req_ids = self._generate_child_request_ids(request)
501-507
: Use the same helper method for child ID generation.This is duplicate code that should use the same helper method suggested above.
- child_req_ids = [] - num_child_requests = _get_num_child_requests(request) - for _ in range(num_child_requests): - self.next_req_id += 1 - child_req_ids.append(self.next_req_id) + child_req_ids = self._generate_child_request_ids(request)
🧹 Nitpick comments (2)
tests/unittest/_torch/test_best_of_n.py (1)
28-31
: Consider expanding test coverage beyond current parameters.The current parametrization only tests
n=2
withbest_of
values ofNone
and3
. Consider adding more comprehensive test cases, such as:
- Different values of
n
(e.g., 1, 3, 4)- Edge cases like
n == best_of
- Error conditions when
n > best_of
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
60-60
: Use more specific type hint for child_req_ids.For better type safety and clarity, consider using a more specific type hint.
- child_req_ids: Optional[list] = None + child_req_ids: Optional[List[int]] = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(3 hunks)examples/llm-api/quickstart_advanced.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(6 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(11 hunks)tests/unittest/_torch/test_best_of_n.py
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
tests/unittest/_torch/test_best_of_n.py (1)
tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-477)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
cpp/tensorrt_llm/batch_manager/llmRequest.cpp (6)
validate
(225-297)validate
(225-226)createResponse
(43-52)createResponse
(43-43)createChildRequest
(299-328)createChildRequest
(299-299)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
append
(77-96)append
(123-140)executor_request_to_llm_request
(419-512)LlmResponse
(254-261)create_response
(340-350)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)
470-472
: LGTM: Well-structured explicit defaulted constructors.The explicit defaulting of move and copy constructors is appropriate for this complex class. Given that the class contains standard library types (optional, shared_ptr, vectors) with well-defined copy/move semantics, the default behavior should work correctly to support the enhanced child request functionality.
2321-2323
: LGTM: Consistent constructor declarations with base class.The explicit defaulting of move and copy constructors in the derived class properly matches the pattern established in the base
GenericLlmRequest
class, maintaining consistency in the class hierarchy.cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
197-198
: LGTM! Hierarchical request support properly exposedThe new bindings correctly expose the parent-child request functionality from C++ to Python, enabling the hierarchical request model needed for best_of/n support.
Also applies to: 364-364, 371-371
examples/llm-api/quickstart_advanced.py (1)
253-269
: LGTM! Clear output formatting for multiple sequencesThe updated output formatting with descriptive variable names and conditional sequence ID display improves readability when using beam search or multiple sequences.
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
280-301
: LGTM! Constructor and response handling properly support child requestsThe constructor changes correctly handle wrapping existing request objects, and the response ID logic properly distinguishes between parent and child requests.
Also applies to: 335-335, 347-350
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
119-122
: LGTM!The modification correctly accounts for child requests when calculating the total request count, ensuring proper capacity management.
2008-2008
: LGTM!The logic correctly handles cancellation for both parent and child requests by checking the appropriate request ID.
2022-2050
: LGTM!The refactoring from dictionary to list of tuples correctly supports multiple responses per request ID, which is essential for the parent-child request hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @evezhier for your help. I left several minor review comments below. Overall it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unittest/_torch/test_best_of_n.py (2)
36-41
: Consider using a more realistic model path and reducing max_batch_size.The hardcoded model path
/model
may not exist in test environments. Additionally, themax_batch_size=8
could be reduced to better align with actual test usage and potentially expose vector size or access pattern errors.
108-112
: Fix the similarity check logic.The condition
if n == best_of
for the similarity check is problematic. Whenbest_of > n
, the system still returns then
best outputs frombest_of
candidates, so the similarity check should still be meaningful. The current logic means that whenn=2
andbest_of=3
, no similarity verification is performed, reducing test coverage.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
60-60
: Improve type annotation for better type safety.The
child_req_ids
field should use a more specific type annotation for better type safety and clarity.- child_req_ids: Optional[list] = None + child_req_ids: Optional[List[int]] = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(11 hunks)tests/unittest/_torch/test_best_of_n.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tests/unittest/_torch/test_best_of_n.py (2)
46-90
: LGTM! Child request creation logic is well-tested.The test comprehensively validates child request creation, including:
- Correct number of children created
- Proper ID assignment and parent-child relationships
- Attribute inheritance and independence verification
- Result and token independence checks
114-133
: LGTM! Async test provides good coverage for batch size exceedance.The test effectively validates that the system can handle multiple asynchronous requests exceeding the max batch size, ensuring each result contains the expected number of outputs.
tensorrt_llm/_torch/pyexecutor/llm_request.py (6)
1-1
: LGTM! Proper import for deep copying child request attributes.The
deepcopy
import is correctly added to support copying complex attributes when creating child requests.
281-302
: LGTM! Constructor properly handles both new and wrapped requests.The constructor correctly handles the optional
llm_request
parameter to wrap existing internal request objects, which is essential for child request creation.
336-336
: LGTM! Children tracking attribute properly initialized.The
children
list is correctly initialized to track child requests created from this parent request.
348-351
: LGTM! Response creation properly handles child request IDs.The
create_response
method correctly usesself.py_request_id
for child requests andself.parent_request_id
for parent requests, ensuring proper request ID assignment in responses.
362-387
: LGTM! Child request creation is well-implemented.The
create_child_request
method properly:
- Creates child request through superclass method
- Wraps it in Python
LlmRequest
wrapper- Deep copies all
py_*
attributes from parent- Resets child-specific attributes appropriately
- Validates parent-child relationships with assertions
- Ensures distinct sampling seeds for different outputs
- Appends child to parent's children list
514-517
: LGTM! Child request creation integrated into utility function.The
executor_request_to_llm_request
function correctly creates child requests whenchild_req_ids
is provided, ensuring proper hierarchical request structure.tensorrt_llm/_torch/pyexecutor/py_executor.py (8)
112-119
: LGTM! Child request counting logic is correct.The implementation properly accounts for both parent and child requests when determining queue capacity, ensuring the total request count doesn't exceed the maximum allowed.
358-382
: LGTM! Request ID generation and child request logic are well implemented.The implementation correctly:
- Uses 64-bit wraparound for request ID generation to prevent overflow
- Creates child requests only when appropriate (beam_width == 1 and num_return_sequences > 1)
- Properly tracks performance statistics for child requests
- Returns None when no child requests are needed
393-403
: LGTM! Request enqueuing logic properly handles child requests.The implementation correctly generates request IDs and child request IDs, passing them to the queue item constructor. Performance statistics are appropriately tracked for the parent request only.
559-567
: LGTM! Request merging logic correctly handles hierarchical requests.The implementation properly:
- Converts each request item to an LlmRequest with child request IDs
- Flattens the parent-child hierarchy into a single list
- Includes both parent and child requests in the returned list
This approach is consistent with the design of treating child requests as separate entities in the active requests list.
2033-2033
: LGTM! Cancellation logic correctly handles child requests.The implementation properly uses the parent request ID for child requests when checking for cancellation. This ensures that when a parent request is canceled, all its child requests are also canceled appropriately.
2047-2047
: LGTM! Parameter type change improves type safety.The change from dictionary to
List[Tuple[int, LlmResponse]]
for the responses parameter improves type safety and is consistent with the new response handling approach.
2079-2087
: LGTM! Response handling correctly uses list-based approach.The implementation properly creates a list of tuples for new responses, consistent with the updated response handling pattern.
2092-2132
: Semantic equivalence confirmed:request.is_finished
matchesresponse.result.is_final
.The
is_final
boolean returned bycreate_serialized_result
inLlmResult
is set when any of the finish‐reason bits (EOS, stop words, or max‐length) are reached. Under the hood, the C++ request object updates its finished state at the same time, and the Python binding exposes that viarequest.is_finished
. You can safely userequest.is_finished
in place of checkingresponse.result.is_final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
516-539
: Fix incomplete conditional structure.The
else:
statement on line 525 lacks a correspondingif
statement, creating invalid syntax. Based on the context, it appears there should be anif
condition checking for performance stats.- if self.enable_iter_perf_stats: - self.start_times[req_id] = time.time() - - else: - child_req_ids = self._generate_child_request_ids(request) + if self.enable_iter_perf_stats: + self.start_times[req_id] = time.time() + + child_req_ids = self._generate_child_request_ids(request)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
61-61
: Improve type annotation specificity.The
child_req_ids
field should have a more specific type annotation to indicate it's a list of integers.- child_req_ids: Optional[list] = None + child_req_ids: Optional[List[int]] = None
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
executor_request_to_llm_request
(426-518)LlmResponse
(255-262)create_response
(341-351)tensorrt_llm/logger.py (1)
debug
(143-144)tensorrt_llm/_utils.py (1)
nvtx_range
(825-844)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1884-1884: Undefined name Tuple
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8)
113-120
: LGTM! Proper handling of child requests in capacity planning.The logic correctly accounts for both parent and child requests when determining if a request can be scheduled within the capacity limits. The implementation prevents exceeding
max_req_count
by breaking when the total would exceed the limit.
361-385
: LGTM! Well-implemented request ID generation and child request creation.The implementation correctly:
- Uses proper 64-bit unsigned integer wraparound for request ID generation
- Creates child requests only for non-beam search scenarios (beam_width == 1 and num_return_sequences > 1)
- Properly tracks performance statistics when enabled
- Provides clear documentation
396-407
: LGTM! Proper integration of child request generation in batch enqueue.The method correctly integrates the new child request generation logic while maintaining backward compatibility. Performance stats are appropriately tracked only for parent requests.
573-581
: LGTM! Proper flattening of parent and child requests.The implementation correctly creates a flattened list of parent and child
LlmRequest
objects, which is appropriate for the executor's processing needs. The logic properly handles both parent requests and their children.
870-870
: LGTM! Proper cancellation handling for hierarchical requests.The logic correctly uses the parent request ID for child requests during cancellation, ensuring that canceling a parent request properly affects all its children.
1896-1901
: LGTM! Proper handling of list-based response structure.The updated logic correctly processes the new list-based response structure and properly flattens responses during gathering across distributed ranks.
1907-1912
: LGTM! Proper response storage for multiple sequences.The updated logic correctly stores multiple responses per request ID as lists, which is essential for supporting hierarchical requests where one parent can generate multiple child responses.
1916-1924
: LGTM! Consistent response handling updates.All response handling methods have been consistently updated to use the new list-based structure with (req_id, response) tuples. This maintains consistency across different response scenarios and properly supports the hierarchical request architecture.
Also applies to: 1929-1930, 1963-1965, 1969-1969
8b25d46
to
1c02f85
Compare
@evezhier -- I think that @NVIDIA/trtllm-bench-reviewers isn't needed here; I don't see any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
517-536
: Request is never enqueued whenenable_iter_perf_stats
is TrueThe
else:
block that generateschild_req_ids
and enqueues the request is only executed whenself.enable_iter_perf_stats
is False.
When perf-stats are enabled, the request never reachesself.request_queue
, effectively dropping the call.Fix by unconditionally generating
child_req_ids
and performing theput
, while keeping the optional timing logic:- if self.enable_iter_perf_stats: - self.start_times[req_id] = time.time() - else: - child_req_ids = self._generate_child_request_ids(request) - ... - self.request_queue.put(RequestQueueItem(...)) + if self.enable_iter_perf_stats: + self.start_times[req_id] = time.time() + + child_req_ids = self._generate_child_request_ids(request) + logger.debug(f"Executor request {req_id} child reqs: {child_req_ids}") + self.request_queue.put( + RequestQueueItem(req_id, request, child_req_ids, query=query) + )
14-14
: MissingTuple
import breaks type-checking & runtime introspection
_enqueue_responses
usesTuple
in annotations (see ~1884) butTuple
isn’t imported fromtyping
, causingNameError
under strict checking.-from typing import List, Optional, Union +from typing import List, Optional, Union, Tuple
🧹 Nitpick comments (2)
examples/llm-api/quickstart_advanced.py (1)
237-237
: Line exceeds 120 charsRuff flags this (
E501
). Consider breaking the line to stay within project style.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
362-387
: Deep-copying largepy_result
may waste memoryThe blanket deepcopy of every
py_*
attribute clonespy_result
, duplicating potentially large tensors (logits/log-probs) for each child.
Unless children truly need a full copy of the parent’s stored results, consider:- if attr_name.startswith('py_'): + if attr_name.startswith('py_') and attr_name != 'py_result':and instantiating a fresh, empty
PyResult
for the child.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(3 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(12 hunks)tests/unittest/_torch/test_best_of_n.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/unittest/_torch/test_best_of_n.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
executor_request_to_llm_request
(426-518)LlmResponse
(255-262)create_response
(341-351)tensorrt_llm/logger.py (1)
debug
(143-144)tensorrt_llm/_utils.py (1)
nvtx_range
(825-844)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1884-1884: Undefined name Tuple
(F821)
examples/llm-api/quickstart_advanced.py
237-237: Line too long (137 > 120)
(E501)
🔇 Additional comments (2)
examples/llm-api/quickstart_advanced.py (1)
264-270
: Minor: variable shadowing ofsequence_idx
enumerate(output.outputs)
already yieldssequence_idx
; the formatted string additionally derivessequence_id_text
fromargs.n
/args.max_beam_width
. All good, just flagging thatsequence_idx
may differ from beam index in beam-search mode when beams are reordered.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
349-350
: Response ID logic looks correctGood catch ensuring child responses report the parent’s ID; this keeps client-side correlation intact.
PR_Github #13842 [ run ] triggered by Bot |
PR_Github #13842 [ run ] completed with state |
f20dc61
to
faa0ec2
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/llm-api/quickstart_advanced.py (1)
234-241
: Fixbest_of
calculation timing and assertion logic.The current implementation has the same issues identified in previous reviews:
best_of
is calculated beforeargs.n
is potentially updated, making it stale- The assertion message contradicts the actual check
Apply the fix suggested in previous reviews:
- use_beam_search = args.max_beam_width > 1 - best_of = args.best_of or args.n - if use_beam_search: - if args.n == 1 and args.best_of is None: - args.n = args.max_beam_width - assert best_of <= args.max_beam_width, f"beam width: {best_of}, should be less or equal to max_beam_width: {args.max_beam_width}" - - assert best_of >= args.n, f"In sampling mode best_of value: {best_of} should be less or equal to n: {args.n}" + use_beam_search = args.max_beam_width > 1 + if use_beam_search and args.n == 1 and args.best_of is None: + args.n = args.max_beam_width + + best_of = args.best_of or args.n + if use_beam_search: + assert best_of <= args.max_beam_width, f"beam width: {best_of}, should be less or equal to max_beam_width: {args.max_beam_width}" + + assert best_of >= args.n, f"In sampling mode best_of value: {best_of} should be greater or equal to n: {args.n}"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
(3 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
(7 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(7 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(7 hunks)tests/unittest/_torch/test_best_of_n.py
(1 hunks)tests/unittest/_torch/test_executor_request_queue.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tests/unittest/_torch/test_executor_request_queue.py
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
- cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
- tests/unittest/_torch/test_best_of_n.py
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/pyexecutor/llm_request.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
examples/llm-api/quickstart_advanced.py
🪛 Ruff (0.12.2)
examples/llm-api/quickstart_advanced.py
239-239: Line too long (137 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py
134-134: SyntaxError: Expected a statement
134-134: SyntaxError: Expected a statement
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (9)
25-25
: LGTM: Child request IDs field added correctly.The optional
child_req_ids
field is properly typed and defaulted, supporting the hierarchical request functionality.
87-91
: LGTM: Child request calculation logic is correct.The method correctly determines child requests are needed only for non-beam search cases (
beam_width <= 1
) with multiple return sequences.
121-126
: Child request counting logic is correct.The queue dequeue logic properly accounts for both parent and child requests when checking capacity limits.
164-168
: LGTM: Request ID generation with proper 64-bit wraparound.The implementation correctly generates unique request IDs with proper 64-bit boundary handling.
170-183
: LGTM: Child request ID generation with proper timing integration.The method correctly generates child request IDs only when needed and properly initializes performance timing for each child request.
190-200
: LGTM: Enqueue method properly updated for child request support.The method correctly uses the new request ID generation and child request functionality while maintaining proper performance timing.
227-238
: LGTM: Single request enqueue updated consistently.The method follows the same correct pattern as
enqueue_requests
for handling child request generation and ID management.
574-577
: LGTM: Latency tracking extended to child requests.The method correctly accumulates latency for both parent and child requests, providing comprehensive performance metrics.
591-599
: LGTM: Merge requests properly handles child request integration.The method correctly passes child request IDs to the conversion function and properly merges both parent and child requests into the final list.
examples/llm-api/quickstart_advanced.py (3)
110-111
: LGTM: CLI arguments added with appropriate defaults.The
--n
and--best_of
parameters are correctly defined with defaults that align with the SamplingParams API.
251-253
: LGTM: SamplingParams correctly updated with new parameters.The parameters are properly passed to SamplingParams to enable the multi-sequence generation functionality.
266-282
: LGTM: Output display properly generalized for multi-sequence generation.The terminology change from "beam" to "sequence" and the updated condition (
args.max_beam_width > 1 or args.n > 1
) correctly handles both beam search and sampling-based multi-sequence scenarios.
PR_Github #13884 [ run ] triggered by Bot |
faa0ec2
to
b49bf33
Compare
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
b49bf33
to
f6c7b25
Compare
/bot run --disable-fail-fast |
PR_Github #13885 [ run ] triggered by Bot |
PR_Github #13884 [ run ] completed with state |
PR_Github #13885 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #13889 [ run ] triggered by Bot |
PR_Github #13889 [ run ] completed with state |
Signed-off-by: Olya Kozlova <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Olya Kozlova <[email protected]>
best_of/n feature for pytorch workflow
Description
Adds support for child request and multiple sequence generation without beam search
Test Coverage
unittest/_torch/test_best_of_n.py
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit