[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997

evezhier · 2025-07-14T08:59:02Z

best_of/n feature for pytorch workflow

Description

Adds support for child request and multiple sequence generation without beam search

Test Coverage

unittest/_torch/test_best_of_n.py

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added support for generating multiple output sequences and "best-of" sampling via new arguments in the advanced quickstart script.
- Enabled creation and management of hierarchical (parent-child) LLM requests, with Python bindings and tracking for child requests.
- Exposed new read-only properties and methods in Python bindings for enhanced request introspection and child request creation.
- Enhanced request queue management to support child request IDs, including latency tracking and merging logic.
- Updated response handling to unify processing with list-based structures and improved cancellation checks for child requests.
Bug Fixes
- Improved handling and cleanup of child requests during cancellation and response processing.
Tests
- Introduced new tests to validate multi-output and best-of sampling features.
- Added tests covering child request creation and request queue behavior with child requests.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tests/unittest/_torch/test_best_of_n.py

tensorrt_llm/_torch/pyexecutor/llm_request.py

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp

examples/llm-api/quickstart_advanced.py

tensorrt_llm/_torch/pyexecutor/llm_request.py

tests/unittest/_torch/test_best_of_n.py

jaedeok-nvidia

Thanks for your help @evezhier 👍 .

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai · 2025-07-16T15:26:24Z

Warning

Rate limit exceeded

@evezhier has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 10 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between faa0ec2 and f6c7b25.

📒 Files selected for processing (9)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7 hunks)
tests/unittest/_torch/test_best_of_n.py (1 hunks)
tests/unittest/_torch/test_executor_request_queue.py (5 hunks)

📝 Walkthrough

Walkthrough

The changes add defaulted move and copy constructors to C++ LLM request classes, expose new properties and methods in Python bindings, extend Python LlmRequest to handle child requests, update the PyExecutor to manage hierarchical requests and responses, introduce CLI arguments for multi-sequence generation, and add unit tests validating multi-output and child request functionality. The ExecutorRequestQueue is enhanced to track child request IDs and incorporate them in enqueue, dequeue, latency tracking, and merging operations.

Changes

File(s)	Change Summary
`cpp/include/tensorrt_llm/batch_manager/llmRequest.h`	Added defaulted move and copy constructors to `GenericLlmRequest` and `LlmRequest` classes.
`cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`	Added read-only properties `parent_request_id` and `is_child`; added copy constructor and `create_child_request` method bindings to `GenericLlmRequest` and `LlmRequest`.
`examples/llm-api/quickstart_advanced.py`	Added CLI args `--n` and `--best_of`; updated sampling params and output iteration logic to support multiple sequences.
`tensorrt_llm/_torch/pyexecutor/llm_request.py`	Extended `LlmRequest` with child request support, added `create_child_request`, updated response creation, and enhanced request conversion with child IDs.
`tensorrt_llm/_torch/pyexecutor/py_executor.py`	Replaced dict with list of tuples for response handling; updated cancellation checks and request completion detection for child requests; removed unused attribute initialization.
`tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`	Added support for child request IDs in queue items; updated enqueue, dequeue, latency tracking, and merging logic to handle child requests.
`tests/unittest/_torch/test_best_of_n.py`	Added tests for child request creation and multi-output generation, including async tests and output validation.
`tests/unittest/_torch/test_executor_request_queue.py`	Added tests for enqueueing requests with child request IDs; patched child ID generation in existing tests for consistency.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant PyExecutor
    participant LlmRequest
    participant LlmResponse

    User->>CLI: Provide prompts with --n and --best_of
    CLI->>PyExecutor: Submit ExecutorRequest(s)
    PyExecutor->>LlmRequest: Create parent LlmRequest
    loop For each child_req_id
        PyExecutor->>LlmRequest: create_child_request(child_id)
        LlmRequest->>LlmRequest: Add child to children list
    end
    PyExecutor->>LlmRequest: Merge parent and children into batch
    LlmRequest->>LlmResponse: create_response (handles parent/child IDs)
    PyExecutor->>User: Return multiple outputs per prompt

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~45 minutes

Possibly related PRs

[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… #6355: Adds copy constructor bindings and create_child_request method bindings for LlmRequest in Python bindings, closely related to this PR’s additions of copy constructors and child request support.
feat: nanobind bindings #6185: Introduces nanobind Python bindings for batch manager classes including copy constructor and create_child_request bindings, overlapping with this PR’s binding changes.
[Infra] - Skip failed cases #6299: Adds defaulted copy and move constructors to GenericLlmRequest and LlmRequest classes, closely related to this PR’s additions of copy constructor bindings and child request support in Python bindings and core classes.

Suggested labels

Community want to contribute

Suggested reviewers

FrankD412
shaharmor98
DomBrown

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (9)

tests/unittest/_torch/test_best_of_n.py (1)
32-37: Consider using a fixture to improve test efficiency.

Creating the LLM instance inside the test function means it gets recreated for each parameter combination, which is time-consuming. As mentioned in previous reviews, consider using a pytest fixture to create the LLM once and reuse it across test cases.

Example fixture:
@pytest.fixture(scope="module")
def llm_instance():
    return LLM(model=os.path.join(llm_models_root(), "llama-models-v2",
                                 "TinyLlama-1.1B-Chat-v1.0"),
              kv_cache_config=global_kvcache_config,
              max_batch_size=128,
              max_seq_len=128,
              enable_trtllm_sampler=True)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

365-365: Remove or uncomment the move constructor binding

This commented line should either be removed or uncommented to maintain code cleanliness.
examples/llm-api/quickstart_advanced.py (1)
234-240: Clarify parameter usage for beam search and best_of

The current logic mixes max_beam_width with n and best_of parameters, which can be confusing. Consider:

Deprecating max_beam_width in favor of use_beam_search (as suggested in past reviews)

Making the parameter relationships clearer
-        best_of=args.max_beam_width
-        if args.max_beam_width > 1 else args.best_of,
+        # Clearly document the parameter logic
+        # When beam search is used, best_of defaults to beam width
+        best_of=args.best_of if args.best_of is not None else (
+            args.max_beam_width if args.max_beam_width > 1 else args.n
+        ),
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
507-511: Consider moving child append to create_child_request

For consistency with the C++ implementation, consider moving the append operation into the create_child_request method as suggested in past reviews.

361-380: Fix attribute copying in create_child_request

The current implementation copies all attributes from parent to child using __dict__.update(), which can lead to issues:

It overwrites the child's already initialized attributes

It doesn't properly handle mutable attributes like lists

Based on past review comments, consider this improved approach:
 def create_child_request(self, child_id):
     child = super().create_child_request(child_id)
     py_request = LlmRequest(llm_request=child)
-    py_request.__dict__.update(**self.__dict__)
+    # Copy only Python-specific attributes from parent to child
+    for attr_name, attr_value in self.__dict__.items():
+        if attr_name.startswith('py_') and attr_name not in ['py_request_id', 'py_result']:
+            setattr(py_request, attr_name, copy.deepcopy(attr_value))
+        elif attr_name in ['is_attention_dp_dummy', 'is_cuda_graph_dummy']:
+            setattr(py_request, attr_name, attr_value)

     py_request.py_result = PyResult(
         self.py_prompt_len, self.py_max_new_tokens,
         self.py_return_logits_device_memory, self.streaming,
         self.py_return_log_probs, self.py_return_context_logits,
         self.py_return_generation_logits)
     py_request.py_request_id = child.request_id
     py_request.children = []
+    self.children.append(py_request)
Also, add the import at the top of the file:
import copy
tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
534-542: Simplify child request handling using extend().

As suggested in the past review, this can be simplified for better readability and efficiency.
             req_with_children = []
             for req_item in new_requests:
                 req = executor_request_to_llm_request(
                     req_item.id, req_item.request, req_item.child_req_ids,
                     self._should_exclude_last_generation_logits())
                 req_with_children.append(req)
-                for child in req.children:
-                    req_with_children.append(child)
+                if req.children:
+                    req_with_children.extend(req.children)
             return req_with_children
2108-2111: Consider if child requests should terminate independently.

The current implementation terminates child requests when the parent finishes. However, a past review comment suggested that "each request should be terminated by themselves, not from the parent request."

Consider whether child requests should manage their own termination lifecycle. If they should terminate independently, simplify to:
                     if response.result.is_final:
                         requests_to_terminate.append(request)
-                        for child in request.children:
-                            requests_to_terminate.append(child)
371-381: Extract child request ID generation to a helper method.

This logic is duplicated in the enqueue_request method. As suggested in the past review, consider extracting this to a helper function to follow DRY principles.

Add a helper method:
def _generate_child_request_ids(self, request: ExecutorRequest) -> List[int]:
    """Generate child request IDs for a given request."""
    child_req_ids = []
    num_child_requests = _get_num_child_requests(request)
    for _ in range(num_child_requests):
        self.next_req_id += 1
        child_req_ids.append(self.next_req_id)
    return child_req_ids
Then use it here:
-                child_req_ids = []
-                num_child_requests = _get_num_child_requests(request)
-                for _ in range(num_child_requests):
-                    self.next_req_id += 1
-                    child_req_ids.append(self.next_req_id)
+                child_req_ids = self._generate_child_request_ids(request)
501-507: Use the same helper method for child ID generation.

This is duplicate code that should use the same helper method suggested above.
-                child_req_ids = []
-                num_child_requests = _get_num_child_requests(request)
-                for _ in range(num_child_requests):
-                    self.next_req_id += 1
-                    child_req_ids.append(self.next_req_id)
+                child_req_ids = self._generate_child_request_ids(request)

🧹 Nitpick comments (2)

tests/unittest/_torch/test_best_of_n.py (1)

28-31: Consider expanding test coverage beyond current parameters.

The current parametrization only tests n=2 with best_of values of None and 3. Consider adding more comprehensive test cases, such as:

Different values of n (e.g., 1, 3, 4)

Edge cases like n == best_of

Error conditions when n > best_of
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
60-60: Use more specific type hint for child_req_ids.

For better type safety and clarity, consider using a more specific type hint.
-    child_req_ids: Optional[list] = None
+    child_req_ids: Optional[List[int]] = None

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e30d7be and 654bac2.

📒 Files selected for processing (6)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3 hunks)
examples/llm-api/quickstart_advanced.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (6 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11 hunks)
tests/unittest/_torch/test_best_of_n.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

tests/unittest/_torch/test_best_of_n.py (1)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-477)

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

cpp/tensorrt_llm/batch_manager/llmRequest.cpp (6)

validate (225-297)

validate (225-226)

createResponse (43-52)

createResponse (43-43)

createChildRequest (299-328)

createChildRequest (299-299)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

append (77-96)

append (123-140)

executor_request_to_llm_request (419-512)

LlmResponse (254-261)

create_response (340-350)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2)

470-472: LGTM: Well-structured explicit defaulted constructors.

The explicit defaulting of move and copy constructors is appropriate for this complex class. Given that the class contains standard library types (optional, shared_ptr, vectors) with well-defined copy/move semantics, the default behavior should work correctly to support the enhanced child request functionality.

2321-2323: LGTM: Consistent constructor declarations with base class.

The explicit defaulting of move and copy constructors in the derived class properly matches the pattern established in the base GenericLlmRequest class, maintaining consistency in the class hierarchy.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

197-198: LGTM! Hierarchical request support properly exposed

The new bindings correctly expose the parent-child request functionality from C++ to Python, enabling the hierarchical request model needed for best_of/n support.

Also applies to: 364-364, 371-371

examples/llm-api/quickstart_advanced.py (1)

253-269: LGTM! Clear output formatting for multiple sequences

The updated output formatting with descriptive variable names and conditional sequence ID display improves readability when using beam search or multiple sequences.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

280-301: LGTM! Constructor and response handling properly support child requests

The constructor changes correctly handle wrapping existing request objects, and the response ID logic properly distinguishes between parent and child requests.

Also applies to: 335-335, 347-350

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

119-122: LGTM!

The modification correctly accounts for child requests when calculating the total request count, ensuring proper capacity management.

2008-2008: LGTM!

The logic correctly handles cancellation for both parent and child requests by checking the appropriate request ID.

2022-2050: LGTM!

The refactoring from dictionary to list of tuples correctly supports multiple responses per request ID, which is essential for the parent-child request hierarchy.

tests/unittest/_torch/test_best_of_n.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

jaedeok-nvidia

Thank @evezhier for your help. I left several minor review comments below. Overall it looks good.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tests/unittest/_torch/test_best_of_n.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/unittest/_torch/test_best_of_n.py (2)

36-41: Consider using a more realistic model path and reducing max_batch_size.

The hardcoded model path /model may not exist in test environments. Additionally, the max_batch_size=8 could be reduced to better align with actual test usage and potentially expose vector size or access pattern errors.

108-112: Fix the similarity check logic.

The condition if n == best_of for the similarity check is problematic. When best_of > n, the system still returns the n best outputs from best_of candidates, so the similarity check should still be meaningful. The current logic means that when n=2 and best_of=3, no similarity verification is performed, reducing test coverage.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
60-60: Improve type annotation for better type safety.

The child_req_ids field should use a more specific type annotation for better type safety and clarity.
-    child_req_ids: Optional[list] = None
+    child_req_ids: Optional[List[int]] = None

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89217e5 and 3e5c6be.

📒 Files selected for processing (4)

examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11 hunks)
tests/unittest/_torch/test_best_of_n.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm-api/quickstart_advanced.py

🧰 Additional context used

🧠 Learnings (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (16)

tests/unittest/_torch/test_best_of_n.py (2)

46-90: LGTM! Child request creation logic is well-tested.

The test comprehensively validates child request creation, including:

Correct number of children created

Proper ID assignment and parent-child relationships

Attribute inheritance and independence verification

Result and token independence checks

114-133: LGTM! Async test provides good coverage for batch size exceedance.

The test effectively validates that the system can handle multiple asynchronous requests exceeding the max batch size, ensuring each result contains the expected number of outputs.

tensorrt_llm/_torch/pyexecutor/llm_request.py (6)

1-1: LGTM! Proper import for deep copying child request attributes.

The deepcopy import is correctly added to support copying complex attributes when creating child requests.

281-302: LGTM! Constructor properly handles both new and wrapped requests.

The constructor correctly handles the optional llm_request parameter to wrap existing internal request objects, which is essential for child request creation.

336-336: LGTM! Children tracking attribute properly initialized.

The children list is correctly initialized to track child requests created from this parent request.

348-351: LGTM! Response creation properly handles child request IDs.

The create_response method correctly uses self.py_request_id for child requests and self.parent_request_id for parent requests, ensuring proper request ID assignment in responses.

362-387: LGTM! Child request creation is well-implemented.

The create_child_request method properly:

Creates child request through superclass method

Wraps it in Python LlmRequest wrapper

Deep copies all py_* attributes from parent

Resets child-specific attributes appropriately

Validates parent-child relationships with assertions

Ensures distinct sampling seeds for different outputs

Appends child to parent's children list

514-517: LGTM! Child request creation integrated into utility function.

The executor_request_to_llm_request function correctly creates child requests when child_req_ids is provided, ensuring proper hierarchical request structure.

tensorrt_llm/_torch/pyexecutor/py_executor.py (8)

112-119: LGTM! Child request counting logic is correct.

The implementation properly accounts for both parent and child requests when determining queue capacity, ensuring the total request count doesn't exceed the maximum allowed.

358-382: LGTM! Request ID generation and child request logic are well implemented.

The implementation correctly:

Uses 64-bit wraparound for request ID generation to prevent overflow

Creates child requests only when appropriate (beam_width == 1 and num_return_sequences > 1)

Properly tracks performance statistics for child requests

Returns None when no child requests are needed

393-403: LGTM! Request enqueuing logic properly handles child requests.

The implementation correctly generates request IDs and child request IDs, passing them to the queue item constructor. Performance statistics are appropriately tracked for the parent request only.

559-567: LGTM! Request merging logic correctly handles hierarchical requests.

The implementation properly:

Converts each request item to an LlmRequest with child request IDs

Flattens the parent-child hierarchy into a single list

Includes both parent and child requests in the returned list

This approach is consistent with the design of treating child requests as separate entities in the active requests list.

2033-2033: LGTM! Cancellation logic correctly handles child requests.

The implementation properly uses the parent request ID for child requests when checking for cancellation. This ensures that when a parent request is canceled, all its child requests are also canceled appropriately.

2047-2047: LGTM! Parameter type change improves type safety.

The change from dictionary to List[Tuple[int, LlmResponse]] for the responses parameter improves type safety and is consistent with the new response handling approach.

2079-2087: LGTM! Response handling correctly uses list-based approach.

The implementation properly creates a list of tuples for new responses, consistent with the updated response handling pattern.

2092-2132: Semantic equivalence confirmed: request.is_finished matches response.result.is_final.

The is_final boolean returned by create_serialized_result in LlmResult is set when any of the finish‐reason bits (EOS, stop words, or max‐length) are reached. Under the hood, the C++ request object updates its finished state at the same time, and the Python binding exposes that via request.is_finished. You can safely use request.is_finished in place of checking response.result.is_final.

tensorrt_llm/_torch/pyexecutor/py_executor.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
516-539: Fix incomplete conditional structure.

The else: statement on line 525 lacks a corresponding if statement, creating invalid syntax. Based on the context, it appears there should be an if condition checking for performance stats.
-            if self.enable_iter_perf_stats:
-                self.start_times[req_id] = time.time()
-
-            else:
-                child_req_ids = self._generate_child_request_ids(request)
+            if self.enable_iter_perf_stats:
+                self.start_times[req_id] = time.time()
+
+            child_req_ids = self._generate_child_request_ids(request)

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
61-61: Improve type annotation specificity.

The child_req_ids field should have a more specific type annotation to indicate it's a list of integers.
-    child_req_ids: Optional[list] = None
+    child_req_ids: Optional[List[int]] = None

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e5c6be and 8b25d46.

📒 Files selected for processing (2)

examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm-api/quickstart_advanced.py

🧰 Additional context used

🧠 Learnings (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧬 Code Graph Analysis (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (3)

executor_request_to_llm_request (426-518)

LlmResponse (255-262)

create_response (341-351)

tensorrt_llm/logger.py (1)

debug (143-144)

tensorrt_llm/_utils.py (1)

nvtx_range (825-844)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1884-1884: Undefined name Tuple

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/_torch/pyexecutor/py_executor.py (8)

113-120: LGTM! Proper handling of child requests in capacity planning.

The logic correctly accounts for both parent and child requests when determining if a request can be scheduled within the capacity limits. The implementation prevents exceeding max_req_count by breaking when the total would exceed the limit.

361-385: LGTM! Well-implemented request ID generation and child request creation.

The implementation correctly:

Uses proper 64-bit unsigned integer wraparound for request ID generation

Creates child requests only for non-beam search scenarios (beam_width == 1 and num_return_sequences > 1)

Properly tracks performance statistics when enabled

Provides clear documentation

396-407: LGTM! Proper integration of child request generation in batch enqueue.

The method correctly integrates the new child request generation logic while maintaining backward compatibility. Performance stats are appropriately tracked only for parent requests.

573-581: LGTM! Proper flattening of parent and child requests.

The implementation correctly creates a flattened list of parent and child LlmRequest objects, which is appropriate for the executor's processing needs. The logic properly handles both parent requests and their children.

870-870: LGTM! Proper cancellation handling for hierarchical requests.

The logic correctly uses the parent request ID for child requests during cancellation, ensuring that canceling a parent request properly affects all its children.

1896-1901: LGTM! Proper handling of list-based response structure.

The updated logic correctly processes the new list-based response structure and properly flattens responses during gathering across distributed ranks.

1907-1912: LGTM! Proper response storage for multiple sequences.

The updated logic correctly stores multiple responses per request ID as lists, which is essential for supporting hierarchical requests where one parent can generate multiple child responses.

1916-1924: LGTM! Consistent response handling updates.

All response handling methods have been consistently updated to use the new list-based structure with (req_id, response) tuples. This maintains consistency across different response scenarios and properly supports the hierarchical request architecture.

Also applies to: 1929-1930, 1963-1965, 1969-1969

tensorrt_llm/_torch/pyexecutor/py_executor.py

FrankD412 · 2025-07-17T21:17:25Z

@evezhier -- I think that @NVIDIA/trtllm-bench-reviewers isn't needed here; I don't see any trtllm-bench modifications here. Was there something in particular you wanted us to review?

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
517-536: Request is never enqueued when enable_iter_perf_stats is True

The else: block that generates child_req_ids and enqueues the request is only executed when self.enable_iter_perf_stats is False.
When perf-stats are enabled, the request never reaches self.request_queue, effectively dropping the call.

Fix by unconditionally generating child_req_ids and performing the put, while keeping the optional timing logic:
-        if self.enable_iter_perf_stats:
-            self.start_times[req_id] = time.time()
-        else:
-            child_req_ids = self._generate_child_request_ids(request)
-            ...
-            self.request_queue.put(RequestQueueItem(...))
+        if self.enable_iter_perf_stats:
+            self.start_times[req_id] = time.time()
+
+        child_req_ids = self._generate_child_request_ids(request)
+        logger.debug(f"Executor request {req_id} child reqs: {child_req_ids}")
+        self.request_queue.put(
+            RequestQueueItem(req_id, request, child_req_ids, query=query)
+        )
14-14: Missing Tuple import breaks type-checking & runtime introspection

_enqueue_responses uses Tuple in annotations (see ~1884) but Tuple isn’t imported from typing, causing NameError under strict checking.
-from typing import List, Optional, Union
+from typing import List, Optional, Union, Tuple

🧹 Nitpick comments (2)

examples/llm-api/quickstart_advanced.py (1)

237-237: Line exceeds 120 chars

Ruff flags this (E501). Consider breaking the line to stay within project style.
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
362-387: Deep-copying large py_result may waste memory

The blanket deepcopy of every py_* attribute clones py_result, duplicating potentially large tensors (logits/log-probs) for each child.
Unless children truly need a full copy of the parent’s stored results, consider:
-            if attr_name.startswith('py_'):
+            if attr_name.startswith('py_') and attr_name != 'py_result':
and instantiating a fresh, empty PyResult for the child.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b25d46 and 1c02f85.

📒 Files selected for processing (6)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (12 hunks)
tests/unittest/_torch/test_best_of_n.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/unittest/_torch/test_best_of_n.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp

🧰 Additional context used

🧠 Learnings (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧬 Code Graph Analysis (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (3)

executor_request_to_llm_request (426-518)

LlmResponse (255-262)

create_response (341-351)

tensorrt_llm/logger.py (1)

debug (143-144)

tensorrt_llm/_utils.py (1)

nvtx_range (825-844)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1884-1884: Undefined name Tuple

(F821)

examples/llm-api/quickstart_advanced.py

237-237: Line too long (137 > 120)

(E501)

🔇 Additional comments (2)

examples/llm-api/quickstart_advanced.py (1)

264-270: Minor: variable shadowing of sequence_idx

enumerate(output.outputs) already yields sequence_idx; the formatted string additionally derives sequence_id_text from args.n/args.max_beam_width. All good, just flagging that sequence_idx may differ from beam index in beam-search mode when beams are reordered.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

349-350: Response ID logic looks correct

Good catch ensuring child responses report the parent’s ID; this keeps client-side correlation intact.

examples/llm-api/quickstart_advanced.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt-cicd · 2025-08-01T22:09:32Z

PR_Github #13842 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-02T05:49:43Z

PR_Github #13842 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10410 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

evezhier · 2025-08-03T19:01:32Z

/bot run --disable-fail-fast

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

examples/llm-api/quickstart_advanced.py (1)

234-241: Fix best_of calculation timing and assertion logic.

The current implementation has the same issues identified in previous reviews:

best_of is calculated before args.n is potentially updated, making it stale
The assertion message contradicts the actual check

Apply the fix suggested in previous reviews:

-    use_beam_search = args.max_beam_width > 1
-    best_of = args.best_of or args.n
-    if use_beam_search:
-        if args.n == 1 and args.best_of is None:
-            args.n = args.max_beam_width
-        assert best_of <= args.max_beam_width, f"beam width: {best_of}, should be less or equal to max_beam_width: {args.max_beam_width}"
-
-    assert best_of >= args.n, f"In sampling mode best_of value: {best_of} should be less or equal to n: {args.n}"
+    use_beam_search = args.max_beam_width > 1
+    if use_beam_search and args.n == 1 and args.best_of is None:
+        args.n = args.max_beam_width
+
+    best_of = args.best_of or args.n
+    if use_beam_search:
+        assert best_of <= args.max_beam_width, f"beam width: {best_of}, should be less or equal to max_beam_width: {args.max_beam_width}"
+
+    assert best_of >= args.n, f"In sampling mode best_of value: {best_of} should be greater or equal to n: {args.n}"

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f20dc61 and faa0ec2.

📒 Files selected for processing (9)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (3 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (7 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (7 hunks)
tests/unittest/_torch/test_best_of_n.py (1 hunks)
tests/unittest/_torch/test_executor_request_queue.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

tests/unittest/_torch/test_executor_request_queue.py
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
tests/unittest/_torch/test_best_of_n.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/llm_request.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

examples/llm-api/quickstart_advanced.py

🪛 Ruff (0.12.2)

examples/llm-api/quickstart_advanced.py

239-239: Line too long (137 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

134-134: SyntaxError: Expected a statement

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (9)

25-25: LGTM: Child request IDs field added correctly.

The optional child_req_ids field is properly typed and defaulted, supporting the hierarchical request functionality.

87-91: LGTM: Child request calculation logic is correct.

The method correctly determines child requests are needed only for non-beam search cases (beam_width <= 1) with multiple return sequences.

121-126: Child request counting logic is correct.

The queue dequeue logic properly accounts for both parent and child requests when checking capacity limits.

164-168: LGTM: Request ID generation with proper 64-bit wraparound.

The implementation correctly generates unique request IDs with proper 64-bit boundary handling.

170-183: LGTM: Child request ID generation with proper timing integration.

The method correctly generates child request IDs only when needed and properly initializes performance timing for each child request.

190-200: LGTM: Enqueue method properly updated for child request support.

The method correctly uses the new request ID generation and child request functionality while maintaining proper performance timing.

227-238: LGTM: Single request enqueue updated consistently.

The method follows the same correct pattern as enqueue_requests for handling child request generation and ID management.

574-577: LGTM: Latency tracking extended to child requests.

The method correctly accumulates latency for both parent and child requests, providing comprehensive performance metrics.

591-599: LGTM: Merge requests properly handles child request integration.

The method correctly passes child request IDs to the conversion function and properly merges both parent and child requests into the final list.

examples/llm-api/quickstart_advanced.py (3)

110-111: LGTM: CLI arguments added with appropriate defaults.

The --n and --best_of parameters are correctly defined with defaults that align with the SamplingParams API.

251-253: LGTM: SamplingParams correctly updated with new parameters.

The parameters are properly passed to SamplingParams to enable the multi-sequence generation functionality.

266-282: LGTM: Output display properly generalized for multi-sequence generation.

The terminology change from "beam" to "sequence" and the updated condition (args.max_beam_width > 1 or args.n > 1) correctly handles both beam search and sampling-based multi-sequence scenarios.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

tensorrt-cicd · 2025-08-03T19:06:32Z

PR_Github #13884 [ run ] triggered by Bot

Signed-off-by: Olya Kozlova <[email protected]>

evezhier · 2025-08-03T19:09:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-03T19:15:27Z

PR_Github #13885 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-03T19:15:29Z

PR_Github #13884 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-03T22:04:01Z

PR_Github #13885 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10450 completed with status: 'FAILURE'

jaedeok-nvidia · 2025-08-03T23:13:50Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-03T23:19:18Z

PR_Github #13889 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-04T02:57:56Z

PR_Github #13889 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10454 completed with status: 'FAILURE'

Signed-off-by: Olya Kozlova <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

Signed-off-by: Olya Kozlova <[email protected]>

evezhier requested a review from a team as a code owner July 14, 2025 08:59

evezhier requested review from HuiGao-NV, Funatiq and stnie July 14, 2025 08:59

evezhier force-pushed the feat/best-of-n-pytorch branch 2 times, most recently from 85a0b9c to e3660a1 Compare July 14, 2025 16:57

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

tests/unittest/_torch/test_best_of_n.py Outdated Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/llm_request.py Outdated Show resolved Hide resolved

stnie reviewed Jul 15, 2025

View reviewed changes

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

evezhier requested review from a team as code owners July 16, 2025 13:18

evezhier requested review from QiJune and lucaslie July 16, 2025 13:18

evezhier force-pushed the feat/best-of-n-pytorch branch from 3b5742d to 654bac2 Compare July 16, 2025 15:26

coderabbitai bot reviewed Jul 16, 2025

View reviewed changes

tests/unittest/_torch/test_best_of_n.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 17, 2025

View reviewed changes

jaedeok-nvidia mentioned this pull request Jul 17, 2025

fix: Enable num_return_sequences (n) support in PyTorch backend #5415

Closed

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

evezhier force-pushed the feat/best-of-n-pytorch branch from 8b25d46 to 1c02f85 Compare July 17, 2025 21:13

coderabbitai bot reviewed Jul 17, 2025

View reviewed changes

examples/llm-api/quickstart_advanced.py Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 18, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

evezhier force-pushed the feat/best-of-n-pytorch branch 2 times, most recently from f20dc61 to faa0ec2 Compare August 3, 2025 18:58

coderabbitai bot reviewed Aug 3, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

evezhier force-pushed the feat/best-of-n-pytorch branch from faa0ec2 to b49bf33 Compare August 3, 2025 19:08

evezhier added 8 commits August 3, 2025 21:08

[TRTLLM-5271] feat: best_of/n for pytorch workflow

183603b

Signed-off-by: Olya Kozlova <[email protected]>

review wip

f5818bc

Signed-off-by: Olya Kozlova <[email protected]>

more tests and cleanup

4b7b3e2

Signed-off-by: Olya Kozlova <[email protected]>

Cleanup

2fb4e87

Signed-off-by: Olya Kozlova <[email protected]>

chore: rebase and nanobind integration

6e10c99

Signed-off-by: Olya Kozlova <[email protected]>

chore: rebase

ffcc8ea

Signed-off-by: Olya Kozlova <[email protected]>

patch executor queue mocks

9c3525f

Signed-off-by: Olya Kozlova <[email protected]>

fix: multi-GPU gather responses

f6c7b25

Signed-off-by: Olya Kozlova <[email protected]>

evezhier force-pushed the feat/best-of-n-pytorch branch from b49bf33 to f6c7b25 Compare August 3, 2025 19:08

MartinMarciniszyn disabled auto-merge August 4, 2025 12:07

MartinMarciniszyn merged commit 13cc1c4 into NVIDIA:main Aug 4, 2025
3 of 4 checks passed

lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025

[TRTLLM-5271][feat] best_of/n for pytorch workflow (NVIDIA#5997)

35f88e8

Signed-off-by: Olya Kozlova <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025

[TRTLLM-5271][feat] best_of/n for pytorch workflow (NVIDIA#5997)

ada0a8d

Signed-off-by: Olya Kozlova <[email protected]>

This was referenced Aug 22, 2025

[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse #7106

Merged

[TRTLLM-7153] [feat] Move stop_criteria to sample_async #7041

Merged

[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997

[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997

Uh oh!

Conversation

evezhier commented Jul 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

best_of/n feature for pytorch workflow

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaedeok-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jaedeok-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrankD412 commented Jul 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 1, 2025

Uh oh!

tensorrt-cicd commented Aug 2, 2025

Uh oh!

evezhier commented Aug 3, 2025

Uh oh!

evezhier commented Jul 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 16, 2025 •

edited

Loading