fix: Enable num_return_sequences (`n`) support in PyTorch backend #5415

jaedeok-nvidia · 2025-06-24T03:29:46Z

This PR enables the n parameter (num_return_sequences) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the LlmRequest class.

Changes:

Added create_child_request method to pyexecutor.LlmRequest that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states.
Updated C++ LlmRequest and related Python bindings to expose additional properties required in the PyTorch backend.
Enhanced PyExecutor to create child requests, ensuring proper handling of requests when num_return_sequences > 1.

jaedeok-nvidia · 2025-06-25T16:21:23Z

/bot run

tensorrt-cicd · 2025-06-25T16:33:05Z

PR_Github #9901 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T19:27:19Z

PR_Github #9901 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7308 completed with status: 'FAILURE'

tensorrt_llm/_torch/pyexecutor/llm_request.py

syuoni · 2025-07-03T02:12:26Z

tensorrt_llm/_torch/pyexecutor/llm_request.py

+        # Copy Python-specific attributes from parent to child
+        child_request.py_client_id = self.py_client_id
+        child_request.py_parent_request_id = self.py_request_id
+        child_request.py_request_id = child_request.request_id
+        child_request.py_llm_request_type = child_request.llm_request_type
+        child_request.py_end_id = child_request.end_id
+        child_request.py_prompt_len = child_request.prompt_len
+        child_request.py_orig_prompt_len = child_request.orig_prompt_len
+        child_request.py_max_new_tokens = child_request.max_new_tokens
+
+        # Copy Python-specific configuration from parent
+        child_request.py_return_log_probs = self.py_return_log_probs
+        child_request.py_return_context_logits = self.py_return_context_logits
+        child_request.py_return_generation_logits = self.py_return_generation_logits
+        child_request.py_return_logits_device_memory = self.py_return_logits_device_memory
+        child_request.py_exclude_last_generation_logits = self.py_exclude_last_generation_logits
+        child_request.py_stop_words_list = self.py_stop_words_list
+        child_request.py_logits_post_processors = self.py_logits_post_processors
+        child_request.py_rewind_len = self.py_rewind_len
+        child_request.py_decoding_iter = self.py_decoding_iter
+        child_request.py_draft_tokens = self.py_draft_tokens.copy(
+        ) if self.py_draft_tokens else []
+        child_request.py_last_draft_tokens = self.py_last_draft_tokens.copy(
+        ) if self.py_last_draft_tokens else None
+        child_request.py_num_accepted_draft_tokens = self.py_num_accepted_draft_tokens
+        child_request.py_lora_task_layer_module_configs = self.py_lora_task_layer_module_configs
+
+        # Initialize Python-specific runtime state
+        child_request.py_batch_idx = None
+        child_request.is_attention_dp_dummy = self.is_attention_dp_dummy
+        child_request.is_cuda_graph_dummy = self.is_cuda_graph_dummy


If possible, we should make this happen automatically instead of manually copying every field. Otherwise, we need to maintain this list every time when adding or removing an attribute.

Will a copy.deepcopy work?

Yeah, either copy.deepcopy, or the copy/clone should be encapsulated in a separate method of Request.

Thank @Superjomn @syuoni for pointing out these issue. Unfortunately, there is a gap between a parent request (of class pyexecutor.LlmRequest) and a child request (of class bindings.LlmRequest).

A parent class tracks all child requests created by create_child_request and those states are sharable each other. All there logic happen in C++ runtime internally. This is for handling termination or cancellation of requests at the executor side. A ugly part is a result of create_child_request is of type bindings.LlmRequest. For now, I couldn't find a better and clearer way to inherit the class. I believe this issue can be resolved if #3034 finishes by bringing all the required logics to the python side.

As WAR before #3034, here the child request generated by a parent mimics pyexecutor.LlmRequest. That's what this functions does. And, I totally agree an encapsulation is necessary for maintaining. Since copy won't work for this case, I will copy attributes having a pattern py_* and some extras like is_attention_dp_dummy. This will make the code clearer and reduce maintenance risk. Does this make sense?

I see. But considering there are so many members flattened, and it could be easy to forget when a new member is introduced. Maybe the following code can help automate the copying of most of the members, with a proper black list or whitelist introduced:

for attr_name in dir(self): if attr_name.startswith("py_"): value = getattr(self, attr_name) setattr(child_request, attr_name, value)

You can try it in a subsequent PR, but you won't need to change in this PR.

Agree it's easy to forget as there are many contributors. However, it's already updated tho, github shows the original impl not a revision. Here is the latest one.

# Copy all py_* attributes from parent to child for attr_name, attr_value in self.__dict__.items(): if attr_name.startswith('py_'): attr_value = getattr(self, attr_name) setattr(child_request, attr_name, copy.deepcopy(attr_value))

I see. Currently there is no method to create child request as pyexecutor.LlmRequest.

Are those children requests be processed in Python runtime? If so, will its different type (bindings.LlmRequest) cause any issues?

@syuoni yes, that's why mimic functions are added. however, as mentioned, it is just a WAR before properly reimplementing the LlmRequest and Resource manager logics within torch backend.

tensorrt_llm/_torch/pyexecutor/py_executor.py

jaedeok-nvidia · 2025-07-08T06:25:26Z

/bot run

tensorrt-cicd · 2025-07-08T06:30:48Z

PR_Github #11227 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T07:56:21Z

PR_Github #11227 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8304 completed with status: 'FAILURE'

ccys-a11y · 2025-07-08T09:09:21Z

@jaedeok-nvidia

I compiled and installed tensorrt_llm from your GitHub branch, but encountered two issues:

The quickstart_advanced.py script still errors out when setting N=2.

Error Info:"Processed requests: 0%| | 0/4 [00:00<?, ?it/s][07/08/2025-17:05:26] [TRT-LLM] [E] Error in event loop: fail to schedule any pending request, probably run out of resource.
[07/08/2025-17:05:26] [TRT-LLM] [E] Traceback (most recent call last):
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 285, in _event_loop_wrapper
self.event_loop()
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1033, in _executor_loop_overlap
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.

Exception in thread Thread-7 (_event_loop_wrapper):
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/python3/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 289, in _event_loop_wrapper
raise e
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 285, in _event_loop_wrapper
self.event_loop()
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1033, in _executor_loop_overlap
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource."
2. Failed to import LLM from torch: "from tensorrt_llm._torch import LLM”

Error Info:
" File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/llm.py", line 7, in init
raise ImportError(
ImportError: _torch.llm is deprecated, please use from tensorrt_llm import LLM directly"

jaedeok-nvidia · 2025-07-09T06:49:33Z

Hi @ccys-a11y sorry for the inconvenience. This PR was broken while rebasing the main and addressing review comments. For now, I confirmed it works at least quickstart_advanced.py and am on further tests. Here is the command that I used for a quick test (TinyLlama-1.1B-Chat-v1.0 was used.).

# Two sequences should be identical due to greedy decoding.
$ TLLM_ALLOW_N_GREEDY_DECODING=1 python quickstart_advanced.py --model_dir /path/to/model --n 2
# Two sequences are expected to be different since high temperature makes almost random.
$ python quickstart_advanced.py --model_dir /path/to/model --n 2 --top_p 0.9 --temperature 999

For the second issue (2. Failed to import LLM from torch: "from tensorrt_llm._torch import LLM”), I think this is not directly related to this PR. We've made torch backend as the default path few weeks ago. That was the reason I guess. You can directly import LLM by from tensorrt_llm import LLM. However, for the current branch after rebasing ToT, it seems working.

>>> from tensorrt_llm._torch import LLM
...
[07/09/2025-06:47:24] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level to INFO
2025-07-09 06:47:24,959 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[07/09/2025-06:47:25] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc3
>>>

jaedeok-nvidia · 2025-07-09T06:50:11Z

/bot run

tensorrt-cicd · 2025-07-09T06:55:37Z

PR_Github #11404 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-09T09:57:58Z

PR_Github #11404 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8435 completed with status: 'FAILURE'

tensorrt_llm/_torch/pyexecutor/llm_request.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

jaedeok-nvidia · 2025-07-11T13:09:55Z

/bot run

tensorrt-cicd · 2025-07-11T13:15:19Z

PR_Github #11654 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-11T20:21:34Z

PR_Github #11654 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8630 completed with status: 'SUCCESS'

This PR enables the `n` parameter (num_return_sequences) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the `LlmRequest` class. Changes: - Added `create_child_request` method to `pyexecutor.LlmRequest` that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states. - Updated C++ LlmRequest and related Python bindings to expose additional properties required in the PyTorch backend. - Enhanced `PyExecutor` to create child requests, ensuring proper handling of requests when `num_return_sequences > 1`. Signed-off-by: Jaedeok Kim <[email protected]>

Signed-off-by: Jaedeok Kim <[email protected]>

jaedeok-nvidia · 2025-07-13T12:39:16Z

/bot run

tensorrt-cicd · 2025-07-13T12:44:46Z

PR_Github #11731 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-13T14:53:39Z

PR_Github #11731 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8686 completed with status: 'FAILURE'

QiJune

LGTM

jaedeok-nvidia · 2025-07-14T02:20:58Z

/bot run

tensorrt-cicd · 2025-07-14T02:26:20Z

PR_Github #11748 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T07:02:51Z

PR_Github #11748 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8700 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

ccys-a11y · 2025-07-16T01:57:14Z

Hi@jaedeok-nvidia，thanks for your method. I find it works for 'quickstart' script. However, when I benchmark Qwen3-14B for AIME24/25 datasets with n=32, the following error might occur intermittently. It seems it's not stable enough. Can you help?

"
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/python3/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, self._kwargs)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1031, in _executor_loop_overlap
self.resource_manager.prepare_resources(scheduled_batch)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/resource_manager.py", line 793, in prepare_resources
resource_manager.prepare_resources(scheduled_batch)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/resource_manager.py", line 307, in prepare_resources
self.impl.add_token(req.py_request_id)
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: isLeaf() (/dockerdata/caiyi/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:334)
1 0x7f8b7f48b9a9 tensorrt_llm::common::throwRuntimeError(char const, int, char const) + 76
2 0x7f8b7f4b75c2 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1a615c2) [0x7f8b7f4b75c2]
3 0x7f8b801d2380 tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 176
4 0x7f8b801d3f7b tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::allocateBlock(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, bool) + 299
5 0x7f8b801d560b tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::updateToken(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, bool) + 123
6 0x7f8b95ad7275 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x108275) [0x7f8b95ad7275]
7 0x7f8b95a7e040 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xaf040) [0x7f8b95a7e040]
8 0x7f8dc201284e /usr/local/python3/lib/libpython3.10.so.1.0(+0x1f384e) [0x7f8dc201284e]
9 0x7f8dc1fecedb _PyObject_MakeTpCall + 123
10 0x7f8dc1f1f27d /usr/local/python3/lib/libpython3.10.so.1.0(+0x10027d) [0x7f8dc1f1f27d]
11 0x7f8dc2051dc7 _PyEval_EvalFrameDefault + 20487
12 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
13 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
14 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
15 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
16 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
17 0x7f8dc1fef1ee /usr/local/python3/lib/libpython3.10.so.1.0(+0x1d01ee) [0x7f8dc1fef1ee]
18 0x7f8dc204fc90 _PyEval_EvalFrameDefault + 11984
19 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
20 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
21 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
22 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
23 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
24 0x7f8dc1fef1ee /usr/local/python3/lib/libpython3.10.so.1.0(+0x1d01ee) [0x7f8dc1fef1ee]
25 0x7f8dc20f5fa6 /usr/local/python3/lib/libpython3.10.so.1.0(+0x2d6fa6) [0x7f8dc20f5fa6]
26 0x7f8dc20da7d4 /usr/local/python3/lib/libpython3.10.so.1.0(+0x2bb7d4) [0x7f8dc20da7d4]
27 0x7f8dc1c071ca /lib64/libpthread.so.0(+0x81ca) [0x7f8dc1c071ca]
28 0x7f8dc10d88d3 clone + 67
"

jaedeok-nvidia · 2025-07-17T02:23:21Z

@ccys-a11y Thanks for reporting the issue. The error may come from incorrect count of request budget. Can you share the reproduce step with us? That would help us add more concrete tests.

FYI, we've reimplemented the fix in PR #5997 for a cleaner logic (no need to mimic LlmReqeest anymore). And, correct-counting req budget is addressed there, however, need double-check if that was the root cause. That PR is going to be merged soon. Sorry for the delay in resolving this issue.

cc. @evezhier

jaedeok-nvidia · 2025-08-05T01:21:08Z

#5997 has been merged. Closing this PR.

jaedeok-nvidia requested a review from a team as a code owner June 24, 2025 03:29

jaedeok-nvidia requested a review from schetlur-nv June 24, 2025 03:29

jaedeok-nvidia changed the title ~~fix: Enable num_return_sequences (n) support in PyTorch backend~~ [DRAFT] fix: Enable num_return_sequences (n) support in PyTorch backend Jun 24, 2025

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch 3 times, most recently from 53c03c7 to 22001d5 Compare June 25, 2025 16:14

jaedeok-nvidia mentioned this pull request Jun 25, 2025

[DRAFT] refactor: PyExecutor uses a list-type for response handling #5406

Closed

jaedeok-nvidia self-assigned this Jun 25, 2025

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 22001d5 to 71811ea Compare June 25, 2025 16:20

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch 2 times, most recently from 170abb9 to 0a93904 Compare June 27, 2025 02:57

syuoni reviewed Jul 3, 2025

View reviewed changes

bobboli mentioned this pull request Jul 7, 2025

Setting adaptive SamplingParams is not allowed when the model is loaded from the torch backend. #5780

Open

4 tasks

Superjomn reviewed Jul 7, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 0a93904 to 28d16ac Compare July 8, 2025 05:34

jaedeok-nvidia changed the title ~~[DRAFT] fix: Enable num_return_sequences (n) support in PyTorch backend~~ fix: Enable num_return_sequences (n) support in PyTorch backend Jul 8, 2025

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from ed2b471 to 4678b73 Compare July 8, 2025 06:18

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 4678b73 to 1b94d38 Compare July 9, 2025 06:37

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 1b94d38 to b519eca Compare July 9, 2025 11:37

QiJune reviewed Jul 11, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/llm_request.py Show resolved Hide resolved

QiJune reviewed Jul 11, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 733d056 to a6ca1c8 Compare July 11, 2025 13:09

jaedeok-nvidia added 12 commits July 13, 2025 21:37

fix child request's create_response

b4a6542

Signed-off-by: Jaedeok Kim <[email protected]>

add n to pytorch example

fedb155

Signed-off-by: Jaedeok Kim <[email protected]>

add llm request unittest

ad13285

Signed-off-by: Jaedeok Kim <[email protected]>

simplify create_child_request logic

e48aab4

Signed-off-by: Jaedeok Kim <[email protected]>

fix format

d256cb8

Signed-off-by: Jaedeok Kim <[email protected]>

fix py llm request

2730a33

Signed-off-by: Jaedeok Kim <[email protected]>

fix llm request unittest

7db8d64

Signed-off-by: Jaedeok Kim <[email protected]>

rollback result serialization to reduce host overhead

5053a12

Signed-off-by: Jaedeok Kim <[email protected]>

remove unnecessary api expose

f62e4e8

Signed-off-by: Jaedeok Kim <[email protected]>

fix eagle case

8a6e429

Signed-off-by: Jaedeok Kim <[email protected]>

factor out child req id generation logic

7c1d650

Signed-off-by: Jaedeok Kim <[email protected]>

jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from a6ca1c8 to 7c1d650 Compare July 13, 2025 12:38

QiJune approved these changes Jul 14, 2025

View reviewed changes

jaedeok-nvidia closed this Aug 5, 2025

fix: Enable num_return_sequences (n) support in PyTorch backend #5415

fix: Enable num_return_sequences (n) support in PyTorch backend #5415

Uh oh!

Conversation

jaedeok-nvidia commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaedeok-nvidia commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Uh oh!

syuoni Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Superjomn Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

jaedeok-nvidia Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Superjomn Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaedeok-nvidia Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syuoni Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

jaedeok-nvidia Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jaedeok-nvidia commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

ccys-a11y commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaedeok-nvidia commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaedeok-nvidia commented Jul 9, 2025

Uh oh!

tensorrt-cicd commented Jul 9, 2025

Uh oh!

tensorrt-cicd commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

jaedeok-nvidia commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

jaedeok-nvidia commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

jaedeok-nvidia commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

fix: Enable num_return_sequences (`n`) support in PyTorch backend #5415

fix: Enable num_return_sequences (`n`) support in PyTorch backend #5415

jaedeok-nvidia commented Jun 24, 2025 •

edited

Loading

Superjomn Jul 11, 2025 •

edited

Loading

jaedeok-nvidia Jul 11, 2025 •

edited

Loading

ccys-a11y commented Jul 8, 2025 •

edited

Loading

jaedeok-nvidia commented Jul 9, 2025 •

edited

Loading