Skip to content

Conversation

jaedeok-nvidia
Copy link
Collaborator

@jaedeok-nvidia jaedeok-nvidia commented Jun 24, 2025

This PR enables the n parameter (num_return_sequences) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the LlmRequest class.

Changes:

  • Added create_child_request method to pyexecutor.LlmRequest that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states.
  • Updated C++ LlmRequest and related Python bindings to expose additional properties required in the PyTorch backend.
  • Enhanced PyExecutor to create child requests, ensuring proper handling of requests when num_return_sequences > 1.

@jaedeok-nvidia jaedeok-nvidia requested a review from a team as a code owner June 24, 2025 03:29
@jaedeok-nvidia jaedeok-nvidia changed the title fix: Enable num_return_sequences (n) support in PyTorch backend [DRAFT] fix: Enable num_return_sequences (n) support in PyTorch backend Jun 24, 2025
@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch 3 times, most recently from 53c03c7 to 22001d5 Compare June 25, 2025 16:14
@jaedeok-nvidia jaedeok-nvidia self-assigned this Jun 25, 2025
@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 22001d5 to 71811ea Compare June 25, 2025 16:20
@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9901 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9901 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7308 completed with status: 'FAILURE'

@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch 2 times, most recently from 170abb9 to 0a93904 Compare June 27, 2025 02:57
Comment on lines 320 to 402
# Copy Python-specific attributes from parent to child
child_request.py_client_id = self.py_client_id
child_request.py_parent_request_id = self.py_request_id
child_request.py_request_id = child_request.request_id
child_request.py_llm_request_type = child_request.llm_request_type
child_request.py_end_id = child_request.end_id
child_request.py_prompt_len = child_request.prompt_len
child_request.py_orig_prompt_len = child_request.orig_prompt_len
child_request.py_max_new_tokens = child_request.max_new_tokens

# Copy Python-specific configuration from parent
child_request.py_return_log_probs = self.py_return_log_probs
child_request.py_return_context_logits = self.py_return_context_logits
child_request.py_return_generation_logits = self.py_return_generation_logits
child_request.py_return_logits_device_memory = self.py_return_logits_device_memory
child_request.py_exclude_last_generation_logits = self.py_exclude_last_generation_logits
child_request.py_stop_words_list = self.py_stop_words_list
child_request.py_logits_post_processors = self.py_logits_post_processors
child_request.py_rewind_len = self.py_rewind_len
child_request.py_decoding_iter = self.py_decoding_iter
child_request.py_draft_tokens = self.py_draft_tokens.copy(
) if self.py_draft_tokens else []
child_request.py_last_draft_tokens = self.py_last_draft_tokens.copy(
) if self.py_last_draft_tokens else None
child_request.py_num_accepted_draft_tokens = self.py_num_accepted_draft_tokens
child_request.py_lora_task_layer_module_configs = self.py_lora_task_layer_module_configs

# Initialize Python-specific runtime state
child_request.py_batch_idx = None
child_request.is_attention_dp_dummy = self.is_attention_dp_dummy
child_request.is_cuda_graph_dummy = self.is_cuda_graph_dummy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we should make this happen automatically instead of manually copying every field. Otherwise, we need to maintain this list every time when adding or removing an attribute.

Will a copy.deepcopy work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, either copy.deepcopy, or the copy/clone should be encapsulated in a separate method of Request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @Superjomn @syuoni for pointing out these issue. Unfortunately, there is a gap between a parent request (of class pyexecutor.LlmRequest) and a child request (of class bindings.LlmRequest).

A parent class tracks all child requests created by create_child_request and those states are sharable each other. All there logic happen in C++ runtime internally. This is for handling termination or cancellation of requests at the executor side. A ugly part is a result of create_child_request is of type bindings.LlmRequest. For now, I couldn't find a better and clearer way to inherit the class. I believe this issue can be resolved if #3034 finishes by bringing all the required logics to the python side.

As WAR before #3034, here the child request generated by a parent mimics pyexecutor.LlmRequest. That's what this functions does. And, I totally agree an encapsulation is necessary for maintaining. Since copy won't work for this case, I will copy attributes having a pattern py_* and some extras like is_attention_dp_dummy. This will make the code clearer and reduce maintenance risk. Does this make sense?

Copy link
Collaborator

@Superjomn Superjomn Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But considering there are so many members flattened, and it could be easy to forget when a new member is introduced. Maybe the following code can help automate the copying of most of the members, with a proper black list or whitelist introduced:

for attr_name in dir(self):
    if attr_name.startswith("py_"):
        value = getattr(self, attr_name)
        setattr(child_request, attr_name, value)

You can try it in a subsequent PR, but you won't need to change in this PR.

Copy link
Collaborator Author

@jaedeok-nvidia jaedeok-nvidia Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it's easy to forget as there are many contributors. However, it's already updated tho, github shows the original impl not a revision. Here is the latest one.

        # Copy all py_* attributes from parent to child
        for attr_name, attr_value in self.__dict__.items():
            if attr_name.startswith('py_'):
                attr_value = getattr(self, attr_name)
                setattr(child_request, attr_name, copy.deepcopy(attr_value))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Currently there is no method to create child request as pyexecutor.LlmRequest.

Are those children requests be processed in Python runtime? If so, will its different type (bindings.LlmRequest) cause any issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syuoni yes, that's why mimic functions are added. however, as mentioned, it is just a WAR before properly reimplementing the LlmRequest and Resource manager logics within torch backend.

@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 0a93904 to 28d16ac Compare July 8, 2025 05:34
@jaedeok-nvidia jaedeok-nvidia changed the title [DRAFT] fix: Enable num_return_sequences (n) support in PyTorch backend fix: Enable num_return_sequences (n) support in PyTorch backend Jul 8, 2025
@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from ed2b471 to 4678b73 Compare July 8, 2025 06:18
@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11227 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11227 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8304 completed with status: 'FAILURE'

@ccys-a11y
Copy link

ccys-a11y commented Jul 8, 2025

@jaedeok-nvidia

I compiled and installed tensorrt_llm from your GitHub branch, but encountered two issues:

  1. The quickstart_advanced.py script still errors out when setting N=2.

Error Info:"Processed requests: 0%| | 0/4 [00:00<?, ?it/s][07/08/2025-17:05:26] [TRT-LLM] [E] Error in event loop: fail to schedule any pending request, probably run out of resource.
[07/08/2025-17:05:26] [TRT-LLM] [E] Traceback (most recent call last):
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 285, in _event_loop_wrapper
self.event_loop()
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1033, in _executor_loop_overlap
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource.

Exception in thread Thread-7 (_event_loop_wrapper):
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/python3/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 289, in _event_loop_wrapper
raise e
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 285, in _event_loop_wrapper
self.event_loop()
File "/root/new/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1033, in _executor_loop_overlap
assert scheduled_batch.batch_size > 0, (
AssertionError: fail to schedule any pending request, probably run out of resource."
2. Failed to import LLM from torch: "from tensorrt_llm._torch import LLM”

Error Info:
" File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/llm.py", line 7, in init
raise ImportError(
ImportError: _torch.llm is deprecated, please use from tensorrt_llm import LLM directly"

@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 4678b73 to 1b94d38 Compare July 9, 2025 06:37
@jaedeok-nvidia
Copy link
Collaborator Author

jaedeok-nvidia commented Jul 9, 2025

Hi @ccys-a11y sorry for the inconvenience. This PR was broken while rebasing the main and addressing review comments. For now, I confirmed it works at least quickstart_advanced.py and am on further tests. Here is the command that I used for a quick test (TinyLlama-1.1B-Chat-v1.0 was used.).

# Two sequences should be identical due to greedy decoding.
$ TLLM_ALLOW_N_GREEDY_DECODING=1 python quickstart_advanced.py --model_dir /path/to/model --n 2
# Two sequences are expected to be different since high temperature makes almost random.
$ python quickstart_advanced.py --model_dir /path/to/model --n 2 --top_p 0.9 --temperature 999

For the second issue (2. Failed to import LLM from torch: "from tensorrt_llm._torch import LLM”), I think this is not directly related to this PR. We've made torch backend as the default path few weeks ago. That was the reason I guess. You can directly import LLM by from tensorrt_llm import LLM. However, for the current branch after rebasing ToT, it seems working.

>>> from tensorrt_llm._torch import LLM
...
[07/09/2025-06:47:24] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level to INFO
2025-07-09 06:47:24,959 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[07/09/2025-06:47:25] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc3
>>> 

@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11404 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11404 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8435 completed with status: 'FAILURE'

@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 1b94d38 to b519eca Compare July 9, 2025 11:37
@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from 733d056 to a6ca1c8 Compare July 11, 2025 13:09
@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11654 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11654 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8630 completed with status: 'SUCCESS'

This PR enables the `n` parameter (num_return_sequences) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the `LlmRequest` class.

Changes:
  - Added `create_child_request` method to `pyexecutor.LlmRequest` that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states.
  - Updated C++ LlmRequest and related Python bindings to expose additional properties required in the PyTorch backend.
  - Enhanced `PyExecutor` to create child requests, ensuring proper handling of requests when `num_return_sequences > 1`.

Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
@jaedeok-nvidia jaedeok-nvidia force-pushed the fix/torch-backend-num_returns branch from a6ca1c8 to 7c1d650 Compare July 13, 2025 12:38
@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11731 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11731 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8686 completed with status: 'FAILURE'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaedeok-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11748 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11748 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8700 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@ccys-a11y
Copy link

Hi@jaedeok-nvidia,thanks for your method. I find it works for 'quickstart' script. However, when I benchmark Qwen3-14B for AIME24/25 datasets with n=32, the following error might occur intermittently. It seems it's not stable enough. Can you help?

"
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/python3/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, self._kwargs)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1031, in _executor_loop_overlap
self.resource_manager.prepare_resources(scheduled_batch)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/resource_manager.py", line 793, in prepare_resources
resource_manager.prepare_resources(scheduled_batch)
File "/usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/_torch/pyexecutor/resource_manager.py", line 307, in prepare_resources
self.impl.add_token(req.py_request_id)
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: isLeaf() (/dockerdata/caiyi/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:334)
1 0x7f8b7f48b9a9 tensorrt_llm::common::throwRuntimeError(char const
, int, char const
) + 76
2 0x7f8b7f4b75c2 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x1a615c2) [0x7f8b7f4b75c2]
3 0x7f8b801d2380 tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 176
4 0x7f8b801d3f7b tensorrt_llm::batch_manager::kv_cache_manager::WindowBlockManager::allocateBlock(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, bool) + 299
5 0x7f8b801d560b tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::updateToken(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, bool) + 123
6 0x7f8b95ad7275 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x108275) [0x7f8b95ad7275]
7 0x7f8b95a7e040 /usr/local/python3/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xaf040) [0x7f8b95a7e040]
8 0x7f8dc201284e /usr/local/python3/lib/libpython3.10.so.1.0(+0x1f384e) [0x7f8dc201284e]
9 0x7f8dc1fecedb _PyObject_MakeTpCall + 123
10 0x7f8dc1f1f27d /usr/local/python3/lib/libpython3.10.so.1.0(+0x10027d) [0x7f8dc1f1f27d]
11 0x7f8dc2051dc7 _PyEval_EvalFrameDefault + 20487
12 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
13 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
14 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
15 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
16 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
17 0x7f8dc1fef1ee /usr/local/python3/lib/libpython3.10.so.1.0(+0x1d01ee) [0x7f8dc1fef1ee]
18 0x7f8dc204fc90 _PyEval_EvalFrameDefault + 11984
19 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
20 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
21 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
22 0x7f8dc204d49a _PyEval_EvalFrameDefault + 1754
23 0x7f8dc204bfe2 /usr/local/python3/lib/libpython3.10.so.1.0(+0x22cfe2) [0x7f8dc204bfe2]
24 0x7f8dc1fef1ee /usr/local/python3/lib/libpython3.10.so.1.0(+0x1d01ee) [0x7f8dc1fef1ee]
25 0x7f8dc20f5fa6 /usr/local/python3/lib/libpython3.10.so.1.0(+0x2d6fa6) [0x7f8dc20f5fa6]
26 0x7f8dc20da7d4 /usr/local/python3/lib/libpython3.10.so.1.0(+0x2bb7d4) [0x7f8dc20da7d4]
27 0x7f8dc1c071ca /lib64/libpthread.so.0(+0x81ca) [0x7f8dc1c071ca]
28 0x7f8dc10d88d3 clone + 67
"

@jaedeok-nvidia
Copy link
Collaborator Author

@ccys-a11y Thanks for reporting the issue. The error may come from incorrect count of request budget. Can you share the reproduce step with us? That would help us add more concrete tests.

FYI, we've reimplemented the fix in PR #5997 for a cleaner logic (no need to mimic LlmReqeest anymore). And, correct-counting req budget is addressed there, however, need double-check if that was the root cause. That PR is going to be merged soon. Sorry for the delay in resolving this issue.

cc. @evezhier

@jaedeok-nvidia
Copy link
Collaborator Author

#5997 has been merged. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants