-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: Enable num_return_sequences (n
) support in PyTorch backend
#5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Enable num_return_sequences (n
) support in PyTorch backend
#5415
Conversation
n
) support in PyTorch backendn
) support in PyTorch backend
53c03c7
to
22001d5
Compare
22001d5
to
71811ea
Compare
/bot run |
PR_Github #9901 [ run ] triggered by Bot |
PR_Github #9901 [ run ] completed with state |
170abb9
to
0a93904
Compare
# Copy Python-specific attributes from parent to child | ||
child_request.py_client_id = self.py_client_id | ||
child_request.py_parent_request_id = self.py_request_id | ||
child_request.py_request_id = child_request.request_id | ||
child_request.py_llm_request_type = child_request.llm_request_type | ||
child_request.py_end_id = child_request.end_id | ||
child_request.py_prompt_len = child_request.prompt_len | ||
child_request.py_orig_prompt_len = child_request.orig_prompt_len | ||
child_request.py_max_new_tokens = child_request.max_new_tokens | ||
|
||
# Copy Python-specific configuration from parent | ||
child_request.py_return_log_probs = self.py_return_log_probs | ||
child_request.py_return_context_logits = self.py_return_context_logits | ||
child_request.py_return_generation_logits = self.py_return_generation_logits | ||
child_request.py_return_logits_device_memory = self.py_return_logits_device_memory | ||
child_request.py_exclude_last_generation_logits = self.py_exclude_last_generation_logits | ||
child_request.py_stop_words_list = self.py_stop_words_list | ||
child_request.py_logits_post_processors = self.py_logits_post_processors | ||
child_request.py_rewind_len = self.py_rewind_len | ||
child_request.py_decoding_iter = self.py_decoding_iter | ||
child_request.py_draft_tokens = self.py_draft_tokens.copy( | ||
) if self.py_draft_tokens else [] | ||
child_request.py_last_draft_tokens = self.py_last_draft_tokens.copy( | ||
) if self.py_last_draft_tokens else None | ||
child_request.py_num_accepted_draft_tokens = self.py_num_accepted_draft_tokens | ||
child_request.py_lora_task_layer_module_configs = self.py_lora_task_layer_module_configs | ||
|
||
# Initialize Python-specific runtime state | ||
child_request.py_batch_idx = None | ||
child_request.is_attention_dp_dummy = self.is_attention_dp_dummy | ||
child_request.is_cuda_graph_dummy = self.is_cuda_graph_dummy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, we should make this happen automatically instead of manually copying every field. Otherwise, we need to maintain this list every time when adding or removing an attribute.
Will a copy.deepcopy
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, either copy.deepcopy
, or the copy
/clone
should be encapsulated in a separate method of Request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @Superjomn @syuoni for pointing out these issue. Unfortunately, there is a gap between a parent request (of class pyexecutor.LlmRequest
) and a child request (of class bindings.LlmRequest
).
A parent class tracks all child requests created by create_child_request
and those states are sharable each other. All there logic happen in C++ runtime internally. This is for handling termination or cancellation of requests at the executor side. A ugly part is a result of create_child_request
is of type bindings.LlmRequest
. For now, I couldn't find a better and clearer way to inherit the class. I believe this issue can be resolved if #3034 finishes by bringing all the required logics to the python side.
As WAR before #3034, here the child request generated by a parent mimics pyexecutor.LlmRequest
. That's what this functions does. And, I totally agree an encapsulation is necessary for maintaining. Since copy
won't work for this case, I will copy attributes having a pattern py_*
and some extras like is_attention_dp_dummy
. This will make the code clearer and reduce maintenance risk. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But considering there are so many members flattened, and it could be easy to forget when a new member is introduced. Maybe the following code can help automate the copying of most of the members, with a proper black list or whitelist introduced:
for attr_name in dir(self):
if attr_name.startswith("py_"):
value = getattr(self, attr_name)
setattr(child_request, attr_name, value)
You can try it in a subsequent PR, but you won't need to change in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it's easy to forget as there are many contributors. However, it's already updated tho, github shows the original impl not a revision. Here is the latest one.
# Copy all py_* attributes from parent to child
for attr_name, attr_value in self.__dict__.items():
if attr_name.startswith('py_'):
attr_value = getattr(self, attr_name)
setattr(child_request, attr_name, copy.deepcopy(attr_value))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Currently there is no method to create child request as pyexecutor.LlmRequest
.
Are those children requests be processed in Python runtime? If so, will its different type (bindings.LlmRequest
) cause any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@syuoni yes, that's why mimic
functions are added. however, as mentioned, it is just a WAR before properly reimplementing the LlmRequest
and Resource manager logics within torch backend.
0a93904
to
28d16ac
Compare
n
) support in PyTorch backendn
) support in PyTorch backend
ed2b471
to
4678b73
Compare
/bot run |
PR_Github #11227 [ run ] triggered by Bot |
PR_Github #11227 [ run ] completed with state |
I compiled and installed tensorrt_llm from your GitHub branch, but encountered two issues:
Error Info:"Processed requests: 0%| | 0/4 [00:00<?, ?it/s][07/08/2025-17:05:26] [TRT-LLM] [E] Error in event loop: fail to schedule any pending request, probably run out of resource. Exception in thread Thread-7 (_event_loop_wrapper): Error Info: |
4678b73
to
1b94d38
Compare
Hi @ccys-a11y sorry for the inconvenience. This PR was broken while rebasing the main and addressing review comments. For now, I confirmed it works at least # Two sequences should be identical due to greedy decoding.
$ TLLM_ALLOW_N_GREEDY_DECODING=1 python quickstart_advanced.py --model_dir /path/to/model --n 2
# Two sequences are expected to be different since high temperature makes almost random.
$ python quickstart_advanced.py --model_dir /path/to/model --n 2 --top_p 0.9 --temperature 999 For the second issue (2. Failed to import LLM from torch: "from tensorrt_llm._torch import LLM”), I think this is not directly related to this PR. We've made torch backend as the default path few weeks ago. That was the reason I guess. You can directly import LLM by >>> from tensorrt_llm._torch import LLM
...
[07/09/2025-06:47:24] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level to INFO
2025-07-09 06:47:24,959 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[07/09/2025-06:47:25] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 1.0.0rc3
>>> |
/bot run |
PR_Github #11404 [ run ] triggered by Bot |
PR_Github #11404 [ run ] completed with state |
1b94d38
to
b519eca
Compare
733d056
to
a6ca1c8
Compare
/bot run |
PR_Github #11654 [ run ] triggered by Bot |
PR_Github #11654 [ run ] completed with state |
This PR enables the `n` parameter (num_return_sequences) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the `LlmRequest` class. Changes: - Added `create_child_request` method to `pyexecutor.LlmRequest` that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states. - Updated C++ LlmRequest and related Python bindings to expose additional properties required in the PyTorch backend. - Enhanced `PyExecutor` to create child requests, ensuring proper handling of requests when `num_return_sequences > 1`. Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
Signed-off-by: Jaedeok Kim <[email protected]>
a6ca1c8
to
7c1d650
Compare
/bot run |
PR_Github #11731 [ run ] triggered by Bot |
PR_Github #11731 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/bot run |
PR_Github #11748 [ run ] triggered by Bot |
PR_Github #11748 [ run ] completed with state |
Hi@jaedeok-nvidia,thanks for your method. I find it works for 'quickstart' script. However, when I benchmark Qwen3-14B for AIME24/25 datasets with n=32, the following error might occur intermittently. It seems it's not stable enough. Can you help? " |
@ccys-a11y Thanks for reporting the issue. The error may come from incorrect count of request budget. Can you share the reproduce step with us? That would help us add more concrete tests. FYI, we've reimplemented the fix in PR #5997 for a cleaner logic (no need to mimic LlmReqeest anymore). And, correct-counting req budget is addressed there, however, need double-check if that was the root cause. That PR is going to be merged soon. Sorry for the delay in resolving this issue. cc. @evezhier |
#5997 has been merged. Closing this PR. |
This PR enables the
n
parameter (num_return_sequences
) in the PyTorch backend, which is the default path for LLM API. While this feature was already implemented in the TRT backend via C++ Executor, it was missing in the PyExecutor. This PR fixes the gap by adding necessary APIs to the pybind of the LlmRequest class.Changes:
create_child_request
method to pyexecutor.LlmRequest that wraps C++'s createChildRequest method. This allows requests to properly handle their child requests and states.