-
Notifications
You must be signed in to change notification settings - Fork 1.7k
imp(torchsampler):support openai stop in text level #6450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughTokenizer support was added throughout the codebase by introducing an optional Changes
Sequence Diagram(s)sequenceDiagram
participant LLM as LLM API
participant Exec as GenerationExecutor
participant Proxy as GenerationExecutorProxy
participant Worker as GenerationExecutorWorker
participant Sampler as Sampler
participant Result as GenerationResult
LLM->>Exec: create(tokenizer)
Exec->>Proxy: GenerationExecutorProxy(..., tokenizer)
Proxy->>Worker: submit(..., tokenizer)
Worker->>Sampler: instantiate_sampler(..., tokenizer)
Worker->>Result: GenerationResult(..., tokenizer)
Sampler->>Result: _meet_stop_token_criteria(..., tokenizer)
Result->>Result: _check_text_stop_criteria(..., tokenizer)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (8)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
504-535
: Remove duplicate tokenizer assignmentThe tokenizer is assigned twice in the TRTLLMSampler constructor (lines 506 and 534).
def __init__( self, executor_config: ExecutorConfig, model, model_dtype, mapping: Mapping, decoding_mode: DecodingMode, disable_overlap_scheduler: bool, tokenizer: PreTrainedTokenizerBase ): self.tokenizer = tokenizer # ... other initialization code ... - self.tokenizer = tokenizer - self._initialize_store() self._instantiate_algorithms()
🧹 Nitpick comments (20)
tensorrt_llm/sampling_params.py (1)
277-277
: Fix type annotation spacing.Add a space after the colon in the type annotation to follow PEP 8 guidelines.
- tokenizer:PreTrainedTokenizerBase = None + tokenizer: PreTrainedTokenizerBase = Nonetensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
203-204
: Fix parameter formatting to follow PEP 8 guidelines.Add proper spacing in the type annotation and around the default value assignment.
- garbage_collection_gen0_threshold: Optional[int] = None, - tokenizer:PreTrainedTokenizerBase=None) -> PyExecutor: + garbage_collection_gen0_threshold: Optional[int] = None, + tokenizer: PreTrainedTokenizerBase = None) -> PyExecutor:tensorrt_llm/executor/proxy.py (1)
51-51
: Fix type annotation spacing.Add a space after the colon in the type annotation to follow PEP 8 guidelines.
- tokenizer:PreTrainedTokenizerBase = None, + tokenizer: PreTrainedTokenizerBase = None,tensorrt_llm/_torch/pyexecutor/_util.py (4)
565-569
: Fix formatting: add space after commaThe parameter list has inconsistent spacing.
def instantiate_sampler(engine: PyTorchModelEngine, executor_config: ExecutorConfig, pytorch_backend_config: PyTorchConfig, - mapping: Mapping, - tokenizer:PreTrainedTokenizerBase): + mapping: Mapping, + tokenizer: PreTrainedTokenizerBase):
574-577
: Fix line length issueLine 576 exceeds the 120 character limit. Consider splitting the parameters across multiple lines.
- enable_mixed_sampler=pytorch_backend_config.enable_mixed_sampler,) + enable_mixed_sampler=pytorch_backend_config.enable_mixed_sampler, + ) if mapping.cp_config.get('cp_type') == 'star_attention': assert pytorch_backend_config.attn_backend == "FLASHINFER_STAR_ATTENTION", "attention backend of star attention should be 'FLASHINFER_STAR_ATTENTION'" - return TorchSampler(sampler_args,tokenizer) + return TorchSampler(sampler_args, tokenizer)
565-590
: Consider making tokenizer parameter optionalMaking the
tokenizer
parameter required could be a breaking change for existing code. Consider making it optional to maintain backward compatibility.def instantiate_sampler(engine: PyTorchModelEngine, executor_config: ExecutorConfig, pytorch_backend_config: PyTorchConfig, mapping: Mapping, - tokenizer: PreTrainedTokenizerBase): + tokenizer: Optional[PreTrainedTokenizerBase] = None):
585-590
: Fix inconsistent spacing around commaMissing space after comma in function arguments.
- tokenizer) + tokenizer) if not engine.model.model_config.is_generation: # NOTE: choose sampler based on model type return EarlyStopSampler() - return TorchSampler(sampler_args,tokenizer) + return TorchSampler(sampler_args, tokenizer)tensorrt_llm/executor/result.py (2)
258-258
: Fix formatting: add spaces after commasMissing spaces after commas in function arguments.
- new_generated_text = self.sampling_params.tokenizer.decode(new_generated_token_ids[idx],skip_special_tokens=False,clean_up_tokenization_spaces=False) + new_generated_text = self.sampling_params.tokenizer.decode(new_generated_token_ids[idx], skip_special_tokens=False, clean_up_tokenization_spaces=False)
266-266
: Fix formatting: add spaces after commasMissing spaces after commas in function arguments.
- stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False) + stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)tensorrt_llm/_torch/pyexecutor/sampler.py (4)
229-236
: Fix formatting: add space after commaMissing space after comma in constructor parameter.
- def __init__(self, args: Args,tokenizer): + def __init__(self, args: Args, tokenizer):
252-273
: Consider error handling for tokenizer decode operationsThe tokenizer decode operations could potentially fail. Consider adding error handling to make the code more robust.
@staticmethod - def _meet_stop_token_criteria(request: LlmRequest,tokenizer,new_token): + def _meet_stop_token_criteria(request: LlmRequest, tokenizer, new_token): if request.py_stop_words_list: assert isinstance( request.py_stop_words_list, list), "request.py_stop_words_list should be a list" stop_words_list, prefix_sum = request.py_stop_words_list tokens = request.get_tokens(0) - new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) + try: + new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False) + except Exception: + # If decode fails, fall back to token-based matching only + new_words = "" offset = 0 for i, offset_end in enumerate(prefix_sum): if i > 0: offset = prefix_sum[i - 1] stop_word = stop_words_list[offset:offset_end] - stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False) + try: + stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) + except Exception: + continue
260-266
: Fix formatting: add spaces after commasMissing spaces after commas in function arguments.
- new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) + new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False) offset = 0 for i, offset_end in enumerate(prefix_sum): if i > 0: offset = prefix_sum[i - 1] stop_word = stop_words_list[offset:offset_end] - stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False) + stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
288-288
: Fix formatting: add space after commaMissing space after comma in function call.
- if self._meet_stop_token_criteria(request,self.tokenizer,new_token): + if self._meet_stop_token_criteria(request, self.tokenizer, new_token):tensorrt_llm/executor/executor.py (5)
38-40
: Remove unused import
TokenizerBase
is imported but never used in this file.from transformers import PreTrainedTokenizerBase -from ..llmapi.tokenizer import TokenizerBase
403-406
: Improve formatting consistencyThe line continuation and indentation could be more consistent.
postproc_worker_config=postproc_worker_config, is_llm_executor=is_llm_executor, - garbage_collection_gen0_threshold= - garbage_collection_gen0_threshold, + garbage_collection_gen0_threshold=garbage_collection_gen0_threshold, tokenizer=tokenizer)
415-420
: Improve formatting consistencyThe line continuation and indentation could be more consistent.
return GenerationExecutorWorker(**worker_kwargs, is_llm_executor=is_llm_executor, - garbage_collection_gen0_threshold= - garbage_collection_gen0_threshold, + garbage_collection_gen0_threshold=garbage_collection_gen0_threshold, tokenizer=tokenizer)
432-435
: Improve formatting consistencyThe line continuation and indentation could be more consistent.
postproc_worker_config=postproc_worker_config, is_llm_executor=is_llm_executor, - garbage_collection_gen0_threshold= - garbage_collection_gen0_threshold, + garbage_collection_gen0_threshold=garbage_collection_gen0_threshold, tokenizer=tokenizer)
446-449
: Improve formatting consistencyThe line continuation and indentation could be more consistent.
postproc_worker_config=postproc_worker_config, is_llm_executor=is_llm_executor, - garbage_collection_gen0_threshold= - garbage_collection_gen0_threshold, + garbage_collection_gen0_threshold=garbage_collection_gen0_threshold, tokenizer=tokenizer)tensorrt_llm/executor/worker.py (2)
64-64
: Fix type annotation spacing.Missing space after colon in type annotation. This violates Python style conventions.
- tokenizer:PreTrainedTokenizerBase = None + tokenizer: PreTrainedTokenizerBase = None
618-618
: Fix type annotation spacing.Missing space after colon in type annotation, same issue as in the constructor.
- tokenizer:PreTrainedTokenizerBase = None + tokenizer: PreTrainedTokenizerBase = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/pyexecutor/_util.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(7 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(3 hunks)tensorrt_llm/executor/result.py
(1 hunks)tensorrt_llm/executor/worker.py
(6 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/executor/executor.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/result.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/executor/executor.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/result.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (2)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/executor/result.py (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor()
is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation()
to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/result.py
257-257: Line too long (181 > 120)
(E501)
264-264: Line too long (121 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/_util.py
576-576: Line too long (158 > 120)
(E501)
🔇 Additional comments (14)
tensorrt_llm/llmapi/llm.py (2)
853-854
: LGTM! Proper tokenizer propagation to TRT executor.The tokenizer parameter is correctly passed to the executor creation, enabling text-level stop word functionality in the TensorRT backend.
988-989
: LGTM! Proper tokenizer propagation to PyTorch executor.The tokenizer parameter is correctly passed to the executor creation, enabling text-level stop word functionality in the PyTorch backend.
tensorrt_llm/sampling_params.py (2)
7-7
: LGTM! Necessary import for tokenizer type annotation.The import is properly placed and required for the new tokenizer field.
340-341
: LGTM! Proper tokenizer assignment in setup method.The tokenizer parameter is correctly assigned to the instance field, enabling tokenizer propagation through sampling configurations.
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)
30-30
: LGTM! Necessary import for tokenizer type annotation.The import is properly placed and required for the new tokenizer parameter.
345-345
: LGTM! Proper tokenizer propagation to sampler instantiation.The tokenizer parameter is correctly passed to the sampler, enabling text-level stop word functionality.
tensorrt_llm/executor/proxy.py (2)
31-32
: LGTM! Necessary import for tokenizer type annotation.The import is properly placed and required for the new tokenizer parameter.
100-101
: LGTM! Proper tokenizer propagation to workers.The tokenizer is correctly added to worker_kwargs, enabling workers to receive the tokenizer instance for text-level processing.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
34-35
: LGTM!The import is correctly placed and follows the proper namespace convention.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
29-29
: LGTM!Import is properly placed and follows the namespace convention.
tensorrt_llm/executor/worker.py (4)
43-43
: LGTM: Import statement follows coding guidelines.The import correctly maintains the namespace and imports only the required class as specified in the coding guidelines.
88-88
: LGTM: Proper instance attribute assignment.The tokenizer is correctly stored as an instance attribute following the existing pattern.
142-142
: LGTM: Tokenizer correctly forwarded to executor creation.The tokenizer is properly passed as a keyword argument to the create_executor function, enabling tokenizer support in the executor pipeline.
745-746
: LGTM: Tokenizer parameter correctly passed to worker constructor.The tokenizer is properly forwarded from the worker_main function to the GenerationExecutorWorker constructor using appropriate keyword argument syntax.
if mapping.cp_config.get('cp_type') == 'star_attention': | ||
assert pytorch_backend_config.attn_backend == "FLASHINFER_STAR_ATTENTION", "attention backend of star attention should be 'FLASHINFER_STAR_ATTENTION'" | ||
return TorchSampler(sampler_args) | ||
return TorchSampler(sampler_args,tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be part of sampler_args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
@@ -247,22 +249,28 @@ def _meet_max_token_stop_criteria(self, request: LlmRequest): | |||
>= self.max_seq_len) | |||
|
|||
@staticmethod | |||
def _meet_stop_token_criteria(request: LlmRequest): | |||
def _meet_stop_token_criteria(request: LlmRequest,tokenizer,new_token): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give types to these variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
tensorrt_llm/executor/proxy.py
Outdated
@@ -46,6 +48,7 @@ def __init__( | |||
postproc_worker_config: Optional[PostprocWorkerConfig] = None, | |||
is_llm_executor: Optional[bool] = None, | |||
garbage_collection_gen0_threshold: Optional[int] = None, | |||
tokenizer:PreTrainedTokenizerBase = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer:PreTrainedTokenizerBase = None, | |
tokenizer: Optional[PreTrainedTokenizerBase] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
tensorrt_llm/executor/worker.py
Outdated
@@ -59,6 +61,7 @@ def __init__( | |||
is_llm_executor: Optional[bool] = None, | |||
lora_config: Optional[LoraConfig] = None, | |||
garbage_collection_gen0_threshold: Optional[int] = None, | |||
tokenizer:PreTrainedTokenizerBase = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer:PreTrainedTokenizerBase = None | |
tokenizer: Optional[PreTrainedTokenizerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
tensorrt_llm/executor/worker.py
Outdated
@@ -610,6 +615,7 @@ def worker_main( | |||
bool] = True, # whether it's the main executor instance | |||
lora_config: Optional[LoraConfig] = None, | |||
garbage_collection_gen0_threshold: Optional[int] = None, | |||
tokenizer:PreTrainedTokenizerBase = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer:PreTrainedTokenizerBase = None | |
tokenizer: Optional[PreTrainedTokenizerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
tensorrt_llm/sampling_params.py
Outdated
@@ -273,6 +274,8 @@ class SamplingParams: | |||
# TODO: make this a per-request parameter. | |||
_stream_interval: Optional[int] = field(default=None, init=False, repr=False) | |||
|
|||
tokenizer:PreTrainedTokenizerBase = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer:PreTrainedTokenizerBase = None | |
tokenizer: Optional[PreTrainedTokenizerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
stop_words_list, prefix_sum = request.py_stop_words_list | ||
tokens = request.get_tokens(0) | ||
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer may be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , i have followed your suggestion.
offset = 0 | ||
for i, offset_end in enumerate(prefix_sum): | ||
if i > 0: | ||
offset = prefix_sum[i - 1] | ||
stop_word = stop_words_list[offset:offset_end] | ||
stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer may be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you , when tokenizer is None,fall back to tokenid level.
f4b01bc
to
01854de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
228-228
: Add Optional type annotation for tokenizer parameterThe tokenizer parameter should be typed as Optional since it may be None.
- tokenizer: PreTrainedTokenizerBase + tokenizer: Optional[PreTrainedTokenizerBase]
261-265
: Add null check for tokenizer before useThe tokenizer parameter may be None and should be checked before calling decode.
try: + if tokenizer is None: + new_words = "" + else: - new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) + new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) except Exception: # If decode fails, fall back to token-based matching only new_words = ""
512-512
: Add Optional type annotation for tokenizer parameterThe tokenizer parameter should be typed as Optional since it may be None.
- tokenizer: PreTrainedTokenizerBase + tokenizer: Optional[PreTrainedTokenizerBase]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(8 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(3 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/executor/worker.py
(6 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/sampling_params.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (1)
tensorrt_llm/executor/result.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
253-253: Line too long (138 > 120)
(E501)
272-272: Line too long (122 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/executor/result.py (2)
200-221
: Good refactoring of stop criteria logicThe extraction of text-based stop criteria logic into a separate method improves code readability and maintainability. The null check for tokenizer is properly implemented.
274-282
: Clean integration of refactored stop criteria logicThe modification properly integrates the new helper method while maintaining backward compatibility for non-string stop reasons.
01854de
to
871eece
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(8 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(3 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/executor/worker.py
(6 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/sampling_params.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/worker.py
- tensorrt_llm/executor/executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line in C++ code.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
276-276: Line too long (122 > 120)
(E501)
🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/sampler.py (7)
4-4
: LGTM! Import changes support tokenizer functionality.The addition of
Union
andList
to typing imports and thePreTrainedTokenizerBase
import from transformers are appropriate for the new tokenizer support.Also applies to: 29-29
228-228
: LGTM! Tokenizer field properly added to Args dataclass.The addition of the
tokenizer
field with correct type annotation follows the established pattern and supports the new functionality.
236-236
: LGTM! Tokenizer properly stored as instance variable.The tokenizer is correctly stored from the args parameter following the established constructor pattern.
253-257
: LGTM! Method signature properly updated for tokenizer support.The method signature correctly adds the tokenizer and new_token parameters with appropriate type annotations. The multi-line formatting improves readability.
300-300
: LGTM! Method call correctly updated with tokenizer parameter.The call to
_meet_stop_token_criteria
properly passes the tokenizer instance and new_token parameter as required by the updated method signature.
388-388
: LGTM! Minor formatting improvement.The added blank line improves code readability in the
sample_async
method.
516-516
: LGTM! TRTLLMSampler constructor properly updated for tokenizer support.The tokenizer parameter is correctly added to the constructor signature and stored as an instance variable, maintaining consistency with the TorchSampler implementation.
Also applies to: 545-546
stop_words_list, prefix_sum = request.py_stop_words_list | ||
tokens = request.get_tokens(0) | ||
try: | ||
new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds tokenizer.decode
almost everywhere. I think this should cause perf issues.
@xq25478 May I have the context that token-ID-level stop comparison is not adequate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, in fact, sglang or vllm already supports text-level stop interception, and this function is very necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the trtllm official team can implement this function as soon as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, could you please help me understand why the token-ID-level stop comparison doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, could you please help me understand why the token-ID-level stop comparison doesn't work?
In Chinese grammar, "你" and "你好" correspond to two different tokenIDs. However, at the text level, the first text of "你好" is "你". Therefore, when stop is set to "你", the correct output should not include "你好". However, the current output of TRTLLM includes "你好", which is incorrect.
tensorrt_llm/sampling_params.py
Outdated
@@ -273,6 +274,8 @@ class SamplingParams: | |||
# TODO: make this a per-request parameter. | |||
_stream_interval: Optional[int] = field(default=None, init=False, repr=False) | |||
|
|||
tokenizer: Optional[PreTrainedTokenizerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we add tokenizer as a field to SamplingParams
? Tokenizer should be the same to all requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i konw your meaning, have you any better idea about how to add tokenizer for SamplingParams?
new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False) | ||
except Exception: | ||
# If decode fails, fall back to token-based matching only | ||
new_words = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_text
could also be empty, which means stop_text in new_words
could be True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
try: | ||
stop_text = tokenizer.decode( | ||
stop_word, | ||
skip_special_tokens=False, | ||
clean_up_tokenization_spaces=False | ||
) | ||
except Exception: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there so many try... expect
? Do we have expectations what exception should be raised here?
Also, please use the specific exception type instead of Exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
5781cb2
to
1d0b121
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(8 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(3 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/executor/worker.py
(6 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/sampling_params.py
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/worker.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
276-276: Line too long (122 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)
4-4
: LGTM!The additional imports for type hints and tokenizer support are correctly added and necessary for the new functionality.
Also applies to: 29-29
228-228
: LGTM!The tokenizer field addition to the Args dataclass is properly typed and follows the existing pattern.
236-236
: LGTM!The tokenizer attribute assignment follows the same pattern as other constructor arguments and is correctly implemented.
300-300
: LGTM!The method call correctly passes the required tokenizer and new_token parameters to match the updated method signature.
388-388
: LGTM!Minor formatting improvement with the added blank line.
516-516
: LGTM!The tokenizer parameter addition to TRTLLMSampler constructor and attribute assignment follows the same pattern as TorchSampler and is correctly implemented.
Also applies to: 545-546
1d0b121
to
da386ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
253-285
: Fix line length violation and improve exception handling.The text-level stop detection implementation correctly enables OpenAI-style stop functionality, but there are code quality issues to address:
Apply this diff to fix the line length issue and improve exception handling:
- stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) + stop_text = tokenizer.decode( + stop_word, + skip_special_tokens=False, + clean_up_tokenization_spaces=False + ) - except Exception: + except (ValueError, TypeError, AttributeError): - except Exception: + except (ValueError, TypeError, AttributeError):The specific exception types cover the most likely tokenizer decode failures while avoiding overly broad exception catching as required by the coding guidelines.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
265-285
: Consider performance optimization for tokenizer decode operations.The implementation correctly adds text-level stop detection, but the frequent
tokenizer.decode()
calls could impact performance, especially with many stop words. Consider caching decoded stop words to avoid repeated decoding of the same sequences.Potential optimization approach:
# Cache decoded stop words at request initialization if not hasattr(request, '_decoded_stop_words'): request._decoded_stop_words = {} for i, offset_end in enumerate(request.py_stop_words_list[1]): # Cache decoded versionsThis addresses the performance concern raised in previous reviews while maintaining the valuable text-level stop functionality.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(8 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(5 hunks)tensorrt_llm/executor/result.py
(5 hunks)tensorrt_llm/executor/worker.py
(7 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/sampling_params.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/sampling_params.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (2)
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
276-276: Line too long (122 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)
4-4
: LGTM - Import additions support tokenizer functionality.The new imports for
Union
,List
, andPreTrainedTokenizerBase
are appropriate for the enhanced stop token detection functionality and follow the coding guidelines for maintaining namespaces.Also applies to: 29-29
228-228
: LGTM - Tokenizer integration in TorchSampler constructor.The tokenizer parameter is properly typed and stored as an instance attribute, following the established patterns in the class constructor.
Also applies to: 236-236
300-300
: LGTM - Method call updated to support text-level stop detection.The call to
_meet_stop_token_criteria
correctly passes the tokenizer and new_token parameters to enable the enhanced stop detection functionality.
516-516
: LGTM - Consistent tokenizer integration in TRTLLMSampler.The tokenizer parameter and storage implementation in TRTLLMSampler maintains consistency with the TorchSampler implementation, ensuring uniform tokenizer support across both sampler types.
Also applies to: 545-546
388-388
: Minor formatting adjustment.This whitespace change appears to improve code formatting.
312e2ed
to
84a8e58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(8 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(5 hunks)tensorrt_llm/executor/result.py
(5 hunks)tensorrt_llm/executor/worker.py
(7 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/executor/proxy.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (2)
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py
276-276: Line too long (122 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)
4-4
: LGTM!The import additions are appropriate for the tokenizer integration functionality.
Also applies to: 29-29
228-229
: LGTM!The tokenizer parameter is correctly added to the Args dataclass and stored as an instance attribute with proper type annotation.
Also applies to: 236-236
300-300
: LGTM!The method call is correctly updated to pass the tokenizer instance and new_token parameter.
388-388
: LGTM!Minor formatting improvement for readability.
516-516
: LGTM!The tokenizer parameter is correctly added to the TRTLLMSampler constructor with proper type annotation and instance variable assignment.
Also applies to: 545-546
228-236
: Excellent integration of tokenizer support for text-level stop criteria.The implementation successfully adds tokenizer support to both TorchSampler and TRTLLMSampler classes, enabling text-level stop word detection as requested in the PR objectives. The approach is well-designed:
- Proper type annotations with
PreTrainedTokenizerBase
- Clean integration into existing sampler architecture
- Intelligent fallback to token-level matching when text decoding fails
- Maintains backward compatibility
The main enhancement allows for more accurate stop criteria detection by comparing decoded text rather than just token sequences, which addresses the limitation mentioned in the PR summary.
Also applies to: 253-285, 300-300, 516-516, 545-546
@@ -1,7 +1,7 @@ | |||
from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify NVIDIA copyright header presence.
According to the coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the current year. Please ensure this file includes the required copyright header.
🏁 Script executed:
#!/bin/bash
# Check if the file contains NVIDIA copyright header
head -10 tensorrt_llm/_torch/pyexecutor/sampler.py | grep -i "nvidia\|copyright"
Length of output: 82
Let’s search the entire file for any NVIDIA/copyright lines:
🏁 Script executed:
#!/bin/bash
# Search for any NVIDIA or Copyright mentions in sampler.py
grep -n -E "NVIDIA|Copyright" tensorrt_llm/_torch/pyexecutor/sampler.py || true
Length of output: 268
Add NVIDIA copyright header to sampler.py
This file is missing the required NVIDIA copyright header. Per CODING_GUIDELINES.md, all Python source files must begin with a header that includes the current year and NVIDIA’s copyright.
• File to update:
- tensorrt_llm/_torch/pyexecutor/sampler.py (at the very top)
Please insert a header similar to:
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES
# SPDX-License-Identifier: Apache-2.0
Ensure it matches the exact formatting used in other TensorRT-LLM files.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py at line 1, add the NVIDIA copyright
header at the very top of the file before any imports. Insert the following two
lines exactly as shown: "# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA
CORPORATION & AFFILIATES" and "# SPDX-License-Identifier: Apache-2.0". This will
ensure compliance with the CODING_GUIDELINES.md and match the formatting used in
other TensorRT-LLM files.
def _meet_stop_token_criteria( | ||
request: LlmRequest, | ||
tokenizer: PreTrainedTokenizerBase, | ||
new_token: Union[int, List[int], torch.Tensor] | ||
): | ||
if request.py_stop_words_list: | ||
assert isinstance( | ||
request.py_stop_words_list, | ||
list), "request.py_stop_words_list should be a list" | ||
|
||
stop_words_list, prefix_sum = request.py_stop_words_list | ||
tokens = request.get_tokens(0) | ||
try: | ||
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) | ||
except Exception: | ||
# If decode fails, fall back to token-based matching only | ||
new_words = "" | ||
offset = 0 | ||
for i, offset_end in enumerate(prefix_sum): | ||
if i > 0: | ||
offset = prefix_sum[i - 1] | ||
stop_word = stop_words_list[offset:offset_end] | ||
try: | ||
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) | ||
except Exception: | ||
continue | ||
if len(stop_word) > len(tokens): | ||
continue | ||
if tokens[-len(stop_word):] == stop_word: | ||
return True | ||
if stop_text in new_words: | ||
return True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix remaining issues from previous reviews.
The implementation has good logic for text-level stop criteria detection, but several issues from previous reviews still need to be addressed:
- Line length violation: Line 276 exceeds the 120-character limit
- Broad exception handling: Using
Exception
instead of specific exceptions - Empty stop text check: Missing validation before substring matching
Apply this diff to fix these issues:
try:
- stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
- except Exception:
+ stop_text = tokenizer.decode(
+ stop_word,
+ skip_special_tokens=False,
+ clean_up_tokenization_spaces=False
+ )
+ except (ValueError, TypeError, AttributeError):
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
- if stop_text in new_words:
+ if stop_text and stop_text in new_words:
return True
Also fix the first exception block:
try:
- new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
- except Exception:
+ new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
+ except (ValueError, TypeError, AttributeError):
# If decode fails, fall back to token-based matching only
new_words = ""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _meet_stop_token_criteria( | |
request: LlmRequest, | |
tokenizer: PreTrainedTokenizerBase, | |
new_token: Union[int, List[int], torch.Tensor] | |
): | |
if request.py_stop_words_list: | |
assert isinstance( | |
request.py_stop_words_list, | |
list), "request.py_stop_words_list should be a list" | |
stop_words_list, prefix_sum = request.py_stop_words_list | |
tokens = request.get_tokens(0) | |
try: | |
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) | |
except Exception: | |
# If decode fails, fall back to token-based matching only | |
new_words = "" | |
offset = 0 | |
for i, offset_end in enumerate(prefix_sum): | |
if i > 0: | |
offset = prefix_sum[i - 1] | |
stop_word = stop_words_list[offset:offset_end] | |
try: | |
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) | |
except Exception: | |
continue | |
if len(stop_word) > len(tokens): | |
continue | |
if tokens[-len(stop_word):] == stop_word: | |
return True | |
if stop_text in new_words: | |
return True | |
def _meet_stop_token_criteria( | |
request: LlmRequest, | |
tokenizer: PreTrainedTokenizerBase, | |
new_token: Union[int, List[int], torch.Tensor] | |
): | |
if request.py_stop_words_list: | |
assert isinstance( | |
request.py_stop_words_list, | |
list | |
), "request.py_stop_words_list should be a list" | |
stop_words_list, prefix_sum = request.py_stop_words_list | |
tokens = request.get_tokens(0) | |
try: | |
new_words = tokenizer.decode( | |
new_token, | |
skip_special_tokens=False, | |
clean_up_tokenization_spaces=False | |
) | |
except (ValueError, TypeError, AttributeError): | |
# If decode fails, fall back to token-based matching only | |
new_words = "" | |
offset = 0 | |
for i, offset_end in enumerate(prefix_sum): | |
if i > 0: | |
offset = prefix_sum[i - 1] | |
stop_word = stop_words_list[offset:offset_end] | |
try: | |
stop_text = tokenizer.decode( | |
stop_word, | |
skip_special_tokens=False, | |
clean_up_tokenization_spaces=False | |
) | |
except (ValueError, TypeError, AttributeError): | |
continue | |
if len(stop_word) > len(tokens): | |
continue | |
if tokens[-len(stop_word):] == stop_word: | |
return True | |
if stop_text and stop_text in new_words: | |
return True |
🧰 Tools
🪛 Ruff (0.12.2)
276-276: Line too long (122 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py between lines 253 and 285, fix the
line length violation by breaking lines exceeding 120 characters into shorter
ones, replace broad exception handling by catching more specific exceptions
related to tokenizer decoding, and add a check to ensure stop_text is not empty
before performing substring matching with new_words. Also, update the first
exception block to handle specific decode errors instead of catching all
exceptions.
84a8e58
to
3e913d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1-1
: Add required NVIDIA copyright header.This file still needs the NVIDIA copyright header as flagged in previous reviews. Per coding guidelines, all Python source files must include the copyright header.
Add the following header at the very top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES +# SPDX-License-Identifier: Apache-2.0 + from abc import ABC, abstractmethod
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/pyexecutor/_util.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(10 hunks)tensorrt_llm/executor/executor.py
(6 hunks)tensorrt_llm/executor/proxy.py
(5 hunks)tensorrt_llm/executor/result.py
(5 hunks)tensorrt_llm/executor/worker.py
(7 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/executor/proxy.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- tensorrt_llm/executor/result.py
- tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (6)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use a maximum of 120 characters per line....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.{cpp,h,cu,py} : all tensorrt-llm open source software code should contain an nvidia ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{cpp,h,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)
4-4
: LGTM - Import changes support the new tokenizer functionality.The added imports are necessary for the tokenizer parameter and proper type annotations.
Also applies to: 29-29
271-271
: LGTM - Proper tokenizer integration.The tokenizer parameter is correctly added to the Args dataclass and properly stored in the constructor with appropriate type annotations.
Also applies to: 279-279
301-301
: LGTM - Correct integration of tokenizer-based stop criteria.The function call properly passes the tokenizer and integrates text-level stop detection into the sampling workflow.
517-517
: LGTM - Consistent tokenizer integration.The tokenizer parameter is properly added to the TRTLLMSampler constructor with correct type annotations and consistent implementation pattern.
Also applies to: 546-547
782-789
: LGTM - Proper integration of text-level stop detection.The stop criteria checks are correctly integrated into both single beam and multiple beam update methods. The logic properly checks stop conditions before adding tokens and sets appropriate finish reasons when stop criteria are met.
Also applies to: 857-860
297-297
: Fix function call to match corrected signature.This call needs to be updated to match the corrected function signature (without
self
).Apply this diff:
- if meet_max_token_stop_criteria(request,self.max_seq_len): + if meet_max_token_stop_criteria(request, self.max_seq_len):Likely an incorrect or invalid review comment.
def meet_stop_token_criteria( | ||
request: LlmRequest, | ||
tokenizer: PreTrainedTokenizerBase, | ||
new_token: Union[int, List[int], torch.Tensor] | ||
): | ||
if request.py_stop_words_list: | ||
assert isinstance( | ||
request.py_stop_words_list, | ||
list), "request.py_stop_words_list should be a list" | ||
|
||
stop_words_list, prefix_sum = request.py_stop_words_list | ||
tokens = request.get_tokens(0) | ||
try: | ||
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) | ||
except Exception: | ||
# If decode fails, fall back to token-based matching only | ||
new_words = "" | ||
offset = 0 | ||
for i, offset_end in enumerate(prefix_sum): | ||
if i > 0: | ||
offset = prefix_sum[i - 1] | ||
stop_word = stop_words_list[offset:offset_end] | ||
try: | ||
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) | ||
except Exception: | ||
continue | ||
if len(stop_word) > len(tokens): | ||
continue | ||
if tokens[-len(stop_word):] == stop_word: | ||
return True | ||
if stop_text in new_words: | ||
return True | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address remaining issues from previous reviews.
The function logic is sound for text-level stop detection, but several previously flagged issues still need to be fixed:
- Missing type annotations: Add proper type hints for the return value
- Broad exception handling: Replace
Exception
with specific exceptions - Empty stop text validation: Add check before substring matching
Apply this diff to fix these issues:
def meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
- ):
+ ) -> bool:
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list), "request.py_stop_words_list should be a list"
stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
- new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
- except Exception:
+ new_words = tokenizer.decode(
+ new_token,
+ skip_special_tokens=False,
+ clean_up_tokenization_spaces=False
+ )
+ except (ValueError, TypeError, AttributeError):
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
- stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
- except Exception:
+ stop_text = tokenizer.decode(
+ stop_word,
+ skip_special_tokens=False,
+ clean_up_tokenization_spaces=False
+ )
+ except (ValueError, TypeError, AttributeError):
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
- if stop_text in new_words:
+ if stop_text and stop_text in new_words:
return True
return False
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def meet_stop_token_criteria( | |
request: LlmRequest, | |
tokenizer: PreTrainedTokenizerBase, | |
new_token: Union[int, List[int], torch.Tensor] | |
): | |
if request.py_stop_words_list: | |
assert isinstance( | |
request.py_stop_words_list, | |
list), "request.py_stop_words_list should be a list" | |
stop_words_list, prefix_sum = request.py_stop_words_list | |
tokens = request.get_tokens(0) | |
try: | |
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False) | |
except Exception: | |
# If decode fails, fall back to token-based matching only | |
new_words = "" | |
offset = 0 | |
for i, offset_end in enumerate(prefix_sum): | |
if i > 0: | |
offset = prefix_sum[i - 1] | |
stop_word = stop_words_list[offset:offset_end] | |
try: | |
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False) | |
except Exception: | |
continue | |
if len(stop_word) > len(tokens): | |
continue | |
if tokens[-len(stop_word):] == stop_word: | |
return True | |
if stop_text in new_words: | |
return True | |
return False | |
def meet_stop_token_criteria( | |
request: LlmRequest, | |
tokenizer: PreTrainedTokenizerBase, | |
new_token: Union[int, List[int], torch.Tensor] | |
) -> bool: | |
if request.py_stop_words_list: | |
assert isinstance( | |
request.py_stop_words_list, | |
list | |
), "request.py_stop_words_list should be a list" | |
stop_words_list, prefix_sum = request.py_stop_words_list | |
tokens = request.get_tokens(0) | |
try: | |
new_words = tokenizer.decode( | |
new_token, | |
skip_special_tokens=False, | |
clean_up_tokenization_spaces=False | |
) | |
except (ValueError, TypeError, AttributeError): | |
# If decode fails, fall back to token-based matching only | |
new_words = "" | |
offset = 0 | |
for i, offset_end in enumerate(prefix_sum): | |
if i > 0: | |
offset = prefix_sum[i - 1] | |
stop_word = stop_words_list[offset:offset_end] | |
try: | |
stop_text = tokenizer.decode( | |
stop_word, | |
skip_special_tokens=False, | |
clean_up_tokenization_spaces=False | |
) | |
except (ValueError, TypeError, AttributeError): | |
continue | |
if len(stop_word) > len(tokens): | |
continue | |
if tokens[-len(stop_word):] == stop_word: | |
return True | |
if stop_text and stop_text in new_words: | |
return True | |
return False |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 209 to 242, add a
return type annotation to the meet_stop_token_criteria function to specify it
returns a bool. Replace the broad except Exception clauses with more specific
exceptions related to tokenizer.decode failures, such as UnicodeDecodeError or
TokenizerError if applicable. Before checking if stop_text is in new_words, add
a condition to skip empty stop_text values to avoid false positives or errors
during substring matching.
def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len): | ||
num_tokens = request.get_num_tokens(self.BEAM) | ||
return (num_tokens - request.py_orig_prompt_len | ||
>= request.py_max_new_tokens) or (num_tokens | ||
>= max_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect function signature.
This standalone function incorrectly includes self
as a parameter, which will cause runtime errors when called.
Apply this diff to fix the function signature and add type annotations:
-def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len):
+def meet_max_token_stop_criteria(request: LlmRequest, max_seq_len: int) -> bool:
- num_tokens = request.get_num_tokens(self.BEAM)
+ num_tokens = request.get_num_tokens(0) # Use beam 0 consistently
return (num_tokens - request.py_orig_prompt_len
>= request.py_max_new_tokens) or (num_tokens
>= max_seq_len)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len): | |
num_tokens = request.get_num_tokens(self.BEAM) | |
return (num_tokens - request.py_orig_prompt_len | |
>= request.py_max_new_tokens) or (num_tokens | |
>= max_seq_len) | |
def meet_max_token_stop_criteria(request: LlmRequest, max_seq_len: int) -> bool: | |
num_tokens = request.get_num_tokens(0) # Use beam 0 consistently | |
return (num_tokens - request.py_orig_prompt_len | |
>= request.py_max_new_tokens) or (num_tokens | |
>= max_seq_len) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 245 to 249, the
function meet_max_token_stop_criteria is incorrectly defined with a self
parameter, but it is a standalone function. Remove the self parameter from the
function signature and add appropriate type annotations for the parameters and
return type to fix the runtime errors.
Signed-off-by: xq25478 <[email protected]>
3e913d0
to
4e7fffa
Compare
imp(torchsampler):support openai stop in text level
Currently, TorchSampler and TRTLLMSampler in TensorRT-LLM cannot implement stop word interception at the text level. Therefore, this PR implements stop word text level interception on TorchSampler.
Summary by CodeRabbit