Skip to content

Conversation

xq25478
Copy link
Contributor

@xq25478 xq25478 commented Jul 29, 2025

imp(torchsampler):support openai stop in text level

Currently, TorchSampler and TRTLLMSampler in TensorRT-LLM cannot implement stop word interception at the text level. Therefore, this PR implements stop word text level interception on TorchSampler.

Summary by CodeRabbit

  • New Features
    • Added support for passing a tokenizer object throughout the model execution pipeline, enabling more accurate and flexible handling of stop criteria during text generation.
    • Improved stop condition detection by using tokenizer-based decoding of generated tokens.
  • Bug Fixes
    • Enhanced stop word detection to work with decoded text, reducing errors in generation termination.
  • Chores
    • Updated interfaces to accept and propagate an optional tokenizer parameter across relevant components.

@xq25478 xq25478 requested review from a team as code owners July 29, 2025 13:34
@xq25478 xq25478 requested review from Superjomn and dongxuy04 July 29, 2025 13:34
Copy link
Contributor

coderabbitai bot commented Jul 29, 2025

Walkthrough

Tokenizer support was added throughout the codebase by introducing an optional tokenizer parameter (of type PreTrainedTokenizerBase) to major executor, sampler, and result classes. This parameter is propagated through constructors, factory functions, and methods, enabling text-based stop criteria detection and token decoding during generation.

Changes

Cohort / File(s) Change Summary
Tokenizer propagation in executor and worker classes
tensorrt_llm/executor/executor.py, tensorrt_llm/executor/proxy.py, tensorrt_llm/executor/worker.py, tensorrt_llm/llmapi/llm.py
Added optional tokenizer parameter to executor, proxy, and worker constructors and methods. The tokenizer is passed and stored throughout the executor creation process, including within the LLM API layer.
Sampler and sampler argument updates
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, tensorrt_llm/_torch/pyexecutor/sampler.py
Added tokenizer argument to sampler creation and instantiation functions, and to TorchSampler/TRTLLMSampler constructors. Tokenizer is used for decoding and enhanced stop criteria logic. Introduced standalone stop criteria functions using tokenizer decoding.
Enhanced stop criteria and result handling
tensorrt_llm/executor/result.py
Constructors for result classes now accept tokenizer. Introduced text-based stop criteria checking using the tokenizer for more accurate detection of stop sequences during generation.

Sequence Diagram(s)

sequenceDiagram
    participant LLM as LLM API
    participant Exec as GenerationExecutor
    participant Proxy as GenerationExecutorProxy
    participant Worker as GenerationExecutorWorker
    participant Sampler as Sampler
    participant Result as GenerationResult

    LLM->>Exec: create(tokenizer)
    Exec->>Proxy: GenerationExecutorProxy(..., tokenizer)
    Proxy->>Worker: submit(..., tokenizer)
    Worker->>Sampler: instantiate_sampler(..., tokenizer)
    Worker->>Result: GenerationResult(..., tokenizer)
    Sampler->>Result: _meet_stop_token_criteria(..., tokenizer)
    Result->>Result: _check_text_stop_criteria(..., tokenizer)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Suggested reviewers

  • litaotju
  • yilin-void
  • chzblych
  • venkywonka

Poem

In the garden of code where the rabbits hop,
We passed a tokenizer—now we never stop!
From samplers to workers, it travels with care,
Decoding each token, aware and aware.
Stop words are spotted with text-savvy eyes—
The carrots of progress, a fluffy surprise!
🥕🐇

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e913d0 and 4e7fffa.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (10 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (5 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/executor/worker.py (7 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

504-535: Remove duplicate tokenizer assignment

The tokenizer is assigned twice in the TRTLLMSampler constructor (lines 506 and 534).

     def __init__(
         self,
         executor_config: ExecutorConfig,
         model,
         model_dtype,
         mapping: Mapping,
         decoding_mode: DecodingMode,
         disable_overlap_scheduler: bool,
         tokenizer: PreTrainedTokenizerBase
     ):
         self.tokenizer = tokenizer
 
         # ... other initialization code ...
 
-        self.tokenizer = tokenizer
-
         self._initialize_store()
         self._instantiate_algorithms()
🧹 Nitpick comments (20)
tensorrt_llm/sampling_params.py (1)

277-277: Fix type annotation spacing.

Add a space after the colon in the type annotation to follow PEP 8 guidelines.

-    tokenizer:PreTrainedTokenizerBase = None
+    tokenizer: PreTrainedTokenizerBase = None
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

203-204: Fix parameter formatting to follow PEP 8 guidelines.

Add proper spacing in the type annotation and around the default value assignment.

-        garbage_collection_gen0_threshold: Optional[int] = None,
-        tokenizer:PreTrainedTokenizerBase=None) -> PyExecutor:
+        garbage_collection_gen0_threshold: Optional[int] = None,
+        tokenizer: PreTrainedTokenizerBase = None) -> PyExecutor:
tensorrt_llm/executor/proxy.py (1)

51-51: Fix type annotation spacing.

Add a space after the colon in the type annotation to follow PEP 8 guidelines.

-        tokenizer:PreTrainedTokenizerBase = None,
+        tokenizer: PreTrainedTokenizerBase = None,
tensorrt_llm/_torch/pyexecutor/_util.py (4)

565-569: Fix formatting: add space after comma

The parameter list has inconsistent spacing.

 def instantiate_sampler(engine: PyTorchModelEngine,
                         executor_config: ExecutorConfig,
                         pytorch_backend_config: PyTorchConfig,
-                        mapping: Mapping,
-                        tokenizer:PreTrainedTokenizerBase):
+                        mapping: Mapping,
+                        tokenizer: PreTrainedTokenizerBase):

574-577: Fix line length issue

Line 576 exceeds the 120 character limit. Consider splitting the parameters across multiple lines.

-        enable_mixed_sampler=pytorch_backend_config.enable_mixed_sampler,)
+        enable_mixed_sampler=pytorch_backend_config.enable_mixed_sampler,
+    )
     if mapping.cp_config.get('cp_type') == 'star_attention':
         assert pytorch_backend_config.attn_backend == "FLASHINFER_STAR_ATTENTION", "attention backend of star attention should be 'FLASHINFER_STAR_ATTENTION'"
-        return TorchSampler(sampler_args,tokenizer)
+        return TorchSampler(sampler_args, tokenizer)

565-590: Consider making tokenizer parameter optional

Making the tokenizer parameter required could be a breaking change for existing code. Consider making it optional to maintain backward compatibility.

 def instantiate_sampler(engine: PyTorchModelEngine,
                         executor_config: ExecutorConfig,
                         pytorch_backend_config: PyTorchConfig,
                         mapping: Mapping,
-                        tokenizer: PreTrainedTokenizerBase):
+                        tokenizer: Optional[PreTrainedTokenizerBase] = None):

585-590: Fix inconsistent spacing around comma

Missing space after comma in function arguments.

-                             tokenizer)
+                             tokenizer)
     if not engine.model.model_config.is_generation:
         # NOTE: choose sampler based on model type
         return EarlyStopSampler()
-    return TorchSampler(sampler_args,tokenizer)
+    return TorchSampler(sampler_args, tokenizer)
tensorrt_llm/executor/result.py (2)

258-258: Fix formatting: add spaces after commas

Missing spaces after commas in function arguments.

-                                new_generated_text = self.sampling_params.tokenizer.decode(new_generated_token_ids[idx],skip_special_tokens=False,clean_up_tokenization_spaces=False)
+                                new_generated_text = self.sampling_params.tokenizer.decode(new_generated_token_ids[idx], skip_special_tokens=False, clean_up_tokenization_spaces=False)

266-266: Fix formatting: add spaces after commas

Missing spaces after commas in function arguments.

-                stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+                stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
tensorrt_llm/_torch/pyexecutor/sampler.py (4)

229-236: Fix formatting: add space after comma

Missing space after comma in constructor parameter.

-    def __init__(self, args: Args,tokenizer):
+    def __init__(self, args: Args, tokenizer):

252-273: Consider error handling for tokenizer decode operations

The tokenizer decode operations could potentially fail. Consider adding error handling to make the code more robust.

     @staticmethod
-    def _meet_stop_token_criteria(request: LlmRequest,tokenizer,new_token):
+    def _meet_stop_token_criteria(request: LlmRequest, tokenizer, new_token):
         if request.py_stop_words_list:
             assert isinstance(
                 request.py_stop_words_list,
                 list), "request.py_stop_words_list should be a list"
 
             stop_words_list, prefix_sum = request.py_stop_words_list
             tokens = request.get_tokens(0)
-            new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+            try:
+                new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
+            except Exception:
+                # If decode fails, fall back to token-based matching only
+                new_words = ""
             offset = 0
             for i, offset_end in enumerate(prefix_sum):
                 if i > 0:
                     offset = prefix_sum[i - 1]
                 stop_word = stop_words_list[offset:offset_end]
-                stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+                try:
+                    stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
+                except Exception:
+                    continue

260-266: Fix formatting: add spaces after commas

Missing spaces after commas in function arguments.

-            new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+            new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
             offset = 0
             for i, offset_end in enumerate(prefix_sum):
                 if i > 0:
                     offset = prefix_sum[i - 1]
                 stop_word = stop_words_list[offset:offset_end]
-                stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+                stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)

288-288: Fix formatting: add space after comma

Missing space after comma in function call.

-        if self._meet_stop_token_criteria(request,self.tokenizer,new_token):
+        if self._meet_stop_token_criteria(request, self.tokenizer, new_token):
tensorrt_llm/executor/executor.py (5)

38-40: Remove unused import

TokenizerBase is imported but never used in this file.

 from transformers import PreTrainedTokenizerBase
-from ..llmapi.tokenizer import TokenizerBase

403-406: Improve formatting consistency

The line continuation and indentation could be more consistent.

                 postproc_worker_config=postproc_worker_config,
                 is_llm_executor=is_llm_executor,
-                garbage_collection_gen0_threshold=
-                garbage_collection_gen0_threshold,
+                garbage_collection_gen0_threshold=garbage_collection_gen0_threshold,
                 tokenizer=tokenizer)

415-420: Improve formatting consistency

The line continuation and indentation could be more consistent.

             return GenerationExecutorWorker(**worker_kwargs,
                                             is_llm_executor=is_llm_executor,
-                                            garbage_collection_gen0_threshold=
-                                            garbage_collection_gen0_threshold,
+                                            garbage_collection_gen0_threshold=garbage_collection_gen0_threshold,
                                             tokenizer=tokenizer)

432-435: Improve formatting consistency

The line continuation and indentation could be more consistent.

                 postproc_worker_config=postproc_worker_config,
                 is_llm_executor=is_llm_executor,
-                garbage_collection_gen0_threshold=
-                garbage_collection_gen0_threshold,
+                garbage_collection_gen0_threshold=garbage_collection_gen0_threshold,
                 tokenizer=tokenizer)

446-449: Improve formatting consistency

The line continuation and indentation could be more consistent.

                 postproc_worker_config=postproc_worker_config,
                 is_llm_executor=is_llm_executor,
-                garbage_collection_gen0_threshold=
-                garbage_collection_gen0_threshold,
+                garbage_collection_gen0_threshold=garbage_collection_gen0_threshold,
                 tokenizer=tokenizer)
tensorrt_llm/executor/worker.py (2)

64-64: Fix type annotation spacing.

Missing space after colon in type annotation. This violates Python style conventions.

-        tokenizer:PreTrainedTokenizerBase = None
+        tokenizer: PreTrainedTokenizerBase = None

618-618: Fix type annotation spacing.

Missing space after colon in type annotation, same issue as in the constructor.

-    tokenizer:PreTrainedTokenizerBase = None
+    tokenizer: PreTrainedTokenizerBase = None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1086e7 and f4b01bc.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (7 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (3 hunks)
  • tensorrt_llm/executor/result.py (1 hunks)
  • tensorrt_llm/executor/worker.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/sampling_params.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a class in the constructor in Python.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (2)
tensorrt_llm/llmapi/llm.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/executor/result.py (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 Ruff (0.12.2)
tensorrt_llm/executor/result.py

257-257: Line too long (181 > 120)

(E501)


264-264: Line too long (121 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/_util.py

576-576: Line too long (158 > 120)

(E501)

🔇 Additional comments (14)
tensorrt_llm/llmapi/llm.py (2)

853-854: LGTM! Proper tokenizer propagation to TRT executor.

The tokenizer parameter is correctly passed to the executor creation, enabling text-level stop word functionality in the TensorRT backend.


988-989: LGTM! Proper tokenizer propagation to PyTorch executor.

The tokenizer parameter is correctly passed to the executor creation, enabling text-level stop word functionality in the PyTorch backend.

tensorrt_llm/sampling_params.py (2)

7-7: LGTM! Necessary import for tokenizer type annotation.

The import is properly placed and required for the new tokenizer field.


340-341: LGTM! Proper tokenizer assignment in setup method.

The tokenizer parameter is correctly assigned to the instance field, enabling tokenizer propagation through sampling configurations.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

30-30: LGTM! Necessary import for tokenizer type annotation.

The import is properly placed and required for the new tokenizer parameter.


345-345: LGTM! Proper tokenizer propagation to sampler instantiation.

The tokenizer parameter is correctly passed to the sampler, enabling text-level stop word functionality.

tensorrt_llm/executor/proxy.py (2)

31-32: LGTM! Necessary import for tokenizer type annotation.

The import is properly placed and required for the new tokenizer parameter.


100-101: LGTM! Proper tokenizer propagation to workers.

The tokenizer is correctly added to worker_kwargs, enabling workers to receive the tokenizer instance for text-level processing.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

34-35: LGTM!

The import is correctly placed and follows the proper namespace convention.

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

29-29: LGTM!

Import is properly placed and follows the namespace convention.

tensorrt_llm/executor/worker.py (4)

43-43: LGTM: Import statement follows coding guidelines.

The import correctly maintains the namespace and imports only the required class as specified in the coding guidelines.


88-88: LGTM: Proper instance attribute assignment.

The tokenizer is correctly stored as an instance attribute following the existing pattern.


142-142: LGTM: Tokenizer correctly forwarded to executor creation.

The tokenizer is properly passed as a keyword argument to the create_executor function, enabling tokenizer support in the executor pipeline.


745-746: LGTM: Tokenizer parameter correctly passed to worker constructor.

The tokenizer is properly forwarded from the worker_main function to the GenerationExecutorWorker constructor using appropriate keyword argument syntax.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Jul 29, 2025
@QiJune QiJune requested a review from netanel-haber July 30, 2025 07:26
@QiJune QiJune assigned dcampora and unassigned dcampora Jul 30, 2025
@QiJune QiJune requested review from dcampora and Naveassaf July 30, 2025 07:26
if mapping.cp_config.get('cp_type') == 'star_attention':
assert pytorch_backend_config.attn_backend == "FLASHINFER_STAR_ATTENTION", "attention backend of star attention should be 'FLASHINFER_STAR_ATTENTION'"
return TorchSampler(sampler_args)
return TorchSampler(sampler_args,tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be part of sampler_args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

@@ -247,22 +249,28 @@ def _meet_max_token_stop_criteria(self, request: LlmRequest):
>= self.max_seq_len)

@staticmethod
def _meet_stop_token_criteria(request: LlmRequest):
def _meet_stop_token_criteria(request: LlmRequest,tokenizer,new_token):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please give types to these variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

@@ -46,6 +48,7 @@ def __init__(
postproc_worker_config: Optional[PostprocWorkerConfig] = None,
is_llm_executor: Optional[bool] = None,
garbage_collection_gen0_threshold: Optional[int] = None,
tokenizer:PreTrainedTokenizerBase = None,
Copy link
Collaborator

@netanel-haber netanel-haber Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer:PreTrainedTokenizerBase = None,
tokenizer: Optional[PreTrainedTokenizerBase] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

@@ -59,6 +61,7 @@ def __init__(
is_llm_executor: Optional[bool] = None,
lora_config: Optional[LoraConfig] = None,
garbage_collection_gen0_threshold: Optional[int] = None,
tokenizer:PreTrainedTokenizerBase = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer:PreTrainedTokenizerBase = None
tokenizer: Optional[PreTrainedTokenizerBase] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

@@ -610,6 +615,7 @@ def worker_main(
bool] = True, # whether it's the main executor instance
lora_config: Optional[LoraConfig] = None,
garbage_collection_gen0_threshold: Optional[int] = None,
tokenizer:PreTrainedTokenizerBase = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer:PreTrainedTokenizerBase = None
tokenizer: Optional[PreTrainedTokenizerBase] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

@@ -273,6 +274,8 @@ class SamplingParams:
# TODO: make this a per-request parameter.
_stream_interval: Optional[int] = field(default=None, init=False, repr=False)

tokenizer:PreTrainedTokenizerBase = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer:PreTrainedTokenizerBase = None
tokenizer: Optional[PreTrainedTokenizerBase] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer may be None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , i have followed your suggestion.

offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
stop_text = tokenizer.decode(stop_word,skip_special_tokens=False,clean_up_tokenization_spaces=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer may be None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you , when tokenizer is None,fall back to tokenid level.

@xq25478 xq25478 closed this Jul 30, 2025
@xq25478 xq25478 reopened this Jul 31, 2025
@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from f4b01bc to 01854de Compare July 31, 2025 03:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)

228-228: Add Optional type annotation for tokenizer parameter

The tokenizer parameter should be typed as Optional since it may be None.

-        tokenizer: PreTrainedTokenizerBase
+        tokenizer: Optional[PreTrainedTokenizerBase]

261-265: Add null check for tokenizer before use

The tokenizer parameter may be None and should be checked before calling decode.

             try: 
+                if tokenizer is None:
+                    new_words = ""
+                else:
-                new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
+                    new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
             except Exception:
                 # If decode fails, fall back to token-based matching only
                 new_words = ""

512-512: Add Optional type annotation for tokenizer parameter

The tokenizer parameter should be typed as Optional since it may be None.

-        tokenizer: PreTrainedTokenizerBase
+        tokenizer: Optional[PreTrainedTokenizerBase]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4b01bc and 01854de.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (8 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (3 hunks)
  • tensorrt_llm/executor/result.py (2 hunks)
  • tensorrt_llm/executor/worker.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/sampling_params.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (1)
tensorrt_llm/executor/result.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

253-253: Line too long (138 > 120)

(E501)


272-272: Line too long (122 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/executor/result.py (2)

200-221: Good refactoring of stop criteria logic

The extraction of text-based stop criteria logic into a separate method improves code readability and maintainability. The null check for tokenizer is properly implemented.


274-282: Clean integration of refactored stop criteria logic

The modification properly integrates the new helper method while maintaining backward compatibility for non-string stop reasons.

@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from 01854de to 871eece Compare July 31, 2025 03:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01854de and 871eece.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (8 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (3 hunks)
  • tensorrt_llm/executor/result.py (2 hunks)
  • tensorrt_llm/executor/worker.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/sampling_params.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/executor/executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line in C++ code.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

276-276: Line too long (122 > 120)

(E501)

🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/sampler.py (7)

4-4: LGTM! Import changes support tokenizer functionality.

The addition of Union and List to typing imports and the PreTrainedTokenizerBase import from transformers are appropriate for the new tokenizer support.

Also applies to: 29-29


228-228: LGTM! Tokenizer field properly added to Args dataclass.

The addition of the tokenizer field with correct type annotation follows the established pattern and supports the new functionality.


236-236: LGTM! Tokenizer properly stored as instance variable.

The tokenizer is correctly stored from the args parameter following the established constructor pattern.


253-257: LGTM! Method signature properly updated for tokenizer support.

The method signature correctly adds the tokenizer and new_token parameters with appropriate type annotations. The multi-line formatting improves readability.


300-300: LGTM! Method call correctly updated with tokenizer parameter.

The call to _meet_stop_token_criteria properly passes the tokenizer instance and new_token parameter as required by the updated method signature.


388-388: LGTM! Minor formatting improvement.

The added blank line improves code readability in the sample_async method.


516-516: LGTM! TRTLLMSampler constructor properly updated for tokenizer support.

The tokenizer parameter is correctly added to the constructor signature and stored as an instance variable, maintaining consistency with the TorchSampler implementation.

Also applies to: 545-546

stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds tokenizer.decode almost everywhere. I think this should cause perf issues.

@xq25478 May I have the context that token-ID-level stop comparison is not adequate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, in fact, sglang or vllm already supports text-level stop interception, and this function is very necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the trtllm official team can implement this function as soon as possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, could you please help me understand why the token-ID-level stop comparison doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, could you please help me understand why the token-ID-level stop comparison doesn't work?

In Chinese grammar, "你" and "你好" correspond to two different tokenIDs. However, at the text level, the first text of "你好" is "你". Therefore, when stop is set to "你", the correct output should not include "你好". However, the current output of TRTLLM includes "你好", which is incorrect.

@@ -273,6 +274,8 @@ class SamplingParams:
# TODO: make this a per-request parameter.
_stream_interval: Optional[int] = field(default=None, init=False, repr=False)

tokenizer: Optional[PreTrainedTokenizerBase] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we add tokenizer as a field to SamplingParams? Tokenizer should be the same to all requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i konw your meaning, have you any better idea about how to add tokenizer for SamplingParams?

new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
except Exception:
# If decode fails, fall back to token-based matching only
new_words = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_text could also be empty, which means stop_text in new_words could be True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 275 to 278
try:
stop_text = tokenizer.decode(
stop_word,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
except Exception:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there so many try... expect? Do we have expectations what exception should be raised here?

Also, please use the specific exception type instead of Exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@xq25478 xq25478 requested a review from netanel-haber July 31, 2025 07:54
@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from 5781cb2 to 1d0b121 Compare July 31, 2025 12:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5781cb2 and 1d0b121.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (8 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (3 hunks)
  • tensorrt_llm/executor/result.py (2 hunks)
  • tensorrt_llm/executor/worker.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/sampling_params.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

276-276: Line too long (122 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)

4-4: LGTM!

The additional imports for type hints and tokenizer support are correctly added and necessary for the new functionality.

Also applies to: 29-29


228-228: LGTM!

The tokenizer field addition to the Args dataclass is properly typed and follows the existing pattern.


236-236: LGTM!

The tokenizer attribute assignment follows the same pattern as other constructor arguments and is correctly implemented.


300-300: LGTM!

The method call correctly passes the required tokenizer and new_token parameters to match the updated method signature.


388-388: LGTM!

Minor formatting improvement with the added blank line.


516-516: LGTM!

The tokenizer parameter addition to TRTLLMSampler constructor and attribute assignment follows the same pattern as TorchSampler and is correctly implemented.

Also applies to: 545-546

@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from 1d0b121 to da386ad Compare August 4, 2025 02:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

253-285: Fix line length violation and improve exception handling.

The text-level stop detection implementation correctly enables OpenAI-style stop functionality, but there are code quality issues to address:

Apply this diff to fix the line length issue and improve exception handling:

-                    stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
+                    stop_text = tokenizer.decode(
+                        stop_word, 
+                        skip_special_tokens=False, 
+                        clean_up_tokenization_spaces=False
+                    )
-            except Exception:
+            except (ValueError, TypeError, AttributeError):
-                except Exception:
+                except (ValueError, TypeError, AttributeError):

The specific exception types cover the most likely tokenizer decode failures while avoiding overly broad exception catching as required by the coding guidelines.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

265-285: Consider performance optimization for tokenizer decode operations.

The implementation correctly adds text-level stop detection, but the frequent tokenizer.decode() calls could impact performance, especially with many stop words. Consider caching decoded stop words to avoid repeated decoding of the same sequences.

Potential optimization approach:

# Cache decoded stop words at request initialization
if not hasattr(request, '_decoded_stop_words'):
    request._decoded_stop_words = {}
    for i, offset_end in enumerate(request.py_stop_words_list[1]):
        # Cache decoded versions

This addresses the performance concern raised in previous reviews while maintaining the valuable text-level stop functionality.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d0b121 and da386ad.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (8 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (5 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/executor/worker.py (7 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/sampling_params.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (2)
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

276-276: Line too long (122 > 120)

(E501)

🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)

4-4: LGTM - Import additions support tokenizer functionality.

The new imports for Union, List, and PreTrainedTokenizerBase are appropriate for the enhanced stop token detection functionality and follow the coding guidelines for maintaining namespaces.

Also applies to: 29-29


228-228: LGTM - Tokenizer integration in TorchSampler constructor.

The tokenizer parameter is properly typed and stored as an instance attribute, following the established patterns in the class constructor.

Also applies to: 236-236


300-300: LGTM - Method call updated to support text-level stop detection.

The call to _meet_stop_token_criteria correctly passes the tokenizer and new_token parameters to enable the enhanced stop detection functionality.


516-516: LGTM - Consistent tokenizer integration in TRTLLMSampler.

The tokenizer parameter and storage implementation in TRTLLMSampler maintains consistency with the TorchSampler implementation, ensuring uniform tokenizer support across both sampler types.

Also applies to: 545-546


388-388: Minor formatting adjustment.

This whitespace change appears to improve code formatting.

@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch 2 times, most recently from 312e2ed to 84a8e58 Compare August 4, 2025 03:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 312e2ed and 84a8e58.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (8 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (5 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/executor/worker.py (7 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/executor/proxy.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (2)
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

276-276: Line too long (122 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)

4-4: LGTM!

The import additions are appropriate for the tokenizer integration functionality.

Also applies to: 29-29


228-229: LGTM!

The tokenizer parameter is correctly added to the Args dataclass and stored as an instance attribute with proper type annotation.

Also applies to: 236-236


300-300: LGTM!

The method call is correctly updated to pass the tokenizer instance and new_token parameter.


388-388: LGTM!

Minor formatting improvement for readability.


516-516: LGTM!

The tokenizer parameter is correctly added to the TRTLLMSampler constructor with proper type annotation and instance variable assignment.

Also applies to: 545-546


228-236: Excellent integration of tokenizer support for text-level stop criteria.

The implementation successfully adds tokenizer support to both TorchSampler and TRTLLMSampler classes, enabling text-level stop word detection as requested in the PR objectives. The approach is well-designed:

  • Proper type annotations with PreTrainedTokenizerBase
  • Clean integration into existing sampler architecture
  • Intelligent fallback to token-level matching when text decoding fails
  • Maintains backward compatibility

The main enhancement allows for more accurate stop criteria detection by comparing decoded text rather than just token sequences, which addresses the limitation mentioned in the PR summary.

Also applies to: 253-285, 300-300, 516-516, 545-546

@@ -1,7 +1,7 @@
from abc import ABC, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify NVIDIA copyright header presence.

According to the coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the current year. Please ensure this file includes the required copyright header.


🏁 Script executed:

#!/bin/bash
# Check if the file contains NVIDIA copyright header
head -10 tensorrt_llm/_torch/pyexecutor/sampler.py | grep -i "nvidia\|copyright"

Length of output: 82


Let’s search the entire file for any NVIDIA/copyright lines:


🏁 Script executed:

#!/bin/bash
# Search for any NVIDIA or Copyright mentions in sampler.py
grep -n -E "NVIDIA|Copyright" tensorrt_llm/_torch/pyexecutor/sampler.py || true

Length of output: 268


Add NVIDIA copyright header to sampler.py

This file is missing the required NVIDIA copyright header. Per CODING_GUIDELINES.md, all Python source files must begin with a header that includes the current year and NVIDIA’s copyright.

• File to update:

  • tensorrt_llm/_torch/pyexecutor/sampler.py (at the very top)

Please insert a header similar to:

# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES
# SPDX-License-Identifier: Apache-2.0

Ensure it matches the exact formatting used in other TensorRT-LLM files.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py at line 1, add the NVIDIA copyright
header at the very top of the file before any imports. Insert the following two
lines exactly as shown: "# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA
CORPORATION & AFFILIATES" and "# SPDX-License-Identifier: Apache-2.0". This will
ensure compliance with the CODING_GUIDELINES.md and match the formatting used in
other TensorRT-LLM files.

Comment on lines 253 to 285
def _meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
):
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list), "request.py_stop_words_list should be a list"

stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
except Exception:
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
except Exception:
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text in new_words:
return True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix remaining issues from previous reviews.

The implementation has good logic for text-level stop criteria detection, but several issues from previous reviews still need to be addressed:

  1. Line length violation: Line 276 exceeds the 120-character limit
  2. Broad exception handling: Using Exception instead of specific exceptions
  3. Empty stop text check: Missing validation before substring matching

Apply this diff to fix these issues:

                try:
-                    stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
-                except Exception:
+                    stop_text = tokenizer.decode(
+                        stop_word, 
+                        skip_special_tokens=False, 
+                        clean_up_tokenization_spaces=False
+                    )
+                except (ValueError, TypeError, AttributeError):
                    continue
                if len(stop_word) > len(tokens):
                    continue
                if tokens[-len(stop_word):] == stop_word:
                    return True
-                if stop_text in new_words:
+                if stop_text and stop_text in new_words:
                    return True

Also fix the first exception block:

            try: 
-                new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
-            except Exception:
+                new_words = tokenizer.decode(new_token, skip_special_tokens=False, clean_up_tokenization_spaces=False)
+            except (ValueError, TypeError, AttributeError):
                # If decode fails, fall back to token-based matching only
                new_words = ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
):
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list), "request.py_stop_words_list should be a list"
stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
except Exception:
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
except Exception:
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text in new_words:
return True
def _meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
):
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list
), "request.py_stop_words_list should be a list"
stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(
new_token,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
except (ValueError, TypeError, AttributeError):
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(
stop_word,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
except (ValueError, TypeError, AttributeError):
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text and stop_text in new_words:
return True
🧰 Tools
🪛 Ruff (0.12.2)

276-276: Line too long (122 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py between lines 253 and 285, fix the
line length violation by breaking lines exceeding 120 characters into shorter
ones, replace broad exception handling by catching more specific exceptions
related to tokenizer decoding, and add a check to ensure stop_text is not empty
before performing substring matching with new_words. Also, update the first
exception block to handle specific decode errors instead of catching all
exceptions.

@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from 84a8e58 to 3e913d0 Compare August 5, 2025 14:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1-1: Add required NVIDIA copyright header.

This file still needs the NVIDIA copyright header as flagged in previous reviews. Per coding guidelines, all Python source files must include the copyright header.

Add the following header at the very top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES
+# SPDX-License-Identifier: Apache-2.0
+
 from abc import ABC, abstractmethod
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84a8e58 and 3e913d0.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/pyexecutor/_util.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (10 hunks)
  • tensorrt_llm/executor/executor.py (6 hunks)
  • tensorrt_llm/executor/proxy.py (5 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/executor/worker.py (7 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (6)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx} : use a maximum of 120 characters per line....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : Use a maximum of 120 characters per line.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks in python, limit the except to the smallest set of...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : when using try-except blocks to handle multiple possible variable types in pyth...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.{cpp,h,cu,py} : all tensorrt-llm open source software code should contain an nvidia ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{cpp,h,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-05T07:49:09.875Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (6)

4-4: LGTM - Import changes support the new tokenizer functionality.

The added imports are necessary for the tokenizer parameter and proper type annotations.

Also applies to: 29-29


271-271: LGTM - Proper tokenizer integration.

The tokenizer parameter is correctly added to the Args dataclass and properly stored in the constructor with appropriate type annotations.

Also applies to: 279-279


301-301: LGTM - Correct integration of tokenizer-based stop criteria.

The function call properly passes the tokenizer and integrates text-level stop detection into the sampling workflow.


517-517: LGTM - Consistent tokenizer integration.

The tokenizer parameter is properly added to the TRTLLMSampler constructor with correct type annotations and consistent implementation pattern.

Also applies to: 546-547


782-789: LGTM - Proper integration of text-level stop detection.

The stop criteria checks are correctly integrated into both single beam and multiple beam update methods. The logic properly checks stop conditions before adding tokens and sets appropriate finish reasons when stop criteria are met.

Also applies to: 857-860


297-297: Fix function call to match corrected signature.

This call needs to be updated to match the corrected function signature (without self).

Apply this diff:

-        if meet_max_token_stop_criteria(request,self.max_seq_len):
+        if meet_max_token_stop_criteria(request, self.max_seq_len):

Likely an incorrect or invalid review comment.

Comment on lines +209 to +242
def meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
):
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list), "request.py_stop_words_list should be a list"

stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
except Exception:
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
except Exception:
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text in new_words:
return True

return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address remaining issues from previous reviews.

The function logic is sound for text-level stop detection, but several previously flagged issues still need to be fixed:

  1. Missing type annotations: Add proper type hints for the return value
  2. Broad exception handling: Replace Exception with specific exceptions
  3. Empty stop text validation: Add check before substring matching

Apply this diff to fix these issues:

 def meet_stop_token_criteria(
     request: LlmRequest,
     tokenizer: PreTrainedTokenizerBase, 
     new_token: Union[int, List[int], torch.Tensor]
-    ):
+    ) -> bool:
     if request.py_stop_words_list:
         assert isinstance(
             request.py_stop_words_list,
             list), "request.py_stop_words_list should be a list"
 
         stop_words_list, prefix_sum = request.py_stop_words_list
         tokens = request.get_tokens(0)
         try: 
-            new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
-        except Exception:
+            new_words = tokenizer.decode(
+                new_token, 
+                skip_special_tokens=False, 
+                clean_up_tokenization_spaces=False
+            )
+        except (ValueError, TypeError, AttributeError):
             # If decode fails, fall back to token-based matching only
             new_words = ""
         offset = 0
         for i, offset_end in enumerate(prefix_sum):
             if i > 0:
                 offset = prefix_sum[i - 1]
             stop_word = stop_words_list[offset:offset_end]
             try:
-                stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
-            except Exception:
+                stop_text = tokenizer.decode(
+                    stop_word, 
+                    skip_special_tokens=False, 
+                    clean_up_tokenization_spaces=False
+                )
+            except (ValueError, TypeError, AttributeError):
                 continue
             if len(stop_word) > len(tokens):
                 continue
             if tokens[-len(stop_word):] == stop_word:
                 return True
-            if stop_text in new_words:
+            if stop_text and stop_text in new_words:
                 return True
 
     return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
):
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list), "request.py_stop_words_list should be a list"
stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(new_token,skip_special_tokens=False,clean_up_tokenization_spaces=False)
except Exception:
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(stop_word, skip_special_tokens=False, clean_up_tokenization_spaces=False)
except Exception:
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text in new_words:
return True
return False
def meet_stop_token_criteria(
request: LlmRequest,
tokenizer: PreTrainedTokenizerBase,
new_token: Union[int, List[int], torch.Tensor]
) -> bool:
if request.py_stop_words_list:
assert isinstance(
request.py_stop_words_list,
list
), "request.py_stop_words_list should be a list"
stop_words_list, prefix_sum = request.py_stop_words_list
tokens = request.get_tokens(0)
try:
new_words = tokenizer.decode(
new_token,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
except (ValueError, TypeError, AttributeError):
# If decode fails, fall back to token-based matching only
new_words = ""
offset = 0
for i, offset_end in enumerate(prefix_sum):
if i > 0:
offset = prefix_sum[i - 1]
stop_word = stop_words_list[offset:offset_end]
try:
stop_text = tokenizer.decode(
stop_word,
skip_special_tokens=False,
clean_up_tokenization_spaces=False
)
except (ValueError, TypeError, AttributeError):
continue
if len(stop_word) > len(tokens):
continue
if tokens[-len(stop_word):] == stop_word:
return True
if stop_text and stop_text in new_words:
return True
return False
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 209 to 242, add a
return type annotation to the meet_stop_token_criteria function to specify it
returns a bool. Replace the broad except Exception clauses with more specific
exceptions related to tokenizer.decode failures, such as UnicodeDecodeError or
TokenizerError if applicable. Before checking if stop_text is in new_words, add
a condition to skip empty stop_text values to avoid false positives or errors
during substring matching.

Comment on lines 245 to 249
def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len):
num_tokens = request.get_num_tokens(self.BEAM)
return (num_tokens - request.py_orig_prompt_len
>= request.py_max_new_tokens) or (num_tokens
>= max_seq_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect function signature.

This standalone function incorrectly includes self as a parameter, which will cause runtime errors when called.

Apply this diff to fix the function signature and add type annotations:

-def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len):
+def meet_max_token_stop_criteria(request: LlmRequest, max_seq_len: int) -> bool:
-    num_tokens = request.get_num_tokens(self.BEAM)
+    num_tokens = request.get_num_tokens(0)  # Use beam 0 consistently
     return (num_tokens - request.py_orig_prompt_len
             >= request.py_max_new_tokens) or (num_tokens
                                                 >= max_seq_len)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def meet_max_token_stop_criteria(self, request: LlmRequest,max_seq_len):
num_tokens = request.get_num_tokens(self.BEAM)
return (num_tokens - request.py_orig_prompt_len
>= request.py_max_new_tokens) or (num_tokens
>= max_seq_len)
def meet_max_token_stop_criteria(request: LlmRequest, max_seq_len: int) -> bool:
num_tokens = request.get_num_tokens(0) # Use beam 0 consistently
return (num_tokens - request.py_orig_prompt_len
>= request.py_max_new_tokens) or (num_tokens
>= max_seq_len)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 245 to 249, the
function meet_max_token_stop_criteria is incorrectly defined with a self
parameter, but it is a standalone function. Remove the self parameter from the
function signature and add appropriate type annotations for the parameters and
return type to fix the runtime errors.

@xq25478 xq25478 force-pushed the v1.0.0rc3_torchsampler_imp_stop branch from 3e913d0 to 4e7fffa Compare August 6, 2025 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants