Skip to content

Conversation

ixlmar
Copy link
Collaborator

@ixlmar ixlmar commented Aug 27, 2025

Description

  • This PR removes the enable_mixed_sampler option, including the "fast_path" code path controlled by it. Meanwhile, the changes aim at improving the performance of the general code path to avoid any significant performance degradation in the case of greedy sampling without drafting, previously covered by "fast_path".

  • Removing enable_mixed_sampler should also avoid non-intuitive behavior, as evidenced in Investigate TRTLLM runtime repetitive issue #5254.

  • However, the revised code is designed to also streamline request processing via batching for non-greedy sampling, as well as in the presence of draft tokens --- i.e., in situations previously not covered by "fast_path".

  • This PR opportunistically makes changes to vectorize TorchSampler._apply_embedding_bias.

  • This PR opportunistically makes changes to skip softmax computation when it is not needed (e.g., for greedy sampling).

  • Add @torch.inference_mode() in TorchSampler.

  • Lastly, the PR includes some minor code improvements (type annotations etc.).

Test Coverage

  • Updated existing tests to account for enable_mixed_sampler option removal.
  • Since the PR contains mostly non-functional changes (performance and general code improvements), existing test coverage is assumed to be sufficient.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Batched sampling for parallel request decoding and a combined Top-K+Top-P sampling strategy with per-request dispatch.
  • Refactor

    • Removed mixed-sampler user configuration and related flags from public configs and init paths.
    • Streamlined sampler, request-drafting, and logits-handling pipelines; per-request draft-logit policy is now sampler-driven.
    • RMS normalization now returns residual only when explicitly requested.
  • Tests

    • Unit/integration test IDs and parameterizations updated to match the new sampling flow.
  • Chores

    • Minimum Python requirement raised to 3.10.

Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

📝 Walkthrough

Walkthrough

Removed the public enable_mixed_sampler flag across configs and call sites; refactored Torch sampler into a batched, per-strategy sampling pipeline with new index translators and sample() APIs; ModelDrafter always routes logits through the sampler and defers draft-prob policy to the sampler; RMSNorm replaces Ellipsis-based residual with an explicit sentinel.

Changes

Cohort / File(s) Summary
Remove enable_mixed_sampler option
tensorrt_llm/_torch/auto_deploy/llm_args.py, tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/config.py, tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/scaffolding/worker.py, tests/unittest/llmapi/test_llm_pytorch.py
Deletes enable_mixed_sampler field/argument across public configs, executor/util wiring, scaffolding, and test parameterization; call sites updated to omit the flag; TorchSampler.Args no longer includes it.
Sampler batched refactor & API additions
tensorrt_llm/_torch/pyexecutor/sampler.py
Major refactor adding per-strategy grouping, batched sampling helpers (torch_multi_arange, StepIndexTranslator family), _BatchedSamplingResult, new sample() API, embedding-bias application, batched pack/unpack flow, new Strategy variant TopKTopP, should_provide_draft_probs, and removal of TorchSampler.Args.enable_mixed_sampler.
Speculative decoding / drafter integration
tensorrt_llm/_torch/speculative/model_drafter.py, tensorrt_llm/_torch/speculative/eagle3.py
ModelDrafter always routes logits through the sampler, builds draft LlmRequest with exclude_last_generation_logits=True, uses sampler.should_provide_draft_probs(request) for draft-prob policy, and adds TYPE_CHECKING imports for type-only hints.
RMSNorm residual sentinel change
tensorrt_llm/_torch/modules/rms_norm.py
Replaces Ellipsis-based optional residual with _ARGUMENT_NOT_SPECIFIED_SENTINEL; updates forward/skip_forward signatures, presence checks, dtype handling, and conditional residual return behavior.
LlmRequest typing tweak
tensorrt_llm/_torch/pyexecutor/llm_request.py
Adds static type annotation Optional[torch.Tensor] for private _py_embedding_bias_1d; runtime initialization unchanged.
Tests: single-path TorchSampler test
tests/unittest/llmapi/test_llm_pytorch.py
Removes parameterization over enable_mixed_sampler and enable_logprobs; test simplified to a single TorchSampler path.
Test-list backend tag renames (YAML)
tests/integration/test_lists/test-db/l0_dgx_b200.yml, tests/integration/test_lists/test-db/l0_dgx_h100.yml
Renames backend test identifiers/tags (uppercase → lowercase / new backend-tag names) for specific GPTOSS entries; pure test-id string updates.
Packaging requirement bump
setup.py
Raises python_requires from ">=3.7, <4" to ">=3.10, <4".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant TRTExec as TRT Executor
  participant Sampler as TorchSampler
  participant Gen as Generator
  participant Req as LlmRequest[]

  Client->>TRTExec: sample_async(requests)
  TRTExec->>Sampler: _process_requests(requests, logits, states)
  Sampler->>Sampler: _group_requests_by_sampling_strategy(requests)
  Sampler->>Gen: get_generator(device)
  par Per-strategy batched sampling
    Sampler->>Sampler: pack indices (torch_multi_arange / indexers)
    alt embedding-bias present
      Sampler->>Sampler: _apply_embedding_bias
    end
    Sampler->>Sampler: sample(strategy, logits_batch, generator, softmax_indices?)
  end
  Sampler-->>TRTExec: _BatchedSamplingResult (tokens, softmax, mappings)
  TRTExec->>TRTExec: unbatch -> update per-request states
  TRTExec-->>Client: next tokens (+ optional probs)
Loading
sequenceDiagram
  autonumber
  participant Drafter as ModelDrafter
  participant Sampler as TorchSampler
  participant Req as LlmRequest(draft)

  Drafter->>Sampler: should_provide_draft_probs(request)
  Drafter->>Req: create(exclude_last_generation_logits=True, return_generation_logits=from sampler)
  Drafter->>Sampler: sample_async(sample_state)
  Sampler-->>Drafter: draft tokens (+/- draft probs)
  Drafter->>Drafter: forward draft logits into target request (py_draft_logits)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • mikeiovine
  • nv-guomingz
  • netanel-haber
  • QiJune
  • syuoni
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

1-8: Missing NVIDIA license header

Add the standard 2025 NVIDIA SPDX header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
tensorrt_llm/_torch/modules/rms_norm.py (1)

56-62: FlashInfer path returns stale residual; semantics diverge from non-Flash path

When residual is provided, non-Flash returns (normed, input+residual). FlashInfer currently returns (normed, original residual). This breaks downstream skip/residual chaining.

Apply this fix to compute and return the updated residual consistently:

         if IS_FLASHINFER_AVAILABLE:
             from ..custom_ops import (flashinfer_fused_add_rmsnorm,
                                       flashinfer_rmsnorm)
-            if residual is not None:
-                flashinfer_fused_add_rmsnorm(hidden_states, residual,
-                                             self.weight, self.variance_epsilon)
+            if residual is not None:
+                # Keep semantics consistent with the non-Flash path:
+                # residual_out = (input + residual) in input dtype.
+                input_dtype = hidden_states.dtype
+                new_residual = (hidden_states.to(torch.float32) +
+                                residual.to(torch.float32)).to(input_dtype)
+                flashinfer_fused_add_rmsnorm(hidden_states, residual,
+                                             self.weight, self.variance_epsilon)
+                residual = new_residual
             else:
                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,
                                                    self.variance_epsilon)
@@
-        if residual is None:
+        if residual is None:
             return hidden_states
         else:
             return hidden_states, residual

Also applies to: 74-78

tensorrt_llm/_torch/pyexecutor/_util.py (2)

1-8: Missing NVIDIA license header

Add the standard 2025 NVIDIA SPDX header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+

576-587: Remove stale enable_mixed_sampler reference in API stability tests

The create_torch_sampler_args signature change is correctly reflected in code, and all internal callers have been updated. However, the API‐stability specification still lists the removed enable_mixed_sampler field:

• tests/unittest/api_stability/references/llm.yaml: lines 110–112 reference enable_mixed_sampler

Please remove or update that entry so the stability tests remain in sync with the new TorchSampler.Args.

🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)

17-19: TYPE_CHECKING import: LGTM

Static-only import avoids runtime dependency cycles. Optionally enable postponed annotations to drop quotes.

Apply this small tweak (optional):

+from __future__ import annotations
tests/unittest/llmapi/test_llm_pytorch.py (1)

229-233: Broaden coverage: parametrize enable_logprobs for both True/False

Current test only covers logprobs=True. Include False to exercise the non-logprobs path with embedding_bias.

Apply:

-@pytest.mark.parametrize(
-    "enable_logprobs",
-    [True],
-)
+@pytest.mark.parametrize("enable_logprobs", [False, True])

Also applies to: 253-259

tensorrt_llm/_torch/pyexecutor/sampler.py (4)

166-192: Consider extracting magic values as constants

The batched sampling functions use hardcoded default values. Consider extracting them for maintainability.

+# Sampling defaults
+DEFAULT_TOP_K = 50
+DEFAULT_TOP_P = 0.9
+DEFAULT_TEMPERATURE = 1.0
+MIN_TEMPERATURE = 1e-5
+
 def top_k_sampling_batch(
     logits,
-    top_k=50,
+    top_k=DEFAULT_TOP_K,
     generator: Optional[torch.Generator] = None
 ) -> tuple[torch.Tensor, torch.Tensor]:
 def top_p_sampling_batch(
     logits: torch.Tensor,
-    top_p: float = 0.9,
-    temperature: float = 1.0,
+    top_p: float = DEFAULT_TOP_P,
+    temperature: float = DEFAULT_TEMPERATURE,
     generator: Optional[torch.Generator] = None
 ) -> tuple[torch.Tensor, torch.Tensor]:

353-361: Add type hints for improved maintainability

The function would benefit from explicit return type hints.

 def _group_requests_by_sampling_strategy(
-        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:
+        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:
     strategy_dict: dict[Strategy, list[int]] = defaultdict(list)

1052-1190: Well-structured batched sampling pipeline

The _sample_batched_by_strategy method effectively groups requests by strategy and processes them in batches. The handling of draft logits indices for speculation is properly tracked.

One concern: The d2t validation at lines 1071-1075 might be better placed in ModelDrafter.prepare_draft_tokens as suggested by the FIXME comment.

Consider moving the d2t validation to ModelDrafter as suggested by the FIXME comment to maintain better separation of concerns.


1332-1333: Fix line length violations

These lines exceed the 120-character limit.

-            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i], next_context_req_offsets_cuda[i])
-            # And if generation requests are present, those tensors include a trailing entries selecting all tokens/logits
+            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i],
+            #          next_context_req_offsets_cuda[i])
+            # And if generation requests are present, those tensors include a trailing
+            # entries selecting all tokens/logits
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b21613 and 91e916f.

📒 Files selected for processing (12)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
  • tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
  • tensorrt_llm/llmapi/llm_args.py (0 hunks)
  • tensorrt_llm/scaffolding/worker.py (0 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/scaffolding/worker.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
  • EagleDecodingConfig (418-444)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
  • llm_test_harness (109-136)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (558-568)
  • max_seq_len (571-575)
tensorrt_llm/_torch/modules/rms_norm.py (4)
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2)
  • flashinfer_fused_add_rmsnorm (32-35)
  • flashinfer_rmsnorm (21-23)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-422)
  • append (79-98)
  • append (125-142)
  • get_draft_token_length (556-567)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/_torch/speculative/model_drafter.py (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
  • Sampler (52-83)
  • SampleState (43-49)
  • should_provide_draft_probs (81-83)
  • should_provide_draft_probs (1413-1415)
  • beam_width (72-75)
  • is_generation_model (78-79)
  • is_generation_model (109-110)
  • is_generation_model (162-163)
  • is_generation_model (718-719)
  • is_generation_model (1449-1450)
  • sample_async (63-65)
  • sample_async (93-96)
  • sample_async (134-141)
  • sample_async (940-956)
  • sample_async (1596-1684)
  • update_requests (68-69)
  • update_requests (99-106)
  • update_requests (144-159)
  • update_requests (901-925)
  • update_requests (1688-1701)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-422)
  • append (79-98)
  • append (125-142)
  • generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
  • HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
  • sample_async (271-318)
  • update_requests (245-269)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1332-1332: Line too long (125 > 120)

(E501)


1333-1333: Line too long (122 > 120)

(E501)

🔇 Additional comments (18)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

366-373: Type annotation addition: LGTM

optional 1D cache for embedding bias is clearly typed and initialized.

tensorrt_llm/_torch/modules/rms_norm.py (1)

49-53: No Ellipsis Passed to RMSNorm Residual
I ran targeted searches for any use of the literal ... passed—either positionally or via residual=...—to RMSNorm (and its implicit __call__ to forward) and found zero occurrences. This confirms that no call sites propagate Ellipsis into the new residual=None API. The change is safe to merge.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

593-596: No change needed: engine.max_seq_len already covers both KV-cache and no-cache cases

The max_seq_len property on TRTLLMEngine is implemented to:

  • return the KVCacheManager’s maximum sequence length when KV caching is enabled
  • fall back to the internally stored _max_seq_len_storage for no-cache attention

Thus, passing engine.max_seq_len into create_torch_sampler_args already applies the correct cap in both scenarios—no further modifications are required here.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/speculative/model_drafter.py (6)

59-59: LGTM! Improved clarity

The error message change from f-string to plain string is appropriate since there are no variables to interpolate.


85-86: Good comment explaining overlap scheduling

The comment explaining exclude_last_generation_logits=True for overlap scheduling provides valuable context.


268-285: Simplified sampling logic with proper error handling

The refactored _sample_async method:

  1. Correctly computes num_context_logits_prefix_sum accounting for py_return_context_logits
  2. Properly delegates to HandleLogits and the sampler's sample_async
  3. Has appropriate error handling that returns None on failure

The removal of optional sampler handling simplifies the control flow.


300-301: LGTM! Simplified request update logic

The removal of the None-check guard for sampler is consistent with the overall simplification where the sampler is now always required.


317-317: Properly handles None propagation for draft logits

The comment indicates that None values are correctly forwarded when draft logits are not available, which maintains backward compatibility.


76-90: Draft logit policy integration verified

The should_provide_draft_probs API is declared in Sampler (returns True by default) and correctly overridden in TRTLLMSampler to disable draft logits under greedy sampling. A search shows its sole invocation for draft requests in model_drafter.py, ensuring consistent, per-request logic without residual flags.

tensorrt_llm/_torch/pyexecutor/sampler.py (9)

81-83: Conservative default for draft probability handling

The base should_provide_draft_probs method returns True as a conservative default, ensuring backward compatibility. Derived classes can override for optimization.


276-285: Efficient greedy search with optional softmax filtering

The softmax_indices parameter enables selective softmax computation, skipping unnecessary calculations for greedy sampling - a nice performance optimization.


288-310: Fixed potential issue with padded draft tokens

Good catch! The comment at line 291-292 correctly notes that draft_probs is not padded while draft_tokens may be padded by ModelDrafter._pad_to_max_draft_tokens. The implementation correctly handles this by slicing draft_tokens[:num_draft_tokens].


422-477: Excellent implementation of multi-arange utility

The torch_multi_arange function provides an efficient batched implementation with clear documentation and algorithm explanation. The use of torch.repeat_interleave() and torch.cumsum() to avoid loops is clever.


767-767: Good device consistency check

Adding the assertion ensures the generator is on the expected device, preventing subtle CUDA errors.


965-1050: Sophisticated embedding bias application

The _apply_embedding_bias implementation:

  1. Handles per-request bias tensors efficiently
  2. Uses smart batching to minimize memory operations
  3. Includes detailed comments explaining the trade-offs
  4. Correctly uses non-blocking transfers

The complexity is justified given the performance considerations.


1192-1286: Complex but necessary unbatching logic

The _unbatch_sampling_results method correctly handles:

  1. Dimension ordering verification
  2. Efficient scatter operations using custom indexers
  3. Proper handling of log probabilities with different tensor layouts
  4. Non-blocking D2H transfers for performance

The FIXMEs at lines 1256 and 1260 suggest potential optimizations for memory layout that could be addressed in future iterations.


1344-1410: Main processing pipeline successfully refactored

The _process_requests method integrates all the batched sampling components effectively:

  1. Computes per-request steps
  2. Selects generated logits
  3. Applies embedding bias
  4. Performs batched sampling by strategy
  5. Unbatches results

This is the core of the performance improvements replacing the old "fast_path" code.


1413-1415: Good optimization for greedy sampling

The override correctly avoids requesting draft probabilities for greedy sampling, which don't require rejection sampling.

@ixlmar ixlmar force-pushed the feat/batched-sampling branch 2 times, most recently from 2b209dc to 6e89d3b Compare August 27, 2025 10:37
@ixlmar ixlmar requested a review from dcampora August 27, 2025 10:38
@ixlmar ixlmar force-pushed the feat/batched-sampling branch from 6e89d3b to f750bba Compare August 27, 2025 10:43
@ixlmar
Copy link
Collaborator Author

ixlmar commented Aug 27, 2025

/bot run

@ixlmar ixlmar changed the title [TRTLLM-7156, TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler) [TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler) Aug 27, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16682 [ run ] triggered by Bot

@ixlmar ixlmar changed the title [TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler) [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156) Aug 27, 2025
@ixlmar ixlmar changed the title [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156) [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) Aug 27, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

1-1: Add required NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA copyright header (current year).

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 from __future__ import annotations
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

288-310: get_rejected_indices(): fix index dtypes and shape bug for single-step case

  • Index tensors must be int64.
  • Removing squeeze(0) avoids breaking 2D indexing when num_steps == 1.

Apply:

-    token_idx = torch.arange(num_draft_tokens,
-                             dtype=torch.int32,
-                             device=generator.device)
-    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.int32).to(
+    token_idx = torch.arange(num_draft_tokens,
+                             dtype=torch.long,
+                             device=generator.device)
+    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.long).to(
         device=generator.device, non_blocking=True)
     p = draft_probs[token_idx, draft_tokens_cuda]
-    q = target_probs.squeeze(0)[token_idx, draft_tokens_cuda]
+    q = target_probs[token_idx, draft_tokens_cuda]

763-777: Support multi-device by storing per-device RNG

Asserting a single device can break in multi-GPU contexts. Keep one deterministic generator per device.

Apply:

     def __init__(self, args: Args):
@@
-        self._generator = None
+        self._generators: dict[torch.device, torch.Generator] = {}
@@
-    def get_generator(self, device: torch.device) -> torch.Generator:
+    def get_generator(self, device: torch.device) -> torch.Generator:
@@
-        if self._generator is None:
-            # Fallback to a default seed if not set
-            self._generator = torch.Generator(device=device)
-            self._generator.manual_seed(self._global_seed)
-        assert self._generator.device == device
-        return self._generator
+        gen = self._generators.get(device)
+        if gen is None or gen.device != device:
+            gen = torch.Generator(device=device)
+            gen.manual_seed(self._global_seed)
+            self._generators[device] = gen
+        return gen
🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

315-319: Forwarding generation logits: guard against tensor-device mismatches

Assigning req.py_result.generation_logits directly to target_model_req.py_draft_logits is fine; ensure both tensors reside on the same device (expected CUDA). If PyResult ever stores CPU tensors (e.g., for CPU-only tests), consider moving to the target device or documenting the assumption.

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

860-898: Consistent attribute: use py_seq_slot instead of seq_slot

Elsewhere the code uses py_seq_slot; mixing both can cause subtle bugs if they ever diverge.

Apply:

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token
+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token
@@
-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token
+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e89d3b and f750bba.

📒 Files selected for processing (12)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
  • tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
  • tensorrt_llm/llmapi/llm_args.py (0 hunks)
  • tensorrt_llm/scaffolding/worker.py (0 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/scaffolding/worker.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/speculative/model_drafter.py (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
  • Sampler (52-83)
  • SampleState (43-49)
  • should_provide_draft_probs (81-83)
  • should_provide_draft_probs (1433-1435)
  • beam_width (72-75)
  • is_generation_model (78-79)
  • is_generation_model (109-110)
  • is_generation_model (162-163)
  • is_generation_model (727-728)
  • is_generation_model (1469-1470)
  • sample_async (63-65)
  • sample_async (93-96)
  • sample_async (134-141)
  • sample_async (951-967)
  • sample_async (1616-1704)
  • update_requests (68-69)
  • update_requests (99-106)
  • update_requests (144-159)
  • update_requests (911-935)
  • update_requests (1708-1721)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/llmapi/llm_args.py (1)
  • DecodingBaseConfig (349-404)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-424)
  • append (79-98)
  • append (125-142)
  • generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
  • HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
  • sample_async (271-318)
  • update_requests (245-269)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1353-1353: Line too long (125 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

271-282: Prefix-sum logic: confirm 1-logit accounting when context logits are disabled

You count 1 per-context request when return_context_logits is False, to keep the last-token logit for potential generation logits. Please verify this matches upstream model output layout for all decoders.

Would you like a quick script to assert that lengths of num_context_logits_prefix_sum align with actual logits slices during tests?

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

1433-1436: Good: draft probs suppressed for greedy

Skipping draft probabilities when greedy is selected avoids unnecessary softmax work.


166-173: Use LongTensor for indices and adjust type hints

Indexing tensors in PyTorch requires int64 (LongTensor). Using IntTensor/int32 for indices will raise runtime errors in advanced indexing and index_select.

Apply:

-def top_k_sampling_batch(
+def top_k_sampling_batch(
     logits,
     top_k=50,
     generator: Optional[torch.Generator] = None
 ) -> tuple[torch.Tensor, torch.Tensor]:

And below (see related comments) ensure all index tensors are int64; also change any annotations using torch.IntTensor to torch.Tensor and cast to long at use sites.

Likely an incorrect or invalid review comment.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16682 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12523 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)

230-253: Force deterministic greedy sampling and avoid numeric overflow in embedding bias

The test currently relies on default sampling behavior (temperature=None → C++ default 1.0f) and an extreme bias value (torch.finfo(torch.float32).max), which can lead to non-greedy sampling and potential numeric instabilities (infs or NaNs). To ensure determinism and stability across backends, please update the test as follows:

• In tests/unittest/llmapi/test_llm_pytorch.py at test_embedding_bias_with_torch_sampler_strategies() (around line 243):

  • Replace the use of torch.finfo(torch.float32).max with a large but finite bias (e.g. 1e6).
  • Add temperature=0.0 (to force greedy sampling) and top_k=1 (to restrict choices to the top token).

Suggested diff:

 def test_embedding_bias_with_torch_sampler_strategies():
@@
-    vocab_size_padded = 32000
-    embedding_bias = torch.zeros(vocab_size_padded)
-    embedding_bias[biased_word_id] = torch.finfo(torch.float32).max
+    vocab_size_padded = 32000  # consider deriving from tokenizer/model if it varies
+    embedding_bias = torch.zeros(vocab_size_padded)
+    embedding_bias[biased_word_id] = 1e6  # large finite bias to avoid overflow

@@
-    sampling_kwargs = {
-        "max_tokens": 6,
-        "embedding_bias": embedding_bias,
-    }
+    sampling_kwargs = {
+        "max_tokens": 6,
+        "temperature": 0.0,   # force deterministic greedy sampling
+        "top_k": 1,           # restrict to the highest-scoring token
+        "embedding_bias": embedding_bias,
+    }

These changes will make the test behavior consistent and robust across different PyTorch builds and hardware.

tensorrt_llm/_torch/modules/rms_norm.py (1)

63-71: Inconsistent residual semantics between FLASHINFER and non-FLASH paths.

Non-FLASH returns updated residual (pre-norm sum), FLASHINFER returns the original residual. This diverges behavior and can corrupt downstream fused-add pipelines.

Apply:

         if IS_FLASHINFER_AVAILABLE:
             from ..custom_ops import (flashinfer_fused_add_rmsnorm,
                                       flashinfer_rmsnorm)
-            if residual is not None:
-                flashinfer_fused_add_rmsnorm(hidden_states, residual,
-                                             self.weight, self.variance_epsilon)
+            if residual is not None:
+                # Preserve parity with non-FLASH path: return pre-norm sum as residual
+                input_dtype = hidden_states.dtype
+                _updated_residual = (hidden_states.to(torch.float32) +
+                                     residual.to(torch.float32)).to(input_dtype)
+                flashinfer_fused_add_rmsnorm(hidden_states, residual,
+                                             self.weight, self.variance_epsilon)
+                residual = _updated_residual
             else:
                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,
                                                    self.variance_epsilon)
🧹 Nitpick comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)

234-241: Consider deriving vocab_size_padded instead of hard-coding 32000.

Hard-coding may break with model/tokenizer variants that pad differently.

I can propose a helper to compute padded size from tokenizer vocab and alignment used by the sampler if you confirm the pad multiple.

tensorrt_llm/_torch/modules/rms_norm.py (1)

75-78: Clarify intent: residual holds pre-norm sum (not original residual).

This is correct for fused-add pipelines; consider a brief comment to avoid future regressions.

Apply:

-            if residual is not None:
+            if residual is not None:
+                # Update residual to the pre-norm sum(hidden_states, residual) for downstream fused-add
                 hidden_states = hidden_states + residual.to(torch.float32)
                 residual = hidden_states.to(input_dtype)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f750bba and 8d6a5a6.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
  • llm_test_harness (109-136)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)

63-71: In-place semantics confirmed for flashinfer_fused_add_rmsnorm
Verified that flashinfer_fused_add_rmsnorm is registered as a custom op with mutates_args=("input", "residual") and its Python signature returns None, meaning it mutates hidden_states (and residual) in-place. No changes required.

@ixlmar
Copy link
Collaborator Author

ixlmar commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16697 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8d6a5a6 and 12e937b.

📒 Files selected for processing (2)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (2 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

97-101: LGTM: standardized lowercase backend specifiers for GPTOSS post_merge.

Consistent with the broader rename pattern in this PR.

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

154-159: LGTM: backend tag casing normalized (CUTLASS/TRITON → cutlass/triton).

Matches the naming scheme adopted elsewhere; should reduce param-id drift across files.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)

64-73: Residual return semantics diverge between FLASHINFER and fallback paths.

Fallback returns the pre-norm sum as residual (Line 78), while the FLASHINFER branch returns the original input residual unchanged, breaking parity and likely confusing downstream consumers expecting the updated residual. Fix by computing the summed residual for return in the FLASHINFER path without altering the inputs to the fused op.

         if IS_FLASHINFER_AVAILABLE:
             from ..custom_ops import (flashinfer_fused_add_rmsnorm,
                                       flashinfer_rmsnorm)
-            if residual is not None:
-                flashinfer_fused_add_rmsnorm(hidden_states, residual,
-                                             self.weight, self.variance_epsilon)
+            if residual is not None:
+                # Maintain the same return semantics as the non-FLASHINFER path:
+                # return the pre-norm sum as residual.
+                input_dtype = hidden_states.dtype
+                residual_out = (hidden_states.to(torch.float32)
+                                + residual.to(torch.float32)).to(input_dtype)
+                flashinfer_fused_add_rmsnorm(hidden_states,  # normalized in-place/return
+                                             residual,
+                                             self.weight,
+                                             self.variance_epsilon)
+                residual = residual_out
             else:
                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,
                                                    self.variance_epsilon)
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/rms_norm.py (4)

28-30: Make sentinel definition clearer and immutable.

Minor readability/typing tweak: define the alias before the constant and mark the sentinel Final so type-checkers treat it as immutable.

-from typing import Optional, Tuple, TypeAlias, Union, cast
+from typing import Final, Optional, Tuple, TypeAlias, Union, cast
@@
-    _ARGUMENT_NOT_SPECIFIED_SENTINEL = ...
-    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType
+    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType
+    _ARGUMENT_NOT_SPECIFIED_SENTINEL: Final[_ArgumentNotSpecifiedSentinelType] = ...

55-63: Streamline the “unspecified residual” logic.

Equivalent but tighter boolean makes intent obvious and avoids the extra flag mutation.

-        return_residual = True
-        if residual is self._ARGUMENT_NOT_SPECIFIED_SENTINEL:
-            return_residual = False
-            residual = None
+        return_residual = residual is not self._ARGUMENT_NOT_SPECIFIED_SENTINEL
+        if not return_residual:
+            residual = None

76-79: Avoid an extra allocation in the fallback add.

You can do the add in-place on the FP32 buffer to reduce one temporary tensor.

-            if residual is not None:
-                hidden_states = hidden_states + residual.to(torch.float32)
-                residual = hidden_states.to(input_dtype)
+            if residual is not None:
+                hidden_states.add_(residual.to(torch.float32))
+                residual = hidden_states.to(input_dtype)

Note: ensure no autograd requirements/in-place aliasing concerns in this inference path.


110-116: Fix mutable default and Optional types in group_rms_norm signature.

weights uses a mutable default [], and eps/weight_bias are annotated Optional but never accept None. Clean this to avoid surprises.

-def group_rms_norm(
-        inputs: list[torch.Tensor],
-        weights: Optional[list[torch.Tensor]] = [],
-        eps: Optional[float] = 1e-5,
-        weight_bias: Optional[float] = 0.0,
+def group_rms_norm(
+        inputs: list[torch.Tensor],
+        weights: Optional[list[torch.Tensor]] = None,
+        eps: float = 1e-5,
+        weight_bias: float = 0.0,
         kernel: GroupRMSNormKernelSelection = GroupRMSNormKernelSelection.heuristic,
         outputs: Optional[list[torch.Tensor]] = None) -> list[torch.Tensor]:
@@
-    out = outputs
+    out = outputs
     if out is None:
         out = [torch.empty_like(input) for input in inputs]
+    if weights is None:
+        weights = []
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 12e937b and 72ae320.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
🧠 Learnings (3)
📚 Learning: 2025-08-27T14:41:56.613Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:96-99
Timestamp: 2025-08-27T14:41:56.613Z
Learning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-27T14:23:55.545Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.545Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.py : Code must target Python 3.8+

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/rms_norm.py (3)

17-18: Python 3.10+ typing usage is fine; ensure packaging/CI reflect this floor.

Using TypeAlias and EllipsisType is correct given the project’s effective Python ≥3.10 policy. Please confirm python_requires and CI images align (≥3.10) to avoid drift.


85-88: Return contract looks good.

Tuple is returned only when the caller explicitly provided a residual (including None), matching the sentinel contract.


93-101: skip_forward parity with sentinel handling is correct.

Behavior mirrors forward: unspecified → single tensor; specified (even None) → tuple. Good.

@ixlmar
Copy link
Collaborator Author

ixlmar commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16697 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12532 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16712 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

769-783: Make generators per-device; current assert breaks multi-GPU

A single cached generator with a hard device-assert will fail when sampling across devices.

-        self._global_seed = 42
-        self._generator = None
+        self._global_seed = 42
+        self._generators: dict[torch.device, torch.Generator] = {}
@@
-    def get_generator(self, device: torch.device) -> torch.Generator:
+    def get_generator(self, device: torch.device) -> torch.Generator:
@@
-        if self._generator is None:
-            # Fallback to a default seed if not set
-            self._generator = torch.Generator(device=device)
-            self._generator.manual_seed(self._global_seed)
-        assert self._generator.device == device
-        return self._generator
+        gen = self._generators.get(device)
+        if gen is None:
+            gen = torch.Generator(device=device)
+            gen.manual_seed(self._global_seed)
+            self._generators[device] = gen
+        return gen

885-897: Use py_seq_slot consistently (fixes wrong writes with binding seq_slot)

TorchSampler elsewhere uses py_seq_slot; mixing in seq_slot can misroute writes.

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token
+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token
@@
-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token
+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token

Also applies to: 899-904

♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)

374-397: Int32 softmax_indices and filtering behavior LGTM

Type hint and greedy softmax-skipping logic align with established TRT-LLM conventions.


1-2: Add required NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA header.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import dataclasses
 import enum

1018-1064: _apply_embedding_bias(): use id()-based dedup and Long indices for index_select

Tensor keys are unsafe/expensive; index_select requires Long indices.

-        # Indices of unique bias tensors
-        bias_to_index: dict[torch.Tensor,
-                            int] = defaultdict(provision_bias_index)
+        # Indices of unique bias tensors (group by identity, not contents)
+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)
+        biases: list[torch.Tensor] = []
@@
-            if req_bias is not None:
+            if req_bias is not None:
                 logits_bias_mask[i:(i + steps)] = True
-                req_bias_index = bias_to_index[req_bias]
+                key = id(req_bias)
+                req_bias_index = bias_to_index[key]
+                if req_bias_index == len(biases):
+                    biases.append(req_bias)
                 bias_gather_indices.extend(repeat(req_bias_index, steps))
@@
-        bias_gather_indices_cuda = torch.tensor(bias_gather_indices,
-                                                pin_memory=True,
-                                                dtype=torch.int32).to(
-                                                    logits.device,
-                                                    non_blocking=True)
+        bias_gather_indices_cuda = torch.tensor(
+            bias_gather_indices, pin_memory=True, dtype=torch.long
+        ).to(logits.device, non_blocking=True)
@@
-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),
-                                    pin_memory=True)
-        biases_tensor = torch.stack(
-            tuple(bias_to_index.keys()),
-            out=biases_tensor,
-        )
+        biases_tensor = torch.stack(
+            biases, out=torch.empty((len(biases), *req_bias.shape), pin_memory=True)
+        )
@@
-        biases_tensor_cuda = torch.index_select(biases_tensor_cuda, 0,
-                                                bias_gather_indices_cuda)
+        biases_tensor_cuda = torch.index_select(
+            biases_tensor_cuda, 0, bias_gather_indices_cuda
+        )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

39-56: Confirm minimum supported Python version (uses 3.10+ features)

This file uses kw_only dataclasses, PEP 634 pattern matching (match/case), and PEP 604 unions (X | Y), which require Python 3.10+. Repo guideline says “Python 3.8+”; please confirm runtime targets 3.10+ or adjust.

Would you like a codemod to replace match/case with if/elif and kw_only with runtime-safe alternatives if 3.8/3.9 must be supported?

Also applies to: 119-133, 374-397

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 72ae320 and fae40c5.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.127Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064
Timestamp: 2025-08-27T14:52:26.324Z
Learning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238
Timestamp: 2025-08-27T14:47:49.339Z
Learning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1359-1359: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1235-1244: dim_order() usage OK

Public API; assertions ensure canonical layout before view/scatter.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

885-897: Use py_seq_slot consistently; seq_slot may be different/undefined here.

Other TorchSampler paths use py_seq_slot. Mixing the two risks indexing the wrong slot or AttributeError.

Apply:

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token
+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token
@@
-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token
+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1-2: Add required NVIDIA copyright header (current year).

Repository guideline requires the header on all source files.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import dataclasses
 import enum
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)

184-191: Guard top-k against k > vocab_size.

Prevents runtime errors when configs overspecify k.

Apply:

-    if top_k > 0:
-        values, indices = torch.topk(logits, top_k, dim=-1)
+    if top_k > 0:
+        top_k = min(top_k, vocab_size)
+        values, indices = torch.topk(logits, top_k, dim=-1)

249-256: Same guard for top-k in top-k+top-p path.

Apply:

-    if top_k > 0:
-        values, _ = torch.topk(logits, top_k, dim=-1)
+    if top_k > 0:
+        top_k = min(top_k, vocab_size)
+        values, _ = torch.topk(logits, top_k, dim=-1)

778-783: Make generator device-robust (multi-GPU).

Reinitialize when device changes instead of asserting.

Apply:

-        if self._generator is None:
-            # Fallback to a default seed if not set
-            self._generator = torch.Generator(device=device)
-            self._generator.manual_seed(self._global_seed)
-        assert self._generator.device == device
+        if self._generator is None or self._generator.device != device:
+            self._generator = torch.Generator(device=device)
+            self._generator.manual_seed(self._global_seed)
         return self._generator

1026-1066: Prefer id()-based dedup for clarity; avoid relying on tensor keys.

Identity hashing of tensors is fine, but explicit id() clarifies intent and decouples from tensor-object key semantics.

Apply:

-        bias_to_index: dict[torch.Tensor,
-                            int] = defaultdict(provision_bias_index)
+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)
+        biases: list[torch.Tensor] = []
@@
-            if req_bias is not None:
+            if req_bias is not None:
                 logits_bias_mask[i:(i + steps)] = True
-                req_bias_index = bias_to_index[req_bias]
+                key = id(req_bias)
+                req_bias_index = bias_to_index[key]
+                if req_bias_index == len(biases):
+                    biases.append(req_bias)
                 bias_gather_indices.extend(repeat(req_bias_index, steps))
@@
-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),
-                                    pin_memory=True)
-        biases_tensor = torch.stack(
-            tuple(bias_to_index.keys()),
-            out=biases_tensor,
-        )
+        biases_tensor = torch.empty((len(biases), *req_bias.shape), pin_memory=True)
+        biases_tensor = torch.stack(tuple(biases), out=biases_tensor)

52-56: Annotate optional fields as Optional[...] for type correctness.

Current defaults are None but annotations are non-optional.

Apply:

 class SampleState:
     scheduled_requests: ScheduledRequests
-
-    device: SampleStateTensors = None
-    host: SampleStateTensors = None
-
-    sampler_event: torch.cuda.Event = None
+    device: Optional[SampleStateTensors] = None
+    host: Optional[SampleStateTensors] = None
+    sampler_event: Optional[torch.cuda.Event] = None
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fae40c5 and 81660f3.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.127Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064
Timestamp: 2025-08-27T14:52:26.324Z
Learning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238
Timestamp: 2025-08-27T14:47:49.339Z
Learning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:37:38.228Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:991-1070
Timestamp: 2025-08-27T15:37:38.228Z
Learning: PyTorch tensors are hashable as dict keys because hash(torch.Tensor) is equivalent to id(torch.Tensor) and uses identity-based hashing, not content-based hashing. This means tensor hashing doesn't scan tensor contents and tensors can be safely used as dictionary keys.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1362-1362: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

359-371: Int32 indexing usage is consistent with established sampler design.

Acknowledging prior discussion: keeping int32 for index tensors here is intentional and works in this codepath.

Also applies to: 282-292, 1188-1194, 1301-1305, 430-488, 504-523, 1238-1247, 1260-1267


1229-1247: Use of torch.Tensor.dim_order is valid.

Public API; check is correct for canonical row-major ordering.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16712 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12543 completed with status: 'FAILURE'

@ixlmar
Copy link
Collaborator Author

ixlmar commented Aug 27, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16722 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16722 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12552 completed with status: 'FAILURE'

@dcampora dcampora enabled auto-merge (squash) September 19, 2025 17:26
Copy link
Collaborator

@dcampora dcampora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved from qwen-dev side

auto-merge was automatically disabled September 22, 2025 09:39

Head branch was pushed to by a user without write access

@ixlmar ixlmar force-pushed the feat/batched-sampling branch from 4c98e8b to b0981e2 Compare September 22, 2025 09:39
@ixlmar
Copy link
Collaborator Author

ixlmar commented Sep 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19563 [ run ] triggered by Bot

Copy link
Collaborator

@DomBrown DomBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the llmapi changes.

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19563 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14707 completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19638 [ run ] triggered by Bot

@ixlmar
Copy link
Collaborator Author

ixlmar commented Sep 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19666 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19638 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14773 (Blue Ocean) completed with status: ABORTED

@dcampora dcampora enabled auto-merge (squash) September 23, 2025 09:03
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19666 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14799 completed with status: 'SUCCESS'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.