[TRTLLM-5670][feat] Support top log probabilities #7296

stnie · 2025-08-27T10:16:45Z

[TRTLLM-5670][feat] Support top log probabilities

Added a new parameter topLogProbs to the OutputConfig class, allowing users to specify the number of top log probabilities to return.
Updated related methods and bindings in the executor and API layers to accommodate this feature.
updated code in Torch sampler to provide the functionality
added a new Dict to LogProbStorage containing the top logprobs
included validation checks for top log probabilities in the sampling parameters
added tests to ensure correct functionality.

Change: Changed API to conform better with current API.

removed top_logprobs parameter and used logprobs instead
concatenated top_logprobs and logprobs in output instead of adding top_logprobs field
Note: Need to update Performance tests for this new change.

Summary by CodeRabbit

New Features
- Return top token log-probabilities with each generated token.
- New controls: SamplingParams.top_logprobs and LLM max_top_logprobs.
- Python API/bindings accept, serialize, and expose top_logprobs across requests and results.
Documentation
- quickstart_advanced adds --top_logprobs and --max_top_logprobs flags, prints top log-probs, and auto-enables mixed sampling when applicable.
Tests
- Added unit and integration tests covering top_logprobs validation, serialization, and runtime behavior (including torch sampler paths).

Description

Allow the user to specify with the top_logprobs parameter if he wants to have the logprobs of the top <top_logprobs> tokens at each step in addition to the logprobs of the sampled tokens.

Currently only functionality was tested.

Performance Tests

I ran performance checks of main and this branch for greedy and top-k sampling with logprobs=false/true to see how the changes impact workflows, where top-logprobs are not enabled.

Number of requests: 100
Number of concurrent requests: 55.9995
Average Input Length (tokens): 40.0000
Average Output Length (tokens): 2009.0000
TP Size: 1
PP Size: 1
EP Size: None
Max Runtime Batch Size: 64
Max Runtime Tokens: 131072
Scheduling Policy: GUARANTEED_NO_EVICT
KV Memory Percentage: 90.00%
Issue Rate (req/sec): 3.9227E+17

This branch (Need update since latest changes)

Test Configuration	Request Throughput (req/sec)	Total Output Throughput (tokens/sec)	Total Token Throughput (tokens/sec)	Total Latency (ms)	Average Request Latency (ms)
Greedy Sampling (w/ logprobs)	1.8135	3643.3317	3715.8719	55141.8369	30763.6410
Greedy Sampling (w/o logprobs)	3.1688	6366.1379	6492.8903	31557.5947	18070.6263
Top-k Only Sampling (w/logprobs)	1.7962	3608.5418	3680.3893	55673.4576	31001.0933
Top-k Only Sampling (w/o logprobs)	2.7684	5561.8000	5672.5377	36121.3996	20532.6132

main branch

Test Configuration	Request Throughput (req/sec)	Total Output Throughput (tokens/sec)	Total Token Throughput (tokens/sec)	Total Latency (ms)	Average Request Latency (ms)
Greedy Sampling (w/ logprobs)	1.8263	3669.0145	3742.0661	54755.8476	30663.0136
Greedy Sampling (w/o logprobs	3.1843	6397.2784	6524.6508	31403.9797	17990.4672
Top-k Only Sampling (w/ logprobs)	1.7546	3524.9074	3595.0897	56994.4056	31702.4851
Top-k Only Sampling (w/o logprobs)	2.7831	5591.2461	5702.5701	35931.1673	20441.4355

Test Coverage

new test: test_return_top_k_logprobs.py
tests if the shape of the results is correct. Additionally ensures that some failure cases throw the correct error

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-27T10:16:52Z

📝 Walkthrough

Walkthrough

Adds an optional top logprobs feature across C++ OutputConfig, Python bindings, PyTorch backend, sampler, and API layers. Introduces configuration (max_top_logprobs, top_logprobs), propagates through constructors, pickling, and execution paths, updates sampler to compute/store top tokens/logprobs, and extends results and examples. Tests cover API and sampler behavior.

Changes

Cohort / File(s)	Summary
Executor OutputConfig (C++) `cpp/include/tensorrt_llm/executor/executor.h`, `cpp/tensorrt_llm/executor/outputConfig.cpp`	OutputConfig gains optional `topLogProbs` member; constructors updated to accept/store it.
Python Bindings `cpp/tensorrt_llm/nanobind/executor/request.cpp`, `cpp/tensorrt_llm/pybind/executor/request.cpp`	Bindings expose `top_logprobs` on OutputConfig; constructor, property, and pickle state extended from 7 to 8 elements.
PyTorch Backend: Config & Plumbing `tensorrt_llm/_torch/pyexecutor/config.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/auto_deploy/shim/demollm.py`, `tensorrt_llm/_torch/speculative/mtp.py`	Adds `max_top_logprobs` to config and sampler args; propagates through utility functions, auto-deploy shim, demo, and speculative MTP store. Adjusts demo/sampler call arity to 3-tuple returns.
Sampler Implementation (PyTorch) `tensorrt_llm/_torch/pyexecutor/sampler.py`	Introduces storage/host buffers for top tokens/logprobs, enforces `max_top_logprobs`, returns 3-tuples from sampling helpers, computes and attaches per-token top-logprobs, updates request processing and state tensors.
LLM Request/Result Path (PyTorch) `tensorrt_llm/_torch/pyexecutor/llm_request.py`, `tensorrt_llm/executor/result.py`	Adds per-beam top-logprobs tracking in storage and PyResult; LlmRequest accepts `top_logprobs`; executor-to-LLM wiring updated; CompletionOutput includes `top_logprobs` and truncation logic.
API, Params, Example `tensorrt_llm/sampling_params.py`, `tensorrt_llm/llmapi/llm_args.py`, `examples/llm-api/quickstart_advanced.py`	SamplingParams gains `top_logprobs` with validations; BaseLlmArgs adds `max_top_logprobs`; quickstart adds CLI flags, wires `max_top_logprobs`, prints top-logprobs when present.
Tests: API stability & bindings `tests/unittest/api_stability/references/completion_output.yaml`, `tests/unittest/api_stability/references/llm.yaml`, `tests/unittest/api_stability/references/sampling_params.yaml`, `tests/unittest/api_stability/test_llm_api.py`, `tests/unittest/bindings/test_executor_bindings.py`	Updates API references for new params/fields; OutputConfig expected fields include `top_logprobs`; bindings tests for constructor/property/pickling.
Tests: Sampler behavior `tests/integration/test_lists/test-db/l0_a30.yml`, `tests/unittest/_torch/sampler/test_return_top_k_logprobs.py`	Adds new unittest and integrates into test list; validates top_logprobs with various top_k/top_p and error cases.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User/CLI
  participant Q as quickstart_advanced.py
  participant L as LLM
  participant SP as SamplingParams
  participant OCp as OutputConfig (Python)
  participant B as Bindings (pybind/nanobind)
  participant OCc as OutputConfig (C++)
  participant EX as Executor
  participant TS as TorchSampler
  participant R as Result (Python)

  U->>Q: --top_logprobs / --max_top_logprobs
  Q->>L: init(max_top_logprobs, enable_mixed_sampler)
  Q->>SP: create(top_logprobs, logprobs, top_k/top_p)
  SP-->>Q: validated params
  Q->>OCp: map params to output config
  OCp->>B: construct / pickle state (8-tuple incl. top_logprobs)
  B->>OCc: construct with topLogProbs
  OCc->>EX: submit request
  EX->>TS: sample_async(...)
  TS-->>EX: next_tokens, log_probs, top_tokens/top_log_probs (<= max_top_logprobs)
  EX-->>R: response_tensors incl. top_log_probs
  R-->>U: CompletionOutput.top_logprobs

sequenceDiagram
  autonumber
  participant TS as TorchSampler
  participant ST as SampleState
  participant HS as Host Buffers

  TS->>ST: run sampling (top-k/top-p/greedy)
  ST-->>TS: (next_tokens, softmax, indices/all_token_ids)
  TS->>TS: compute log_probs_host()
  alt need_top_logprobs
    TS->>HS: write top_tokens_host, top_log_probs_host (bounded by max_top_logprobs)
  else skip
    TS-->>HS: no-op
  end
  TS-->>ST: update SampleStateTensorsHostTorch with top_tokens/top_log_probs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

[ci] Speedup beam search unit tests with fixtures for LLM #5843 — Touches the same Torch sampler module, adjusting cum_log_probs indexing in multi-beam updates; intersects with sampler changes in this PR.

Suggested reviewers

dcampora
Funatiq
ixlmar
syuoni
QiJune

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.86% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	❓ Inconclusive	Title and most required sections from the template (Description, Performance Tests, Test Coverage, and GitHub Bot Help) are present and detailed. However the PR contains contradictory statements about whether top_logprobs was added or removed (a "Change" subsection says it was removed while the rest of the description and diffs indicate it was added), and the template's PR Checklist is not filled or addressed. Because of this ambiguity I cannot conclusively verify the description fully satisfies the repository template or clearly documents the final public API.	Please remove or reconcile the contradictory "Change" subsection and explicitly state the final API surface (clarify whether top_logprobs is added, removed, or folded into logprobs) and any backward-incompatible impacts. Fill in or acknowledge the PR Checklist items from the template and enumerate the exact files/tests that validate the change (including which performance tests must be updated). After those clarifications are made I will re-evaluate the PR description for a conclusive pass/fail result.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly captures the main feature introduced by this pull request—supporting top log probabilities—and correctly references the JIRA ticket and change type, making it clear and specific without extraneous details.

stnie · 2025-08-27T10:17:36Z

/bot run

tensorrt-cicd · 2025-08-27T10:23:14Z

PR_Github #16678 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/sampling_params.py (1)

466-471: Enable logprob computation when top_logprobs/prompt_logprobs are requested

Without setting return_log_probs when only top_logprobs (or prompt_logprobs) is requested, the backend may skip computing logprobs entirely, yielding empty top_logprobs. Set the flag if any of logprobs, prompt_logprobs, or top_logprobs are requested.

Apply this diff:

-        if is_pytorch_backend:
-            config_kwargs["return_log_probs"] = bool(self.logprobs)
-        else:
-            config_kwargs["return_log_probs"] = self._return_log_probs
+        if is_pytorch_backend:
+            config_kwargs["return_log_probs"] = (
+                bool(self.logprobs) or bool(self.prompt_logprobs) or (self.top_logprobs > 0)
+            )
+        else:
+            config_kwargs["return_log_probs"] = (
+                self._return_log_probs or bool(self.prompt_logprobs) or (self.top_logprobs > 0)
+            )

🧹 Nitpick comments (25)

tensorrt_llm/_torch/pyexecutor/config.py (1)

117-119: New PyTorchConfig knob max_top_logprobs added — OK.

Minor: consider documenting/validating that it must be >= 0 (consistent with SamplingParams.top_logprobs).
tensorrt_llm/llmapi/llm_args.py (1)
1330-1332: Validate BaseLlmArgs.max_top_logprobs is non-negative.

Add a pydantic validator to prevent accidental negatives.
# Add near other validators in BaseLlmArgs
@field_validator("max_top_logprobs")
@classmethod
def validate_max_top_logprobs(cls, v: int):
    if v < 0:
        raise ValueError("max_top_logprobs must be >= 0")
    return v
tensorrt_llm/executor/result.py (2)
114-115: Expose top_logprobs in CompletionOutput — add docstring entry for completeness.

Keep Args section in sync with fields.
# In CompletionOutput docstring Args section, add:
        top_logprobs (TokenLogprobs, optional): For each generated position, a dict of top token ids to Logprob (logprob, rank). Defaults to None.
261-267: Align top_logprobs truncation with logprobs (optional assert).

Mirror the post-slice assert to catch mismatches early.
             if response_tensors.top_log_probs is not None:
                 output.top_logprobs = response_tensors.top_log_probs[src_idx]
                 if finish_reasons[src_idx] != tllm.FinishReason.CANCELLED:
                     if len(output.top_logprobs) > output.length:
-                        output.top_logprobs = output.top_logprobs[:output.
-                                                                  length]
+                        output.top_logprobs = output.top_logprobs[:output.length]
+                    assert len(output.top_logprobs) == output.length
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
59-64: Make assertions robust to early EOS (avoid flakiness).

Generated length can be < max_tokens; assert against actual output length.
-            if top_logprobs > 0:
-                assert len(output.outputs[0].logprobs) == max_tokens
-                assert len(output.outputs[0].top_logprobs) == max_tokens
-                assert len(output.outputs[0].logprobs[0]) == 1
-                assert len(output.outputs[0].top_logprobs[0]) == top_logprobs
+            if top_logprobs > 0:
+                length = output.outputs[0].length
+                assert len(output.outputs[0].logprobs) == length
+                assert len(output.outputs[0].top_logprobs) == length
+                assert len(output.outputs[0].logprobs[0]) == 1
+                assert len(output.outputs[0].top_logprobs[0]) == top_logprobs
21-35: Optional: make the test deterministic.

Pass a fixed random seed to SamplingParams in the loop to reduce flakiness across CI runs.
# When constructing SamplingParams in the test:
SamplingParams(..., random_seed=1234)
tensorrt_llm/sampling_params.py (4)
328-341: Validation: unify greedy detection and clarify error messages

Greedy detection here conflicts with quickstart_advanced.py (that uses top_k in {None, 0}). Pick one definition across the codebase to avoid surprising rejections. Prefer treating None as 0 to reflect TRT defaults.

Error message “cannot exceed 0” is awkward; say “must be 0”.

Apply this diff to improve the message and (optionally) to use consistent greedy detection:
-        if self._greedy_decoding and self.top_logprobs > 0:
+        if self._greedy_decoding and self.top_logprobs > 0:
             # check for greedy sampling
             raise ValueError(
-                f"top_logprobs {self.top_logprobs} cannot exceed 0 for greedy sampling"
+                f"top_logprobs must be 0 for greedy sampling; got {self.top_logprobs}"
             )
Optionally align _greedy_decoding with the quickstart heuristic:
     def _greedy_decoding(self) -> bool:
-        return (
-            not self.use_beam_search
-            and (self.top_k is None or self.top_k == 1)
-            and (self.top_p is None or self.top_p == 0.0)
-        )
+        tk = 0 if self.top_k is None else self.top_k
+        tp = 0.0 if self.top_p is None else self.top_p
+        return (not self.use_beam_search) and (tk == 0) and (tp == 0.0)
324-327: Normalize user input type for top_logprobs (parity with logprobs/prompt_logprobs)

Users may pass booleans; normalize to int as you do for logprobs and prompt_logprobs.

Apply this diff:
         self.logprobs = self.logprobs and int(self.logprobs)
         self.prompt_logprobs = self.prompt_logprobs and int(self.prompt_logprobs)
+        self.top_logprobs = int(self.top_logprobs)
176-176: Docstring accuracy: “optional” vs defaulted int

Field is an int with a default (0), not Optional[int]. Update the docstring accordingly to avoid confusion.

1-1: Missing NVIDIA copyright header (policy)

Per guidelines, prepend the NVIDIA copyright header (2025) to this .py file.

Example:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import json
cpp/tensorrt_llm/executor/outputConfig.cpp (1)

25-35: Constructor wiring for topLogProbs looks correct

Parameter appended at the end and member initialization order preserved. Consider documenting in the header that 0 disables top-logprobs to avoid ambiguity with std::optional storage.
tensorrt_llm/_torch/pyexecutor/_util.py (3)
662-665: Consistency with greedy heuristic

instantiate_sampler uses enable_mixed_sampler from CLI’s is_greedy. If you adopt the unified greedy detection in SamplingParams, consider deriving enable_mixed_sampler from the same logic to prevent drift.

1-12: Missing NVIDIA copyright header (policy)

Add the 2025 NVIDIA header at the top of this Python file.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import os
638-651: Ensure non-negative max_top_logprobs is enforced

I don’t see any explicit validation (e.g., >= 0) on the max_top_logprobs field in the executor configuration. While the sampler itself clamps requests whose top_logprobs exceed max_top_logprobs (raising a ValueError), a negative max_top_logprobs would lead to an invalid tensor shape (1 + max_top_logprobs <= 0) later in TorchSampler and cause a runtime error.

Please add a check to clamp or validate max_top_logprobs >= 0 at the utility boundary so that invalid negative values are caught early.

• In tensorrt_llm/_torch/pyexecutor/_util.py:create_torch_sampler_args, before constructing TorchSampler.Args, add:
-    return TorchSampler.Args(
+    # Ensure max_top_logprobs is non-negative
+    max_top_logprobs = max(0, max_top_logprobs)
+    return TorchSampler.Args(
         max_seq_len=max_seq_len,
         max_draft_len=max_draft_len,
         max_num_sequences=max_num_sequences,
         max_beam_width=executor_config.max_beam_width,
-        max_top_logprobs=max_top_logprobs,
+        max_top_logprobs=max_top_logprobs,
         enable_mixed_sampler=enable_mixed_sampler,
• Alternatively, add a Pydantic constraint (ge=0) on the max_top_logprobs field of ExecutorConfig in config.py.
examples/llm-api/quickstart_advanced.py (4)
150-152: CLI: add help text and bounds checking for new args

Add helpful descriptions and validate non-negative values early to fail fast.
-    parser.add_argument('--top_logprobs', type=int, default=0)
-    parser.add_argument('--max_top_logprobs', type=int, default=0)
+    parser.add_argument('--top_logprobs', type=int, default=0,
+                        help='Return the top-k log probabilities per generated token (0 disables).')
+    parser.add_argument('--max_top_logprobs', type=int, default=0,
+                        help='Upper bound for top_logprobs supported by the backend (resource cap).')
And before setup_llm:
+    if args.top_logprobs < 0 or args.max_top_logprobs < 0:
+        raise SystemExit("--top_logprobs and --max_top_logprobs must be >= 0")
281-284: Passing boolean to logprobs relies on implicit normalization

This depends on SamplingParams converting True→1. After adding explicit normalization for top_logprobs too, behavior will be consistent.

323-326: Printing: guard against None/empty fields

In mixed backends, sequence.top_logprobs may be None/empty. Add a truthiness check to avoid noisy prints.
-            if args.top_logprobs:
+            if args.top_logprobs and sequence.top_logprobs:
                 print(
                     f"[{i}]{sequence_id_text} Top logprobs: {sequence.top_logprobs}"
                 )
1-1: Missing NVIDIA copyright header (policy)

Add the 2025 NVIDIA header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
cpp/include/tensorrt_llm/executor/executor.h (3)
218-220: API surface: constructor extension is clear

Appending topLogProbs at the end is backward-compatible. Consider documenting that 0 disables and that None (via bindings) maps to 0 at construction time.

238-239: Member type vs constructor parameter type

Member is std::optional but the constructor takes a SizeType32. It works (implicit optional construction), but accepting std::optional in the ctor would better reflect the member semantics.
-        SizeType32 topLogProbs = 0);
+        std::optional<SizeType32> topLogProbs = 0);
2-2: Update copyright year

File header still says 2022-2024. Update to include 2025 per guidelines.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/pybind/executor/request.cpp (1)
2-2: Update SPDX copyright year.

Guideline requires current year. Header shows 2022-2024.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
127-133: Validate top_log_probs shape when provided.

Guard against mismatched shapes to avoid IndexError on extend.
         if self.beam_width == -1:
             self._init(new_probs)
 
         assert len(new_probs) == self.beam_width, "Beam width mismatch"
+        if top_log_probs is not None:
+            assert len(top_log_probs) == self.beam_width, "Beam width mismatch for top_log_probs"
1-3: Missing NVIDIA 2025 header.

Guidelines require the header at top of .py files.

Add the standard 2025 NVIDIA SPDX header to this file.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1-1: Missing NVIDIA 2025 header.

Add the standard header at top of this .py per guidelines.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f082e48 and f81fb63.

📒 Files selected for processing (14)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/outputConfig.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
cpp/tensorrt_llm/pybind/executor/request.cpp (1 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (9 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (10 hunks)
tensorrt_llm/executor/result.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/sampling_params.py (3 hunks)
tests/integration/test_lists/test-db/l0_a30.yml (1 hunks)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
tensorrt_llm/sampling_params.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
tensorrt_llm/sampling_params.py
tensorrt_llm/executor/result.py
cpp/tensorrt_llm/executor/outputConfig.cpp
tensorrt_llm/_torch/pyexecutor/config.py
examples/llm-api/quickstart_advanced.py
cpp/include/tensorrt_llm/executor/executor.h
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py
cpp/tensorrt_llm/pybind/executor/request.cpp

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp
cpp/tensorrt_llm/pybind/executor/request.cpp

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/nanobind/executor/request.cpp
cpp/tensorrt_llm/pybind/executor/request.cpp

**/*.{h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hh,hpp,hxx,cuh}: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore

Files:

cpp/include/tensorrt_llm/executor/executor.h

🧠 Learnings (3)

📚 Learning: 2025-08-26T09:37:10.431Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.431Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/llmapi/llm_args.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

📚 Learning: 2025-08-13T16:20:37.987Z

Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

🧬 Code graph analysis (6)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-502)

tensorrt_llm/llmapi/llm_args.py (2)

CudaGraphConfig (108-165)

KvCacheConfig (946-1077)

tensorrt_llm/sampling_params.py (2)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

top_k (221-221)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

use_beam_search (519-520)

tensorrt_llm/executor/result.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

top_log_probs (250-251)

cpp/include/tensorrt_llm/executor/executor.h (2)

cpp/tensorrt_llm/executor/outputConfig.cpp (1)

AdditionalModelOutput (37-41)

tensorrt_llm/sampling_params.py (1)

AdditionalModelOutput (112-121)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

tensorrt_llm/_torch/attention_backend/trtllm.py (2)

max_seq_len (558-568)

max_seq_len (571-575)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py (3)

top_log_probs (250-251)

log_probs (246-247)

append_log_probs (195-206)

tensorrt_llm/executor/result.py (1)

Logprob (37-40)

🔇 Additional comments (11)

tests/integration/test_lists/test-db/l0_a30.yml (1)

23-23: Verification complete: Test entry is valid

Confirmed that tests/unittest/_torch/sampler/test_return_top_k_logprobs.py exists

Confirmed a single reference in tests/integration/test_lists/test-db/l0_a30.yml at line 23

LGTM.

tensorrt_llm/llmapi/llm_args.py (1)

2589-2591: PyTorchConfig.max_top_logprobs enforcement confirmed

In _util.py, create_torch_sampler_args forwards max_top_logprobs from PyTorchConfig into TorchSampler.Args unchanged (lines 640–646).

In sampler.py, the TorchSampler constructor sets self.max_top_logprobs = args.max_top_logprobs and uses it to shape the output buffer (NEW_TOKENS_SHAPE) accordingly (around lines 388–394).

Further down in sampler.py, each request’s py_top_logprobs is compared against self.max_top_logprobs, raising a ValueError if it exceeds the configured cap (lines 753–757).

No additional action needed—your original plumbing correctly enforces the cap downstream.

examples/llm-api/quickstart_advanced.py (2)

218-225: Greedy heuristic diverges from SamplingParams._greedy_decoding

This script treats greedy as (top_k in {None, 0} and top_p in {None, 0.0}); SamplingParams currently uses {None, 1} for top_k. Align both sides to prevent surprising validation errors.

Would you like me to patch SamplingParams._greedy_decoding to match this heuristic (tk==0 && tp==0.0)?

260-262: Nice: propagating max_top_logprobs and gating mixed sampler

Wiring looks good. Once greedy detection is unified, this will stay consistent.

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

116-126: Initialize top_log_probs alongside log_probs (good).

Per-beam containers are created consistently with beam_width. No issues.

249-252: Property exposure for top_log_probs (good).

Matches PyResult API and LlmResult forwarding.

310-317: Thread top_logprobs into LlmRequest (good plumbing).

Attribute naming is consistent (py_top_logprobs).

576-578: Forward OutputConfig.top_logprobs into LlmRequest (good).

Ensures executor->torch path carries the value.

1-20: Python 3.8 compatibility of typing.

The file uses built-in generics like list[T], which require Python 3.9+. If the project target remains 3.8 per guidelines, switch to typing.List and friends or ensure the repo targets >=3.9.

Would you like me to generate a patch converting annotations to typing.List[...] for 3.8 compatibility?

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

584-591: Docstring shape updated (good).

Host buffer shape matches device layout and max_top_logprobs dimensioning.

392-399: Store sizing (good).

NEW_TOKENS_SHAPE includes 1 + max_top_logprobs channel; aligns with later slicing.

cpp/tensorrt_llm/nanobind/executor/request.cpp

cpp/tensorrt_llm/pybind/executor/request.cpp

tensorrt_llm/_torch/pyexecutor/sampler.py

coderabbitai · 2025-08-27T10:30:33Z

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

@@ -0,0 +1,72 @@
+import os


⚠️ Potential issue

Missing NVIDIA copyright header.

Add the standard header per repo guidelines.

+# Copyright (c) 2025, NVIDIA CORPORATION. +# +# import os

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import os

# Copyright (c) 2025, NVIDIA CORPORATION.

#

#

import os

🤖 Prompt for AI Agents

In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 1 to 1, the file is missing the required NVIDIA copyright header; add the repository's standard copyright/license header at the very top of the file (above any imports) matching other test files, ensuring year and owner fields follow project conventions.

tensorrt-cicd · 2025-08-27T10:57:14Z

PR_Github #16678 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12519 completed with status: 'FAILURE'

stnie · 2025-08-27T11:15:23Z

/bot run

tensorrt-cicd · 2025-08-27T11:20:39Z

PR_Github #16685 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-27T11:49:59Z

PR_Github #16685 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12525 completed with status: 'FAILURE'

cpp/include/tensorrt_llm/executor/executor.h

cpp/tensorrt_llm/nanobind/executor/request.cpp

tensorrt_llm/_torch/pyexecutor/sampler.py

tensorrt_llm/sampling_params.py

cpp/tensorrt_llm/nanobind/executor/request.cpp

stnie · 2025-08-27T16:26:41Z

/bot run

tensorrt-cicd · 2025-08-27T16:32:43Z

PR_Github #16721 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tensorrt_llm/sampling_params.py (1)
1-1: Add required NVIDIA copyright header.

Per repo guidelines, prepend the SPDX header to Python sources.

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import json
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-1: Add required NVIDIA copyright header.

Per repo guidelines, prepend the SPDX header to Python sources.

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from copy import deepcopy
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
154-185: UnboundLocalError risk: indices may be undefined when top_k <= 0.

top_k_sampling_batch returns indices unconditionally, but it’s only assigned inside if top_k > 0. This will raise at runtime when top_k == 0 (valid “no-top-k” case).

Apply this diff:
 def top_k_sampling_batch(logits,
                          top_k=50,
                          generator: Optional[torch.Generator] = None):
@@
     batch_size, vocab_size = logits.size()
+    indices = None
@@
     next_tokens = torch.multinomial(softmax, num_samples=1,
                                     generator=generator).squeeze(-1)
     return next_tokens, softmax, indices
230-280: Same indices initialization issue in top_k_top_p_sampling_batch.

Mirror the fix to guarantee indices is defined for top_k <= 0.

Apply this diff:
     if temperature != 0:
         logits = logits / max(temperature, 1e-5)
     batch_size, vocab_size = logits.size()
+    indices = None
@@
     next_tokens = torch.multinomial(softmax, num_samples=1,
                                     generator=generator).squeeze(-1)
     return next_tokens, softmax, indices
1-1: Add required NVIDIA copyright header.

Per repo guidelines, prepend the SPDX header to Python sources.

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from abc import ABC, abstractmethod
tests/unittest/bindings/test_executor_bindings.py (1)
1-1: Add required NVIDIA copyright header.

Per repo guidelines, prepend the SPDX header to Python sources (tests included, unless exempted by project policy).

Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import datetime

♻️ Duplicate comments (3)

cpp/tensorrt_llm/nanobind/executor/request.cpp (1)

206-214: Make OutputConfig pickle backward-compatible (accept 7- or 8-field state).

Current __setstate__ requires exactly 8 items, breaking round-trip with older pickles. Accept 7 or 8 and default topLogProbs to nullopt when missing.

-    if (state.size() != 8)
+    if (state.size() != 7 && state.size() != 8)
     {
         throw std::runtime_error("Invalid OutputConfig state!");
     }
-    new (&outputConfig) tle::OutputConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]),
-        nb::cast<bool>(state[2]), nb::cast<bool>(state[3]), nb::cast<bool>(state[4]), nb::cast<bool>(state[5]),
-        nb::cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(state[6]),
-        nb::cast<std::optional<SizeType32>>(state[7]));
+    auto additional
+        = nb::cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(state[6]);
+    std::optional<SizeType32> top
+        = (state.size() == 8) ? nb::cast<std::optional<SizeType32>>(state[7]) : std::nullopt;
+    new (&outputConfig) tle::OutputConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]),
+        nb::cast<bool>(state[2]), nb::cast<bool>(state[3]), nb::cast<bool>(state[4]), nb::cast<bool>(state[5]),
+        additional, top);

cpp/tensorrt_llm/pybind/executor/request.cpp (1)

191-207: Preserve pickle round-trip and back-compat for OutputConfig.

Emit 8 fields and accept both 7- and 8-field states; default top_logprobs when absent.

     auto outputConfigGetstate = [](tle::OutputConfig const& self)
     {
         return py::make_tuple(self.returnLogProbs, self.returnContextLogits, self.returnGenerationLogits,
-            self.excludeInputFromOutput, self.returnEncoderOutput, self.returnPerfMetrics, self.additionalModelOutputs,
-            self.topLogProbs);
+            self.excludeInputFromOutput, self.returnEncoderOutput, self.returnPerfMetrics,
+            self.additionalModelOutputs, self.topLogProbs);
     };
     auto outputConfigSetstate = [](py::tuple const& state)
     {
-        if (state.size() != 8)
+        if (state.size() != 7 && state.size() != 8)
         {
             throw std::runtime_error("Invalid OutputConfig state!");
         }
-        return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(),
-            state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(),
-            state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(),
-            state[7].cast<std::optional<SizeType32>>());
+        auto additional = state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>();
+        std::optional<SizeType32> top
+            = (state.size() == 8) ? state[7].cast<std::optional<SizeType32>>() : std::nullopt;
+        return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(),
+            state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), additional, top);
     };

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)

1-1: Add NVIDIA copyright header.

Required by coding guidelines.

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import os

🧹 Nitpick comments (8)

tensorrt_llm/sampling_params.py (1)

328-333: Clarify validation error message for top_logprobs dependency.

logprobs is an int-or-bool-like count; saying “set logprobs to True” is misleading. Recommend wording tied to count semantics.

Apply this diff:
-            if not self.logprobs:
-                raise ValueError("You need to set logprobs to True to use top_logprobs.")
+            if not self.logprobs:
+                raise ValueError("top_logprobs requires logprobs >= 1.")

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

141-147: Guard top_log_probs beam dimension in append().

When top_log_probs is provided, ensure its outer list matches beam_width to prevent IndexError/misalignment.

Apply this diff:

         for beam_idx, probs in enumerate(new_probs):
             self.log_probs[beam_idx].extend(probs)
-            if top_log_probs is not None:
-                self.top_log_probs[beam_idx].extend(top_log_probs[beam_idx])
+            if top_log_probs is not None:
+                if len(top_log_probs) != self.beam_width:
+                    raise ValueError("Beam width mismatch for top_log_probs")
+                self.top_log_probs[beam_idx].extend(top_log_probs[beam_idx])

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

626-635: Tidy long docstring line and clarify meaning.

Break the long line and make explicit that extra columns are present only when top_logprobs > 0.

Apply this diff:

-    def log_probs_host(self, scheduled_requests: ScheduledRequests):
-        """Shape: max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs)
-        The last dimension contains log_probs for the sampled tokens and additionally log_probs for the top tokens if top_logprobs is specified"""
+    def log_probs_host(self, scheduled_requests: ScheduledRequests):
+        """Shape: (max_num_sequences, max_beam_width==1, max_tokens, 1 + max_top_logprobs).
+        The last dim holds sampled-token log_probs and, when top_logprobs > 0, the top-k token log_probs."""

796-801: Typo in comment (“to0”).

Minor nit for readability.

Apply this diff:

-            # otherwise, it is None and should be defaulted to0
+            # otherwise, it is None and should be defaulted to 0

tests/unittest/bindings/test_executor_bindings.py (2)

883-907: Prefer truthiness checks over ‘== True/False’ (ruff E712).

Use assert var / assert not var or is True/False for clarity and to satisfy linters.

Apply this diff:

-    assert config.return_log_probs == False
-    assert config.return_context_logits == False
-    assert config.return_generation_logits == False
-    assert config.exclude_input_from_output == False
-    assert config.return_encoder_output == False
-    assert config.return_perf_metrics == False
+    assert not config.return_log_probs
+    assert not config.return_context_logits
+    assert not config.return_generation_logits
+    assert not config.exclude_input_from_output
+    assert not config.return_encoder_output
+    assert not config.return_perf_metrics
@@
-    assert config.return_log_probs == True
-    assert config.return_context_logits == False
-    assert config.return_generation_logits == True
-    assert config.exclude_input_from_output == False
-    assert config.return_encoder_output == True
-    assert config.return_perf_metrics == False
+    assert config.return_log_probs is True
+    assert not config.return_context_logits
+    assert config.return_generation_logits is True
+    assert not config.exclude_input_from_output
+    assert config.return_encoder_output is True
+    assert not config.return_perf_metrics

910-926: Same style nit in pickle test.

Align with boolean assertion style.

Apply this diff:

-    assert config_copy.return_log_probs == True
-    assert config_copy.return_context_logits == False
-    assert config_copy.return_generation_logits == True
-    assert config_copy.exclude_input_from_output == False
-    assert config_copy.return_encoder_output == True
-    assert config_copy.return_perf_metrics == False
+    assert config_copy.return_log_probs is True
+    assert not config_copy.return_context_logits
+    assert config_copy.return_generation_logits is True
+    assert not config_copy.exclude_input_from_output
+    assert config_copy.return_encoder_output is True
+    assert not config_copy.return_perf_metrics

cpp/tensorrt_llm/pybind/executor/request.cpp (1)

2-2: Update copyright year to 2025.

Keep headers current per repo guideline.

- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)

80-86: Rename test for clarity/grammar.

“disabler” → “disabled”.

-def test_generate_with_top_logprobs_and_disabler_logprobs(
+def test_generate_with_top_logprobs_and_disabled_logprobs(
         llm_torch, input_prompts):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f81fb63 and b70e895.

📒 Files selected for processing (15)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/outputConfig.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
cpp/tensorrt_llm/pybind/executor/request.cpp (1 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (12 hunks)
tensorrt_llm/executor/result.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/sampling_params.py (3 hunks)
tests/integration/test_lists/test-db/l0_a30.yml (1 hunks)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1 hunks)
tests/unittest/bindings/test_executor_bindings.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

tensorrt_llm/llmapi/llm_args.py
tests/integration/test_lists/test-db/l0_a30.yml
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/_util.py
examples/llm-api/quickstart_advanced.py
cpp/include/tensorrt_llm/executor/executor.h

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/pybind/executor/request.cpp
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/pybind/executor/request.cpp
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

cpp/tensorrt_llm/pybind/executor/request.cpp
tests/unittest/bindings/test_executor_bindings.py
cpp/tensorrt_llm/executor/outputConfig.cpp
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tests/unittest/bindings/test_executor_bindings.py
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

🧬 Code graph analysis (6)

cpp/tensorrt_llm/pybind/executor/request.cpp (2)

cpp/include/tensorrt_llm/executor/executor.h (1)

OutputConfig (212-239)

cpp/tensorrt_llm/executor/outputConfig.cpp (1)

OutputConfig (23-35)

tests/unittest/bindings/test_executor_bindings.py (3)

cpp/include/tensorrt_llm/executor/executor.h (1)

OutputConfig (212-239)

cpp/tensorrt_llm/executor/outputConfig.cpp (2)

OutputConfig (23-35)

AdditionalModelOutput (37-41)

tensorrt_llm/sampling_params.py (1)

AdditionalModelOutput (112-121)

tensorrt_llm/sampling_params.py (2)

tensorrt_llm/scaffolding/task.py (1)

logprobs (99-100)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

use_beam_search (519-520)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

top_log_probs (250-251)

calculate_top_logprobs (422-423)

LlmRequest (306-454)

log_probs (246-247)

append_log_probs (195-206)

tensorrt_llm/executor/result.py (1)

Logprob (37-40)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

calculate_top_logprobs (430-431)

tensorrt_llm/_torch/shared_tensor/shared_tensor.py (3)

SharedTensorContainer (95-404)

from_tensor (313-331)

dump_to_dict (375-404)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-508)

tensorrt_llm/llmapi/llm_args.py (2)

CudaGraphConfig (108-165)

KvCacheConfig (946-1077)

🪛 Ruff (0.12.2)

tests/unittest/bindings/test_executor_bindings.py

896-896: Avoid equality comparisons to True; use config.return_log_probs: for truth checks

Replace with config.return_log_probs

(E712)

tensorrt_llm/_torch/pyexecutor/sampler.py

628-628: Line too long (146 > 120)

(E501)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

91-91: Local variable sampling_params is assigned to but never used

Remove assignment to unused variable sampling_params

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

cpp/tensorrt_llm/executor/outputConfig.cpp (1)

23-35: Constructor extension looks correct; ownership and member initialization LGTM.

cpp/tensorrt_llm/pybind/executor/request.cpp

tensorrt_llm/sampling_params.py

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

coderabbitai · 2025-08-27T16:33:22Z

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

+        sampling_params = SamplingParams(max_tokens=max_tokens,
+                                         logprobs=False,
+                                         top_k=top_k,
+                                         top_p=top_p,
+                                         temperature=1.0,
+                                         top_logprobs=top_logprobs)


🛠️ Refactor suggestion

Remove unused variable to satisfy Ruff F841 (second occurrence).

- sampling_params = SamplingParams(max_tokens=max_tokens, - logprobs=False, - top_k=top_k, - top_p=top_p, - temperature=1.0, - top_logprobs=top_logprobs) + SamplingParams(max_tokens=max_tokens, + logprobs=False, + top_k=top_k, + top_p=top_p, + temperature=1.0, + top_logprobs=top_logprobs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sampling_params = SamplingParams(max_tokens=max_tokens,

logprobs=False,

top_k=top_k,

top_p=top_p,

temperature=1.0,

top_logprobs=top_logprobs)

SamplingParams(max_tokens=max_tokens,

logprobs=False,

top_k=top_k,

top_p=top_p,

temperature=1.0,

top_logprobs=top_logprobs)

🧰 Tools

🪛 Ruff (0.12.2)

91-91: Local variable sampling_params is assigned to but never used

Remove assignment to unused variable sampling_params

(F841)

🤖 Prompt for AI Agents

In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 91 to 96, there is a second, unused declaration of a variable (triggering Ruff F841); remove the redundant variable declaration and update the SamplingParams call so it does not reference that removed variable (use None or omit the parameter if optional), ensuring no unused variable remains.

tensorrt-cicd · 2025-08-27T16:59:23Z

PR_Github #16721 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12551 completed with status: 'FAILURE'

stnie · 2025-08-28T05:59:55Z

/bot run

stnie · 2025-08-28T06:00:12Z

/bot run

tensorrt-cicd · 2025-08-28T06:05:26Z

PR_Github #16804 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T06:05:42Z

PR_Github #16805 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T06:05:44Z

PR_Github #16804 [ run ] completed with state ABORTED

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tests/unittest/bindings/test_executor_bindings.py (1)

896-907: Support optional top_logprobs in OutputConfig binding

In cpp/tensorrt_llm/pybind/executor/request.cpp, change the eighth py::init parameter from SizeType32 to std::optional (and keep py::arg("top_logprobs") = py::none()).

Update the outputConfigSetstate lambda to accept both 7- and 8-element state tuples, defaulting missing top_logprobs to std::nullopt.
examples/llm-api/quickstart_advanced.py (1)
1-1: Add NVIDIA copyright header.
Required by repo guidelines for Python sources.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
1-6: Add NVIDIA header and ensure Python 3.8+ typing compatibility.

Missing required NVIDIA copyright header.

File uses PEP 604 unions (|) and builtin generics (list[int], tuple[...]); add from future import annotations to keep runtime compatibility with Python 3.8.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+#
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+from __future__ import annotations
169-185: NameError when top_k == 0: indices may be undefined.

Initialize indices = None before conditional; return it safely.

Apply:
-    # get first top_k logits of each sample and their indices
-    if top_k > 0:
+    # get first top_k logits of each sample and their indices
+    indices = None
+    if top_k > 0:
         values, indices = torch.topk(logits, top_k, dim=-1)
         min_values = values[:, -1].unsqueeze(-1).expand(batch_size, vocab_size)
@@
-    next_tokens = torch.multinomial(softmax, num_samples=1,
-                                    generator=generator).squeeze(-1)
+    next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1)
     return next_tokens, softmax, indices
269-279: Return top indices after applying the top-p mask (correctness of “top logprobs”).

Currently, indices are taken before masking by top-p; this can surface masked tokens (-inf logit) in top candidates. Recompute indices after masking.

Apply:
     logits = logits.masked_fill(indices_to_remove, float('-inf'))
@@
-    next_tokens = torch.multinomial(softmax, num_samples=1,
-                                    generator=generator).squeeze(-1)
-    return next_tokens, softmax, indices
+    next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1)
+    # Recompute top-k indices on masked logits so they align with the final distribution
+    _, indices = torch.topk(logits, top_k, dim=-1)
+    return next_tokens, softmax, indices

♻️ Duplicate comments (4)

tests/unittest/bindings/test_executor_bindings.py (1)

910-926: Round-trip pickle requires 8-field state; keep back-compat for 7-field pickles.
The test is correct; update setstate to accept both sizes so older pickles still load.

Apply in cpp/tensorrt_llm/pybind/executor/request.cpp (see suggested diff in that file’s comments).

cpp/tensorrt_llm/pybind/executor/request.cpp (2)

209-215: Fix ctor signature: last param must be std::optional.
Passing py::none() to SizeType32 causes a compile-time error and is the likely CI failure.

Apply:

-        .def(py::init<bool, bool, bool, bool, bool, bool, std::optional<std::vector<tle::AdditionalModelOutput>>,
-                 SizeType32>(),
+        .def(py::init<bool, bool, bool, bool, bool, bool, std::optional<std::vector<tle::AdditionalModelOutput>>,
+                 std::optional<SizeType32>>(),
...
-            py::arg("additional_model_outputs") = py::none(), py::arg("top_logprobs") = py::none())
+            py::arg("additional_model_outputs") = py::none(), py::arg("top_logprobs") = py::none())

199-207: Back-compat: accept both 7- and 8-field OutputConfig pickle states.
Older pickles (without top_logprobs) should still unpickle. Treat missing as std::nullopt.

Apply:

-    auto outputConfigSetstate = [](py::tuple const& state)
-    {
-        if (state.size() != 8)
-        {
-            throw std::runtime_error("Invalid OutputConfig state!");
-        }
-        return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(),
-            state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(),
-            state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(),
-            state[7].cast<std::optional<SizeType32>>());
-    };
+    auto outputConfigSetstate = [](py::tuple const& state)
+    {
+        if (state.size() != 7 && state.size() != 8)
+        {
+            throw std::runtime_error("Invalid OutputConfig state!");
+        }
+        auto additional = state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>();
+        std::optional<SizeType32> top = (state.size() == 8)
+            ? state[7].cast<std::optional<SizeType32>>()
+            : std::optional<SizeType32>{std::nullopt};
+        return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(),
+            state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), additional, top);
+    };

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

809-829: Greedy path breaks with top_logprobs > 0 (indices is None).

When strategy == GREEDY, indices is None; later you slice with num_top_tokens > 0 causing shape mismatch and wrong log_probs. Compute top-k indices from logits for greedy.

Apply:

             if batched_next_tokens is None:
                 logits = self._apply_embedding_bias(logits, [req])
-                next_tokens, softmax, indices = sample(strategy, logits,
-                                                       generator)
+                next_tokens, softmax, indices = sample(strategy, logits, generator)
             else:
                 # Batched processing already applied bias, just use the results
                 next_tokens = batched_next_tokens[input_slice]
                 softmax = batched_softmax[input_slice]
-                indices = batched_indices[
-                    input_slice] if strategy != GREEDY else None
+                indices = batched_indices[input_slice] if strategy != GREEDY else None
+
+            # If greedy and top-logprobs requested, derive top indices from logits
+            if num_top_tokens > 0 and indices is None:
+                indices = torch.topk(logits, num_top_tokens, dim=-1).indices
@@
-            if request.calculate_top_logprobs and strategy != GREEDY:
+            if num_top_tokens > 0:
                 # concatenate the sampled tokens with the top tokens
                 next_tokens = torch.cat(
-                    [next_tokens.unsqueeze(1), indices[:, :num_top_tokens]],
+                    [next_tokens.unsqueeze(1), indices[:, :num_top_tokens]],
                     dim=1)
             else:
                 next_tokens = next_tokens.unsqueeze(1)

🧹 Nitpick comments (14)

tensorrt_llm/sampling_params.py (1)

176-176: Docstring default mismatch for top_logprobs.
The field default is None; update docs accordingly.

Apply:

-        top_logprobs (int, optional): Number of top log probabilities to return per output token. Defaults to 0.
+        top_logprobs (int, optional): Number of top log probabilities to return per output token. Defaults to None.

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)

69-76: Remove unused variable to satisfy Ruff F841.
No need to bind the instance in the negative path.

Apply:

-        with pytest.raises(ValueError, match="top_logprobs.*"):
-            sampling_params = SamplingParams(max_tokens=max_tokens,
-                                             logprobs=True,
-                                             top_k=top_k,
-                                             top_p=top_p,
-                                             temperature=1.0,
-                                             top_logprobs=top_logprobs)
+        with pytest.raises(ValueError, match="top_logprobs.*"):
+            SamplingParams(max_tokens=max_tokens,
+                           logprobs=True,
+                           top_k=top_k,
+                           top_p=top_p,
+                           temperature=1.0,
+                           top_logprobs=top_logprobs)

91-96: Remove second unused variable to satisfy Ruff F841.
Same pattern in another negative case.

Apply:

-        sampling_params = SamplingParams(max_tokens=max_tokens,
-                                         logprobs=False,
-                                         top_k=top_k,
-                                         top_p=top_p,
-                                         temperature=1.0,
-                                         top_logprobs=top_logprobs)
+        SamplingParams(max_tokens=max_tokens,
+                       logprobs=False,
+                       top_k=top_k,
+                       top_p=top_p,
+                       temperature=1.0,
+                       top_logprobs=top_logprobs)

tensorrt_llm/_torch/pyexecutor/sampler.py (11)

33-34: Document shapes for new optional tensors.

Add concise shape notes for top_tokens/top_log_probs to aid maintainability and avoid ambiguity.

Apply:

 @dataclass(kw_only=True)
 class SampleStateTensors:
     new_tokens: torch.Tensor
     log_probs: torch.Tensor | None = None
-    top_tokens: torch.Tensor | None = None
-    top_log_probs: torch.Tensor | None = None
+    top_tokens: torch.Tensor | None = None  # Shape: (max_tokens, max_num_sequences, MAX_BEAM_WIDTH, max_top_logprobs)
+    top_log_probs: torch.Tensor | None = None  # Shape matches top_tokens

157-162: Fix docstring: returned next_tokens shape is [batch_size], not [batch_size, 1]; clarify indices.

Apply:

-    Returns:
-        next_tokens: [batch_size, 1] (sampled tokens)
+    Returns:
+        next_tokens: [batch_size] (sampled tokens)
         softmax: [batch_size, vocab_size] (probability distribution)
-        indices: [batch_size, top_k] (indices of top_k logits)
+        indices: None or [batch_size, top_k] (indices of top_k logits, None if top_k <= 0)

191-196: Fix docstring: top_p returns sorted_indices over vocab.

Apply:

-    Returns:
-        next_tokens: [batch_size, 1] (sampled tokens)
+    Returns:
+        next_tokens: [batch_size] (sampled tokens)
         softmax: [batch_size, vocab_size] (probability distribution)
-        indices: [batch_size, top_k] (indices of top_k logits)
+        indices: [batch_size, vocab_size] (sorted token indices by logit desc, before masking)

283-291: Greedy path docstring is fine; ensure consistency with other functions (next_tokens shape).

Consider mirroring the wording used in other functions: “next_tokens: [batch_size]”.

405-416: Validate Args.max_top_logprobs is non-negative and small enough to fit NEW_TOKENS_SHAPE.

Avoid silent misconfigurations.

Apply:

     def __init__(self, args: Args):
+        assert args.max_top_logprobs >= 0, "max_top_logprobs must be >= 0"
         self.max_seq_len = args.max_seq_len
         self.enable_mixed_sampler = args.enable_mixed_sampler
         self.max_tokens = args.max_draft_len + 1
         self.max_top_logprobs = args.max_top_logprobs

416-419: Nice: comment explains last dim. Add brief example to solidify intent.

Apply:

-        # The last dimension contains the sampled tokens and additionally top tokens if top_logprobs is specified
+        # The last dimension contains the sampled token (slot 0) and, if requested, the top-k tokens (slots 1..k).

626-635: Fix long docstring line and clarify shapes (ruff E501).

Apply:

-        """Shape: max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs)
-        The last dimension contains log_probs for the sampled tokens and additionally log_probs for the top tokens if top_logprobs is specified"""
+        """Shape:
+        (max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs)
+
+        The last dim packs: [sampled_token_logprob, top1_logprob, ..., topK_logprob] when requested.
+        """

797-803: Typo in comment.

Apply:

-            # otherwise, it is None and should be defaulted to0
+            # otherwise, it is None and should default to 0

491-509: Minor: return ranks starting at 1 for top tokens for consistency with sampled token rank.

Currently sampled token uses rank=1 while top-k use rank=k (0-based). Consider 1-based ranks for consistency.

Apply:

-                Logprob(logprob=logprob[k], rank=k)
+                Logprob(logprob=logprob[k], rank=k + 1)

354-366: Add explicit default case guard to sample().

Optional: raise for unknown strategy variants to fail fast.

Apply:

         case ("greedy", None):
             return greedy_search_sampling_batch(logits)
+        case _:
+            raise ValueError(f"Unsupported sampling strategy: {strategy}")

154-186: Optional: temperature handling in top_k_sampling for parity with top_p/top_k_top_p.

If needed, add temperature scaling for top-k as in other functions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b70e895 and d6e9f5e.

📒 Files selected for processing (15)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/outputConfig.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
cpp/tensorrt_llm/pybind/executor/request.cpp (1 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (12 hunks)
tensorrt_llm/executor/result.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/sampling_params.py (3 hunks)
tests/integration/test_lists/test-db/l0_a30.yml (1 hunks)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1 hunks)
tests/unittest/bindings/test_executor_bindings.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

tests/integration/test_lists/test-db/l0_a30.yml
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/executor/result.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/executor/outputConfig.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tests/unittest/bindings/test_executor_bindings.py
tensorrt_llm/sampling_params.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/sampler.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tests/unittest/bindings/test_executor_bindings.py
cpp/tensorrt_llm/pybind/executor/request.cpp
tensorrt_llm/sampling_params.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/sampler.py

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/pybind/executor/request.cpp

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/pybind/executor/request.cpp

🧬 Code graph analysis (5)

tests/unittest/bindings/test_executor_bindings.py (3)

cpp/include/tensorrt_llm/executor/executor.h (1)

OutputConfig (212-239)

cpp/tensorrt_llm/executor/outputConfig.cpp (2)

OutputConfig (23-35)

AdditionalModelOutput (37-41)

tensorrt_llm/sampling_params.py (1)

AdditionalModelOutput (112-121)

cpp/tensorrt_llm/pybind/executor/request.cpp (2)

cpp/include/tensorrt_llm/executor/executor.h (1)

OutputConfig (212-239)

cpp/tensorrt_llm/executor/outputConfig.cpp (1)

OutputConfig (23-35)

tensorrt_llm/sampling_params.py (3)

tensorrt_llm/scaffolding/task.py (1)

logprobs (99-100)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

top_k (221-221)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

use_beam_search (519-520)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-508)

tensorrt_llm/llmapi/llm_args.py (2)

CudaGraphConfig (108-165)

KvCacheConfig (946-1077)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

top_log_probs (250-251)

calculate_top_logprobs (422-423)

LlmRequest (306-454)

log_probs (246-247)

append_log_probs (195-206)

tensorrt_llm/executor/result.py (1)

Logprob (37-40)

🪛 Ruff (0.12.2)

tests/unittest/bindings/test_executor_bindings.py

896-896: Avoid equality comparisons to True; use config.return_log_probs: for truth checks

Replace with config.return_log_probs

(E712)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

91-91: Local variable sampling_params is assigned to but never used

Remove assignment to unused variable sampling_params

(F841)

tensorrt_llm/_torch/pyexecutor/sampler.py

628-628: Line too long (146 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

tests/unittest/bindings/test_executor_bindings.py (1)

892-892: Good: default top_logprobs is None is covered.
This verifies the new attribute presence and default.

cpp/tensorrt_llm/pybind/executor/request.cpp (2)

191-196: LGTM: include topLogProbs in getstate.
Serialization now emits the 8th element.

221-223: LGTM: expose top_logprobs attribute.
The binding name matches Python-side expectations and tests.

tensorrt_llm/sampling_params.py (1)

328-347: Validation logic for top_logprobs looks correct.
Covers logprobs dependency, bounds, greedy/top-k/beam-search incompatibilities.
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)
1-1: Add NVIDIA copyright header.
Tests are Python and must include the standard header.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import os
⛔ Skipped due to learnings
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.612Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
tensorrt_llm/_torch/pyexecutor/sampler.py (4)

648-666: Guard construction of top_ on per-request need, not only global capability.*

Currently gated by self.calculate_top_logprobs (global); safe, but per-request shaping is handled later via slicing. No action needed; just a heads-up.

833-841: After greedy fix, gather/copy shapes align; before fix it could mis-shape.

Good use of gather on expanded next_tokens; nothing to change once greedy is fixed.

430-432: Nice property. Consider exposing max_top_logprobs==0 early to skip extra host tensors.

No change required; optimization is already applied via calculate_top_logprobs.

626-635: Action: Make PEP‑604 annotations safe — either require Python ≥3.10 or add from __future__ import annotations.

Scan shows tensorrt_llm/_torch/pyexecutor/sampler.py (and many other files) use X | Y / builtin generics in type annotations but lack from __future__ import annotations. Either pin the repo/CI to Python ≥3.10 or add the future import to affected files (start with tensorrt_llm/_torch/pyexecutor/sampler.py).

examples/llm-api/quickstart_advanced.py

tensorrt_llm/_torch/pyexecutor/sampler.py

tensorrt-cicd · 2025-08-28T07:17:27Z

PR_Github #16805 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12615 completed with status: 'FAILURE'

stnie · 2025-08-28T07:27:33Z

/bot run

tensorrt-cicd · 2025-08-28T07:32:56Z

PR_Github #16817 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

491-509: Guard against None and slice only requested K to avoid OOB/garbage reads.

This can dereference None when the sampler’s max_top_logprobs==0 or when request K==0; it also reads all preallocated columns, not the requested K.

Apply:

 def _maybe_get_top_logprobs(self, request: LlmRequest, state: SampleState, *, beam: int, count: int):
     """Return top_logprobs for the current request if requested, else return None."""
-
-    if request.calculate_top_logprobs:
-        current_slice = slice(0, count), request.py_seq_slot, beam
-        top_log_probs = state.host.top_log_probs[request.py_seq_slot,
-                                                 beam, :count]
-        top_current_tokens = state.host.top_tokens[current_slice]
-        token_top_log_probs = [{
-            int(token[k]):
-            Logprob(logprob=logprob[k], rank=k)
-            for k in range(request.py_top_logprobs)
-        } for token, logprob in zip(top_current_tokens,
-                                    top_log_probs.tolist())]
-        return [token_top_log_probs]
-    else:
-        return None
+    k = request.py_top_logprobs or 0
+    if k <= 0:
+        return None
+    if state.host.top_log_probs is None or state.host.top_tokens is None:
+        return None
+    current_slice = (slice(0, count), request.py_seq_slot, beam)
+    # Restrict to the requested K
+    top_log_probs = state.host.top_log_probs[request.py_seq_slot, beam, :count, :k]
+    top_current_tokens = state.host.top_tokens[current_slice][:, :k]
+    token_top_log_probs = [
+        {int(token[idx]): Logprob(logprob=row[idx], rank=idx)
+         for idx in range(len(token))}
+        for token, row in zip(top_current_tokens, top_log_probs.tolist())
+    ]
+    return [token_top_log_probs]

♻️ Duplicate comments (4)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (3)

1-1: Add the standard NVIDIA header.

Required by repo policy; place it above imports.

+ # Copyright (c) 2025, NVIDIA CORPORATION.
+ #
+ #
 import os

69-76: Remove unused variable to satisfy Ruff F841.

-        with pytest.raises(ValueError, match="top_logprobs.*"):
-            sampling_params = SamplingParams(max_tokens=max_tokens,
-                                             logprobs=True,
-                                             top_k=top_k,
-                                             top_p=top_p,
-                                             temperature=1.0,
-                                             top_logprobs=top_logprobs)
+        with pytest.raises(ValueError, match="top_logprobs.*"):
+            SamplingParams(max_tokens=max_tokens,
+                           logprobs=True,
+                           top_k=top_k,
+                           top_p=top_p,
+                           temperature=1.0,
+                           top_logprobs=top_logprobs)

91-96: Remove unused variable to satisfy Ruff F841 (second occurrence).

-        sampling_params = SamplingParams(max_tokens=max_tokens,
-                                         logprobs=False,
-                                         top_k=top_k,
-                                         top_p=top_p,
-                                         temperature=1.0,
-                                         top_logprobs=top_logprobs)
+        SamplingParams(max_tokens=max_tokens,
+                       logprobs=False,
+                       top_k=top_k,
+                       top_p=top_p,
+                       temperature=1.0,
+                       top_logprobs=top_logprobs)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

154-162: Fix docstrings: shapes and semantics.

next_tokens is 1D (squeeze(-1)), not [batch, 1].
Clarify indices meaning for top-p paths (sorted full list).

Apply:

@@ def top_k_sampling_batch(...
-        next_tokens: [batch_size, 1] (sampled tokens)
+        next_tokens: [batch_size] (sampled tokens)
@@ def top_p_sampling_batch(...
-        next_tokens: [batch_size, 1] (sampled tokens)
+        next_tokens: [batch_size] (sampled tokens)
-        indices: [batch_size, top_k] (indices of top_k logits)
+        indices: [batch_size, vocab_size] (indices sorted by descending logit; first N cover top-p mass)
@@ def greedy_search_sampling_batch(...
-        next_tokens: [batch_size, 1] (sampled tokens)
+        next_tokens: [batch_size] (sampled tokens)

Also applies to: 187-196, 282-292

🧹 Nitpick comments (8)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (3)
38-38: Fix misleading comment on GPU arch.

Decorator forces Ampere; comment mentions H100.
-@force_ampere  # Save H100 resource
+@force_ampere  # Save Ampere GPU resources
80-80: Rename test for clarity/grammar.
-def test_generate_with_top_logprobs_and_disabler_logprobs(
+def test_generate_with_top_logprobs_with_logprobs_disabled(
97-97: Add a negative test for exceeding max_top_logprobs (LLM-level cap).

Covers the backend constraint (fixture sets max_top_logprobs=4).
def test_top_logprobs_exceeds_llm_cap(llm_torch, input_prompts):
    with pytest.raises(ValueError):
        sp = SamplingParams(max_tokens=4, logprobs=True, top_logprobs=8)
        # Expect backend/runner to reject > max_top_logprobs
        for _ in llm_torch.generate(input_prompts, sampling_params=sp):
            pass
Want me to push a patch with all suggested edits?
tests/unittest/bindings/test_executor_bindings.py (1)
896-907: Strengthen assertions for type and pickle stability.

Add a minimal type check to ensure top_logprobs remains an int across pickle round-trips.

Apply:
@@
-    assert config.top_logprobs == 10
+    assert isinstance(config.top_logprobs, int)
+    assert config.top_logprobs == 10
@@
-    assert config_copy.top_logprobs == 10
+    assert isinstance(config_copy.top_logprobs, int)
+    assert config_copy.top_logprobs == 10
Also applies to: 910-926
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
327-328: Include TopKTopP in the Strategy type hint.

Keeps annotations consistent with actual usage in request_strategy.

Apply:
-Strategy = TopK | TopP | Greedy
+Strategy = TopK | TopP | TopKTopP | Greedy
799-803: Nit: typo in comment.

“defaulted to0” -> “defaulted to 0”.

Apply:
-            # otherwise, it is None and should be defaulted to0
+            # otherwise, it is None and should be defaulted to 0
tensorrt_llm/sampling_params.py (2)
328-339: Clarify validation message and align semantics with int-valued logprobs.

Message “set logprobs to True” is misleading since logprobs is numeric.

Apply:
-            if not self.logprobs:
-                raise ValueError("You need to set logprobs to True to use top_logprobs.")
+            if not self.logprobs:
+                raise ValueError("You need to set logprobs >= 1 to use top_logprobs.")
345-347: Explicitly disallow with beam search (good), consider adding actionable hint.

Optional: include “use TRTLLM sampler path instead” or “set use_beam_search=False” to guide users.

Apply:
-            if self.use_beam_search:
-                raise ValueError("top_logprobs + beam search is not supported")
+            if self.use_beam_search:
+                raise ValueError("top_logprobs + beam search is not supported; set use_beam_search=False.")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d6e9f5e and 504fcf6.

📒 Files selected for processing (15)

cpp/include/tensorrt_llm/executor/executor.h (2 hunks)
cpp/tensorrt_llm/executor/outputConfig.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
cpp/tensorrt_llm/pybind/executor/request.cpp (1 hunks)
examples/llm-api/quickstart_advanced.py (5 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (12 hunks)
tensorrt_llm/executor/result.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tensorrt_llm/sampling_params.py (3 hunks)
tests/integration/test_lists/test-db/l0_a30.yml (1 hunks)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1 hunks)
tests/unittest/bindings/test_executor_bindings.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/config.py
cpp/tensorrt_llm/pybind/executor/request.cpp
tensorrt_llm/executor/result.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
tests/integration/test_lists/test-db/l0_a30.yml
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/executor/outputConfig.cpp
examples/llm-api/quickstart_advanced.py

🧰 Additional context used

🧬 Code graph analysis (6)

tests/unittest/bindings/test_executor_bindings.py (3)

cpp/include/tensorrt_llm/executor/executor.h (1)

OutputConfig (212-239)

cpp/tensorrt_llm/executor/outputConfig.cpp (2)

OutputConfig (23-35)

AdditionalModelOutput (37-41)

tensorrt_llm/sampling_params.py (1)

AdditionalModelOutput (112-121)

tensorrt_llm/sampling_params.py (3)

tensorrt_llm/scaffolding/task.py (1)

logprobs (99-100)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

top_k (221-221)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

use_beam_search (519-520)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-508)

tensorrt_llm/llmapi/llm_args.py (2)

CudaGraphConfig (108-165)

KvCacheConfig (946-1077)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

beam_width (70-73)

calculate_top_logprobs (430-431)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

tensorrt_llm/_torch/attention_backend/trtllm.py (2)

max_seq_len (558-568)

max_seq_len (571-575)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py (5)

top_log_probs (250-251)

calculate_top_logprobs (422-423)

LlmRequest (306-454)

log_probs (246-247)

append_log_probs (195-206)

tensorrt_llm/executor/result.py (1)

Logprob (37-40)

🪛 Ruff (0.12.2)

tests/unittest/bindings/test_executor_bindings.py

896-896: Avoid equality comparisons to True; use config.return_log_probs: for truth checks

Replace with config.return_log_probs

(E712)

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

91-91: Local variable sampling_params is assigned to but never used

Remove assignment to unused variable sampling_params

(F841)

tensorrt_llm/_torch/pyexecutor/sampler.py

628-628: Line too long (146 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tests/unittest/bindings/test_executor_bindings.py (1)

892-893: LGTM: default is None as expected.

Asserting the default OutputConfig().top_logprobs is None is correct and matches the new optional semantics.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

662-664: LGTM: correctly wires max_top_logprobs from backend config.

Propagation path looks correct.

tensorrt_llm/sampling_params.py (1)

176-176: Docstring default matches Optional semantics.

“Defaults to None” aligns with the new field type.

coderabbitai · 2025-08-28T07:45:40Z

tensorrt_llm/_torch/pyexecutor/_util.py

 def create_torch_sampler_args(executor_config: ExecutorConfig, mapping: Mapping,
-                              *, max_seq_len: int, enable_mixed_sampler: bool):
+                              *, max_seq_len: int, enable_mixed_sampler: bool,
+                              max_top_logprobs: int):
    max_num_sequences = executor_config.max_batch_size * mapping.pp_size


🛠️ Refactor suggestion

Validate max_top_logprobs (non-negative) at construction.

Prevents accidental negative values propagating to buffer shapes.

Apply:

def create_torch_sampler_args(executor_config: ExecutorConfig, mapping: Mapping, - *, max_seq_len: int, enable_mixed_sampler: bool, - max_top_logprobs: int): + *, max_seq_len: int, enable_mixed_sampler: bool, + max_top_logprobs: int): + if max_top_logprobs is None: + max_top_logprobs = 0 + if max_top_logprobs < 0: + raise ValueError("max_top_logprobs must be >= 0") @@ max_beam_width=executor_config.max_beam_width, max_top_logprobs=max_top_logprobs, enable_mixed_sampler=enable_mixed_sampler,

Also applies to: 649-649

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/_util.py around lines 638-641 (and also at line 649), validate that the parameter max_top_logprobs is non-negative when constructing the sampler args: if max_top_logprobs is negative raise a ValueError (with a clear message including the invalid value) before any buffer shape calculations, and ensure downstream code uses the validated value; update both places where max_top_logprobs is accepted to perform the same check.

tensorrt_llm/_torch/pyexecutor/llm_request.py

tensorrt_llm/_torch/pyexecutor/sampler.py

coderabbitai · 2025-08-28T07:45:40Z

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

+    return LLM(
+        model=os.path.join(llm_models_root(), "llama-models-v2",
+                           "TinyLlama-1.1B-Chat-v1.0"),
+        kv_cache_config=KvCacheConfig(max_tokens=10000),
+        max_batch_size=len(
+            input_prompts
+        ),  # use small batch size to prevent large buffers from possibly hiding wrong data accesses.
+        max_seq_len=32,
+        disable_overlap_scheduler=False,
+        sampler_type="TorchSampler",
+        cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8],
+                                          enable_padding=True),
+        enable_mixed_sampler=True,
+        max_top_logprobs=4,
+    )


🛠️ Refactor suggestion

Make tests robust to missing local model assets (skip when absent).

Prevents CI flakes when TinyLlama weights are not present.

- return LLM( - model=os.path.join(llm_models_root(), "llama-models-v2", - "TinyLlama-1.1B-Chat-v1.0"), + model_path = os.path.join(llm_models_root(), "llama-models-v2", + "TinyLlama-1.1B-Chat-v1.0") + if not os.path.isdir(model_path): + pytest.skip(f"Model not found: {model_path}") + return LLM( + model=model_path, kv_cache_config=KvCacheConfig(max_tokens=10000), max_batch_size=len( input_prompts ), # use small batch size to prevent large buffers from possibly hiding wrong data accesses. max_seq_len=32, disable_overlap_scheduler=False, sampler_type="TorchSampler", cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8], enable_padding=True), enable_mixed_sampler=True, max_top_logprobs=4, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return LLM(

model=os.path.join(llm_models_root(), "llama-models-v2",

"TinyLlama-1.1B-Chat-v1.0"),

kv_cache_config=KvCacheConfig(max_tokens=10000),

max_batch_size=len(

input_prompts

), # use small batch size to prevent large buffers from possibly hiding wrong data accesses.

max_seq_len=32,

disable_overlap_scheduler=False,

sampler_type="TorchSampler",

cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8],

enable_padding=True),

enable_mixed_sampler=True,

max_top_logprobs=4,

)

model_path = os.path.join(llm_models_root(), "llama-models-v2",

"TinyLlama-1.1B-Chat-v1.0")

if not os.path.isdir(model_path):

pytest.skip(f"Model not found: {model_path}")

return LLM(

model=model_path,

kv_cache_config=KvCacheConfig(max_tokens=10000),

max_batch_size=len(

input_prompts

), # use small batch size to prevent large buffers from possibly hiding wrong data accesses.

max_seq_len=32,

disable_overlap_scheduler=False,

sampler_type="TorchSampler",

cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8],

enable_padding=True),

enable_mixed_sampler=True,

max_top_logprobs=4,

)

🤖 Prompt for AI Agents

In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 21 to 35, the test constructs an LLM pointing to local TinyLlama weights which may be absent in CI; add a pre-check for the model directory/file using os.path.exists(...) and call pytest.skip("TinyLlama model not available locally - skipping test") when missing, and ensure pytest and os are imported at top of the file; then only construct and return the LLM if the path exists so the test is skipped gracefully instead of failing.

tests/unittest/_torch/sampler/test_return_top_k_logprobs.py

stnie · 2025-09-16T09:07:01Z

/bot run

tensorrt-cicd · 2025-09-16T09:11:59Z

PR_Github #18767 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-16T09:12:01Z

PR_Github #18763 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14065 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-09-16T10:40:36Z

PR_Github #18767 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14069 completed with status: 'FAILURE'

stnie · 2025-09-16T11:20:24Z

/bot run

tensorrt-cicd · 2025-09-16T11:25:23Z

PR_Github #18786 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-16T13:00:00Z

PR_Github #18786 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14084 completed with status: 'FAILURE'

…ties - Added a new parameter `topLogProbs` to the `OutputConfig` class, allowing users to specify the number of top log probabilities to return. - Updated related methods and bindings in the executor and API layers to accommodate this feature. - updated code in Torch sampler to provide the functionality - added a new Dict to LogProbStorage containing the top logprobs - included validation checks for top log probabilities in the sampling parameters - added tests to ensure correct functionality. Signed-off-by: Stefan Niebler <[email protected]>

…ptional top log probabilities - Changed `topLogProbs` parameter in `OutputConfig` to be optional, allowing for more flexible configurations. - Updated related methods and bindings in the executor and API layers to handle the new optional type. - Enhanced validation checks in sampling parameters to ensure correct usage of `top_logprobs`. - Adjusted tests to accommodate the changes in parameter handling and validation. Signed-off-by: Stefan Niebler <[email protected]>

- Updated `OutputConfig` to use `std::optional<SizeType32>` for the `top_logprobs` parameter - Adjusted docstring in `sampling_params.py` to correctly state the default of `top_logprobs` as None. - Modified `quickstart_advanced.py` to ensure correct handling of `top_k` conditions. - Clarified documentation in `sampler.py` regarding indices after applying top-k and top-p masks. Signed-off-by: Stefan Niebler <[email protected]>

…robs changes - Correctly set max_top_logprobs parameter in create_autodeploy_executor. - Updated DemoEngine to correctly unpack top_k and greedy_search Signed-off-by: Stefan Niebler <[email protected]>

…ocstring and updated yaml files for api stability testing - Updated the test cases in `test_llm_api.py`, `completion_output.yaml`, `llm.yaml`, and `sampling_params.yaml` to include the new `top_logprobs` parameter and its associated properties. Signed-off-by: Stefan Niebler <[email protected]>

- Updated `TorchStore` and `TorchSampler` to include `max_top_logprobs` parameter for managing top log probabilities. - Modified methods to handle top log probabilities during sampling and logging. - Added new Tensors to `TorchStore` containing data for top logprobs - Adjusted tests to validate the correct functionality of top log probabilities in various scenarios. Signed-off-by: Stefan Niebler <[email protected]>

…rent LLM-API - Removed the top_logprobs parameter from SamplingParams and related classes to streamline log probability management. - Change _get_output_config to explicitly pass logprobs as top-logprobs to OutputConfig object. - Updated setup_llm function to conditionally set logprobs based on input arguments. - Adjusted CompletionOutput and related tests to reflect the removal of top_logprobs. - Enhanced validation checks to ensure correct usage of logprobs in sampling scenarios. Signed-off-by: Stefan Niebler <[email protected]>

- Adjusted the validation logic in SamplingParams to ensure that logprobs must not exceed 1 when greedy sampling is enabled. - Updated error message for clarity regarding the constraints on logprobs in greedy sampling scenarios. Signed-off-by: Stefan Niebler <[email protected]>

…obs of 1 - Modified the return statement in `greedy_search_sampling_batch` to include `next_tokens` as an unsqueezed tensor, ensuring compatibility with downstream processing. - Added `max_top_logprobs` parameter to the configuration in the test file to ensure correct execution. Signed-off-by: Stefan Niebler <[email protected]>

- Updated test_embedding_bias_with_torch_sampler_strategies cases to correctly handle top logprobs of 1 in the parameterization - Enhanced validation logic in test_return_top_k_logprobs to accommodate the new conditions for valid setups in sampling tests (greedy sampling). Signed-off-by: Stefan Niebler <[email protected]>

stnie · 2025-09-17T08:29:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-17T08:34:58Z

PR_Github #18946 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-17T12:18:51Z

PR_Github #18946 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14200 completed with status: 'FAILURE'

stnie requested review from dcampora and Funatiq August 27, 2025 10:17

stnie force-pushed the develop/sampler/return_k_logprobs branch from d70796a to f81fb63 Compare August 27, 2025 10:17

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

Funatiq requested changes Aug 27, 2025

View reviewed changes

stnie force-pushed the develop/sampler/return_k_logprobs branch 2 times, most recently from 51d599d to b70e895 Compare August 27, 2025 16:19

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

stnie force-pushed the develop/sampler/return_k_logprobs branch from b70e895 to d6e9f5e Compare August 28, 2025 06:00

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

examples/llm-api/quickstart_advanced.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/sampler.py Show resolved Hide resolved

stnie force-pushed the develop/sampler/return_k_logprobs branch from d6e9f5e to 504fcf6 Compare August 28, 2025 07:26

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

stnie force-pushed the develop/sampler/return_k_logprobs branch from 686cc8b to 6cf9dc4 Compare September 16, 2025 09:06

stnie force-pushed the develop/sampler/return_k_logprobs branch from 6cf9dc4 to 337b137 Compare September 16, 2025 11:20

stnie added 10 commits September 17, 2025 07:37

[TRTLLM-5670][chore] Update autodeploy code to adjust to the top_logp…

5022c6b

…robs changes - Correctly set max_top_logprobs parameter in create_autodeploy_executor. - Updated DemoEngine to correctly unpack top_k and greedy_search Signed-off-by: Stefan Niebler <[email protected]>

stnie force-pushed the develop/sampler/return_k_logprobs branch from 337b137 to 31bc186 Compare September 17, 2025 08:29

stnie marked this pull request as draft September 17, 2025 14:17

stnie removed request for syuoni, QiJune, kaiyux, kris1025, netanel-haber and nvchenghaoz September 17, 2025 15:17

[TRTLLM-5670][feat] Support top log probabilities #7296

Are you sure you want to change the base?

[TRTLLM-5670][feat] Support top log probabilities #7296

Uh oh!

Conversation

stnie commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[TRTLLM-5670][feat] Support top log probabilities

Summary by CodeRabbit

Description

Performance Tests

This branch (Need update since latest changes)

main branch

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks

Uh oh!

stnie commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

stnie commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stnie commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

stnie commented Aug 28, 2025

Uh oh!

stnie commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

stnie commented Aug 27, 2025 •

edited

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading