Skip to content

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented Sep 6, 2025

Summary by CodeRabbit

  • New Features
    • Unified logprobs support across backends, including prompt logprobs on PyTorch.
    • Automatic capture of required logits when requesting prompt logprobs.
    • Option to clear stored context logits to reduce memory after use.
    • Improved streaming behavior with consistent logprobs reporting.
  • Bug Fixes
    • More accurate and consistent logprobs accumulation and length handling across backends.
  • Tests
    • Expanded test coverage for logprobs across TensorRT and PyTorch backends, including streaming scenarios.

Description

also fixes https://nvbugs/5516660

Test Coverage

- test_llm_pytorch.py::test_llm_return_logprobs
- test_llm_pytorch.py::test_llm_return_logprobs_streaming

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Sep 6, 2025

📝 Walkthrough

Walkthrough

Adds LlmResponse wrapper with clear_context_logits; updates LlmRequest.create_response to return LlmResponse. Refactors logprobs handling across backends: executor/result accumulation, worker computation paths for PyTorch vs TRT using logits. Adjusts LLM argument checks and sampling params to auto-enable context logits. Expands tests for multi-backend and streaming.

Changes

Cohort / File(s) Summary
PyTorch response wrapper and API
tensorrt_llm/_torch/pyexecutor/llm_request.py
Introduces LlmResponse.clear_context_logits; changes LlmRequest.create_response to return LlmResponse or None; adds docstring; wires Python-side result with serialized payload.
Executor logprobs handling
tensorrt_llm/executor/result.py
Refines prompt/generation logprobs handling for TRT and PyTorch backends; accumulates only new PyTorch logprobs; clamps to sequence length; updates compute_logprobs docstring.
Worker cross-backend logprobs
tensorrt_llm/executor/worker.py
Accepts Union[tllm.Response, LlmResponse]; adds _compute_pytorch_prompt_logprobs; branches: PyTorch computes prompt logprobs via context logits; TRT computes from context/generation logits; drops logits per flags.
LLM argument checks
tensorrt_llm/llmapi/llm.py
Makes is_gen_only optional (default False); removes PyTorch-specific prompt_logprobs prohibition; keeps other validations.
Sampling params auto-enable
tensorrt_llm/sampling_params.py
Imports logger; if PyTorch prompt_logprobs requested, auto-enables return_context_logits with info log.
Tests: multi-backend and streaming
tests/unittest/llmapi/test_llm.py, tests/unittest/llmapi/test_llm_pytorch.py
Generalizes harness to accept backend/tp/streaming; runs against TRT and PyTorch; integrates streaming assertions; adds new PyTorch tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant LLM
  participant Executor
  participant Worker
  participant Sampler
  participant Response

  Client->>LLM: submit(prompt, sampling_params)
  LLM->>Executor: run request
  Executor->>Sampler: generate tokens
  Sampler-->>Executor: tokens (+ backend-provided logprobs?)
  Executor-->>LLM: serialized result + logits (optional)
  LLM-->>Response: LlmResponse(create_response)
  Note over Response: result.context_logits / generation_logits may be present

  LLM->>Worker: _get_logprobs(response, backend)

  alt PyTorch backend
    Worker->>Worker: _compute_pytorch_prompt_logprobs using context_logits
    opt streaming
      Worker->>Worker: cache/use prior prompt logprobs
    end
    Worker->>Response: clear_context_logits() (optional)
  else TRT backend
    Worker->>Worker: compute from context_logits and generation_logits
    Worker->>Response: drop logits per flags
  end

  Worker-->>LLM: LogProbsResult
  LLM-->>Client: outputs + logprobs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR body contains the repository template and a brief "also fixes" note plus a Test Coverage list, but the required Description section is not populated with a concise summary of what was changed, why, which files/subsystems are affected, or any behavioral/compatibility impact, so the description is largely incomplete for reviewers. Please populate the Description section with a short summary of the implemented changes and rationale, list the main files or subsystems touched and any behavioral or compatibility impacts, confirm which tests exercise the changes, and mark the PR checklist items (documentation, CODEOWNERS, etc.) so reviewers have sufficient context.
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title follows the repository convention ([TRTLLM-7015][feat]) and concisely summarizes the primary change — enabling prompt_logprobs support in the PyTorch backend; it is specific, relevant to the changeset, and contains no noisy extras.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot changed the title [TRTLLM-7015] @coderabbitai title [TRTLLM-7015] [feat] Enable prompt token logprobs across backends Sep 6, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
tensorrt_llm/sampling_params.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

This file is missing the standard header mandated by the repo guidelines.

Apply at file top:

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
tensorrt_llm/llmapi/llm.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
tensorrt_llm/executor/result.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.

+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
tensorrt_llm/serve/postprocess_handlers.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.

The file lacks the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/executor/worker.py (2)

1014-1021: Guard against error responses or missing result to avoid attribute errors.

on error responses may have no result; early-return to prevent None access.

     logprobs_result = None
     generation_result = worker._results.get(response.client_id, None)
 
     if not generation_result:
         return
+    # Skip when response has errors or no result payload yet
+    if hasattr(response, "has_error") and response.has_error():
+        return
+    if not getattr(response, "result", None):
+        return

1-1: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.

The file lacks the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tests/unittest/llmapi/apps/_test_openai_completions.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per coding guidelines, prepend the NVIDIA Apache-2.0 copyright header to all .py files.

Apply at file top:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
tensorrt_llm/serve/openai_protocol.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per guidelines, all source files must include the NVIDIA Apache-2.0 header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
tests/unittest/llmapi/test_llm.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Tests are also source files; add the standard header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
🧹 Nitpick comments (11)
tensorrt_llm/sampling_params.py (1)

452-456: Auto-enabling context logits for PyTorch: also flag as auto-enabled.

To keep behavior consistent with TRT (and make downstream “need to return” checks accurate), set the private auto-enabled flag when you flip return_context_logits.

Apply this diff:

-            if self.prompt_logprobs and not self.return_context_logits:
-                config_kwargs["return_context_logits"] = True
+            if self.prompt_logprobs and not self.return_context_logits:
+                config_kwargs["return_context_logits"] = True
+                # Mirror TRT flow: mark as internally enabled so consumers can safely drop them.
+                self._context_logits_auto_enabled = True

Please confirm worker logic drops context logits when prompt_logprobs is requested but return_context_logits was not explicitly set by the user.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

281-291: Good: targeted API to drop context logits.

Method is concise and safely nulls the Python-side storage.

Consider adding a symmetric clear_generation_logits() for parity with drop_generation_logits flows.

tests/unittest/llmapi/apps/_test_openai_chat.py (1)

187-243: Tighten assertions; drop noisy prints for CI hygiene.

  • The conditional print statements make test output noisy and aren’t actionable. Prefer silent acceptance with explicit assertions or a short skip/xfail note.
  • Keep the “accepts but may not return” behavior, but avoid printing.

Apply this diff to remove prints while preserving intent:

@@
-        print(
-            f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries"
-        )
@@
-        print(
-            f"✓ Found prompt_logprobs in message with {len(message.prompt_logprobs)} entries"
-        )
@@
-        print(
-            f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)"
-        )
+        # prompt_logprobs currently accepted but not returned (known limitation)
+        pass
tensorrt_llm/serve/postprocess_handlers.py (2)

60-63: Fix type hint to reflect accepted input shapes.

The function consumes ints/floats, dicts, sequences, and objects with a logprob attribute; List[float] is misleading.

Apply this diff:

-from typing import List, Literal, Optional, Tuple, Union
+from typing import List, Literal, Optional, Tuple, Union, Sequence, Any
@@
-def create_logprobs(token_ids: List[int],
-                    tokenizer: TransformersTokenizer,
-                    logprobs: List[float]) -> ChatCompletionLogProbs:
+def create_logprobs(token_ids: List[int],
+                    tokenizer: TransformersTokenizer,
+                    logprobs: Sequence[Any]) -> ChatCompletionLogProbs:

27-27: Fix typo in formatter pragma.

Comment should read “enable”.

-# yapf: enale
+# yapf: enable
tensorrt_llm/executor/worker.py (2)

1009-1013: Compress one-line docstring (D200).

Ruff warns; keep on a single line.

-    """Compute logprob and prompt logprob and clear out logits if applicable.
-    """
+    """Compute logprobs and prompt_logprobs, and clear logits if applicable."""

1022-1036: PyTorch branch: compute only when needed; keep cleanup but be defensive.

Looks good overall. Minor nit: keep hasattr checks before touching backend-specific cleanup to avoid surprises with mixed response types.

tests/unittest/llmapi/apps/_test_openai_completions.py (3)

357-404: Make the expectation explicit; avoid prints.

Prefer an assertion over print so the test fails if the response shape unexpectedly changes. Use getattr for forward-compat.

-    # Check if prompt_logprobs are in the response (for future compatibility)
-    if hasattr(choice,
-               'prompt_logprobs') and choice.prompt_logprobs is not None:
-        # Future: When prompt_logprobs are added to the response
-        assert len(
-            choice.prompt_logprobs) > 0, "prompt_logprobs should not be empty"
-        print(
-            f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries"
-        )
-    else:
-        # Current state: prompt_logprobs are computed internally but not returned
-        # This test verifies the feature doesn't cause errors
-        print(
-            f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)"
-        )
+    plp = getattr(choice, "prompt_logprobs", None)
+    # Current behavior: not returned; assert None to catch unexpected changes.
+    # When support lands, update the assertion to validate structure.
+    assert plp is None

406-452: Strengthen generation logprobs validation; keep backend logic.

Assert expected K for returned top_logprobs to catch regressions.

-    # Verify we got generation logprobs
-    assert choice.logprobs is not None, "Generation logprobs were requested but not returned"
+    # Verify we got generation logprobs with expected top-K (allow K or K+1 for sampled token)
+    assert choice.logprobs is not None, "Generation logprobs were requested but not returned"
+    k = generation_logprobs
+    assert len(choice.logprobs.top_logprobs) >= 1
+    assert len(choice.logprobs.top_logprobs[0]) in {k, k + 1}

454-481: Assert no context logits leaked when not requested.

This confirms the auto-enable/drop path behaves as intended.

     choice = completion.choices[0]
     assert len(choice.text) >= 1
+    # Should not surface context logits when explicitly disabled.
+    assert getattr(choice, "context_logits", None) is None
tests/unittest/llmapi/test_llm.py (1)

1879-1907: Backend matrix: confirm PyTorch returns context logits.

This param set includes (prompt_logprobs=2, return_context_logits=True, backend="pytorch"). If PyTorch still doesn't expose context logits, this will fail. Either keep the case only if supported, or mark xfail/skip dynamically.

 @pytest.mark.parametrize(
     "prompt_logprobs, logprobs, return_context_logits, return_generation_logits, backend",
     [
@@
-        (2, None, True, False, "pytorch"
-         ),  # prompt_logprobs with context_logits
+        pytest.param(2, None, True, False, "pytorch", marks=pytest.mark.xfail(reason="PyTorch context logits exposure")),

If PyTorch now supports returning context logits, keep as-is and remove xfail.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 74105a4 and 4aa47fb.

⛔ Files ignored due to path filters (7)
  • docs/source/blogs/media/tech_blog10_baseline_performance_detail.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_baseline_performance_overview.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_context_wait_performance.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_dataset_token_distribution.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_full_strategy_performance.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png is excluded by !**/*.png
📒 Files selected for processing (11)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/executor/result.py (1 hunks)
  • tensorrt_llm/executor/worker.py (3 hunks)
  • tensorrt_llm/llmapi/llm.py (1 hunks)
  • tensorrt_llm/sampling_params.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (4 hunks)
  • tensorrt_llm/serve/postprocess_handlers.py (1 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_chat.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_completions.py (1 hunks)
  • tests/unittest/llmapi/test_llm.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/serve/postprocess_handlers.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tests/unittest/llmapi/apps/_test_openai_chat.py
  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/serve/openai_protocol.py
  • tests/unittest/llmapi/apps/_test_openai_completions.py
  • tests/unittest/llmapi/test_llm.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/serve/postprocess_handlers.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tests/unittest/llmapi/apps/_test_openai_chat.py
  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/serve/openai_protocol.py
  • tests/unittest/llmapi/apps/_test_openai_completions.py
  • tests/unittest/llmapi/test_llm.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/sampling_params.py
  • tensorrt_llm/serve/postprocess_handlers.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tests/unittest/llmapi/apps/_test_openai_chat.py
  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/serve/openai_protocol.py
  • tests/unittest/llmapi/apps/_test_openai_completions.py
  • tests/unittest/llmapi/test_llm.py
🧠 Learnings (3)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/test_llm.py
🧬 Code graph analysis (8)
tensorrt_llm/executor/result.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • cum_log_probs (230-231)
  • log_probs (226-227)
tensorrt_llm/scaffolding/task.py (2)
  • cumulative_logprob (94-96)
  • logprobs (99-100)
cpp/include/tensorrt_llm/executor/types.h (1)
  • FinishReason (502-597)
tensorrt_llm/serve/postprocess_handlers.py (1)
tensorrt_llm/serve/openai_protocol.py (1)
  • ChatCompletionLogProbsContent (368-369)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
tensorrt_llm/executor/result.py (1)
  • result (574-585)
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
tests/unittest/llmapi/apps/_test_openai_completions.py (3)
  • client (44-45)
  • model_name (16-17)
  • backend (21-22)
tensorrt_llm/executor/worker.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmResponse (272-290)
  • get (100-109)
tensorrt_llm/sampling_params.py (1)
  • SamplingParams (125-491)
tensorrt_llm/executor/result.py (1)
  • LogProbsResult (57-60)
tensorrt_llm/serve/openai_protocol.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
  • Field (69-96)
tensorrt_llm/builder.py (1)
  • default (50-58)
tests/unittest/llmapi/apps/_test_openai_completions.py (1)
tests/unittest/llmapi/apps/_test_openai_chat.py (3)
  • client (82-83)
  • model_name (21-22)
  • backend (26-27)
tests/unittest/llmapi/test_llm.py (3)
tensorrt_llm/scaffolding/task.py (1)
  • logprobs (99-100)
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
  • backend (26-27)
tests/unittest/llmapi/apps/_test_openai_completions.py (1)
  • backend (21-22)
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker.py

1035-1036: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (6)
tensorrt_llm/llmapi/llm.py (1)

570-573: Manual verify _TorchLLM.create’s return_logits doesn’t override context-logits capture
sampling_params.py already auto-enables context logits on PyTorch (tensorrt_llm/sampling_params.py lines 451–458) and executor/worker.py drops them when _need_return_context_logits is false (lines 1054–1058). Now confirm that the PyTorch LLM factory method (e.g. _TorchLLM.create) accepts and propagates return_context_logits alongside any return_logits flag so context logits aren’t inadvertently suppressed.

tensorrt_llm/executor/result.py (1)

251-264: Hardened logprobs handling looks correct.

  • Preserves previous length for diffs.
  • Slices only when logprobs is a list and asserts length matches output length.
tests/unittest/llmapi/apps/_test_openai_chat.py (1)

245-291: Clarify backend expectations or align with earlier skip.

Earlier in this file, PyTorch chat logprobs are skipped; this test asserts generation logprobs for both backends. If PyTorch support is now enabled, consider removing the earlier skip to avoid inconsistent expectations; otherwise, add a targeted skip here for PyTorch until parity lands.

tensorrt_llm/executor/worker.py (1)

17-17: I’ve extracted the LlmResponse class definition and member methods to confirm context_logits and clear_context_logits are present. Let me know once you have the output.

tensorrt_llm/serve/openai_protocol.py (2)

306-309: Pass-through looks good.

Routing both logprobs and prompt_logprobs to SamplingParams is correct.


639-641: LGTM on SamplingParams plumbing.

Chat path correctly forwards both fields.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/source/commands/trtllm-serve/trtllm-serve.rst (2)

186-186: Inconsistent config filename in multi-node example

This section still uses the old filename convention extra-llm-api-config.yml while the rest of the document has been updated to extra_llm_config.yaml. This should be updated for consistency.

-    echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > extra-llm-api-config.yml
+    echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > extra_llm_config.yaml

And update the reference in the command:

-        bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
+        bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra_llm_config.yaml"

82-86: Another filename consistency issue in multimodal example

The multimodal section also uses the old filename convention. This should be updated for consistency with the rest of the document.

-   cat >./extra-llm-api-config.yml<<EOF
+   cat >./extra_llm_config.yaml<<EOF

And update the corresponding reference:

-       --extra_llm_api_options ./extra-llm-api-config.yml
+       --extra_llm_api_options ./extra_llm_config.yaml
tests/integration/defs/test_e2e.py (1)

443-457: Critical: TemporaryDirectory is immediately eligible for cleanup; work_dir may disappear.

self.work_dir = Path(tempfile.TemporaryDirectory().name) loses the TemporaryDirectory handle, so the dir can be auto-deleted by GC, causing flaky I/O later.

Apply this diff:

-        self.work_dir = Path(tempfile.TemporaryDirectory().name)
+        # Keep a handle so the directory isn't cleaned up prematurely.
+        self._tmpdir = tempfile.TemporaryDirectory()
+        self.work_dir = Path(self._tmpdir.name)

Optional: implement __del__ to call self._tmpdir.cleanup() explicitly.

Also applies to: 481-483

🧹 Nitpick comments (16)
docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)

73-86: Notation/clarity nits: finite sum and math typesetting.

Suggest replacing the infinite sum with a finite horizon (T iterations) and using \cdot instead of *. Also define T in the “where” list for completeness.

Apply this diff:

-```math
-sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i
-```
+```math
+sol\_time = \sum_{i=0}^{T-1} time_i \cdot balance\_ratio_i
+```

@@
 - $elapsed\_time$: Total empirically measured end-to-end execution time
 - $actual\_tps$: Observed throughput in tokens per second
 - $sol\_tps$: Theoretical maximum throughput under perfect load balance
+- $T$: Total number of iterations considered
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (3)

237-238: Tighten wording + proper capitalization; align with earlier section naming.

Minor edits for clarity and consistency with “Launch the TRT-LLM Server”.

-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. 
+You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, and `max_num_tokens` when launching the TRT-LLM server, and set `reasoning-effort` when launching evaluation in GPT-OSS. Below are reference configurations for accuracy evaluation on B200. Ensure these values remain consistent with the “Launch the TRT-LLM Server” section.

239-245: Define “DEP” and map table rows to concrete flags; call out config consistency with launch flags.

Right now “DEP8 / DEP4” isn’t actionable. Please define DEP and how to set it (e.g., via enable_attention_dp: true) and explicitly map each row to --tp_size, --ep_size, and enable_attention_dp. Also, values here (e.g., max_batch_size 1024 and max_num_tokens 32768/133120) don’t match the earlier launch example (--max_batch_size 720, --max_num_tokens 16384). Readers will be confused.

 | high                 | TP8 / TP4                  | 720                | 133120             |
- 
+ 
+Note:
+- “DEP” denotes attention data parallelism; set `enable_attention_dp: true`. “TP” denotes pure tensor parallelism; set `enable_attention_dp: false`.
+- For each row, set `--tp_size` and `--ep_size` accordingly (e.g., “TP8” → `--tp_size 8`; use `--ep_size` for MoE expert parallel as needed).
+- Keep `--max_batch_size` and `--max_num_tokens` consistent with your server launch. If you launched with `--max_batch_size 720` and `--max_num_tokens 16384`, update the corresponding table entries to match.

246-246: Fix eval name: “AIME25” vs “AIME2025”.

Text says “AIME2025” but the command uses aime25. Align the prose with the flag.

-Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.
+Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME25.
docs/source/reference/multimodal-feature-support-matrix.md (1)

3-14: Add a brief definition/footnote for “Encoder IFB”.

Readers may not know “Encoder IFB”. Add a one‑liner or footnote under the table clarifying it (e.g., encoder in‑flight batching for the vision tower during prefill) and any notable limitations.

Example addition:

 | Qwen2.5-VL         | Yes        | Yes                 | Yes            | No              |
 
+Note: “Encoder IFB” = encoder in‑flight batching (vision encoder). Availability may depend on model’s vision tower and scheduling flags.
docs/source/commands/trtllm-serve/trtllm-serve.rst (1)

210-210: Improve accuracy of metrics description

The line mentions "inflight-batching details" but the example metrics don't show any inflight-batching specific fields. Consider updating to be more accurate.

-The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
+The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage, KV cache details, and request metrics.
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (6)

14-18: Fix MD040: add a fenced code language.

Label the directory tree block to satisfy markdownlint and improve rendering.

-```
+```text
 /path/to/models/
   ├─ gpt-oss-120b/  # base model directory
   └─ eagle/         # Eagle3 speculative decoding assets

---

`33-46`: **Consider a non-interactive run example.**

Add a detached variant to improve copy/paste UX in automation.


```diff
-docker run --rm --ipc=host -it \
+docker run --rm --ipc=host -it \
   --ulimit stack=67108864 \
   --ulimit memlock=-1 \
   --gpus all \
   -p 8000:8000 \
   -v /path/to/models:/config/models:rw \
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \
   /bin/bash
+
+# Or detached server directly:
+# docker run --rm --ipc=host -d --name trtllm \
+#   --ulimit stack=67108864 --ulimit memlock=-1 --gpus all \
+#   -p 8000:8000 -v /path/to/models:/config/models:rw \
+#   nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \
+#   bash -lc 'TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b ...'

73-74: Use proper Markdown links instead of code-formatted URLs.

Improves readability and avoids confusion with shell literals.

-References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`
+References: [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) and [nvidia/gpt-oss-120b-Eagle3](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3)

108-109: Sanity-check parallelism and sequence settings for the stated 8‑GPU example.

--tp_size 8 --ep_size 4 with 8 GPUs and 120B weights may be inconsistent/resource-heavy; --max_seq_len/--max_num_tokens 131072 is extreme. Provide safer defaults or a sizing note.

-TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
+TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 \
+  --max_batch_size 4 --tp_size 8 --ep_size 1 --trust_remote_code \
+  --extra_llm_api_options /config/models/eagle/eagle.yaml \
+  --max_num_tokens 32768 --max_seq_len 32768  # adjust per VRAM/SLA

125-139: Showcase prompt token logprobs in the sample request (matches PR goal).

Add logprobs and prompt_logprobs to demonstrate the new capability.

   -d '{
     "model": "gpt-oss-120b",
     "messages": [
       {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
     ],
-    "max_tokens": 128,
+    "max_tokens": 128,
+    "logprobs": 5,
+    "prompt_logprobs": true,
     "stream": false
   }'

3-3: Either link the “previous guide” or remove the forward reference.

Avoid dangling references.

docs/source/torch/features/feature_combination_matrix.md (1)

18-18: Confirm MTP and EAGLE‑3(One Model Engine) support before marking “Yes”.

Docs now claim both are “Yes”. Ensure feature gates, required flags, and backend constraints (CUDA graph, attention backend, etc.) match actual support paths; otherwise annotate with “WIP/Untested” like other rows or add footnotes.

Would you like me to link the exact example/tests that demonstrate both features to justify the change in this matrix?

tests/integration/defs/test_e2e.py (3)

565-569: Harden CLI flag emission for concurrency/num_requests.

  • Check explicitly for None to allow zero if ever needed.
  • Cast to int once to avoid accidental string injection.

Apply:

-        if self.concurrency:
-            benchmark_cmd += f" --concurrency {self.concurrency}"
-        if self.num_requests:
-            benchmark_cmd += f" --num_requests {self.num_requests}"
+        if self.concurrency is not None:
+            benchmark_cmd += f" --concurrency {int(self.concurrency)}"
+        if self.num_requests is not None:
+            benchmark_cmd += f" --num_requests {int(self.num_requests)}"

628-687: Make MIG scaling assertion less brittle; fix minor print nit.

  • Throughput increase factor 1.3 per step may be flaky across environments. Recommend monotonic increase or a smaller factor (e.g., 1.1).
  • Remove needless f-prefix in static print.

Apply:

-    print(f"\n=== Benchmark Results Comparison ===")
+    print("\n=== Benchmark Results Comparison ===")
@@
-            prev_throughput = float(results[concurrency_list[idx - 1]].get(
-                'throughput', 0))
-            assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}"
+            prev_throughput = float(results[concurrency_list[idx - 1]].get('throughput', 0))
+            # Allow modest scaling margin to reduce flakiness across MIG configs.
+            assert throughput >= prev_throughput * 1.1, (
+                f"Throughput did not sufficiently increase: {prev_throughput} -> {throughput} "
+                f"at concurrency {concurrency_list[idx]}"
+            )

1633-1638: Remove unused fixture argument to silence lint.

llm_root is unused.

Apply:

-def test_openai_responses(llm_root, llm_venv):
+def test_openai_responses(llm_venv):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa47fb and fbec724.

⛔ Files ignored due to path filters (3)
  • docs/source/blogs/media/tech_blog10_baseline_performance_detail.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_context_wait_performance.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog10_full_strategy_performance.png is excluded by !**/*.png
📒 Files selected for processing (10)
  • docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1 hunks)
  • docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (1 hunks)
  • docs/source/commands/trtllm-serve/trtllm-serve.rst (1 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • docs/source/legacy/tensorrt_quickstart.md (1 hunks)
  • docs/source/reference/multimodal-feature-support-matrix.md (1 hunks)
  • docs/source/reference/support-matrix.md (1 hunks)
  • docs/source/torch/features/feature_combination_matrix.md (1 hunks)
  • tests/integration/defs/test_e2e.py (20 hunks)
✅ Files skipped from review due to trivial changes (2)
  • docs/source/index.rst
  • docs/source/legacy/tensorrt_quickstart.md
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/integration/defs/test_e2e.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/integration/defs/test_e2e.py
🧠 Learnings (4)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • docs/source/commands/trtllm-serve/trtllm-serve.rst
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
PR: NVIDIA/TensorRT-LLM#7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/integration/defs/test_e2e.py
🧬 Code graph analysis (1)
tests/integration/defs/test_e2e.py (3)
tests/integration/defs/trt_test_alternative.py (1)
  • check_output (257-283)
tests/integration/defs/conftest.py (4)
  • llm_venv (707-723)
  • llm_root (180-181)
  • unittest_path (90-91)
  • llm_models_root (77-83)
tests/integration/defs/common.py (1)
  • parse_output (665-681)
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py

570-570: Function call with shell=True parameter identified, security issue

(S604)


659-659: f-string without any placeholders

Remove extraneous f prefix

(F541)


680-680: Use of assert detected

(S101)


681-681: Use of assert detected

(S101)


686-686: Use of assert detected

(S101)


1633-1633: Unused function argument: llm_root

(ARG001)


2806-2806: Use of assert detected

(S101)

🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

14-14: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)

74-74: Correct fix: SOL time should scale with balance ratio (not divide).

Multiplying by balance_ratio_i = avg_tokens/max_tokens correctly models the ideal per-iteration time as time_i * balance_ratio_i. Good catch.

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1)

239-245: Double-check batch/token limits for B200; numbers look unusually high.

max_num_tokens = 133120 and max_batch_size = 1024 (TP low/medium) are aggressive for 120B. Please validate these against actual memory headroom for the recommended kv_cache_free_gpu_memory_fraction and your sequence lengths (ISL/OSL). If they’re theoretical maxima, label them as such and provide a safe default row.

docs/source/reference/multimodal-feature-support-matrix.md (3)

3-14: Cross-matrix consistency check.

If other backend matrices (e.g., TensorRT) list Llama 4, mirror this change there (or explicitly scope this table to PyTorch only, which the header implies).


9-9: No change—row already uses canonical “Llama 4”.

Likely an incorrect or invalid review comment.


9-9: Verify Llama 4 encoder IFB support in Python inference
No Python code paths reference “IFB” or “in-flight batching” for Llama 4. Determine whether encoder IFB is only enabled under the C++ session (via python_e2e, batch-size limits, scheduling flags, etc.); if so, update the matrix entry or add a footnote with the gating conditions.

docs/source/commands/trtllm-serve/trtllm-serve.rst (5)

204-209: LGTM! Clear documentation of PyTorch backend limitations

The note effectively communicates the beta status and limitations of the PyTorch backend metrics, including the performance impact caveat. This helps set appropriate user expectations.


212-218: Configuration structure update is well documented

The change from nested pytorch_backend_config to top-level enable_iter_perf_stats is clearly documented with proper YAML formatting.


223-223: Server command update matches configuration changes

The updated command properly references the new config filename convention.


225-227: Important usage guidance for metrics polling

This is valuable information about the queue-based metrics storage and the need for timely polling. The guidance helps prevent users from missing metrics data.


236-256: Comprehensive metrics example output

The updated JSON example properly reflects the current PyTorch backend metrics structure, including detailed KV cache statistics. The formatting and field coverage look accurate.

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (2)

61-71: Hugging Face repository identifiers verified

The Hugging Face repos openai/gpt-oss-120b and nvidia/gpt-oss-120b-Eagle3 both exist and correspond to the intended base model and Eagle3 speculative decoding checkpoint, respectively.


82-95: Use valid speculative_config.decoding_type value “eagle”
speculative_config.decoding_type must be one of the accepted values (explicit_draft_tokens, eagle, ngram, user_provided, auto); update from Eagle to eagle.

Likely an incorrect or invalid review comment.

tests/integration/defs/test_e2e.py (4)

488-489: Returning the parsed benchmark result is good.

This makes the runner composable in tests.


511-512: Dynamic num-requests in dataset generation — LGTM.

Keeps dataset size proportional to test intent.


1560-1570: Add cache salt test — LGTM.

Installs requirements and executes targeted test as expected.


1758-1760: Regex now handles single or double quotes — LGTM.

Improves robustness to varied output quoting.

Comment on lines 98 to 102
Notes:
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add a security warning for trust_remote_code: true.

Readers should understand the risk and how to mitigate.

 Notes:
 - Ensure your base model directory is `/config/models/gpt-oss-120b`.
 - Ensure your Eagle3 assets are present under `/config/models/eagle/`.
 - If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
+- Warning: `trust_remote_code: true` executes repository code during model load. Only enable for sources you trust and pin exact revisions.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Notes:
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
Notes:
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
- Warning: `trust_remote_code: true` executes repository code during model load. Only enable for sources you trust and pin exact revisions.
🤖 Prompt for AI Agents
In docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md around lines 98 to 102,
add a clear security warning paragraph about using trust_remote_code: true
explaining that it executes arbitrary remote code, advising readers to only
enable it for trusted, vetted checkpoints; run with isolation (containers, VMs),
limited privileges, and network restrictions; pin and verify model/package
versions and checksums, review remote code before enabling, and recommend
alternatives (e.g., local converted models or vendor-signed artifacts) and
explicit steps to revert the flag.

Comment on lines 248 to 259
```shell
# execute this command in gpt-oss
python -m gpt_oss.evals \
--sampler chat_completions \
--eval gpqa,aime25 \
--model gpt-oss-120b \
--reasoning-effort low,medium
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make the example runnable: set OpenAI-compatible endpoint env vars and align model name with server.

Without OPENAI_BASE_URL/OPENAI_API_KEY, gpt_oss.evals won’t hit the local TRT-LLM server. Also, earlier sections use openai/gpt-oss-120b; align here for consistency.

-# execute this command in gpt-oss
-python -m gpt_oss.evals \
+# In the gpt-oss repo:
+export OPENAI_BASE_URL="http://localhost:8000/v1"
+export OPENAI_API_KEY="dummy"
+
+python -m gpt_oss.evals \
   --sampler chat_completions \
   --eval gpqa,aime25 \
-  --model gpt-oss-120b \
+  --model openai/gpt-oss-120b \
   --reasoning-effort low,medium
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```shell
# execute this command in gpt-oss
python -m gpt_oss.evals \
--sampler chat_completions \
--eval gpqa,aime25 \
--model gpt-oss-120b \
--reasoning-effort low,medium
```
🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around lines 248-255, the shell example lacks the OpenAI-compatible environment
variables and uses an inconsistent model name; update the instructions to
require setting OPENAI_BASE_URL and OPENAI_API_KEY to point at the local TRT-LLM
server and change the --model value to openai/gpt-oss-120b to match earlier
sections (showing either exporting those env vars beforehand or prefixing the
run command with them).

@venkywonka venkywonka force-pushed the prompt-logprobs-sampling branch from fbec724 to dea4d74 Compare September 9, 2025 07:03
@venkywonka venkywonka marked this pull request as ready for review September 9, 2025 07:04
@venkywonka venkywonka requested review from a team as code owners September 9, 2025 07:04
@venkywonka venkywonka force-pushed the prompt-logprobs-sampling branch from dea4d74 to 227fed5 Compare September 9, 2025 07:04
@venkywonka venkywonka self-assigned this Sep 9, 2025
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18178 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18178 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13623 completed with status: 'FAILURE'

@venkywonka venkywonka force-pushed the prompt-logprobs-sampling branch from 227fed5 to bd2a1b6 Compare September 9, 2025 15:55
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18244 [ run ] triggered by Bot

@venkywonka venkywonka force-pushed the prompt-logprobs-sampling branch from bd2a1b6 to f925fa9 Compare September 15, 2025 17:50
if finish_reasons[src_idx] != tllm.FinishReason.CANCELLED:
if len(output.logprobs) > output.length:
# Check if logprobs is a list (not a dict or other structure)
if isinstance(output.logprobs, list) and len(
Copy link
Collaborator Author

@venkywonka venkywonka Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output.logprobs will always by of list-type (List[float] or TokenLogprobs: TypeAlias = list[dict[int, Logprob]]), so the isinstance is unnecessary

asyncio.run(main())


@pytest.mark.skip(reason="https://nvbugs/5516660")
Copy link
Collaborator Author

@venkywonka venkywonka Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chzblych @hchings @LinPoly waiving this out as i was able to fix the bug in result.py

@venkywonka
Copy link
Collaborator Author

/bot run --extra-stage "A100X-TensorRT-Post-Merge-1, A100X-TensorRT-Post-Merge-2, A100X-TensorRT-Post-Merge-3, A100X-TensorRT-Post-Merge-4, A100X-TensorRT-Post-Merge-5, A100X-TensorRT-Post-Merge-6"

1 similar comment
@venkywonka
Copy link
Collaborator Author

/bot run --extra-stage "A100X-TensorRT-Post-Merge-1, A100X-TensorRT-Post-Merge-2, A100X-TensorRT-Post-Merge-3, A100X-TensorRT-Post-Merge-4, A100X-TensorRT-Post-Merge-5, A100X-TensorRT-Post-Merge-6"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19032 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19032 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14273 completed with status: 'FAILURE'

@venkywonka venkywonka force-pushed the prompt-logprobs-sampling branch from 4fc94a4 to 142d29d Compare September 19, 2025 05:46

@pytest.mark.skip(reason="https://nvbugs/5516660")
@force_ampere
# @force_ampere
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment this.

query_len: int,
sampling_params: SamplingParams,
is_gen_only: bool) -> None:
streaming: bool = False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary now

Signed-off-by: Venky Ganesh <[email protected]>
if streaming:
# need this so that context_logits / prompt_logprobs are not dropped
# in the 2nd reuse of llm.generate() in streaming mode
kv_cache_args_extra["enable_block_reuse"] = False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tijyojwad this was the knob that made things work for prompt_logprobs

# each streamed response_tensors.log_probs[src_idx]
# contains a streamwise monotonically growing list of logprobs.
# so we need to accumulate only the new ones unique to that particular streamed response
output.logprobs += response_tensors.log_probs[src_idx][
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hchings @LinPoly this lack of accumulation was causing https://nvbugs/5516660 , fixed here

@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19295 [ run ] triggered by Bot

@venkywonka venkywonka changed the title [TRTLLM-7015] [feat] Enable prompt token logprobs across backends [TRTLLM-7015] [feat] Enable prompt token logprobs in pytorch backend Sep 19, 2025
@venkywonka venkywonka changed the title [TRTLLM-7015] [feat] Enable prompt token logprobs in pytorch backend [TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend Sep 19, 2025
@venkywonka
Copy link
Collaborator Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

✅ Actions performed

Full review triggered.

prompt_logprobs=prompt_logprobs,
return_context_logits=return_context_logits,
return_generation_logits=return_generation_logits)
with llm:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about restoring this to use the LLM directly without the with-statement? The LLM instance should be able to clean up without requiring an explicit __exit__ or shutdown. Introducing a with-statement might obscure issues that prevent proper destruction of the LLM instance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
tensorrt_llm/executor/result.py (1)

247-272: Fix logprobs accumulation logic to prevent out-of-bounds access.

The current implementation slices response_tensors.log_probs[src_idx] from output._last_logprobs_len: without checking if the index is valid. If output._last_logprobs_len exceeds the length of the logprobs list, this will result in an empty slice but no error. However, the logic then assumes there are new logprobs to append, which may not be accurate.

Consider adding a bounds check:

-            output.logprobs += response_tensors.log_probs[src_idx][
-                output._last_logprobs_len:]
+            new_logprobs = response_tensors.log_probs[src_idx]
+            if output._last_logprobs_len < len(new_logprobs):
+                output.logprobs += new_logprobs[output._last_logprobs_len:]
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

281-291: Consider adding error handling for missing attributes.

The clear_context_logits method directly accesses _py_result._context_logits without checking if the attribute exists. While the docstring indicates when this is used, defensive programming would make the code more robust.

Apply this diff to add defensive checks:

 def clear_context_logits(self):
     """Clear context logits from the response result.
 
     This is used to drop context logits after prompt_logprobs have been computed
     when the user didn't explicitly request them.
     """
     if self.result and hasattr(self.result, '_py_result'):
         py_result = self.result._py_result
-        if hasattr(py_result, '_context_logits'):
-            py_result._context_logits = None
+        if hasattr(py_result, '_context_logits') and py_result._context_logits is not None:
+            py_result._context_logits = None
tensorrt_llm/sampling_params.py (1)

453-467: Consider refining the logging message for clarity.

While the implementation correctly enables return_context_logits when needed for prompt_logprobs computation, the info log message could be more concise and clearer about the internal nature of this behavior.

 if self.prompt_logprobs and not self.return_context_logits:
-    logger.info(
-        "Since prompt_logprobs is requested but return_context_logits is False, "
-        "internally enabling context logits for prompt logprobs computation. "
-        "context logits will be dropped after computation as the user didn't explicitly request them."
-    )
+    logger.debug(
+        "Internally enabling context_logits for prompt_logprobs computation "
+        "(will be dropped after use)"
+    )

Consider using debug level instead of info since this is an internal implementation detail that users don't need to act upon.

tensorrt_llm/executor/worker.py (2)

1074-1075: Add defensive check for context_logits availability.

While the assertion message is helpful, consider a more graceful error handling approach for production code.

Consider replacing the assertion with a proper exception:

-    assert context_logits is not None, "context_logits cannot be None when prompt_logprobs is requested."
+    if context_logits is None:
+        raise ValueError("context_logits cannot be None when prompt_logprobs is requested. "
+                        "Ensure return_context_logits is enabled in sampling params.")

1078-1079: Consider documenting the caching behavior.

The caching of prompt logprobs in streaming mode is an important implementation detail that should be documented.

Add a comment explaining the caching strategy:

     if generation_result._streaming:
+        # Cache prompt logprobs to avoid recomputation in subsequent streaming iterations
         generation_result._cached_prompt_logprobs = logprobs_result.prompt
tests/unittest/llmapi/test_llm.py (1)

1869-1869: Remove unused parameter id from async function.

The static analysis correctly identifies that the id parameter in the async task function is never used.

-            async def task(id: int, prompt: str):
+            async def task(prompt: str):

And update the task creation:

-                tasks = [task(id, prompt) for id, prompt in enumerate(prompts)]
+                tasks = [task(prompt) for prompt in prompts]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fcc054 and f3dcad0.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (2 hunks)
  • tensorrt_llm/executor/result.py (2 hunks)
  • tensorrt_llm/executor/worker.py (3 hunks)
  • tensorrt_llm/llmapi/llm.py (1 hunks)
  • tensorrt_llm/sampling_params.py (2 hunks)
  • tests/unittest/llmapi/test_llm.py (1 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm.py
  • tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/unittest/llmapi/test_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/unittest/llmapi/test_llm.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm.py

1869-1869: Unused function argument: id

(ARG001)

🔇 Additional comments (14)
tensorrt_llm/executor/result.py (1)

703-722: LGTM! Clear and comprehensive docstring update.

The updated docstring for compute_logprobs clearly explains the function's purpose and distinguishes between different backend behaviors for prompt and generation logprobs computation.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

390-416: Excellent documentation for the create_response method.

The comprehensive docstring clearly explains the method's behavior, parameters, and return semantics. The implementation correctly wraps the serialized result in an LlmResponse with appropriate PyTorch-specific handling.

tests/unittest/llmapi/test_llm_pytorch.py (2)

884-904: LGTM! Comprehensive test coverage for PyTorch backend logprobs.

The test properly validates various combinations of prompt_logprobs and generation logprobs for the PyTorch backend, including the case where prompt_logprobs works without explicitly enabling context_logits.


906-924: Well-structured streaming test for logprobs.

The streaming test complements the non-streaming test by covering key scenarios including prompt-only, generation-only, and combined logprobs cases.

tensorrt_llm/llmapi/llm.py (2)

572-577: LGTM! Clean handling of the is_gen_only parameter.

The addition of the default value for is_gen_only makes the method signature more flexible while maintaining backward compatibility.


578-591: Good simplification of PyTorch backend validation.

Removing the prompt_logprobs restriction for the PyTorch backend aligns with the PR's goal of enabling this feature. The remaining validations for logprobs <= 1 and max_tokens checks are appropriate.

tensorrt_llm/executor/worker.py (4)

17-17: Import statement looks good.

The addition of LlmResponse import from the PyTorch executor module is appropriate for handling PyTorch backend responses with context logits.


1061-1081: Well-structured helper function for PyTorch prompt logprobs.

The implementation correctly handles caching for streaming mode and cleanly separates PyTorch-specific logic.


1083-1085: Type annotation properly updated to support both response types.

The union type allows flexible handling of both TRT and PyTorch backend responses.


1101-1113: PyTorch backend path correctly implements prompt logprobs handling.

The logic properly:

  1. Early returns when prompt_logprobs is not requested
  2. Computes prompt logprobs using the helper function
  3. Clears context logits when requested to avoid memory overhead

Good separation of concerns between PyTorch and TRT backends.

tests/unittest/llmapi/test_llm.py (4)

1798-1825: Well-structured test harness with proper backend handling.

The test harness correctly handles:

  • Backend-specific LLM class selection
  • KV cache configuration for streaming mode
  • Proper resource cleanup with the context manager pattern

1867-1886: Comprehensive streaming validation logic.

The streaming test properly validates that:

  • Streaming logprobs accumulation matches non-streaming results
  • Prompt logprobs are preserved across streaming iterations

Good test coverage for the streaming scenario.


1807-1814: Smart workaround for PyTorch streaming limitation. Tests reference prompt_logprobs/context_logits behavior (tests/unittest/llmapi/test_llm_pytorch.py:888,892); no TODO or tracking issue references found in the repo — create or link a tracking issue and add a brief inline code comment documenting this limitation.


1888-1908: Add PyTorch-specific logprobs tests (tests/unittest/llmapi/test_llm_pytorch.py)

test_llm_pytorch.py imports llm_return_logprobs_test_harness but no explicit PyTorch logprobs tests were found; add parametrized cases mirroring the TRT tests to cover:

  • prompt_logprobs with context_logits
  • prompt_logprobs without context_logits
  • generation logprobs only (top-k)
  • no logprobs

Use llm_return_logprobs_test_harness to implement these cases.

@venkywonka venkywonka requested a review from Copilot September 19, 2025 15:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables prompt_logprobs support in the PyTorch backend to unify logprobs functionality across TensorRT and PyTorch backends. The implementation automatically captures required context logits when prompt logprobs are requested and provides an option to clear stored context logits after use to reduce memory consumption.

Key Changes

  • Added PyTorch backend support for prompt logprobs computation from context logits
  • Implemented automatic context logits capture when prompt logprobs are requested but context logits aren't explicitly enabled
  • Enhanced streaming mode to properly handle prompt logprobs with consistent caching behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tensorrt_llm/sampling_params.py Added logic to auto-enable context logits when prompt_logprobs is requested in PyTorch backend
tensorrt_llm/llmapi/llm.py Removed PyTorch backend restriction for prompt_logprobs support
tensorrt_llm/executor/worker.py Enhanced logprobs computation to handle PyTorch backend prompt logprobs with caching for streaming
tensorrt_llm/executor/result.py Updated logprobs handling to differentiate between TRT and PyTorch backend provenance
tensorrt_llm/_torch/pyexecutor/llm_request.py Added clear_context_logits method and improved documentation
tests/unittest/llmapi/test_llm.py Updated test harness with backend parameter and streaming improvements
tests/unittest/llmapi/test_llm_pytorch.py Added comprehensive test coverage for PyTorch backend logprobs functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +459 to +465
# TODO(venky): Find a more elegant way to do this.
# NOTE: This is an internal hack, so we can entirely avoid introducing
# `prompt_logprobs` into the executor bindings and further into
# model engine / sampler.
# This is because, prompt_logprobs is a derived quantity from
# context logits, and the capability to post-compute it
# already exists in the worker. (see _get_logprobs in worker.py)
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment indicates a design concern about the current implementation being a 'hack'. Consider creating a GitHub issue to track the technical debt for refactoring this approach to a more elegant solution.

Copilot uses AI. Check for mistakes.

# In streaming mode, since out-of-order responses are not possible,
# each streamed response_tensors.log_probs[src_idx]
# contains a streamwise monotonically growing list of logprobs.
# so we need to accumulate only the new ones unique to that particular streamed response
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slicing logic [output._last_logprobs_len:] to get only new logprobs in streaming mode could be error-prone if _last_logprobs_len becomes inconsistent. Consider adding an assertion to verify that output._last_logprobs_len <= len(response_tensors.log_probs[src_idx]) before slicing.

Suggested change
# so we need to accumulate only the new ones unique to that particular streamed response
# so we need to accumulate only the new ones unique to that particular streamed response
assert output._last_logprobs_len <= len(response_tensors.log_probs[src_idx]), (
f"_last_logprobs_len ({output._last_logprobs_len}) > log_probs length ({len(response_tensors.log_probs[src_idx])})"
)

Copilot uses AI. Check for mistakes.

parent class's serialized result in a PyTorch-specific LlmResponse object.
Args:
use_fast_logits (bool, optional): Whether to use fast logits computation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a bit vague and probably incorrect. This boolean controls the direct transfer from draft model to target model for speculative decoding. I actually don't know if it applies to the pytorch backend as it is implemented.

Args:
use_fast_logits (bool, optional): Whether to use fast logits computation
for improved performance. Defaults to False.
mpi_world_rank (int, optional): The MPI world rank for distributed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing: this is the leader rank of the draft model when using direct logits transfer.

if is_pytorch_backend:
if not logprob_params.prompt_logprobs:
# PyTorch: generation logprobs computed in sampler, no post-processing needed
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return None to be explicit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants