[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend #7580

venkywonka · 2025-09-06T02:44:10Z

Summary by CodeRabbit

New Features
- Unified logprobs support across backends, including prompt logprobs on PyTorch.
- Automatic capture of required logits when requesting prompt logprobs.
- Option to clear stored context logits to reduce memory after use.
- Improved streaming behavior with consistent logprobs reporting.
Bug Fixes
- More accurate and consistent logprobs accumulation and length handling across backends.
Tests
- Expanded test coverage for logprobs across TensorRT and PyTorch backends, including streaming scenarios.

Description

Test Coverage

- test_llm_pytorch.py::test_llm_return_logprobs
- test_llm_pytorch.py::test_llm_return_logprobs_streaming

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-06T02:44:16Z

📝 Walkthrough

Walkthrough

Adds LlmResponse wrapper with clear_context_logits; updates LlmRequest.create_response to return LlmResponse. Refactors logprobs handling across backends: executor/result accumulation, worker computation paths for PyTorch vs TRT using logits. Adjusts LLM argument checks and sampling params to auto-enable context logits. Expands tests for multi-backend and streaming.

Changes

Cohort / File(s)	Summary
PyTorch response wrapper and API `tensorrt_llm/_torch/pyexecutor/llm_request.py`	Introduces LlmResponse.clear_context_logits; changes LlmRequest.create_response to return LlmResponse or None; adds docstring; wires Python-side result with serialized payload.
Executor logprobs handling `tensorrt_llm/executor/result.py`	Refines prompt/generation logprobs handling for TRT and PyTorch backends; accumulates only new PyTorch logprobs; clamps to sequence length; updates compute_logprobs docstring.
Worker cross-backend logprobs `tensorrt_llm/executor/worker.py`	Accepts Union[tllm.Response, LlmResponse]; adds _compute_pytorch_prompt_logprobs; branches: PyTorch computes prompt logprobs via context logits; TRT computes from context/generation logits; drops logits per flags.
LLM argument checks `tensorrt_llm/llmapi/llm.py`	Makes is_gen_only optional (default False); removes PyTorch-specific prompt_logprobs prohibition; keeps other validations.
Sampling params auto-enable `tensorrt_llm/sampling_params.py`	Imports logger; if PyTorch prompt_logprobs requested, auto-enables return_context_logits with info log.
Tests: multi-backend and streaming `tests/unittest/llmapi/test_llm.py`, `tests/unittest/llmapi/test_llm_pytorch.py`	Generalizes harness to accept backend/tp/streaming; runs against TRT and PyTorch; integrates streaming assertions; adds new PyTorch tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant LLM
  participant Executor
  participant Worker
  participant Sampler
  participant Response

  Client->>LLM: submit(prompt, sampling_params)
  LLM->>Executor: run request
  Executor->>Sampler: generate tokens
  Sampler-->>Executor: tokens (+ backend-provided logprobs?)
  Executor-->>LLM: serialized result + logits (optional)
  LLM-->>Response: LlmResponse(create_response)
  Note over Response: result.context_logits / generation_logits may be present

  LLM->>Worker: _get_logprobs(response, backend)

  alt PyTorch backend
    Worker->>Worker: _compute_pytorch_prompt_logprobs using context_logits
    opt streaming
      Worker->>Worker: cache/use prior prompt logprobs
    end
    Worker->>Response: clear_context_logits() (optional)
  else TRT backend
    Worker->>Worker: compute from context_logits and generation_logits
    Worker->>Response: drop logits per flags
  end

  Worker-->>LLM: LogProbsResult
  LLM-->>Client: outputs + logprobs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR body contains the repository template and a brief "also fixes" note plus a Test Coverage list, but the required Description section is not populated with a concise summary of what was changed, why, which files/subsystems are affected, or any behavioral/compatibility impact, so the description is largely incomplete for reviewers.	Please populate the Description section with a short summary of the implemented changes and rationale, list the main files or subsystems touched and any behavioral or compatibility impacts, confirm which tests exercise the changes, and mark the PR checklist items (documentation, CODEOWNERS, etc.) so reviewers have sufficient context.
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the repository convention ([TRTLLM-7015][feat]) and concisely summarizes the primary change — enabling prompt_logprobs support in the PyTorch backend; it is specific, relevant to the changeset, and contains no noisy extras.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

docs/source/blogs/media/tech_blog10_baseline_performance_detail.png

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

tensorrt_llm/sampling_params.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

This file is missing the standard header mandated by the repo guidelines.

Apply at file top:
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

tensorrt_llm/llmapi/llm.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

tensorrt_llm/executor/result.py (1)

1-1: Add required NVIDIA Apache-2.0 header (2025).

Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

tensorrt_llm/serve/postprocess_handlers.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.

The file lacks the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/executor/worker.py (2)

1014-1021: Guard against error responses or missing result to avoid attribute errors.

on error responses may have no result; early-return to prevent None access.

     logprobs_result = None
     generation_result = worker._results.get(response.client_id, None)
 
     if not generation_result:
         return
+    # Skip when response has errors or no result payload yet
+    if hasattr(response, "has_error") and response.has_error():
+        return
+    if not getattr(response, "result", None):
+        return

1-1: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.

The file lacks the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tests/unittest/llmapi/apps/_test_openai_completions.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per coding guidelines, prepend the NVIDIA Apache-2.0 copyright header to all .py files.

Apply at file top:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.

tensorrt_llm/serve/openai_protocol.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per guidelines, all source files must include the NVIDIA Apache-2.0 header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.

tests/unittest/llmapi/test_llm.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Tests are also source files; add the standard header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.

🧹 Nitpick comments (11)

tensorrt_llm/sampling_params.py (1)
452-456: Auto-enabling context logits for PyTorch: also flag as auto-enabled.

To keep behavior consistent with TRT (and make downstream “need to return” checks accurate), set the private auto-enabled flag when you flip return_context_logits.

Apply this diff:
-            if self.prompt_logprobs and not self.return_context_logits:
-                config_kwargs["return_context_logits"] = True
+            if self.prompt_logprobs and not self.return_context_logits:
+                config_kwargs["return_context_logits"] = True
+                # Mirror TRT flow: mark as internally enabled so consumers can safely drop them.
+                self._context_logits_auto_enabled = True
Please confirm worker logic drops context logits when prompt_logprobs is requested but return_context_logits was not explicitly set by the user.
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

281-291: Good: targeted API to drop context logits.

Method is concise and safely nulls the Python-side storage.

Consider adding a symmetric clear_generation_logits() for parity with drop_generation_logits flows.
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
187-243: Tighten assertions; drop noisy prints for CI hygiene.

The conditional print statements make test output noisy and aren’t actionable. Prefer silent acceptance with explicit assertions or a short skip/xfail note.

Keep the “accepts but may not return” behavior, but avoid printing.

Apply this diff to remove prints while preserving intent:
@@
-        print(
-            f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries"
-        )
@@
-        print(
-            f"✓ Found prompt_logprobs in message with {len(message.prompt_logprobs)} entries"
-        )
@@
-        print(
-            f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)"
-        )
+        # prompt_logprobs currently accepted but not returned (known limitation)
+        pass
tensorrt_llm/serve/postprocess_handlers.py (2)
60-63: Fix type hint to reflect accepted input shapes.

The function consumes ints/floats, dicts, sequences, and objects with a logprob attribute; List[float] is misleading.

Apply this diff:
-from typing import List, Literal, Optional, Tuple, Union
+from typing import List, Literal, Optional, Tuple, Union, Sequence, Any
@@
-def create_logprobs(token_ids: List[int],
-                    tokenizer: TransformersTokenizer,
-                    logprobs: List[float]) -> ChatCompletionLogProbs:
+def create_logprobs(token_ids: List[int],
+                    tokenizer: TransformersTokenizer,
+                    logprobs: Sequence[Any]) -> ChatCompletionLogProbs:
27-27: Fix typo in formatter pragma.

Comment should read “enable”.
-# yapf: enale
+# yapf: enable
tensorrt_llm/executor/worker.py (2)
1009-1013: Compress one-line docstring (D200).

Ruff warns; keep on a single line.
-    """Compute logprob and prompt logprob and clear out logits if applicable.
-    """
+    """Compute logprobs and prompt_logprobs, and clear logits if applicable."""
1022-1036: PyTorch branch: compute only when needed; keep cleanup but be defensive.

Looks good overall. Minor nit: keep hasattr checks before touching backend-specific cleanup to avoid surprises with mixed response types.
tests/unittest/llmapi/apps/_test_openai_completions.py (3)
357-404: Make the expectation explicit; avoid prints.

Prefer an assertion over print so the test fails if the response shape unexpectedly changes. Use getattr for forward-compat.
-    # Check if prompt_logprobs are in the response (for future compatibility)
-    if hasattr(choice,
-               'prompt_logprobs') and choice.prompt_logprobs is not None:
-        # Future: When prompt_logprobs are added to the response
-        assert len(
-            choice.prompt_logprobs) > 0, "prompt_logprobs should not be empty"
-        print(
-            f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries"
-        )
-    else:
-        # Current state: prompt_logprobs are computed internally but not returned
-        # This test verifies the feature doesn't cause errors
-        print(
-            f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)"
-        )
+    plp = getattr(choice, "prompt_logprobs", None)
+    # Current behavior: not returned; assert None to catch unexpected changes.
+    # When support lands, update the assertion to validate structure.
+    assert plp is None
406-452: Strengthen generation logprobs validation; keep backend logic.

Assert expected K for returned top_logprobs to catch regressions.
-    # Verify we got generation logprobs
-    assert choice.logprobs is not None, "Generation logprobs were requested but not returned"
+    # Verify we got generation logprobs with expected top-K (allow K or K+1 for sampled token)
+    assert choice.logprobs is not None, "Generation logprobs were requested but not returned"
+    k = generation_logprobs
+    assert len(choice.logprobs.top_logprobs) >= 1
+    assert len(choice.logprobs.top_logprobs[0]) in {k, k + 1}
454-481: Assert no context logits leaked when not requested.

This confirms the auto-enable/drop path behaves as intended.
     choice = completion.choices[0]
     assert len(choice.text) >= 1
+    # Should not surface context logits when explicitly disabled.
+    assert getattr(choice, "context_logits", None) is None
tests/unittest/llmapi/test_llm.py (1)
1879-1907: Backend matrix: confirm PyTorch returns context logits.

This param set includes (prompt_logprobs=2, return_context_logits=True, backend="pytorch"). If PyTorch still doesn't expose context logits, this will fail. Either keep the case only if supported, or mark xfail/skip dynamically.
 @pytest.mark.parametrize(
     "prompt_logprobs, logprobs, return_context_logits, return_generation_logits, backend",
     [
@@
-        (2, None, True, False, "pytorch"
-         ),  # prompt_logprobs with context_logits
+        pytest.param(2, None, True, False, "pytorch", marks=pytest.mark.xfail(reason="PyTorch context logits exposure")),
If PyTorch now supports returning context logits, keep as-is and remove xfail.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 74105a4 and 4aa47fb.

⛔ Files ignored due to path filters (7)

docs/source/blogs/media/tech_blog10_baseline_performance_detail.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_baseline_performance_overview.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_context_wait_performance.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_dataset_token_distribution.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_full_strategy_performance.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png is excluded by !**/*.png

📒 Files selected for processing (11)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/executor/result.py (1 hunks)
tensorrt_llm/executor/worker.py (3 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/sampling_params.py (1 hunks)
tensorrt_llm/serve/openai_protocol.py (4 hunks)
tensorrt_llm/serve/postprocess_handlers.py (1 hunks)
tests/integration/defs/test_e2e.py (1 hunks)
tests/unittest/llmapi/apps/_test_openai_chat.py (1 hunks)
tests/unittest/llmapi/apps/_test_openai_completions.py (1 hunks)
tests/unittest/llmapi/test_llm.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py

🧠 Learnings (3)

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/llmapi/llm.py

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

tensorrt_llm/llmapi/llm.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/test_e2e.py
tests/unittest/llmapi/test_llm.py

🧬 Code graph analysis (8)

tensorrt_llm/executor/result.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

cum_log_probs (230-231)

log_probs (226-227)

tensorrt_llm/scaffolding/task.py (2)

cumulative_logprob (94-96)

logprobs (99-100)

cpp/include/tensorrt_llm/executor/types.h (1)

FinishReason (502-597)

tensorrt_llm/serve/postprocess_handlers.py (1)

tensorrt_llm/serve/openai_protocol.py (1)

ChatCompletionLogProbsContent (368-369)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

tensorrt_llm/executor/result.py (1)

result (574-585)

tests/unittest/llmapi/apps/_test_openai_chat.py (1)

tests/unittest/llmapi/apps/_test_openai_completions.py (3)

client (44-45)

model_name (16-17)

backend (21-22)

tensorrt_llm/executor/worker.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

LlmResponse (272-290)

get (100-109)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-491)

tensorrt_llm/executor/result.py (1)

LogProbsResult (57-60)

tensorrt_llm/serve/openai_protocol.py (2)

tensorrt_llm/llmapi/llm_args.py (1)

Field (69-96)

tensorrt_llm/builder.py (1)

default (50-58)

tests/unittest/llmapi/apps/_test_openai_completions.py (1)

tests/unittest/llmapi/apps/_test_openai_chat.py (3)

client (82-83)

model_name (21-22)

backend (26-27)

tests/unittest/llmapi/test_llm.py (3)

tensorrt_llm/scaffolding/task.py (1)

logprobs (99-100)

tests/unittest/llmapi/apps/_test_openai_chat.py (1)

backend (26-27)

tests/unittest/llmapi/apps/_test_openai_completions.py (1)

backend (21-22)

🪛 Ruff (0.12.2)

tensorrt_llm/executor/worker.py

1035-1036: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (6)

tensorrt_llm/llmapi/llm.py (1)

570-573: Manual verify _TorchLLM.create’s return_logits doesn’t override context-logits capture
sampling_params.py already auto-enables context logits on PyTorch (tensorrt_llm/sampling_params.py lines 451–458) and executor/worker.py drops them when _need_return_context_logits is false (lines 1054–1058). Now confirm that the PyTorch LLM factory method (e.g. _TorchLLM.create) accepts and propagates return_context_logits alongside any return_logits flag so context logits aren’t inadvertently suppressed.

tensorrt_llm/executor/result.py (1)

251-264: Hardened logprobs handling looks correct.

Preserves previous length for diffs.

Slices only when logprobs is a list and asserts length matches output length.

tests/unittest/llmapi/apps/_test_openai_chat.py (1)

245-291: Clarify backend expectations or align with earlier skip.

Earlier in this file, PyTorch chat logprobs are skipped; this test asserts generation logprobs for both backends. If PyTorch support is now enabled, consider removing the earlier skip to avoid inconsistent expectations; otherwise, add a targeted skip here for PyTorch until parity lands.

tensorrt_llm/executor/worker.py (1)

17-17: I’ve extracted the LlmResponse class definition and member methods to confirm context_logits and clear_context_logits are present. Let me know once you have the output.

tensorrt_llm/serve/openai_protocol.py (2)

306-309: Pass-through looks good.

Routing both logprobs and prompt_logprobs to SamplingParams is correct.

639-641: LGTM on SamplingParams plumbing.

Chat path correctly forwards both fields.

tensorrt_llm/serve/openai_protocol.py

tensorrt_llm/serve/postprocess_handlers.py

tests/integration/defs/test_e2e.py

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

docs/source/commands/trtllm-serve/trtllm-serve.rst (2)
186-186: Inconsistent config filename in multi-node example

This section still uses the old filename convention extra-llm-api-config.yml while the rest of the document has been updated to extra_llm_config.yaml. This should be updated for consistency.
-    echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > extra-llm-api-config.yml
+    echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > extra_llm_config.yaml
And update the reference in the command:
-        bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
+        bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra_llm_config.yaml"
82-86: Another filename consistency issue in multimodal example

The multimodal section also uses the old filename convention. This should be updated for consistency with the rest of the document.
-   cat >./extra-llm-api-config.yml<<EOF
+   cat >./extra_llm_config.yaml<<EOF
And update the corresponding reference:
-       --extra_llm_api_options ./extra-llm-api-config.yml
+       --extra_llm_api_options ./extra_llm_config.yaml
tests/integration/defs/test_e2e.py (1)
443-457: Critical: TemporaryDirectory is immediately eligible for cleanup; work_dir may disappear.

self.work_dir = Path(tempfile.TemporaryDirectory().name) loses the TemporaryDirectory handle, so the dir can be auto-deleted by GC, causing flaky I/O later.

Apply this diff:
-        self.work_dir = Path(tempfile.TemporaryDirectory().name)
+        # Keep a handle so the directory isn't cleaned up prematurely.
+        self._tmpdir = tempfile.TemporaryDirectory()
+        self.work_dir = Path(self._tmpdir.name)
Optional: implement __del__ to call self._tmpdir.cleanup() explicitly.

Also applies to: 481-483

🧹 Nitpick comments (16)

docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)
73-86: Notation/clarity nits: finite sum and math typesetting.

Suggest replacing the infinite sum with a finite horizon (T iterations) and using \cdot instead of *. Also define T in the “where” list for completeness.

Apply this diff:
-```math
-sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i
-```
+```math
+sol\_time = \sum_{i=0}^{T-1} time_i \cdot balance\_ratio_i
+```

@@
 - $elapsed\_time$: Total empirically measured end-to-end execution time
 - $actual\_tps$: Observed throughput in tokens per second
 - $sol\_tps$: Theoretical maximum throughput under perfect load balance
+- $T$: Total number of iterations considered
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (3)
237-238: Tighten wording + proper capitalization; align with earlier section naming.

Minor edits for clarity and consistency with “Launch the TRT-LLM Server”.
-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. 
+You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, and `max_num_tokens` when launching the TRT-LLM server, and set `reasoning-effort` when launching evaluation in GPT-OSS. Below are reference configurations for accuracy evaluation on B200. Ensure these values remain consistent with the “Launch the TRT-LLM Server” section.
239-245: Define “DEP” and map table rows to concrete flags; call out config consistency with launch flags.

Right now “DEP8 / DEP4” isn’t actionable. Please define DEP and how to set it (e.g., via enable_attention_dp: true) and explicitly map each row to --tp_size, --ep_size, and enable_attention_dp. Also, values here (e.g., max_batch_size 1024 and max_num_tokens 32768/133120) don’t match the earlier launch example (--max_batch_size 720, --max_num_tokens 16384). Readers will be confused.
 | high                 | TP8 / TP4                  | 720                | 133120             |
- 
+ 
+Note:
+- “DEP” denotes attention data parallelism; set `enable_attention_dp: true`. “TP” denotes pure tensor parallelism; set `enable_attention_dp: false`.
+- For each row, set `--tp_size` and `--ep_size` accordingly (e.g., “TP8” → `--tp_size 8`; use `--ep_size` for MoE expert parallel as needed).
+- Keep `--max_batch_size` and `--max_num_tokens` consistent with your server launch. If you launched with `--max_batch_size 720` and `--max_num_tokens 16384`, update the corresponding table entries to match.
246-246: Fix eval name: “AIME25” vs “AIME2025”.

Text says “AIME2025” but the command uses aime25. Align the prose with the flag.
-Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.
+Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME25.
docs/source/reference/multimodal-feature-support-matrix.md (1)
3-14: Add a brief definition/footnote for “Encoder IFB”.

Readers may not know “Encoder IFB”. Add a one‑liner or footnote under the table clarifying it (e.g., encoder in‑flight batching for the vision tower during prefill) and any notable limitations.

Example addition:
 | Qwen2.5-VL         | Yes        | Yes                 | Yes            | No              |
 
+Note: “Encoder IFB” = encoder in‑flight batching (vision encoder). Availability may depend on model’s vision tower and scheduling flags.
docs/source/commands/trtllm-serve/trtllm-serve.rst (1)
210-210: Improve accuracy of metrics description

The line mentions "inflight-batching details" but the example metrics don't show any inflight-batching specific fields. Consider updating to be more accurate.
-The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
+The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage, KV cache details, and request metrics.
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (6)
14-18: Fix MD040: add a fenced code language.

Label the directory tree block to satisfy markdownlint and improve rendering.
-```
+```text
 /path/to/models/
   ├─ gpt-oss-120b/  # base model directory
   └─ eagle/         # Eagle3 speculative decoding assets
---

`33-46`: **Consider a non-interactive run example.**

Add a detached variant to improve copy/paste UX in automation.


```diff
-docker run --rm --ipc=host -it \
+docker run --rm --ipc=host -it \
   --ulimit stack=67108864 \
   --ulimit memlock=-1 \
   --gpus all \
   -p 8000:8000 \
   -v /path/to/models:/config/models:rw \
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \
   /bin/bash
+
+# Or detached server directly:
+# docker run --rm --ipc=host -d --name trtllm \
+#   --ulimit stack=67108864 --ulimit memlock=-1 --gpus all \
+#   -p 8000:8000 -v /path/to/models:/config/models:rw \
+#   nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \
+#   bash -lc 'TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b ...'
73-74: Use proper Markdown links instead of code-formatted URLs.

Improves readability and avoids confusion with shell literals.
-References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3`
+References: [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) and [nvidia/gpt-oss-120b-Eagle3](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3)
108-109: Sanity-check parallelism and sequence settings for the stated 8‑GPU example.

--tp_size 8 --ep_size 4 with 8 GPUs and 120B weights may be inconsistent/resource-heavy; --max_seq_len/--max_num_tokens 131072 is extreme. Provide safer defaults or a sizing note.
-TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10  --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072
+TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 \
+  --max_batch_size 4 --tp_size 8 --ep_size 1 --trust_remote_code \
+  --extra_llm_api_options /config/models/eagle/eagle.yaml \
+  --max_num_tokens 32768 --max_seq_len 32768  # adjust per VRAM/SLA
125-139: Showcase prompt token logprobs in the sample request (matches PR goal).

Add logprobs and prompt_logprobs to demonstrate the new capability.
   -d '{
     "model": "gpt-oss-120b",
     "messages": [
       {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."}
     ],
-    "max_tokens": 128,
+    "max_tokens": 128,
+    "logprobs": 5,
+    "prompt_logprobs": true,
     "stream": false
   }'
3-3: Either link the “previous guide” or remove the forward reference.

Avoid dangling references.
docs/source/torch/features/feature_combination_matrix.md (1)

18-18: Confirm MTP and EAGLE‑3(One Model Engine) support before marking “Yes”.

Docs now claim both are “Yes”. Ensure feature gates, required flags, and backend constraints (CUDA graph, attention backend, etc.) match actual support paths; otherwise annotate with “WIP/Untested” like other rows or add footnotes.

Would you like me to link the exact example/tests that demonstrate both features to justify the change in this matrix?
tests/integration/defs/test_e2e.py (3)
565-569: Harden CLI flag emission for concurrency/num_requests.

Check explicitly for None to allow zero if ever needed.

Cast to int once to avoid accidental string injection.

Apply:
-        if self.concurrency:
-            benchmark_cmd += f" --concurrency {self.concurrency}"
-        if self.num_requests:
-            benchmark_cmd += f" --num_requests {self.num_requests}"
+        if self.concurrency is not None:
+            benchmark_cmd += f" --concurrency {int(self.concurrency)}"
+        if self.num_requests is not None:
+            benchmark_cmd += f" --num_requests {int(self.num_requests)}"
628-687: Make MIG scaling assertion less brittle; fix minor print nit.

Throughput increase factor 1.3 per step may be flaky across environments. Recommend monotonic increase or a smaller factor (e.g., 1.1).

Remove needless f-prefix in static print.

Apply:
-    print(f"\n=== Benchmark Results Comparison ===")
+    print("\n=== Benchmark Results Comparison ===")
@@
-            prev_throughput = float(results[concurrency_list[idx - 1]].get(
-                'throughput', 0))
-            assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}"
+            prev_throughput = float(results[concurrency_list[idx - 1]].get('throughput', 0))
+            # Allow modest scaling margin to reduce flakiness across MIG configs.
+            assert throughput >= prev_throughput * 1.1, (
+                f"Throughput did not sufficiently increase: {prev_throughput} -> {throughput} "
+                f"at concurrency {concurrency_list[idx]}"
+            )
1633-1638: Remove unused fixture argument to silence lint.

llm_root is unused.

Apply:
-def test_openai_responses(llm_root, llm_venv):
+def test_openai_responses(llm_venv):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa47fb and fbec724.

⛔ Files ignored due to path filters (3)

docs/source/blogs/media/tech_blog10_baseline_performance_detail.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_context_wait_performance.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog10_full_strategy_performance.png is excluded by !**/*.png

📒 Files selected for processing (10)

docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1 hunks)
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (1 hunks)
docs/source/commands/trtllm-serve/trtllm-serve.rst (1 hunks)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1 hunks)
docs/source/index.rst (1 hunks)
docs/source/legacy/tensorrt_quickstart.md (1 hunks)
docs/source/reference/multimodal-feature-support-matrix.md (1 hunks)
docs/source/reference/support-matrix.md (1 hunks)
docs/source/torch/features/feature_combination_matrix.md (1 hunks)
tests/integration/defs/test_e2e.py (20 hunks)

✅ Files skipped from review due to trivial changes (2)

docs/source/index.rst
docs/source/legacy/tensorrt_quickstart.md

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/integration/defs/test_e2e.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/integration/defs/test_e2e.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/integration/defs/test_e2e.py

🧠 Learnings (4)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
tests/integration/defs/test_e2e.py

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/commands/trtllm-serve/trtllm-serve.rst

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/defs/test_e2e.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoCh
PR: NVIDIA/TensorRT-LLM#7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/integration/defs/test_e2e.py

🧬 Code graph analysis (1)

tests/integration/defs/test_e2e.py (3)

tests/integration/defs/trt_test_alternative.py (1)

check_output (257-283)

tests/integration/defs/conftest.py (4)

llm_venv (707-723)

llm_root (180-181)

unittest_path (90-91)

llm_models_root (77-83)

tests/integration/defs/common.py (1)

parse_output (665-681)

🪛 Ruff (0.12.2)

tests/integration/defs/test_e2e.py

570-570: Function call with shell=True parameter identified, security issue

(S604)

659-659: f-string without any placeholders

Remove extraneous f prefix

(F541)

680-680: Use of assert detected

(S101)

681-681: Use of assert detected

(S101)

686-686: Use of assert detected

(S101)

1633-1633: Unused function argument: llm_root

(ARG001)

2806-2806: Use of assert detected

(S101)

🪛 markdownlint-cli2 (0.17.2)

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

14-14: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (16)

docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)

74-74: Correct fix: SOL time should scale with balance ratio (not divide).

Multiplying by balance_ratio_i = avg_tokens/max_tokens correctly models the ideal per-iteration time as time_i * balance_ratio_i. Good catch.

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1)

239-245: Double-check batch/token limits for B200; numbers look unusually high.

max_num_tokens = 133120 and max_batch_size = 1024 (TP low/medium) are aggressive for 120B. Please validate these against actual memory headroom for the recommended kv_cache_free_gpu_memory_fraction and your sequence lengths (ISL/OSL). If they’re theoretical maxima, label them as such and provide a safe default row.

docs/source/reference/multimodal-feature-support-matrix.md (3)

3-14: Cross-matrix consistency check.

If other backend matrices (e.g., TensorRT) list Llama 4, mirror this change there (or explicitly scope this table to PyTorch only, which the header implies).

9-9: No change—row already uses canonical “Llama 4”.

Likely an incorrect or invalid review comment.

9-9: Verify Llama 4 encoder IFB support in Python inference
No Python code paths reference “IFB” or “in-flight batching” for Llama 4. Determine whether encoder IFB is only enabled under the C++ session (via python_e2e, batch-size limits, scheduling flags, etc.); if so, update the matrix entry or add a footnote with the gating conditions.

docs/source/commands/trtllm-serve/trtllm-serve.rst (5)

204-209: LGTM! Clear documentation of PyTorch backend limitations

The note effectively communicates the beta status and limitations of the PyTorch backend metrics, including the performance impact caveat. This helps set appropriate user expectations.

212-218: Configuration structure update is well documented

The change from nested pytorch_backend_config to top-level enable_iter_perf_stats is clearly documented with proper YAML formatting.

223-223: Server command update matches configuration changes

The updated command properly references the new config filename convention.

225-227: Important usage guidance for metrics polling

This is valuable information about the queue-based metrics storage and the need for timely polling. The guidance helps prevent users from missing metrics data.

236-256: Comprehensive metrics example output

The updated JSON example properly reflects the current PyTorch backend metrics structure, including detailed KV cache statistics. The formatting and field coverage look accurate.

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (2)

61-71: Hugging Face repository identifiers verified

The Hugging Face repos openai/gpt-oss-120b and nvidia/gpt-oss-120b-Eagle3 both exist and correspond to the intended base model and Eagle3 speculative decoding checkpoint, respectively.

82-95: Use valid speculative_config.decoding_type value “eagle”
speculative_config.decoding_type must be one of the accepted values (explicit_draft_tokens, eagle, ngram, user_provided, auto); update from Eagle to eagle.

Likely an incorrect or invalid review comment.

tests/integration/defs/test_e2e.py (4)

488-489: Returning the parsed benchmark result is good.

This makes the runner composable in tests.

511-512: Dynamic num-requests in dataset generation — LGTM.

Keeps dataset size proportional to test intent.

1560-1570: Add cache salt test — LGTM.

Installs requirements and executes targeted test as expected.

1758-1760: Regex now handles single or double quotes — LGTM.

Improves robustness to varied output quoting.

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

docs/source/commands/trtllm-serve/trtllm-serve.rst

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

docs/source/reference/support-matrix.md

tests/integration/defs/test_e2e.py

venkywonka · 2025-09-09T07:10:12Z

/bot run

tensorrt-cicd · 2025-09-09T07:16:22Z

PR_Github #18178 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-09T08:37:30Z

PR_Github #18178 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13623 completed with status: 'FAILURE'

venkywonka · 2025-09-09T15:55:19Z

/bot run

tensorrt-cicd · 2025-09-09T16:00:52Z

PR_Github #18244 [ run ] triggered by Bot

tests/unittest/llmapi/test_llm.py

tensorrt_llm/sampling_params.py

tensorrt_llm/executor/result.py

tests/unittest/llmapi/test_llm.py

coderabbitai · 2025-09-19T06:21:02Z

✅ Actions performed

Full review triggered.

tests/unittest/llmapi/test_llm.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

tensorrt_llm/executor/result.py (1)
247-272: Fix logprobs accumulation logic to prevent out-of-bounds access.

The current implementation slices response_tensors.log_probs[src_idx] from output._last_logprobs_len: without checking if the index is valid. If output._last_logprobs_len exceeds the length of the logprobs list, this will result in an empty slice but no error. However, the logic then assumes there are new logprobs to append, which may not be accurate.

Consider adding a bounds check:
-            output.logprobs += response_tensors.log_probs[src_idx][
-                output._last_logprobs_len:]
+            new_logprobs = response_tensors.log_probs[src_idx]
+            if output._last_logprobs_len < len(new_logprobs):
+                output.logprobs += new_logprobs[output._last_logprobs_len:]
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
281-291: Consider adding error handling for missing attributes.

The clear_context_logits method directly accesses _py_result._context_logits without checking if the attribute exists. While the docstring indicates when this is used, defensive programming would make the code more robust.

Apply this diff to add defensive checks:
 def clear_context_logits(self):
     """Clear context logits from the response result.
 
     This is used to drop context logits after prompt_logprobs have been computed
     when the user didn't explicitly request them.
     """
     if self.result and hasattr(self.result, '_py_result'):
         py_result = self.result._py_result
-        if hasattr(py_result, '_context_logits'):
-            py_result._context_logits = None
+        if hasattr(py_result, '_context_logits') and py_result._context_logits is not None:
+            py_result._context_logits = None
tensorrt_llm/sampling_params.py (1)
453-467: Consider refining the logging message for clarity.

While the implementation correctly enables return_context_logits when needed for prompt_logprobs computation, the info log message could be more concise and clearer about the internal nature of this behavior.
 if self.prompt_logprobs and not self.return_context_logits:
-    logger.info(
-        "Since prompt_logprobs is requested but return_context_logits is False, "
-        "internally enabling context logits for prompt logprobs computation. "
-        "context logits will be dropped after computation as the user didn't explicitly request them."
-    )
+    logger.debug(
+        "Internally enabling context_logits for prompt_logprobs computation "
+        "(will be dropped after use)"
+    )
Consider using debug level instead of info since this is an internal implementation detail that users don't need to act upon.
tensorrt_llm/executor/worker.py (2)
1074-1075: Add defensive check for context_logits availability.

While the assertion message is helpful, consider a more graceful error handling approach for production code.

Consider replacing the assertion with a proper exception:
-    assert context_logits is not None, "context_logits cannot be None when prompt_logprobs is requested."
+    if context_logits is None:
+        raise ValueError("context_logits cannot be None when prompt_logprobs is requested. "
+                        "Ensure return_context_logits is enabled in sampling params.")
1078-1079: Consider documenting the caching behavior.

The caching of prompt logprobs in streaming mode is an important implementation detail that should be documented.

Add a comment explaining the caching strategy:
     if generation_result._streaming:
+        # Cache prompt logprobs to avoid recomputation in subsequent streaming iterations
         generation_result._cached_prompt_logprobs = logprobs_result.prompt
tests/unittest/llmapi/test_llm.py (1)
1869-1869: Remove unused parameter id from async function.

The static analysis correctly identifies that the id parameter in the async task function is never used.
-            async def task(id: int, prompt: str):
+            async def task(prompt: str):
And update the task creation:
-                tasks = [task(id, prompt) for id, prompt in enumerate(prompts)]
+                tasks = [task(prompt) for prompt in prompts]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fcc054 and f3dcad0.

📒 Files selected for processing (7)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2 hunks)
tensorrt_llm/executor/result.py (2 hunks)
tensorrt_llm/executor/worker.py (3 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/sampling_params.py (2 hunks)
tests/unittest/llmapi/test_llm.py (1 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (3 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/llmapi/llm.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/llmapi/test_llm.py
tests/unittest/llmapi/test_llm_pytorch.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/llmapi/test_llm.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/llmapi/test_llm.py

🪛 Ruff (0.12.2)

tests/unittest/llmapi/test_llm.py

1869-1869: Unused function argument: id

(ARG001)

🔇 Additional comments (14)

tensorrt_llm/executor/result.py (1)

703-722: LGTM! Clear and comprehensive docstring update.

The updated docstring for compute_logprobs clearly explains the function's purpose and distinguishes between different backend behaviors for prompt and generation logprobs computation.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

390-416: Excellent documentation for the create_response method.

The comprehensive docstring clearly explains the method's behavior, parameters, and return semantics. The implementation correctly wraps the serialized result in an LlmResponse with appropriate PyTorch-specific handling.

tests/unittest/llmapi/test_llm_pytorch.py (2)

884-904: LGTM! Comprehensive test coverage for PyTorch backend logprobs.

The test properly validates various combinations of prompt_logprobs and generation logprobs for the PyTorch backend, including the case where prompt_logprobs works without explicitly enabling context_logits.

906-924: Well-structured streaming test for logprobs.

The streaming test complements the non-streaming test by covering key scenarios including prompt-only, generation-only, and combined logprobs cases.

tensorrt_llm/llmapi/llm.py (2)

572-577: LGTM! Clean handling of the is_gen_only parameter.

The addition of the default value for is_gen_only makes the method signature more flexible while maintaining backward compatibility.

578-591: Good simplification of PyTorch backend validation.

Removing the prompt_logprobs restriction for the PyTorch backend aligns with the PR's goal of enabling this feature. The remaining validations for logprobs <= 1 and max_tokens checks are appropriate.

tensorrt_llm/executor/worker.py (4)

17-17: Import statement looks good.

The addition of LlmResponse import from the PyTorch executor module is appropriate for handling PyTorch backend responses with context logits.

1061-1081: Well-structured helper function for PyTorch prompt logprobs.

The implementation correctly handles caching for streaming mode and cleanly separates PyTorch-specific logic.

1083-1085: Type annotation properly updated to support both response types.

The union type allows flexible handling of both TRT and PyTorch backend responses.

1101-1113: PyTorch backend path correctly implements prompt logprobs handling.

The logic properly:

Early returns when prompt_logprobs is not requested

Computes prompt logprobs using the helper function

Clears context logits when requested to avoid memory overhead

Good separation of concerns between PyTorch and TRT backends.

tests/unittest/llmapi/test_llm.py (4)

1798-1825: Well-structured test harness with proper backend handling.

The test harness correctly handles:

Backend-specific LLM class selection

KV cache configuration for streaming mode

Proper resource cleanup with the context manager pattern

1867-1886: Comprehensive streaming validation logic.

The streaming test properly validates that:

Streaming logprobs accumulation matches non-streaming results

Prompt logprobs are preserved across streaming iterations

Good test coverage for the streaming scenario.

1807-1814: Smart workaround for PyTorch streaming limitation. Tests reference prompt_logprobs/context_logits behavior (tests/unittest/llmapi/test_llm_pytorch.py:888,892); no TODO or tracking issue references found in the repo — create or link a tracking issue and add a brief inline code comment documenting this limitation.

1888-1908: Add PyTorch-specific logprobs tests (tests/unittest/llmapi/test_llm_pytorch.py)

test_llm_pytorch.py imports llm_return_logprobs_test_harness but no explicit PyTorch logprobs tests were found; add parametrized cases mirroring the TRT tests to cover:

prompt_logprobs with context_logits

prompt_logprobs without context_logits

generation logprobs only (top-k)

no logprobs

Use llm_return_logprobs_test_harness to implement these cases.

Copilot

Pull Request Overview

This PR enables prompt_logprobs support in the PyTorch backend to unify logprobs functionality across TensorRT and PyTorch backends. The implementation automatically captures required context logits when prompt logprobs are requested and provides an option to clear stored context logits after use to reduce memory consumption.

Key Changes

Added PyTorch backend support for prompt logprobs computation from context logits
Implemented automatic context logits capture when prompt logprobs are requested but context logits aren't explicitly enabled
Enhanced streaming mode to properly handle prompt logprobs with consistent caching behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tensorrt_llm/sampling_params.py	Added logic to auto-enable context logits when prompt_logprobs is requested in PyTorch backend
tensorrt_llm/llmapi/llm.py	Removed PyTorch backend restriction for prompt_logprobs support
tensorrt_llm/executor/worker.py	Enhanced logprobs computation to handle PyTorch backend prompt logprobs with caching for streaming
tensorrt_llm/executor/result.py	Updated logprobs handling to differentiate between TRT and PyTorch backend provenance
tensorrt_llm/_torch/pyexecutor/llm_request.py	Added clear_context_logits method and improved documentation
tests/unittest/llmapi/test_llm.py	Updated test harness with backend parameter and streaming improvements
tests/unittest/llmapi/test_llm_pytorch.py	Added comprehensive test coverage for PyTorch backend logprobs functionality

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tensorrt_llm/sampling_params.py

tensorrt_llm/executor/result.py

tensorrt_llm/_torch/pyexecutor/llm_request.py

tensorrt_llm/executor/worker.py

Signed-off-by: Venky Ganesh <[email protected]>

…ion change Signed-off-by: Venky Ganesh <[email protected]>

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka · 2025-09-22T18:50:41Z

/bot run

tensorrt-cicd · 2025-09-22T18:55:50Z

PR_Github #19614 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-23T08:24:06Z

PR_Github #19614 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14753 completed with status: 'SUCCESS'

Superjomn

LGTM

venkywonka commented Sep 6, 2025

View reviewed changes

docs/source/blogs/media/tech_blog10_baseline_performance_detail.png Show resolved Hide resolved

coderabbitai bot changed the title ~~[TRTLLM-7015] @coderabbitai title~~ [TRTLLM-7015] [feat] Enable prompt token logprobs across backends Sep 6, 2025

coderabbitai bot reviewed Sep 6, 2025

View reviewed changes

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

venkywonka force-pushed the prompt-logprobs-sampling branch from fbec724 to dea4d74 Compare September 9, 2025 07:03

venkywonka marked this pull request as ready for review September 9, 2025 07:04

venkywonka requested review from a team as code owners September 9, 2025 07:04

venkywonka requested review from Superjomn and achartier September 9, 2025 07:04

venkywonka force-pushed the prompt-logprobs-sampling branch from dea4d74 to 227fed5 Compare September 9, 2025 07:04

venkywonka self-assigned this Sep 9, 2025

venkywonka requested review from dcampora, hchings, stnie, tijyojwad and LinPoly September 9, 2025 07:04

venkywonka force-pushed the prompt-logprobs-sampling branch from 227fed5 to bd2a1b6 Compare September 9, 2025 15:55

venkywonka commented Sep 10, 2025

View reviewed changes

tests/unittest/llmapi/test_llm.py Outdated Show resolved Hide resolved

venkywonka commented Sep 10, 2025

View reviewed changes

tensorrt_llm/sampling_params.py Outdated Show resolved Hide resolved

venkywonka force-pushed the prompt-logprobs-sampling branch from bd2a1b6 to f925fa9 Compare September 15, 2025 17:50

venkywonka commented Sep 15, 2025

View reviewed changes

tensorrt_llm/executor/result.py Outdated Show resolved Hide resolved

venkywonka commented Sep 15, 2025

View reviewed changes

tensorrt_llm/executor/result.py Show resolved Hide resolved

venkywonka commented Sep 16, 2025

View reviewed changes

tests/unittest/llmapi/test_llm.py Show resolved Hide resolved

Superjomn reviewed Sep 19, 2025

View reviewed changes

tests/unittest/llmapi/test_llm.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

venkywonka requested a review from Copilot September 19, 2025 15:35

Copilot AI reviewed Sep 19, 2025

View reviewed changes

tensorrt_llm/sampling_params.py Show resolved Hide resolved

tensorrt_llm/executor/result.py Show resolved Hide resolved

achartier reviewed Sep 19, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/llm_request.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/llm_request.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/worker.py Outdated Show resolved Hide resolved

venkywonka added 12 commits September 22, 2025 11:29

initial prototype

92626a0

Signed-off-by: Venky Ganesh <[email protected]>

revert unintended changes, revert openai server changes

6b59526

Signed-off-by: Venky Ganesh <[email protected]>

revert openai server changes

b66376c

Signed-off-by: Venky Ganesh <[email protected]>

add helpful comments

a96d89b

Signed-off-by: Venky Ganesh <[email protected]>

fix streaming logprobs collection bug, improve docstrings

c026ed3

Signed-off-by: Venky Ganesh <[email protected]>

fix test_llm_return_logprobs_streaming test and undo output accumulat…

6111fca

…ion change Signed-off-by: Venky Ganesh <[email protected]>

add extra comments

10bdafa

Signed-off-by: Venky Ganesh <[email protected]>

raise error when streaming + prompt logprobs

be93e15

Signed-off-by: Venky Ganesh <[email protected]>

enable prompt logprobs calc for streaming

bbdc642

Signed-off-by: Venky Ganesh <[email protected]>

undo some unnecessary changes

b63de29

Signed-off-by: Venky Ganesh <[email protected]>

review comment changes

0a3f5cf

Signed-off-by: Venky Ganesh <[email protected]>

restore 3rdparty

339e530

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka force-pushed the prompt-logprobs-sampling branch from fc6f188 to 339e530 Compare September 22, 2025 18:34

venkywonka requested review from achartier and Superjomn September 22, 2025 18:37

fix improper rebase

4106c8e

Signed-off-by: Venky Ganesh <[email protected]>

achartier approved these changes Sep 22, 2025

View reviewed changes

Superjomn approved these changes Sep 24, 2025

View reviewed changes

venkywonka enabled auto-merge (squash) September 24, 2025 01:47

venkywonka merged commit 6ff0fad into NVIDIA:main Sep 24, 2025
5 checks passed

venkywonka mentioned this pull request Sep 24, 2025

[TRTLLM-8031][feat] Add chunked return_generation_logits logic #7831

Open

1 task

[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend #7580

[TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend #7580

Uh oh!

Conversation

venkywonka commented Sep 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

venkywonka commented Sep 9, 2025

Uh oh!

tensorrt-cicd commented Sep 9, 2025

Uh oh!

tensorrt-cicd commented Sep 9, 2025

Uh oh!

venkywonka commented Sep 9, 2025

Uh oh!

tensorrt-cicd commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Sep 19, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

venkywonka commented Sep 22, 2025

Uh oh!

tensorrt-cicd commented Sep 22, 2025

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend #7580

[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend #7580

venkywonka commented Sep 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 6, 2025 •

edited

Loading