-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-7015] [feat] Enable prompt_logprobs
in pytorch backend
#7580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds LlmResponse wrapper with clear_context_logits; updates LlmRequest.create_response to return LlmResponse. Refactors logprobs handling across backends: executor/result accumulation, worker computation paths for PyTorch vs TRT using logits. Adjusts LLM argument checks and sampling params to auto-enable context logits. Expands tests for multi-backend and streaming. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant LLM
participant Executor
participant Worker
participant Sampler
participant Response
Client->>LLM: submit(prompt, sampling_params)
LLM->>Executor: run request
Executor->>Sampler: generate tokens
Sampler-->>Executor: tokens (+ backend-provided logprobs?)
Executor-->>LLM: serialized result + logits (optional)
LLM-->>Response: LlmResponse(create_response)
Note over Response: result.context_logits / generation_logits may be present
LLM->>Worker: _get_logprobs(response, backend)
alt PyTorch backend
Worker->>Worker: _compute_pytorch_prompt_logprobs using context_logits
opt streaming
Worker->>Worker: cache/use prior prompt logprobs
end
Worker->>Response: clear_context_logits() (optional)
else TRT backend
Worker->>Worker: compute from context_logits and generation_logits
Worker->>Response: drop logits per flags
end
Worker-->>LLM: LogProbsResult
LLM-->>Client: outputs + logprobs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
tensorrt_llm/sampling_params.py (1)
1-1
: Add required NVIDIA Apache-2.0 header (2025).This file is missing the standard header mandated by the repo guidelines.
Apply at file top:
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0tensorrt_llm/llmapi/llm.py (1)
1-1
: Add required NVIDIA Apache-2.0 header (2025).Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-1
: Add required NVIDIA Apache-2.0 header (2025).Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0tensorrt_llm/executor/result.py (1)
1-1
: Add required NVIDIA Apache-2.0 header (2025).Header is missing.
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0tensorrt_llm/serve/postprocess_handlers.py (1)
1-1
: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.The file lacks the mandated header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/executor/worker.py (2)
1014-1021
: Guard against error responses or missing result to avoid attribute errors.on error responses may have no result; early-return to prevent None access.
logprobs_result = None generation_result = worker._results.get(response.client_id, None) if not generation_result: return + # Skip when response has errors or no result payload yet + if hasattr(response, "has_error") and response.has_error(): + return + if not getattr(response, "result", None): + return
1-1
: Add NVIDIA Apache-2.0 header (2025) as required by guidelines.The file lacks the mandated header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tests/unittest/llmapi/apps/_test_openai_completions.py (1)
1-1
: Add NVIDIA Apache-2.0 header (2025).Per coding guidelines, prepend the NVIDIA Apache-2.0 copyright header to all .py files.
Apply at file top:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.tensorrt_llm/serve/openai_protocol.py (1)
1-1
: Add NVIDIA Apache-2.0 header (2025).Per guidelines, all source files must include the NVIDIA Apache-2.0 header.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.tests/unittest/llmapi/test_llm.py (1)
1-1
: Add NVIDIA Apache-2.0 header (2025).Tests are also source files; add the standard header.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License.
🧹 Nitpick comments (11)
tensorrt_llm/sampling_params.py (1)
452-456
: Auto-enabling context logits for PyTorch: also flag as auto-enabled.To keep behavior consistent with TRT (and make downstream “need to return” checks accurate), set the private auto-enabled flag when you flip return_context_logits.
Apply this diff:
- if self.prompt_logprobs and not self.return_context_logits: - config_kwargs["return_context_logits"] = True + if self.prompt_logprobs and not self.return_context_logits: + config_kwargs["return_context_logits"] = True + # Mirror TRT flow: mark as internally enabled so consumers can safely drop them. + self._context_logits_auto_enabled = TruePlease confirm worker logic drops context logits when prompt_logprobs is requested but return_context_logits was not explicitly set by the user.
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
281-291
: Good: targeted API to drop context logits.Method is concise and safely nulls the Python-side storage.
Consider adding a symmetric clear_generation_logits() for parity with drop_generation_logits flows.
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
187-243
: Tighten assertions; drop noisy prints for CI hygiene.
- The conditional print statements make test output noisy and aren’t actionable. Prefer silent acceptance with explicit assertions or a short skip/xfail note.
- Keep the “accepts but may not return” behavior, but avoid printing.
Apply this diff to remove prints while preserving intent:
@@ - print( - f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries" - ) @@ - print( - f"✓ Found prompt_logprobs in message with {len(message.prompt_logprobs)} entries" - ) @@ - print( - f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)" - ) + # prompt_logprobs currently accepted but not returned (known limitation) + passtensorrt_llm/serve/postprocess_handlers.py (2)
60-63
: Fix type hint to reflect accepted input shapes.The function consumes ints/floats, dicts, sequences, and objects with a logprob attribute; List[float] is misleading.
Apply this diff:
-from typing import List, Literal, Optional, Tuple, Union +from typing import List, Literal, Optional, Tuple, Union, Sequence, Any @@ -def create_logprobs(token_ids: List[int], - tokenizer: TransformersTokenizer, - logprobs: List[float]) -> ChatCompletionLogProbs: +def create_logprobs(token_ids: List[int], + tokenizer: TransformersTokenizer, + logprobs: Sequence[Any]) -> ChatCompletionLogProbs:
27-27
: Fix typo in formatter pragma.Comment should read “enable”.
-# yapf: enale +# yapf: enabletensorrt_llm/executor/worker.py (2)
1009-1013
: Compress one-line docstring (D200).Ruff warns; keep on a single line.
- """Compute logprob and prompt logprob and clear out logits if applicable. - """ + """Compute logprobs and prompt_logprobs, and clear logits if applicable."""
1022-1036
: PyTorch branch: compute only when needed; keep cleanup but be defensive.Looks good overall. Minor nit: keep hasattr checks before touching backend-specific cleanup to avoid surprises with mixed response types.
tests/unittest/llmapi/apps/_test_openai_completions.py (3)
357-404
: Make the expectation explicit; avoid prints.Prefer an assertion over print so the test fails if the response shape unexpectedly changes. Use getattr for forward-compat.
- # Check if prompt_logprobs are in the response (for future compatibility) - if hasattr(choice, - 'prompt_logprobs') and choice.prompt_logprobs is not None: - # Future: When prompt_logprobs are added to the response - assert len( - choice.prompt_logprobs) > 0, "prompt_logprobs should not be empty" - print( - f"✓ Found prompt_logprobs with {len(choice.prompt_logprobs)} entries" - ) - else: - # Current state: prompt_logprobs are computed internally but not returned - # This test verifies the feature doesn't cause errors - print( - f"Note: prompt_logprobs accepted in request but not yet included in response (expected limitation)" - ) + plp = getattr(choice, "prompt_logprobs", None) + # Current behavior: not returned; assert None to catch unexpected changes. + # When support lands, update the assertion to validate structure. + assert plp is None
406-452
: Strengthen generation logprobs validation; keep backend logic.Assert expected K for returned top_logprobs to catch regressions.
- # Verify we got generation logprobs - assert choice.logprobs is not None, "Generation logprobs were requested but not returned" + # Verify we got generation logprobs with expected top-K (allow K or K+1 for sampled token) + assert choice.logprobs is not None, "Generation logprobs were requested but not returned" + k = generation_logprobs + assert len(choice.logprobs.top_logprobs) >= 1 + assert len(choice.logprobs.top_logprobs[0]) in {k, k + 1}
454-481
: Assert no context logits leaked when not requested.This confirms the auto-enable/drop path behaves as intended.
choice = completion.choices[0] assert len(choice.text) >= 1 + # Should not surface context logits when explicitly disabled. + assert getattr(choice, "context_logits", None) is Nonetests/unittest/llmapi/test_llm.py (1)
1879-1907
: Backend matrix: confirm PyTorch returns context logits.This param set includes (prompt_logprobs=2, return_context_logits=True, backend="pytorch"). If PyTorch still doesn't expose context logits, this will fail. Either keep the case only if supported, or mark xfail/skip dynamically.
@pytest.mark.parametrize( "prompt_logprobs, logprobs, return_context_logits, return_generation_logits, backend", [ @@ - (2, None, True, False, "pytorch" - ), # prompt_logprobs with context_logits + pytest.param(2, None, True, False, "pytorch", marks=pytest.mark.xfail(reason="PyTorch context logits exposure")),If PyTorch now supports returning context logits, keep as-is and remove xfail.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (7)
docs/source/blogs/media/tech_blog10_baseline_performance_detail.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_baseline_performance_overview.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_context_wait_performance.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_dataset_token_distribution.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_full_strategy_performance.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png
is excluded by!**/*.png
📒 Files selected for processing (11)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/executor/result.py
(1 hunks)tensorrt_llm/executor/worker.py
(3 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tensorrt_llm/sampling_params.py
(1 hunks)tensorrt_llm/serve/openai_protocol.py
(4 hunks)tensorrt_llm/serve/postprocess_handlers.py
(1 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/unittest/llmapi/apps/_test_openai_chat.py
(1 hunks)tests/unittest/llmapi/apps/_test_openai_completions.py
(1 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/executor/result.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/sampling_params.py
tensorrt_llm/serve/postprocess_handlers.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_chat.py
tests/integration/defs/test_e2e.py
tensorrt_llm/executor/worker.py
tensorrt_llm/serve/openai_protocol.py
tests/unittest/llmapi/apps/_test_openai_completions.py
tests/unittest/llmapi/test_llm.py
🧠 Learnings (3)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/test_llm.py
🧬 Code graph analysis (8)
tensorrt_llm/executor/result.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
cum_log_probs
(230-231)log_probs
(226-227)tensorrt_llm/scaffolding/task.py (2)
cumulative_logprob
(94-96)logprobs
(99-100)cpp/include/tensorrt_llm/executor/types.h (1)
FinishReason
(502-597)
tensorrt_llm/serve/postprocess_handlers.py (1)
tensorrt_llm/serve/openai_protocol.py (1)
ChatCompletionLogProbsContent
(368-369)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
tensorrt_llm/executor/result.py (1)
result
(574-585)
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
tests/unittest/llmapi/apps/_test_openai_completions.py (3)
client
(44-45)model_name
(16-17)backend
(21-22)
tensorrt_llm/executor/worker.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmResponse
(272-290)get
(100-109)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-491)tensorrt_llm/executor/result.py (1)
LogProbsResult
(57-60)
tensorrt_llm/serve/openai_protocol.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(69-96)tensorrt_llm/builder.py (1)
default
(50-58)
tests/unittest/llmapi/apps/_test_openai_completions.py (1)
tests/unittest/llmapi/apps/_test_openai_chat.py (3)
client
(82-83)model_name
(21-22)backend
(26-27)
tests/unittest/llmapi/test_llm.py (3)
tensorrt_llm/scaffolding/task.py (1)
logprobs
(99-100)tests/unittest/llmapi/apps/_test_openai_chat.py (1)
backend
(26-27)tests/unittest/llmapi/apps/_test_openai_completions.py (1)
backend
(21-22)
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker.py
1035-1036: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (6)
tensorrt_llm/llmapi/llm.py (1)
570-573
: Manual verify_TorchLLM.create
’sreturn_logits
doesn’t override context-logits capture
sampling_params.py already auto-enables context logits on PyTorch (tensorrt_llm/sampling_params.py lines 451–458) and executor/worker.py drops them when_need_return_context_logits
is false (lines 1054–1058). Now confirm that the PyTorch LLM factory method (e.g._TorchLLM.create
) accepts and propagatesreturn_context_logits
alongside anyreturn_logits
flag so context logits aren’t inadvertently suppressed.tensorrt_llm/executor/result.py (1)
251-264
: Hardened logprobs handling looks correct.
- Preserves previous length for diffs.
- Slices only when logprobs is a list and asserts length matches output length.
tests/unittest/llmapi/apps/_test_openai_chat.py (1)
245-291
: Clarify backend expectations or align with earlier skip.Earlier in this file, PyTorch chat logprobs are skipped; this test asserts generation logprobs for both backends. If PyTorch support is now enabled, consider removing the earlier skip to avoid inconsistent expectations; otherwise, add a targeted skip here for PyTorch until parity lands.
tensorrt_llm/executor/worker.py (1)
17-17
: I’ve extracted theLlmResponse
class definition and member methods to confirmcontext_logits
andclear_context_logits
are present. Let me know once you have the output.tensorrt_llm/serve/openai_protocol.py (2)
306-309
: Pass-through looks good.Routing both logprobs and prompt_logprobs to SamplingParams is correct.
639-641
: LGTM on SamplingParams plumbing.Chat path correctly forwards both fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
docs/source/commands/trtllm-serve/trtllm-serve.rst (2)
186-186
: Inconsistent config filename in multi-node exampleThis section still uses the old filename convention
extra-llm-api-config.yml
while the rest of the document has been updated toextra_llm_config.yaml
. This should be updated for consistency.- echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true" > extra-llm-api-config.yml + echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true" > extra_llm_config.yamlAnd update the reference in the command:
- bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml" + bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra_llm_config.yaml"
82-86
: Another filename consistency issue in multimodal exampleThe multimodal section also uses the old filename convention. This should be updated for consistency with the rest of the document.
- cat >./extra-llm-api-config.yml<<EOF + cat >./extra_llm_config.yaml<<EOFAnd update the corresponding reference:
- --extra_llm_api_options ./extra-llm-api-config.yml + --extra_llm_api_options ./extra_llm_config.yamltests/integration/defs/test_e2e.py (1)
443-457
: Critical: TemporaryDirectory is immediately eligible for cleanup; work_dir may disappear.
self.work_dir = Path(tempfile.TemporaryDirectory().name)
loses the TemporaryDirectory handle, so the dir can be auto-deleted by GC, causing flaky I/O later.Apply this diff:
- self.work_dir = Path(tempfile.TemporaryDirectory().name) + # Keep a handle so the directory isn't cleaned up prematurely. + self._tmpdir = tempfile.TemporaryDirectory() + self.work_dir = Path(self._tmpdir.name)Optional: implement
__del__
to callself._tmpdir.cleanup()
explicitly.Also applies to: 481-483
🧹 Nitpick comments (16)
docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)
73-86
: Notation/clarity nits: finite sum and math typesetting.Suggest replacing the infinite sum with a finite horizon (
T
iterations) and using\cdot
instead of*
. Also defineT
in the “where” list for completeness.Apply this diff:
-```math -sol\_time = \sum_{i=0}^{\infty} time_i * balance\_ratio_i -``` +```math +sol\_time = \sum_{i=0}^{T-1} time_i \cdot balance\_ratio_i +``` @@ - $elapsed\_time$: Total empirically measured end-to-end execution time - $actual\_tps$: Observed throughput in tokens per second - $sol\_tps$: Theoretical maximum throughput under perfect load balance +- $T$: Total number of iterations considereddocs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (3)
237-238
: Tighten wording + proper capitalization; align with earlier section naming.Minor edits for clarity and consistency with “Launch the TRT-LLM Server”.
-You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200. +You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size`, and `max_num_tokens` when launching the TRT-LLM server, and set `reasoning-effort` when launching evaluation in GPT-OSS. Below are reference configurations for accuracy evaluation on B200. Ensure these values remain consistent with the “Launch the TRT-LLM Server” section.
239-245
: Define “DEP” and map table rows to concrete flags; call out config consistency with launch flags.Right now “DEP8 / DEP4” isn’t actionable. Please define DEP and how to set it (e.g., via
enable_attention_dp: true
) and explicitly map each row to--tp_size
,--ep_size
, andenable_attention_dp
. Also, values here (e.g.,max_batch_size
1024 andmax_num_tokens
32768/133120) don’t match the earlier launch example (--max_batch_size 720
,--max_num_tokens 16384
). Readers will be confused.| high | TP8 / TP4 | 720 | 133120 | - + +Note: +- “DEP” denotes attention data parallelism; set `enable_attention_dp: true`. “TP” denotes pure tensor parallelism; set `enable_attention_dp: false`. +- For each row, set `--tp_size` and `--ep_size` accordingly (e.g., “TP8” → `--tp_size 8`; use `--ep_size` for MoE expert parallel as needed). +- Keep `--max_batch_size` and `--max_num_tokens` consistent with your server launch. If you launched with `--max_batch_size 720` and `--max_num_tokens 16384`, update the corresponding table entries to match.
246-246
: Fix eval name: “AIME25” vs “AIME2025”.Text says “AIME2025” but the command uses
aime25
. Align the prose with the flag.-Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025. +Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME25.docs/source/reference/multimodal-feature-support-matrix.md (1)
3-14
: Add a brief definition/footnote for “Encoder IFB”.Readers may not know “Encoder IFB”. Add a one‑liner or footnote under the table clarifying it (e.g., encoder in‑flight batching for the vision tower during prefill) and any notable limitations.
Example addition:
| Qwen2.5-VL | Yes | Yes | Yes | No | +Note: “Encoder IFB” = encoder in‑flight batching (vision encoder). Availability may depend on model’s vision tower and scheduling flags.
docs/source/commands/trtllm-serve/trtllm-serve.rst (1)
210-210
: Improve accuracy of metrics descriptionThe line mentions "inflight-batching details" but the example metrics don't show any inflight-batching specific fields. Consider updating to be more accurate.
-The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details. +The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage, KV cache details, and request metrics.docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (6)
14-18
: Fix MD040: add a fenced code language.Label the directory tree block to satisfy markdownlint and improve rendering.
-``` +```text /path/to/models/ ├─ gpt-oss-120b/ # base model directory └─ eagle/ # Eagle3 speculative decoding assets--- `33-46`: **Consider a non-interactive run example.** Add a detached variant to improve copy/paste UX in automation. ```diff -docker run --rm --ipc=host -it \ +docker run --rm --ipc=host -it \ --ulimit stack=67108864 \ --ulimit memlock=-1 \ --gpus all \ -p 8000:8000 \ -v /path/to/models:/config/models:rw \ - nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \ + nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \ /bin/bash + +# Or detached server directly: +# docker run --rm --ipc=host -d --name trtllm \ +# --ulimit stack=67108864 --ulimit memlock=-1 --gpus all \ +# -p 8000:8000 -v /path/to/models:/config/models:rw \ +# nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_IMAGE_TAG} \ +# bash -lc 'TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b ...'
73-74
: Use proper Markdown links instead of code-formatted URLs.Improves readability and avoids confusion with shell literals.
-References: `https://huggingface.co/openai/gpt-oss-120b` and `https://huggingface.co/nvidia/gpt-oss-120b-Eagle3` +References: [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) and [nvidia/gpt-oss-120b-Eagle3](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3)
108-109
: Sanity-check parallelism and sequence settings for the stated 8‑GPU example.
--tp_size 8 --ep_size 4
with 8 GPUs and 120B weights may be inconsistent/resource-heavy;--max_seq_len/--max_num_tokens 131072
is extreme. Provide safer defaults or a sizing note.-TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 --max_batch_size 10 --tp_size 8 --ep_size 4 --trust_remote_code --extra_llm_api_options /config/models/eagle/eagle.yaml --max_num_tokens 131072 --max_seq_len 131072 +TRTLLM_ENABLE_PDL=1 trtllm-serve /config/models/gpt-oss-120b --host 0.0.0.0 --port 8000 \ + --max_batch_size 4 --tp_size 8 --ep_size 1 --trust_remote_code \ + --extra_llm_api_options /config/models/eagle/eagle.yaml \ + --max_num_tokens 32768 --max_seq_len 32768 # adjust per VRAM/SLA
125-139
: Showcase prompt token logprobs in the sample request (matches PR goal).Add
logprobs
andprompt_logprobs
to demonstrate the new capability.-d '{ "model": "gpt-oss-120b", "messages": [ {"role": "user", "content": "Give me a two-sentence summary of Eagle3 speculative decoding."} ], - "max_tokens": 128, + "max_tokens": 128, + "logprobs": 5, + "prompt_logprobs": true, "stream": false }'
3-3
: Either link the “previous guide” or remove the forward reference.Avoid dangling references.
docs/source/torch/features/feature_combination_matrix.md (1)
18-18
: Confirm MTP and EAGLE‑3(One Model Engine) support before marking “Yes”.Docs now claim both are “Yes”. Ensure feature gates, required flags, and backend constraints (CUDA graph, attention backend, etc.) match actual support paths; otherwise annotate with “WIP/Untested” like other rows or add footnotes.
Would you like me to link the exact example/tests that demonstrate both features to justify the change in this matrix?
tests/integration/defs/test_e2e.py (3)
565-569
: Harden CLI flag emission for concurrency/num_requests.
- Check explicitly for None to allow zero if ever needed.
- Cast to int once to avoid accidental string injection.
Apply:
- if self.concurrency: - benchmark_cmd += f" --concurrency {self.concurrency}" - if self.num_requests: - benchmark_cmd += f" --num_requests {self.num_requests}" + if self.concurrency is not None: + benchmark_cmd += f" --concurrency {int(self.concurrency)}" + if self.num_requests is not None: + benchmark_cmd += f" --num_requests {int(self.num_requests)}"
628-687
: Make MIG scaling assertion less brittle; fix minor print nit.
- Throughput increase factor 1.3 per step may be flaky across environments. Recommend monotonic increase or a smaller factor (e.g., 1.1).
- Remove needless f-prefix in static print.
Apply:
- print(f"\n=== Benchmark Results Comparison ===") + print("\n=== Benchmark Results Comparison ===") @@ - prev_throughput = float(results[concurrency_list[idx - 1]].get( - 'throughput', 0)) - assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}" + prev_throughput = float(results[concurrency_list[idx - 1]].get('throughput', 0)) + # Allow modest scaling margin to reduce flakiness across MIG configs. + assert throughput >= prev_throughput * 1.1, ( + f"Throughput did not sufficiently increase: {prev_throughput} -> {throughput} " + f"at concurrency {concurrency_list[idx]}" + )
1633-1638
: Remove unused fixture argument to silence lint.
llm_root
is unused.Apply:
-def test_openai_responses(llm_root, llm_venv): +def test_openai_responses(llm_venv):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
docs/source/blogs/media/tech_blog10_baseline_performance_detail.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_context_wait_performance.png
is excluded by!**/*.png
docs/source/blogs/media/tech_blog10_full_strategy_performance.png
is excluded by!**/*.png
📒 Files selected for processing (10)
docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md
(1 hunks)docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
(1 hunks)docs/source/commands/trtllm-serve/trtllm-serve.rst
(1 hunks)docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
(1 hunks)docs/source/index.rst
(1 hunks)docs/source/legacy/tensorrt_quickstart.md
(1 hunks)docs/source/reference/multimodal-feature-support-matrix.md
(1 hunks)docs/source/reference/support-matrix.md
(1 hunks)docs/source/torch/features/feature_combination_matrix.md
(1 hunks)tests/integration/defs/test_e2e.py
(20 hunks)
✅ Files skipped from review due to trivial changes (2)
- docs/source/index.rst
- docs/source/legacy/tensorrt_quickstart.md
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/integration/defs/test_e2e.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/integration/defs/test_e2e.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/integration/defs/test_e2e.py
🧠 Learnings (4)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
docs/source/commands/trtllm-serve/trtllm-serve.rst
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
PR: NVIDIA/TensorRT-LLM#7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/integration/defs/test_e2e.py
🧬 Code graph analysis (1)
tests/integration/defs/test_e2e.py (3)
tests/integration/defs/trt_test_alternative.py (1)
check_output
(257-283)tests/integration/defs/conftest.py (4)
llm_venv
(707-723)llm_root
(180-181)unittest_path
(90-91)llm_models_root
(77-83)tests/integration/defs/common.py (1)
parse_output
(665-681)
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
570-570: Function call with shell=True
parameter identified, security issue
(S604)
659-659: f-string without any placeholders
Remove extraneous f
prefix
(F541)
680-680: Use of assert
detected
(S101)
681-681: Use of assert
detected
(S101)
686-686: Use of assert
detected
(S101)
1633-1633: Unused function argument: llm_root
(ARG001)
2806-2806: Use of assert
detected
(S101)
🪛 markdownlint-cli2 (0.17.2)
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
14-14: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md (1)
74-74
: Correct fix: SOL time should scale with balance ratio (not divide).Multiplying by
balance_ratio_i = avg_tokens/max_tokens
correctly models the ideal per-iteration time astime_i * balance_ratio_i
. Good catch.docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1)
239-245
: Double-check batch/token limits for B200; numbers look unusually high.
max_num_tokens = 133120
andmax_batch_size = 1024
(TP low/medium) are aggressive for 120B. Please validate these against actual memory headroom for the recommendedkv_cache_free_gpu_memory_fraction
and your sequence lengths (ISL/OSL). If they’re theoretical maxima, label them as such and provide a safe default row.docs/source/reference/multimodal-feature-support-matrix.md (3)
3-14
: Cross-matrix consistency check.If other backend matrices (e.g., TensorRT) list Llama 4, mirror this change there (or explicitly scope this table to PyTorch only, which the header implies).
9-9
: No change—row already uses canonical “Llama 4”.Likely an incorrect or invalid review comment.
9-9
: Verify Llama 4 encoder IFB support in Python inference
No Python code paths reference “IFB” or “in-flight batching” for Llama 4. Determine whether encoder IFB is only enabled under the C++ session (viapython_e2e
, batch-size limits, scheduling flags, etc.); if so, update the matrix entry or add a footnote with the gating conditions.docs/source/commands/trtllm-serve/trtllm-serve.rst (5)
204-209
: LGTM! Clear documentation of PyTorch backend limitationsThe note effectively communicates the beta status and limitations of the PyTorch backend metrics, including the performance impact caveat. This helps set appropriate user expectations.
212-218
: Configuration structure update is well documentedThe change from nested
pytorch_backend_config
to top-levelenable_iter_perf_stats
is clearly documented with proper YAML formatting.
223-223
: Server command update matches configuration changesThe updated command properly references the new config filename convention.
225-227
: Important usage guidance for metrics pollingThis is valuable information about the queue-based metrics storage and the need for timely polling. The guidance helps prevent users from missing metrics data.
236-256
: Comprehensive metrics example outputThe updated JSON example properly reflects the current PyTorch backend metrics structure, including detailed KV cache statistics. The formatting and field coverage look accurate.
docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md (2)
61-71
: Hugging Face repository identifiers verifiedThe Hugging Face repos openai/gpt-oss-120b and nvidia/gpt-oss-120b-Eagle3 both exist and correspond to the intended base model and Eagle3 speculative decoding checkpoint, respectively.
82-95
: Use validspeculative_config.decoding_type
value “eagle”
speculative_config.decoding_type
must be one of the accepted values (explicit_draft_tokens, eagle, ngram, user_provided, auto); update fromEagle
toeagle
.Likely an incorrect or invalid review comment.
tests/integration/defs/test_e2e.py (4)
488-489
: Returning the parsed benchmark result is good.This makes the runner composable in tests.
511-512
: Dynamic num-requests in dataset generation — LGTM.Keeps dataset size proportional to test intent.
1560-1570
: Add cache salt test — LGTM.Installs requirements and executes targeted test as expected.
1758-1760
: Regex now handles single or double quotes — LGTM.Improves robustness to varied output quoting.
Notes: | ||
- Ensure your base model directory is `/config/models/gpt-oss-120b`. | ||
- Ensure your Eagle3 assets are present under `/config/models/eagle/`. | ||
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add a security warning for trust_remote_code: true
.
Readers should understand the risk and how to mitigate.
Notes:
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
+- Warning: `trust_remote_code: true` executes repository code during model load. Only enable for sources you trust and pin exact revisions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Notes: | |
- Ensure your base model directory is `/config/models/gpt-oss-120b`. | |
- Ensure your Eagle3 assets are present under `/config/models/eagle/`. | |
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`. | |
Notes: | |
- Ensure your base model directory is `/config/models/gpt-oss-120b`. | |
- Ensure your Eagle3 assets are present under `/config/models/eagle/`. | |
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`. | |
- Warning: `trust_remote_code: true` executes repository code during model load. Only enable for sources you trust and pin exact revisions. |
🤖 Prompt for AI Agents
In docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md around lines 98 to 102,
add a clear security warning paragraph about using trust_remote_code: true
explaining that it executes arbitrary remote code, advising readers to only
enable it for trusted, vetted checkpoints; run with isolation (containers, VMs),
limited privileges, and network restrictions; pin and verify model/package
versions and checksums, review remote code before enabling, and recommend
alternatives (e.g., local converted models or vendor-signed artifacts) and
explicit steps to revert the flag.
```shell | ||
# execute this command in gpt-oss | ||
python -m gpt_oss.evals \ | ||
--sampler chat_completions \ | ||
--eval gpqa,aime25 \ | ||
--model gpt-oss-120b \ | ||
--reasoning-effort low,medium | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make the example runnable: set OpenAI-compatible endpoint env vars and align model name with server.
Without OPENAI_BASE_URL
/OPENAI_API_KEY
, gpt_oss.evals
won’t hit the local TRT-LLM server. Also, earlier sections use openai/gpt-oss-120b
; align here for consistency.
-# execute this command in gpt-oss
-python -m gpt_oss.evals \
+# In the gpt-oss repo:
+export OPENAI_BASE_URL="http://localhost:8000/v1"
+export OPENAI_API_KEY="dummy"
+
+python -m gpt_oss.evals \
--sampler chat_completions \
--eval gpqa,aime25 \
- --model gpt-oss-120b \
+ --model openai/gpt-oss-120b \
--reasoning-effort low,medium
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```shell | |
# execute this command in gpt-oss | |
python -m gpt_oss.evals \ | |
--sampler chat_completions \ | |
--eval gpqa,aime25 \ | |
--model gpt-oss-120b \ | |
--reasoning-effort low,medium | |
``` |
🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around lines 248-255, the shell example lacks the OpenAI-compatible environment
variables and uses an inconsistent model name; update the instructions to
require setting OPENAI_BASE_URL and OPENAI_API_KEY to point at the local TRT-LLM
server and change the --model value to openai/gpt-oss-120b to match earlier
sections (showing either exporting those env vars beforehand or prefixing the
run command with them).
fbec724
to
dea4d74
Compare
dea4d74
to
227fed5
Compare
/bot run |
PR_Github #18178 [ run ] triggered by Bot |
PR_Github #18178 [ run ] completed with state |
227fed5
to
bd2a1b6
Compare
/bot run |
PR_Github #18244 [ run ] triggered by Bot |
bd2a1b6
to
f925fa9
Compare
tensorrt_llm/executor/result.py
Outdated
if finish_reasons[src_idx] != tllm.FinishReason.CANCELLED: | ||
if len(output.logprobs) > output.length: | ||
# Check if logprobs is a list (not a dict or other structure) | ||
if isinstance(output.logprobs, list) and len( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output.logprobs
will always by of list-type (List[float]
or TokenLogprobs: TypeAlias = list[dict[int, Logprob]]
), so the isinstance
is unnecessary
asyncio.run(main()) | ||
|
||
|
||
@pytest.mark.skip(reason="https://nvbugs/5516660") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run --extra-stage "A100X-TensorRT-Post-Merge-1, A100X-TensorRT-Post-Merge-2, A100X-TensorRT-Post-Merge-3, A100X-TensorRT-Post-Merge-4, A100X-TensorRT-Post-Merge-5, A100X-TensorRT-Post-Merge-6" |
1 similar comment
/bot run --extra-stage "A100X-TensorRT-Post-Merge-1, A100X-TensorRT-Post-Merge-2, A100X-TensorRT-Post-Merge-3, A100X-TensorRT-Post-Merge-4, A100X-TensorRT-Post-Merge-5, A100X-TensorRT-Post-Merge-6" |
PR_Github #19032 [ run ] triggered by Bot |
PR_Github #19032 [ run ] completed with state |
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
…ion change Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
4fc94a4
to
142d29d
Compare
tests/unittest/llmapi/test_llm.py
Outdated
|
||
@pytest.mark.skip(reason="https://nvbugs/5516660") | ||
@force_ampere | ||
# @force_ampere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment this.
tensorrt_llm/llmapi/llm.py
Outdated
query_len: int, | ||
sampling_params: SamplingParams, | ||
is_gen_only: bool) -> None: | ||
streaming: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary now
Signed-off-by: Venky Ganesh <[email protected]>
if streaming: | ||
# need this so that context_logits / prompt_logprobs are not dropped | ||
# in the 2nd reuse of llm.generate() in streaming mode | ||
kv_cache_args_extra["enable_block_reuse"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tijyojwad this was the knob that made things work for prompt_logprobs
# each streamed response_tensors.log_probs[src_idx] | ||
# contains a streamwise monotonically growing list of logprobs. | ||
# so we need to accumulate only the new ones unique to that particular streamed response | ||
output.logprobs += response_tensors.log_probs[src_idx][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hchings @LinPoly this lack of accumulation was causing https://nvbugs/5516660 , fixed here
/bot run |
PR_Github #19295 [ run ] triggered by Bot |
prompt_logprobs
in pytorch backend
@coderabbitai full review |
✅ Actions performedFull review triggered. |
prompt_logprobs=prompt_logprobs, | ||
return_context_logits=return_context_logits, | ||
return_generation_logits=return_generation_logits) | ||
with llm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about restoring this to use the LLM directly without the with-
statement? The LLM instance should be able to clean up without requiring an explicit __exit__
or shutdown
. Introducing a with-
statement might obscure issues that prevent proper destruction of the LLM instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
tensorrt_llm/executor/result.py (1)
247-272
: Fix logprobs accumulation logic to prevent out-of-bounds access.The current implementation slices
response_tensors.log_probs[src_idx]
fromoutput._last_logprobs_len:
without checking if the index is valid. Ifoutput._last_logprobs_len
exceeds the length of the logprobs list, this will result in an empty slice but no error. However, the logic then assumes there are new logprobs to append, which may not be accurate.Consider adding a bounds check:
- output.logprobs += response_tensors.log_probs[src_idx][ - output._last_logprobs_len:] + new_logprobs = response_tensors.log_probs[src_idx] + if output._last_logprobs_len < len(new_logprobs): + output.logprobs += new_logprobs[output._last_logprobs_len:]tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
281-291
: Consider adding error handling for missing attributes.The
clear_context_logits
method directly accesses_py_result._context_logits
without checking if the attribute exists. While the docstring indicates when this is used, defensive programming would make the code more robust.Apply this diff to add defensive checks:
def clear_context_logits(self): """Clear context logits from the response result. This is used to drop context logits after prompt_logprobs have been computed when the user didn't explicitly request them. """ if self.result and hasattr(self.result, '_py_result'): py_result = self.result._py_result - if hasattr(py_result, '_context_logits'): - py_result._context_logits = None + if hasattr(py_result, '_context_logits') and py_result._context_logits is not None: + py_result._context_logits = Nonetensorrt_llm/sampling_params.py (1)
453-467
: Consider refining the logging message for clarity.While the implementation correctly enables
return_context_logits
when needed for prompt_logprobs computation, the info log message could be more concise and clearer about the internal nature of this behavior.if self.prompt_logprobs and not self.return_context_logits: - logger.info( - "Since prompt_logprobs is requested but return_context_logits is False, " - "internally enabling context logits for prompt logprobs computation. " - "context logits will be dropped after computation as the user didn't explicitly request them." - ) + logger.debug( + "Internally enabling context_logits for prompt_logprobs computation " + "(will be dropped after use)" + )Consider using
debug
level instead ofinfo
since this is an internal implementation detail that users don't need to act upon.tensorrt_llm/executor/worker.py (2)
1074-1075
: Add defensive check for context_logits availability.While the assertion message is helpful, consider a more graceful error handling approach for production code.
Consider replacing the assertion with a proper exception:
- assert context_logits is not None, "context_logits cannot be None when prompt_logprobs is requested." + if context_logits is None: + raise ValueError("context_logits cannot be None when prompt_logprobs is requested. " + "Ensure return_context_logits is enabled in sampling params.")
1078-1079
: Consider documenting the caching behavior.The caching of prompt logprobs in streaming mode is an important implementation detail that should be documented.
Add a comment explaining the caching strategy:
if generation_result._streaming: + # Cache prompt logprobs to avoid recomputation in subsequent streaming iterations generation_result._cached_prompt_logprobs = logprobs_result.prompt
tests/unittest/llmapi/test_llm.py (1)
1869-1869
: Remove unused parameterid
from async function.The static analysis correctly identifies that the
id
parameter in the async task function is never used.- async def task(id: int, prompt: str): + async def task(prompt: str):And update the task creation:
- tasks = [task(id, prompt) for id, prompt in enumerate(prompts)] + tasks = [task(prompt) for prompt in prompts]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(2 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/executor/worker.py
(3 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tensorrt_llm/sampling_params.py
(2 hunks)tests/unittest/llmapi/test_llm.py
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm.py
tests/unittest/llmapi/test_llm_pytorch.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/unittest/llmapi/test_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/unittest/llmapi/test_llm.py
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm.py
1869-1869: Unused function argument: id
(ARG001)
🔇 Additional comments (14)
tensorrt_llm/executor/result.py (1)
703-722
: LGTM! Clear and comprehensive docstring update.The updated docstring for
compute_logprobs
clearly explains the function's purpose and distinguishes between different backend behaviors for prompt and generation logprobs computation.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
390-416
: Excellent documentation for the create_response method.The comprehensive docstring clearly explains the method's behavior, parameters, and return semantics. The implementation correctly wraps the serialized result in an LlmResponse with appropriate PyTorch-specific handling.
tests/unittest/llmapi/test_llm_pytorch.py (2)
884-904
: LGTM! Comprehensive test coverage for PyTorch backend logprobs.The test properly validates various combinations of prompt_logprobs and generation logprobs for the PyTorch backend, including the case where prompt_logprobs works without explicitly enabling context_logits.
906-924
: Well-structured streaming test for logprobs.The streaming test complements the non-streaming test by covering key scenarios including prompt-only, generation-only, and combined logprobs cases.
tensorrt_llm/llmapi/llm.py (2)
572-577
: LGTM! Clean handling of the is_gen_only parameter.The addition of the default value for
is_gen_only
makes the method signature more flexible while maintaining backward compatibility.
578-591
: Good simplification of PyTorch backend validation.Removing the prompt_logprobs restriction for the PyTorch backend aligns with the PR's goal of enabling this feature. The remaining validations for logprobs <= 1 and max_tokens checks are appropriate.
tensorrt_llm/executor/worker.py (4)
17-17
: Import statement looks good.The addition of
LlmResponse
import from the PyTorch executor module is appropriate for handling PyTorch backend responses with context logits.
1061-1081
: Well-structured helper function for PyTorch prompt logprobs.The implementation correctly handles caching for streaming mode and cleanly separates PyTorch-specific logic.
1083-1085
: Type annotation properly updated to support both response types.The union type allows flexible handling of both TRT and PyTorch backend responses.
1101-1113
: PyTorch backend path correctly implements prompt logprobs handling.The logic properly:
- Early returns when prompt_logprobs is not requested
- Computes prompt logprobs using the helper function
- Clears context logits when requested to avoid memory overhead
Good separation of concerns between PyTorch and TRT backends.
tests/unittest/llmapi/test_llm.py (4)
1798-1825
: Well-structured test harness with proper backend handling.The test harness correctly handles:
- Backend-specific LLM class selection
- KV cache configuration for streaming mode
- Proper resource cleanup with the context manager pattern
1867-1886
: Comprehensive streaming validation logic.The streaming test properly validates that:
- Streaming logprobs accumulation matches non-streaming results
- Prompt logprobs are preserved across streaming iterations
Good test coverage for the streaming scenario.
1807-1814
: Smart workaround for PyTorch streaming limitation. Tests reference prompt_logprobs/context_logits behavior (tests/unittest/llmapi/test_llm_pytorch.py:888,892); no TODO or tracking issue references found in the repo — create or link a tracking issue and add a brief inline code comment documenting this limitation.
1888-1908
: Add PyTorch-specific logprobs tests (tests/unittest/llmapi/test_llm_pytorch.py)test_llm_pytorch.py imports llm_return_logprobs_test_harness but no explicit PyTorch logprobs tests were found; add parametrized cases mirroring the TRT tests to cover:
- prompt_logprobs with context_logits
- prompt_logprobs without context_logits
- generation logprobs only (top-k)
- no logprobs
Use llm_return_logprobs_test_harness to implement these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables prompt_logprobs
support in the PyTorch backend to unify logprobs functionality across TensorRT and PyTorch backends. The implementation automatically captures required context logits when prompt logprobs are requested and provides an option to clear stored context logits after use to reduce memory consumption.
Key Changes
- Added PyTorch backend support for prompt logprobs computation from context logits
- Implemented automatic context logits capture when prompt logprobs are requested but context logits aren't explicitly enabled
- Enhanced streaming mode to properly handle prompt logprobs with consistent caching behavior
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tensorrt_llm/sampling_params.py | Added logic to auto-enable context logits when prompt_logprobs is requested in PyTorch backend |
tensorrt_llm/llmapi/llm.py | Removed PyTorch backend restriction for prompt_logprobs support |
tensorrt_llm/executor/worker.py | Enhanced logprobs computation to handle PyTorch backend prompt logprobs with caching for streaming |
tensorrt_llm/executor/result.py | Updated logprobs handling to differentiate between TRT and PyTorch backend provenance |
tensorrt_llm/_torch/pyexecutor/llm_request.py | Added clear_context_logits method and improved documentation |
tests/unittest/llmapi/test_llm.py | Updated test harness with backend parameter and streaming improvements |
tests/unittest/llmapi/test_llm_pytorch.py | Added comprehensive test coverage for PyTorch backend logprobs functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
# TODO(venky): Find a more elegant way to do this. | ||
# NOTE: This is an internal hack, so we can entirely avoid introducing | ||
# `prompt_logprobs` into the executor bindings and further into | ||
# model engine / sampler. | ||
# This is because, prompt_logprobs is a derived quantity from | ||
# context logits, and the capability to post-compute it | ||
# already exists in the worker. (see _get_logprobs in worker.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment indicates a design concern about the current implementation being a 'hack'. Consider creating a GitHub issue to track the technical debt for refactoring this approach to a more elegant solution.
Copilot uses AI. Check for mistakes.
# In streaming mode, since out-of-order responses are not possible, | ||
# each streamed response_tensors.log_probs[src_idx] | ||
# contains a streamwise monotonically growing list of logprobs. | ||
# so we need to accumulate only the new ones unique to that particular streamed response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slicing logic [output._last_logprobs_len:]
to get only new logprobs in streaming mode could be error-prone if _last_logprobs_len
becomes inconsistent. Consider adding an assertion to verify that output._last_logprobs_len <= len(response_tensors.log_probs[src_idx])
before slicing.
# so we need to accumulate only the new ones unique to that particular streamed response | |
# so we need to accumulate only the new ones unique to that particular streamed response | |
assert output._last_logprobs_len <= len(response_tensors.log_probs[src_idx]), ( | |
f"_last_logprobs_len ({output._last_logprobs_len}) > log_probs length ({len(response_tensors.log_probs[src_idx])})" | |
) |
Copilot uses AI. Check for mistakes.
parent class's serialized result in a PyTorch-specific LlmResponse object. | ||
Args: | ||
use_fast_logits (bool, optional): Whether to use fast logits computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a bit vague and probably incorrect. This boolean controls the direct transfer from draft model to target model for speculative decoding. I actually don't know if it applies to the pytorch backend as it is implemented.
Args: | ||
use_fast_logits (bool, optional): Whether to use fast logits computation | ||
for improved performance. Defaults to False. | ||
mpi_world_rank (int, optional): The MPI world rank for distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing: this is the leader rank of the draft model when using direct logits transfer.
if is_pytorch_backend: | ||
if not logprob_params.prompt_logprobs: | ||
# PyTorch: generation logprobs computed in sampler, no post-processing needed | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return None
to be explicit?
Summary by CodeRabbit
Description
also fixes https://nvbugs/5516660
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.