-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5670][feat] Support top log probabilities #7296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds an optional top logprobs feature across C++ OutputConfig, Python bindings, PyTorch backend, sampler, and API layers. Introduces configuration (max_top_logprobs, top_logprobs), propagates through constructors, pickling, and execution paths, updates sampler to compute/store top tokens/logprobs, and extends results and examples. Tests cover API and sampler behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User/CLI
participant Q as quickstart_advanced.py
participant L as LLM
participant SP as SamplingParams
participant OCp as OutputConfig (Python)
participant B as Bindings (pybind/nanobind)
participant OCc as OutputConfig (C++)
participant EX as Executor
participant TS as TorchSampler
participant R as Result (Python)
U->>Q: --top_logprobs / --max_top_logprobs
Q->>L: init(max_top_logprobs, enable_mixed_sampler)
Q->>SP: create(top_logprobs, logprobs, top_k/top_p)
SP-->>Q: validated params
Q->>OCp: map params to output config
OCp->>B: construct / pickle state (8-tuple incl. top_logprobs)
B->>OCc: construct with topLogProbs
OCc->>EX: submit request
EX->>TS: sample_async(...)
TS-->>EX: next_tokens, log_probs, top_tokens/top_log_probs (<= max_top_logprobs)
EX-->>R: response_tensors incl. top_log_probs
R-->>U: CompletionOutput.top_logprobs
sequenceDiagram
autonumber
participant TS as TorchSampler
participant ST as SampleState
participant HS as Host Buffers
TS->>ST: run sampling (top-k/top-p/greedy)
ST-->>TS: (next_tokens, softmax, indices/all_token_ids)
TS->>TS: compute log_probs_host()
alt need_top_logprobs
TS->>HS: write top_tokens_host, top_log_probs_host (bounded by max_top_logprobs)
else skip
TS-->>HS: no-op
end
TS-->>ST: update SampleStateTensorsHostTorch with top_tokens/top_log_probs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
|
d70796a
to
f81fb63
Compare
/bot run |
PR_Github #16678 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/sampling_params.py (1)
466-471
: Enable logprob computation when top_logprobs/prompt_logprobs are requestedWithout setting return_log_probs when only top_logprobs (or prompt_logprobs) is requested, the backend may skip computing logprobs entirely, yielding empty top_logprobs. Set the flag if any of logprobs, prompt_logprobs, or top_logprobs are requested.
Apply this diff:
- if is_pytorch_backend: - config_kwargs["return_log_probs"] = bool(self.logprobs) - else: - config_kwargs["return_log_probs"] = self._return_log_probs + if is_pytorch_backend: + config_kwargs["return_log_probs"] = ( + bool(self.logprobs) or bool(self.prompt_logprobs) or (self.top_logprobs > 0) + ) + else: + config_kwargs["return_log_probs"] = ( + self._return_log_probs or bool(self.prompt_logprobs) or (self.top_logprobs > 0) + )
🧹 Nitpick comments (25)
tensorrt_llm/_torch/pyexecutor/config.py (1)
117-119
: New PyTorchConfig knob max_top_logprobs added — OK.Minor: consider documenting/validating that it must be >= 0 (consistent with SamplingParams.top_logprobs).
tensorrt_llm/llmapi/llm_args.py (1)
1330-1332
: Validate BaseLlmArgs.max_top_logprobs is non-negative.Add a pydantic validator to prevent accidental negatives.
# Add near other validators in BaseLlmArgs @field_validator("max_top_logprobs") @classmethod def validate_max_top_logprobs(cls, v: int): if v < 0: raise ValueError("max_top_logprobs must be >= 0") return vtensorrt_llm/executor/result.py (2)
114-115
: Expose top_logprobs in CompletionOutput — add docstring entry for completeness.Keep Args section in sync with fields.
# In CompletionOutput docstring Args section, add: top_logprobs (TokenLogprobs, optional): For each generated position, a dict of top token ids to Logprob (logprob, rank). Defaults to None.
261-267
: Align top_logprobs truncation with logprobs (optional assert).Mirror the post-slice assert to catch mismatches early.
if response_tensors.top_log_probs is not None: output.top_logprobs = response_tensors.top_log_probs[src_idx] if finish_reasons[src_idx] != tllm.FinishReason.CANCELLED: if len(output.top_logprobs) > output.length: - output.top_logprobs = output.top_logprobs[:output. - length] + output.top_logprobs = output.top_logprobs[:output.length] + assert len(output.top_logprobs) == output.lengthtests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
59-64
: Make assertions robust to early EOS (avoid flakiness).Generated length can be < max_tokens; assert against actual output length.
- if top_logprobs > 0: - assert len(output.outputs[0].logprobs) == max_tokens - assert len(output.outputs[0].top_logprobs) == max_tokens - assert len(output.outputs[0].logprobs[0]) == 1 - assert len(output.outputs[0].top_logprobs[0]) == top_logprobs + if top_logprobs > 0: + length = output.outputs[0].length + assert len(output.outputs[0].logprobs) == length + assert len(output.outputs[0].top_logprobs) == length + assert len(output.outputs[0].logprobs[0]) == 1 + assert len(output.outputs[0].top_logprobs[0]) == top_logprobs
21-35
: Optional: make the test deterministic.Pass a fixed random seed to SamplingParams in the loop to reduce flakiness across CI runs.
# When constructing SamplingParams in the test: SamplingParams(..., random_seed=1234)tensorrt_llm/sampling_params.py (4)
328-341
: Validation: unify greedy detection and clarify error messages
- Greedy detection here conflicts with quickstart_advanced.py (that uses top_k in {None, 0}). Pick one definition across the codebase to avoid surprising rejections. Prefer treating None as 0 to reflect TRT defaults.
- Error message “cannot exceed 0” is awkward; say “must be 0”.
Apply this diff to improve the message and (optionally) to use consistent greedy detection:
- if self._greedy_decoding and self.top_logprobs > 0: + if self._greedy_decoding and self.top_logprobs > 0: # check for greedy sampling raise ValueError( - f"top_logprobs {self.top_logprobs} cannot exceed 0 for greedy sampling" + f"top_logprobs must be 0 for greedy sampling; got {self.top_logprobs}" )Optionally align _greedy_decoding with the quickstart heuristic:
def _greedy_decoding(self) -> bool: - return ( - not self.use_beam_search - and (self.top_k is None or self.top_k == 1) - and (self.top_p is None or self.top_p == 0.0) - ) + tk = 0 if self.top_k is None else self.top_k + tp = 0.0 if self.top_p is None else self.top_p + return (not self.use_beam_search) and (tk == 0) and (tp == 0.0)
324-327
: Normalize user input type for top_logprobs (parity with logprobs/prompt_logprobs)Users may pass booleans; normalize to int as you do for logprobs and prompt_logprobs.
Apply this diff:
self.logprobs = self.logprobs and int(self.logprobs) self.prompt_logprobs = self.prompt_logprobs and int(self.prompt_logprobs) + self.top_logprobs = int(self.top_logprobs)
176-176
: Docstring accuracy: “optional” vs defaulted intField is an int with a default (0), not Optional[int]. Update the docstring accordingly to avoid confusion.
1-1
: Missing NVIDIA copyright header (policy)Per guidelines, prepend the NVIDIA copyright header (2025) to this .py file.
Example:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import jsoncpp/tensorrt_llm/executor/outputConfig.cpp (1)
25-35
: Constructor wiring for topLogProbs looks correctParameter appended at the end and member initialization order preserved. Consider documenting in the header that 0 disables top-logprobs to avoid ambiguity with std::optional storage.
tensorrt_llm/_torch/pyexecutor/_util.py (3)
662-665
: Consistency with greedy heuristicinstantiate_sampler uses enable_mixed_sampler from CLI’s is_greedy. If you adopt the unified greedy detection in SamplingParams, consider deriving enable_mixed_sampler from the same logic to prevent drift.
1-12
: Missing NVIDIA copyright header (policy)Add the 2025 NVIDIA header at the top of this Python file.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import os
638-651
: Ensure non-negativemax_top_logprobs
is enforcedI don’t see any explicit validation (e.g.,
>= 0
) on themax_top_logprobs
field in the executor configuration. While the sampler itself clamps requests whosetop_logprobs
exceedmax_top_logprobs
(raising aValueError
), a negativemax_top_logprobs
would lead to an invalid tensor shape (1 + max_top_logprobs <= 0
) later inTorchSampler
and cause a runtime error.Please add a check to clamp or validate
max_top_logprobs >= 0
at the utility boundary so that invalid negative values are caught early.• In
tensorrt_llm/_torch/pyexecutor/_util.py:create_torch_sampler_args
, before constructingTorchSampler.Args
, add:- return TorchSampler.Args( + # Ensure max_top_logprobs is non-negative + max_top_logprobs = max(0, max_top_logprobs) + return TorchSampler.Args( max_seq_len=max_seq_len, max_draft_len=max_draft_len, max_num_sequences=max_num_sequences, max_beam_width=executor_config.max_beam_width, - max_top_logprobs=max_top_logprobs, + max_top_logprobs=max_top_logprobs, enable_mixed_sampler=enable_mixed_sampler,• Alternatively, add a Pydantic constraint (
ge=0
) on themax_top_logprobs
field ofExecutorConfig
inconfig.py
.examples/llm-api/quickstart_advanced.py (4)
150-152
: CLI: add help text and bounds checking for new argsAdd helpful descriptions and validate non-negative values early to fail fast.
- parser.add_argument('--top_logprobs', type=int, default=0) - parser.add_argument('--max_top_logprobs', type=int, default=0) + parser.add_argument('--top_logprobs', type=int, default=0, + help='Return the top-k log probabilities per generated token (0 disables).') + parser.add_argument('--max_top_logprobs', type=int, default=0, + help='Upper bound for top_logprobs supported by the backend (resource cap).')And before setup_llm:
+ if args.top_logprobs < 0 or args.max_top_logprobs < 0: + raise SystemExit("--top_logprobs and --max_top_logprobs must be >= 0")
281-284
: Passing boolean to logprobs relies on implicit normalizationThis depends on SamplingParams converting True→1. After adding explicit normalization for top_logprobs too, behavior will be consistent.
323-326
: Printing: guard against None/empty fieldsIn mixed backends, sequence.top_logprobs may be None/empty. Add a truthiness check to avoid noisy prints.
- if args.top_logprobs: + if args.top_logprobs and sequence.top_logprobs: print( f"[{i}]{sequence_id_text} Top logprobs: {sequence.top_logprobs}" )
1-1
: Missing NVIDIA copyright header (policy)Add the 2025 NVIDIA header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import argparsecpp/include/tensorrt_llm/executor/executor.h (3)
218-220
: API surface: constructor extension is clearAppending topLogProbs at the end is backward-compatible. Consider documenting that 0 disables and that None (via bindings) maps to 0 at construction time.
238-239
: Member type vs constructor parameter typeMember is std::optional but the constructor takes a SizeType32. It works (implicit optional construction), but accepting std::optional in the ctor would better reflect the member semantics.
- SizeType32 topLogProbs = 0); + std::optional<SizeType32> topLogProbs = 0);
2-2
: Update copyright yearFile header still says 2022-2024. Update to include 2025 per guidelines.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/pybind/executor/request.cpp (1)
2-2
: Update SPDX copyright year.Guideline requires current year. Header shows 2022-2024.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
127-133
: Validate top_log_probs shape when provided.Guard against mismatched shapes to avoid IndexError on extend.
if self.beam_width == -1: self._init(new_probs) assert len(new_probs) == self.beam_width, "Beam width mismatch" + if top_log_probs is not None: + assert len(top_log_probs) == self.beam_width, "Beam width mismatch for top_log_probs"
1-3
: Missing NVIDIA 2025 header.Guidelines require the header at top of .py files.
Add the standard 2025 NVIDIA SPDX header to this file.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1-1
: Missing NVIDIA 2025 header.Add the standard header at top of this .py per guidelines.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (14)
cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/tensorrt_llm/executor/outputConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(1 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(9 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(10 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)tests/integration/test_lists/test-db/l0_a30.yml
(1 hunks)tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
tensorrt_llm/sampling_params.py
tensorrt_llm/executor/result.py
tensorrt_llm/_torch/pyexecutor/config.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
tensorrt_llm/sampling_params.py
tensorrt_llm/executor/result.py
cpp/tensorrt_llm/executor/outputConfig.cpp
tensorrt_llm/_torch/pyexecutor/config.py
examples/llm-api/quickstart_advanced.py
cpp/include/tensorrt_llm/executor/executor.h
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py
cpp/tensorrt_llm/pybind/executor/request.cpp
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp
cpp/tensorrt_llm/pybind/executor/request.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/include/tensorrt_llm/executor/executor.h
cpp/tensorrt_llm/nanobind/executor/request.cpp
cpp/tensorrt_llm/pybind/executor/request.cpp
**/*.{h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hh,hpp,hxx,cuh}
: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore
Files:
cpp/include/tensorrt_llm/executor/executor.h
🧠 Learnings (3)
📚 Learning: 2025-08-26T09:37:10.431Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.431Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/llmapi/llm_args.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (6)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-502)tensorrt_llm/llmapi/llm_args.py (2)
CudaGraphConfig
(108-165)KvCacheConfig
(946-1077)
tensorrt_llm/sampling_params.py (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k
(221-221)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_beam_search
(519-520)
tensorrt_llm/executor/result.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
top_log_probs
(250-251)
cpp/include/tensorrt_llm/executor/executor.h (2)
cpp/tensorrt_llm/executor/outputConfig.cpp (1)
AdditionalModelOutput
(37-41)tensorrt_llm/sampling_params.py (1)
AdditionalModelOutput
(112-121)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len
(558-568)max_seq_len
(571-575)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (3)
top_log_probs
(250-251)log_probs
(246-247)append_log_probs
(195-206)tensorrt_llm/executor/result.py (1)
Logprob
(37-40)
🔇 Additional comments (11)
tests/integration/test_lists/test-db/l0_a30.yml (1)
23-23
: Verification complete: Test entry is valid
- Confirmed that
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
exists- Confirmed a single reference in
tests/integration/test_lists/test-db/l0_a30.yml
at line 23LGTM.
tensorrt_llm/llmapi/llm_args.py (1)
2589-2591
: PyTorchConfig.max_top_logprobs enforcement confirmed
- In
_util.py
,create_torch_sampler_args
forwardsmax_top_logprobs
fromPyTorchConfig
intoTorchSampler.Args
unchanged (lines 640–646).- In
sampler.py
, theTorchSampler
constructor setsself.max_top_logprobs = args.max_top_logprobs
and uses it to shape the output buffer (NEW_TOKENS_SHAPE
) accordingly (around lines 388–394).- Further down in
sampler.py
, each request’spy_top_logprobs
is compared againstself.max_top_logprobs
, raising aValueError
if it exceeds the configured cap (lines 753–757).No additional action needed—your original plumbing correctly enforces the cap downstream.
examples/llm-api/quickstart_advanced.py (2)
218-225
: Greedy heuristic diverges from SamplingParams._greedy_decodingThis script treats greedy as (top_k in {None, 0} and top_p in {None, 0.0}); SamplingParams currently uses {None, 1} for top_k. Align both sides to prevent surprising validation errors.
Would you like me to patch SamplingParams._greedy_decoding to match this heuristic (tk==0 && tp==0.0)?
260-262
: Nice: propagating max_top_logprobs and gating mixed samplerWiring looks good. Once greedy detection is unified, this will stay consistent.
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
116-126
: Initialize top_log_probs alongside log_probs (good).Per-beam containers are created consistently with beam_width. No issues.
249-252
: Property exposure for top_log_probs (good).Matches PyResult API and LlmResult forwarding.
310-317
: Thread top_logprobs into LlmRequest (good plumbing).Attribute naming is consistent (py_top_logprobs).
576-578
: Forward OutputConfig.top_logprobs into LlmRequest (good).Ensures executor->torch path carries the value.
1-20
: Python 3.8 compatibility of typing.The file uses built-in generics like list[T], which require Python 3.9+. If the project target remains 3.8 per guidelines, switch to typing.List and friends or ensure the repo targets >=3.9.
Would you like me to generate a patch converting annotations to typing.List[...] for 3.8 compatibility?
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
584-591
: Docstring shape updated (good).Host buffer shape matches device layout and max_top_logprobs dimensioning.
392-399
: Store sizing (good).NEW_TOKENS_SHAPE includes 1 + max_top_logprobs channel; aligns with later slicing.
@@ -0,0 +1,72 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing NVIDIA copyright header.
Add the standard header per repo guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+#
import os
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import os | |
# Copyright (c) 2025, NVIDIA CORPORATION. | |
# | |
# | |
import os |
🤖 Prompt for AI Agents
In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 1 to
1, the file is missing the required NVIDIA copyright header; add the
repository's standard copyright/license header at the very top of the file
(above any imports) matching other test files, ensuring year and owner fields
follow project conventions.
PR_Github #16678 [ run ] completed with state |
/bot run |
PR_Github #16685 [ run ] triggered by Bot |
PR_Github #16685 [ run ] completed with state |
51d599d
to
b70e895
Compare
/bot run |
PR_Github #16721 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
tensorrt_llm/sampling_params.py (1)
1-1
: Add required NVIDIA copyright header.Per repo guidelines, prepend the SPDX header to Python sources.
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import jsontensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-1
: Add required NVIDIA copyright header.Per repo guidelines, prepend the SPDX header to Python sources.
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + from copy import deepcopytensorrt_llm/_torch/pyexecutor/sampler.py (3)
154-185
: UnboundLocalError risk: indices may be undefined when top_k <= 0.top_k_sampling_batch returns
indices
unconditionally, but it’s only assigned insideif top_k > 0
. This will raise at runtime when top_k == 0 (valid “no-top-k” case).Apply this diff:
def top_k_sampling_batch(logits, top_k=50, generator: Optional[torch.Generator] = None): @@ batch_size, vocab_size = logits.size() + indices = None @@ next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1) return next_tokens, softmax, indices
230-280
: Same indices initialization issue in top_k_top_p_sampling_batch.Mirror the fix to guarantee
indices
is defined for top_k <= 0.Apply this diff:
if temperature != 0: logits = logits / max(temperature, 1e-5) batch_size, vocab_size = logits.size() + indices = None @@ next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1) return next_tokens, softmax, indices
1-1
: Add required NVIDIA copyright header.Per repo guidelines, prepend the SPDX header to Python sources.
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + from abc import ABC, abstractmethodtests/unittest/bindings/test_executor_bindings.py (1)
1-1
: Add required NVIDIA copyright header.Per repo guidelines, prepend the SPDX header to Python sources (tests included, unless exempted by project policy).
Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import datetime
♻️ Duplicate comments (3)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
206-214
: Make OutputConfig pickle backward-compatible (accept 7- or 8-field state).Current
__setstate__
requires exactly 8 items, breaking round-trip with older pickles. Accept 7 or 8 and defaulttopLogProbs
to nullopt when missing.- if (state.size() != 8) + if (state.size() != 7 && state.size() != 8) { throw std::runtime_error("Invalid OutputConfig state!"); } - new (&outputConfig) tle::OutputConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]), - nb::cast<bool>(state[2]), nb::cast<bool>(state[3]), nb::cast<bool>(state[4]), nb::cast<bool>(state[5]), - nb::cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(state[6]), - nb::cast<std::optional<SizeType32>>(state[7])); + auto additional + = nb::cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(state[6]); + std::optional<SizeType32> top + = (state.size() == 8) ? nb::cast<std::optional<SizeType32>>(state[7]) : std::nullopt; + new (&outputConfig) tle::OutputConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]), + nb::cast<bool>(state[2]), nb::cast<bool>(state[3]), nb::cast<bool>(state[4]), nb::cast<bool>(state[5]), + additional, top);cpp/tensorrt_llm/pybind/executor/request.cpp (1)
191-207
: Preserve pickle round-trip and back-compat for OutputConfig.Emit 8 fields and accept both 7- and 8-field states; default
top_logprobs
when absent.auto outputConfigGetstate = [](tle::OutputConfig const& self) { return py::make_tuple(self.returnLogProbs, self.returnContextLogits, self.returnGenerationLogits, - self.excludeInputFromOutput, self.returnEncoderOutput, self.returnPerfMetrics, self.additionalModelOutputs, - self.topLogProbs); + self.excludeInputFromOutput, self.returnEncoderOutput, self.returnPerfMetrics, + self.additionalModelOutputs, self.topLogProbs); }; auto outputConfigSetstate = [](py::tuple const& state) { - if (state.size() != 8) + if (state.size() != 7 && state.size() != 8) { throw std::runtime_error("Invalid OutputConfig state!"); } - return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(), - state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), - state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(), - state[7].cast<std::optional<SizeType32>>()); + auto additional = state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(); + std::optional<SizeType32> top + = (state.size() == 8) ? state[7].cast<std::optional<SizeType32>>() : std::nullopt; + return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(), + state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), additional, top); };tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)
1-1
: Add NVIDIA copyright header.Required by coding guidelines.
+#!/usr/bin/env python3 +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import os
🧹 Nitpick comments (8)
tensorrt_llm/sampling_params.py (1)
328-333
: Clarify validation error message for top_logprobs dependency.
logprobs
is an int-or-bool-like count; saying “set logprobs to True” is misleading. Recommend wording tied to count semantics.Apply this diff:
- if not self.logprobs: - raise ValueError("You need to set logprobs to True to use top_logprobs.") + if not self.logprobs: + raise ValueError("top_logprobs requires logprobs >= 1.")tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
141-147
: Guard top_log_probs beam dimension in append().When top_log_probs is provided, ensure its outer list matches beam_width to prevent IndexError/misalignment.
Apply this diff:
for beam_idx, probs in enumerate(new_probs): self.log_probs[beam_idx].extend(probs) - if top_log_probs is not None: - self.top_log_probs[beam_idx].extend(top_log_probs[beam_idx]) + if top_log_probs is not None: + if len(top_log_probs) != self.beam_width: + raise ValueError("Beam width mismatch for top_log_probs") + self.top_log_probs[beam_idx].extend(top_log_probs[beam_idx])tensorrt_llm/_torch/pyexecutor/sampler.py (2)
626-635
: Tidy long docstring line and clarify meaning.Break the long line and make explicit that extra columns are present only when top_logprobs > 0.
Apply this diff:
- def log_probs_host(self, scheduled_requests: ScheduledRequests): - """Shape: max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs) - The last dimension contains log_probs for the sampled tokens and additionally log_probs for the top tokens if top_logprobs is specified""" + def log_probs_host(self, scheduled_requests: ScheduledRequests): + """Shape: (max_num_sequences, max_beam_width==1, max_tokens, 1 + max_top_logprobs). + The last dim holds sampled-token log_probs and, when top_logprobs > 0, the top-k token log_probs."""
796-801
: Typo in comment (“to0”).Minor nit for readability.
Apply this diff:
- # otherwise, it is None and should be defaulted to0 + # otherwise, it is None and should be defaulted to 0tests/unittest/bindings/test_executor_bindings.py (2)
883-907
: Prefer truthiness checks over ‘== True/False’ (ruff E712).Use
assert var
/assert not var
oris True/False
for clarity and to satisfy linters.Apply this diff:
- assert config.return_log_probs == False - assert config.return_context_logits == False - assert config.return_generation_logits == False - assert config.exclude_input_from_output == False - assert config.return_encoder_output == False - assert config.return_perf_metrics == False + assert not config.return_log_probs + assert not config.return_context_logits + assert not config.return_generation_logits + assert not config.exclude_input_from_output + assert not config.return_encoder_output + assert not config.return_perf_metrics @@ - assert config.return_log_probs == True - assert config.return_context_logits == False - assert config.return_generation_logits == True - assert config.exclude_input_from_output == False - assert config.return_encoder_output == True - assert config.return_perf_metrics == False + assert config.return_log_probs is True + assert not config.return_context_logits + assert config.return_generation_logits is True + assert not config.exclude_input_from_output + assert config.return_encoder_output is True + assert not config.return_perf_metrics
910-926
: Same style nit in pickle test.Align with boolean assertion style.
Apply this diff:
- assert config_copy.return_log_probs == True - assert config_copy.return_context_logits == False - assert config_copy.return_generation_logits == True - assert config_copy.exclude_input_from_output == False - assert config_copy.return_encoder_output == True - assert config_copy.return_perf_metrics == False + assert config_copy.return_log_probs is True + assert not config_copy.return_context_logits + assert config_copy.return_generation_logits is True + assert not config_copy.exclude_input_from_output + assert config_copy.return_encoder_output is True + assert not config_copy.return_perf_metricscpp/tensorrt_llm/pybind/executor/request.cpp (1)
2-2
: Update copyright year to 2025.Keep headers current per repo guideline.
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)
80-86
: Rename test for clarity/grammar.“disabler” → “disabled”.
-def test_generate_with_top_logprobs_and_disabler_logprobs( +def test_generate_with_top_logprobs_and_disabled_logprobs( llm_torch, input_prompts):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/tensorrt_llm/executor/outputConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(1 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(10 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(12 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)tests/integration/test_lists/test-db/l0_a30.yml
(1 hunks)tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
(1 hunks)tests/unittest/bindings/test_executor_bindings.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/llmapi/llm_args.py
- tests/integration/test_lists/test-db/l0_a30.yml
- tensorrt_llm/executor/result.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- examples/llm-api/quickstart_advanced.py
- cpp/include/tensorrt_llm/executor/executor.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/pybind/executor/request.cpp
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/pybind/executor/request.cpp
cpp/tensorrt_llm/executor/outputConfig.cpp
cpp/tensorrt_llm/nanobind/executor/request.cpp
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/pybind/executor/request.cpp
tests/unittest/bindings/test_executor_bindings.py
cpp/tensorrt_llm/executor/outputConfig.cpp
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/nanobind/executor/request.cpp
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/bindings/test_executor_bindings.py
tensorrt_llm/sampling_params.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
🧬 Code graph analysis (6)
cpp/tensorrt_llm/pybind/executor/request.cpp (2)
cpp/include/tensorrt_llm/executor/executor.h (1)
OutputConfig
(212-239)cpp/tensorrt_llm/executor/outputConfig.cpp (1)
OutputConfig
(23-35)
tests/unittest/bindings/test_executor_bindings.py (3)
cpp/include/tensorrt_llm/executor/executor.h (1)
OutputConfig
(212-239)cpp/tensorrt_llm/executor/outputConfig.cpp (2)
OutputConfig
(23-35)AdditionalModelOutput
(37-41)tensorrt_llm/sampling_params.py (1)
AdditionalModelOutput
(112-121)
tensorrt_llm/sampling_params.py (2)
tensorrt_llm/scaffolding/task.py (1)
logprobs
(99-100)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_beam_search
(519-520)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
top_log_probs
(250-251)calculate_top_logprobs
(422-423)LlmRequest
(306-454)log_probs
(246-247)append_log_probs
(195-206)tensorrt_llm/executor/result.py (1)
Logprob
(37-40)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
calculate_top_logprobs
(430-431)tensorrt_llm/_torch/shared_tensor/shared_tensor.py (3)
SharedTensorContainer
(95-404)from_tensor
(313-331)dump_to_dict
(375-404)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-508)tensorrt_llm/llmapi/llm_args.py (2)
CudaGraphConfig
(108-165)KvCacheConfig
(946-1077)
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
896-896: Avoid equality comparisons to True
; use config.return_log_probs:
for truth checks
Replace with config.return_log_probs
(E712)
tensorrt_llm/_torch/pyexecutor/sampler.py
628-628: Line too long (146 > 120)
(E501)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
91-91: Local variable sampling_params
is assigned to but never used
Remove assignment to unused variable sampling_params
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/executor/outputConfig.cpp (1)
23-35
: Constructor extension looks correct; ownership and member initialization LGTM.
sampling_params = SamplingParams(max_tokens=max_tokens, | ||
logprobs=False, | ||
top_k=top_k, | ||
top_p=top_p, | ||
temperature=1.0, | ||
top_logprobs=top_logprobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused variable to satisfy Ruff F841 (second occurrence).
- sampling_params = SamplingParams(max_tokens=max_tokens,
- logprobs=False,
- top_k=top_k,
- top_p=top_p,
- temperature=1.0,
- top_logprobs=top_logprobs)
+ SamplingParams(max_tokens=max_tokens,
+ logprobs=False,
+ top_k=top_k,
+ top_p=top_p,
+ temperature=1.0,
+ top_logprobs=top_logprobs)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
sampling_params = SamplingParams(max_tokens=max_tokens, | |
logprobs=False, | |
top_k=top_k, | |
top_p=top_p, | |
temperature=1.0, | |
top_logprobs=top_logprobs) | |
SamplingParams(max_tokens=max_tokens, | |
logprobs=False, | |
top_k=top_k, | |
top_p=top_p, | |
temperature=1.0, | |
top_logprobs=top_logprobs) |
🧰 Tools
🪛 Ruff (0.12.2)
91-91: Local variable sampling_params
is assigned to but never used
Remove assignment to unused variable sampling_params
(F841)
🤖 Prompt for AI Agents
In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 91
to 96, there is a second, unused declaration of a variable (triggering Ruff
F841); remove the redundant variable declaration and update the SamplingParams
call so it does not reference that removed variable (use None or omit the
parameter if optional), ensuring no unused variable remains.
PR_Github #16721 [ run ] completed with state |
/bot run |
b70e895
to
d6e9f5e
Compare
/bot run |
PR_Github #16804 [ run ] triggered by Bot |
PR_Github #16805 [ run ] triggered by Bot |
PR_Github #16804 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tests/unittest/bindings/test_executor_bindings.py (1)
896-907
: Support optional top_logprobs in OutputConfig binding
- In cpp/tensorrt_llm/pybind/executor/request.cpp, change the eighth py::init parameter from SizeType32 to std::optional (and keep
py::arg("top_logprobs") = py::none()
).- Update the
outputConfigSetstate
lambda to accept both 7- and 8-element state tuples, defaulting missingtop_logprobs
to std::nullopt.examples/llm-api/quickstart_advanced.py (1)
1-1
: Add NVIDIA copyright header.
Required by repo guidelines for Python sources.Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import argparsetensorrt_llm/_torch/pyexecutor/sampler.py (3)
1-6
: Add NVIDIA header and ensure Python 3.8+ typing compatibility.
- Missing required NVIDIA copyright header.
- File uses PEP 604 unions (|) and builtin generics (list[int], tuple[...]); add from future import annotations to keep runtime compatibility with Python 3.8.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +from __future__ import annotations
169-185
: NameError when top_k == 0: indices may be undefined.Initialize indices = None before conditional; return it safely.
Apply:
- # get first top_k logits of each sample and their indices - if top_k > 0: + # get first top_k logits of each sample and their indices + indices = None + if top_k > 0: values, indices = torch.topk(logits, top_k, dim=-1) min_values = values[:, -1].unsqueeze(-1).expand(batch_size, vocab_size) @@ - next_tokens = torch.multinomial(softmax, num_samples=1, - generator=generator).squeeze(-1) + next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1) return next_tokens, softmax, indices
269-279
: Return top indices after applying the top-p mask (correctness of “top logprobs”).Currently, indices are taken before masking by top-p; this can surface masked tokens (-inf logit) in top candidates. Recompute indices after masking.
Apply:
logits = logits.masked_fill(indices_to_remove, float('-inf')) @@ - next_tokens = torch.multinomial(softmax, num_samples=1, - generator=generator).squeeze(-1) - return next_tokens, softmax, indices + next_tokens = torch.multinomial(softmax, num_samples=1, generator=generator).squeeze(-1) + # Recompute top-k indices on masked logits so they align with the final distribution + _, indices = torch.topk(logits, top_k, dim=-1) + return next_tokens, softmax, indices
♻️ Duplicate comments (4)
tests/unittest/bindings/test_executor_bindings.py (1)
910-926
: Round-trip pickle requires 8-field state; keep back-compat for 7-field pickles.
The test is correct; update setstate to accept both sizes so older pickles still load.Apply in cpp/tensorrt_llm/pybind/executor/request.cpp (see suggested diff in that file’s comments).
cpp/tensorrt_llm/pybind/executor/request.cpp (2)
209-215
: Fix ctor signature: last param must be std::optional.
Passing py::none() to SizeType32 causes a compile-time error and is the likely CI failure.Apply:
- .def(py::init<bool, bool, bool, bool, bool, bool, std::optional<std::vector<tle::AdditionalModelOutput>>, - SizeType32>(), + .def(py::init<bool, bool, bool, bool, bool, bool, std::optional<std::vector<tle::AdditionalModelOutput>>, + std::optional<SizeType32>>(), ... - py::arg("additional_model_outputs") = py::none(), py::arg("top_logprobs") = py::none()) + py::arg("additional_model_outputs") = py::none(), py::arg("top_logprobs") = py::none())
199-207
: Back-compat: accept both 7- and 8-field OutputConfig pickle states.
Older pickles (without top_logprobs) should still unpickle. Treat missing as std::nullopt.Apply:
- auto outputConfigSetstate = [](py::tuple const& state) - { - if (state.size() != 8) - { - throw std::runtime_error("Invalid OutputConfig state!"); - } - return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(), - state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), - state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(), - state[7].cast<std::optional<SizeType32>>()); - }; + auto outputConfigSetstate = [](py::tuple const& state) + { + if (state.size() != 7 && state.size() != 8) + { + throw std::runtime_error("Invalid OutputConfig state!"); + } + auto additional = state[6].cast<std::optional<std::vector<tle::AdditionalModelOutput>>>(); + std::optional<SizeType32> top = (state.size() == 8) + ? state[7].cast<std::optional<SizeType32>>() + : std::optional<SizeType32>{std::nullopt}; + return tle::OutputConfig(state[0].cast<bool>(), state[1].cast<bool>(), state[2].cast<bool>(), + state[3].cast<bool>(), state[4].cast<bool>(), state[5].cast<bool>(), additional, top); + };tensorrt_llm/_torch/pyexecutor/sampler.py (1)
809-829
: Greedy path breaks with top_logprobs > 0 (indices is None).When strategy == GREEDY, indices is None; later you slice with num_top_tokens > 0 causing shape mismatch and wrong log_probs. Compute top-k indices from logits for greedy.
Apply:
if batched_next_tokens is None: logits = self._apply_embedding_bias(logits, [req]) - next_tokens, softmax, indices = sample(strategy, logits, - generator) + next_tokens, softmax, indices = sample(strategy, logits, generator) else: # Batched processing already applied bias, just use the results next_tokens = batched_next_tokens[input_slice] softmax = batched_softmax[input_slice] - indices = batched_indices[ - input_slice] if strategy != GREEDY else None + indices = batched_indices[input_slice] if strategy != GREEDY else None + + # If greedy and top-logprobs requested, derive top indices from logits + if num_top_tokens > 0 and indices is None: + indices = torch.topk(logits, num_top_tokens, dim=-1).indices @@ - if request.calculate_top_logprobs and strategy != GREEDY: + if num_top_tokens > 0: # concatenate the sampled tokens with the top tokens next_tokens = torch.cat( - [next_tokens.unsqueeze(1), indices[:, :num_top_tokens]], + [next_tokens.unsqueeze(1), indices[:, :num_top_tokens]], dim=1) else: next_tokens = next_tokens.unsqueeze(1)
🧹 Nitpick comments (14)
tensorrt_llm/sampling_params.py (1)
176-176
: Docstring default mismatch for top_logprobs.
The field default is None; update docs accordingly.Apply:
- top_logprobs (int, optional): Number of top log probabilities to return per output token. Defaults to 0. + top_logprobs (int, optional): Number of top log probabilities to return per output token. Defaults to None.tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
69-76
: Remove unused variable to satisfy Ruff F841.
No need to bind the instance in the negative path.Apply:
- with pytest.raises(ValueError, match="top_logprobs.*"): - sampling_params = SamplingParams(max_tokens=max_tokens, - logprobs=True, - top_k=top_k, - top_p=top_p, - temperature=1.0, - top_logprobs=top_logprobs) + with pytest.raises(ValueError, match="top_logprobs.*"): + SamplingParams(max_tokens=max_tokens, + logprobs=True, + top_k=top_k, + top_p=top_p, + temperature=1.0, + top_logprobs=top_logprobs)
91-96
: Remove second unused variable to satisfy Ruff F841.
Same pattern in another negative case.Apply:
- sampling_params = SamplingParams(max_tokens=max_tokens, - logprobs=False, - top_k=top_k, - top_p=top_p, - temperature=1.0, - top_logprobs=top_logprobs) + SamplingParams(max_tokens=max_tokens, + logprobs=False, + top_k=top_k, + top_p=top_p, + temperature=1.0, + top_logprobs=top_logprobs)tensorrt_llm/_torch/pyexecutor/sampler.py (11)
33-34
: Document shapes for new optional tensors.Add concise shape notes for top_tokens/top_log_probs to aid maintainability and avoid ambiguity.
Apply:
@dataclass(kw_only=True) class SampleStateTensors: new_tokens: torch.Tensor log_probs: torch.Tensor | None = None - top_tokens: torch.Tensor | None = None - top_log_probs: torch.Tensor | None = None + top_tokens: torch.Tensor | None = None # Shape: (max_tokens, max_num_sequences, MAX_BEAM_WIDTH, max_top_logprobs) + top_log_probs: torch.Tensor | None = None # Shape matches top_tokens
157-162
: Fix docstring: returned next_tokens shape is [batch_size], not [batch_size, 1]; clarify indices.Apply:
- Returns: - next_tokens: [batch_size, 1] (sampled tokens) + Returns: + next_tokens: [batch_size] (sampled tokens) softmax: [batch_size, vocab_size] (probability distribution) - indices: [batch_size, top_k] (indices of top_k logits) + indices: None or [batch_size, top_k] (indices of top_k logits, None if top_k <= 0)
191-196
: Fix docstring: top_p returns sorted_indices over vocab.Apply:
- Returns: - next_tokens: [batch_size, 1] (sampled tokens) + Returns: + next_tokens: [batch_size] (sampled tokens) softmax: [batch_size, vocab_size] (probability distribution) - indices: [batch_size, top_k] (indices of top_k logits) + indices: [batch_size, vocab_size] (sorted token indices by logit desc, before masking)
283-291
: Greedy path docstring is fine; ensure consistency with other functions (next_tokens shape).Consider mirroring the wording used in other functions: “next_tokens: [batch_size]”.
405-416
: Validate Args.max_top_logprobs is non-negative and small enough to fit NEW_TOKENS_SHAPE.Avoid silent misconfigurations.
Apply:
def __init__(self, args: Args): + assert args.max_top_logprobs >= 0, "max_top_logprobs must be >= 0" self.max_seq_len = args.max_seq_len self.enable_mixed_sampler = args.enable_mixed_sampler self.max_tokens = args.max_draft_len + 1 self.max_top_logprobs = args.max_top_logprobs
416-419
: Nice: comment explains last dim. Add brief example to solidify intent.Apply:
- # The last dimension contains the sampled tokens and additionally top tokens if top_logprobs is specified + # The last dimension contains the sampled token (slot 0) and, if requested, the top-k tokens (slots 1..k).
626-635
: Fix long docstring line and clarify shapes (ruff E501).Apply:
- """Shape: max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs) - The last dimension contains log_probs for the sampled tokens and additionally log_probs for the top tokens if top_logprobs is specified""" + """Shape: + (max_num_sequences, max_beam_width == 1, max_tokens, 1 + max_top_logprobs) + + The last dim packs: [sampled_token_logprob, top1_logprob, ..., topK_logprob] when requested. + """
797-803
: Typo in comment.Apply:
- # otherwise, it is None and should be defaulted to0 + # otherwise, it is None and should default to 0
491-509
: Minor: return ranks starting at 1 for top tokens for consistency with sampled token rank.Currently sampled token uses rank=1 while top-k use rank=k (0-based). Consider 1-based ranks for consistency.
Apply:
- Logprob(logprob=logprob[k], rank=k) + Logprob(logprob=logprob[k], rank=k + 1)
354-366
: Add explicit default case guard to sample().Optional: raise for unknown strategy variants to fail fast.
Apply:
case ("greedy", None): return greedy_search_sampling_batch(logits) + case _: + raise ValueError(f"Unsupported sampling strategy: {strategy}")
154-186
: Optional: temperature handling in top_k_sampling for parity with top_p/top_k_top_p.If needed, add temperature scaling for top-k as in other functions.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/tensorrt_llm/executor/outputConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(1 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(10 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(12 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)tests/integration/test_lists/test-db/l0_a30.yml
(1 hunks)tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
(1 hunks)tests/unittest/bindings/test_executor_bindings.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/integration/test_lists/test-db/l0_a30.yml
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/executor/result.py
- cpp/tensorrt_llm/nanobind/executor/request.cpp
- cpp/include/tensorrt_llm/executor/executor.h
- cpp/tensorrt_llm/executor/outputConfig.cpp
- tensorrt_llm/_torch/pyexecutor/llm_request.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/bindings/test_executor_bindings.py
tensorrt_llm/sampling_params.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tests/unittest/bindings/test_executor_bindings.py
cpp/tensorrt_llm/pybind/executor/request.cpp
tensorrt_llm/sampling_params.py
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/pybind/executor/request.cpp
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/pybind/executor/request.cpp
🧬 Code graph analysis (5)
tests/unittest/bindings/test_executor_bindings.py (3)
cpp/include/tensorrt_llm/executor/executor.h (1)
OutputConfig
(212-239)cpp/tensorrt_llm/executor/outputConfig.cpp (2)
OutputConfig
(23-35)AdditionalModelOutput
(37-41)tensorrt_llm/sampling_params.py (1)
AdditionalModelOutput
(112-121)
cpp/tensorrt_llm/pybind/executor/request.cpp (2)
cpp/include/tensorrt_llm/executor/executor.h (1)
OutputConfig
(212-239)cpp/tensorrt_llm/executor/outputConfig.cpp (1)
OutputConfig
(23-35)
tensorrt_llm/sampling_params.py (3)
tensorrt_llm/scaffolding/task.py (1)
logprobs
(99-100)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k
(221-221)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_beam_search
(519-520)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-508)tensorrt_llm/llmapi/llm_args.py (2)
CudaGraphConfig
(108-165)KvCacheConfig
(946-1077)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
top_log_probs
(250-251)calculate_top_logprobs
(422-423)LlmRequest
(306-454)log_probs
(246-247)append_log_probs
(195-206)tensorrt_llm/executor/result.py (1)
Logprob
(37-40)
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
896-896: Avoid equality comparisons to True
; use config.return_log_probs:
for truth checks
Replace with config.return_log_probs
(E712)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
91-91: Local variable sampling_params
is assigned to but never used
Remove assignment to unused variable sampling_params
(F841)
tensorrt_llm/_torch/pyexecutor/sampler.py
628-628: Line too long (146 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/unittest/bindings/test_executor_bindings.py (1)
892-892
: Good: default top_logprobs is None is covered.
This verifies the new attribute presence and default.cpp/tensorrt_llm/pybind/executor/request.cpp (2)
191-196
: LGTM: include topLogProbs in getstate.
Serialization now emits the 8th element.
221-223
: LGTM: expose top_logprobs attribute.
The binding name matches Python-side expectations and tests.tensorrt_llm/sampling_params.py (1)
328-347
: Validation logic for top_logprobs looks correct.
Covers logprobs dependency, bounds, greedy/top-k/beam-search incompatibilities.tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (1)
1-1
: Add NVIDIA copyright header.
Tests are Python and must include the standard header.Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import os⛔ Skipped due to learnings
Learnt from: CR PR: NVIDIA/TensorRT-LLM#0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-08-25T22:42:47.612Z Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Learnt from: galagam PR: NVIDIA/TensorRT-LLM#6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
648-666
: Guard construction of top_ on per-request need, not only global capability.*Currently gated by self.calculate_top_logprobs (global); safe, but per-request shaping is handled later via slicing. No action needed; just a heads-up.
833-841
: After greedy fix, gather/copy shapes align; before fix it could mis-shape.Good use of gather on expanded next_tokens; nothing to change once greedy is fixed.
430-432
: Nice property. Consider exposing max_top_logprobs==0 early to skip extra host tensors.No change required; optimization is already applied via calculate_top_logprobs.
626-635
: Action: Make PEP‑604 annotations safe — either require Python ≥3.10 or addfrom __future__ import annotations
.Scan shows tensorrt_llm/_torch/pyexecutor/sampler.py (and many other files) use
X | Y
/ builtin generics in type annotations but lackfrom __future__ import annotations
. Either pin the repo/CI to Python ≥3.10 or add the future import to affected files (start with tensorrt_llm/_torch/pyexecutor/sampler.py).
PR_Github #16805 [ run ] completed with state |
d6e9f5e
to
504fcf6
Compare
/bot run |
PR_Github #16817 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
491-509
: Guard against None and slice only requested K to avoid OOB/garbage reads.This can dereference
None
when the sampler’smax_top_logprobs==0
or when request K==0; it also reads all preallocated columns, not the requested K.Apply:
def _maybe_get_top_logprobs(self, request: LlmRequest, state: SampleState, *, beam: int, count: int): """Return top_logprobs for the current request if requested, else return None.""" - - if request.calculate_top_logprobs: - current_slice = slice(0, count), request.py_seq_slot, beam - top_log_probs = state.host.top_log_probs[request.py_seq_slot, - beam, :count] - top_current_tokens = state.host.top_tokens[current_slice] - token_top_log_probs = [{ - int(token[k]): - Logprob(logprob=logprob[k], rank=k) - for k in range(request.py_top_logprobs) - } for token, logprob in zip(top_current_tokens, - top_log_probs.tolist())] - return [token_top_log_probs] - else: - return None + k = request.py_top_logprobs or 0 + if k <= 0: + return None + if state.host.top_log_probs is None or state.host.top_tokens is None: + return None + current_slice = (slice(0, count), request.py_seq_slot, beam) + # Restrict to the requested K + top_log_probs = state.host.top_log_probs[request.py_seq_slot, beam, :count, :k] + top_current_tokens = state.host.top_tokens[current_slice][:, :k] + token_top_log_probs = [ + {int(token[idx]): Logprob(logprob=row[idx], rank=idx) + for idx in range(len(token))} + for token, row in zip(top_current_tokens, top_log_probs.tolist()) + ] + return [token_top_log_probs]
♻️ Duplicate comments (4)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (3)
1-1
: Add the standard NVIDIA header.Required by repo policy; place it above imports.
+ # Copyright (c) 2025, NVIDIA CORPORATION. + # + # import os
69-76
: Remove unused variable to satisfy Ruff F841.- with pytest.raises(ValueError, match="top_logprobs.*"): - sampling_params = SamplingParams(max_tokens=max_tokens, - logprobs=True, - top_k=top_k, - top_p=top_p, - temperature=1.0, - top_logprobs=top_logprobs) + with pytest.raises(ValueError, match="top_logprobs.*"): + SamplingParams(max_tokens=max_tokens, + logprobs=True, + top_k=top_k, + top_p=top_p, + temperature=1.0, + top_logprobs=top_logprobs)
91-96
: Remove unused variable to satisfy Ruff F841 (second occurrence).- sampling_params = SamplingParams(max_tokens=max_tokens, - logprobs=False, - top_k=top_k, - top_p=top_p, - temperature=1.0, - top_logprobs=top_logprobs) + SamplingParams(max_tokens=max_tokens, + logprobs=False, + top_k=top_k, + top_p=top_p, + temperature=1.0, + top_logprobs=top_logprobs)tensorrt_llm/_torch/pyexecutor/sampler.py (1)
154-162
: Fix docstrings: shapes and semantics.
next_tokens
is 1D (squeeze(-1)
), not[batch, 1]
.- Clarify
indices
meaning for top-p paths (sorted full list).Apply:
@@ def top_k_sampling_batch(... - next_tokens: [batch_size, 1] (sampled tokens) + next_tokens: [batch_size] (sampled tokens) @@ def top_p_sampling_batch(... - next_tokens: [batch_size, 1] (sampled tokens) + next_tokens: [batch_size] (sampled tokens) - indices: [batch_size, top_k] (indices of top_k logits) + indices: [batch_size, vocab_size] (indices sorted by descending logit; first N cover top-p mass) @@ def greedy_search_sampling_batch(... - next_tokens: [batch_size, 1] (sampled tokens) + next_tokens: [batch_size] (sampled tokens)Also applies to: 187-196, 282-292
🧹 Nitpick comments (8)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (3)
38-38
: Fix misleading comment on GPU arch.Decorator forces Ampere; comment mentions H100.
-@force_ampere # Save H100 resource +@force_ampere # Save Ampere GPU resources
80-80
: Rename test for clarity/grammar.-def test_generate_with_top_logprobs_and_disabler_logprobs( +def test_generate_with_top_logprobs_with_logprobs_disabled(
97-97
: Add a negative test for exceeding max_top_logprobs (LLM-level cap).Covers the backend constraint (fixture sets max_top_logprobs=4).
def test_top_logprobs_exceeds_llm_cap(llm_torch, input_prompts): with pytest.raises(ValueError): sp = SamplingParams(max_tokens=4, logprobs=True, top_logprobs=8) # Expect backend/runner to reject > max_top_logprobs for _ in llm_torch.generate(input_prompts, sampling_params=sp): passWant me to push a patch with all suggested edits?
tests/unittest/bindings/test_executor_bindings.py (1)
896-907
: Strengthen assertions for type and pickle stability.Add a minimal type check to ensure
top_logprobs
remains anint
across pickle round-trips.Apply:
@@ - assert config.top_logprobs == 10 + assert isinstance(config.top_logprobs, int) + assert config.top_logprobs == 10 @@ - assert config_copy.top_logprobs == 10 + assert isinstance(config_copy.top_logprobs, int) + assert config_copy.top_logprobs == 10Also applies to: 910-926
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
327-328
: Include TopKTopP in the Strategy type hint.Keeps annotations consistent with actual usage in
request_strategy
.Apply:
-Strategy = TopK | TopP | Greedy +Strategy = TopK | TopP | TopKTopP | Greedy
799-803
: Nit: typo in comment.“defaulted to0” -> “defaulted to 0”.
Apply:
- # otherwise, it is None and should be defaulted to0 + # otherwise, it is None and should be defaulted to 0tensorrt_llm/sampling_params.py (2)
328-339
: Clarify validation message and align semantics with int-valuedlogprobs
.Message “set logprobs to True” is misleading since
logprobs
is numeric.Apply:
- if not self.logprobs: - raise ValueError("You need to set logprobs to True to use top_logprobs.") + if not self.logprobs: + raise ValueError("You need to set logprobs >= 1 to use top_logprobs.")
345-347
: Explicitly disallow with beam search (good), consider adding actionable hint.Optional: include “use TRTLLM sampler path instead” or “set
use_beam_search=False
” to guide users.Apply:
- if self.use_beam_search: - raise ValueError("top_logprobs + beam search is not supported") + if self.use_beam_search: + raise ValueError("top_logprobs + beam search is not supported; set use_beam_search=False.")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
cpp/include/tensorrt_llm/executor/executor.h
(2 hunks)cpp/tensorrt_llm/executor/outputConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/request.cpp
(1 hunks)examples/llm-api/quickstart_advanced.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(10 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(12 hunks)tensorrt_llm/executor/result.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)tensorrt_llm/sampling_params.py
(3 hunks)tests/integration/test_lists/test-db/l0_a30.yml
(1 hunks)tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
(1 hunks)tests/unittest/bindings/test_executor_bindings.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- tensorrt_llm/llmapi/llm_args.py
- tensorrt_llm/_torch/pyexecutor/config.py
- cpp/tensorrt_llm/pybind/executor/request.cpp
- tensorrt_llm/executor/result.py
- cpp/tensorrt_llm/nanobind/executor/request.cpp
- tests/integration/test_lists/test-db/l0_a30.yml
- cpp/include/tensorrt_llm/executor/executor.h
- cpp/tensorrt_llm/executor/outputConfig.cpp
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🧬 Code graph analysis (6)
tests/unittest/bindings/test_executor_bindings.py (3)
cpp/include/tensorrt_llm/executor/executor.h (1)
OutputConfig
(212-239)cpp/tensorrt_llm/executor/outputConfig.cpp (2)
OutputConfig
(23-35)AdditionalModelOutput
(37-41)tensorrt_llm/sampling_params.py (1)
AdditionalModelOutput
(112-121)
tensorrt_llm/sampling_params.py (3)
tensorrt_llm/scaffolding/task.py (1)
logprobs
(99-100)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
top_k
(221-221)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_beam_search
(519-520)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py (2)
tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-508)tensorrt_llm/llmapi/llm_args.py (2)
CudaGraphConfig
(108-165)KvCacheConfig
(946-1077)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
beam_width
(70-73)calculate_top_logprobs
(430-431)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len
(558-568)max_seq_len
(571-575)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (5)
top_log_probs
(250-251)calculate_top_logprobs
(422-423)LlmRequest
(306-454)log_probs
(246-247)append_log_probs
(195-206)tensorrt_llm/executor/result.py (1)
Logprob
(37-40)
🪛 Ruff (0.12.2)
tests/unittest/bindings/test_executor_bindings.py
896-896: Avoid equality comparisons to True
; use config.return_log_probs:
for truth checks
Replace with config.return_log_probs
(E712)
tests/unittest/_torch/sampler/test_return_top_k_logprobs.py
91-91: Local variable sampling_params
is assigned to but never used
Remove assignment to unused variable sampling_params
(F841)
tensorrt_llm/_torch/pyexecutor/sampler.py
628-628: Line too long (146 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/bindings/test_executor_bindings.py (1)
892-893
: LGTM: default is None as expected.Asserting the default
OutputConfig().top_logprobs is None
is correct and matches the new optional semantics.tensorrt_llm/_torch/pyexecutor/_util.py (1)
662-664
: LGTM: correctly wiresmax_top_logprobs
from backend config.Propagation path looks correct.
tensorrt_llm/sampling_params.py (1)
176-176
: Docstring default matches Optional semantics.“Defaults to None” aligns with the new field type.
def create_torch_sampler_args(executor_config: ExecutorConfig, mapping: Mapping, | ||
*, max_seq_len: int, enable_mixed_sampler: bool): | ||
*, max_seq_len: int, enable_mixed_sampler: bool, | ||
max_top_logprobs: int): | ||
max_num_sequences = executor_config.max_batch_size * mapping.pp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Validate max_top_logprobs
(non-negative) at construction.
Prevents accidental negative values propagating to buffer shapes.
Apply:
def create_torch_sampler_args(executor_config: ExecutorConfig, mapping: Mapping,
- *, max_seq_len: int, enable_mixed_sampler: bool,
- max_top_logprobs: int):
+ *, max_seq_len: int, enable_mixed_sampler: bool,
+ max_top_logprobs: int):
+ if max_top_logprobs is None:
+ max_top_logprobs = 0
+ if max_top_logprobs < 0:
+ raise ValueError("max_top_logprobs must be >= 0")
@@
max_beam_width=executor_config.max_beam_width,
max_top_logprobs=max_top_logprobs,
enable_mixed_sampler=enable_mixed_sampler,
Also applies to: 649-649
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/_util.py around lines 638-641 (and also at
line 649), validate that the parameter max_top_logprobs is non-negative when
constructing the sampler args: if max_top_logprobs is negative raise a
ValueError (with a clear message including the invalid value) before any buffer
shape calculations, and ensure downstream code uses the validated value; update
both places where max_top_logprobs is accepted to perform the same check.
return LLM( | ||
model=os.path.join(llm_models_root(), "llama-models-v2", | ||
"TinyLlama-1.1B-Chat-v1.0"), | ||
kv_cache_config=KvCacheConfig(max_tokens=10000), | ||
max_batch_size=len( | ||
input_prompts | ||
), # use small batch size to prevent large buffers from possibly hiding wrong data accesses. | ||
max_seq_len=32, | ||
disable_overlap_scheduler=False, | ||
sampler_type="TorchSampler", | ||
cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8], | ||
enable_padding=True), | ||
enable_mixed_sampler=True, | ||
max_top_logprobs=4, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make tests robust to missing local model assets (skip when absent).
Prevents CI flakes when TinyLlama weights are not present.
- return LLM(
- model=os.path.join(llm_models_root(), "llama-models-v2",
- "TinyLlama-1.1B-Chat-v1.0"),
+ model_path = os.path.join(llm_models_root(), "llama-models-v2",
+ "TinyLlama-1.1B-Chat-v1.0")
+ if not os.path.isdir(model_path):
+ pytest.skip(f"Model not found: {model_path}")
+ return LLM(
+ model=model_path,
kv_cache_config=KvCacheConfig(max_tokens=10000),
max_batch_size=len(
input_prompts
), # use small batch size to prevent large buffers from possibly hiding wrong data accesses.
max_seq_len=32,
disable_overlap_scheduler=False,
sampler_type="TorchSampler",
cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8],
enable_padding=True),
enable_mixed_sampler=True,
max_top_logprobs=4,
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return LLM( | |
model=os.path.join(llm_models_root(), "llama-models-v2", | |
"TinyLlama-1.1B-Chat-v1.0"), | |
kv_cache_config=KvCacheConfig(max_tokens=10000), | |
max_batch_size=len( | |
input_prompts | |
), # use small batch size to prevent large buffers from possibly hiding wrong data accesses. | |
max_seq_len=32, | |
disable_overlap_scheduler=False, | |
sampler_type="TorchSampler", | |
cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8], | |
enable_padding=True), | |
enable_mixed_sampler=True, | |
max_top_logprobs=4, | |
) | |
model_path = os.path.join(llm_models_root(), "llama-models-v2", | |
"TinyLlama-1.1B-Chat-v1.0") | |
if not os.path.isdir(model_path): | |
pytest.skip(f"Model not found: {model_path}") | |
return LLM( | |
model=model_path, | |
kv_cache_config=KvCacheConfig(max_tokens=10000), | |
max_batch_size=len( | |
input_prompts | |
), # use small batch size to prevent large buffers from possibly hiding wrong data accesses. | |
max_seq_len=32, | |
disable_overlap_scheduler=False, | |
sampler_type="TorchSampler", | |
cuda_graph_config=CudaGraphConfig(batch_sizes=[1, 2, 4, 8], | |
enable_padding=True), | |
enable_mixed_sampler=True, | |
max_top_logprobs=4, | |
) |
🤖 Prompt for AI Agents
In tests/unittest/_torch/sampler/test_return_top_k_logprobs.py around lines 21
to 35, the test constructs an LLM pointing to local TinyLlama weights which may
be absent in CI; add a pre-check for the model directory/file using
os.path.exists(...) and call pytest.skip("TinyLlama model not available locally
- skipping test") when missing, and ensure pytest and os are imported at top of
the file; then only construct and return the LLM if the path exists so the test
is skipped gracefully instead of failing.
686cc8b
to
6cf9dc4
Compare
/bot run |
PR_Github #18767 [ run ] triggered by Bot |
PR_Github #18763 [ run ] completed with state |
PR_Github #18767 [ run ] completed with state |
6cf9dc4
to
337b137
Compare
/bot run |
PR_Github #18786 [ run ] triggered by Bot |
PR_Github #18786 [ run ] completed with state |
…ties - Added a new parameter `topLogProbs` to the `OutputConfig` class, allowing users to specify the number of top log probabilities to return. - Updated related methods and bindings in the executor and API layers to accommodate this feature. - updated code in Torch sampler to provide the functionality - added a new Dict to LogProbStorage containing the top logprobs - included validation checks for top log probabilities in the sampling parameters - added tests to ensure correct functionality. Signed-off-by: Stefan Niebler <[email protected]>
…ptional top log probabilities - Changed `topLogProbs` parameter in `OutputConfig` to be optional, allowing for more flexible configurations. - Updated related methods and bindings in the executor and API layers to handle the new optional type. - Enhanced validation checks in sampling parameters to ensure correct usage of `top_logprobs`. - Adjusted tests to accommodate the changes in parameter handling and validation. Signed-off-by: Stefan Niebler <[email protected]>
- Updated `OutputConfig` to use `std::optional<SizeType32>` for the `top_logprobs` parameter - Adjusted docstring in `sampling_params.py` to correctly state the default of `top_logprobs` as None. - Modified `quickstart_advanced.py` to ensure correct handling of `top_k` conditions. - Clarified documentation in `sampler.py` regarding indices after applying top-k and top-p masks. Signed-off-by: Stefan Niebler <[email protected]>
…robs changes - Correctly set max_top_logprobs parameter in create_autodeploy_executor. - Updated DemoEngine to correctly unpack top_k and greedy_search Signed-off-by: Stefan Niebler <[email protected]>
…ocstring and updated yaml files for api stability testing - Updated the test cases in `test_llm_api.py`, `completion_output.yaml`, `llm.yaml`, and `sampling_params.yaml` to include the new `top_logprobs` parameter and its associated properties. Signed-off-by: Stefan Niebler <[email protected]>
- Updated `TorchStore` and `TorchSampler` to include `max_top_logprobs` parameter for managing top log probabilities. - Modified methods to handle top log probabilities during sampling and logging. - Added new Tensors to `TorchStore` containing data for top logprobs - Adjusted tests to validate the correct functionality of top log probabilities in various scenarios. Signed-off-by: Stefan Niebler <[email protected]>
…rent LLM-API - Removed the top_logprobs parameter from SamplingParams and related classes to streamline log probability management. - Change _get_output_config to explicitly pass logprobs as top-logprobs to OutputConfig object. - Updated setup_llm function to conditionally set logprobs based on input arguments. - Adjusted CompletionOutput and related tests to reflect the removal of top_logprobs. - Enhanced validation checks to ensure correct usage of logprobs in sampling scenarios. Signed-off-by: Stefan Niebler <[email protected]>
- Adjusted the validation logic in SamplingParams to ensure that logprobs must not exceed 1 when greedy sampling is enabled. - Updated error message for clarity regarding the constraints on logprobs in greedy sampling scenarios. Signed-off-by: Stefan Niebler <[email protected]>
…obs of 1 - Modified the return statement in `greedy_search_sampling_batch` to include `next_tokens` as an unsqueezed tensor, ensuring compatibility with downstream processing. - Added `max_top_logprobs` parameter to the configuration in the test file to ensure correct execution. Signed-off-by: Stefan Niebler <[email protected]>
- Updated test_embedding_bias_with_torch_sampler_strategies cases to correctly handle top logprobs of 1 in the parameterization - Enhanced validation logic in test_return_top_k_logprobs to accommodate the new conditions for valid setups in sampling tests (greedy sampling). Signed-off-by: Stefan Niebler <[email protected]>
337b137
to
31bc186
Compare
/bot run --disable-fail-fast |
PR_Github #18946 [ run ] triggered by Bot |
PR_Github #18946 [ run ] completed with state |
[TRTLLM-5670][feat] Support top log probabilities
topLogProbs
to theOutputConfig
class, allowing users to specify the number of top log probabilities to return.Change: Changed API to conform better with current API.
Summary by CodeRabbit
New Features
Documentation
Tests
Description
Allow the user to specify with the top_logprobs parameter if he wants to have the logprobs of the top <top_logprobs> tokens at each step in addition to the logprobs of the sampled tokens.
Currently only functionality was tested.
Performance Tests
I ran performance checks of main and this branch for greedy and top-k sampling with logprobs=false/true to see how the changes impact workflows, where top-logprobs are not enabled.
Number of requests: 100
Number of concurrent requests: 55.9995
Average Input Length (tokens): 40.0000
Average Output Length (tokens): 2009.0000
TP Size: 1
PP Size: 1
EP Size: None
Max Runtime Batch Size: 64
Max Runtime Tokens: 131072
Scheduling Policy: GUARANTEED_NO_EVICT
KV Memory Percentage: 90.00%
Issue Rate (req/sec): 3.9227E+17
This branch (Need update since latest changes)
main branch
Test Coverage
new test: test_return_top_k_logprobs.py
tests if the shape of the results is correct. Additionally ensures that some failure cases throw the correct error
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.