-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][refactor] Improve lookahead decoding interfaces #5576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/bot run |
PR_Github #10240 [ run ] triggered by Bot |
PR_Github #10240 [ run ] completed with state |
/bot run |
PR_Github #10283 [ run ] triggered by Bot |
PR_Github #10283 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the lookahead-decoding interfaces by centralizing sampling‐config handling and unifying the disable‐lookahead logic across Python and C++ layers.
- Migrate sampling-config creation into
CreateNewDecoderRequests
and return an optional config. - Remove legacy
disableLookahead
overloads and introducedisableLookaheadDecoder
in batching code. - Update the Python sampler to consume the new API and drop the old
make_sampling_config
helper.
Comments suppressed due to low confidence (3)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp:52
- New helper functions
fillBatchSlots
andfuseSamplingConfigs
introduce core batching logic but lack corresponding unit tests. Adding targeted tests will ensure correct behavior and prevent regressions.
TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers,
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h:384
- [nitpick] The newly added
disableLookaheadDecoder
method lacks a Doxygen comment. Adding a brief description of its purpose and parameters will improve API discoverability and maintenance.
void disableLookaheadDecoder(RequestVector const& genRequests, DecoderInputBuffers& inputBuffers);
cde297f
to
37b8678
Compare
37b8678
to
b8b0610
Compare
/bot run |
PR_Github #11847 [ run ] triggered by Bot |
PR_Github #11847 [ run ] completed with state |
b8b0610
to
8643c36
Compare
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughConsolidates per-request sampling configs into a single fused SamplingConfig, adds helpers to fill/return batch-slot views, updates CreateNewDecoderRequests operator() signature, removes per-request lookahead parameter from DecoderState::disableLookahead, and moves decoder-level lookahead disable logic into the batch manager. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CreateNewDecoderRequests
participant DecoderInputBuffers
participant UnderlyingDecoder
Client->>CreateNewDecoderRequests: operator()(..., finishedContextRequests, inputBuffers, decoderState, runtimeStream, decoderStream, ...)
CreateNewDecoderRequests->>DecoderInputBuffers: fillBatchSlots(finishedContextRequests, inputBuffers)
CreateNewDecoderRequests->>CreateNewDecoderRequests: fuseSamplingConfigs(finishedContextRequests)
CreateNewDecoderRequests->>Client: (batchSlotsView, fusedSamplingConfig, lookaheadPrompts, lookaheadAlgos)
Client->>UnderlyingDecoder: setup(fusedSamplingConfig, batchSlotsView, ...)
sequenceDiagram
participant BatchManager
participant TrtGptModelInflightBatching
participant UnderlyingDecoder
participant DecoderState
BatchManager->>TrtGptModelInflightBatching: changeSpecDecMode(genRequests)
TrtGptModelInflightBatching->>TrtGptModelInflightBatching: disableLookaheadDecoder(genRequests, inputBuffers)
TrtGptModelInflightBatching->>UnderlyingDecoder: disableLookahead(batchSlotsView, fusedSamplingConfig)
TrtGptModelInflightBatching->>DecoderState: disableLookahead()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
588-602
: Verify type safety for optional sampling config parameter.The refactoring correctly changes from handling multiple sampling configs to a single optional config. However, there's a potential type safety concern.
The
sampling_config
returned bycreate_new_decoder_requests
is now an Optional[SamplingConfig]. Passing it directly tounderlying_decoder().setup()
may cause a type mismatch or runtime error if the method expects a non-optional type.#!/bin/bash # Verify the expected parameter type for underlying_decoder().setup() ast-grep --pattern $'def setup($$$) { $$$ }' # Also search for the setup method signature in C++ bindings rg -A 5 "setup.*SamplingConfig" --type cpp
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (1)
74-80
: Consider using a named struct for the complex return type.The tuple return type with 4 elements reduces readability. Consider defining a struct to make the API more self-documenting:
struct DecoderRequestsResult { TensorPtr batchSlots; std::optional<runtime::SamplingConfig> samplingConfig; std::vector<runtime::ITensor::SharedConstPtr> lookaheadPrompts; std::vector<executor::LookaheadDecodingConfig> lookaheadConfigs; };This would make the return type clearer and easier to work with at call sites.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
(2 hunks)cpp/include/tensorrt_llm/runtime/decoderState.h
(1 hunks)cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
(1 hunks)cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
(0 hunks)cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
(5 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(0 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/runtime/decoderState.cpp
(1 hunks)cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
(0 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(2 hunks)
💤 Files with no reviewable changes (5)
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/nanobind/bindings.cpp
- cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
- cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tensorrt_llm/runtime/decoderState.cpp
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
🔇 Additional comments (15)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1)
391-392
: LGTM! Well-designed interface for centralized lookahead decoding management.The new
disableLookaheadDecoder
method appropriately centralizes lookahead decoding control at the batch manager level, taking generation requests and decoder input buffers as parameters. This aligns well with the refactoring objectives to decouple lookahead decoding from lower-level decoder interfaces.cpp/include/tensorrt_llm/runtime/decoderState.h (1)
69-69
: LGTM! Interface simplification improves separation of concerns.Removing the
RequestVector const& genRequests
parameter fromdisableLookahead()
is a good refactoring that:
- Eliminates the dependency between
DecoderState
and request vectors- Simplifies the interface by focusing purely on decoder state management
- Delegates request-specific logic to higher-level components (e.g.,
TrtGptModelInflightBatching
)This aligns well with the principle of separation of concerns.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
12-12
: LGTM! Import removal aligns with refactoring objectives.Removing the
make_sampling_config
import is appropriate as the sampling configuration fusion is now handled at the C++ level through the new static methods inCreateNewDecoderRequests
.cpp/tensorrt_llm/runtime/decoderState.cpp (1)
444-459
: LGTM! Implementation correctly reflects the simplified interface.The
disableLookahead()
method implementation has been appropriately simplified:
- Signature now matches the header file (no parameters)
- Removed the loop over generation requests, delegating that responsibility to higher-level components
- Maintains all essential decoder state operations: resetting speculative decoding mode, setting max tokens to 1, resetting lookahead inputs, and reshaping buffers
- Improves separation of concerns by focusing purely on decoder state management
The refactoring maintains functionality while simplifying the interface.
cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (3)
53-56
: Good addition of [[nodiscard]] attribute for resource accessor.The
[[nodiscard]]
attribute is appropriately added to prevent accidental discarding of the returnedCudaStreamPtr
, which is a critical resource for GPU operations.
58-61
: Good addition of [[nodiscard]] attribute for decoder accessor.The
[[nodiscard]]
attribute is appropriately added to ensure the returned decoder reference is properly utilized, preventing potential logic errors from unused decoder access.
34-79
: Lookahead decoding integration verified
- No remaining calls to the old
disableLookahead
inGptDecoderBatched
.- All former callers have been updated to invoke
disableLookaheadDecoder
inTrtGptModelInflightBatching
.- The implementation still calls
mDecoderState->disableLookahead()
inside the new method, preserving state-level behavior.- Existing TODOs in
trtGptModelInflightBatching.cpp
relate to future enabling of speculative/lookahead modes and do not block the current refactoring.Refactoring is sound and introduces no regressions in lookahead decoding.
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3)
868-883
: LGTM! Refactoring improves clarity and consistency.The variable renaming from
samplingConfigs
tosamplingConfig
and the corresponding method call changes align well with the interface refactoring. The stream variable rename todecoderStream
also improves code readability by being more specific.
2517-2533
: Well-structured method for consolidating lookahead disable logic.The new
disableLookaheadDecoder
method effectively encapsulates the lookahead decoding disable functionality. It properly:
- Uses the refactored static methods for batch slot filling and sampling config fusion
- Maintains correct CUDA stream synchronization between decoder and runtime streams
- Follows established patterns in the codebase
2622-2630
: Refactoring successfully unifies lookahead decoding disable logic.The changes effectively replace direct decoder calls with the new unified
disableLookaheadDecoder
method and simplify theDecoderState::disableLookahead()
interface. The addition of null checking forllmReq->mSeqSlot
before accessing it is good defensive programming practice.cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2)
91-91
: LGTM!The static method signature is well-designed with const-correct parameters and a clear purpose.
93-93
: LGTM!The static method appropriately uses
std::optional
to handle empty request vectors and has a descriptive name.cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)
52-90
: Good refactoring to return the batch slots view.The modification to return
batchSlotsView
enables better code reuse while maintaining the original functionality. The empty check optimization at lines 79-87 is also a good addition.
154-168
: Clean implementation of sampling config fusion.The implementation properly handles empty requests and efficiently constructs the fused config. Good use of
reserve
for performance.
170-197
: Well-structured refactoring using the new helper methods.The implementation cleanly integrates the new
fuseSamplingConfigs
method and properly usesstd::move
for efficient returns. The code is more modular and easier to understand.
8643c36
to
acda71e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)
132-152
: Code duplication issue already identified.The batch slot filling logic (lines 140-146) duplicates code from
copySequenceLengths
(lines 64-76), which was already flagged in previous reviews.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)
170-196
: Good refactoring but consider safer optional access.The changes effectively consolidate the interface by:
- Adding proper precondition assertion for non-empty requests
- Using the returned batch slots view efficiently
- Simplifying the return type with fused sampling config
However, consider using a safer approach for accessing the optional value:
- return {std::move(batchSlotsView), std::move(samplingConfig.value()), std::move(lookaheadPrompt), + TLLM_CHECK(samplingConfig.has_value()); // Assert our expectation + return {std::move(batchSlotsView), std::move(*samplingConfig), std::move(lookaheadPrompt), std::move(lookaheadAlgoConfigs)};This makes the expectation explicit and provides better error messages if the assumption is violated.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
(2 hunks)cpp/include/tensorrt_llm/runtime/decoderState.h
(1 hunks)cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
(1 hunks)cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
(0 hunks)cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
(4 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(0 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/runtime/decoderState.cpp
(1 hunks)cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
(0 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(2 hunks)
💤 Files with no reviewable changes (5)
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
- cpp/tensorrt_llm/nanobind/bindings.cpp
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
🚧 Files skipped from review as they are similar to previous changes (6)
- cpp/include/tensorrt_llm/runtime/decoderState.h
- cpp/tensorrt_llm/runtime/decoderState.cpp
- cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
- tensorrt_llm/_torch/pyexecutor/sampler.py
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...
Files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (2)
52-90
: LGTM! Improved interface design with better error handling.The changes enhance the function by:
- Returning the
batchSlotsView
for caller use, improving the interface- Adding an empty request guard to prevent unnecessary device operations
- Maintaining the existing logic while improving efficiency
The return type change aligns well with the refactoring objectives.
154-168
: LGTM! Clean implementation of sampling config fusion.The function properly:
- Handles empty requests by returning
std::nullopt
- Uses efficient memory allocation with
reserve()
- Aggregates all sampling configs into a single unified configuration
- Uses appropriate optional return type
This aligns well with the refactoring objectives to consolidate sampling config handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)
132-152
: Function implementation is correct.The
fillBatchSlots
method serves its purpose as a lighter alternative tocopySequenceLengths
when device copying isn't needed. The TODO comment at line 148 should be tracked in an issue to ensure it's addressed when device copying is required for performance optimization.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...
Files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)
51-90
: LGTM! Well-implemented refactoring.The addition of the return value and empty request guard condition improves the function's usability and efficiency. The logic remains correct and the changes align well with the interface refactoring objectives.
154-168
: LGTM! Clean implementation of sampling config fusion.The function correctly handles the empty requests case and efficiently aggregates sampling configurations. The use of
std::optional
and vector reservation demonstrates good practices.
170-197
: LGTM! Well-executed interface refactoring.The updated operator() method successfully unifies the lookahead decoding interface. The assertion ensures proper usage, and the use of
fuseSamplingConfigs
with appropriate validation maintains correctness while simplifying the interface from vector to singleSamplingConfig
.
/bot run |
PR_Github #13804 [ run ] triggered by Bot |
PR_Github #13804 [ run ] completed with state |
/bot run |
PR_Github #13820 [ run ] triggered by Bot |
PR_Github #13820 [ run ] completed with state |
62f03c9
to
5b94f79
Compare
/bot run |
PR_Github #16184 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)
133-154
: Refactor: avoid duplicate “fill seqSlot loop” (same logic as copySequenceLengths).The host-side loop to fill seqSlots duplicates copySequenceLengths’ loop. Extract a tiny helper and reuse it to reduce drift.
- // fill buffers on host - SizeType32 batchIdx{0}; - for (auto const& llmReq : requests) - { - auto const seqSlot = llmReq->mSeqSlot.value(); - batchSlotsRange[batchIdx] = seqSlot; - ++batchIdx; - } + // fill buffers on host + fillSeqSlotsHost(requests, batchSlotsRange);Add this helper near the top (same TU), used by both copySequenceLengths and fillBatchSlots:
// Helper to fill seqSlots host-side to keep logic in one place. static inline void fillSeqSlotsHost(RequestVector const& requests, tr::BufferRange<SizeType32>& batchSlotsRange) { SizeType32 batchIdx{0}; for (auto const& llmReq : requests) { TLLM_CHECK_WITH_INFO(llmReq->mSeqSlot.has_value(), "seqSlot must be assigned before decoder setup"); batchSlotsRange[batchIdx++] = llmReq->mSeqSlot.value(); } }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
1879-1881
: Redundant >0 guard after an emptiness check.localBatchSize > 0 is guaranteed by the preceding if (!finishedContextRequests.empty()) block. Either remove the check or assert stronger invariants.
- auto const localBatchSize = batchSlots->getSize(); - TLLM_CHECK_WITH_INFO(localBatchSize > 0, "Decoder setup should be called with at least one request"); + auto const localBatchSize = batchSlots->getSize(); + TLLM_CHECK_WITH_INFO( + localBatchSize == finishedContextRequests.size(), + "Unexpected localBatchSize (%zu) vs. requests (%zu)", + static_cast<size_t>(localBatchSize), + finishedContextRequests.size());
1882-1885
: Pass explicitDraftTokensDType only when ExplicitDraftTokens is active.You’re always passing mModelConfig.getDataType() to IGptDecoder::setup’s explicitDraftTokensDType parameter. That argument should be std::nullopt unless the decoding mode is ExplicitDraftTokens; otherwise it can confuse downstream logic.
- mDecoder->getUnderlyingDecoder().setup(samplingConfig, localBatchSize, batchSlots, - {mDecoderState->getJointDecodingOutput()}, mModelConfig.getDataType(), lookaheadPrompt, - lookaheadAlgoConfigs); + auto explicitDraftTokensDType = mModelConfig.getSpeculativeDecodingMode().isExplicitDraftTokens() + ? std::optional<nvinfer1::DataType>{mModelConfig.getDataType()} + : std::nullopt; + mDecoder->getUnderlyingDecoder().setup(samplingConfig, localBatchSize, batchSlots, + {mDecoderState->getJointDecodingOutput()}, explicitDraftTokensDType, lookaheadPrompt, + lookaheadAlgoConfigs);Verification: confirm all IGptDecoder::setup callsites only pass a dtype when decoding mode is ExplicitDraftTokens.
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)
52-92
: Strengthen preconditions and simplify the guarded copy.
- copySequenceLengths assumes at least one request; enforce that with a CHECK at entry and remove the inner empty check.
- Validate seqSlot presence before value() to avoid UB if a call site regresses.
-//! @brief Fills the seqSlots and sequence lengths in the inputBuffers. -TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers, +//! @brief Fills the seqSlots and sequence lengths in the inputBuffers. +TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers, ITensor& sequenceLengths, SizeType32 beamWidth, runtime::CudaStream const& stream) { auto const bufferManager = BufferManager{std::make_shared<CudaStream>(stream.get())}; auto const batchSize = contextRequests.size(); + TLLM_CHECK_WITH_INFO(batchSize > 0, "copySequenceLengths should be called with at least one request"); auto batchSlotsView = tr::ITensor::slice(inputBuffers.setupBatchSlots, 0, batchSize); auto fillValuesView = tr::ITensor::slice(inputBuffers.fillValues, 0, batchSize); @@ - for (auto const& llmReq : contextRequests) + for (auto const& llmReq : contextRequests) { + TLLM_CHECK_WITH_INFO(llmReq->mSeqSlot.has_value(), "seqSlot must be assigned before decoder setup"); auto const disaggFirstGenTokenSize = llmReq->getContextPhaseParams() ? llmReq->getContextPhaseParams().value().getFirstGenTokens().size() : 0; @@ - // copy sequence lengths - if (!contextRequests.empty()) - { - auto batchSlotsDeviceView = tr::ITensor::slice(inputBuffers.setupBatchSlotsDevice, 0, batchSize); - auto fillValuesViewDevice = tr::ITensor::slice(inputBuffers.fillValuesDevice, 0, batchSize); - - bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView); - bufferManager.copy(*fillValuesView, *fillValuesViewDevice); - tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream); - } + // copy sequence lengths + auto batchSlotsDeviceView = tr::ITensor::slice(inputBuffers.setupBatchSlotsDevice, 0, batchSize); + auto fillValuesViewDevice = tr::ITensor::slice(inputBuffers.fillValuesDevice, 0, batchSize); + bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView); + bufferManager.copy(*fillValuesView, *fillValuesViewDevice); + tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
(2 hunks)cpp/include/tensorrt_llm/runtime/decoderState.h
(1 hunks)cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
(1 hunks)cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
(0 hunks)cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
(5 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
(1 hunks)
💤 Files with no reviewable changes (1)
- cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
🚧 Files skipped from review as they are similar to previous changes (4)
- cpp/include/tensorrt_llm/runtime/decoderState.h
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
- cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
- cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (2)
📚 Learning: 2025-08-21T09:41:49.327Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.327Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧬 Code graph analysis (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (4)
samplingConfig
(163-172)samplingConfig
(163-168)samplingConfig
(189-193)samplingConfig
(189-190)cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4)
fillBatchSlots
(134-154)fillBatchSlots
(134-134)fuseSamplingConfigs
(156-170)fuseSamplingConfigs
(156-156)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)
cpp/tests/runtime/gptDecoderBatchedTest.cpp (2)
copySequenceLengths
(153-158)copySequenceLengths
(153-154)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (4)
samplingConfig
(66-75)samplingConfig
(66-71)samplingConfig
(92-96)samplingConfig
(92-93)cpp/tensorrt_llm/pybind/runtime/bindings.cpp (4)
samplingConfig
(163-172)samplingConfig
(163-168)samplingConfig
(189-193)samplingConfig
(189-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
1886-1890
: Good cross-stream synchronization.Recording on decoder stream and waiting on the runtime stream correctly orders the decoder setup ahead of subsequent runtime work.
2523-2539
: disableLookaheadDecoder handles batching and device‐side buffers correctly—no changes needed.The existing implementation of
disableLookaheadDecoder
already covers both empty and non‐emptygenRequests
correctly:
- When
genRequests
is empty,fillBatchSlots
returns a zero‐length slice andfuseSamplingConfigs
yieldsnullopt
; the call to
IGptDecoder::disableLookahead(nullopt, 0, batchSlots)
still invokes
DynamicDecodeLayer::disableLookahead(…, 0, batchSlots, …)
which resets the decode mode and reinitializes all lookahead layers (dynamicDecodeLayer.cpp:106–114). Skipping this would leave the dynamic layer in its prior state.- When
genRequests
is non‐empty,GptDecoder::disableLookahead
callsworkspace->setDeviceBatchSlots(batchSlots)
before forwarding to the dynamic layer, ensuring the host‐filledbatchSlots
are copied onto the device for the kernels (gptDecoder.cpp:115–118 & dynamicDecodeLayer.cpp:111–118).Because both paths correctly reset the lookahead state and perform any needed device transfer, the proposed early return for empty batches would omit the critical reset step, and a manual copy in
fillBatchSlots
is redundant.Likely an incorrect or invalid review comment.
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (2)
156-171
: LGTM: fused SamplingConfig construction.The optional-return API and aggregation approach look correct and align with the new decoder interfaces.
183-199
: LGTM: operator() contract and returned tuple match the new setup path.
- Non-empty precondition is enforced.
- copySequenceLengths uses the runtime stream (consistent with kernels that write sequenceLengths).
- Returning fused SamplingConfig and lookahead artifacts aligns with downstream setup.
RequestVector finishedContextRequests; | ||
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests), | ||
[](auto const& llmReq) { return llmReq->isLastContextChunk(); }); | ||
|
||
auto [batchSlots, samplingConfigs, lookaheadPrompt, lookaheadAlgoConfigs] | ||
= (*mCreateNewDecoderRequests)(mModelConfig, mWorldConfig, mDecodingConfig, contextRequests, logitsType, | ||
inputBuffers, *mDecoderState, mRuntime->getStream(), *mDecoder->getDecoderStream(), getMaxSequenceLen(), | ||
mOperatingBeamWidth, buffers.mMedusaBuffers); | ||
|
||
auto const localBatchSize = batchSlots->getSize(); | ||
if (localBatchSize > 0) | ||
if (!finishedContextRequests.empty()) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t drop disaggregated-generation setup when no “last context chunk” is present.
setupDecoderStep now filters to last-context chunks only. When called from prepareDistGenBufferAndDecoder (Lines 1681-1683) with generation requests, isLastContextChunk() will typically be false, so finishedContextRequests becomes empty and decoder setup is skipped, breaking the disaggregated generation init path.
Fix: fall back to using the passed-in requests if no last-context chunk is found.
- RequestVector finishedContextRequests;
- std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
- [](auto const& llmReq) { return llmReq->isLastContextChunk(); });
+ RequestVector finishedContextRequests;
+ std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
+ [](auto const& llmReq) { return llmReq->isLastContextChunk(); });
+ // Disagg generation init path calls this with generation-ready requests (no context chunk in-flight).
+ if (finishedContextRequests.empty())
+ {
+ finishedContextRequests = contextRequests;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
RequestVector finishedContextRequests; | |
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests), | |
[](auto const& llmReq) { return llmReq->isLastContextChunk(); }); | |
auto [batchSlots, samplingConfigs, lookaheadPrompt, lookaheadAlgoConfigs] | |
= (*mCreateNewDecoderRequests)(mModelConfig, mWorldConfig, mDecodingConfig, contextRequests, logitsType, | |
inputBuffers, *mDecoderState, mRuntime->getStream(), *mDecoder->getDecoderStream(), getMaxSequenceLen(), | |
mOperatingBeamWidth, buffers.mMedusaBuffers); | |
auto const localBatchSize = batchSlots->getSize(); | |
if (localBatchSize > 0) | |
if (!finishedContextRequests.empty()) | |
{ | |
RequestVector finishedContextRequests; | |
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests), | |
[](auto const& llmReq) { return llmReq->isLastContextChunk(); }); | |
// Disagg generation init path calls this with generation-ready requests (no context chunk in-flight). | |
if (finishedContextRequests.empty()) | |
{ | |
finishedContextRequests = contextRequests; | |
} | |
if (!finishedContextRequests.empty()) | |
{ | |
// … existing decoder-setup logic … | |
} |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp around lines
1866 to 1871, the code filters contextRequests to only last-context chunks and
skips decoder setup if none are found, which drops disaggregated-generation
initialization; change the logic so that after building finishedContextRequests
you check if it is empty and, if so, assign finishedContextRequests =
contextRequests (i.e., fall back to the original passed-in requests) before
calling setupDecoderStep, ensuring decoder setup still runs for generation
requests when no last-context chunk exists.
disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId())); | ||
mDecoderState->disableLookahead(); | ||
|
||
for (auto const& llmReq : scheduledRequests.generationRequests) | ||
{ | ||
if (llmReq->mSeqSlot) | ||
{ | ||
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Disable LAD for all active slots (ctx + gen), and reset engine tokens consistently.
The current call disables lookahead only for scheduledRequests.generationRequests. When LAD is turned off due to constraints, there may be only context requests in-flight; underlying decoder state for those slots won’t be updated. Also, resetting “numDecodingEngineTokens = 1” should apply to all slots in this transition.
- disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId()));
- mDecoderState->disableLookahead();
-
- for (auto const& llmReq : scheduledRequests.generationRequests)
+ // Apply to all scheduled slots (both ctx and gen have seqSlot at this point)
+ RequestVector requestsForDisable;
+ requestsForDisable.reserve(
+ scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size());
+ requestsForDisable.insert(requestsForDisable.end(),
+ scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end());
+ requestsForDisable.insert(requestsForDisable.end(),
+ scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end());
+
+ disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId()));
+ mDecoderState->disableLookahead();
+
+ for (auto const& llmReq : requestsForDisable)
{
if (llmReq->mSeqSlot)
{
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId())); | |
mDecoderState->disableLookahead(); | |
for (auto const& llmReq : scheduledRequests.generationRequests) | |
{ | |
if (llmReq->mSeqSlot) | |
{ | |
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1); | |
} | |
// Apply to all scheduled slots (both ctx and gen have seqSlot at this point) | |
RequestVector requestsForDisable; | |
requestsForDisable.reserve( | |
scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size()); | |
requestsForDisable.insert(requestsForDisable.end(), | |
scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end()); | |
requestsForDisable.insert(requestsForDisable.end(), | |
scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end()); | |
disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId())); | |
mDecoderState->disableLookahead(); | |
for (auto const& llmReq : requestsForDisable) | |
{ | |
if (llmReq->mSeqSlot) | |
{ | |
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1); | |
} | |
} |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp around lines
2628 to 2637, the code only disables lookahead for
scheduledRequests.generationRequests and only resets numDecodingEngineTokens for
those gen slots; when LAD is turned off there may be only context requests
in-flight and their decoder state won't be updated. Fix by invoking
disableLookaheadDecoder over all active requests (both generationRequests and
contextRequests) using the same fused buffer, keep
mDecoderState->disableLookahead() as is, and iterate every active slot from both
lists to call mDecoderState->setNumDecodingEngineTokens(slot, 1) so all slots
are consistently reset when disabling lookahead.
PR_Github #16184 [ run ] completed with state |
- Removed the RequestVector parameter from the disableLookahead method in DecoderState, streamlining its interface. - Updated calls to disableLookahead in TrtGptModelInflightBatching to reflect the new method signature. - Adjusted internal logic to maintain functionality without the need for external request data. Signed-off-by: Robin Kobus <[email protected]>
…GptDecoderBatched - Eliminated the disableLookahead method from both GptDecoderBatched and iGptDecoderBatched interfaces to streamline the API. - Updated TrtGptModelInflightBatching to utilize a new internal method for handling lookahead disabling, improving encapsulation and clarity. - Adjusted related logic in the setupDecoderStep method to maintain functionality without the removed method. Signed-off-by: Robin Kobus <[email protected]>
…ateNewDecoderRequests - Updated the operator() method to utilize std::optional for sampling configurations, improving flexibility in handling requests. - Introduced fillBatchSlots and fuseSamplingConfigs methods to streamline batch slot filling and sampling configuration fusion. - Refactored copySequenceLengths to return batch slots, enhancing clarity and separation of concerns in the code. - Use new methods to disable lookahead in trtGptModelInflightBatching. Signed-off-by: Robin Kobus <[email protected]>
…more effectively - Filter out finished context requests from the contextRequests vector before calling CreateNewDecoderRequests. - Modified the operator() method to directly use finishedContextRequests, enhancing clarity. - Updated the handling of sampling configurations to ensure a valid value is returned, improving robustness. Signed-off-by: Robin Kobus <[email protected]>
5b94f79
to
6062c69
Compare
/bot run |
PR_Github #16249 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
1866-1890
: Don’t drop decoder setup when no “last context chunk” exists (disagg gen init).When no request has
isLastContextChunk()
,finishedContextRequests
is empty and we skip decoder setup, breaking the disaggregated-generation initialization path.Apply:
RequestVector finishedContextRequests; std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests), [](auto const& llmReq) { return llmReq->isLastContextChunk(); }); - if (!finishedContextRequests.empty()) + // Disagg generation init path calls this with generation-ready requests (no context chunk in-flight). + if (finishedContextRequests.empty()) + { + finishedContextRequests = contextRequests; + } + if (!finishedContextRequests.empty()) { auto const logitsType = mRuntime->getEngine().getTensorDataType("logits");
2629-2637
: Disable LAD and reset engine-token count for all active slots (ctx + gen).Currently applies only to
generationRequests
. When switching off Lookahead, context-only in-flight slots won’t be reset.Apply:
- disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId())); - mDecoderState->disableLookahead(); - - for (auto const& llmReq : scheduledRequests.generationRequests) + // Apply to all scheduled slots (both ctx and gen have seqSlot at this point) + RequestVector requestsForDisable; + requestsForDisable.reserve( + scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size()); + requestsForDisable.insert(requestsForDisable.end(), + scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end()); + requestsForDisable.insert(requestsForDisable.end(), + scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end()); + + disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId())); + mDecoderState->disableLookahead(); + + for (auto const& llmReq : requestsForDisable) { if (llmReq->mSeqSlot) { mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1); }#!/bin/bash # Quick sanity: ensure all call sites of disableLookahead() use new signature (no args). rg -nP --type=cpp -C2 '\bdisableLookahead\s*\(' | sed -n '1,200p'
🧹 Nitpick comments (8)
cpp/tensorrt_llm/runtime/decoderState.cpp (1)
446-460
: API change LGTM; add note on per-slot reset behavior and consider header year.
- The parameterless
disableLookahead()
correctly resets speculative state and reshapes buffers. Callers must handle per-slotnumDecodingEngineTokens
reset (now done at call sites).- Nit: header still says 2022-2024; repo guidelines suggest current year.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2)
73-80
: Return fused SamplingConfig: OK. Consider [[nodiscard]] and clarify empty input behavior.
- Good: operator() now returns a single fused
SamplingConfig
.- Suggest marking operator() [[nodiscard]] to avoid accidental ignoring of the return tuple.
- std::tuple<TensorPtr, runtime::SamplingConfig, std::vector<runtime::ITensor::SharedConstPtr>, + [[nodiscard]] std::tuple<TensorPtr, runtime::SamplingConfig, std::vector<runtime::ITensor::SharedConstPtr>, std::vector<executor::LookaheadDecodingConfig>>
89-92
: Expose helpers: add [[nodiscard]] and include .
- Since
fillBatchSlots
andfuseSamplingConfigs
return values that must be consumed, mark them [[nodiscard]].- Ensure this header includes
<optional>
explicitly.+#include <optional> @@ - static TensorPtr fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers); + [[nodiscard]] static TensorPtr fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers); @@ - static std::optional<SamplingConfig> fuseSamplingConfigs(RequestVector const& requests); + [[nodiscard]] static std::optional<SamplingConfig> fuseSamplingConfigs(RequestVector const& requests);cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
2523-2539
: Guard against empty batches in disableLookaheadDecoder and avoid unnecessary calls.
fillBatchSlots
on an empty vector returns a 0-sized view; we still calldisableLookahead
and record/wait. Early-out to save work and avoid edge-case behavior.void TrtGptModelInflightBatching::disableLookaheadDecoder( RequestVector const& genRequests, DecoderInputBuffers& inputBuffers) { TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__); auto batchSlots = CreateNewDecoderRequests::fillBatchSlots(genRequests, inputBuffers); auto samplingConfig = CreateNewDecoderRequests::fuseSamplingConfigs(genRequests); + if (batchSlots->getSize() == 0) + { + TLLM_LOG_TRACE("%s stop (no requests)", __PRETTY_FUNCTION__); + return; + } mDecoder->getUnderlyingDecoder().disableLookahead(samplingConfig, batchSlots->getSize(), batchSlots);cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4)
57-97
: copySequenceLengths: semantics look correct; minor: avoid redundant host-to-device copies when batch is tiny.The host-fill + device-copy + kernel looks fine. If micro-batches frequently have batch size 1, consider a small fast-path to directly fill the single entry on device to avoid two host copies. Optional.
- bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView); - bufferManager.copy(*fillValuesView, *fillValuesViewDevice); - tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream); + if (batchSize == 1) { + tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsView, beamWidth, *fillValuesView, stream); + } else { + bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView); + bufferManager.copy(*fillValuesView, *fillValuesViewDevice); + tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream); + }
138-160
: fillBatchSlots: consider DRY with copySequenceLengths and optional device copy.The host-side loop duplicates part of
copySequenceLengths
. If you plan to use the slots on device soon, consider adding an optionaltoDevice
flag here to copy intosetupBatchSlotsDevice
.- TensorPtr CreateNewDecoderRequests::fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers) + TensorPtr CreateNewDecoderRequests::fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers /*, bool toDevice=false*/) { @@ - // TODO: copy to device and use in GptDecoder + // TODO: copy to device and use in GptDecoder (optional flag)
161-176
: fuseSamplingConfigs: behavior on empty input is clear; optional: assert homogeneity where required.If downstream requires homogeneous fields (e.g., temperature arrays length-1), optionally add lightweight checks under debug builds.
188-204
: operator(): assert non-empty requests — good. Ensure beamWidth ≤ decoder’s max and streams are coherent.
- You already validate non-empty. Good.
- Consider asserting
beamWidth <= decoderState.getMaxBeamWidth()
here (even though enforced per-request later) to fail-fast.- Ensure
runtimeStream
is the same stream yourBufferManager
uses elsewhere in the call chain (looks consistent).TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__); return {std::move(batchSlotsView), std::move(samplingConfig.value()), std::move(lookaheadPrompt), std::move(lookaheadAlgoConfigs)};
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
(2 hunks)cpp/include/tensorrt_llm/runtime/decoderState.h
(1 hunks)cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
(1 hunks)cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
(0 hunks)cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
(5 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(0 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(0 hunks)cpp/tensorrt_llm/runtime/decoderState.cpp
(1 hunks)cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
(0 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(2 hunks)
💤 Files with no reviewable changes (5)
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
- cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/nanobind/bindings.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
- cpp/include/tensorrt_llm/runtime/decoderState.h
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
- cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/runtime/decoderState.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/runtime/decoderState.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/runtime/decoderState.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/runtime/decoderState.cpp
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
tensorrt_llm/_torch/pyexecutor/sampler.py
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
🧠 Learnings (3)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
905-910
: No action needed:model_config.data_type
is the correct dtype for the underlying decoder setup.The C++ signature for
GptDecoderBatched::setup
(and its interfaceIGptDecoderBatched::setup
) clearly takesnvinfer1::DataType dtype
as its fourth parameter, representing the network’s precision (e.g. FP32, FP16, INT8), followed byModelConfig
andWorldConfig
— not the logits’ torch dtype. Passingself.model_config.data_type
from Python correctly maps to this TensorRT data type, so there’s no need to switch toself.logits_datatype
.
from tensorrt_llm._utils import nvtx_range, torch_dtype_to_binding | ||
from tensorrt_llm.bindings import (CudaStream, DataType, ModelConfig, | ||
WorldConfig, make_sampling_config) | ||
from tensorrt_llm.bindings import CudaStream, DataType, ModelConfig, WorldConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Remove stale use of make_sampling_config
(binding removed) in ADP path.
make_sampling_config
is not imported and the binding was removed in this refactor. The current ADP branch will raise NameError. Use a single sampling config (first ADP req) — consistent with the fused-config design — and avoid extra validation for performance.
Apply:
- config = make_sampling_config([r.sampling_config for r in adp])
+ # ADP: use the sampling_config from the first request (configs are expected homogeneous)
+ config = adp[0].sampling_config
slots = torch.tensor([r.py_seq_slot for r in adp], dtype=torch.int32)
self.algs.decoder.underlying_decoder().setup(config, batch_size, slots)
Run to confirm no remaining references to the removed helper:
Also applies to: 917-920
🏁 Script executed:
#!/bin/bash
# Verify that make_sampling_config is no longer referenced
rg -nP --glob '!**/build/**' '\bmake_sampling_config\b'
Length of output: 175
Remove stale make_sampling_config
call in ADP path
The binding for make_sampling_config
has been removed, but the code at tensorrt_llm/_torch/pyexecutor/sampler.py:917 still calls it, causing a NameError at runtime. Please replace this with direct use of the first request’s sampling_config
, which matches the fused-config design and avoids extra validation overhead.
• Location:
- File: tensorrt_llm/_torch/pyexecutor/sampler.py
- Line: 917
• Recommended fix:
- config = make_sampling_config([r.sampling_config for r in adp])
+ # ADP: use the sampling_config from the first request (configs are expected homogeneous)
+ config = adp[0].sampling_config
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from tensorrt_llm.bindings import CudaStream, DataType, ModelConfig, WorldConfig | |
# ADP: use the sampling_config from the first request (configs are expected homogeneous) | |
config = adp[0].sampling_config |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around line 917, the code calls the
now-removed binding make_sampling_config which causes a NameError; replace that
call by directly using the first request's sampling_config field (e.g.,
requests[0].sampling_config) in the ADP/fused-config path, remove any
surrounding validation or conversion that invoked make_sampling_config, and
ensure downstream code expects the same sampling_config object (adapt types if
necessary) so the fused-config flow uses the request-provided config without
extra overhead.
finished_context_requests = [ | ||
r for r in requests if r.is_last_context_chunk | ||
] | ||
|
||
if len(finished_context_requests) == 0: | ||
return | ||
|
||
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests( | ||
self.model_config, self.world_config, self.decoding_config, | ||
requests.context_requests, self.logits_datatype, | ||
finished_context_requests, self.logits_datatype, | ||
self.store["decoder_input_buffers"][self.micro_batch_idx], | ||
self.store["decoder_state"], self.store["cuda_stream"], | ||
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len, | ||
self.beam_width(requests.context_requests)) | ||
self.beam_width(finished_context_requests)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix: iterate context requests, and don’t skip decoder setup when no last-chunk is present (disagg gen path).
- Bug 1:
requests
is aScheduledRequests
, not an iterable ofLlmRequest
. Iterating it will not yield requests; userequests.context_requests
. - Bug 2: If no request has
is_last_context_chunk
, we currently return early and skip decoder setup. This drops the disaggregated-generation init path. Fall back to usingrequests.context_requests
when the filtered list is empty.
Apply:
@nvtx_range("setup_sampler_step")
def setup_sampler_step(self, requests):
- finished_context_requests = [
- r for r in requests if r.is_last_context_chunk
- ]
-
- if len(finished_context_requests) == 0:
- return
+ # Consider only context requests
+ finished_context_requests = [
+ r for r in requests.context_requests if r.is_last_context_chunk
+ ]
+ # Disaggregated generation init path: fall back if no "last chunk" found
+ if len(finished_context_requests) == 0:
+ finished_context_requests = list(requests.context_requests)
+ if len(finished_context_requests) == 0:
+ return
@@
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests(
self.model_config, self.world_config, self.decoding_config,
- finished_context_requests, self.logits_datatype,
+ finished_context_requests, self.logits_datatype,
self.store["decoder_input_buffers"][self.micro_batch_idx],
self.store["decoder_state"], self.store["cuda_stream"],
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len,
- self.beam_width(finished_context_requests))
+ self.beam_width(finished_context_requests))
local_batch_size = len(batch_slots)
assert local_batch_size > 0, "Decoder setup should be called with at least one request"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
finished_context_requests = [ | |
r for r in requests if r.is_last_context_chunk | |
] | |
if len(finished_context_requests) == 0: | |
return | |
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests( | |
self.model_config, self.world_config, self.decoding_config, | |
requests.context_requests, self.logits_datatype, | |
finished_context_requests, self.logits_datatype, | |
self.store["decoder_input_buffers"][self.micro_batch_idx], | |
self.store["decoder_state"], self.store["cuda_stream"], | |
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len, | |
self.beam_width(requests.context_requests)) | |
self.beam_width(finished_context_requests)) | |
@nvtx_range("setup_sampler_step") | |
def setup_sampler_step(self, requests): | |
- finished_context_requests = [ | |
- r for r in requests if r.is_last_context_chunk | |
- ] | |
- | |
- if len(finished_context_requests) == 0: | |
# Consider only context requests | |
finished_context_requests = [ | |
r for r in requests.context_requests if r.is_last_context_chunk | |
] | |
# Disaggregated generation init path: fall back if no "last chunk" found | |
if len(finished_context_requests) == 0: | |
finished_context_requests = list(requests.context_requests) | |
if len(finished_context_requests) == 0: | |
return | |
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests( | |
self.model_config, self.world_config, self.decoding_config, | |
finished_context_requests, self.logits_datatype, | |
self.store["decoder_input_buffers"][self.micro_batch_idx], | |
self.store["decoder_state"], self.store["cuda_stream"], | |
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len, | |
self.beam_width(finished_context_requests)) | |
local_batch_size = len(batch_slots) | |
assert local_batch_size > 0, "Decoder setup should be called with at least one request" |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 887 to 901, the code
incorrectly iterates over the ScheduledRequests object and returns early when no
request has is_last_context_chunk, which skips decoder setup for the
disaggregated-generation path; change the iteration to use
requests.context_requests to collect finished_context_requests (e.g., [r for r
in requests.context_requests if r.is_last_context_chunk]) and if that filtered
list is empty, fall back to using requests.context_requests (do not return) so
the decoder setup call always runs with the appropriate list when initializing
disaggregated generation.
PR_Github #16249 [ run ] completed with state |
Description
LlmRequest
fromDecoderState
andGptDecoderBatched
interfaces.Please see commit messages for details.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit