Skip to content

Conversation

Funatiq
Copy link
Collaborator

@Funatiq Funatiq commented Jun 29, 2025

Description

  • Unify lookahead decoding functions with decoder setup.
  • Remove LlmRequest from DecoderState and GptDecoderBatched interfaces.
  • Remove the binding for fusing the sampling configs.

Please see commit messages for details.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • Refactor
    • Consolidated per-request sampling settings into a single fused sampling configuration and unified batch-slot handling to simplify decoder request flow.
    • Narrowed decoder setup to final context chunks and centralized lookahead-disabling logic, reducing redundant parameters and clarifying stream synchronization.
  • Chores
    • Removed legacy helper bindings for combining sampling configs from the Python/C++ module surfaces.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Jun 29, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10240 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10240 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7567 completed with status: 'FAILURE'

@Funatiq
Copy link
Collaborator Author

Funatiq commented Jun 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10283 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10283 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7600 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Funatiq Funatiq marked this pull request as ready for review June 30, 2025 14:17
@Funatiq Funatiq requested a review from a team as a code owner June 30, 2025 14:17
@Funatiq Funatiq requested review from schetlur-nv and Copilot June 30, 2025 14:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the lookahead-decoding interfaces by centralizing sampling‐config handling and unifying the disable‐lookahead logic across Python and C++ layers.

  • Migrate sampling-config creation into CreateNewDecoderRequests and return an optional config.
  • Remove legacy disableLookahead overloads and introduce disableLookaheadDecoder in batching code.
  • Update the Python sampler to consume the new API and drop the old make_sampling_config helper.
Comments suppressed due to low confidence (3)

cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp:52

  • New helper functions fillBatchSlots and fuseSamplingConfigs introduce core batching logic but lack corresponding unit tests. Adding targeted tests will ensure correct behavior and prevent regressions.
TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers,

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h:384

  • [nitpick] The newly added disableLookaheadDecoder method lacks a Doxygen comment. Adding a brief description of its purpose and parameters will improve API discoverability and maintenance.
    void disableLookaheadDecoder(RequestVector const& genRequests, DecoderInputBuffers& inputBuffers);

@Funatiq Funatiq requested review from dcampora and DomBrown June 30, 2025 14:19
@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch 3 times, most recently from cde297f to 37b8678 Compare July 7, 2025 15:45
@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch from 37b8678 to b8b0610 Compare July 14, 2025 20:11
@Funatiq
Copy link
Collaborator Author

Funatiq commented Jul 14, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11847 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11847 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8778 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch from b8b0610 to 8643c36 Compare July 31, 2025 13:00
@Funatiq Funatiq removed the request for review from schetlur-nv July 31, 2025 13:00
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Consolidates per-request sampling configs into a single fused SamplingConfig, adds helpers to fill/return batch-slot views, updates CreateNewDecoderRequests operator() signature, removes per-request lookahead parameter from DecoderState::disableLookahead, and moves decoder-level lookahead disable logic into the batch manager.

Changes

Cohort / File(s) Change Summary
Batch slot & sampling-config refactor
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h, cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
operator() now returns a single runtime::SamplingConfig and a batch-slot TensorPtr; added static fillBatchSlots and fuseSamplingConfigs; copySequenceLengths now returns the batch slots view; operator() parameters expanded to accept input buffers, decoder state, streams, lengths, beam width, and medusa buffers; internal flow uses fused sampling config.
DecoderState lookahead simplification
cpp/include/tensorrt_llm/runtime/decoderState.h, cpp/tensorrt_llm/runtime/decoderState.cpp
disableLookahead signature changed to void disableLookahead() (removed RequestVector param); removed loop that set per-request decoding tokens; dropped include of llmRequest.h.
GptDecoderBatched API cleanup
cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h, cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h, cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
Removed LlmRequest forward-decl, LlmRequestPtr/RequestVector typedefs and disableLookahead that accepted requests; removed GptDecoderBatched::disableLookahead implementation; added [[nodiscard]] to getDecoderStream() and getUnderlyingDecoder().
Batch manager lookahead handling
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp, cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
Added disableLookaheadDecoder(RequestVector const&, DecoderInputBuffers&); setupDecoderStep now operates on finished context chunks only and calls mCreateNewDecoderRequests with finished requests to get batch-slots view and fused sampling config; stream/event sync uses decoder stream; per-slot token counts adjusted when disabling lookahead.
Python bindings / casters cleanup
cpp/tensorrt_llm/nanobind/bindings.cpp, cpp/tensorrt_llm/nanobind/common/customCasters.h, cpp/tensorrt_llm/pybind/bindings.cpp
Removed make_sampling_config helper and bindings for std::vector<SamplingConfig>; removed related opaque/vector caster and samplingConfig.h include.
PyTorch sampler adjustment
tensorrt_llm/_torch/pyexecutor/sampler.py
Filter finished_context_requests (is_last_context_chunk); early return if none; consume single sampling_config from create_new_decoder_requests; removed make_sampling_config usage and assert local batch size > 0.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CreateNewDecoderRequests
    participant DecoderInputBuffers
    participant UnderlyingDecoder

    Client->>CreateNewDecoderRequests: operator()(..., finishedContextRequests, inputBuffers, decoderState, runtimeStream, decoderStream, ...)
    CreateNewDecoderRequests->>DecoderInputBuffers: fillBatchSlots(finishedContextRequests, inputBuffers)
    CreateNewDecoderRequests->>CreateNewDecoderRequests: fuseSamplingConfigs(finishedContextRequests)
    CreateNewDecoderRequests->>Client: (batchSlotsView, fusedSamplingConfig, lookaheadPrompts, lookaheadAlgos)
    Client->>UnderlyingDecoder: setup(fusedSamplingConfig, batchSlotsView, ...)
Loading
sequenceDiagram
    participant BatchManager
    participant TrtGptModelInflightBatching
    participant UnderlyingDecoder
    participant DecoderState

    BatchManager->>TrtGptModelInflightBatching: changeSpecDecMode(genRequests)
    TrtGptModelInflightBatching->>TrtGptModelInflightBatching: disableLookaheadDecoder(genRequests, inputBuffers)
    TrtGptModelInflightBatching->>UnderlyingDecoder: disableLookahead(batchSlotsView, fusedSamplingConfig)
    TrtGptModelInflightBatching->>DecoderState: disableLookahead()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • dcampora
  • DomBrown
  • tomeras91
  • shaharmor98
  • syuoni

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

588-602: Verify type safety for optional sampling config parameter.

The refactoring correctly changes from handling multiple sampling configs to a single optional config. However, there's a potential type safety concern.

The sampling_config returned by create_new_decoder_requests is now an Optional[SamplingConfig]. Passing it directly to underlying_decoder().setup() may cause a type mismatch or runtime error if the method expects a non-optional type.

#!/bin/bash
# Verify the expected parameter type for underlying_decoder().setup()
ast-grep --pattern $'def setup($$$) {
  $$$
}'

# Also search for the setup method signature in C++ bindings
rg -A 5 "setup.*SamplingConfig" --type cpp
🧹 Nitpick comments (1)
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (1)

74-80: Consider using a named struct for the complex return type.

The tuple return type with 4 elements reduces readability. Consider defining a struct to make the API more self-documenting:

struct DecoderRequestsResult {
    TensorPtr batchSlots;
    std::optional<runtime::SamplingConfig> samplingConfig;
    std::vector<runtime::ITensor::SharedConstPtr> lookaheadPrompts;
    std::vector<executor::LookaheadDecodingConfig> lookaheadConfigs;
};

This would make the return type clearer and easier to work with at call sites.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between baece56 and 8643c36.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2 hunks)
  • cpp/include/tensorrt_llm/runtime/decoderState.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (0 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/runtime/decoderState.cpp (1 hunks)
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (2 hunks)
💤 Files with no reviewable changes (5)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🔇 Additional comments (15)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1)

391-392: LGTM! Well-designed interface for centralized lookahead decoding management.

The new disableLookaheadDecoder method appropriately centralizes lookahead decoding control at the batch manager level, taking generation requests and decoder input buffers as parameters. This aligns well with the refactoring objectives to decouple lookahead decoding from lower-level decoder interfaces.

cpp/include/tensorrt_llm/runtime/decoderState.h (1)

69-69: LGTM! Interface simplification improves separation of concerns.

Removing the RequestVector const& genRequests parameter from disableLookahead() is a good refactoring that:

  • Eliminates the dependency between DecoderState and request vectors
  • Simplifies the interface by focusing purely on decoder state management
  • Delegates request-specific logic to higher-level components (e.g., TrtGptModelInflightBatching)

This aligns well with the principle of separation of concerns.

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

12-12: LGTM! Import removal aligns with refactoring objectives.

Removing the make_sampling_config import is appropriate as the sampling configuration fusion is now handled at the C++ level through the new static methods in CreateNewDecoderRequests.

cpp/tensorrt_llm/runtime/decoderState.cpp (1)

444-459: LGTM! Implementation correctly reflects the simplified interface.

The disableLookahead() method implementation has been appropriately simplified:

  • Signature now matches the header file (no parameters)
  • Removed the loop over generation requests, delegating that responsibility to higher-level components
  • Maintains all essential decoder state operations: resetting speculative decoding mode, setting max tokens to 1, resetting lookahead inputs, and reshaping buffers
  • Improves separation of concerns by focusing purely on decoder state management

The refactoring maintains functionality while simplifying the interface.

cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (3)

53-56: Good addition of [[nodiscard]] attribute for resource accessor.

The [[nodiscard]] attribute is appropriately added to prevent accidental discarding of the returned CudaStreamPtr, which is a critical resource for GPU operations.


58-61: Good addition of [[nodiscard]] attribute for decoder accessor.

The [[nodiscard]] attribute is appropriately added to ensure the returned decoder reference is properly utilized, preventing potential logic errors from unused decoder access.


34-79: Lookahead decoding integration verified

  • No remaining calls to the old disableLookahead in GptDecoderBatched.
  • All former callers have been updated to invoke disableLookaheadDecoder in TrtGptModelInflightBatching.
  • The implementation still calls mDecoderState->disableLookahead() inside the new method, preserving state-level behavior.
  • Existing TODOs in trtGptModelInflightBatching.cpp relate to future enabling of speculative/lookahead modes and do not block the current refactoring.

Refactoring is sound and introduces no regressions in lookahead decoding.

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3)

868-883: LGTM! Refactoring improves clarity and consistency.

The variable renaming from samplingConfigs to samplingConfig and the corresponding method call changes align well with the interface refactoring. The stream variable rename to decoderStream also improves code readability by being more specific.


2517-2533: Well-structured method for consolidating lookahead disable logic.

The new disableLookaheadDecoder method effectively encapsulates the lookahead decoding disable functionality. It properly:

  • Uses the refactored static methods for batch slot filling and sampling config fusion
  • Maintains correct CUDA stream synchronization between decoder and runtime streams
  • Follows established patterns in the codebase

2622-2630: Refactoring successfully unifies lookahead decoding disable logic.

The changes effectively replace direct decoder calls with the new unified disableLookaheadDecoder method and simplify the DecoderState::disableLookahead() interface. The addition of null checking for llmReq->mSeqSlot before accessing it is good defensive programming practice.

cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2)

91-91: LGTM!

The static method signature is well-designed with const-correct parameters and a clear purpose.


93-93: LGTM!

The static method appropriately uses std::optional to handle empty request vectors and has a descriptive name.

cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)

52-90: Good refactoring to return the batch slots view.

The modification to return batchSlotsView enables better code reuse while maintaining the original functionality. The empty check optimization at lines 79-87 is also a good addition.


154-168: Clean implementation of sampling config fusion.

The implementation properly handles empty requests and efficiently constructs the fused config. Good use of reserve for performance.


170-197: Well-structured refactoring using the new helper methods.

The implementation cleanly integrates the new fuseSamplingConfigs method and properly uses std::move for efficient returns. The code is more modular and easier to understand.

@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch from 8643c36 to acda71e Compare August 1, 2025 09:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)

132-152: Code duplication issue already identified.

The batch slot filling logic (lines 140-146) duplicates code from copySequenceLengths (lines 64-76), which was already flagged in previous reviews.

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)

170-196: Good refactoring but consider safer optional access.

The changes effectively consolidate the interface by:

  • Adding proper precondition assertion for non-empty requests
  • Using the returned batch slots view efficiently
  • Simplifying the return type with fused sampling config

However, consider using a safer approach for accessing the optional value:

-    return {std::move(batchSlotsView), std::move(samplingConfig.value()), std::move(lookaheadPrompt),
+    TLLM_CHECK(samplingConfig.has_value()); // Assert our expectation
+    return {std::move(batchSlotsView), std::move(*samplingConfig), std::move(lookaheadPrompt),
         std::move(lookaheadAlgoConfigs)};

This makes the expectation explicit and provides better error messages if the assumption is violated.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8643c36 and acda71e.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2 hunks)
  • cpp/include/tensorrt_llm/runtime/decoderState.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (0 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/runtime/decoderState.cpp (1 hunks)
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (2 hunks)
💤 Files with no reviewable changes (5)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
✅ Files skipped from review due to trivial changes (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
🚧 Files skipped from review as they are similar to previous changes (6)
  • cpp/include/tensorrt_llm/runtime/decoderState.h
  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...

Files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (2)

52-90: LGTM! Improved interface design with better error handling.

The changes enhance the function by:

  • Returning the batchSlotsView for caller use, improving the interface
  • Adding an empty request guard to prevent unnecessary device operations
  • Maintaining the existing logic while improving efficiency

The return type change aligns well with the refactoring objectives.


154-168: LGTM! Clean implementation of sampling config fusion.

The function properly:

  • Handles empty requests by returning std::nullopt
  • Uses efficient memory allocation with reserve()
  • Aggregates all sampling configs into a single unified configuration
  • Uses appropriate optional return type

This aligns well with the refactoring objectives to consolidate sampling config handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)

132-152: Function implementation is correct.

The fillBatchSlots method serves its purpose as a lighter alternative to copySequenceLengths when device copying isn't needed. The TODO comment at line 148 should be tracked in an issue to ensure it's addressed when device copying is required for performance optimization.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acda71e and 479a99e.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in the compilation of a target must have filenames that are case-insensitive unique.
All types (including class names) are camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable uses camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic...

Files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (1)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)

51-90: LGTM! Well-implemented refactoring.

The addition of the return value and empty request guard condition improves the function's usability and efficiency. The logic remains correct and the changes align well with the interface refactoring objectives.


154-168: LGTM! Clean implementation of sampling config fusion.

The function correctly handles the empty requests case and efficiently aggregates sampling configurations. The use of std::optional and vector reservation demonstrates good practices.


170-197: LGTM! Well-executed interface refactoring.

The updated operator() method successfully unifies the lookahead decoding interface. The assertion ensures proper usage, and the use of fuseSamplingConfigs with appropriate validation maintains correctness while simplifying the interface from vector to single SamplingConfig.

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13804 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13804 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10378 completed with status: 'FAILURE'

@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13820 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13820 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10391 completed with status: 'FAILURE'

@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch 2 times, most recently from 62f03c9 to 5b94f79 Compare August 22, 2025 11:18
@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 22, 2025

/bot run

@Funatiq Funatiq changed the title refactor: Improve lookahead decoding interfaces [None][refactor] Improve lookahead decoding interfaces Aug 22, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16184 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)

133-154: Refactor: avoid duplicate “fill seqSlot loop” (same logic as copySequenceLengths).

The host-side loop to fill seqSlots duplicates copySequenceLengths’ loop. Extract a tiny helper and reuse it to reduce drift.

-    // fill buffers on host
-    SizeType32 batchIdx{0};
-    for (auto const& llmReq : requests)
-    {
-        auto const seqSlot = llmReq->mSeqSlot.value();
-        batchSlotsRange[batchIdx] = seqSlot;
-        ++batchIdx;
-    }
+    // fill buffers on host
+    fillSeqSlotsHost(requests, batchSlotsRange);

Add this helper near the top (same TU), used by both copySequenceLengths and fillBatchSlots:

// Helper to fill seqSlots host-side to keep logic in one place.
static inline void fillSeqSlotsHost(RequestVector const& requests, tr::BufferRange<SizeType32>& batchSlotsRange)
{
    SizeType32 batchIdx{0};
    for (auto const& llmReq : requests)
    {
        TLLM_CHECK_WITH_INFO(llmReq->mSeqSlot.has_value(), "seqSlot must be assigned before decoder setup");
        batchSlotsRange[batchIdx++] = llmReq->mSeqSlot.value();
    }
}
🧹 Nitpick comments (3)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

1879-1881: Redundant >0 guard after an emptiness check.

localBatchSize > 0 is guaranteed by the preceding if (!finishedContextRequests.empty()) block. Either remove the check or assert stronger invariants.

-            auto const localBatchSize = batchSlots->getSize();
-            TLLM_CHECK_WITH_INFO(localBatchSize > 0, "Decoder setup should be called with at least one request");
+            auto const localBatchSize = batchSlots->getSize();
+            TLLM_CHECK_WITH_INFO(
+                localBatchSize == finishedContextRequests.size(),
+                "Unexpected localBatchSize (%zu) vs. requests (%zu)",
+                static_cast<size_t>(localBatchSize),
+                finishedContextRequests.size());

1882-1885: Pass explicitDraftTokensDType only when ExplicitDraftTokens is active.

You’re always passing mModelConfig.getDataType() to IGptDecoder::setup’s explicitDraftTokensDType parameter. That argument should be std::nullopt unless the decoding mode is ExplicitDraftTokens; otherwise it can confuse downstream logic.

-            mDecoder->getUnderlyingDecoder().setup(samplingConfig, localBatchSize, batchSlots,
-                {mDecoderState->getJointDecodingOutput()}, mModelConfig.getDataType(), lookaheadPrompt,
-                lookaheadAlgoConfigs);
+            auto explicitDraftTokensDType = mModelConfig.getSpeculativeDecodingMode().isExplicitDraftTokens()
+                ? std::optional<nvinfer1::DataType>{mModelConfig.getDataType()}
+                : std::nullopt;
+            mDecoder->getUnderlyingDecoder().setup(samplingConfig, localBatchSize, batchSlots,
+                {mDecoderState->getJointDecodingOutput()}, explicitDraftTokensDType, lookaheadPrompt,
+                lookaheadAlgoConfigs);

Verification: confirm all IGptDecoder::setup callsites only pass a dtype when decoding mode is ExplicitDraftTokens.

cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (1)

52-92: Strengthen preconditions and simplify the guarded copy.

  • copySequenceLengths assumes at least one request; enforce that with a CHECK at entry and remove the inner empty check.
  • Validate seqSlot presence before value() to avoid UB if a call site regresses.
-//! @brief Fills the seqSlots and sequence lengths in the inputBuffers.
-TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers,
+//! @brief Fills the seqSlots and sequence lengths in the inputBuffers.
+TensorPtr copySequenceLengths(RequestVector const& contextRequests, DecoderInputBuffers& inputBuffers,
     ITensor& sequenceLengths, SizeType32 beamWidth, runtime::CudaStream const& stream)
 {
     auto const bufferManager = BufferManager{std::make_shared<CudaStream>(stream.get())};
 
     auto const batchSize = contextRequests.size();
+    TLLM_CHECK_WITH_INFO(batchSize > 0, "copySequenceLengths should be called with at least one request");
     auto batchSlotsView = tr::ITensor::slice(inputBuffers.setupBatchSlots, 0, batchSize);
     auto fillValuesView = tr::ITensor::slice(inputBuffers.fillValues, 0, batchSize);
@@
-    for (auto const& llmReq : contextRequests)
+    for (auto const& llmReq : contextRequests)
     {
+        TLLM_CHECK_WITH_INFO(llmReq->mSeqSlot.has_value(), "seqSlot must be assigned before decoder setup");
         auto const disaggFirstGenTokenSize
             = llmReq->getContextPhaseParams() ? llmReq->getContextPhaseParams().value().getFirstGenTokens().size() : 0;
@@
-    // copy sequence lengths
-    if (!contextRequests.empty())
-    {
-        auto batchSlotsDeviceView = tr::ITensor::slice(inputBuffers.setupBatchSlotsDevice, 0, batchSize);
-        auto fillValuesViewDevice = tr::ITensor::slice(inputBuffers.fillValuesDevice, 0, batchSize);
-
-        bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView);
-        bufferManager.copy(*fillValuesView, *fillValuesViewDevice);
-        tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream);
-    }
+    // copy sequence lengths
+    auto batchSlotsDeviceView = tr::ITensor::slice(inputBuffers.setupBatchSlotsDevice, 0, batchSize);
+    auto fillValuesViewDevice = tr::ITensor::slice(inputBuffers.fillValuesDevice, 0, batchSize);
+    bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView);
+    bufferManager.copy(*fillValuesView, *fillValuesViewDevice);
+    tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 479a99e and 5b94f79.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2 hunks)
  • cpp/include/tensorrt_llm/runtime/decoderState.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
💤 Files with no reviewable changes (1)
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
🚧 Files skipped from review as they are similar to previous changes (4)
  • cpp/include/tensorrt_llm/runtime/decoderState.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧠 Learnings (2)
📚 Learning: 2025-08-21T09:41:49.327Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.327Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
🧬 Code graph analysis (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (4)
  • samplingConfig (163-172)
  • samplingConfig (163-168)
  • samplingConfig (189-193)
  • samplingConfig (189-190)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4)
  • fillBatchSlots (134-154)
  • fillBatchSlots (134-134)
  • fuseSamplingConfigs (156-170)
  • fuseSamplingConfigs (156-156)
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (3)
cpp/tests/runtime/gptDecoderBatchedTest.cpp (2)
  • copySequenceLengths (153-158)
  • copySequenceLengths (153-154)
cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (4)
  • samplingConfig (66-75)
  • samplingConfig (66-71)
  • samplingConfig (92-96)
  • samplingConfig (92-93)
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (4)
  • samplingConfig (163-172)
  • samplingConfig (163-168)
  • samplingConfig (189-193)
  • samplingConfig (189-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

1886-1890: Good cross-stream synchronization.

Recording on decoder stream and waiting on the runtime stream correctly orders the decoder setup ahead of subsequent runtime work.


2523-2539: disableLookaheadDecoder handles batching and device‐side buffers correctly—no changes needed.

The existing implementation of disableLookaheadDecoder already covers both empty and non‐empty genRequests correctly:

  • When genRequests is empty, fillBatchSlots returns a zero‐length slice and fuseSamplingConfigs yields nullopt; the call to
    IGptDecoder::disableLookahead(nullopt, 0, batchSlots) still invokes
    DynamicDecodeLayer::disableLookahead(…, 0, batchSlots, …) which resets the decode mode and reinitializes all lookahead layers (dynamicDecodeLayer.cpp:106–114). Skipping this would leave the dynamic layer in its prior state.
  • When genRequests is non‐empty, GptDecoder::disableLookahead calls workspace->setDeviceBatchSlots(batchSlots) before forwarding to the dynamic layer, ensuring the host‐filled batchSlots are copied onto the device for the kernels (gptDecoder.cpp:115–118 & dynamicDecodeLayer.cpp:111–118).

Because both paths correctly reset the lookahead state and perform any needed device transfer, the proposed early return for empty batches would omit the critical reset step, and a manual copy in fillBatchSlots is redundant.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (2)

156-171: LGTM: fused SamplingConfig construction.

The optional-return API and aggregation approach look correct and align with the new decoder interfaces.


183-199: LGTM: operator() contract and returned tuple match the new setup path.

  • Non-empty precondition is enforced.
  • copySequenceLengths uses the runtime stream (consistent with kernels that write sequenceLengths).
  • Returning fused SamplingConfig and lookahead artifacts aligns with downstream setup.

Comment on lines +1866 to 1871
RequestVector finishedContextRequests;
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
[](auto const& llmReq) { return llmReq->isLastContextChunk(); });

auto [batchSlots, samplingConfigs, lookaheadPrompt, lookaheadAlgoConfigs]
= (*mCreateNewDecoderRequests)(mModelConfig, mWorldConfig, mDecodingConfig, contextRequests, logitsType,
inputBuffers, *mDecoderState, mRuntime->getStream(), *mDecoder->getDecoderStream(), getMaxSequenceLen(),
mOperatingBeamWidth, buffers.mMedusaBuffers);

auto const localBatchSize = batchSlots->getSize();
if (localBatchSize > 0)
if (!finishedContextRequests.empty())
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Don’t drop disaggregated-generation setup when no “last context chunk” is present.

setupDecoderStep now filters to last-context chunks only. When called from prepareDistGenBufferAndDecoder (Lines 1681-1683) with generation requests, isLastContextChunk() will typically be false, so finishedContextRequests becomes empty and decoder setup is skipped, breaking the disaggregated generation init path.

Fix: fall back to using the passed-in requests if no last-context chunk is found.

-        RequestVector finishedContextRequests;
-        std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
-            [](auto const& llmReq) { return llmReq->isLastContextChunk(); });
+        RequestVector finishedContextRequests;
+        std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
+            [](auto const& llmReq) { return llmReq->isLastContextChunk(); });
+        // Disagg generation init path calls this with generation-ready requests (no context chunk in-flight).
+        if (finishedContextRequests.empty())
+        {
+            finishedContextRequests = contextRequests;
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RequestVector finishedContextRequests;
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
[](auto const& llmReq) { return llmReq->isLastContextChunk(); });
auto [batchSlots, samplingConfigs, lookaheadPrompt, lookaheadAlgoConfigs]
= (*mCreateNewDecoderRequests)(mModelConfig, mWorldConfig, mDecodingConfig, contextRequests, logitsType,
inputBuffers, *mDecoderState, mRuntime->getStream(), *mDecoder->getDecoderStream(), getMaxSequenceLen(),
mOperatingBeamWidth, buffers.mMedusaBuffers);
auto const localBatchSize = batchSlots->getSize();
if (localBatchSize > 0)
if (!finishedContextRequests.empty())
{
RequestVector finishedContextRequests;
std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
[](auto const& llmReq) { return llmReq->isLastContextChunk(); });
// Disagg generation init path calls this with generation-ready requests (no context chunk in-flight).
if (finishedContextRequests.empty())
{
finishedContextRequests = contextRequests;
}
if (!finishedContextRequests.empty())
{
// … existing decoder-setup logic …
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp around lines
1866 to 1871, the code filters contextRequests to only last-context chunks and
skips decoder setup if none are found, which drops disaggregated-generation
initialization; change the logic so that after building finishedContextRequests
you check if it is empty and, if so, assign finishedContextRequests =
contextRequests (i.e., fall back to the original passed-in requests) before
calling setupDecoderStep, ensuring decoder setup still runs for generation
requests when no last-context chunk exists.

Comment on lines +2628 to +2637
disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId()));
mDecoderState->disableLookahead();

for (auto const& llmReq : scheduledRequests.generationRequests)
{
if (llmReq->mSeqSlot)
{
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Disable LAD for all active slots (ctx + gen), and reset engine tokens consistently.

The current call disables lookahead only for scheduledRequests.generationRequests. When LAD is turned off due to constraints, there may be only context requests in-flight; underlying decoder state for those slots won’t be updated. Also, resetting “numDecodingEngineTokens = 1” should apply to all slots in this transition.

-        disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId()));
-        mDecoderState->disableLookahead();
-
-        for (auto const& llmReq : scheduledRequests.generationRequests)
+        // Apply to all scheduled slots (both ctx and gen have seqSlot at this point)
+        RequestVector requestsForDisable;
+        requestsForDisable.reserve(
+            scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size());
+        requestsForDisable.insert(requestsForDisable.end(),
+            scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end());
+        requestsForDisable.insert(requestsForDisable.end(),
+            scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end());
+
+        disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId()));
+        mDecoderState->disableLookahead();
+
+        for (auto const& llmReq : requestsForDisable)
         {
             if (llmReq->mSeqSlot)
             {
                 mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId()));
mDecoderState->disableLookahead();
for (auto const& llmReq : scheduledRequests.generationRequests)
{
if (llmReq->mSeqSlot)
{
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
}
// Apply to all scheduled slots (both ctx and gen have seqSlot at this point)
RequestVector requestsForDisable;
requestsForDisable.reserve(
scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size());
requestsForDisable.insert(requestsForDisable.end(),
scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end());
requestsForDisable.insert(requestsForDisable.end(),
scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end());
disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId()));
mDecoderState->disableLookahead();
for (auto const& llmReq : requestsForDisable)
{
if (llmReq->mSeqSlot)
{
mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
}
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp around lines
2628 to 2637, the code only disables lookahead for
scheduledRequests.generationRequests and only resets numDecodingEngineTokens for
those gen slots; when LAD is turned off there may be only context requests
in-flight and their decoder state won't be updated. Fix by invoking
disableLookaheadDecoder over all active requests (both generationRequests and
contextRequests) using the same fused buffer, keep
mDecoderState->disableLookahead() as is, and iterate every active slot from both
lists to call mDecoderState->setNumDecodingEngineTokens(slot, 1) so all slots
are consistently reset when disabling lookahead.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16184 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12172 completed with status: 'FAILURE'

- Removed the RequestVector parameter from the disableLookahead method in DecoderState, streamlining its interface.
- Updated calls to disableLookahead in TrtGptModelInflightBatching to reflect the new method signature.
- Adjusted internal logic to maintain functionality without the need for external request data.

Signed-off-by: Robin Kobus <[email protected]>
…GptDecoderBatched

- Eliminated the disableLookahead method from both GptDecoderBatched and iGptDecoderBatched interfaces to streamline the API.
- Updated TrtGptModelInflightBatching to utilize a new internal method for handling lookahead disabling, improving encapsulation and clarity.
- Adjusted related logic in the setupDecoderStep method to maintain functionality without the removed method.

Signed-off-by: Robin Kobus <[email protected]>
…ateNewDecoderRequests

- Updated the operator() method to utilize std::optional for sampling configurations, improving flexibility in handling requests.
- Introduced fillBatchSlots and fuseSamplingConfigs methods to streamline batch slot filling and sampling configuration fusion.
- Refactored copySequenceLengths to return batch slots, enhancing clarity and separation of concerns in the code.
- Use new methods to disable lookahead in trtGptModelInflightBatching.

Signed-off-by: Robin Kobus <[email protected]>
…more effectively

- Filter out finished context requests from the contextRequests vector before calling CreateNewDecoderRequests.
- Modified the operator() method to directly use finishedContextRequests, enhancing clarity.
- Updated the handling of sampling configurations to ensure a valid value is returned, improving robustness.

Signed-off-by: Robin Kobus <[email protected]>
@Funatiq Funatiq force-pushed the dev/refactor_lookahead branch from 5b94f79 to 6062c69 Compare August 23, 2025 05:45
@Funatiq
Copy link
Collaborator Author

Funatiq commented Aug 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16249 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

1866-1890: Don’t drop decoder setup when no “last context chunk” exists (disagg gen init).

When no request has isLastContextChunk(), finishedContextRequests is empty and we skip decoder setup, breaking the disaggregated-generation initialization path.

Apply:

         RequestVector finishedContextRequests;
         std::copy_if(contextRequests.begin(), contextRequests.end(), std::back_inserter(finishedContextRequests),
             [](auto const& llmReq) { return llmReq->isLastContextChunk(); });
 
-        if (!finishedContextRequests.empty())
+        // Disagg generation init path calls this with generation-ready requests (no context chunk in-flight).
+        if (finishedContextRequests.empty())
+        {
+            finishedContextRequests = contextRequests;
+        }
+        if (!finishedContextRequests.empty())
         {
             auto const logitsType = mRuntime->getEngine().getTensorDataType("logits");

2629-2637: Disable LAD and reset engine-token count for all active slots (ctx + gen).

Currently applies only to generationRequests. When switching off Lookahead, context-only in-flight slots won’t be reset.

Apply:

-        disableLookaheadDecoder(scheduledRequests.generationRequests, mDecoderInputBuffers.at(getFusedBufferId()));
-        mDecoderState->disableLookahead();
-
-        for (auto const& llmReq : scheduledRequests.generationRequests)
+        // Apply to all scheduled slots (both ctx and gen have seqSlot at this point)
+        RequestVector requestsForDisable;
+        requestsForDisable.reserve(
+            scheduledRequests.contextRequests.size() + scheduledRequests.generationRequests.size());
+        requestsForDisable.insert(requestsForDisable.end(),
+            scheduledRequests.contextRequests.begin(), scheduledRequests.contextRequests.end());
+        requestsForDisable.insert(requestsForDisable.end(),
+            scheduledRequests.generationRequests.begin(), scheduledRequests.generationRequests.end());
+
+        disableLookaheadDecoder(requestsForDisable, mDecoderInputBuffers.at(getFusedBufferId()));
+        mDecoderState->disableLookahead();
+
+        for (auto const& llmReq : requestsForDisable)
         {
             if (llmReq->mSeqSlot)
             {
                 mDecoderState->setNumDecodingEngineTokens(llmReq->mSeqSlot.value(), 1);
             }
#!/bin/bash
# Quick sanity: ensure all call sites of disableLookahead() use new signature (no args).
rg -nP --type=cpp -C2 '\bdisableLookahead\s*\(' | sed -n '1,200p'
🧹 Nitpick comments (8)
cpp/tensorrt_llm/runtime/decoderState.cpp (1)

446-460: API change LGTM; add note on per-slot reset behavior and consider header year.

  • The parameterless disableLookahead() correctly resets speculative state and reshapes buffers. Callers must handle per-slot numDecodingEngineTokens reset (now done at call sites).
  • Nit: header still says 2022-2024; repo guidelines suggest current year.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2)

73-80: Return fused SamplingConfig: OK. Consider [[nodiscard]] and clarify empty input behavior.

  • Good: operator() now returns a single fused SamplingConfig.
  • Suggest marking operator() [[nodiscard]] to avoid accidental ignoring of the return tuple.
-    std::tuple<TensorPtr, runtime::SamplingConfig, std::vector<runtime::ITensor::SharedConstPtr>,
+    [[nodiscard]] std::tuple<TensorPtr, runtime::SamplingConfig, std::vector<runtime::ITensor::SharedConstPtr>,
         std::vector<executor::LookaheadDecodingConfig>>

89-92: Expose helpers: add [[nodiscard]] and include .

  • Since fillBatchSlots and fuseSamplingConfigs return values that must be consumed, mark them [[nodiscard]].
  • Ensure this header includes <optional> explicitly.
+#include <optional>
@@
-    static TensorPtr fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers);
+    [[nodiscard]] static TensorPtr fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers);
@@
-    static std::optional<SamplingConfig> fuseSamplingConfigs(RequestVector const& requests);
+    [[nodiscard]] static std::optional<SamplingConfig> fuseSamplingConfigs(RequestVector const& requests);
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

2523-2539: Guard against empty batches in disableLookaheadDecoder and avoid unnecessary calls.

fillBatchSlots on an empty vector returns a 0-sized view; we still call disableLookahead and record/wait. Early-out to save work and avoid edge-case behavior.

 void TrtGptModelInflightBatching::disableLookaheadDecoder(
     RequestVector const& genRequests, DecoderInputBuffers& inputBuffers)
 {
     TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
 
     auto batchSlots = CreateNewDecoderRequests::fillBatchSlots(genRequests, inputBuffers);
     auto samplingConfig = CreateNewDecoderRequests::fuseSamplingConfigs(genRequests);
 
+    if (batchSlots->getSize() == 0)
+    {
+        TLLM_LOG_TRACE("%s stop (no requests)", __PRETTY_FUNCTION__);
+        return;
+    }
     mDecoder->getUnderlyingDecoder().disableLookahead(samplingConfig, batchSlots->getSize(), batchSlots);
cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (4)

57-97: copySequenceLengths: semantics look correct; minor: avoid redundant host-to-device copies when batch is tiny.

The host-fill + device-copy + kernel looks fine. If micro-batches frequently have batch size 1, consider a small fast-path to directly fill the single entry on device to avoid two host copies. Optional.

-    bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView);
-    bufferManager.copy(*fillValuesView, *fillValuesViewDevice);
-    tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream);
+    if (batchSize == 1) {
+        tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsView, beamWidth, *fillValuesView, stream);
+    } else {
+        bufferManager.copy(*batchSlotsView, *batchSlotsDeviceView);
+        bufferManager.copy(*fillValuesView, *fillValuesViewDevice);
+        tr::kernels::invokeFillBatch(sequenceLengths, *batchSlotsDeviceView, beamWidth, *fillValuesViewDevice, stream);
+    }

138-160: fillBatchSlots: consider DRY with copySequenceLengths and optional device copy.

The host-side loop duplicates part of copySequenceLengths. If you plan to use the slots on device soon, consider adding an optional toDevice flag here to copy into setupBatchSlotsDevice.

- TensorPtr CreateNewDecoderRequests::fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers)
+ TensorPtr CreateNewDecoderRequests::fillBatchSlots(RequestVector const& requests, DecoderInputBuffers& inputBuffers /*, bool toDevice=false*/)
 {
@@
-    // TODO: copy to device and use in GptDecoder
+    // TODO: copy to device and use in GptDecoder (optional flag)

161-176: fuseSamplingConfigs: behavior on empty input is clear; optional: assert homogeneity where required.

If downstream requires homogeneous fields (e.g., temperature arrays length-1), optionally add lightweight checks under debug builds.


188-204: operator(): assert non-empty requests — good. Ensure beamWidth ≤ decoder’s max and streams are coherent.

  • You already validate non-empty. Good.
  • Consider asserting beamWidth <= decoderState.getMaxBeamWidth() here (even though enforced per-request later) to fail-fast.
  • Ensure runtimeStream is the same stream your BufferManager uses elsewhere in the call chain (looks consistent).
     TLLM_LOG_TRACE("%s stop", __PRETTY_FUNCTION__);
     return {std::move(batchSlotsView), std::move(samplingConfig.value()), std::move(lookaheadPrompt),
         std::move(lookaheadAlgoConfigs)};
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5b94f79 and 6062c69.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h (2 hunks)
  • cpp/include/tensorrt_llm/runtime/decoderState.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h (1 hunks)
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp (5 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (0 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (0 hunks)
  • cpp/tensorrt_llm/runtime/decoderState.cpp (1 hunks)
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (2 hunks)
💤 Files with no reviewable changes (5)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h
  • cpp/tensorrt_llm/runtime/gptDecoderBatched.cpp
  • cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/bindings.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/include/tensorrt_llm/runtime/decoderState.h
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h
  • cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/runtime/decoderState.cpp
  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/createNewDecoderRequests.h
🧠 Learnings (3)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/createNewDecoderRequests.cpp
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

905-910: No action needed: model_config.data_type is the correct dtype for the underlying decoder setup.

The C++ signature for GptDecoderBatched::setup (and its interface IGptDecoderBatched::setup) clearly takes nvinfer1::DataType dtype as its fourth parameter, representing the network’s precision (e.g. FP32, FP16, INT8), followed by ModelConfig and WorldConfig — not the logits’ torch dtype. Passing self.model_config.data_type from Python correctly maps to this TensorRT data type, so there’s no need to switch to self.logits_datatype.

from tensorrt_llm._utils import nvtx_range, torch_dtype_to_binding
from tensorrt_llm.bindings import (CudaStream, DataType, ModelConfig,
WorldConfig, make_sampling_config)
from tensorrt_llm.bindings import CudaStream, DataType, ModelConfig, WorldConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Remove stale use of make_sampling_config (binding removed) in ADP path.

make_sampling_config is not imported and the binding was removed in this refactor. The current ADP branch will raise NameError. Use a single sampling config (first ADP req) — consistent with the fused-config design — and avoid extra validation for performance.

Apply:

-        config = make_sampling_config([r.sampling_config for r in adp])
+        # ADP: use the sampling_config from the first request (configs are expected homogeneous)
+        config = adp[0].sampling_config
         slots = torch.tensor([r.py_seq_slot for r in adp], dtype=torch.int32)
         self.algs.decoder.underlying_decoder().setup(config, batch_size, slots)

Run to confirm no remaining references to the removed helper:

Also applies to: 917-920


🏁 Script executed:

#!/bin/bash
# Verify that make_sampling_config is no longer referenced
rg -nP --glob '!**/build/**' '\bmake_sampling_config\b'

Length of output: 175


Remove stale make_sampling_config call in ADP path

The binding for make_sampling_config has been removed, but the code at tensorrt_llm/_torch/pyexecutor/sampler.py:917 still calls it, causing a NameError at runtime. Please replace this with direct use of the first request’s sampling_config, which matches the fused-config design and avoids extra validation overhead.

Location:

  • File: tensorrt_llm/_torch/pyexecutor/sampler.py
  • Line: 917

Recommended fix:

-        config = make_sampling_config([r.sampling_config for r in adp])
+        # ADP: use the sampling_config from the first request (configs are expected homogeneous)
+        config = adp[0].sampling_config
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm.bindings import CudaStream, DataType, ModelConfig, WorldConfig
# ADP: use the sampling_config from the first request (configs are expected homogeneous)
config = adp[0].sampling_config
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around line 917, the code calls the
now-removed binding make_sampling_config which causes a NameError; replace that
call by directly using the first request's sampling_config field (e.g.,
requests[0].sampling_config) in the ADP/fused-config path, remove any
surrounding validation or conversion that invoked make_sampling_config, and
ensure downstream code expects the same sampling_config object (adapt types if
necessary) so the fused-config flow uses the request-provided config without
extra overhead.

Comment on lines +887 to 901
finished_context_requests = [
r for r in requests if r.is_last_context_chunk
]

if len(finished_context_requests) == 0:
return

batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests(
self.model_config, self.world_config, self.decoding_config,
requests.context_requests, self.logits_datatype,
finished_context_requests, self.logits_datatype,
self.store["decoder_input_buffers"][self.micro_batch_idx],
self.store["decoder_state"], self.store["cuda_stream"],
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len,
self.beam_width(requests.context_requests))
self.beam_width(finished_context_requests))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix: iterate context requests, and don’t skip decoder setup when no last-chunk is present (disagg gen path).

  • Bug 1: requests is a ScheduledRequests, not an iterable of LlmRequest. Iterating it will not yield requests; use requests.context_requests.
  • Bug 2: If no request has is_last_context_chunk, we currently return early and skip decoder setup. This drops the disaggregated-generation init path. Fall back to using requests.context_requests when the filtered list is empty.

Apply:

 @nvtx_range("setup_sampler_step")
 def setup_sampler_step(self, requests):
-        finished_context_requests = [
-            r for r in requests if r.is_last_context_chunk
-        ]
-
-        if len(finished_context_requests) == 0:
-            return
+        # Consider only context requests
+        finished_context_requests = [
+            r for r in requests.context_requests if r.is_last_context_chunk
+        ]
+        # Disaggregated generation init path: fall back if no "last chunk" found
+        if len(finished_context_requests) == 0:
+            finished_context_requests = list(requests.context_requests)
+            if len(finished_context_requests) == 0:
+                return
@@
         batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests(
             self.model_config, self.world_config, self.decoding_config,
-            finished_context_requests, self.logits_datatype,
+            finished_context_requests, self.logits_datatype,
             self.store["decoder_input_buffers"][self.micro_batch_idx],
             self.store["decoder_state"], self.store["cuda_stream"],
             self.algs.decoder.decoder_stream, self.executor_config.max_seq_len,
-            self.beam_width(finished_context_requests))
+            self.beam_width(finished_context_requests))
 
         local_batch_size = len(batch_slots)
         assert local_batch_size > 0, "Decoder setup should be called with at least one request"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
finished_context_requests = [
r for r in requests if r.is_last_context_chunk
]
if len(finished_context_requests) == 0:
return
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests(
self.model_config, self.world_config, self.decoding_config,
requests.context_requests, self.logits_datatype,
finished_context_requests, self.logits_datatype,
self.store["decoder_input_buffers"][self.micro_batch_idx],
self.store["decoder_state"], self.store["cuda_stream"],
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len,
self.beam_width(requests.context_requests))
self.beam_width(finished_context_requests))
@nvtx_range("setup_sampler_step")
def setup_sampler_step(self, requests):
- finished_context_requests = [
- r for r in requests if r.is_last_context_chunk
- ]
-
- if len(finished_context_requests) == 0:
# Consider only context requests
finished_context_requests = [
r for r in requests.context_requests if r.is_last_context_chunk
]
# Disaggregated generation init path: fall back if no "last chunk" found
if len(finished_context_requests) == 0:
finished_context_requests = list(requests.context_requests)
if len(finished_context_requests) == 0:
return
batch_slots, sampling_config, lookahead_prompt, lookahead_algo_configs = self.algs.create_new_decoder_requests(
self.model_config, self.world_config, self.decoding_config,
finished_context_requests, self.logits_datatype,
self.store["decoder_input_buffers"][self.micro_batch_idx],
self.store["decoder_state"], self.store["cuda_stream"],
self.algs.decoder.decoder_stream, self.executor_config.max_seq_len,
self.beam_width(finished_context_requests))
local_batch_size = len(batch_slots)
assert local_batch_size > 0, "Decoder setup should be called with at least one request"
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/sampler.py around lines 887 to 901, the code
incorrectly iterates over the ScheduledRequests object and returns early when no
request has is_last_context_chunk, which skips decoder setup for the
disaggregated-generation path; change the iteration to use
requests.context_requests to collect finished_context_requests (e.g., [r for r
in requests.context_requests if r.is_last_context_chunk]) and if that filtered
list is empty, fall back to using requests.context_requests (do not return) so
the decoder setup call always runs with the appropriate list when initializing
disaggregated generation.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16249 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12214 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants