Skip to content

Conversation

eopXD
Copy link
Collaborator

@eopXD eopXD commented Sep 23, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced KV-cache with sequence-aware management and sliding-window attention support.

Description

NOTE: Please note that the depending MR is still on-going and not merged yet.

This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored.

Test Coverage

Regarding unit test, KVCacheManagerVariableWindowAttentionWithReuseTest, KVCacheManagerSWAInvalidateReuseTest, and KVCacheManagerMaxAttentionWindowWithReuseTest are added to verify SWA reuse and block invalidating mechanism.

Regarding end-to-end, the existing test for vswa setting for gemma3 for the mmlu and gsm8k benchmark will cover the code paths added in this merge request.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@eopXD eopXD requested review from a team as code owners September 23, 2025 08:20
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

📝 Walkthrough

Walkthrough

The change set makes KV-cache management sequence-aware and SWA-aware across headers, implementation, and bindings. It replaces optional max-sequence-length with required parameters, adjusts constructor signatures and call sites, adds per-sequence bookkeeping and block reuse validity, removes cyclic KV-cache logic, updates tests, and refreshes related help text/docs.

Changes

Cohort / File(s) Summary of changes
KV-cache core API and implementation
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Introduces per-sequence block ownership/validity, SWA flag handling, front-block detachment/adjustment, and sequence-aware store/onboard/release paths. Adds kSWAExtraBlock, maxBlocksPerSeq, getAllSequenceBlocks. Changes many method signatures to thread GenerationRequest. Makes maxSequenceLength required.
Inflight batching integration
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
Removes cyclic KV-cache logic and “use one more block” path. Updates KVCacheManager construction to pass concrete max sequence/encoder length and new reuse options. Adjusts RewindInputs to fixed false.
Python bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Updates KVCacheManager constructor binding: previously-optional SizeType32 is now required; argument names aligned with new signature, including enable_partial_reuse and copy_on_partial_reuse.
C++ unit tests (constructor alignment)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp, cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp, cpp/tests/unit_tests/executor/agentCommTest.cpp, cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Updates KVCacheManager call sites to pass required maxNumTokens/maxSequenceLength argument (replacing std::nullopt and prior parameter clusters). Adjusts parameters ordering; disables some window-related multi-GPU test instantiations.
Executor PyTorch util
tensorrt_llm/_torch/pyexecutor/_util.py
Unifies max_tokens computation; removes VSWA-specific branch and always derives from kv_cache_max_memory.
Functional docstrings
tensorrt_llm/functional.py
Updates wording for sliding-window-attention description; no functional changes.
Examples help text
examples/models/core/llama/summarize_long.py, examples/models/core/qwen2audio/utils.py, examples/utils.py
Removes “cyclic” wording from --max_attention_window_size help strings.
Integration tests (VSWA variants)
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/test-db/l0_h100.yml
Replaces a skipped VSWA test with four explicit VSWA variants (reuse/non-reuse, chunked prefill on/off). Updates test list entries accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor App as Caller
  participant KM as KVCacheManager
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant Seq as GenerationRequest
  rect rgba(230,240,255,0.5)
    note over App,KM: Add sequence and allocate context
    App->>KM: addSequence(Seq, ...)
    KM->>BM: holdSequence(Seq.id)
    KM->>BM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
  end
  rect rgba(240,255,240,0.5)
    note over App,WBM: Onboard a new block (sequence-aware)
    App->>KM: onboardBlock(Seq, windowSize, ...)
    KM->>BM: onboardBlock(Seq, ...)
    BM->>WBM: onboardBlock(Seq, ...)
    WBM->>WBM: getFreeBlock(Seq, ...)
    WBM-->>BM: BlockPtr
    BM-->>KM: BlockPtr
    KM-->>App: BlockPtr
  end
  alt SWA sequence
    note right of WBM: SWA guards: skip store/detach on-the-fly
  else Non-SWA
    rect rgba(255,250,230,0.6)
      note over WBM: Detach/adjust front blocks when needed
      KM->>BM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
      BM->>WBM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
      WBM->>WBM: detachFrontBlock(Seq, enableBlockReuse)
    end
  end
  rect rgba(255,235,235,0.6)
    note over App,WBM: Store or release blocks (sequence validity tracked)
    App->>KM: releaseBlocks(Seq, ...)
    KM->>BM: releaseBlocks(Seq, ...)
    BM->>WBM: storeBlocks(Seq, ...) / release(...)
    WBM->>BM: update block->sequence mappings
  end
  KM->>BM: releaseSequence(Seq.id)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.48% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach" follows the repository convention by including a ticket and type and succinctly states the primary change: enabling out-of-window (OOW) block detachment for SWA KV-cache reuse. It is specific, on-topic, and free of noise or file lists, so a teammate scanning history will understand the main purpose of the PR.
Description Check ✅ Passed The PR description follows the repository template by providing a clear Description that explains the issue and solution, a Test Coverage section listing added unit tests and end-to-end coverage, and a PR Checklist; it also notes the dependent MR is still ongoing. The content is specific and actionable for reviewers, and it lists relevant tests that exercise the new SWA reuse and invalidation behavior.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

582-586: Bug: cacheType shadowing prevents kSELFKONLY from being set.

Inner ‘auto cacheType’ shadows the outer variable; SELF is always used even when kvFactor == 1.

Fix:

-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = (kvFactor == 1) ? CacheType::kSELFKONLY : CacheType::kSELF;
🧹 Nitpick comments (12)
examples/utils.py (1)

356-362: Help text tweak: clarify plural semantics for nargs="+"

Since this arg accepts multiple integers, make the wording plural and add a brief example.

-        'The attention window size that controls the sliding window attention kv cache behavior'
+        'Attention window size(s) controlling sliding-window attention KV-cache behavior. '
+        'Accepts one or more ints (e.g., --max_attention_window_size 4096 or 2048 4096).'
examples/models/core/qwen2audio/utils.py (2)

41-41: Polish help text for clarity and consistency (KV‑cache, hyphenation).

Suggest concise copy and consistent hyphenation/capitalization.

-        'The attention window size that controls the sliding window attention kv cache behavior'
+        'Controls the sliding-window attention KV-cache behavior.'

1-1: Missing SPDX/NVIDIA Apache-2.0 header.

Per guidelines, add the license header to Python sources.

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 from argparse import BooleanOptionalAction
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

59-63: Constructor signature alignment LGTM; prefer explicit SizeType32 for kvMaxNumTokens.

Avoid implicit int → SizeType32 conversions to keep types consistent across platforms.

Apply:

-        auto kvMaxNumTokens = tokensPerBlock * maxBlocksPerSeq;
+        SizeType32 kvMaxNumTokens = static_cast<SizeType32>(tokensPerBlock) * maxBlocksPerSeq;
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)

84-88: Constructor update LGTM; minor type tightening suggested.

Use SizeType32 for kvMaxNumTokens to match manager expectations.

-        auto kvMaxNumTokens = tokensPerBlock * maxBlocksPerSeq;
+        SizeType32 kvMaxNumTokens = static_cast<SizeType32>(tokensPerBlock) * maxBlocksPerSeq;
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

621-623: Constructor update LGTM; consider explicit type for maxNumTokens.

Use SizeType32 to avoid implicit narrowing and mirror other callsites.

-            dataType, sinkTokenLength, stream, maxNumTokens, enableBlockReuse, onboardBlocks, CacheType::kSELF,
+            dataType, sinkTokenLength, stream, static_cast<SizeType32>(maxNumTokens), enableBlockReuse, onboardBlocks, CacheType::kSELF,

1316-1337: Temporarily waiving isWindow tests reduces coverage.

Add a TODO and plan to re‑enable once SWA reuse/OOW detach stabilizes; otherwise we may miss regressions.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

736-736: Fix inconsistent parameter naming in onboardBlock methods.

The onboardBlock methods use inconsistent parameter naming - copyOnPartialReuse in some signatures and copyOnpartialReuse in others.

Apply this diff to fix the inconsistent casing:

-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,
-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,

Also applies to: 828-829, 951-951, 988-989, 1053-1055, 1060-1060

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

324-327: Remove or relocate debug logging from production code.

The debug logging before the assertion seems misplaced. Consider removing it or wrapping it in a debug-only build flag.

-    if (!hasRefs())
-    {
-        TLLM_LOG_DEBUG("::decRefCount Going to hit assertion for block %d", getBlockId());
-    }
     TLLM_CHECK_WITH_INFO(hasRefs(), "Can't remove link from block that is not allocated");

507-519: Consider using the intended calculateWindowSizeToShare logic.

The comment at line 519-520 indicates that the current uniform distribution (1.0f / windowSizeToLayers.size()) is temporary. The actual proportional calculation based on window sizes is already implemented but commented out. Consider enabling the intended logic if testing has been completed.

The function calculateWindowSizeToShare appears to have sophisticated logic for proportional distribution, but it's not being used. Consider tracking this as a TODO if the uniform distribution is indeed temporary:

+    // TODO(TLLM-XXXX): Enable proportional block allocation based on window sizes
+    // once testing is complete. Currently using uniform distribution for identical performance.
     std::map<SizeType32, float> windowSizeToShare;

1091-1096: Consider documenting the offloading behavior trade-offs.

The comment correctly identifies that the current default behavior may lead to unnecessary traffic. Consider tracking this as a technical debt item.

Add a TODO to track the improvement opportunity:

+    // TODO(TLLM-XXXX): Delegate offloading decisions to the eviction policy
+    // to avoid unnecessary traffic from offloading blocks that won't be reused.
     // The current default behavior is to offload the out-of-window block

1485-1547: Document the return value usage of storeBlocks.

The method now returns the number of blocks stored, but this information appears to be used only for logging. Consider documenting this behavior.

Add a comment explaining the return value usage:

 SizeType32 WindowBlockManager::storeBlocks(
     std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds)
 {
+    // Returns the number of blocks actually stored (excludes already-matched blocks)
     SizeType32 numBlocksStoredForReuse = 0;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb64e74 and 3da009f.

📒 Files selected for processing (15)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (22 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (42 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1 hunks)
  • cpp/tests/unit_tests/executor/agentCommTest.cpp (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/utils.py (1 hunks)
  • examples/utils.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/functional.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • examples/models/core/qwen2audio/utils.py
  • examples/utils.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • examples/models/core/llama/summarize_long.py
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • examples/models/core/qwen2audio/utils.py
  • examples/utils.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • examples/models/core/llama/summarize_long.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • examples/models/core/qwen2audio/utils.py
  • examples/utils.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • examples/models/core/llama/summarize_long.py
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (9)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code graph analysis (7)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
  • stream (4654-4660)
  • stream (4740-4746)
  • stream (5099-5105)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
  • storeBlocks (996-1000)
  • isSequenceValidForStoreForReuse (1243-1248)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (1)
  • KvCacheConfig (972-1106)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1022-1038)
tests/integration/defs/accuracy/accuracy_core.py (4)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (712-722)
  • MMLU (276-290)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (32)
  • nodiscard (1118-1125)
  • nodiscard (1648-1651)
  • nodiscard (2735-2753)
  • KVCacheBlock (272-285)
  • releaseBlocks (1658-1673)
  • releaseBlocks (1658-1658)
  • releaseBlocks (1739-1796)
  • releaseBlocks (1739-1739)
  • updateLastCacheBlockOffsets (2103-2119)
  • updateLastCacheBlockOffsets (2103-2103)
  • releaseLastBlock (1624-1627)
  • releaseLastBlock (1624-1624)
  • releaseLastBlock (1629-1646)
  • releaseLastBlock (1629-1629)
  • detachFrontBlock (2146-2181)
  • detachFrontBlock (2146-2146)
  • adjustBlocksIfNeeded (1365-1371)
  • adjustBlocksIfNeeded (1365-1365)
  • adjustBlocksIfNeeded (1373-1391)
  • adjustBlocksIfNeeded (1373-1373)
  • onboardBlock (1054-1058)
  • onboardBlock (1054-1055)
  • onboardBlock (1060-1080)
  • onboardBlock (1060-1061)
  • offloadBlock (1082-1086)
  • offloadBlock (1082-1083)
  • offloadBlock (1088-1116)
  • offloadBlock (1088-1089)
  • storeBlocks (1485-1548)
  • storeBlocks (1485-1486)
  • getFreeBlock (951-1024)
  • getFreeBlock (951-953)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (40)
examples/models/core/llama/summarize_long.py (1)

44-49: LGTM — help text aligned with SWA wording

Wording consistent with removing "cyclic"; no behavior change. Sandbox rg search skipped files ("No files were searched"), so verify locally:

rg -n -C2 --type=py --hidden -uu --no-ignore -g '!**/site-packages/**' -e '\bmax_attention_window_size\b' -e '\bcyclic\b' || git ls-files | xargs -r grep -n -E -C2 '\b(max_attention_window_size|cyclic)\b'
tensorrt_llm/functional.py (1)

5361-5361: Doc terminology update looks fine; confirm expected tensor shape.

The wording change is OK. Please verify whether host_max_attention_window_sizes is still expected to be shape [1] or per‑layer (vector). The docstring still says [1] while other parts of the PR imply per‑layer support.

tests/integration/test_lists/test-db/l0_h100.yml (1)

39-42: New VSWA test entries look good; verify they exist and are stable.

Ensure these four tests are implemented and not skipped, and that they replace any older VSWA variants in other lists to avoid redundant runtime.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

335-337: Unconditionally setting max_tokens may constrain VSWA unexpectedly.

Previously VSWA left max_tokens unset; now it’s always derived from memory. Confirm C++ respects this as a global capacity (not per-window) and that VSWA capacity planning remains correct for multi-window configs.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

136-139: Updated KVCacheManager callsite matches new signature.

No issues spotted.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

218-222: KVCacheManager signature change handled correctly here.

No functional concerns with passing maxNumTokens.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)

1060-1072: LGTM! Test correctly configures KV cache without reuse.

The test properly sets up the KV cache configuration with block reuse disabled and validates the SWA window configuration through GSM8K and MMLU benchmarks.


1074-1086: LGTM! Test correctly enables KV cache block reuse.

The test properly validates the new SWA behavior with block reuse enabled. This aligns with the PR objectives to enable OOW block detachment with reuse.


1087-1107: LGTM! Chunked prefill test without reuse is well-structured.

The test correctly combines chunked prefill configuration with disabled KV cache reuse, providing good coverage for this configuration combination.


1108-1127: LGTM! Comprehensive test coverage for chunked prefill with reuse.

The test provides good coverage for the combination of chunked prefill and block reuse, which is important for validating the SWA behavior under different operational modes.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

486-500: Constructor signature updated correctly for SWA support.

The KVCacheManager constructor binding correctly updates the signature to:

  1. Replace optional maxSequenceLength with required SizeType32
  2. Add enable_partial_reuse and copy_on_partial_reuse parameters

This aligns with the C++ implementation changes for per-sequence SWA management.

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

689-696: LGTM! KV cache manager creation updated for partial reuse.

The createKvCacheManager call correctly passes:

  1. Fixed sequence length based on cache type (encoder vs decoder)
  2. New partial reuse configuration parameters

This enables the per-sequence block management features described in the PR.


2477-2478: Confirm intent: isUseOneMoreBlock is effectively always false

isUseOneMoreBlock is declared at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h:110, initialized to false at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:361, and the only usage found is the kernel call at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:2477–2478. No tests exercise this flag. Either restore dynamic assignment (and add unit tests for the "one more block" rewind path) or remove the parameter/callsite to avoid dead/unused behavior.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11)

60-62: LGTM! SWA-specific extra block constant is well-defined.

The addition of kSWAExtraBlock constant for sliding window attention (SWA) buffer allocation is appropriate and follows the established naming conventions in the codebase.


100-101: LGTM! Per-sequence window metadata additions are appropriate.

The clarification of maxTokenNum as per-sequence and addition of maxBlocksPerSeq field align well with the per-sequence tracking requirements for SWA support.


349-351: Good transition from cyclic threshold to front-block accounting.

Replacing mCyclicThreshold with mNumFrontBlocksRemoved provides clearer semantics for tracking detached blocks in SWA scenarios. The field is well-initialized to 0.


394-397: LGTM! GenerationRequest methods for front block management are well-designed.

The addition of getNumFrontBlocksRemoved(), removeFrontBlock(), and related state tracking provides a clean interface for managing out-of-window blocks in SWA.

Also applies to: 436-437, 439-442, 485-488


554-554: LGTM! WindowBlockManager SWA awareness is properly integrated.

The addition of isSWA parameter throughout the constructor and the isSWA() accessor method provide essential SWA awareness to the block management layer.

Also applies to: 608-609, 661-661, 675-675, 786-789, 871-871


588-589: Documentation improvement enhances clarity.

The updated documentation clearly explains when blocks will be stored during release, which is crucial for understanding reuse behavior in SWA contexts.


594-596: LGTM! New block management methods for SWA are well-structured.

The addition of updateLastCacheBlockOffsets, detachFrontBlock, and adjustBlocksIfNeeded provides the necessary infrastructure for SWA block lifecycle management.

Also applies to: 600-609


791-806: Sequence storage validity tracking is well-implemented.

The addition of methods to track sequence storage validity (initializeSequenceStorageValidity, releaseSequenceStorageValidity, isSequenceValidForStoreForReuse) along with the mapping structures provides necessary bookkeeping for SWA reuse scenarios.

Also applies to: 913-919


1200-1206: LGTM! Clear documentation for adjustBlocksIfNeeded in BlockManager.

The comprehensive documentation explains the method's behavior for both block allocation and detachment in SWA scenarios.


1208-1236: Excellent per-sequence lifecycle management implementation.

The addition of sequence management methods (isSequenceHeld, holdSequence, releaseSequence, isSequenceValidForStoreForReuse) and the mManagedSequences tracking set provides robust lifecycle management for sequences across all window sizes.

Also applies to: 1238-1248, 1284-1285


931-931: LGTM! Constructor signatures properly updated for maxSequenceLength.

All KVCacheManager constructors now consistently use non-optional SizeType32 maxSequenceLength parameter, aligning with the design decision to enforce this as a required parameter.

Also applies to: 1503-1504, 1514-1515, 1525-1526, 1536-1538, 1853-1854, 1887-1888

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (16)

160-191: LGTM! Well-implemented getAllSequenceBlocks helper function.

The function efficiently traverses the block chain backwards with pre-allocation for performance optimization. Good implementation with proper null checks and early return for empty sequences.


596-600: Good defensive programming with window size validation.

The warning for window sizes exceeding max sequence length helps catch configuration issues early.


625-637: Clear explanation of SWA block allocation requirements.

The detailed comment explaining the SWA linear allocation behavior and the temporary resolution for tempAttentionWindow calculation provides good context for future maintainers. The capping to maxSequenceLength is a reasonable temporary solution.


641-645: LGTM! Enhanced logging for KV cache blocks per sequence.

The improved logging now includes window size information, which is helpful for debugging variable window attention configurations.


793-798: Appropriate SWA guard for context block storage.

The check to skip storing context blocks for SWA sequences prevents potential issues with out-of-window blocks being incorrectly reused.


995-1022: Excellent per-sequence ownership tracking in getFreeBlock.

The implementation properly tracks block ownership by sequence and invalidates reuse when blocks are acquired by different sequences. The detailed debug logging helps with troubleshooting reuse issues.


1365-1391: LGTM! Well-implemented adjustBlocksIfNeeded logic.

The method correctly handles both block detachment for SWA and new block allocation at block boundaries. The while loop properly handles multiple OOW blocks in the first generation step.


1679-1684: Good SWA handling in storeNewBlock.

The check to skip storing new blocks for SWA sequences prevents issues with blocks that may go out-of-window.


1739-1796: Comprehensive releaseBlocks implementation with SWA support.

The method properly handles sequence validity checking for SWA and correctly manages the release of blocks while accounting for front blocks that have been removed.


2146-2181: LGTM! Robust detachFrontBlock implementation.

The method correctly handles front block detachment with proper ref counting and includes a check to ensure beam width is 1 (as SWA with beam search is not yet supported).


2213-2219: Good sequence lifecycle management in addSequence.

The method properly initializes sequence storage validity for new sequences, ensuring proper bookkeeping for all window sizes.


2267-2276: Good defensive programming in storeContextBlocks.

The method includes proper error handling with a warning when the sequence is not found, rather than crashing.


2311-2315: LGTM! Proper sequence cleanup in removeSequence.

The method correctly releases the sequence and verifies it's no longer held, with appropriate debug logging.


2693-2699: LGTM! Proper SWA extra block accounting.

The method correctly adds the extra block for SWA when the sequence length exceeds the window size.


2244-2244: isShareLastContextBlock logic is correct — cross-KV should always share the last context block.

WindowBlockManager already shares earlier context blocks; the final block is shared when either isCrossKv() (encoder/cross-attention outputs are the same across beams) or when the input ends exactly on a block boundary (no partial last block). Keep as‑is.


2066-2073: Verify SWA extra-block calculation (numExtraBlocksPerBeam)

Inspected KVCacheManager::getRemainingBlocksToCompletion (cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2066–2073). The code sets numExtraBlocksPerBeam = 1 only when:

  • (req.mPromptLen + req.mMaxNewTokens) > windowSize, and
  • ceilDiv(req.mPromptLen + req.mMaxNewTokens) > ceilDiv(req.getNumTokens(0)), and
  • ceilDiv(req.mPromptLen + req.mMaxNewTokens) > numTotalBlocksPerBeam.

Confirm these are the intended semantics. Specifically verify:

  • temporaryAttentionWindow / mSinkBubbleLength interaction with isSlidingWindow and numTotalBlocksPerBeam (should the sliding check consider windowSize + temporaryAttentionWindow?),
  • whether willCrossBlockBoundary should compare against numAllocBlocksPerBeam (allocated blocks) instead of currentSeqlenInBlocks (req.getNumTokens(0)),
  • consistency with kSWAExtraBlock usage in calculateMaxBlockRequirements*.

@eopXD eopXD force-pushed the swa-kv-cache-reuse-oow-detach branch from 3da009f to 45070b1 Compare September 24, 2025 07:53
@eopXD eopXD changed the title [TLLM-6777][feature] SWA KV cache reuse OOW block detach [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach Sep 24, 2025
@eopXD eopXD force-pushed the swa-kv-cache-reuse-oow-detach branch from 45070b1 to feec60c Compare September 24, 2025 07:56
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 24, 2025

@coderabbitai generate docstrings

Copy link
Contributor

coderabbitai bot commented Sep 24, 2025

✅ Actions performed

Docstrings generation started.

Copy link
Contributor

coderabbitai bot commented Sep 24, 2025

Caution

An unexpected error occurred while opening a pull request: Reference update failed - https://docs.github.com/rest/git/refs#create-a-reference

@eopXD
Copy link
Collaborator Author

eopXD commented Sep 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19779 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19779 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #14882 completed with status: 'FAILURE'

@thorjohnsen thorjohnsen self-requested a review September 24, 2025 15:57
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW
(out-of-window) blocks are only detached when reuse is not enabled.
This MR enables KV cache manager to detach the block when reuse is
enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep
whether the block is overwritten by another sequence. If at the end of
the sequence all blocks for it is still clean, then the sequence block
will be stored.

Signed-off-by: eopXD <[email protected]>
@eopXD eopXD force-pushed the swa-kv-cache-reuse-oow-detach branch from feec60c to 9e2bb25 Compare September 25, 2025 07:15
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 25, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19916 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19916 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14992 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants