Skip to content

Conversation

eopXD
Copy link
Collaborator

@eopXD eopXD commented Aug 10, 2025

(Landed) This MR is based upon #6767

Description

This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic matter. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design.

In this MR, we change such behavior to let the manager write blocks in a linear matter. With a linear block writing bahavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy.

KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request.

Writing the blocks linearly, the maximum number of blocks allocated for a sequence(GenerationRequest) is the "max sequence length" specified. The GenerationRequest that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks.

Given the above, main changes are (more context/explanation in the commits):

  • Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager.
  • Add detach mechanism and have it under KVCacheManager::addToken. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this.
  • Enforce "max sequence length" to be a non-optional parameter to the KVCacheManager/BlockManager
  • Let all window size resource pool get identical proportion of memory
  • Fix free memory calculation under resource_manager.py

Test Coverage

Unit tests

Tests are added under KVCacheManagerTest.cpp. The main tests added to cover the SWA detach/reuse concep t are CTgjc8kegnVqvtVbGZfpP5RHLKnRNikArUYFpVHNebEN and KVCacheManagerVariableWindowAttentionWithReuseTest. Although getNeededBlocksOneStep is not a used function right now, changes of the blocks SWA uses have been updated and tests have been added.

Benchmark

Added test case on gemma3 with VSWA setting on the H100 platform.

 accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_without_reuse
 accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_reuse
 accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_chunked_prefill_without_reuse
 accuracy/test_llm_api_pytorch.py::TestGemma3_1BInstruct::test_auto_dtype_vswa_chunked_prefill_reuse

Running with GPT-OSS 120B

Batch size throughput 482 using GPT-OSS 120B as model on GB200.

Summary by CodeRabbit

  • New Features

    • Sliding Window Attention KV-cache support for longer-context generation with stable memory use.
    • Explicit max sequence length configuration across APIs and tests.
    • New KV cache options: partial reuse and copy-on-partial-reuse.
    • Improved compatibility with beam search by removing cyclic KV-cache constraints.
  • Documentation

    • Updated CLI help and docstrings to reflect Sliding Window Attention terminology (removed “cyclic” wording).
  • Tests

    • Added VSWA coverage, including reuse and chunked prefill scenarios, to validate accuracy and behavior.

@eopXD eopXD marked this pull request as draft August 10, 2025 06:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🔭 Outside diff range comments (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

17-17: Use required header guard instead of pragma once

Per coding guidelines, headers must use a guard with prefix TRTLLM_. Replace #pragma once with a proper include guard.

-#pragma once
+#ifndef TRTLLM_KVCACHEMANAGER_H
+#define TRTLLM_KVCACHEMANAGER_H

And add the closing guard at the end of the file:

#endif // TRTLLM_KVCACHEMANAGER_H
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

1627-1681: Potential out-of-bounds when allocatedBlocks is empty before getAllSequenceBlocks()

If blockKeys.size() > allocatedBlocks.size() and allocatedBlocks is empty, allocatedBlocks.back() will be undefined. This could happen in edge cases where the sequence did not allocate blocks (e.g., early exit or immediate detach after minimal generation).

Guard against an empty vector:

-        if (blockKeys.size() > allocatedBlocks.size())
+        if (blockKeys.size() > allocatedBlocks.size())
         {
-            // For SWA, get all blocks in the sequence.
-            allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
+            // For SWA, get all blocks in the sequence if we have a tail; otherwise, nothing to store.
+            if (!allocatedBlocks.empty())
+            {
+                allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
+            }
+            else
+            {
+                TLLM_LOG_DEBUG("%s::releaseBlocks - no allocated blocks to store for reuse", mLogPrefix.c_str());
+            }
         }

1913-1964: willCrossWindowBlockBoundary condition is always false (logic bug)

maxSeqlenInBlocks and numTotalBlocksPerBeam are computed from the same expression, so maxSeqlenInBlocks > numTotalBlocksPerBeam can never be true. As a result, numExtraBlocksPerBeam is always 0, defeating the SWA extra-block logic.

Fix by comparing block counts within the window, not the unconstrained sequence:

-    SizeType32 const maxSeqlenInBlocks = tc::ceilDiv(req.mPromptLen + req.mMaxNewTokens, getTokensPerBlock());
-    auto const willCrossBlockBoundary = maxSeqlenInBlocks > currentSeqlenInBlocks;
-    auto const willCrossWindowBlockBoundary = maxSeqlenInBlocks > numTotalBlocksPerBeam;
+    auto const maxSeqlen = std::min(req.mPromptLen + req.mMaxNewTokens, windowSize);
+    SizeType32 const maxSeqlenInBlocks = tc::ceilDiv(maxSeqlen, getTokensPerBlock());
+    auto const willCrossBlockBoundary = maxSeqlenInBlocks > currentSeqlenInBlocks;
+    auto const windowCurrBlocks = tc::ceilDiv(std::min(req.getNumTokens(0), windowSize), getTokensPerBlock());
+    auto const windowMaxBlocks = tc::ceilDiv(std::min(req.mPromptLen + req.mMaxNewTokens, windowSize), getTokensPerBlock());
+    auto const willCrossWindowBlockBoundary = windowMaxBlocks > windowCurrBlocks;

This preserves the intended “cross both block and window boundaries” semantics.

🧹 Nitpick comments (7)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

341-341: Optional: inline the constant to reduce coupling

If BlockManager::isUseOneMoreBlock() is permanently a constant false, consider inlining here for clarity and to decouple this constructor from BlockManager internals. Easy to revert if behavior changes later.

-        auto const isUseOneMoreBlock = kv_cache_manager::BlockManager::isUseOneMoreBlock();
+        constexpr bool isUseOneMoreBlock = false;
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

77-80: Avoid magic -1; define a typed invalid ID sentinel

Counting valid blocks by comparing to a raw -1 is brittle and type-unsafe. Define a typed sentinel once and reuse it here.

Apply within the helper:

-SizeType32 getNumValidBlocks(std::vector<SizeType32> const& blockIds)
-{
-    return std::count_if(blockIds.begin(), blockIds.end(), [](SizeType32 id) { return id != -1; });
-}
+SizeType32 getNumValidBlocks(std::vector<SizeType32> const& blockIds)
+{
+    constexpr auto kInvalidBlockId = static_cast<SizeType32>(-1);
+    return std::count_if(blockIds.begin(), blockIds.end(), [&](SizeType32 id) { return id != kInvalidBlockId; });
+}

If feasible longer-term, consider exposing a dedicated invalid sentinel for block IDs (aligned with KVCacheBlock::IdType) to avoid relying on implicit -1 semantics.


2690-2690: Remove stray debug print in tests

This std::cerr debug output will add noise to test logs. Please drop it or guard behind a verbose flag.

-    std::cerr << "numBlocksPerSeq: " << numBlocksPerSeq << std::endl;
+    // (optional) enable only for local debugging
+    // std::cerr << "numBlocksPerSeq: " << numBlocksPerSeq << std::endl;
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)

56-59: Clarify rationale and scope of kSWAExtraBlock

The extra buffer block for SWA is reasonable; add a short note tying it to worst-case detachment/attachment during window slide and how it figures into capacity calculations (e.g., calculateMaxBlockRequirements*).


102-110: Make toString() const

This printer doesn’t mutate state and should be marked const.

-    std::string toString()
+    std::string toString() const

1123-1127: isUseOneMoreBlock permanently false—tidy up or document

If the extra-block decision is now fully handled elsewhere (via kSWAExtraBlock), consider removing this method entirely or documenting its deprecation to avoid confusion.


2-2: Update copyright year

Guidelines require current year in OSS headers. Consider updating 2024 → 2025 if this file is modified in 2025.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee19ca5 and d751a9b.

📒 Files selected for processing (4)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (15 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (30 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (62 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (15)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

341-341: Signature update to BlockManager::isUseOneMoreBlock() looks correct

Call site updated to the new zero-arg static method. No other logic change here; the value is still passed through to mRewindInputs.


341-349: Confirm isUseOneMoreBlock Always False & Threading to Kernel

  • BlockManager::isUseOneMoreBlock() is now a zero-arg static that unconditionally returns false (kvCacheManager.h:1123–1127).
  • No legacy call sites still passing arguments were found.
  • The flag is still propagated via mRewindInputs.isUseOneMoreBlock into invokeUpdateKVBlockArrayDraftTokenLocation (trtGptModelInflightBatching.cpp:2472).

Since “one-more-block” mode is effectively disabled, please review the index math in both rewindKVCacheBlocks() and invokeUpdateKVBlockArrayDraftTokenLocation() to ensure that removing the cyclic layout hasn’t introduced any off-by-one errors.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5)

342-343: Constructor initializes mNumFrontBlocksRemoved—good

Tracking front-detached blocks explicitly is correct under SWA.


576-586: New detach/addBlockIfNeeded/cacheNewBlockOffset APIs—good for SWA

Publicly exposing block detachment and conditional allocation aligns with linearized reuse. Ensure implementations:

  • Call clearCacheBlocks/reset counters appropriately when sequences are fully reshaped.
  • Keep offset tables consistent across pools when beams share/duplicate blocks.

If needed, I can audit the .cpp implementations for these invariants.


597-598: Free-block query by cache level—nice addition

The cache-level overload eliminates ambiguity between primary/secondary pools.


936-951: Propagating cache-level awareness to per-window queries—good

This unblocks precise telemetry and simplifies test assertions.


1144-1149: Expose sequence offset caching and conditional allocation at BlockManager

Good API surface for KVCacheManager to drive SWA lifecycle. Ensure thread-safety around mSequences/mWindowManagers when called from addToken.

Happy to follow through with a quick lock-audit if you want.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)

578-589: Constructor metadata computation switch to maxTokensPerSeq looks good

Using maxSequenceLength.value_or(0) to derive maxTokensPerSeq and maxBlocksPerSeq is consistent with the header changes and avoids stale terminology.


1295-1313: Verify detach threshold logic for SWA

The minTokensForBlockDetach = windowSize + tokensPerBlock threshold intentionally delays detachment to compensate for loading the full context instead of just the last window. This is called out as a hot-fix in comments. Please confirm this behavior matches test expectations for:

  • context length slightly above window size,
  • exact window boundary,
  • large contexts (multiple window slides).

If mismatches arise, consider computing detachment strictly based on tokens within the window (e.g., based on the number of blocks representing min(windowSize, numTokens)), and adjusting context loading accordingly.


1408-1473: storeBlocks now returns number of blocks stored — LGTM

Counting and returning numBlocksStoredForReuse is correct and aligns with the updated API. Event emission and hash updates are preserved.


1523-1526: Accessor for free blocks — LGTM

Forwarding getNumFreeBlocks to the eviction policy is straightforward.


2004-2045: Detach path still requires full offset re-cache

Even with the change above, this block’s local logic removes the front block and erases from allocatedBlocks. Ensure offsets are re-cached for all blocks post-detach (not just the newest). The suggested change in addBlockIfNeeded() addresses this; otherwise, replicate the full re-cache here.

Would you prefer re-caching all offsets in addBlockIfNeeded() (single place) or also redundantly here to be extra safe?


2618-2632: Guard against negative window result when subtracting kSWAExtraBlock

When outputBlockRequirements <= blockCapacity is false, ((blockCapacity / beamWidth) - kSWAExtraBlock) can become negative for small capacities, underflowing SizeType32.

Clamp to zero before multiplying:

-        return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock;
+        auto const perBeamBlocks = std::max<SizeType32>(0, (blockCapacity / beamWidth) - kSWAExtraBlock);
+        return perBeamBlocks * tokensPerBlock;

Please confirm unit tests cover low-capacity edge cases.


1966-1985: cacheSequenceBlockOffsets() addition — LGTM

Full re-cache helper is correct and will be useful especially upon detach. It’s fine to call this after addSequence and after detach scenarios.


2158-2165: storeContextBlocks gated by reuse and non-dummy — LGTM

This avoids unnecessary hash/trie inserts when reuse is disabled or for dummy requests.

@eopXD eopXD changed the title [TRTLLM-6341][feature][kv cache manager] Support SWA KV cache reuse [TRTLLM-6341][feature] SWA KV cache reuse Aug 11, 2025
@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from d751a9b to 93e91ab Compare August 11, 2025 16:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

415-421: clearCacheBlocks should reset front-detach counter

After clearing per-beam block IDs, mNumFrontBlocksRemoved must be reset; otherwise subsequent removeFrontBlock calls will index past the beginning.

 void clearCacheBlocks(SizeType32 windowSize)
 {
     for (auto& beamBlockIds : mCacheBlockIds.at(windowSize))
     {
         beamBlockIds.clear();
     }
+    mNumFrontBlocksRemoved = 0;
 }

423-432: Bounds-check and handle empty beams when removing front block

If removeFrontBlock is called more times than assigned blocks, this will write out of bounds. Also, keep logic robust for empty beams.

 void removeFrontBlock(SizeType32 windowSize)
 {
     for (auto& beamBlockIds : mCacheBlockIds.at(windowSize))
     {
-        // Does not actually remove from mCacheBlockIds like removeLastBlock
-        // Id is set to -1 instead.
-        beamBlockIds[mNumFrontBlocksRemoved] = -1;
+        if (mNumFrontBlocksRemoved < static_cast<SizeType32>(beamBlockIds.size()))
+        {
+            // Mark as invalid without shrinking; SWA keeps layout length stable.
+            beamBlockIds[mNumFrontBlocksRemoved] = -1;
+        }
+        // else: nothing to mark for this beam (already all detached)
     }
     ++mNumFrontBlocksRemoved;
 }
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

159-186: Prevent size_t underflow in getAllSequenceBlocks()

currentIndex is initialized with blockCount - 1 even when blockCount can be 0. While the loop short-circuits, this underflow is undesirable and may trigger static-analysis warnings.

 std::vector<BlockPtr> getAllSequenceBlocks(BlockPtr lastBlock)
 {
     // First count the number of blocks to pre-allocate the vector
     auto currentBlock = lastBlock;
     size_t blockCount = 0;
     while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
     {
         blockCount++;
         currentBlock = currentBlock->getPrevBlockInSeq();
     }
 
+    if (blockCount == 0)
+    {
+        return {};
+    }
     // Create and pre-allocate the vector with the correct size
     std::vector<BlockPtr> sequenceBlocks(blockCount);
 
     // Now traverse backwards and fill from the end
     currentBlock = lastBlock;
-    size_t currentIndex = blockCount - 1;
+    size_t currentIndex = blockCount;
     while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
     {
-        sequenceBlocks[currentIndex--] = currentBlock;
+        sequenceBlocks[--currentIndex] = currentBlock;
         currentBlock = currentBlock->getPrevBlockInSeq();
     }
 
     return sequenceBlocks;
 }

1241-1282: Page-table offsets become stale after detaching the front block

When shouldDetachBlock is true, removeFrontBlock shifts indices of every remaining block. Calling cacheNewBlockOffset() only updates the last block and leaves indices [0..n-2] stale, which will corrupt page-table reads.

Update offsets for the entire sequence after detaching:

 void WindowBlockManager::addBlockIfNeeded(GenerationRequest& sequence, bool isEnableBlockReuse)
 {
     // ... existing code ...
-    if (shouldAllocateBlock || shouldDetachBlock)
-    {
-        cacheNewBlockOffset(sequence);
-    }
+    // Update KV cache indices
+    if (shouldAllocateBlock && !shouldDetachBlock)
+    {
+        // Only appended a block; update just the last entry
+        cacheNewBlockOffset(sequence);
+    }
+    if (shouldDetachBlock)
+    {
+        // Recompute all offsets due to front removal shifting indices
+        auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize);
+        auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize);
+        auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor);
+        auto const& offsetsShape = cacheBlocksTensor.getShape();
+        for (SizeType32 beamIdx = 0; beamIdx < sequence.getBeamWidth(); ++beamIdx)
+        {
+            auto const& beamCacheBlock = cacheBlocks[beamIdx];
+            for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx)
+            {
+                setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, beamCacheBlock[blockIdx]);
+            }
+        }
+    }
 }

2001-2009: Remove duplicate variable declaration in addToken()

sequence is declared twice, causing a compile error.

 void KVCacheManager::addToken(RequestIdType requestId)
 {
     TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0,
         "[kv cache manager] streamLLM is not supported at the moment");
-    auto& sequence = getSequence(requestId);
+    auto& sequence = getSequence(requestId);
     TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[kv cache manager] addToken does not support beamWidth > 1");
     sequence.addNewTokens(1);
     mBlockManager.addBlockIfNeeded(sequence, mEnableBlockReuse);
 }

2399-2418: Remove duplicate variable declaration in removeToken()

sequence is declared twice, causing a compile error.

 void KVCacheManager::removeToken(RequestIdType requestId)
 {
     TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0,
         "[kv cache manager] streamLLM is not supported at the moment");
-    auto& sequence = getSequence(requestId);
+    auto& sequence = getSequence(requestId);
     if (sequence.getNumTokens() == 0)
     {
         return;
     }
     TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[kv cache manager] removeToken does not support beamWidth > 1");
     sequence.removeTokens(1);
     // ... rest of the function
 }
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

983-989: Consider implementing a smarter eviction policy for offloading

The comment indicates that the current default behavior offloads out-of-window blocks to secondary memory without considering their usefulness. This could lead to unnecessary memory traffic. Consider implementing an eviction policy that evaluates block reuse potential before offloading.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d751a9b and 93e91ab.

📒 Files selected for processing (8)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (15 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (25 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (14 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (65 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
👮 Files not reviewed due to content moderation or server errors (3)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
  • KVCacheManager (1661-1672)
  • KVCacheManager (1674-1688)
  • KVCacheManager (1690-1720)
  • KVCacheManager (1722-1735)
  • calculateMaxBlockRequirementsPerBeam (2492-2504)
  • calculateMaxBlockRequirementsPerBeam (2492-2493)
  • calculateMaxBlockRequirements (2506-2538)
  • calculateMaxBlockRequirements (2506-2507)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • createKvCacheManager (624-722)
  • createKvCacheManager (624-626)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (35)
  • nodiscard (1010-1017)
  • nodiscard (1482-1485)
  • nodiscard (2540-2559)
  • addSequence (1188-1192)
  • addSequence (1188-1189)
  • addSequence (1196-1231)
  • addSequence (1196-1197)
  • addSequence (1286-1290)
  • addSequence (1286-1287)
  • addSequence (1294-1310)
  • addSequence (1294-1295)
  • addSequence (2018-2079)
  • addSequence (2018-2019)
  • releaseBlocks (1492-1500)
  • releaseBlocks (1492-1492)
  • releaseBlocks (1586-1637)
  • releaseBlocks (1586-1586)
  • schedulingReleaseBlocks (1639-1645)
  • schedulingReleaseBlocks (1639-1639)
  • schedulingReleaseBlocks (1647-1659)
  • schedulingReleaseBlocks (1647-1647)
  • releaseLastBlock (1458-1461)
  • releaseLastBlock (1458-1458)
  • releaseLastBlock (1463-1480)
  • releaseLastBlock (1463-1463)
  • detachBlock (1960-1999)
  • detachBlock (1960-1960)
  • cacheNewBlockOffset (1942-1958)
  • cacheNewBlockOffset (1942-1942)
  • getNumFreeBlocks (1482-1482)
  • storeBlocks (1370-1433)
  • storeBlocks (1370-1371)
  • KVCacheBlock (257-270)
  • cacheSequenceBlockOffsets (1922-1940)
  • cacheSequenceBlockOffsets (1922-1922)
cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
  • getNumFreeBlocks (179-182)
  • getNumFreeBlocks (179-179)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17)
  • storeBlocks (910-914)
  • nodiscard (215-1154)
  • nodiscard (586-594)
  • nodiscard (608-616)
  • nodiscard (618-626)
  • nodiscard (628-631)
  • nodiscard (638-641)
  • nodiscard (643-653)
  • nodiscard (664-667)
  • nodiscard (669-672)
  • nodiscard (684-687)
  • nodiscard (692-695)
  • nodiscard (736-747)
  • nodiscard (938-941)
  • nodiscard (943-956)
  • nodiscard (958-971)
  • nodiscard (973-979)
cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
  • getNumFreeBlocks (179-182)
  • getNumFreeBlocks (179-179)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1241-1282: I need to inspect the WindowBlockManager class to verify whether its block pools are protected by any mutex during addBlockIfNeeded.

@eopXD eopXD self-assigned this Aug 12, 2025
@eopXD eopXD added the KV-Cache Management kv-cache management for efficient LLM inference label Aug 12, 2025
@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from 93e91ab to b8fa5b6 Compare August 13, 2025 09:03
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1991-2008: Compile error: misuse of mWindowSize and wrong setOffsets overload

This BlockManager method references mWindowSize (not a member of BlockManager) and calls setOffsets without the windowSize parameter. Use the provided windowSize argument and route to the WindowBlockManager overload.

Apply:

-void BlockManager::updateLastCacheBlockOffsets(GenerationRequest& sequence, SizeType32 windowSize)
-{
-    auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize);
-    auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize);
+void BlockManager::updateLastCacheBlockOffsets(GenerationRequest& sequence, SizeType32 windowSize)
+{
+    auto const& cacheBlocks = sequence.getCacheBlockIds(windowSize);
+    auto& cacheBlocksTensor = sequence.getCacheBlockIndices(windowSize);
     auto const beamWidth = sequence.getBeamWidth();

     auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor);
     auto const& offsetsShape = cacheBlocksTensor.getShape();

     for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx)
     {
         auto const& beamCacheBlock = cacheBlocks[beamIdx];
         auto const blockId = beamCacheBlock.back();
         auto const blockIdx = static_cast<SizeType32>(beamCacheBlock.size() - 1);
-        setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+        mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
     }
 }
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)

52-59: Fix: SizeType32 alias used before declaration (compile error).

kPrimaryLevel/kSecondaryLevel/kSWAExtraBlock are defined before SizeType32 is introduced (using alias appears at Line 65). This will not compile.

Apply this diff to use the fully qualified type in these constant definitions:

-static constexpr SizeType32 kPrimaryLevel = 0;
+static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0;

-static constexpr SizeType32 kSecondaryLevel = 1;
+static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1;

-// Extra block buffer allocated for SWA to be able to always keep "window size"
-// tokens held in the blocks.
-static constexpr SizeType32 kSWAExtraBlock = 1;
+// Extra block buffer allocated for SWA to be able to always keep "window size"
+// tokens held in the blocks.
+static constexpr tensorrt_llm::runtime::SizeType32 kSWAExtraBlock = 1;

621-624: Correct allocated-blocks computation under multi-level pools.

getNumAllocatedBlocks() currently uses getMaxNumBlocks() - getNumFreeBlocks(), but getNumFreeBlocks() now defaults to primary-level only. This makes the computation inconsistent with total max blocks when secondary is present. Compute primary usage explicitly.

Apply this diff:

-    [[nodiscard]] SizeType32 getNumAllocatedBlocks() const noexcept
-    {
-        return getMaxNumBlocks() - getNumFreeBlocks();
-    }
+    [[nodiscard]] SizeType32 getNumAllocatedBlocks() const noexcept
+    {
+        // Report primary-level allocation to match free-block reporting defaults.
+        return getNumPrimaryBlocks() - getNumFreeBlocks(kPrimaryLevel);
+    }

1441-1445: Align “used blocks” with primary-level semantics.

getUsedNumBlocks() returns BlockManager::getNumAllocatedBlocks(), which currently subtracts primary free blocks from a possibly combined (primary+secondary) maximum, leading to inconsistent stats. Compute primary used blocks directly.

Apply this diff:

-    [[nodiscard]] SizeType32 getUsedNumBlocks() const override
-    {
-        return mBlockManager.getNumAllocatedBlocks();
-    }
+    [[nodiscard]] SizeType32 getUsedNumBlocks() const override
+    {
+        return mBlockManager.getNumPrimaryBlocks() - mBlockManager.getNumFreeBlocks(kPrimaryLevel);
+    }

1481-1497: Fix KvCacheStats consistency (primary-level vs. total).

KvCacheStats comments say maxNumBlocks refers to the primary pool. Ensure all reported metrics are aligned to the primary level to avoid confusing stats when a secondary pool exists.

Apply this diff:

-        kvCacheStats.maxNumBlocks = getMaxNumBlocks();
-        kvCacheStats.freeNumBlocks = getNumFreeBlocks();
-        kvCacheStats.usedNumBlocks = getUsedNumBlocks();
+        kvCacheStats.maxNumBlocks = mBlockManager.getNumPrimaryBlocks();
+        kvCacheStats.freeNumBlocks = mBlockManager.getNumFreeBlocks(kPrimaryLevel);
+        kvCacheStats.usedNumBlocks = kvCacheStats.maxNumBlocks - kvCacheStats.freeNumBlocks;

The remaining fields can stay unchanged.

♻️ Duplicate comments (7)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

2864-2873: Stop inferring “allocated primary blocks” by subtracting secondary capacity

Computing allocated primary blocks as getNumAllocatedBlocks() - blocksInSecondaryPool assumes that the aggregated count includes secondary and that the secondary pool is “full,” which couples tests to BlockManager internals.

Expose an API that reports allocated-by-level to make tests precise and resilient. For example:

  • BlockManager::getNumAllocatedBlocks(SizeType32 cacheLevel)
  • WindowBlockManager::getNumAllocatedBlocks(SizeType32 cacheLevel)

Then use getNumAllocatedBlocks(kPrimaryLevel) here.

If you want, I can draft the change across BlockManager/WindowBlockManager and update affected tests.


3030-3033: Same issue: don’t derive primary allocation counts via subtraction

This repeats the subtract-secondary pattern. Prefer a direct API: getNumAllocatedBlocks(kPrimaryLevel).

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

159-186: Fix potential size_t underflow in getAllSequenceBlocks

When there are zero blocks, initializing currentIndex with blockCount - 1 underflows. Guard zero and use pre-decrement indexing.

Apply:

 std::vector<BlockPtr> getAllSequenceBlocks(BlockPtr lastBlock)
 {
     // First count the number of blocks to pre-allocate the vector
     auto currentBlock = lastBlock;
     size_t blockCount = 0;
     while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
     {
         blockCount++;
         currentBlock = currentBlock->getPrevBlockInSeq();
     }

-    // Create and pre-allocate the vector with the correct size
-    std::vector<BlockPtr> sequenceBlocks(blockCount);
+    // Create and pre-allocate the vector with the correct size
+    if (blockCount == 0)
+    {
+        return {};
+    }
+    std::vector<BlockPtr> sequenceBlocks(blockCount);

     // Now traverse backwards and fill from the end
     currentBlock = lastBlock;
-    size_t currentIndex = blockCount - 1;
+    size_t currentIndex = blockCount;
     while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
     {
-        sequenceBlocks[currentIndex--] = currentBlock;
+        sequenceBlocks[--currentIndex] = currentBlock;
         currentBlock = currentBlock->getPrevBlockInSeq();
     }

     return sequenceBlocks;
 }

1241-1282: Undefined cacheNewBlockOffset; update KV offsets correctly after detach/allocate

cacheNewBlockOffset(sequence) is not defined and will not compile. Also, when detaching the front block under SWA, page-table offsets shift for all remaining blocks, so updating only the last block is insufficient.

Update offsets in-place here:

  • If shouldDetachBlock: recompute offsets for all blocks/beams.
  • Else if shouldAllocateBlock: update just the last block.

Apply:

-    if (shouldAllocateBlock || shouldDetachBlock)
-    {
-        cacheNewBlockOffset(sequence);
-    }
+    // Refresh KV cache indices
+    if (shouldDetachBlock)
+    {
+        // Detaching the front block shifts all indices; recompute all offsets.
+        auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize);
+        auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize);
+        auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor);
+        auto const& offsetsShape = cacheBlocksTensor.getShape();
+        auto const beamWidth = sequence.getBeamWidth();
+        for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx)
+        {
+            auto const& beamCacheBlock = cacheBlocks[beamIdx];
+            for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx)
+            {
+                setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, beamCacheBlock[blockIdx]);
+            }
+        }
+    }
+    else if (shouldAllocateBlock)
+    {
+        // Only the last block appended needs offsets.
+        auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize);
+        auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize);
+        auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor);
+        auto const& offsetsShape = cacheBlocksTensor.getShape();
+        auto const beamWidth = sequence.getBeamWidth();
+        for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx)
+        {
+            auto const& beamCacheBlock = cacheBlocks[beamIdx];
+            if (beamCacheBlock.empty())
+            {
+                continue;
+            }
+            auto const blockIdx = static_cast<SizeType32>(beamCacheBlock.size() - 1);
+            auto const blockId = beamCacheBlock.back();
+            setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+        }
+    }

This also addresses the previously raised “stale offsets after detaching the front block” issue.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

421-427: Reset front-detach counter when clearing cache blocks.

After clearing per-beam block IDs, mNumFrontBlocksRemoved must be reset; otherwise subsequent removeFrontBlock calls will index past the beginning.

Apply this diff:

     void clearCacheBlocks(SizeType32 windowSize)
     {
         for (auto& beamBlockIds : mCacheBlockIds.at(windowSize))
         {
             beamBlockIds.clear();
         }
+        mNumFrontBlocksRemoved = 0;
     }

1109-1113: Clarify intent or make isUseOneMoreBlock configurable.

This now always returns false with only a placeholder comment. It influences block budgeting hooks; please document why it’s disabled or wire it to a config if needed. If truly obsolete, remove and clean up call sites.

Would you like me to draft a small config flag (env/ctor) plumbed into BlockManager to control this?


429-438: Bounds-check when marking front block as removed; avoid magic -1.

If removeFrontBlock is called more times than there are assigned blocks, this writes out-of-bounds. Also, prefer a named invalid-id sentinel over magic -1.

Apply this diff:

-    void removeFrontBlock(SizeType32 windowSize)
-    {
-        for (auto& beamBlockIds : mCacheBlockIds.at(windowSize))
-        {
-            // Does not actually remove from mCacheBlockIds like removeLastBlock
-            // Id is set to -1 instead.
-            beamBlockIds[mNumFrontBlocksRemoved] = -1;
-        }
-        ++mNumFrontBlocksRemoved;
-    }
+    void removeFrontBlock(SizeType32 windowSize)
+    {
+        auto& perBeam = mCacheBlockIds.at(windowSize);
+        for (auto& beamBlockIds : perBeam)
+        {
+            if (mNumFrontBlocksRemoved < static_cast<SizeType32>(beamBlockIds.size()))
+            {
+                // Mark as invalid without shrinking; SWA keeps layout length stable.
+                beamBlockIds[mNumFrontBlocksRemoved] = KVCacheBlock::kInvalidBlockId;
+            }
+            // else: nothing to mark for this beam (already all detached)
+        }
+        ++mNumFrontBlocksRemoved;
+    }

Additionally, define a dedicated invalid id sentinel to avoid reusing other sentinels:

// Add inside class KVCacheBlock public section (near kCachedBlocksRootId)
static constexpr IdType kInvalidBlockId = std::numeric_limits<IdType>::min();
🧹 Nitpick comments (15)
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2)

751-753: Restoring streaming-param coverage via GTEST_SKIP instead of removing the parameter

Dropping streaming from the parameterization reduces coverage and can mask regressions. Prefer to keep the parameter and skip at runtime until streamLLM is supported.

Apply this diff to restore the parameter:

-        // TODO: Support and add coverage for streamLLM
-        testing::Values(false),
+        // TODO: Support and add coverage for streamLLM
+        testing::Values(false, true),

Additionally, add an early guard in the test body to avoid failures until support lands (outside the selected range; place near the top of createResponse()):

// Early skip for streaming path until streamLLM is supported in tests.
if (streaming) {
    GTEST_SKIP() << "TODO(TRTLLM-XXXX): enable streaming coverage when streamLLM is supported.";
}

513-515: Fix typo: beamWdith → beamWidth

Minor naming nit for readability and consistency.

Apply this diff:

-    auto const beamWdith = std::get<3>(info.param);
+    auto const beamWidth = std::get<3>(info.param);
...
-    name += "Bw" + std::to_string(beamWdith);
+    name += "Bw" + std::to_string(beamWidth);

Also applies to: 529-529

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (3)

373-407: Remove std::cerr debug prints from unit tests

These ad-hoc prints add noisy output and can flake CI logs. Prefer GTest SCOPED_TRACE or dedicated logging macros, or just remove them.

Apply this diff to remove the debug prints:

-                std::cerr << "isContextInitState, itCount: " << itCount << std::endl;
...
-                    std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl;
...
-                    std::cerr << "addNewTokens, itCount: " << itCount << std::endl;
...
-                std::cerr << "generation phase, itCount: " << itCount << std::endl;
-                std::cerr << "add token for kvCacheManager" << std::endl;
...
-                    std::cerr << "add token for crossKvCacheManager" << std::endl;
...
-                std::cerr << "add token for llmReq" << std::endl;
...
-                std::cerr << "GENERATION_COMPLETE" << std::endl;

If you want structured context on failure, wrap specific expectations with SCOPED_TRACE instead.

Also applies to: 413-422, 426-427, 413-427


465-466: Consolidate repeated TODOs for sinkTokenLen > 0 coverage and track with an issue

Multiple TODOs across tests signal temporarily reduced coverage. Consolidate this into a single fixture-level toggle or a parameterized/typed test to avoid duplication, and link to a tracking issue (e.g., TRTLLM-XXXX).

Consider:

  • A single constexpr flag or environment-driven toggle in the test fixture.
  • A typed test with sinkTokenLen as a parameter; temporarily constrain its values to {0} and re-enable {>0} later.
  • Add a single comment at the fixture level with the tracking ID instead of repeating per-test.

Also applies to: 519-520, 609-610, 719-721, 845-852, 1101-1103, 1195-1196, 1289-1291, 1510-1511, 1529-1531, 1580-1582


845-852: Document removal of configurations and add a re-enable plan

You’ve removed specific configurations (e.g., (kMAX_UTILIZATION, 4, 125) and (kGUARANTEED_NO_EVICT, 1, 160)). Please add a brief comment with a tracking ticket and rationale to ease re-enabling later.

Example:

// TODO(TRTLLM-XXXX): Re-enable sinkTokenLen>0 cases after SWA reuse path supports bubbles; 
// removed temporarily to stabilize CI (see PR 6768).
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

523-526: Float equality check is brittle; use tolerance

TLLM_CHECK(total == 1.0f) risks false negatives due to float rounding. Prefer a tolerance, e.g., fabs(total - 1.0f) < 1e-6f.

I can submit a small patch if you want this changed globally in this helper.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (9)

56-59: Nit: Constant naming style.

Per guidelines, constants use kPREFIXED_UPPER_SNAKE. Consider renaming kSWAExtraBlock to kSWA_EXTRA_BLOCK for consistency. This would require touching all references (e.g., .cpp sites using this symbol), so it can be a follow-up cleanup.


102-109: Make WindowSizeMetadata::toString() const.

Method does not mutate state and should be marked const.

Apply this diff:

-    std::string toString()
+    std::string toString() const

347-349: Avoid needless copy in structured binding.

Use const reference to avoid copying metadata on each iteration.

Apply this diff:

-        for (auto const [windowSize, metadata] : windowSizeToMetadata)
+        for (auto const& [windowSize, metadata] : windowSizeToMetadata)

748-755: Remove redundant clamp and fix type safety.

The guard ensures inputs->maxInputLen > mWindowSize, so window is non-negative. The std::max(window, 0) mixes signed/unsigned and is unnecessary.

Apply this diff:

-            window = std::max(window, 0); // clamp negative values to 0
+            // window is non-negative due to the guard above.

Alternatively, if you prefer an explicit clamp, use:

  • std::max(window, static_cast(0))

734-736: Propagate storeBlocks() return value to callers.

WindowBlockManager::storeBlocks now returns the number of actual blocks stored. BlockManager::storeBlocks discards this. Consider adding a returning variant in BlockManager (and/or KVCacheManager) to preserve the signal for telemetry or decision-making.

One option:

-    void storeBlocks(std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds,
-        SizeType32 windowSize)
-    {
-        mWindowBlockManagers.at(windowSize).storeBlocks(blockKeys, blockIds);
-    }
+    SizeType32 storeBlocks(std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds,
+        SizeType32 windowSize)
+    {
+        return mWindowBlockManagers.at(windowSize).storeBlocks(blockKeys, blockIds);
+    }

Update call sites accordingly.


1391-1394: Typo: copyOnpartialReuse → copyOnPartialReuse.

Parameter naming should be consistent (lowerCamelCase) and match similar usage elsewhere (e.g., WindowBlockManager). Please fix the capitalization.

Apply this diff:

-        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
-        bool copyOnpartialReuse = true);
+        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
+        bool copyOnPartialReuse = true);

1401-1404: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).

Apply this diff:

-        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
-        bool copyOnpartialReuse = true);
+        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
+        bool copyOnPartialReuse = true);

1411-1414: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).

Apply this diff:

-        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
-        bool copyOnpartialReuse = true);
+        std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true,
+        bool copyOnPartialReuse = true);

1419-1422: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).

Apply this diff:

-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93e91ab and b8fa5b6.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (15 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (25 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (17 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (64 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • KVCacheManager (1711-1722)
  • KVCacheManager (1724-1738)
  • KVCacheManager (1740-1772)
  • KVCacheManager (1774-1787)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • KVCacheManager (140-930)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • createKvCacheManager (624-722)
  • createKvCacheManager (624-626)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (26)
  • addSequence (1188-1192)
  • addSequence (1188-1189)
  • addSequence (1196-1231)
  • addSequence (1196-1197)
  • addSequence (1286-1290)
  • addSequence (1286-1287)
  • addSequence (1294-1310)
  • addSequence (1294-1295)
  • addSequence (2107-2165)
  • addSequence (2107-2108)
  • detachBlock (2026-2065)
  • detachBlock (2026-2026)
  • addBlockIfNeeded (1233-1239)
  • addBlockIfNeeded (1233-1233)
  • addBlockIfNeeded (1241-1282)
  • addBlockIfNeeded (1241-1241)
  • getNumFreeBlocks (1532-1532)
  • storeBlocks (1370-1433)
  • storeBlocks (1370-1371)
  • KVCacheBlock (257-270)
  • updateSequenceCacheBlockOffsets (1971-1989)
  • updateSequenceCacheBlockOffsets (1971-1971)
  • updateLastCacheBlockOffsets (1991-2007)
  • updateLastCacheBlockOffsets (1991-1991)
  • updateCacheBlockOffsetsAtIdx (2009-2024)
  • updateCacheBlockOffsetsAtIdx (2009-2009)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
  • kvCacheManager (2969-2971)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (16)
  • nodiscard (215-1175)
  • nodiscard (594-602)
  • nodiscard (616-624)
  • nodiscard (626-634)
  • nodiscard (636-639)
  • nodiscard (646-649)
  • nodiscard (651-661)
  • nodiscard (672-675)
  • nodiscard (677-680)
  • nodiscard (692-695)
  • nodiscard (700-703)
  • nodiscard (744-755)
  • nodiscard (953-956)
  • nodiscard (958-971)
  • nodiscard (973-986)
  • nodiscard (988-994)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

341-341: Stubbed isUseOneMoreBlock always returns false – please confirm intent

The new parameterless isUseOneMoreBlock() (cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:1109–1113) is a stub that unconditionally returns false, so there’s no longer any per-request logic (e.g., attention window, sequence length, beam width) influencing this decision.
Please verify that always disabling the “one more block” path is intended for all models and request scenarios. If any runtime variability is required, update this method to compute the correct value based on the appropriate parameters.

Key locations to review:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:341
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:1109–1113
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

136-139: Constructor signature verified across call sites

All KVCacheManager instantiations—including the one in capacitySchedulerTest.cpp (lines 136–139)—now match the updated overload (with the added maxSequenceLength, enableBlockReuse, onboardBlocks, and cacheType parameters in the correct order). No mismatches were found in tests or code.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

77-80: Helper correctly counts valid block IDs

The sentinel-aware helper is concise and used consistently. Return type fits the test domain.


4708-4709: Ensure allocatePools default argument exists (or pass explicit bool)

Tests call kvCacheManager->allocatePools() without an argument, but KVCacheManager::allocatePools(bool useUvm) in C++ requires a bool unless a default is declared in the header. Verify that the header provides a default (e.g., useUvm = false) or pass false explicitly here.

Would you like me to sweep the headers and update calls to consistently pass false where missing?

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

2067-2097: SWA: addToken path updates offsets via updateLastCacheBlockOffsets

Once updateLastCacheBlockOffsets is fixed (see earlier comment), this full-attention path is fine. For completeness, consider adding an unconditional offsets refresh after the SWA path too, in case future addBlockIfNeeded implementations stop doing in-place updates.

Would you like me to wire a conservative BlockManager::updateSequenceCacheBlockOffsets call for SWA here as well?

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from b8fa5b6 to 6390a11 Compare August 13, 2025 09:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

341-349: Document disabled one-more-block behavior

The new static BlockManager::isUseOneMoreBlock() always returns false, which means the “one-more-block” path is effectively disabled. We should document this decision at the call site to make it clear that downstream kernels and rewind inputs no longer need to handle the extra‐block cases.

• In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (around line 341):

  • Add a brief comment before or after the call to isUseOneMoreBlock() explaining that it always returns false and why it’s safe to disable the extra-block logic (e.g., SWA/FHMA alignment no longer requires it).
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

292-298: Bug: wrong container used in EXPECT_NE for disagg-gen-init verification

You’re searching in scheduledRequests but comparing to scheduledDisaggGenInitRequests.end(), which is a mismatch and can mask failures.

Apply this fix:

-                    EXPECT_NE(
-                        scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
+                    EXPECT_NE(
+                        scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
                         << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;
♻️ Duplicate comments (8)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (8)

519-520: Same note as previous: sinkTokenLen coverage reduced

Replicate the rationale comment or a shared helper to centralize this decision.


564-565: Same note as previous: sinkTokenLen coverage reduced


609-610: Same note as previous: sinkTokenLen coverage reduced


719-721: Same note as previous: sinkTokenLen coverage reduced


1194-1196: Same note as previous: sinkTokenLen coverage reduced


1380-1382: Same note as previous: sinkTokenLen coverage reduced


1528-1530: Same note as previous: sinkTokenLen coverage reduced


1580-1582: Same note as previous: sinkTokenLen coverage reduced

🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (4)

373-427: Drop noisy std::cerr traces or guard them behind a flag

The ad-hoc std::cerr prints in a unit-test utility will produce noisy test output and make CI logs harder to parse. Prefer GTest’s SCOPED_TRACE, TLLM_LOG_* with adjustable severity, or guard with a macro/env flag.

Here’s a minimal cleanup that simply removes the debug prints:

-                std::cerr << "isContextInitState, itCount: " << itCount << std::endl;
@@
-                    std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl;
@@
-                    std::cerr << "addNewTokens, itCount: " << itCount << std::endl;
@@
-                std::cerr << "generation phase, itCount: " << itCount << std::endl;
-                std::cerr << "add token for kvCacheManager" << std::endl;
@@
-                    std::cerr << "add token for crossKvCacheManager" << std::endl;
@@
-                std::cerr << "add token for llmReq" << std::endl;
@@
-                std::cerr << "GENERATION_COMPLETE" << std::endl;

If you want to keep them for debugging, consider:

  • Replace with SCOPED_TRACE("...") where appropriate, or
  • Wrap in an if (std::getenv("TLLM_TEST_TRACE")) guard.

465-466: Reduced sinkTokenLen coverage: add rationale and tracking

Limiting sinkTokenLen to {0} reduces coverage. If this is temporary (e.g., reuse+sink tokens not supported yet), please add a brief rationale and a ticket reference. Consider adding a GTEST_SKIP-based param test for sinkTokenLen > 0 to keep the intent visible without failing.


845-849: Document removed configurations and plan to re-enable

You removed some configurations. If disabled due to current limitations, please add a short explanatory comment with an issue link to make the intent explicit and to plan re-enablement.


851-852: Narrowed configurations: consider parameterized tests or GTEST_SKIP for unsupported cases

Reducing to one configuration speeds tests but loses coverage. Consider keeping the matrix via parameterized tests and skipping unsupported combinations with GTEST_SKIP to preserve visibility.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8fa5b6 and 6390a11.

📒 Files selected for processing (5)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (15 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (25 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (17 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (64 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
👮 Files not reviewed due to content moderation or server errors (2)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • KVCacheManager (1711-1722)
  • KVCacheManager (1724-1738)
  • KVCacheManager (1740-1772)
  • KVCacheManager (1774-1787)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
  • numLayers (32-69)
  • numLayers (32-33)
  • numLayers (77-88)
  • numLayers (77-77)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
  • kvCacheManager (2969-2971)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
  • storeBlocks (925-929)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

137-139: KVCacheManager ctor update looks correct

Passing maxNumTokensPerSeq (per sequence budget / max sequence length) after the stream conforms to the updated ctor signature and matches the usage in other tests. No issues.

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch 5 times, most recently from 306debd to 44643d2 Compare August 13, 2025 10:08
@NVIDIA NVIDIA deleted a comment from coderabbitai bot Aug 13, 2025
Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

📝 Walkthrough

Walkthrough

Adds Sliding Window Attention (SWA) support and front-block management to KV cache. Replaces cyclic logic with detach/adjust of front blocks, introduces kSWAExtraBlock and kInvalidBlockId, and makes maxSequenceLength mandatory in constructors. Updates APIs, bindings, PyTorch executor config, and tests; modifies help texts and integration tests accordingly.

Changes

Cohort / File(s) Change summary
Core KV cache header (APIs, types)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Adds SWA wiring (mIsSWA), kSWAExtraBlock, kInvalidBlockId; replaces cyclic logic with front-block removal; makes maxSequenceLength non-optional in constructors; modifies signatures (WindowBlockManager, BlockManager, KVCacheManager); adds methods: updateLastCacheBlockOffsets, detachFrontBlock, adjustBlocksIfNeeded, GenerationRequest getters/removeFrontBlock; storeBlocks now returns SizeType32; removes isUseOneMoreBlock and isCyclic.
Core KV cache implementation
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Implements SWA-aware block adjust/detach, extra-block accounting, uniform window-size share; updates addToken/removeToken, remaining/required blocks calculations; updates constructors (non-optional maxSequenceLength, new params like event manager/retention); logging; refactors helpers; storeBlocks returns count.
TRT model inflight batching integration
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
Removes cyclic KV assertion; updates KVCacheManager construction with explicit maxSequenceLength and new partial reuse flags; disables “one more block” rewind logic.
Python bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Updates KVCacheManager binding: parameter changed from optional to required runtime::SizeType32; adjusts nb::init argument style.
PyTorch executor resource/config
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py
Removes VSWA-specific branch; always compute max_tokens from memory; notes dependency on max_tokens. Removes storing context blocks in update_resources.
Examples/help texts
examples/models/core/llama/summarize_long.py, examples/models/core/qwen2audio/utils.py, examples/utils.py, tensorrt_llm/functional.py
Updates help/docstrings: remove “cyclic” wording; no functional changes.
Unit tests: batch manager
cpp/tests/unit_tests/batch_manager/*
Updates KVCacheManager constructors to pass maxSequenceLength; adds helpers for valid blocks; adapts expectations to kInvalidBlockId and window behavior; extends scenarios for SWA/variable windows and one-step estimation.
Unit tests: executor/agent
cpp/tests/unit_tests/executor/agentCommTest.cpp
Supplies concrete kvMaxNumTokens to KVCacheManager; unchanged flow otherwise.
Unit tests: multi-GPU
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Passes maxNumTokens to KVCacheManager; reduces/waives windowed test variants; updates parameterizations.
Integration tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Replaces single VSWA test with four explicit cases (reuse/non-reuse, chunked prefill variants); removes skip; configures max_attention_window per test.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App as Inference Loop
  participant KVC as KVCacheManager
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant GR as GenerationRequest

  App->>KVC: addToken(requestId)
  KVC->>BM: adjustBlocksIfNeeded(GR, enableReuse)
  alt SWA path (mIsSWA)
    BM->>WBM: adjustBlocksIfNeeded(GR, enableReuse)
    opt Window slide needed
      WBM->>WBM: detachFrontBlock(GR, enableReuse)\n(mark OOW block kInvalidBlockId)
      WBM->>WBM: updateLastCacheBlockOffsets(GR)
    end
  else Full attention
    BM->>BM: allocate/store as needed
  end
  BM-->>KVC: done
  KVC-->>App: token added

  note over KVC,WBM: On storage
  KVC->>WBM: storeBlocks(keys, ids)
  WBM-->>KVC: returns numBlocksStored
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch 2 times, most recently from c380a15 to 061fd2c Compare August 13, 2025 17:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

52-59: Fix compile error: SizeType32 used before it’s declared

SizeType32 is first introduced at Line 65, but is used to define kPrimaryLevel, kSecondaryLevel, and kSWAExtraBlock earlier (Lines 52–59). This will not compile.

Use the fully qualified type for these constants or move the alias before the constants.

Apply this diff to use the fully qualified type:

-static constexpr SizeType32 kPrimaryLevel = 0;
+static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0;

-static constexpr SizeType32 kSecondaryLevel = 1;
+static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1;

 // Extra block buffer allocated for SWA to be able to always keep "window size"
 // tokens held in the blocks.
-static constexpr SizeType32 kSWAExtraBlock = 1;
+static constexpr tensorrt_llm::runtime::SizeType32 kSWAExtraBlock = 1;

34-43: Include headers for types used in this header (avoid transitive includes)

This header uses std::map, std::tuple, std::deque, std::string, std::function, std::numeric, and std::chrono types without including their respective headers. Relying on transitive includes is brittle.

Add the missing headers:

 #include <array>
 #include <cstdint>
 #include <limits>
 #include <list>
 #include <memory>
 #include <optional>
 #include <set>
 #include <unordered_map>
 #include <vector>
+#include <map>
+#include <tuple>
+#include <deque>
+#include <string>
+#include <functional>
+#include <numeric>
+#include <chrono>

17-17: Add required include guard for headers (per coding guidelines)

Guidelines require header guards named TRTLLM__H. Replace or complement #pragma once with an include guard.

Apply this diff at the top and bottom of the file:

-#pragma once
+#ifndef TRTLLM_KVCACHEMANAGER_H
+#define TRTLLM_KVCACHEMANAGER_H
-} // namespace tensorrt_llm::batch_manager::kv_cache_manager
-
+#} // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_KVCACHEMANAGER_H

Note: You can keep #pragma once as an optimization, but the guard is mandatory per the project rules.

Also applies to: 1704-1705

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

295-297: Fix container used in EXPECT_NE for disagg init scheduled set

This line is checking scheduledDisaggGenInitRequests, but it searches in scheduledRequests. This causes false negatives.

Apply this diff:

-                    EXPECT_NE(
-                        scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
+                    EXPECT_NE(
+                        scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId),
+                        scheduledDisaggGenInitRequests.end())
                         << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;
♻️ Duplicate comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

422-429: Reset counter after clearCacheBlocks: good fix

Resetting mNumFrontBlocksRemoved here resolves the earlier bug raised in a past review. Thanks for addressing it.

🧹 Nitpick comments (9)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)

755-766: Avoid implicit signed/unsigned mixing in std::max/min

To prevent unintended integral promotion (and to make intent clear), cast the zero literal to SizeType32.

-            auto window = std::min(inputs->maxNumTokens, inputs->maxInputLen - mWindowSize);
-            window = std::max(window, 0); // clamp negative values to 0
+            auto window = std::min(inputs->maxNumTokens, inputs->maxInputLen - mWindowSize);
+            window = std::max(window, static_cast<SizeType32>(0)); // clamp negative values to 0

590-603: New SWA helpers are well-scoped

The additions updateLastCacheBlockOffsets, detachBlock, and addBlockIfNeeded make responsibilities explicit. Keep them minimal and cohesive; watch for duplication with any BlockManager helpers.


1393-1407: Parameter name typo: copyOnpartialReuse → copyOnPartialReuse

There’s an inconsistent parameter name across constructors: copyOnpartialReuse vs. copyOnPartialReuse. Even though names don’t affect ABI, consistency improves readability and avoids confusion with the config flags.

Apply this rename in the declarations:

-        bool copyOnpartialReuse = true);
+        bool copyOnPartialReuse = true);

And propagate the same rename to the other overload below.

Please ensure the definitions in the corresponding .cpp also use the same parameter name.

Also applies to: 1418-1425


21-21: Consider forward-declaring LlmRequest to reduce header coupling

The TODO suggests it; forward-declaring class LlmRequest; and moving the include to the .cpp would reduce rebuild churn. Validate that only pointers/refs are used here before doing so.

cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)

87-89: Ack: TODO for beamWidth > 1 coverage

Noted. Please track this with an issue if it’s not on your immediate roadmap.

I can draft additional tests for beamWidth > 1 once SWA-related paths are stable. Want me to propose them?

cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1)

749-758: Dropping streaming=true reduces coverage; add a tracking plan

Instantiating with testing::Values(false) removes all streaming=true scenarios. If this is temporary, add a TODO with a concrete link/issue and timeframe to re-enable streaming coverage, especially for beamWidth > 1 constraints.

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

347-348: Hard-coding isUseOneMoreBlock=false: document rationale and implications

isUseOneMoreBlock is now always false. That changes rewind behavior for speculative decoding. Please add a clear comment explaining why we no longer use “one more block” (e.g., incompatibility with SWA linearized allocation, correctness constraints) and verify any sizing assumptions in invokeUpdateKVBlockArrayDraftTokenLocation.

Optionally, gate this behind a config if future modes need to flip it back.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)

373-384: Remove noisy std::cerr debug prints from unit tests

These prints make test output noisy and brittle. Prefer GTest SCOPED_TRACE or logs behind a verbosity flag. If needed, use TLLM_LOG_* with a controllable level.

Apply this diff to remove the ad-hoc prints:

-                std::cerr << "isContextInitState, itCount: " << itCount << std::endl;
@@
-                    std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl;
@@
-                    std::cerr << "addNewTokens, itCount: " << itCount << std::endl;
@@
-                std::cerr << "generation phase, itCount: " << itCount << std::endl;
-                std::cerr << "add token for kvCacheManager" << std::endl;
@@
-                    std::cerr << "add token for crossKvCacheManager" << std::endl;
@@
-                std::cerr << "add token for llmReq" << std::endl;
@@
-                std::cerr << "GENERATION_COMPLETE" << std::endl;

Also applies to: 407-427


462-466: TODOs for sinkTokenLen > 0: track re-enabling scenarios

Acknowledged the temporary restriction to sinkTokenLen=0. Please add an issue link or re-enable coverage once SWA + sink-bubble semantics are finalized.

I can help craft parameterized cases for sinkTokenLen > 0 once the allocator semantics stabilize.

Also applies to: 518-520, 563-565, 719-721, 845-852, 1188-1196, 1528-1536, 1580-1588

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bda42f8 and 061fd2c.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (28 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (17 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (64 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1 hunks)
👮 Files not reviewed due to content moderation or server errors (2)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (1)
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🔇 Additional comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

548-554: Constructor change: isSWA flag addition looks correct

Adding the isSWA toggle into WindowBlockManager aligns with the SWA behavior changes. Ensure the .cpp honors this to linearize writes and detach OOW blocks as described in the PR.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

136-139: Constructor update looks correct

Passing maxNumTokensPerSeq into the KVCacheManager ctor matches the new signature. Ensure all other test helpers and sites were updated consistently.

@eopXD eopXD changed the title [TRTLLM-6341][feature] SWA KV cache reuse [TRTLLM-6341][feature] Support SWA KV cache reuse Aug 14, 2025
@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from 061fd2c to 9aa7fa0 Compare August 15, 2025 07:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

95-96: LGTM! Clearer naming for maxTokensPerSeq field.

The rename from maxTokenNum to maxTokensPerSeq improves clarity and consistency with the codebase naming conventions.


431-448: Minor issue: Fix format specifier for requestId logging.

mRequestId is of type LlmRequest::RequestIdType (likely uint64_t), not int. Using %d may cause undefined behavior or incorrect logging output.

Apply this diff to fix the format specifier:

-                TLLM_LOG_WARNING("RequestID %d: removeFrontBlock called but nothing to remove", mRequestId);
+                TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove",
+                    static_cast<unsigned long>(mRequestId));
🧹 Nitpick comments (2)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)

404-404: Consider removing or converting debug output to proper test logging.

The std::cerr statements appear to be debugging artifacts. Consider using proper test logging mechanisms or removing them if no longer needed.

-                    std::cerr << "addNewTokens, itCount: " << itCount << std::endl;
+                    // Token addition completed for context
-            std::cerr << "kvCacheManager" << std::endl;
+            // Initialize KV cache manager
-            std::cerr << "crossKvCacheManager" << std::endl;
+            // Initialize cross KV cache manager

Also applies to: 513-513, 516-516


453-454: Document the reasoning for temporarily disabled sinkTokenLen > 0 tests.

Multiple TODO comments indicate that sinkTokenLen > 0 tests are disabled. Please add an explanation of why this configuration is not currently supported with SWA and when it might be addressed.

Add a more detailed comment explaining the limitation:

-    // TODO: Support and add coverage for sinkTokenLen > 0. (e.g. 4)
+    // TODO(TRTLLM-6341): Support and add coverage for sinkTokenLen > 0.
+    // SWA KV cache reuse currently doesn't support sink tokens.
+    // This will be addressed in a future iteration.

Also applies to: 507-508, 554-555, 599-600, 710-711, 836-842, 1038-1039, 1092-1093, 1185-1186, 1371-1372, 1519-1520, 1571-1572, 1852-1853

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 061fd2c and 9aa7fa0.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (29 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (15 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (64 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
👮 Files not reviewed due to content moderation or server errors (2)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (7)

56-58: LGTM! Clear naming for SWA extra block constant.

The constant is well-documented, follows the naming convention (kPREFIXED_UPPER_SNAKE), and clearly indicates its purpose for sliding-window attention buffer management.


210-210: LGTM! Appropriate invalid block ID constant.

Using -2 as an invalid block ID sentinel value is correct, distinguishing it from -1 (kCachedBlocksRootId).


343-343: LGTM! Proper state management for front block removal.

The addition of mNumFrontBlocksRemoved with getter method and reset in clearCacheBlocks ensures consistent tracking of detached front blocks for SWA.

Also applies to: 386-389, 428-429, 481-482


550-550: LGTM! SWA flag properly integrated into WindowBlockManager.

The addition of the mIsSWA flag and its propagation through the constructor enables proper conditional behavior for sliding-window attention.

Also applies to: 568-568, 827-827


585-585: LGTM! Well-designed SWA-specific block management methods.

The new methods provide clear semantics for SWA operations:

  • updateLastCacheBlockOffsets: Updates offsets for the last block
  • detachBlock: Cleanly detaches blocks from sequences
  • addBlockIfNeeded: Conditionally adds blocks with detachment of out-of-window blocks
  • storeBlocks now returns the count of actually stored blocks for better tracking

Also applies to: 591-592, 596-602, 614-614, 744-745


906-907: LGTM! Clear API updates for SWA support.

The addition of isShareLastContextBlock parameter and cache-level support in getNumFreeBlocks methods provides necessary control for SWA behavior.

Also applies to: 949-950, 960-962


1136-1144: LGTM! Well-designed block offset management methods for SWA.

The new private helper methods in BlockManager provide clean separation of concerns for managing cache block offsets and block addition logic.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

138-138: Constructor signature properly updated for maxNumTokensPerSeq parameter.

The test helper correctly passes the new maxNumTokensPerSeq parameter matching the updated KVCacheManager constructor.

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from 9aa7fa0 to 3354203 Compare August 15, 2025 08:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (2)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

291-299: Fix: wrong container used when verifying scheduled disagg-gen-init requests

You’re checking membership in scheduledDisaggGenInitRequests but calling scheduledRequests.find(...). This will produce false negatives and can hide scheduling bugs.

Apply this diff:

-                    // Check that scheduleId is found in scheduled Requests
-                    EXPECT_NE(
-                        scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
+                    // Check that scheduleId is found in scheduled disaggregated generation-init Requests
+                    EXPECT_NE(
+                        scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
                         << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

4179-4183: Potential undefined type alias: use explicit CudaStream type for clarity

createKvCacheManager takes a parameter StreamPtr stream, but this alias isn’t defined in this TU (and may not exist). All other places use std::shared_ptr<tr::CudaStream>.

Replace the parameter type with the explicit stream pointer type to avoid build breaks:

-std::shared_ptr<KVCacheManager> createKvCacheManager(
-    KvCacheManagerInstantiationParameters const& kvCacheInstantiationParameters, StreamPtr stream)
+std::shared_ptr<KVCacheManager> createKvCacheManager(
+    KvCacheManagerInstantiationParameters const& kvCacheInstantiationParameters,
+    std::shared_ptr<tr::CudaStream> stream)
♻️ Duplicate comments (1)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

2863-2866: Primary allocation check relies on subtracting secondary capacity (test-only workaround)

You compute “allocated primary blocks” via:

  • blockManager.getNumAllocatedBlocks() - blocksInSecondaryPool
  • and similarly for multi-window: - blocksInSecondaryPoolPerWindow * numWindows

This couples tests to internal pool sizing semantics.

If feasible, expose an API to query allocated counts per cache level (symmetrical to getNumFreeBlocks(level)), e.g., getNumAllocatedBlocks(level). If you prefer to keep current tests as-is (previously noted), no action needed now.

Also applies to: 3030-3033

🧹 Nitpick comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

1191-1200: Fix incorrect comment about block reuse versions.

The comments state "There are two versions of BlockManager::addSequence function" but this is misleading. These are overloaded methods with different purposes, not just enabled/disabled versions of the same functionality.

Update the comments to be more accurate:

-// There are two versions of BlockManager::addSequence function.
-// This is called when block reuse is enabled.
+// Add a sequence with KV cache block reuse support.
+// Attempts to reuse existing blocks from the cache tree when possible.
-// There are two versions of BlockManager::addSequence function.
-// This is called when block reuse is disabled.
+// Add a sequence without block reuse.
+// Allocates fresh blocks for the entire sequence.

Also applies to: 1277-1300


2593-2594: Add defensive check for potential underflow.

When blockCapacity / beamWidth <= kSWAExtraBlock, the subtraction could underflow. While the current code path might prevent this scenario, adding a defensive check would make the code more robust.

The issue was already addressed in a past review comment. Consider applying:

-        return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock;
+        auto const blocksPerBeam = blockCapacity / beamWidth;
+        auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? blocksPerBeam - kSWAExtraBlock : 0;
+        return usableBlocksPerBeam * tokensPerBlock;

Similarly for line 2601:

-    return std::min(
-        outputLength + (leftoverBlockCapacity - kSWAExtraBlock) * tokensPerBlock, inputLength + outputLength);
+    auto const usableLeftoverCapacity = leftoverBlockCapacity > kSWAExtraBlock ? leftoverBlockCapacity - kSWAExtraBlock : 0;
+    return std::min(
+        outputLength + usableLeftoverCapacity * tokensPerBlock, inputLength + outputLength);

Also applies to: 2600-2601

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)

2811-2811: Remove unused variable ‘temporaryAttentionWindow’

temporaryAttentionWindow is defined but not used, which can trigger warnings.

-    auto constexpr temporaryAttentionWindow = 0;
+    // (unused) auto constexpr temporaryAttentionWindow = 0;

Or remove the line entirely.

Also applies to: 3081-3081


4071-4071: Typo in test name: ‘Caculate’ → ‘Calculate’

Minor naming polish to avoid misspellings in test outputs.

-TEST_P(BlockRequirementsParamTest, TestCaculateMaxBlocksRequirement)
+TEST_P(BlockRequirementsParamTest, TestCalculateMaxBlocksRequirement)

103-111: Non-deterministic event polling can be flaky

Using a fixed sleep_for(100ms) to wait for event flushing can cause intermittent flakes on slow or loaded CI.

Consider polling with a bounded timeout (e.g., loop on getLatestEvents until non-empty or a deadline is reached), or instrument the event manager to signal completion to avoid sleeping.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9aa7fa0 and 3354203.

📒 Files selected for processing (5)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (29 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (14 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (61 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • KVCacheManager (1687-1698)
  • KVCacheManager (1700-1714)
  • KVCacheManager (1716-1748)
  • KVCacheManager (1750-1763)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (27)
  • nodiscard (1015-1022)
  • nodiscard (1522-1525)
  • nodiscard (2583-2602)
  • KVCacheBlock (261-274)
  • addSequence (1193-1197)
  • addSequence (1193-1194)
  • addSequence (1201-1236)
  • addSequence (1201-1202)
  • addSequence (1279-1283)
  • addSequence (1279-1280)
  • addSequence (1287-1300)
  • addSequence (1287-1288)
  • addSequence (2055-2113)
  • addSequence (2055-2056)
  • updateLastCacheBlockOffsets (1968-1984)
  • updateLastCacheBlockOffsets (1968-1968)
  • detachBlock (2011-2046)
  • detachBlock (2011-2011)
  • addBlockIfNeeded (1238-1244)
  • addBlockIfNeeded (1238-1238)
  • addBlockIfNeeded (1246-1275)
  • addBlockIfNeeded (1246-1246)
  • getNumFreeBlocks (1522-1522)
  • updateSequenceCacheBlockOffsets (1948-1966)
  • updateSequenceCacheBlockOffsets (1948-1948)
  • updateCacheBlockOffsetsAtIdx (1986-2001)
  • updateCacheBlockOffsetsAtIdx (1986-1986)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
  • llmRequest (421-450)
  • llmRequest (421-421)
  • llmRequest (872-903)
  • llmRequest (872-872)
  • llmRequest (905-912)
  • llmRequest (905-905)
  • llmRequest (914-933)
  • llmRequest (914-914)
cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
  • getNumFreeBlocks (179-182)
  • getNumFreeBlocks (179-179)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (8)

58-58: LGTM! Well-structured SWA constant addition.

The kSWAExtraBlock constant is appropriately defined and documented, providing a clear buffer for sliding window attention. The comment explains its purpose well.


95-108: LGTM! Consistent naming improvement for window metadata.

The rename from maxTokenNum to maxTokensPerSeq is more descriptive and aligns well with maxBlocksPerSeq. The toString() method is correctly updated.


210-210: Good addition of invalid block sentinel value.

The kInvalidBlockId = -2 constant provides a clear distinction from kCachedBlocksRootId = -1, making detached/removed blocks easily identifiable in SWA operations.


343-348: LGTM! Well-designed front block removal for SWA.

The implementation correctly:

  • Resets the counter in clearCacheBlocks (addressing the past review comment)
  • Uses bounds checking to prevent out-of-bounds access
  • Preserves block layout for reuse by marking with kInvalidBlockId instead of removing
  • Includes appropriate warning logging

Also applies to: 386-390, 428-429, 431-448, 481-483


550-551: LGTM! Clean SWA flag integration.

The isSWA parameter is properly propagated through the constructor chain and stored for controlling SWA-specific behavior. The boolean expression windowSize < maxSequenceLength correctly identifies SWA scenarios.

Also applies to: 565-566, 606-608, 619-620, 827-828


591-603: LGTM! Well-structured SWA block management methods.

The new methods provide clear separation of concerns:

  • updateLastCacheBlockOffsets: Efficiently updates only the last block
  • detachBlock: Handles out-of-window block removal
  • addBlockIfNeeded: Orchestrates allocation and detachment

The documentation is clear and helpful.


745-746: LGTM! Return value addition enables tracking of stored blocks.

Changing storeBlocks to return the count of blocks stored is a good improvement for monitoring and debugging reuse effectiveness.


902-908: LGTM! Clean API surface for SWA block management.

The BlockManager methods properly delegate to WindowBlockManager instances and provide a clean abstraction for managing blocks across different window sizes.

Also applies to: 1137-1144

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (12)

159-191: LGTM! Well-implemented sequence block traversal utility.

The getAllSequenceBlocks function efficiently handles the traversal with proper early return for empty sequences and pre-allocation for performance. This addresses the past review comment about size_t underflow.


564-566: LGTM! Proper SWA flag initialization.

The SWA flag is correctly computed as windowSize < maxSequenceLength.value_or(0), which properly identifies sliding window scenarios.


607-620: LGTM! Clean WindowBlockManager constructor with SWA support.

The constructor properly initializes the mIsSWA flag and maintains the correct parameter order.


1238-1276: LGTM! Well-implemented SWA block management.

The addBlockIfNeeded implementation correctly:

  1. Calculates when to detach blocks based on window boundaries
  2. Handles the detachment before allocation to maintain memory
  3. Updates offsets appropriately

The TODO comments provide valuable context about current limitations.


1360-1423: LGTM! Proper return value for tracking stored blocks.

The storeBlocks method now correctly returns the count of blocks stored for reuse, which is useful for monitoring and debugging.


1615-1663: Excellent SWA-aware block release implementation!

The releaseBlocks method properly handles SWA by:

  1. Using getAllSequenceBlocks to gather all blocks including detached ones
  2. Checking for blocks that may not have ref counts (detached blocks)
  3. Logging the number of blocks stored for reuse

This is a robust implementation that handles the complexities of SWA.


1737-1738: LGTM! Clear runtime guard against unsupported features.

The check appropriately prevents streamLLM usage with SWA, which is documented as not yet supported.


1949-2001: Clean refactoring of offset update methods.

The separation of concerns is well done:

  • updateSequenceCacheBlockOffsets: Full sequence update
  • updateLastCacheBlockOffsets: Efficient last block update
  • updateCacheBlockOffsetsAtIdx: Targeted block update

2003-2009: LGTM! Clean public API for token addition.

The new addToken method provides a clean interface that hides the complexity of block management.


2011-2046: Well-structured block detachment for SWA.

The detachBlock method correctly:

  1. Validates beam width constraint (documented limitation)
  2. Decrements ref counts appropriately
  3. Chooses between offload and release based on configuration
  4. Updates sequence state and allocated blocks

The TODO comment about future eviction policy integration is helpful.


2057-2102: LGTM! Comprehensive sequence addition with SWA support.

The method properly handles:

  • Context length calculation
  • Block allocation (with/without reuse)
  • Sharing decisions for cross-attention
  • Cache offset updates

2536-2547: LGTM! Correct SWA block calculation with extra buffer.

The block requirement calculations properly account for the kSWAExtraBlock buffer when sequences exceed the window size, ensuring sufficient blocks for smooth window sliding.

Also applies to: 2549-2581

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from 3354203 to 762d9d4 Compare August 15, 2025 09:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

581-583: Default fallback for maxSequenceLength may cause block allocation failure.

When maxSequenceLength is std::nullopt, setting maxTokensPerSeq = 0 results in maxBlocksPerSeq = 0, which prevents KV-cache allocation. This could break allocation flows when callers pass std::nullopt.

Apply this fix to derive a sensible default:

-        auto const maxTokensPerSeq = maxSequenceLength.value_or(0);
+        // Derive a sensible default when maxSequenceLength is not provided
+        auto const maxTokensPerSeq = maxSequenceLength.has_value() 
+            ? maxSequenceLength.value() 
+            : allottedPrimaryBlocks * tokensPerBlock / beamWidth;

1287-1300: Fix parameter name consistency between declaration and definition.

The implementation uses isShareLastContextBlock while the header likely declares isShareLastContextBlock. Ensure parameter names match exactly between header and implementation.

Verify and align parameter names:

 void WindowBlockManager::addSequence(
-    GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock)
+    GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock)

2592-2600: Fix potential underflow in tight-capacity SWA calculation.

The expression ((blockCapacity / beamWidth) - kSWAExtraBlock) can underflow when blockCapacity/beamWidth <= kSWAExtraBlock, causing incorrect attention window calculation.

Apply the clamping fix:

-        return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock;
+        auto const blocksPerBeam = blockCapacity / beamWidth;
+        auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? blocksPerBeam - kSWAExtraBlock : 0;
+        return usableBlocksPerBeam * tokensPerBlock;

564-566: Default fallback for maxSequenceLength may cause block allocation failure.

When maxSequenceLength is std::nullopt, the fallback value of 0 results in maxTokensPerSeq = 0 and maxBlocksPerSeq = 0 (line 583), which prevents KV-cache allocation. This could break allocation flows when callers pass std::nullopt.

Apply this fix to derive a sensible default:

-        /*isSWA=*/windowSize < maxSequenceLength.value_or(0), allottedPrimaryBlocks,
+        /*isSWA=*/maxSequenceLength.has_value() && windowSize < maxSequenceLength.value(), allottedPrimaryBlocks,
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

2976-2978: Incorrect use of EXPECT_THAT with a raw value (won’t compile).

EXPECT_THAT requires a matcher as the second argument. Use EXPECT_EQ or ::testing::Eq instead. This was flagged previously and still applies.

Apply this minimal fix:

-    EXPECT_THAT(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows);
+    EXPECT_EQ(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows);

Alternatively:

-    EXPECT_THAT(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows);
+    EXPECT_THAT(allBlocksInPrimaryPools, ::testing::Eq(blocksInPrimaryPoolPerWindow * numWindows));
🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)

452-454: Make sinkTokenLen coverage TODOs explicit to avoid accidental rot.

Multiple TODOs defer sinkTokenLen > 0 coverage. Suggest one of:

  • Mark corresponding parameterized cases as disabled (prefix test name with DISABLED_) until support lands.
  • Keep the param values {0,4} but GTEST_SKIP() early when sinkTokenLen != 0, so the shape remains visible in CI without failing.

This helps track intent and prevents accidental deletion of these scenarios.

Example pattern within affected loops:

 for (auto sinkTokenLen : sinkTokenLens)
 {
+    if (sinkTokenLen != 0) {
+        GTEST_SKIP() << "sinkTokenLen > 0 temporarily unsupported; tracked by TODO.";
+    }
     auto kvCacheManager = getKvCacheManager(
         maxNumRequests, kvCacheTokensPerBlock, kvCacheMaxNumTokens, kvCacheMaxNumTokensPerSeq, sinkTokenLen);

Also applies to: 506-508, 551-553, 707-709, 833-840, 1090-1091, 1182-1184, 1368-1370, 1516-1518, 1568-1570, 1850-1851


1729-1745: Dead branch comment about STATIC_BATCH in chunked test.

This test declares only GUARANTEED_NO_EVICT and kMAX_UTILIZATION in capacitySchedulerPolicies, yet retains a STATIC_BATCH branch/comment. Either include STATIC_BATCH in the vector or remove the dead branch to avoid confusion.

If you want to exercise STATIC_BATCH here as well:

-    auto capacitySchedulerPolicies = std::vector<CapacitySchedulerPolicy>{
-        CapacitySchedulerPolicy::kGUARANTEED_NO_EVICT, CapacitySchedulerPolicy::kMAX_UTILIZATION};
+    auto capacitySchedulerPolicies = std::vector<CapacitySchedulerPolicy>{
+        CapacitySchedulerPolicy::kGUARANTEED_NO_EVICT,
+        CapacitySchedulerPolicy::kMAX_UTILIZATION,
+        CapacitySchedulerPolicy::kSTATIC_BATCH};
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

2810-2812: Remove unused variable temporaryAttentionWindow.

temporaryAttentionWindow is declared but never used in these tests. Clean it up to avoid warnings.

-    auto constexpr temporaryAttentionWindow = 0;

Also applies to: 3081-3081


4706-4707: Be explicit with allocatePools(false) for consistency.

Most tests explicitly pass false to avoid UVM. For consistency and to prevent accidental defaults changing behavior, prefer allocatePools(false) here as well.

-        kvCacheManager->allocatePools();
+        kvCacheManager->allocatePools(false);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3354203 and 762d9d4.

📒 Files selected for processing (5)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (29 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (14 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (61 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code Graph Analysis (2)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • KVCacheManager (1686-1697)
  • KVCacheManager (1699-1713)
  • KVCacheManager (1715-1747)
  • KVCacheManager (1749-1762)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • KVCacheManager (141-931)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (26)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (23)

159-190: LGTM! Well-structured utility for SWA block traversal.

The function pre-allocates the vector and correctly handles the edge case of empty blocks. The early return guard protects against underflow issues when blockCount is zero.


607-619: LGTM! Clear SWA integration in WindowBlockManager constructor.

The addition of the isSWA parameter properly initializes SWA support based on the window size comparison.


1191-1244: LGTM! Appropriate function signature documentation and implementation.

The function correctly implements block allocation/detachment logic for SWA, with proper bounds checking and beam width validation for the detach mechanism.


1246-1275: Well-documented SWA block management logic.

The implementation correctly handles both block allocation and detachment for SWA with appropriate conditions. The comments clearly explain the current sub-optimal nature and areas for future improvement.


1360-1422: LGTM! Enhanced storeBlocks with return value for SWA tracking.

The function now returns the count of blocks stored for reuse, which is valuable for SWA logging and monitoring. The implementation correctly tracks and returns numBlocksStoredForReuse.


1522-1525: LGTM! Improved function signature with explicit const qualifier.

The addition of the const noexcept qualifiers correctly indicates that this function doesn't modify the object state and provides strong exception safety guarantees.


1615-1645: Well-implemented SWA-aware block release with reuse support.

The function correctly handles SWA scenarios by using getAllSequenceBlocks() to collect all sequence blocks when needed, and properly logs the number of blocks stored for reuse.


1650-1654: LGTM! Proper handling of detached blocks without references.

The conditional check correctly handles blocks that may not have reference counts due to SWA detachment, preventing unnecessary decrements on blocks that are already detached.


1736-1737: Appropriate guard for unsupported streamLLM with SWA.

The runtime check correctly prevents the use of streamLLM functionality when sink-related lengths are non-zero, providing clear error messaging for unsupported configurations.


1876-1879: LGTM! Correct usage of metadata for token limit checking.

The check properly uses maxTokensPerSeq from the window size metadata to prevent allocation beyond sequence limits.


1925-1944: Well-implemented SWA extra block logic for generation planning.

The calculation correctly accounts for sliding window scenarios where both block and window boundaries are crossed, requiring an extra block for smooth window transitions.


1947-1965: LGTM! Comprehensive cache block offset updates.

The function correctly updates all cache block offsets for the entire sequence, ensuring proper page table maintenance across all beams and blocks.


1967-1983: LGTM! Efficient last block offset update.

The function correctly updates only the last block's offsets, which is appropriate when only a new block has been appended to the sequence.


1985-2000: LGTM! Targeted block offset update at specific index.

The function correctly updates offsets for a specific block index across all beams, which is useful for SWA scenarios where specific blocks need offset recalculation.


2002-2008: LGTM! Clean implementation of the new addToken API.

The function correctly advances the sequence token count and triggers block allocation/detachment logic through the block manager. The TODO comment appropriately notes the need for future streamLLM support.


2010-2045: Well-implemented SWA block detachment with proper validation.

The function correctly implements out-of-window block detachment for SWA with appropriate beam width validation, reference count management, and offloading logic when reuse is enabled.


2079-2100: LGTM! Clear routing logic for block reuse vs. non-reuse paths.

The conditional logic correctly routes to different addSequence implementations based on block reuse settings, with appropriate warnings for retention configurations when reuse is disabled.


2120-2128: LGTM! Proper sequence existence validation.

The function correctly checks for sequence existence before attempting to store context blocks, with appropriate warning logging for missing sequences.


2138-2152: LGTM! Clear validation and logging for storeNewBlock constraints.

The function correctly validates beam width and block reuse requirements with informative warning messages when constraints are not met.


2444-2459: LGTM! Comprehensive token removal with multi-window support.

The function correctly handles token removal across all window sizes with proper block release logic when tokens align with block boundaries.


2540-2545: LGTM! Proper SWA extra block allowance in capacity calculation.

The addition of kSWAExtraBlock correctly accounts for the extra block needed during sliding window transitions when the sequence length exceeds the window size.


2574-2578: LGTM! Consistent SWA extra block handling for beam search.

The logic correctly adds kSWAExtraBlock when the sequence length exceeds the window size, maintaining consistency with single-beam calculations.


1848-1852: Verify SWA extra block allowance in block calculation.

The addition of chunkSize to the window size when calculating promptCacheLen may not align with SWA semantics. Ensure this calculation properly accounts for the attention window boundaries.

Review the calculation to verify the SWA window boundary logic:

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

137-139: Constructor wiring LGTM: maxSequenceLength threaded correctly.

The helper updates the KVCacheManager constructor to pass maxNumTokensPerSeq as both the window size entry and the explicit maxSequenceLength argument. This aligns with the new API and SWA semantics and keeps the blocksPerWindow configuration coherent.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

77-81: Helper to count valid block IDs is clear and correct.

getNumValidBlocks() correctly filters out KVCacheBlock::kInvalidBlockId and is used to assert allocation invariants under SWA. Good addition.


145-146: addSequence(isShareLastContextBlock) coverage is solid.

The test now exercises both non-sharing and sharing of the last context block and validates block allocation patterns and equality across beams. This meaningfully covers the new API.

Also applies to: 160-161, 173-189

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from 762d9d4 to 169d317 Compare August 18, 2025 05:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

52-59: Fix use-before-declaration of SizeType32 for kPrimaryLevel/kSecondaryLevel (compile-time error).

SizeType32 is introduced at Line 65, but it's used earlier at Lines 52 and 54. This will not compile unless another alias is already in scope. Use a fully qualified type for these constants (or move the alias above).

Apply this diff:

-static constexpr SizeType32 kPrimaryLevel = 0;
-
-static constexpr SizeType32 kSecondaryLevel = 1;
+static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0;
+
+static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1;
♻️ Duplicate comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

431-446: Use correct format specifier and cast for requestId in warning log.

mRequestId isn't guaranteed to be an unsigned long and using %lu without an explicit cast is risky; previous reviews requested this fix.

Apply this diff:

-                TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove", mRequestId);
+                TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove",
+                    static_cast<unsigned long>(mRequestId));
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

2589-2601: Clamp SWA extra-block arithmetic to avoid underflow in tight-capacity cases.

When blockCapacity/beamWidth <= kSWAExtraBlock, subtracting kSWAExtraBlock underflows (unsigned) and returns a huge window. Similarly for leftoverBlockCapacity - kSWAExtraBlock.

Apply this diff:

-    if (outputBlockRequirements > blockCapacity)
-    {
-        return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock;
-    }
+    if (outputBlockRequirements > blockCapacity)
+    {
+        auto const blocksPerBeam = static_cast<SizeType32>(blockCapacity / beamWidth);
+        auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? (blocksPerBeam - kSWAExtraBlock) : 0;
+        return usableBlocksPerBeam * tokensPerBlock;
+    }
@@
-    return std::min(
-        outputLength + (leftoverBlockCapacity - kSWAExtraBlock) * tokensPerBlock, inputLength + outputLength);
+    auto const usableLeftoverBlocks =
+        leftoverBlockCapacity > kSWAExtraBlock ? (leftoverBlockCapacity - kSWAExtraBlock) : 0;
+    return std::min(outputLength + usableLeftoverBlocks * tokensPerBlock, inputLength + outputLength);

1246-1275: Re-cache KV offsets after detaching front block; prevent stale/invalid page-table entries.

After detachBlock, indices shift but the page-table (offset tensor) isn’t recomputed. Copying offsets later will include stale entries or garbage for invalid slots.

Apply this diff to recompute all offsets and mark invalid entries with a sentinel:

     if (shouldDetachBlock && mIsSWA)
     {
-        detachBlock(sequence, isEnableBlockReuse);
+        detachBlock(sequence, isEnableBlockReuse);
+
+        // Recompute offsets for the entire sequence after front-removal.
+        auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize);
+        auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor);
+        auto const& offsetsShape = cacheBlocksTensor.getShape();
+        auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize);
+        auto const beamWidth = sequence.getBeamWidth();
+        auto constexpr kIdx = 0;
+        auto constexpr vIdx = 1;
+        tk::KVCacheIndex const invalidIndex{
+            std::numeric_limits<tk::KVCacheIndex::UnderlyingType>::max()};
+        for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx)
+        {
+            auto const& beamCacheBlock = cacheBlocks[beamIdx];
+            for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx)
+            {
+                auto const blockId = beamCacheBlock[blockIdx];
+                if (blockId != KVCacheBlock::kInvalidBlockId)
+                {
+                    setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+                }
+                else
+                {
+                    // Fill sentinel for both K and V across all pools for invalid block slots.
+                    for (SizeType32 poolIdx = 0; poolIdx < static_cast<SizeType32>(mPools.size()); ++poolIdx)
+                    {
+                        for (auto const xIdx : {kIdx, vIdx})
+                        {
+                            auto const offsetIndex =
+                                tensorrt_llm::common::flat_index(offsetsShape.d, poolIdx, beamIdx, xIdx, blockIdx);
+                            offsetsPtr[offsetIndex] = invalidIndex;
+                        }
+                    }
+                }
+            }
+        }
     }

581-591: Avoid metadata.maxBlocksPerSeq == 0 when maxSequenceLength is not provided.

Defaulting maxTokensPerSeq to 0 yields maxBlocksPerSeq=0 and zero-capacity offset tables, which later get written into. This is a correctness issue.

Apply this diff:

-        auto const maxTokensPerSeq = maxSequenceLength.value_or(0);
+        // If maxSequenceLength is not provided, fall back to a sane per-seq capacity:
+        // at least the window, plus one extra SWA buffer block.
+        auto const defaultBlocksPerSeq =
+            tensorrt_llm::common::ceilDiv(windowSize, tokensPerBlock) + kSWAExtraBlock;
+        auto const maxBlocksPerSeq = maxSequenceLength.has_value()
+            ? tensorrt_llm::common::ceilDiv(*maxSequenceLength, tokensPerBlock)
+            : defaultBlocksPerSeq;
+        auto const maxTokensPerSeq = maxBlocksPerSeq * tokensPerBlock;
-        auto const temporaryAttentionWindow = manager.calculateTemporaryAttentionWindow(tempAttentionWindowInputs);
-        auto const maxBlocksPerSeq = tc::ceilDiv(maxTokensPerSeq, tokensPerBlock);
+        auto const temporaryAttentionWindow = manager.calculateTemporaryAttentionWindow(tempAttentionWindowInputs);
         auto const [allottedPrimaryBlocks, allottedSecondaryBlocks] = blocksPerWindow.at(windowSize);
         mWindowSizeToMetadata[windowSize]
-            = WindowSizeMetadata{allottedPrimaryBlocks, allottedSecondaryBlocks, absolutePoolsOffset, numPools,
-                maxTokensPerSeq, maxBlocksPerSeq, manager.getMaxNumBlocks(), temporaryAttentionWindow};
+            = WindowSizeMetadata{allottedPrimaryBlocks, allottedSecondaryBlocks, absolutePoolsOffset, numPools,
+                maxTokensPerSeq, maxBlocksPerSeq, manager.getMaxNumBlocks(), temporaryAttentionWindow};
         TLLM_LOG_INFO(
             "Max KV cache blocks per sequence: %d [window size=%d], tokens per block=%d, primary blocks=%d, secondary "
             "blocks=%d",
             maxBlocksPerSeq, windowSize, tokensPerBlock, allottedPrimaryBlocks, allottedSecondaryBlocks);
🧹 Nitpick comments (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

548-554: Constructor’s new isSWA parameter: good surface for SWA-specific flows.

Adding isSWA to WindowBlockManager enables localized SWA behavior. Ensure call sites compute isSWA robustly (see cpp fix for nullopt maxSequenceLength).

Would you like me to scan all call sites to confirm they pass a well-defined SWA signal across variable-window configs?


17-17: Prefer include guards over #pragma once (guideline).

The codebase guidelines specify include guards named TRTLLM__H. Consider adding them for consistency.

Example:

// #pragma once  // optional to keep as well
#ifndef TRTLLM_KVCACHEMANAGER_H
#define TRTLLM_KVCACHEMANAGER_H
...
#endif // TRTLLM_KVCACHEMANAGER_H
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

159-190: Avoid potential size_t underflow pattern; simplify index logic.

You added an early return for blockCount==0 (good), but the post-decrement pattern can still flag static-analysis warnings. Use pre-decrement with an initial blockCount value for clarity.

Apply this diff:

-    size_t currentIndex = blockCount - 1;
-    while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
-    {
-        sequenceBlocks[currentIndex--] = currentBlock;
+    size_t currentIndex = blockCount;
+    while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId)
+    {
+        sequenceBlocks[--currentIndex] = currentBlock;
         currentBlock = currentBlock->getPrevBlockInSeq();
     }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 762d9d4 and 169d317.

📒 Files selected for processing (3)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (29 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (4)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🔇 Additional comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

428-429: LGTM: Resetting front-detach counter after clear.

Resetting mNumFrontBlocksRemoved avoids indexing past the beginning on subsequent detach operations.


614-616: Expose getNumFreeBlocks cacheLevel overload across Block/Window managers: solid addition.

This aligns stats reporting with multi-level pools (primary/secondary).

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

1360-1423: Returning number of stored blocks from storeBlocks is a good improvement.

This allows upstream code to track reuse effectiveness per call. The bookkeeping added in release path leverages this count.


1615-1662: Release path covers SWA-detached blocks correctly.

Conditional decRefCount() avoids underflow for already-detached blocks; bulk clearing of per-seq storage is correct.


2444-2459: Single-beam removeToken guardrails look good.

Beam>1 guard and block-boundary logic for releasing last block are consistent with SWA restrictions in this PR.

…t the possible maximum for kv_cache_config.max_tokens

This way, we are not constrained by the max_tokens limit.

Signed-off-by: eopXD <[email protected]>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19441 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14610 completed with status: 'FAILURE'

@eopXD eopXD force-pushed the swa-kv-cache-reuse branch from ae9a42c to 3598e9f Compare September 22, 2025 08:01
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 22, 2025

/bot run --disable-fail-fast --post-merge

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19539 [ run ] triggered by Bot

@nvbrantz
Copy link
Collaborator

/bot run --disable-fail-fast --post-merge

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19602 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19539 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14687 (Blue Ocean) completed with status: ABORTED

@eopXD
Copy link
Collaborator Author

eopXD commented Sep 23, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19653 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19602 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14743 (Blue Ocean) completed with status: ABORTED

eopXD added a commit to eopXD/TensorRT-LLM that referenced this pull request Sep 23, 2025
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW
(out-of-window) blocks are only detached when reuse is not enabled.
This MR enables KV cache manager to detach the block when reuse is
enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep
whether the block is overwritten by another sequence. If at the end of
the sequence all blocks for it is still clean, then the sequence block
will be stored.

Signed-off-by: eopXD <[email protected]>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19653 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14788 completed with status: 'SUCCESS'

@thorjohnsen thorjohnsen self-requested a review September 23, 2025 18:06
Copy link
Collaborator

@jaedeok-nvidia jaedeok-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left another comment for a reminder, but it is ok since KV cache mgr doesn't affect the performance unless a user explicitly controls. I approved, please fix the calculation in a follow-up PR.

@eopXD eopXD merged commit cf10093 into NVIDIA:main Sep 24, 2025
9 checks passed
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 24, 2025

I left another comment for a reminder, but it is ok since KV cache mgr doesn't affect the performance unless a user explicitly controls. I approved, please fix the calculation in a follow-up PR.

@jaedeok-nvidia Thank you for the reminder. I will queue this into my todo list.

@eopXD eopXD changed the title [TRTLLM-6341][feature] Support SWA KV cache reuse [TRTLLM-6341][feature] Support SWA KV cache Sep 24, 2025
eopXD added a commit to eopXD/TensorRT-LLM that referenced this pull request Sep 24, 2025
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW
(out-of-window) blocks are only detached when reuse is not enabled.
This MR enables KV cache manager to detach the block when reuse is
enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep
whether the block is overwritten by another sequence. If at the end of
the sequence all blocks for it is still clean, then the sequence block
will be stored.

Signed-off-by: eopXD <[email protected]>
eopXD added a commit to eopXD/TensorRT-LLM that referenced this pull request Sep 24, 2025
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW
(out-of-window) blocks are only detached when reuse is not enabled.
This MR enables KV cache manager to detach the block when reuse is
enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep
whether the block is overwritten by another sequence. If at the end of
the sequence all blocks for it is still clean, then the sequence block
will be stored.

Signed-off-by: eopXD <[email protected]>
eopXD added a commit to eopXD/TensorRT-LLM that referenced this pull request Sep 25, 2025
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW
(out-of-window) blocks are only detached when reuse is not enabled.
This MR enables KV cache manager to detach the block when reuse is
enabled.

When reuse is enabled, OOW block gets detached while the manager bookkeep
whether the block is overwritten by another sequence. If at the end of
the sequence all blocks for it is still clean, then the sequence block
will be stored.

Signed-off-by: eopXD <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
KV-Cache Management kv-cache management for efficient LLM inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.