Skip to content

Conversation

eopXD
Copy link
Collaborator

@eopXD eopXD commented Aug 10, 2025

Description

This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review.

Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion.

Since streamLLM (sink attention) is broken now, assertion is added under KVCacheManager ctor to guard for the value of mSinkBlockTokenLength and mSinkBubbleLength. Compute logics relate to it are pruned.

The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future.

On top of this, we should make an effort to update the supporting matrix of the kv cache manager after merging the support of SWA KV cache reuse.

Changes are listed as following:

  • Separate KVCacheManager::updateToken into KVCacheManager::addToken and KVCacheManager::removeToken. The functionality should be decoupled.
  • Push utility cacheSequenceBlockOffsets and cacheNewBlockOffset from KVCacheManager down to WindowBlockManager. KVCacheManager-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level.
  • Simplify "is shared last context block" logic under KVCacheManager::addSequence.

Test Coverage

Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder.

For LlmRequestTest.ParamTest, streaming=True is commented out because we guard sink attention with assertion now.

In capacitySchedulerTest, addToken action to crossKVCacheManager is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder.

Summary by CodeRabbit

  • New Features

    • Per-window KV-cache flows and a new explicit option to share the last context block; window-aware public APIs to update KV-cache offsets.
  • Refactor

    • Centralized block allocation and offset-update responsibilities under window-aware block managers; per-window routing for sequence/token handling.
  • Behavior Changes

    • Streaming generation disabled/guarded; legacy shared/unshared replacement paths removed; sequence/block APIs changed (window-aware, share-last flag).
  • Tests

    • Unit tests simplified; sink-token >0 and multi-beam coverage deferred or disabled.

@eopXD eopXD changed the title [TRTLLM-6341][feature][kv cache manager] Preliminary refactors before supporting swa kv cahce reuse [TRTLLM-6341][chore][kv cache manager] Preliminary refactors before supporting swa kv cahce reuse Aug 10, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

1-3: Update copyright year to include 2025

Per coding guidelines, OSS code should include the current year. Header shows 2022-2024 while the .cpp is 2025.

-/* 
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+/*
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.

17-17: Use header include guards instead of #pragma once

Guideline requires preprocessor guard in headers. Keep #pragma once if you like, but add the guard.

 #pragma once
+#ifndef TRTLLM_KVCACHEMANAGER_H
+#define TRTLLM_KVCACHEMANAGER_H
...
-} // namespace tensorrt_llm::batch_manager::kv_cache_manager
+} // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_KVCACHEMANAGER_H

1383-1396: Fix parameter name typo: copyOnpartialReusecopyOnPartialReuse

Guideline: parameter names should be consistent between declaration and definition. This is also a typo.

-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true);
...
-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true);
...
-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true);
...
-        bool enablePartialReuse = true, bool copyOnpartialReuse = true);
+        bool enablePartialReuse = true, bool copyOnPartialReuse = true);

Also applies to: 1403-1414

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

131-136: Bug: wrong multimodal start offset within block

mmStartInBlock is reversed. For content starting inside the block, the offset should be (startPos - startTokenIdx); otherwise 0.

-            SizeType32 mmStartInBlock = (startPos >= startTokenIdx) ? 0 : startTokenIdx - startPos;
+            SizeType32 mmStartInBlock = (startPos >= startTokenIdx) ? (startPos - startTokenIdx) : 0;
🧹 Nitpick comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

549-550: Document the new parameter isShareLastContextBlock

Add a short note explaining when the last context block can be shared across beams.

-//! \brief Assign blocks for new sequence. Does not try to reuse blocks.
+//! \brief Assign blocks for new sequence. Does not try to reuse blocks.
+//! \param isShareLastContextBlock If true, the last context block is shared across beams (e.g., when fully filled).

879-880: Doc parity for BlockManager overload

Mirror the parameter documentation here to keep interfaces consistent.

-void addSequence(
-    GenerationRequest& sequence, SizeType32 numContextBlocks, SizeType32 windowSize, bool isShareLastContextBlock);
+//! \param isShareLastContextBlock If true, share the last context block across beams for this window size.
+void addSequence(GenerationRequest& sequence, SizeType32 numContextBlocks, SizeType32 windowSize,
+    bool isShareLastContextBlock);
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

1194-1204: Comment improvement

These comments are helpful. Consider also noting the conditions that select each overload (reuse enabled vs disabled) to avoid drift.


1864-1881: cacheNewBlockOffset: LGTM with a tiny guard suggestion

Optional: early-continue if a beam has zero blocks to avoid .back() on empty vectors.

-        auto const blockId = beamCacheBlock.back();
+        if (beamCacheBlock.empty()) continue;
+        auto const blockId = beamCacheBlock.back();

1949-1951: Comment/code mismatch: temporaryAttentionWindow not used

The comment says we consider temporaryAttentionWindow, but effectiveInputLength ignores it.

  • Either remove the comment or incorporate it into the calculation (and document the rule).
  • Example if intended: effectiveInputLength = std::min(inputLength + temporaryAttentionWindow, maxTokenNum);
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee19ca5 and 0202754.

📒 Files selected for processing (2)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (13 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1128-1133: New caching APIs look good

Centralizing offset caching in BlockManager simplifies KVCacheManager. Interface and comments are clear.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)

1844-1862: cacheSequenceBlockOffsets: LGTM

Offset materialization is correct and respects per-pool layout.


1918-1920: Assertion matches PR scope

Good to hard-stop unsupported streamLLM.


1994-1995: Offset caching on addSequence

Caching offsets at the end of sequence allocation is correct and keeps copyBlockOffsets O(1) per batch slot.


1807-1819: Assertion aligns with streamLLM pruning; context blocks calculation

Using floor division for numContextBlocks is fine with streamLLM disabled. Make sure this aligns with how addSequence computes context blocks (ceilDiv) to avoid off-by-one in heterogeneous paths.

Would you like me to generate a small unit test sketch to compare getRemainingBlocksToCompletion vs calculateMaxBlockRequirements for representative inputs?


1882-1906: Fix duplicate declaration and enforce beamWidth assertion

  • Remove the duplicate auto& sequence = getSequence(requestId); (only one declaration needed after the sink‐check).
  • Replace the warning on beamWidth > 1 with a hard assertion:
    TLLM_CHECK_WITH_INFO(
        sequence.getBeamWidth() == 1,
        "[KV cache manager] addToken does not support beamWidth > 1"
    );
  • Simplify the window boundary check to:
    if (tokensInWindow % getTokensPerBlock() == 0)
    (the tokensInWindow <= maxTokensInWindow check is always true by construction).

Proposed diff:

 void KVCacheManager::addToken(RequestIdType requestId)
 {
-    auto& sequence = getSequence(requestId);
     TLLM_CHECK_WITH_INFO(
         mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0,
         "streamLLM is not supported at the moment"
     );
-    auto& sequence = getSequence(requestId);
-    if (sequence.getBeamWidth() > 1)
-    {
-        TLLM_LOG_WARNING("[KV cache manager] addToken does not support beamWidth > 1");
-    }
+    auto& sequence = getSequence(requestId);
+    TLLM_CHECK_WITH_INFO(
+        sequence.getBeamWidth() == 1,
+        "[KV cache manager] addToken does not support beamWidth > 1"
+    );
     sequence.addNewTokens(1);
     for (auto const [windowSize, metadata] : mBlockManager.getWindowSizesMetadata())
     {
         SizeType32 const maxTokensInWindow = metadata.maxTokenNum;
         SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow;
-        // NOTE: If beam search is involved, added blocks should be shared.
-        // Please check out TODO(file MR number here) for utility function pruned.
-        if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow)
+        // Allocate a new block on each block boundary within the window.
+        if (tokensInWindow % getTokensPerBlock() == 0)
         {
             mBlockManager.allocateBlock(sequence, windowSize);
             mBlockManager.cacheNewBlockOffset(sequence, windowSize);
         }
     }
 }

Please confirm whether the “one extra block” behavior on exact window rollover (when tokensInWindow == 0) is intended. If not, change the condition to:

if (tokensInWindow > 0 && tokensInWindow % getTokensPerBlock() == 0)

2319-2339: Compile error and missing assertion in removeToken

Same issues as addToken: duplicate sequence declaration and lack of assertion for beamWidth > 1.

 void KVCacheManager::removeToken(RequestIdType requestId)
 {
-    auto& sequence = getSequence(requestId);
     TLLM_CHECK_WITH_INFO(
         mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, "streamLLM is not supported at the moment");
-    auto& sequence = getSequence(requestId);
-    if (sequence.getBeamWidth() > 1)
-    {
-        TLLM_LOG_WARNING("[KV cache manager] removeToken does not support beamWidth > 1");
-    }
+    auto& sequence = getSequence(requestId);
+    TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[KV cache manager] removeToken does not support beamWidth > 1");
     if (sequence.getNumTokens() == 0)
     {
         return;
     }
     sequence.removeTokens(1);
     for (auto const [windowSize, metadata] : mBlockManager.getWindowSizesMetadata())
     {
         SizeType32 const maxTokensInWindow = metadata.maxTokenNum;
         SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow;
         if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow)
         {
             mBlockManager.releaseLastBlock(sequence, windowSize);
         }
     }
 }
⛔ Skipped due to learnings
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

@eopXD eopXD changed the title [TRTLLM-6341][chore][kv cache manager] Preliminary refactors before supporting swa kv cahce reuse [TRTLLM-6341][chore][kv cache manager] Preliminary refactors before supporting swa kv cache reuse Aug 11, 2025
@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from 0202754 to 2e74a61 Compare August 11, 2025 07:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)

1207-1218: Guard against underflow when numContextBlocks == 0

If SizeType32 is unsigned, numContextBlocks - 1 underflows. Safer to compute the shared count explicitly.

-    for (SizeType32 bi = 0; bi < numContextBlocks - 1; ++bi)
+    SizeType32 sharedCount = (numContextBlocks > 0) ? (numContextBlocks - 1) : 0;
+    for (SizeType32 bi = 0; bi < sharedCount; ++bi)
     {
         allocateBlock(sequence, /*shareAmongBeams=*/true);
     }
-    allocateBlock(sequence, /*shareAmongBeams=*/isShareLastContextBlock);
+    if (numContextBlocks > 0) {
+        allocateBlock(sequence, /*shareAmongBeams=*/isShareLastContextBlock);
+    }

1881-1884: Use assertion instead of warning for unsupported beam-search

PR intent is to prune unsupported paths with assertions. Replace warning with TLLM_CHECK to avoid undefined behavior later in allocation paths.

-    if (sequence.getBeamWidth() > 1)
-    {
-        TLLM_LOG_WARNING("[KV cache manager] Beam search is not supported at the moment");
-    }
+    TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[KV cache manager] Beam search is not supported at the moment");
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1829-1853: Update TODO comment with actual MR reference

Line 1845 contains a placeholder TODO comment that should be updated with the actual merge request number before merging.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0202754 and 2e74a61.

📒 Files selected for processing (2)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (12 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧬 Code Graph Analysis (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
  • addSequence (1151-1155)
  • addSequence (1151-1152)
  • addSequence (1159-1194)
  • addSequence (1159-1160)
  • addSequence (1198-1202)
  • addSequence (1198-1199)
  • addSequence (1206-1218)
  • addSequence (1206-1207)
  • addSequence (1862-1929)
  • addSequence (1862-1863)
  • cacheSequenceBlockOffsets (1791-1809)
  • cacheSequenceBlockOffsets (1791-1791)
  • cacheNewBlockOffset (1811-1827)
  • cacheNewBlockOffset (1811-1811)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (7)
  • llmRequest (421-450)
  • llmRequest (421-421)
  • llmRequest (872-903)
  • llmRequest (872-872)
  • llmRequest (905-912)
  • requestId (200-208)
  • requestId (200-201)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
  • requestId (93-96)
  • requestId (93-93)
  • requestId (98-103)
  • requestId (98-99)
  • requestId (105-109)
  • requestId (105-106)
  • requestId (111-114)
  • requestId (111-111)
  • requestId (116-119)
  • requestId (116-116)
  • requestId (152-155)
  • requestId (152-152)
  • requestId (174-179)
  • requestId (174-175)
cpp/tests/executor/encDecTest.cpp (1)
  • req (224-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

548-548: Good API simplification

The new signature with isShareLastContextBlock boolean parameter is clearer and more intuitive than the previous unsharedBlockIdx approach.


865-865: Consistent API update

The signature change maintains consistency with WindowBlockManager's addSequence method.


1099-1103: Good architectural refactoring

Moving the offset caching methods from KVCacheManager to BlockManager improves encapsulation by placing the functionality closer to where the block data is managed.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1149-1155: Clear separation of block reuse paths

Good practice to have separate overloads with clear comments distinguishing the block reuse enabled vs disabled paths.


1752-1789: Good simplification of block calculation

The removal of temporaryAttentionWindow and sink bubble length from the calculation makes the logic clearer and aligns with the PR's goal of removing streaming LLM support.


1791-1827: Correct implementation of offset caching methods

The new BlockManager methods properly delegate to the appropriate WindowBlockManager, maintaining good separation of concerns.


2239-2263: Consistent implementation with addToken

The removeToken implementation properly mirrors the addToken logic, maintaining consistency in handling block boundaries and unsupported features.

@eopXD eopXD changed the title [TRTLLM-6341][chore][kv cache manager] Preliminary refactors before supporting swa kv cache reuse [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse Aug 11, 2025
@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch 3 times, most recently from 7ed955e to 2a54634 Compare August 11, 2025 13:29
@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from 2a54634 to d04b8ee Compare August 11, 2025 13:31
@eopXD eopXD enabled auto-merge (squash) August 11, 2025 13:43
@eopXD eopXD disabled auto-merge August 11, 2025 13:44
@eopXD eopXD enabled auto-merge (squash) August 11, 2025 13:45
@eopXD eopXD requested a review from netanel-haber August 11, 2025 13:45
Copy link
Collaborator

@tomeras91 tomeras91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, although I do have a few comments/questions:

  1. You say StreamingLLM (i.e. sink tokens) and beam search are not supported with SWA..
    1.1. What broke this support?
    1.2. You changed the unittests to reflect the fact they aren't supported. I assume this means these unittests passed before your changes (i.e. with sink tokens and beam search). So how did these unittests pass if it's not supported?

  2. I asked in the PR comments as well.. is beam search really not supported at all? That's what it seems from the assertions you added, but you wrote in the description it's not supported only for SWA...

BTW - Great description! It was very helpful for reviewing

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (3)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (3)

289-315: Unreachable code after premature exit in TestBufferIndexAssignment1

Child branch calls exit at Line 289, making the subsequent recv-buffer test (Lines 291–314) unreachable. Remove the early exit and keep the final exit at Line 315.

Apply this minimal diff:

-        exit(testing::Test::HasFailure() ? 1 : 0);
+        // Continue with recv buffer index assignment test; final exit occurs below.

381-383: Fix format specifier for size_t in TLLM_LOG_INFO

defaultTransSize is a size-like value; using "%d" is incorrect on LP64 and can cause UB. Use "%zu".

-        TLLM_LOG_INFO("defaultTransSize: %d", defaultTransSize);
+        TLLM_LOG_INFO("defaultTransSize: %zu", static_cast<size_t>(defaultTransSize));

18-26: Include required POSIX headers for fork and waitpid

This file uses fork() and waitpid() but does not include their headers explicitly. Add these to ensure correct declarations across platforms/compilers.

 #include "tensorrt_llm/runtime/iTensor.h"
 #include <gtest/gtest.h>
 #include <memory>
+// For fork(), waitpid()
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/wait.h>
🧹 Nitpick comments (13)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

38-41: Beam width forced to 1: consistent with current constraints

Setting maxBeamWidth to 1 with a clear TODO is aligned with the PR’s temporary limitation. Please add a tracking issue reference in the TODO for discoverability.

cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2)

87-89: Beam width TODO acknowledged

The TODO to add coverage for beamWidth > 1 is appropriate given current limitations. Consider adding a tracking ticket reference for traceability.


1-11: Update copyright header

Per repo guidelines, OSS source should include the current year. This file was modified; update “2023-2024” to include 2025.

cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (4)

407-411: Avoid double-disabling a test

Using both DISABLED_ prefix and GTEST_SKIP() is redundant. Keep a single mechanism (prefer DISABLED_ for long-term disable, or GTEST_SKIP with TODO if temporary).

-TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence)
+TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence)
 {
-    // TODO: Support and add coverage for beamWidth > 1
-    GTEST_SKIP();
+    // TODO: Support and add coverage for beamWidth > 1
+    // Disabled via DISABLED_ prefix above.

515-515: Typo in local variable name: beamWdith → beamWidth

Minor naming fix for consistency/readability.

-    auto const beamWdith = std::get<3>(info.param);
+    auto const beamWidth = std::get<3>(info.param);
...
-    name += "Bw" + std::to_string(beamWdith);
+    name += "Bw" + std::to_string(beamWidth);

753-761: Streaming parameter reduced to false: OK with TODO

Constraining streaming values to {false} is consistent with current gating. Consider adding a reference to a tracking issue in the TODO comment.


1-11: Update copyright header

Please include the current year in the header (add 2025).

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)

456-458: Sink-token coverage gated to 0: OK; add tracking reference and keep code ready to re-enable

Switching sinkTokenLen variants from {0, 4} to {0} with TODO is consistent with current constraints. Two suggestions:

  • Add a tracking issue ID to the TODO comment so the missing 4-case coverage isn’t lost.
  • Keep the plumbing (loops/param values) but guard with a single flag or a constexpr to simplify re-enabling later.

Also applies to: 511-512, 556-557, 601-602, 711-713, 837-844, 1040-1041, 1093-1095, 1187-1188, 1520-1522, 1572-1574, 1853-1855


1-11: Update copyright header

Please update to include 2025.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (4)

120-123: Beam width > 1 temporarily gated: good; add tracking references

Multiple tests now fix beamWidth/maxBeamWidth at 1 with TODOs. This is consistent with the current limitation. Please add links to a tracking issue next to the TODOs for future re-enablement.

Also applies to: 172-177, 2372-2374, 2436-2437, 3665-3666, 3908-3911, 4151-4153, 4257-4259


3502-3507: Avoid double-disabling and fix typo

  • The test is both DISABLED_ and calls GTEST_SKIP(); remove one (prefer DISABLED_ here).
  • Typo in TODO: “attetnion” → “attention”.
-TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest)
+TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest)
 {
-    // TODO: Support sink attetnion and add coverage
-    // TODO: Support and add coverage for beamWidth > 1
-    GTEST_SKIP();
+    // TODO: Support sink attention and add coverage
+    // TODO: Support and add coverage for beamWidth > 1
+    // Disabled via DISABLED_ prefix.

3796-3799: Prefer test-level gating over runtime checks

This helper asserts beamWidth == 1 via TLLM_CHECK. Since callers already gate beam tests, consider replacing this with a gtest ASSERT/ASSUME inside the test or relying on the existing gating to avoid hard aborts.

-    TLLM_CHECK(beamWidth == 1);
+    ASSERT_EQ(beamWidth, 1);

1-11: Update copyright header

Please update header to include 2025 as per guidelines.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e74a61 and d04b8ee.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (10 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (13 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (19 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
  • addSequence (1151-1155)
  • addSequence (1151-1152)
  • addSequence (1159-1194)
  • addSequence (1159-1160)
  • addSequence (1198-1202)
  • addSequence (1198-1199)
  • addSequence (1206-1222)
  • addSequence (1206-1207)
  • addSequence (1867-1926)
  • addSequence (1867-1868)
  • cacheSequenceBlockOffsets (1799-1817)
  • cacheSequenceBlockOffsets (1799-1799)
  • cacheNewBlockOffset (1819-1835)
  • cacheNewBlockOffset (1819-1819)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
  • llmRequest (70-70)
  • llmRequest (70-70)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
  • llmRequest (421-450)
  • llmRequest (421-421)
  • llmRequest (872-903)
  • requestId (200-208)
  • requestId (200-201)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
  • requestId (93-96)
  • requestId (93-93)
  • requestId (98-103)
  • requestId (98-99)
  • requestId (105-109)
  • requestId (105-106)
  • requestId (111-114)
  • requestId (111-111)
  • requestId (116-119)
  • requestId (116-116)
  • requestId (152-155)
  • requestId (152-152)
  • requestId (174-179)
  • requestId (174-175)
🔇 Additional comments (11)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

140-141: New BlockManager::addSequence signature usage looks correct

Calls updated to pass maxAttentionWindow and the isShareLastContextBlock flag. Assumptions and expectations (occupiedBlocks and sharing behavior) are consistent with beamWidth = 1. LGTM.

Also applies to: 155-165, 168-178, 181-187


2771-2774: Sliding-window expectations updated

The free block counts after addToken steps align with cyclic/window semantics. Looks correct for the specified maxAttentionWindow scenario.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

548-548: Good API simplification

The change from unsharedBlockIdx to isShareLastContextBlock makes the API clearer and more intuitive.


865-865: Consistent API design

The BlockManager signature change mirrors the WindowBlockManager change, maintaining consistency across the API.


1099-1103: Good architectural refactoring

Moving offset caching methods to BlockManager improves separation of concerns by centralizing block-related operations in the appropriate class.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)

1149-1151: Good documentation for overloaded methods

Clear comments distinguishing between block reuse enabled/disabled versions improve code maintainability.

Also applies to: 1157-1158


1799-1835: Clean per-window routing implementation

The offset caching methods properly delegate to window-specific managers, maintaining good separation of concerns.


1837-1858: Proper guards for unsupported features

Good use of assertions to guard against unsupported streamLLM and beam search features, with helpful comments explaining the current limitations.


1867-1926: Well-structured sequence addition with proper window handling

The implementation correctly calculates effective input length per window and determines block sharing appropriately. Good assertion placement for unsupported features.


2253-2261: Correct per-window block release logic

The implementation properly handles block release when crossing block boundaries, respecting per-window constraints.


1196-1197: Clear method documentation

Comments clearly distinguish between the block reuse enabled and disabled versions of the methods.

Also applies to: 1204-1205

@eopXD
Copy link
Collaborator Author

eopXD commented Aug 11, 2025

@tomeras91 Thank you for the swift reviewal.

You say StreamingLLM (i.e. sink tokens) and beam search are not supported with SWA..
1.1. What broke this support?

The recent non-cyclic kernel changes don't support them. #6379

1.2. You changed the unittests to reflect the fact they aren't supported. I assume this means these unittests passed before your changes (i.e. with sink tokens and beam search). So how did these unittests pass if it's not supported?

This has been a problem because the coverage is not full and even though the specific functionality tested in the unit tests do work as expected, overall the feature is not ready (or shall I say back to un-supported). So right now I prefer to first prune them out to not further complicate computation on SWA reuse. As long as we keep clear context and fully guard it uses, we can revisit this in the future. Also, you can find that many tests under the UT file only covers sinkTokenLength = 0, and I don't dare to add test coverage now given my first priority is to have SWA KV cache reuse implemented.

I asked in the PR comments as well.. is beam search really not supported at all? That's what it seems from the assertions you added, but you wrote in the description it's not supported only for SWA...

Good question. What it seems to me now is that the KV cache manager manages all attentions with a "window size" specified. Since we are planning to support SWA KV cache reuse to grow linearly just like a full attention, the breakage for "beam search with SWA" also implies the breakage for "beam search with an infinite window, that is, full attention".

Beam search was added before SWA concept was introduced into KV cache manager, and I think except some UT-s we see here, we don't have other e2e coverage of it. So as it seems not ideal now, it is borderline-acceptable for me to prune out the feature and revisit it in the near future.

@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from d04b8ee to 60b1e17 Compare August 11, 2025 14:57
@eopXD eopXD disabled auto-merge August 11, 2025 15:01
@eopXD eopXD enabled auto-merge (squash) August 11, 2025 15:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (13)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

38-40: Constrain beamWidth to 1 for now — aligned with guarded support

LGTM. Matches PR scope. Consider linking a tracking ticket in the TODO for re-enabling beamWidth > 1 coverage to avoid lingering tech debt.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (4)

456-458: Sink token coverage reduced to {0} with TODO — ok; suggest centralizing

Reducing sinkTokenLen to {0} is consistent with temporarily guarding unsupported paths. To reduce duplication and ease re-enablement later, consider a single helper (e.g., a function or a test fixture field) that returns the currently-supported set and documents the TODO once.

Would you like a script to list remaining tests still attempting sinkTokenLen > 0 so we can ensure consistency across the suite?

Also applies to: 510-512, 555-557, 598-602, 711-713, 1039-1041, 1093-1095, 1186-1188, 1372-1374, 1520-1522, 1572-1574, 1853-1855


837-844: Configurations pared down to supported case

Trimming to MAX_UTILIZATION with sinkTokenLen=0 and explicit expected iterations keeps this test stable during the guard period. If you intend to restore others later, consider adding a reference to a tracking issue (besides the TODO above) to avoid forgetting the removed cases.


3502-3507: Redundant DISABLED_ and GTEST_SKIP()

This test is already disabled via the DISABLED_ prefix; GTEST_SKIP() won’t execute. Drop the skip call to avoid unreachable code.

-TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest)
-{
-    // TODO: Support sink attetnion and add coverage
-    // TODO: Support and add coverage for beamWidth > 1
-    GTEST_SKIP();
+TEST_P(KVCacheManagerTest, DISABLED_KVCacheManagerSinkTokenLengthTest)
+{
+    // TODO: Support sink attention and add coverage
+    // TODO: Support and add coverage for beamWidth > 1

4151-4152: Global TODOs for beamWidth>1 and sink attention

Consider linking a tracking issue ID next to these TODOs to make the intent actionable and visible in dashboards.

Also applies to: 4257-4258

cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2)

407-411: Remove redundant GTEST_SKIP() from a DISABLED_ test

The DISABLED_ prefix already prevents execution. Drop GTEST_SKIP() to keep the body clean.

-TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence)
-{
-    // TODO: Support and add coverage for beamWidth > 1
-    GTEST_SKIP();
+TEST_F(LlmRequestTest, DISABLED_testLastTokensSetIndependence)
+{
+    // TODO: Support and add coverage for beamWidth > 1

751-766: Param space narrowed (streaming=false, beamWidth=1)

LGTM. Matches the temporary guard. Consider adding the tracking issue ID to the TODO comments so this can be expanded back later with traceability.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (6)

168-178: Expectation updates for addSequence capacity and duplicate request IDs

LGTM. The EXPECT_NO_THROW/EXPECT_THROW combinations are coherent. The inline TODO on beam width clarifies the temporary limitation.


3797-3799: Explicit guard: TLLM_CHECK(beamWidth == 1)

Reasonable defensive check for the temporary state. When re-enabling multi-beam, remember to remove this or convert to a parameterized constraint.


2362-2374: Misc: tests narrowed to maxBeamWidth=1

Multiple tests enforce maxBeamWidth=1 with TODOs. To minimize repetition, you can centralize this in the test fixture or a shared header constant so you only have to change it once when re-enabling multi-beam.

Also applies to: 2436-2437, 3665-3666


3502-3507: Duplicate skip in DISABLED_ test (sink token length)

As noted in capacitySchedulerTest, DISABLED_ already prevents execution. Consider removing GTEST_SKIP() here as well (if introduced in future diffs).


4151-4152: Global TODOs for beamWidth>1 and sink attention

Suggest linking a tracking issue to keep this visible and actionable to the team.

Also applies to: 4257-4258


2679-2682: Comment clarity

Minor nit: “Enable cyclic kv cache for all new generated tokens” comment could be rephrased for accuracy if the window semantics are not fully cyclic for the entire sequence. Not a blocker.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d04b8ee and 60b1e17.

📒 Files selected for processing (7)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (10 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (13 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (19 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
cpp/tests/utils/common.h (1)
  • beamWidth (214-214)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (14)
  • addSequence (1151-1155)
  • addSequence (1151-1152)
  • addSequence (1159-1194)
  • addSequence (1159-1160)
  • addSequence (1198-1202)
  • addSequence (1198-1199)
  • addSequence (1206-1222)
  • addSequence (1206-1207)
  • addSequence (1867-1927)
  • addSequence (1867-1868)
  • cacheSequenceBlockOffsets (1799-1817)
  • cacheSequenceBlockOffsets (1799-1799)
  • cacheNewBlockOffset (1819-1835)
  • cacheNewBlockOffset (1819-1819)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (5)
  • llmRequest (421-450)
  • llmRequest (421-421)
  • llmRequest (872-903)
  • requestId (200-208)
  • requestId (200-201)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (14)
  • requestId (93-96)
  • requestId (93-93)
  • requestId (98-103)
  • requestId (98-99)
  • requestId (105-109)
  • requestId (105-106)
  • requestId (111-114)
  • requestId (111-111)
  • requestId (116-119)
  • requestId (116-116)
  • requestId (152-155)
  • requestId (152-152)
  • requestId (174-179)
  • requestId (174-175)
cpp/tests/utils/common.h (1)
  • beamWidth (214-214)
🔇 Additional comments (22)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)

548-548: LGTM: Clear API documentation for addSequence variants

The updated comment clearly distinguishes between the two addSequence overloads, improving API clarity.


864-865: Verify consistent parameter naming across overloads

The new isShareLastContextBlock parameter is well-named and makes the sharing behavior explicit. However, ensure this parameter is consistently used across all related methods.


1099-1104: New offset caching methods properly centralize functionality

The addition of cacheSequenceBlockOffsets and cacheNewBlockOffset methods to BlockManager follows good separation of concerns by moving implementation-specific caching from KVCacheManager to the appropriate manager class.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (9)

1149-1151: Clear documentation improves code maintainability

The comments clearly distinguish between the two addSequence overloads and their use cases (block reuse enabled vs disabled), improving code readability.


1196-1202: Proper delegation to per-window managers

The implementation correctly delegates to the appropriate WindowBlockManager based on windowSize, maintaining the per-window routing architecture.


1206-1222: Block allocation logic correctly implements sharing behavior

The implementation properly allocates blocks with the correct sharing semantics:

  • Lines 1214-1217: Allocate shared blocks for all but the last context block
  • Lines 1218-1221: Allocate the last block with the specified sharing behavior

1799-1817: Offset caching properly centralized in BlockManager

The cacheSequenceBlockOffsets method correctly:

  • Routes to the appropriate WindowBlockManager
  • Handles multiple beams and blocks
  • Maintains the existing caching interface

1819-1835: Single block offset caching implementation is correct

The cacheNewBlockOffset method properly handles caching for just the newest block, complementing the full sequence caching method.


1837-1858: Proper safeguarding against unsupported features

The assertions correctly guard against:

  • StreamLLM functionality (sink tokens)
  • Beam search with width > 1
    This aligns with the PR's goal of pruning unsupported paths.

1870-1885: Consistent guard implementation in addSequence

The same safeguarding pattern is properly applied in addSequence, ensuring consistent behavior across the API.


1911-1915: Correct sharing logic and offset caching

The logic properly:

  • Determines sharing behavior based on cross-KV, cyclic state, and beam width
  • Routes through the new per-window addSequence API
  • Caches offsets using the new centralized method

2242-2259: Complete implementation of removeToken with proper guards

The removeToken method correctly:

  • Guards against unsupported features consistently
  • Implements per-window token removal logic
  • Maintains the block management invariants
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (3)

2436-2437: Force maxBeamWidth=1 in DISABLED_KVCacheManagerAllocationTest

LGTM. Matches the temporary guard. No action needed beyond tracking re-enable work.


3958-3960: Beam width explicitly set to 1 in reference/capacity tests

LGTM. This keeps the math and expectations coherent while multi-beam is guarded.

Also applies to: 3969-3971


2679-2682: Adjusted free block expectations around addToken

The updated expectations reflect the first addToken not allocating a new block but the second one does. This aligns with the cyclic window semantics exercised by the test. LGTM.

If you want an extra sanity check, I can generate a one-off assertion helper to compute expected free blocks from test parameters.

Also applies to: 2771-2774

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (7)

120-123: Force beamWidth=1 in BlockManagerTest

LGTM. Keeps the test aligned with temporarily unsupported multi-beam behavior.


140-141: Adopt new BlockManager::addSequence signature

Correct use of the new 4th parameter (isShareLastContextBlock) in both false and true variants. Good coverage of both paths.

Also applies to: 155-156


2679-2682: MaxAttentionWindow test: updated expectations after addToken

The first addToken not changing free blocks and the second reducing by maxBeamWidth matches the intended semantics here. LGTM.

Also applies to: 2771-2774


2777-2787: Batch test: free block accounting expectations updated

After addSequence + addToken, expectations reflect only shared+one leaf allocation per request. This mirrors the updated semantics. LGTM.


3956-3966: Reference helpers (calculateMax): beamWidth set to 1 in tests*

Temporarily constraining beamWidth to 1 is fine. The comments correctly mark follow-ups. No further issues.

Also applies to: 3967-3981, 3983-4007


4257-4467: FillKvCacheAndCompleteRequests param set: constrained but representative

Parameter sets cover diverse prompt/output sizes and windows with reuse on/off, while keeping beamWidth = 1. Good coverage given current constraints.

Also applies to: 4469-4548


3797-3854: neededBlocksOneStep tests: guard + expectations

Using TLLM_CHECK(beamWidth == 1) and validating used blocks against getNeededBlocksOneStep is coherent and keeps the test meaningful even with the constraint. LGTM.

Also applies to: 3861-3891

@eopXD eopXD requested a review from tomeras91 August 11, 2025 15:26
@eopXD eopXD self-assigned this Aug 11, 2025
@eopXD eopXD added the KV-Cache Management kv-cache management for efficient LLM inference label Aug 11, 2025
@tomeras91
Copy link
Collaborator

Good question. What it seems to me now is that the KV cache manager manages all attentions with a "window size" specified. Since we are planning to support SWA KV cache reuse to grow linearly just like a full attention, the breakage for "beam search with SWA" also implies the breakage for "beam search with an infinite window, that is, full attention".
Beam search was added before SWA concept was introduced into KV cache manager, and I think except some UT-s we see here, we don't have other e2e coverage of it. So as it seems not ideal now, it is borderline-acceptable for me to prune out the feature and revisit it in the near future.

While it is true that we can think of full attention is SWA with infinite window, there is still an important difference. If the window size is equal to the maximum sequence length supported by the model, we're not actually in SWA. This is a fundamentally different scenario than actual SWA where the attention window is smaller than the maximum sequence length.

Beam search is a pretty widely used feature and I would be very surprised if it is currently broken on main. Here are a few links to end2end tests using beam search. They should work as recent commits passed CI..
https://github.com/NVIDIA/TensorRT-LLM/blob/7e33ed6d61e30f8e1e4e2765d5d275726e7f0a6a/tests/unittest/_torch/test_beam_search.py


@tensorrt-cicd
Copy link
Collaborator

PR_Github #14823 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11195 completed with status: 'FAILURE'

@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from 78cbfc8 to 5b847f4 Compare August 15, 2025 06:58
@eopXD
Copy link
Collaborator Author

eopXD commented Aug 15, 2025

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

295-297: Bug: wrong container used in EXPECT_NE for disagg-gen-init scheduled set

The check for scheduledDisaggGenInitRequests erroneously searches in scheduledRequests, making the assertion ineffective.

Apply this diff to fix the container lookup:

-                    EXPECT_NE(
-                        scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end())
+                    EXPECT_NE(
+                        scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId),
+                        scheduledDisaggGenInitRequests.end())
🧹 Nitpick comments (11)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (11)

452-453: Track the narrowed sink-token coverage with a concrete reference

Acknowledged narrowing to sinkTokenLen = 0. Consider linking the TODO to the root-cause PR/issue (e.g., PR #6379 context) so it’s actionable when re-enabling.


504-508: Cross-blocks test: same sink-token gating comment

Same note: adding an issue/PR reference to the TODO will make the intent traceable.


551-552: Lora duplicate task: add reference to TODO

Link the TODO to the tracked breakage for easier re-enablement later.


596-598: Lora doesn’t fit: add reference to TODO

Same as above; a ticket/PR reference helps future follow-up.


707-709: Max utilization: add reference to TODO

Please link to the tracked item for sinkTokenLen > 0 so this doesn’t get lost.


1090-1091: Cross-blocks negative-fit: add reference to TODO

Same suggestion: link the TODO to the underlying issue/PR.


1182-1184: Adding new requests (max util.): add reference to TODO

Consider attaching an issue/PR ID to the TODO for sink tokens.


1368-1370: Adding new requests with priorities: add reference to TODO

Same note about linking TODOs for clarity.


1516-1518: Guaranteed completion (adding new requests): add reference to TODO

Please link TODO to a tracking item.


1568-1570: Guaranteed completion in chunk: add reference to TODO

Same recommendation to link TODO to an issue/PR.


1849-1851: Static batch: add reference to TODO

Same note here; a concrete reference will help re-enable coverage later.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 78cbfc8 and 5b847f4.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (13 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

833-841: Thanks for documenting removed configurations

The added TODO and explicit “Removed configuration” notes make the coverage contraction clear. This addresses prior tracking concerns.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15413 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15412 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15413 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11617 completed with status: 'FAILURE'

@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from 5b847f4 to 2add619 Compare August 18, 2025 03:34
@eopXD
Copy link
Collaborator Author

eopXD commented Aug 18, 2025

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

872-879: Nice docstring addition for the new overload

The parameter docs clarify intent and match implementation in the .cpp. This addresses the prior request to document the new overload.

🧹 Nitpick comments (7)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)

1157-1194: Reuse path looks correct; minor guard optional

The reuse variant ignores the last token for hashing and pads an empty block when needed, which aligns with prefill semantics. If you want an early fail, consider asserting inputLength > 0 at function entry, though callers likely already ensure it.

Would you like me to add a TLLM_CHECK guard at entry?


1247-1251: Nit: clarify the boolean literal with a named-argument comment

For consistency with nearby calls that use /shareAmongBeams=/ comments, consider annotating this call as well.

Apply this minimal change for readability:

-    mWindowBlockManagers.at(windowSize).allocateBlock(sequence, false);
+    mWindowBlockManagers.at(windowSize).allocateBlock(sequence, /*shareAmongBeams=*/false);

1847-1866: Use BlockManager::setOffsets wrapper for consistency

These new helpers are great. You already provide BlockManager::setOffsets(..., windowSize); prefer using it instead of reaching into mWindowBlockManagers directly. Keeps the abstraction clean and consistent across the file.

Apply:

-            mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+            setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);

1867-1884: Same as above: route via BlockManager::setOffsets

Mirror the change here to use the existing wrapper.

-        mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+        setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);

1885-1901: Same as above: route via BlockManager::setOffsets

Mirror the change here to use the existing wrapper.

-        mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId);
+        setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId, windowSize);

1902-1927: Comment vs behavior clarity for beam search and SWA

The comment says “shared block is replaced with unshared ones,” but in this PR the addToken path allocates unshared blocks by design. replaceSharedBlock becomes a no-op unless a shared block exists from prefill/full-attention flows. Consider clarifying the comment to avoid confusion.

-                // For beam search, shared block is replaced with unshared ones
+                // For beam search: if a previously shared context block exists (e.g. from prefill/full-attention),
+                // replace it with unshared ones when sliding past the window. No-op if blocks are already unshared.

2311-2330: Simplify condition for releasing last block on token removal

tokensInWindow is modulo maxTokensInWindow, so “<= maxTokensInWindow” is always true. You can simplify the condition.

-        if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow)
+        if (tokensInWindow % getTokensPerBlock() == 0)
         {
             mBlockManager.releaseLastBlock(sequence, windowSize);
         }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5b847f4 and 2add619.

📒 Files selected for processing (2)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (10 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (5)

1149-1156: Good separation: per-window reuse routing moved into BlockManager/WindowBlockManager

The split clearly delineates the reuse path and delegates appropriately to the per-window manager. No issues spotted.


1196-1203: API simplification: non-reuse addSequence wrapper is clean

The new signature with isShareLastContextBlock is straightforward and matches the updated call sites.


1204-1219: Edge cases handled: single-block and unsigned underflow are guarded

  • numContextBlocks > 0 asserted
  • Loop allocates numContextBlocks - 1 shared blocks, then allocates the last with the caller’s sharing choice
    This addresses the earlier underflow/single-block concerns.

1959-1987: Non-reuse path logic is sound; sharing rule reads clearly

The isShareLastContextBlock decision captures cross-KV, cyclic single-beam, and “full last block” cases. The subsequent per-window offset update centralizes the canonical offsets state. Looks good.


1651-1653: Intentional guard: streamLLM (sink attention) disabled via constructor assertion

This matches PR objectives to prune unsupported paths now. Good fail-fast location.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

554-555: Header signature reflects the non-reuse path accurately

Good to expose the simplified non-reuse addSequence with explicit isShareLastContextBlock.


1114-1122: Public, window-aware offset update APIs are clear and cohesive

The trio of updateSequenceCacheBlockOffsets/updateLastCacheBlockOffsets/updateCacheBlockOffsetsAtIdx improves readability of callers and encapsulates per-window routing.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15575 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15575 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11728 completed with status: 'FAILURE'

eopXD added 4 commits August 18, 2025 20:21
…nager::updateToken` into `addToken` and `removeToken`

Given that streamLLM is broken now, removed related code logic and added
assertion to guard for sinkAttention settings. This allows us to revisit
the computation in the future.

Main reasons of this change:
Token addition and removal logic should be decoupled.
Flattens callstack and nested-if.
std::swap use is bizarre.
Broken features should be guarded safely.

Rename cache offset bookkeeping utility for clearance.
These utilities are pushed downwards from `KVCacheManager` to `BlockManager`.
- cacheBlockOffsets` --> `updateSequenceCacheBlockOffsets`
- cacheNewBlockOffsets` --> `updateLastCacheBlockOffsets`
- updateNewBlockPointer` --> `updateCacheBlockOffsetsAtIdx`

Comments have been added to test cases for reminder of future test
coverage on feature support.

Signed-off-by: eopXD <[email protected]>
…e context to distinguish `BlockManager::addSequence` and `WindowBlockManager::addSequence`

Signed-off-by: eopXD <[email protected]>
…hared last context block logic under `KVCacheManager::addSequence`

The last context block is shared when:
- Operating on a cyclic kv cache and with no beam search
- Operating on cross kv cache
- Last context block is full

Prune incorrect comment left under `KVCacheManager::getNeededBlocksOneStep`.

No need to complicate things.

Signed-off-by: eopXD <[email protected]>
…ache block allocation on token addition for cross kv manager

For an encode-decoder model, generation tokens are added only to the
decoder and not the encoder.

Signed-off-by: eopXD <[email protected]>
@eopXD eopXD force-pushed the refactor-before-swa-kv-cache-reuse branch from 2add619 to 7e56c2c Compare August 19, 2025 03:21
@eopXD
Copy link
Collaborator Author

eopXD commented Aug 19, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15697 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

872-879: Doc added for BlockManager::addSequence is on point; thanks.

This addresses the earlier request to document the new overload and its parameters.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

2664-2664: BeamWidth>1 and sink-attention TODOs: consider explicit test gating.

As previously suggested, prefer GTEST_SKIP() guards (with a short rationale) over TODO comments so CI visibly tracks feature gaps without marking tests “DISABLED” or leaving TODOs. That said, keeping current coverage intact is fine for now.

Also applies to: 4136-4138, 4242-4244

🧹 Nitpick comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

553-555: Clarify parameter semantics in WindowBlockManager::addSequence (no-reuse path).

The brief says “Does not try to reuse blocks” but the signature adds isShareLastContextBlock. To avoid confusion, clarify that “sharing” here is intra-request (across beams), not cross-request reuse.

Apply this diff to extend the comment:

-    //! \brief Assign blocks for new sequence. Does not try to reuse blocks.
+    //! \brief Assign blocks for a new sequence without cross-request reuse.
+    //! \param sequence  The GenerationRequest to process.
+    //! \param numContextBlocks  Number of context blocks to allocate per beam.
+    //! \param isShareLastContextBlock  If true, share the last context block across beams (not across requests).
     void addSequence(GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock);
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

3497-3501: Disabled sink-token-length test is an acceptable interim gate.

Given sink-attention is temporarily unsupported, disabling the suite is reasonable. When re-enabling, prefer parameterized GTEST_SKIP() over DISABLED_ to keep compilation paths active and intent visible.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

1157-1194: Guard against inputLength == 0 to avoid unsigned underflow

inputLength - 1 underflows if inputLength == 0 (SizeType32 is unsigned). Add a precondition or compute safely.

Apply this diff:

 void WindowBlockManager::addSequence(
     GenerationRequest& sequence, SizeType32 inputLength, SizeType32 numContextBlocks, LlmRequest& llmRequest)
 {
+    TLLM_CHECK_WITH_INFO(inputLength > 0, "inputLength must be greater than 0");
     auto const requestId = sequence.getRequestId();
     auto const [seqIt, emplaceDone] = mAllocatedBlocksPerSeq.emplace(requestId, std::vector<BlockPtr>{});
     TLLM_CHECK(emplaceDone);
@@
-    // Ignore last token because it can't be recovered
-    auto blockedUniqueTokens = chopVectorIntoBlocks<UniqueToken>(uniqueTokens, inputLength - 1, mTokensPerBlock, true);
+    // Ignore last token because it can't be recovered
+    auto blockedUniqueTokens = chopVectorIntoBlocks<UniqueToken>(uniqueTokens, inputLength - 1, mTokensPerBlock, true);

1651-1652: Constructor guard for streamLLM is good; consider richer diagnostic

Asserting once in the ctor is the right place. Consider including the resolved sink lengths in the message for faster diagnostics.

Apply this diff:

-    TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0,
-        "[kv cache manager] streamLLM is not supported at the moment");
+    TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0,
+        "[kv cache manager] streamLLM is not supported at the moment (sinkBlockTokenLength=%d, sinkBubbleLength=%d)",
+        static_cast<int>(mSinkBlockTokenLength), static_cast<int>(mSinkBubbleLength));

2313-2330: removeToken: redundant condition and minor simplification

tokensInWindow <= maxTokensInWindow is always true by construction. Can be dropped for clarity.

Apply this diff:

-        SizeType32 const maxTokensInWindow = metadata.maxTokenNum;
-        SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow;
-        if (tokensInWindow % getTokensPerBlock() == 0 && tokensInWindow <= maxTokensInWindow)
+        SizeType32 const maxTokensInWindow = metadata.maxTokenNum;
+        SizeType32 const tokensInWindow = sequence.getNumTokens() % maxTokensInWindow;
+        if (tokensInWindow % getTokensPerBlock() == 0)
         {
             mBlockManager.releaseLastBlock(sequence, windowSize);
         }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2add619 and 7e56c2c.

📒 Files selected for processing (6)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (10 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (13 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (11 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T23:23:27.420Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.420Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1114-1122: BlockManager offset APIs implemented and legacy helpers removed

  • Confirmed implementations in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:
    BlockManager::updateSequenceCacheBlockOffsets at line 1847
    BlockManager::updateLastCacheBlockOffsets at line 1867
    BlockManager::updateCacheBlockOffsetsAtIdx at line 1885
  • Verified usage in KVCacheManager at lines 1916 and 1986
  • Deprecated helpers (cacheBlockOffsets, cacheNewBlockOffsets, updateNewBlockPointer, updateToken) are no longer referenced
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)

139-139: Signature migration to BlockManager::addSequence w/ isShareLastContextBlock looks correct.

  • The false/true toggles for sharing last context block align with the updated API and the occupancy assertions.
  • Duplicate/oom-path EXPECT_THROW checks still capture the intended failure cases.

Also applies to: 154-154, 168-173, 176-182


2415-2415: Comment-only changes (Full attention).

No action needed.

Also applies to: 3645-3645


2674-2674: Updated free-block accounting after addToken matches the new sharing semantics.

The expectations reflect the decoupled “shared context + per-beam decode” model. Looks good.

Also applies to: 2767-2769

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (7)

1149-1155: Per-window delegation for reuse path looks good

Clear separation of the reuse-enabled overload with explicit window routing. No issues.


1196-1203: Non-reuse addSequence (BlockManager) API simplification is clear

The explicit isShareLastContextBlock flag removes hidden sharing semantics. Delegation per window is correct.


1204-1219: Single-block case and unsigned safety handled

  • Validates numContextBlocks > 0 before looping with numContextBlocks - 1.
  • Always allocates the last block honoring isShareLastContextBlock.

Looks correct after the earlier discussion.


1847-1865: Per-window offsets update: good layering

Routing via BlockManager::updateSequenceCacheBlockOffsets to per-window setOffsets improves modularity and keeps window concerns localized. No issues.


1867-1883: Targeted last-block offsets update is correct

Only computes offsets for the last block across beams, avoiding redundant work. Delegation to window manager is consistent.


1885-1900: Index-specific offsets update is correct

The index-based update delegates per-window and iterates beams. API shape and usage look consistent.


1939-1987: Non-reuse addSequence: sharing rule simplified and per-window offsets updated

  • The isShareLastContextBlock condition is explicit and easier to reason about.
  • Offsets are updated per window after allocation.

This aligns with the PR goal of simplifying non-reuse flows.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15697 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11806 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@shivghai
Copy link

Just here to say I'm looking forward to this landing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
KV-Cache Management kv-cache management for efficient LLM inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants