-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6341][feature] Support SWA KV cache #6768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🔭 Outside diff range comments (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
17-17
: Use required header guard instead of pragma oncePer coding guidelines, headers must use a guard with prefix TRTLLM_. Replace
#pragma once
with a proper include guard.-#pragma once +#ifndef TRTLLM_KVCACHEMANAGER_H +#define TRTLLM_KVCACHEMANAGER_HAnd add the closing guard at the end of the file:
#endif // TRTLLM_KVCACHEMANAGER_H
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
1627-1681
: Potential out-of-bounds when allocatedBlocks is empty before getAllSequenceBlocks()If blockKeys.size() > allocatedBlocks.size() and allocatedBlocks is empty, allocatedBlocks.back() will be undefined. This could happen in edge cases where the sequence did not allocate blocks (e.g., early exit or immediate detach after minimal generation).
Guard against an empty vector:
- if (blockKeys.size() > allocatedBlocks.size()) + if (blockKeys.size() > allocatedBlocks.size()) { - // For SWA, get all blocks in the sequence. - allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back()); + // For SWA, get all blocks in the sequence if we have a tail; otherwise, nothing to store. + if (!allocatedBlocks.empty()) + { + allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back()); + } + else + { + TLLM_LOG_DEBUG("%s::releaseBlocks - no allocated blocks to store for reuse", mLogPrefix.c_str()); + } }
1913-1964
: willCrossWindowBlockBoundary condition is always false (logic bug)maxSeqlenInBlocks and numTotalBlocksPerBeam are computed from the same expression, so maxSeqlenInBlocks > numTotalBlocksPerBeam can never be true. As a result, numExtraBlocksPerBeam is always 0, defeating the SWA extra-block logic.
Fix by comparing block counts within the window, not the unconstrained sequence:
- SizeType32 const maxSeqlenInBlocks = tc::ceilDiv(req.mPromptLen + req.mMaxNewTokens, getTokensPerBlock()); - auto const willCrossBlockBoundary = maxSeqlenInBlocks > currentSeqlenInBlocks; - auto const willCrossWindowBlockBoundary = maxSeqlenInBlocks > numTotalBlocksPerBeam; + auto const maxSeqlen = std::min(req.mPromptLen + req.mMaxNewTokens, windowSize); + SizeType32 const maxSeqlenInBlocks = tc::ceilDiv(maxSeqlen, getTokensPerBlock()); + auto const willCrossBlockBoundary = maxSeqlenInBlocks > currentSeqlenInBlocks; + auto const windowCurrBlocks = tc::ceilDiv(std::min(req.getNumTokens(0), windowSize), getTokensPerBlock()); + auto const windowMaxBlocks = tc::ceilDiv(std::min(req.mPromptLen + req.mMaxNewTokens, windowSize), getTokensPerBlock()); + auto const willCrossWindowBlockBoundary = windowMaxBlocks > windowCurrBlocks;This preserves the intended “cross both block and window boundaries” semantics.
🧹 Nitpick comments (7)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
341-341
: Optional: inline the constant to reduce couplingIf BlockManager::isUseOneMoreBlock() is permanently a constant false, consider inlining here for clarity and to decouple this constructor from BlockManager internals. Easy to revert if behavior changes later.
- auto const isUseOneMoreBlock = kv_cache_manager::BlockManager::isUseOneMoreBlock(); + constexpr bool isUseOneMoreBlock = false;cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
77-80
: Avoid magic -1; define a typed invalid ID sentinelCounting valid blocks by comparing to a raw -1 is brittle and type-unsafe. Define a typed sentinel once and reuse it here.
Apply within the helper:
-SizeType32 getNumValidBlocks(std::vector<SizeType32> const& blockIds) -{ - return std::count_if(blockIds.begin(), blockIds.end(), [](SizeType32 id) { return id != -1; }); -} +SizeType32 getNumValidBlocks(std::vector<SizeType32> const& blockIds) +{ + constexpr auto kInvalidBlockId = static_cast<SizeType32>(-1); + return std::count_if(blockIds.begin(), blockIds.end(), [&](SizeType32 id) { return id != kInvalidBlockId; }); +}If feasible longer-term, consider exposing a dedicated invalid sentinel for block IDs (aligned with KVCacheBlock::IdType) to avoid relying on implicit -1 semantics.
2690-2690
: Remove stray debug print in testsThis std::cerr debug output will add noise to test logs. Please drop it or guard behind a verbose flag.
- std::cerr << "numBlocksPerSeq: " << numBlocksPerSeq << std::endl; + // (optional) enable only for local debugging + // std::cerr << "numBlocksPerSeq: " << numBlocksPerSeq << std::endl;cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
56-59
: Clarify rationale and scope of kSWAExtraBlockThe extra buffer block for SWA is reasonable; add a short note tying it to worst-case detachment/attachment during window slide and how it figures into capacity calculations (e.g., calculateMaxBlockRequirements*).
102-110
: Make toString() constThis printer doesn’t mutate state and should be marked const.
- std::string toString() + std::string toString() const
1123-1127
: isUseOneMoreBlock permanently false—tidy up or documentIf the extra-block decision is now fully handled elsewhere (via kSWAExtraBlock), consider removing this method entirely or documenting its deprecation to avoid confusion.
2-2
: Update copyright yearGuidelines require current year in OSS headers. Consider updating 2024 → 2025 if this file is modified in 2025.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(15 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(30 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(62 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
341-341
: Signature update to BlockManager::isUseOneMoreBlock() looks correctCall site updated to the new zero-arg static method. No other logic change here; the value is still passed through to mRewindInputs.
341-349
: Confirm isUseOneMoreBlock Always False & Threading to Kernel
- BlockManager::isUseOneMoreBlock() is now a zero-arg static that unconditionally returns false (kvCacheManager.h:1123–1127).
- No legacy call sites still passing arguments were found.
- The flag is still propagated via mRewindInputs.isUseOneMoreBlock into invokeUpdateKVBlockArrayDraftTokenLocation (trtGptModelInflightBatching.cpp:2472).
Since “one-more-block” mode is effectively disabled, please review the index math in both rewindKVCacheBlocks() and invokeUpdateKVBlockArrayDraftTokenLocation() to ensure that removing the cyclic layout hasn’t introduced any off-by-one errors.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5)
342-343
: Constructor initializes mNumFrontBlocksRemoved—goodTracking front-detached blocks explicitly is correct under SWA.
576-586
: New detach/addBlockIfNeeded/cacheNewBlockOffset APIs—good for SWAPublicly exposing block detachment and conditional allocation aligns with linearized reuse. Ensure implementations:
- Call clearCacheBlocks/reset counters appropriately when sequences are fully reshaped.
- Keep offset tables consistent across pools when beams share/duplicate blocks.
If needed, I can audit the .cpp implementations for these invariants.
597-598
: Free-block query by cache level—nice additionThe cache-level overload eliminates ambiguity between primary/secondary pools.
936-951
: Propagating cache-level awareness to per-window queries—goodThis unblocks precise telemetry and simplifies test assertions.
1144-1149
: Expose sequence offset caching and conditional allocation at BlockManagerGood API surface for KVCacheManager to drive SWA lifecycle. Ensure thread-safety around mSequences/mWindowManagers when called from addToken.
Happy to follow through with a quick lock-audit if you want.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
578-589
: Constructor metadata computation switch to maxTokensPerSeq looks goodUsing maxSequenceLength.value_or(0) to derive maxTokensPerSeq and maxBlocksPerSeq is consistent with the header changes and avoids stale terminology.
1295-1313
: Verify detach threshold logic for SWAThe minTokensForBlockDetach = windowSize + tokensPerBlock threshold intentionally delays detachment to compensate for loading the full context instead of just the last window. This is called out as a hot-fix in comments. Please confirm this behavior matches test expectations for:
- context length slightly above window size,
- exact window boundary,
- large contexts (multiple window slides).
If mismatches arise, consider computing detachment strictly based on tokens within the window (e.g., based on the number of blocks representing min(windowSize, numTokens)), and adjusting context loading accordingly.
1408-1473
: storeBlocks now returns number of blocks stored — LGTMCounting and returning numBlocksStoredForReuse is correct and aligns with the updated API. Event emission and hash updates are preserved.
1523-1526
: Accessor for free blocks — LGTMForwarding getNumFreeBlocks to the eviction policy is straightforward.
2004-2045
: Detach path still requires full offset re-cacheEven with the change above, this block’s local logic removes the front block and erases from allocatedBlocks. Ensure offsets are re-cached for all blocks post-detach (not just the newest). The suggested change in addBlockIfNeeded() addresses this; otherwise, replicate the full re-cache here.
Would you prefer re-caching all offsets in addBlockIfNeeded() (single place) or also redundantly here to be extra safe?
2618-2632
: Guard against negative window result when subtracting kSWAExtraBlockWhen outputBlockRequirements <= blockCapacity is false, ((blockCapacity / beamWidth) - kSWAExtraBlock) can become negative for small capacities, underflowing SizeType32.
Clamp to zero before multiplying:
- return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock; + auto const perBeamBlocks = std::max<SizeType32>(0, (blockCapacity / beamWidth) - kSWAExtraBlock); + return perBeamBlocks * tokensPerBlock;Please confirm unit tests cover low-capacity edge cases.
1966-1985
: cacheSequenceBlockOffsets() addition — LGTMFull re-cache helper is correct and will be useful especially upon detach. It’s fine to call this after addSequence and after detach scenarios.
2158-2165
: storeContextBlocks gated by reuse and non-dummy — LGTMThis avoids unnecessary hash/trie inserts when reuse is disabled or for dummy requests.
d751a9b
to
93e91ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
415-421
: clearCacheBlocks should reset front-detach counterAfter clearing per-beam block IDs,
mNumFrontBlocksRemoved
must be reset; otherwise subsequentremoveFrontBlock
calls will index past the beginning.void clearCacheBlocks(SizeType32 windowSize) { for (auto& beamBlockIds : mCacheBlockIds.at(windowSize)) { beamBlockIds.clear(); } + mNumFrontBlocksRemoved = 0; }
423-432
: Bounds-check and handle empty beams when removing front blockIf
removeFrontBlock
is called more times than assigned blocks, this will write out of bounds. Also, keep logic robust for empty beams.void removeFrontBlock(SizeType32 windowSize) { for (auto& beamBlockIds : mCacheBlockIds.at(windowSize)) { - // Does not actually remove from mCacheBlockIds like removeLastBlock - // Id is set to -1 instead. - beamBlockIds[mNumFrontBlocksRemoved] = -1; + if (mNumFrontBlocksRemoved < static_cast<SizeType32>(beamBlockIds.size())) + { + // Mark as invalid without shrinking; SWA keeps layout length stable. + beamBlockIds[mNumFrontBlocksRemoved] = -1; + } + // else: nothing to mark for this beam (already all detached) } ++mNumFrontBlocksRemoved; }cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
159-186
: Prevent size_t underflow in getAllSequenceBlocks()
currentIndex
is initialized withblockCount - 1
even whenblockCount
can be 0. While the loop short-circuits, this underflow is undesirable and may trigger static-analysis warnings.std::vector<BlockPtr> getAllSequenceBlocks(BlockPtr lastBlock) { // First count the number of blocks to pre-allocate the vector auto currentBlock = lastBlock; size_t blockCount = 0; while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) { blockCount++; currentBlock = currentBlock->getPrevBlockInSeq(); } + if (blockCount == 0) + { + return {}; + } // Create and pre-allocate the vector with the correct size std::vector<BlockPtr> sequenceBlocks(blockCount); // Now traverse backwards and fill from the end currentBlock = lastBlock; - size_t currentIndex = blockCount - 1; + size_t currentIndex = blockCount; while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) { - sequenceBlocks[currentIndex--] = currentBlock; + sequenceBlocks[--currentIndex] = currentBlock; currentBlock = currentBlock->getPrevBlockInSeq(); } return sequenceBlocks; }
1241-1282
: Page-table offsets become stale after detaching the front blockWhen
shouldDetachBlock
is true,removeFrontBlock
shifts indices of every remaining block. CallingcacheNewBlockOffset()
only updates the last block and leaves indices [0..n-2] stale, which will corrupt page-table reads.Update offsets for the entire sequence after detaching:
void WindowBlockManager::addBlockIfNeeded(GenerationRequest& sequence, bool isEnableBlockReuse) { // ... existing code ... - if (shouldAllocateBlock || shouldDetachBlock) - { - cacheNewBlockOffset(sequence); - } + // Update KV cache indices + if (shouldAllocateBlock && !shouldDetachBlock) + { + // Only appended a block; update just the last entry + cacheNewBlockOffset(sequence); + } + if (shouldDetachBlock) + { + // Recompute all offsets due to front removal shifting indices + auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize); + auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize); + auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor); + auto const& offsetsShape = cacheBlocksTensor.getShape(); + for (SizeType32 beamIdx = 0; beamIdx < sequence.getBeamWidth(); ++beamIdx) + { + auto const& beamCacheBlock = cacheBlocks[beamIdx]; + for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx) + { + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, beamCacheBlock[blockIdx]); + } + } + } }
2001-2009
: Remove duplicate variable declaration in addToken()
sequence
is declared twice, causing a compile error.void KVCacheManager::addToken(RequestIdType requestId) { TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, "[kv cache manager] streamLLM is not supported at the moment"); - auto& sequence = getSequence(requestId); + auto& sequence = getSequence(requestId); TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[kv cache manager] addToken does not support beamWidth > 1"); sequence.addNewTokens(1); mBlockManager.addBlockIfNeeded(sequence, mEnableBlockReuse); }
2399-2418
: Remove duplicate variable declaration in removeToken()
sequence
is declared twice, causing a compile error.void KVCacheManager::removeToken(RequestIdType requestId) { TLLM_CHECK_WITH_INFO(mSinkBlockTokenLength == 0 && mSinkBubbleLength == 0, "[kv cache manager] streamLLM is not supported at the moment"); - auto& sequence = getSequence(requestId); + auto& sequence = getSequence(requestId); if (sequence.getNumTokens() == 0) { return; } TLLM_CHECK_WITH_INFO(sequence.getBeamWidth() == 1, "[kv cache manager] removeToken does not support beamWidth > 1"); sequence.removeTokens(1); // ... rest of the function }
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
983-989
: Consider implementing a smarter eviction policy for offloadingThe comment indicates that the current default behavior offloads out-of-window blocks to secondary memory without considering their usefulness. This could lead to unnecessary memory traffic. Consider implementing an eviction policy that evaluates block reuse potential before offloading.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(15 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(25 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(14 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(65 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(2 hunks)
✅ Files skipped from review due to trivial changes (2)
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
- cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
👮 Files not reviewed due to content moderation or server errors (3)
- cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
- cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (8)
KVCacheManager
(1661-1672)KVCacheManager
(1674-1688)KVCacheManager
(1690-1720)KVCacheManager
(1722-1735)calculateMaxBlockRequirementsPerBeam
(2492-2504)calculateMaxBlockRequirementsPerBeam
(2492-2493)calculateMaxBlockRequirements
(2506-2538)calculateMaxBlockRequirements
(2506-2507)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
createKvCacheManager
(624-722)createKvCacheManager
(624-626)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (35)
nodiscard
(1010-1017)nodiscard
(1482-1485)nodiscard
(2540-2559)addSequence
(1188-1192)addSequence
(1188-1189)addSequence
(1196-1231)addSequence
(1196-1197)addSequence
(1286-1290)addSequence
(1286-1287)addSequence
(1294-1310)addSequence
(1294-1295)addSequence
(2018-2079)addSequence
(2018-2019)releaseBlocks
(1492-1500)releaseBlocks
(1492-1492)releaseBlocks
(1586-1637)releaseBlocks
(1586-1586)schedulingReleaseBlocks
(1639-1645)schedulingReleaseBlocks
(1639-1639)schedulingReleaseBlocks
(1647-1659)schedulingReleaseBlocks
(1647-1647)releaseLastBlock
(1458-1461)releaseLastBlock
(1458-1458)releaseLastBlock
(1463-1480)releaseLastBlock
(1463-1463)detachBlock
(1960-1999)detachBlock
(1960-1960)cacheNewBlockOffset
(1942-1958)cacheNewBlockOffset
(1942-1942)getNumFreeBlocks
(1482-1482)storeBlocks
(1370-1433)storeBlocks
(1370-1371)KVCacheBlock
(257-270)cacheSequenceBlockOffsets
(1922-1940)cacheSequenceBlockOffsets
(1922-1922)cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
getNumFreeBlocks
(179-182)getNumFreeBlocks
(179-179)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17)
storeBlocks
(910-914)nodiscard
(215-1154)nodiscard
(586-594)nodiscard
(608-616)nodiscard
(618-626)nodiscard
(628-631)nodiscard
(638-641)nodiscard
(643-653)nodiscard
(664-667)nodiscard
(669-672)nodiscard
(684-687)nodiscard
(692-695)nodiscard
(736-747)nodiscard
(938-941)nodiscard
(943-956)nodiscard
(958-971)nodiscard
(973-979)cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
getNumFreeBlocks
(179-182)getNumFreeBlocks
(179-179)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1241-1282
: I need to inspect theWindowBlockManager
class to verify whether its block pools are protected by any mutex duringaddBlockIfNeeded
.
93e91ab
to
b8fa5b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1991-2008
: Compile error: misuse of mWindowSize and wrong setOffsets overloadThis BlockManager method references mWindowSize (not a member of BlockManager) and calls setOffsets without the windowSize parameter. Use the provided windowSize argument and route to the WindowBlockManager overload.
Apply:
-void BlockManager::updateLastCacheBlockOffsets(GenerationRequest& sequence, SizeType32 windowSize) -{ - auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize); - auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize); +void BlockManager::updateLastCacheBlockOffsets(GenerationRequest& sequence, SizeType32 windowSize) +{ + auto const& cacheBlocks = sequence.getCacheBlockIds(windowSize); + auto& cacheBlocksTensor = sequence.getCacheBlockIndices(windowSize); auto const beamWidth = sequence.getBeamWidth(); auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor); auto const& offsetsShape = cacheBlocksTensor.getShape(); for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx) { auto const& beamCacheBlock = cacheBlocks[beamIdx]; auto const blockId = beamCacheBlock.back(); auto const blockIdx = static_cast<SizeType32>(beamCacheBlock.size() - 1); - setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + mWindowBlockManagers.at(windowSize).setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); } }cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
52-59
: Fix: SizeType32 alias used before declaration (compile error).kPrimaryLevel/kSecondaryLevel/kSWAExtraBlock are defined before SizeType32 is introduced (using alias appears at Line 65). This will not compile.
Apply this diff to use the fully qualified type in these constant definitions:
-static constexpr SizeType32 kPrimaryLevel = 0; +static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0; -static constexpr SizeType32 kSecondaryLevel = 1; +static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1; -// Extra block buffer allocated for SWA to be able to always keep "window size" -// tokens held in the blocks. -static constexpr SizeType32 kSWAExtraBlock = 1; +// Extra block buffer allocated for SWA to be able to always keep "window size" +// tokens held in the blocks. +static constexpr tensorrt_llm::runtime::SizeType32 kSWAExtraBlock = 1;
621-624
: Correct allocated-blocks computation under multi-level pools.getNumAllocatedBlocks() currently uses getMaxNumBlocks() - getNumFreeBlocks(), but getNumFreeBlocks() now defaults to primary-level only. This makes the computation inconsistent with total max blocks when secondary is present. Compute primary usage explicitly.
Apply this diff:
- [[nodiscard]] SizeType32 getNumAllocatedBlocks() const noexcept - { - return getMaxNumBlocks() - getNumFreeBlocks(); - } + [[nodiscard]] SizeType32 getNumAllocatedBlocks() const noexcept + { + // Report primary-level allocation to match free-block reporting defaults. + return getNumPrimaryBlocks() - getNumFreeBlocks(kPrimaryLevel); + }
1441-1445
: Align “used blocks” with primary-level semantics.getUsedNumBlocks() returns BlockManager::getNumAllocatedBlocks(), which currently subtracts primary free blocks from a possibly combined (primary+secondary) maximum, leading to inconsistent stats. Compute primary used blocks directly.
Apply this diff:
- [[nodiscard]] SizeType32 getUsedNumBlocks() const override - { - return mBlockManager.getNumAllocatedBlocks(); - } + [[nodiscard]] SizeType32 getUsedNumBlocks() const override + { + return mBlockManager.getNumPrimaryBlocks() - mBlockManager.getNumFreeBlocks(kPrimaryLevel); + }
1481-1497
: Fix KvCacheStats consistency (primary-level vs. total).KvCacheStats comments say maxNumBlocks refers to the primary pool. Ensure all reported metrics are aligned to the primary level to avoid confusing stats when a secondary pool exists.
Apply this diff:
- kvCacheStats.maxNumBlocks = getMaxNumBlocks(); - kvCacheStats.freeNumBlocks = getNumFreeBlocks(); - kvCacheStats.usedNumBlocks = getUsedNumBlocks(); + kvCacheStats.maxNumBlocks = mBlockManager.getNumPrimaryBlocks(); + kvCacheStats.freeNumBlocks = mBlockManager.getNumFreeBlocks(kPrimaryLevel); + kvCacheStats.usedNumBlocks = kvCacheStats.maxNumBlocks - kvCacheStats.freeNumBlocks;The remaining fields can stay unchanged.
♻️ Duplicate comments (7)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
2864-2873
: Stop inferring “allocated primary blocks” by subtracting secondary capacityComputing allocated primary blocks as getNumAllocatedBlocks() - blocksInSecondaryPool assumes that the aggregated count includes secondary and that the secondary pool is “full,” which couples tests to BlockManager internals.
Expose an API that reports allocated-by-level to make tests precise and resilient. For example:
- BlockManager::getNumAllocatedBlocks(SizeType32 cacheLevel)
- WindowBlockManager::getNumAllocatedBlocks(SizeType32 cacheLevel)
Then use getNumAllocatedBlocks(kPrimaryLevel) here.
If you want, I can draft the change across BlockManager/WindowBlockManager and update affected tests.
3030-3033
: Same issue: don’t derive primary allocation counts via subtractionThis repeats the subtract-secondary pattern. Prefer a direct API: getNumAllocatedBlocks(kPrimaryLevel).
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
159-186
: Fix potential size_t underflow in getAllSequenceBlocksWhen there are zero blocks, initializing currentIndex with blockCount - 1 underflows. Guard zero and use pre-decrement indexing.
Apply:
std::vector<BlockPtr> getAllSequenceBlocks(BlockPtr lastBlock) { // First count the number of blocks to pre-allocate the vector auto currentBlock = lastBlock; size_t blockCount = 0; while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) { blockCount++; currentBlock = currentBlock->getPrevBlockInSeq(); } - // Create and pre-allocate the vector with the correct size - std::vector<BlockPtr> sequenceBlocks(blockCount); + // Create and pre-allocate the vector with the correct size + if (blockCount == 0) + { + return {}; + } + std::vector<BlockPtr> sequenceBlocks(blockCount); // Now traverse backwards and fill from the end currentBlock = lastBlock; - size_t currentIndex = blockCount - 1; + size_t currentIndex = blockCount; while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) { - sequenceBlocks[currentIndex--] = currentBlock; + sequenceBlocks[--currentIndex] = currentBlock; currentBlock = currentBlock->getPrevBlockInSeq(); } return sequenceBlocks; }
1241-1282
: Undefined cacheNewBlockOffset; update KV offsets correctly after detach/allocatecacheNewBlockOffset(sequence) is not defined and will not compile. Also, when detaching the front block under SWA, page-table offsets shift for all remaining blocks, so updating only the last block is insufficient.
Update offsets in-place here:
- If shouldDetachBlock: recompute offsets for all blocks/beams.
- Else if shouldAllocateBlock: update just the last block.
Apply:
- if (shouldAllocateBlock || shouldDetachBlock) - { - cacheNewBlockOffset(sequence); - } + // Refresh KV cache indices + if (shouldDetachBlock) + { + // Detaching the front block shifts all indices; recompute all offsets. + auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize); + auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize); + auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor); + auto const& offsetsShape = cacheBlocksTensor.getShape(); + auto const beamWidth = sequence.getBeamWidth(); + for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx) + { + auto const& beamCacheBlock = cacheBlocks[beamIdx]; + for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx) + { + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, beamCacheBlock[blockIdx]); + } + } + } + else if (shouldAllocateBlock) + { + // Only the last block appended needs offsets. + auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize); + auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize); + auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor); + auto const& offsetsShape = cacheBlocksTensor.getShape(); + auto const beamWidth = sequence.getBeamWidth(); + for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx) + { + auto const& beamCacheBlock = cacheBlocks[beamIdx]; + if (beamCacheBlock.empty()) + { + continue; + } + auto const blockIdx = static_cast<SizeType32>(beamCacheBlock.size() - 1); + auto const blockId = beamCacheBlock.back(); + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + } + }This also addresses the previously raised “stale offsets after detaching the front block” issue.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
421-427
: Reset front-detach counter when clearing cache blocks.After clearing per-beam block IDs, mNumFrontBlocksRemoved must be reset; otherwise subsequent removeFrontBlock calls will index past the beginning.
Apply this diff:
void clearCacheBlocks(SizeType32 windowSize) { for (auto& beamBlockIds : mCacheBlockIds.at(windowSize)) { beamBlockIds.clear(); } + mNumFrontBlocksRemoved = 0; }
1109-1113
: Clarify intent or make isUseOneMoreBlock configurable.This now always returns false with only a placeholder comment. It influences block budgeting hooks; please document why it’s disabled or wire it to a config if needed. If truly obsolete, remove and clean up call sites.
Would you like me to draft a small config flag (env/ctor) plumbed into BlockManager to control this?
429-438
: Bounds-check when marking front block as removed; avoid magic -1.If removeFrontBlock is called more times than there are assigned blocks, this writes out-of-bounds. Also, prefer a named invalid-id sentinel over magic -1.
Apply this diff:
- void removeFrontBlock(SizeType32 windowSize) - { - for (auto& beamBlockIds : mCacheBlockIds.at(windowSize)) - { - // Does not actually remove from mCacheBlockIds like removeLastBlock - // Id is set to -1 instead. - beamBlockIds[mNumFrontBlocksRemoved] = -1; - } - ++mNumFrontBlocksRemoved; - } + void removeFrontBlock(SizeType32 windowSize) + { + auto& perBeam = mCacheBlockIds.at(windowSize); + for (auto& beamBlockIds : perBeam) + { + if (mNumFrontBlocksRemoved < static_cast<SizeType32>(beamBlockIds.size())) + { + // Mark as invalid without shrinking; SWA keeps layout length stable. + beamBlockIds[mNumFrontBlocksRemoved] = KVCacheBlock::kInvalidBlockId; + } + // else: nothing to mark for this beam (already all detached) + } + ++mNumFrontBlocksRemoved; + }Additionally, define a dedicated invalid id sentinel to avoid reusing other sentinels:
// Add inside class KVCacheBlock public section (near kCachedBlocksRootId) static constexpr IdType kInvalidBlockId = std::numeric_limits<IdType>::min();
🧹 Nitpick comments (15)
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (2)
751-753
: Restoring streaming-param coverage via GTEST_SKIP instead of removing the parameterDropping streaming from the parameterization reduces coverage and can mask regressions. Prefer to keep the parameter and skip at runtime until streamLLM is supported.
Apply this diff to restore the parameter:
- // TODO: Support and add coverage for streamLLM - testing::Values(false), + // TODO: Support and add coverage for streamLLM + testing::Values(false, true),Additionally, add an early guard in the test body to avoid failures until support lands (outside the selected range; place near the top of createResponse()):
// Early skip for streaming path until streamLLM is supported in tests. if (streaming) { GTEST_SKIP() << "TODO(TRTLLM-XXXX): enable streaming coverage when streamLLM is supported."; }
513-515
: Fix typo: beamWdith → beamWidthMinor naming nit for readability and consistency.
Apply this diff:
- auto const beamWdith = std::get<3>(info.param); + auto const beamWidth = std::get<3>(info.param); ... - name += "Bw" + std::to_string(beamWdith); + name += "Bw" + std::to_string(beamWidth);Also applies to: 529-529
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (3)
373-407
: Remove std::cerr debug prints from unit testsThese ad-hoc prints add noisy output and can flake CI logs. Prefer GTest SCOPED_TRACE or dedicated logging macros, or just remove them.
Apply this diff to remove the debug prints:
- std::cerr << "isContextInitState, itCount: " << itCount << std::endl; ... - std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl; ... - std::cerr << "addNewTokens, itCount: " << itCount << std::endl; ... - std::cerr << "generation phase, itCount: " << itCount << std::endl; - std::cerr << "add token for kvCacheManager" << std::endl; ... - std::cerr << "add token for crossKvCacheManager" << std::endl; ... - std::cerr << "add token for llmReq" << std::endl; ... - std::cerr << "GENERATION_COMPLETE" << std::endl;If you want structured context on failure, wrap specific expectations with SCOPED_TRACE instead.
Also applies to: 413-422, 426-427, 413-427
465-466
: Consolidate repeated TODOs for sinkTokenLen > 0 coverage and track with an issueMultiple TODOs across tests signal temporarily reduced coverage. Consolidate this into a single fixture-level toggle or a parameterized/typed test to avoid duplication, and link to a tracking issue (e.g., TRTLLM-XXXX).
Consider:
- A single constexpr flag or environment-driven toggle in the test fixture.
- A typed test with sinkTokenLen as a parameter; temporarily constrain its values to {0} and re-enable {>0} later.
- Add a single comment at the fixture level with the tracking ID instead of repeating per-test.
Also applies to: 519-520, 609-610, 719-721, 845-852, 1101-1103, 1195-1196, 1289-1291, 1510-1511, 1529-1531, 1580-1582
845-852
: Document removal of configurations and add a re-enable planYou’ve removed specific configurations (e.g., (kMAX_UTILIZATION, 4, 125) and (kGUARANTEED_NO_EVICT, 1, 160)). Please add a brief comment with a tracking ticket and rationale to ease re-enabling later.
Example:
// TODO(TRTLLM-XXXX): Re-enable sinkTokenLen>0 cases after SWA reuse path supports bubbles; // removed temporarily to stabilize CI (see PR 6768).cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
523-526
: Float equality check is brittle; use toleranceTLLM_CHECK(total == 1.0f) risks false negatives due to float rounding. Prefer a tolerance, e.g., fabs(total - 1.0f) < 1e-6f.
I can submit a small patch if you want this changed globally in this helper.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (9)
56-59
: Nit: Constant naming style.Per guidelines, constants use kPREFIXED_UPPER_SNAKE. Consider renaming kSWAExtraBlock to kSWA_EXTRA_BLOCK for consistency. This would require touching all references (e.g., .cpp sites using this symbol), so it can be a follow-up cleanup.
102-109
: Make WindowSizeMetadata::toString() const.Method does not mutate state and should be marked const.
Apply this diff:
- std::string toString() + std::string toString() const
347-349
: Avoid needless copy in structured binding.Use const reference to avoid copying metadata on each iteration.
Apply this diff:
- for (auto const [windowSize, metadata] : windowSizeToMetadata) + for (auto const& [windowSize, metadata] : windowSizeToMetadata)
748-755
: Remove redundant clamp and fix type safety.The guard ensures inputs->maxInputLen > mWindowSize, so window is non-negative. The std::max(window, 0) mixes signed/unsigned and is unnecessary.
Apply this diff:
- window = std::max(window, 0); // clamp negative values to 0 + // window is non-negative due to the guard above.Alternatively, if you prefer an explicit clamp, use:
- std::max(window, static_cast(0))
734-736
: Propagate storeBlocks() return value to callers.WindowBlockManager::storeBlocks now returns the number of actual blocks stored. BlockManager::storeBlocks discards this. Consider adding a returning variant in BlockManager (and/or KVCacheManager) to preserve the signal for telemetry or decision-making.
One option:
- void storeBlocks(std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds, - SizeType32 windowSize) - { - mWindowBlockManagers.at(windowSize).storeBlocks(blockKeys, blockIds); - } + SizeType32 storeBlocks(std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds, + SizeType32 windowSize) + { + return mWindowBlockManagers.at(windowSize).storeBlocks(blockKeys, blockIds); + }Update call sites accordingly.
1391-1394
: Typo: copyOnpartialReuse → copyOnPartialReuse.Parameter naming should be consistent (lowerCamelCase) and match similar usage elsewhere (e.g., WindowBlockManager). Please fix the capitalization.
Apply this diff:
- std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, - bool copyOnpartialReuse = true); + std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, + bool copyOnPartialReuse = true);
1401-1404
: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).Apply this diff:
- std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, - bool copyOnpartialReuse = true); + std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, + bool copyOnPartialReuse = true);
1411-1414
: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).Apply this diff:
- std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, - bool copyOnpartialReuse = true); + std::shared_ptr<KVCacheEventManager> eventManager = nullptr, bool enablePartialReuse = true, + bool copyOnPartialReuse = true);
1419-1422
: Typo: copyOnpartialReuse → copyOnPartialReuse (constructor overload).Apply this diff:
- bool enablePartialReuse = true, bool copyOnpartialReuse = true); + bool enablePartialReuse = true, bool copyOnPartialReuse = true);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(15 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(25 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(17 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(64 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code Graph Analysis (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
KVCacheManager
(1711-1722)KVCacheManager
(1724-1738)KVCacheManager
(1740-1772)KVCacheManager
(1774-1787)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager
(140-930)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
createKvCacheManager
(624-722)createKvCacheManager
(624-626)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (26)
addSequence
(1188-1192)addSequence
(1188-1189)addSequence
(1196-1231)addSequence
(1196-1197)addSequence
(1286-1290)addSequence
(1286-1287)addSequence
(1294-1310)addSequence
(1294-1295)addSequence
(2107-2165)addSequence
(2107-2108)detachBlock
(2026-2065)detachBlock
(2026-2026)addBlockIfNeeded
(1233-1239)addBlockIfNeeded
(1233-1233)addBlockIfNeeded
(1241-1282)addBlockIfNeeded
(1241-1241)getNumFreeBlocks
(1532-1532)storeBlocks
(1370-1433)storeBlocks
(1370-1371)KVCacheBlock
(257-270)updateSequenceCacheBlockOffsets
(1971-1989)updateSequenceCacheBlockOffsets
(1971-1971)updateLastCacheBlockOffsets
(1991-2007)updateLastCacheBlockOffsets
(1991-1991)updateCacheBlockOffsetsAtIdx
(2009-2024)updateCacheBlockOffsetsAtIdx
(2009-2009)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
kvCacheManager
(2969-2971)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (16)
nodiscard
(215-1175)nodiscard
(594-602)nodiscard
(616-624)nodiscard
(626-634)nodiscard
(636-639)nodiscard
(646-649)nodiscard
(651-661)nodiscard
(672-675)nodiscard
(677-680)nodiscard
(692-695)nodiscard
(700-703)nodiscard
(744-755)nodiscard
(953-956)nodiscard
(958-971)nodiscard
(973-986)nodiscard
(988-994)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
341-341
: Stubbed isUseOneMoreBlock always returns false – please confirm intentThe new parameterless
isUseOneMoreBlock()
(cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:1109–1113) is a stub that unconditionally returnsfalse
, so there’s no longer any per-request logic (e.g., attention window, sequence length, beam width) influencing this decision.
Please verify that always disabling the “one more block” path is intended for all models and request scenarios. If any runtime variability is required, update this method to compute the correct value based on the appropriate parameters.Key locations to review:
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:341
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:1109–1113
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
136-139
: Constructor signature verified across call sitesAll
KVCacheManager
instantiations—including the one incapacitySchedulerTest.cpp
(lines 136–139)—now match the updated overload (with the addedmaxSequenceLength
,enableBlockReuse
,onboardBlocks
, andcacheType
parameters in the correct order). No mismatches were found in tests or code.cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
77-80
: Helper correctly counts valid block IDsThe sentinel-aware helper is concise and used consistently. Return type fits the test domain.
4708-4709
: Ensure allocatePools default argument exists (or pass explicit bool)Tests call kvCacheManager->allocatePools() without an argument, but KVCacheManager::allocatePools(bool useUvm) in C++ requires a bool unless a default is declared in the header. Verify that the header provides a default (e.g., useUvm = false) or pass false explicitly here.
Would you like me to sweep the headers and update calls to consistently pass false where missing?
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
2067-2097
: SWA: addToken path updates offsets via updateLastCacheBlockOffsetsOnce updateLastCacheBlockOffsets is fixed (see earlier comment), this full-attention path is fine. For completeness, consider adding an unconditional offsets refresh after the SWA path too, in case future addBlockIfNeeded implementations stop doing in-place updates.
Would you like me to wire a conservative BlockManager::updateSequenceCacheBlockOffsets call for SWA here as well?
b8fa5b6
to
6390a11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
341-349
: Document disabled one-more-block behaviorThe new static BlockManager::isUseOneMoreBlock() always returns false, which means the “one-more-block” path is effectively disabled. We should document this decision at the call site to make it clear that downstream kernels and rewind inputs no longer need to handle the extra‐block cases.
• In cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (around line 341):
- Add a brief comment before or after the call to isUseOneMoreBlock() explaining that it always returns false and why it’s safe to disable the extra-block logic (e.g., SWA/FHMA alignment no longer requires it).
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
292-298
: Bug: wrong container used in EXPECT_NE for disagg-gen-init verificationYou’re searching in scheduledRequests but comparing to scheduledDisaggGenInitRequests.end(), which is a mismatch and can mask failures.
Apply this fix:
- EXPECT_NE( - scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) + EXPECT_NE( + scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;
♻️ Duplicate comments (8)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (8)
519-520
: Same note as previous: sinkTokenLen coverage reducedReplicate the rationale comment or a shared helper to centralize this decision.
564-565
: Same note as previous: sinkTokenLen coverage reduced
609-610
: Same note as previous: sinkTokenLen coverage reduced
719-721
: Same note as previous: sinkTokenLen coverage reduced
1194-1196
: Same note as previous: sinkTokenLen coverage reduced
1380-1382
: Same note as previous: sinkTokenLen coverage reduced
1528-1530
: Same note as previous: sinkTokenLen coverage reduced
1580-1582
: Same note as previous: sinkTokenLen coverage reduced
🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (4)
373-427
: Drop noisy std::cerr traces or guard them behind a flagThe ad-hoc std::cerr prints in a unit-test utility will produce noisy test output and make CI logs harder to parse. Prefer GTest’s SCOPED_TRACE, TLLM_LOG_* with adjustable severity, or guard with a macro/env flag.
Here’s a minimal cleanup that simply removes the debug prints:
- std::cerr << "isContextInitState, itCount: " << itCount << std::endl; @@ - std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl; @@ - std::cerr << "addNewTokens, itCount: " << itCount << std::endl; @@ - std::cerr << "generation phase, itCount: " << itCount << std::endl; - std::cerr << "add token for kvCacheManager" << std::endl; @@ - std::cerr << "add token for crossKvCacheManager" << std::endl; @@ - std::cerr << "add token for llmReq" << std::endl; @@ - std::cerr << "GENERATION_COMPLETE" << std::endl;If you want to keep them for debugging, consider:
- Replace with SCOPED_TRACE("...") where appropriate, or
- Wrap in an if (std::getenv("TLLM_TEST_TRACE")) guard.
465-466
: Reduced sinkTokenLen coverage: add rationale and trackingLimiting sinkTokenLen to {0} reduces coverage. If this is temporary (e.g., reuse+sink tokens not supported yet), please add a brief rationale and a ticket reference. Consider adding a GTEST_SKIP-based param test for sinkTokenLen > 0 to keep the intent visible without failing.
845-849
: Document removed configurations and plan to re-enableYou removed some configurations. If disabled due to current limitations, please add a short explanatory comment with an issue link to make the intent explicit and to plan re-enablement.
851-852
: Narrowed configurations: consider parameterized tests or GTEST_SKIP for unsupported casesReducing to one configuration speeds tests but loses coverage. Consider keeping the matrix via parameterized tests and skipping unsupported combinations with GTEST_SKIP to preserve visibility.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(15 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(25 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(17 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(64 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
👮 Files not reviewed due to content moderation or server errors (2)
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
KVCacheManager
(1711-1722)KVCacheManager
(1724-1738)KVCacheManager
(1740-1772)KVCacheManager
(1774-1787)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (4)
numLayers
(32-69)numLayers
(32-33)numLayers
(77-88)numLayers
(77-77)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
kvCacheManager
(2969-2971)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
storeBlocks
(925-929)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
137-139
: KVCacheManager ctor update looks correctPassing maxNumTokensPerSeq (per sequence budget / max sequence length) after the stream conforms to the updated ctor signature and matches the usage in other tests. No issues.
306debd
to
44643d2
Compare
📝 WalkthroughWalkthroughAdds Sliding Window Attention (SWA) support and front-block management to KV cache. Replaces cyclic logic with detach/adjust of front blocks, introduces kSWAExtraBlock and kInvalidBlockId, and makes maxSequenceLength mandatory in constructors. Updates APIs, bindings, PyTorch executor config, and tests; modifies help texts and integration tests accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as Inference Loop
participant KVC as KVCacheManager
participant BM as BlockManager
participant WBM as WindowBlockManager
participant GR as GenerationRequest
App->>KVC: addToken(requestId)
KVC->>BM: adjustBlocksIfNeeded(GR, enableReuse)
alt SWA path (mIsSWA)
BM->>WBM: adjustBlocksIfNeeded(GR, enableReuse)
opt Window slide needed
WBM->>WBM: detachFrontBlock(GR, enableReuse)\n(mark OOW block kInvalidBlockId)
WBM->>WBM: updateLastCacheBlockOffsets(GR)
end
else Full attention
BM->>BM: allocate/store as needed
end
BM-->>KVC: done
KVC-->>App: token added
note over KVC,WBM: On storage
KVC->>WBM: storeBlocks(keys, ids)
WBM-->>KVC: returns numBlocksStored
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c380a15
to
061fd2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
52-59
: Fix compile error: SizeType32 used before it’s declared
SizeType32
is first introduced at Line 65, but is used to definekPrimaryLevel
,kSecondaryLevel
, andkSWAExtraBlock
earlier (Lines 52–59). This will not compile.Use the fully qualified type for these constants or move the alias before the constants.
Apply this diff to use the fully qualified type:
-static constexpr SizeType32 kPrimaryLevel = 0; +static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0; -static constexpr SizeType32 kSecondaryLevel = 1; +static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1; // Extra block buffer allocated for SWA to be able to always keep "window size" // tokens held in the blocks. -static constexpr SizeType32 kSWAExtraBlock = 1; +static constexpr tensorrt_llm::runtime::SizeType32 kSWAExtraBlock = 1;
34-43
: Include headers for types used in this header (avoid transitive includes)This header uses std::map, std::tuple, std::deque, std::string, std::function, std::numeric, and std::chrono types without including their respective headers. Relying on transitive includes is brittle.
Add the missing headers:
#include <array> #include <cstdint> #include <limits> #include <list> #include <memory> #include <optional> #include <set> #include <unordered_map> #include <vector> +#include <map> +#include <tuple> +#include <deque> +#include <string> +#include <functional> +#include <numeric> +#include <chrono>
17-17
: Add required include guard for headers (per coding guidelines)Guidelines require header guards named TRTLLM__H. Replace or complement
#pragma once
with an include guard.Apply this diff at the top and bottom of the file:
-#pragma once +#ifndef TRTLLM_KVCACHEMANAGER_H +#define TRTLLM_KVCACHEMANAGER_H-} // namespace tensorrt_llm::batch_manager::kv_cache_manager - +#} // namespace tensorrt_llm::batch_manager::kv_cache_manager + +#endif // TRTLLM_KVCACHEMANAGER_HNote: You can keep
#pragma once
as an optimization, but the guard is mandatory per the project rules.Also applies to: 1704-1705
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
295-297
: Fix container used in EXPECT_NE for disagg init scheduled setThis line is checking
scheduledDisaggGenInitRequests
, but it searches inscheduledRequests
. This causes false negatives.Apply this diff:
- EXPECT_NE( - scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) + EXPECT_NE( + scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), + scheduledDisaggGenInitRequests.end()) << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;
♻️ Duplicate comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
422-429
: Reset counter after clearCacheBlocks: good fixResetting
mNumFrontBlocksRemoved
here resolves the earlier bug raised in a past review. Thanks for addressing it.
🧹 Nitpick comments (9)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
755-766
: Avoid implicit signed/unsigned mixing in std::max/minTo prevent unintended integral promotion (and to make intent clear), cast the zero literal to
SizeType32
.- auto window = std::min(inputs->maxNumTokens, inputs->maxInputLen - mWindowSize); - window = std::max(window, 0); // clamp negative values to 0 + auto window = std::min(inputs->maxNumTokens, inputs->maxInputLen - mWindowSize); + window = std::max(window, static_cast<SizeType32>(0)); // clamp negative values to 0
590-603
: New SWA helpers are well-scopedThe additions
updateLastCacheBlockOffsets
,detachBlock
, andaddBlockIfNeeded
make responsibilities explicit. Keep them minimal and cohesive; watch for duplication with any BlockManager helpers.
1393-1407
: Parameter name typo: copyOnpartialReuse → copyOnPartialReuseThere’s an inconsistent parameter name across constructors:
copyOnpartialReuse
vs.copyOnPartialReuse
. Even though names don’t affect ABI, consistency improves readability and avoids confusion with the config flags.Apply this rename in the declarations:
- bool copyOnpartialReuse = true); + bool copyOnPartialReuse = true);And propagate the same rename to the other overload below.
Please ensure the definitions in the corresponding .cpp also use the same parameter name.
Also applies to: 1418-1425
21-21
: Consider forward-declaring LlmRequest to reduce header couplingThe TODO suggests it; forward-declaring
class LlmRequest;
and moving the include to the .cpp would reduce rebuild churn. Validate that only pointers/refs are used here before doing so.cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (1)
87-89
: Ack: TODO for beamWidth > 1 coverageNoted. Please track this with an issue if it’s not on your immediate roadmap.
I can draft additional tests for beamWidth > 1 once SWA-related paths are stable. Want me to propose them?
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp (1)
749-758
: Dropping streaming=true reduces coverage; add a tracking planInstantiating with
testing::Values(false)
removes all streaming=true scenarios. If this is temporary, add a TODO with a concrete link/issue and timeframe to re-enable streaming coverage, especially for beamWidth > 1 constraints.cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
347-348
: Hard-coding isUseOneMoreBlock=false: document rationale and implications
isUseOneMoreBlock
is now always false. That changes rewind behavior for speculative decoding. Please add a clear comment explaining why we no longer use “one more block” (e.g., incompatibility with SWA linearized allocation, correctness constraints) and verify any sizing assumptions ininvokeUpdateKVBlockArrayDraftTokenLocation
.Optionally, gate this behind a config if future modes need to flip it back.
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
373-384
: Remove noisy std::cerr debug prints from unit testsThese prints make test output noisy and brittle. Prefer GTest SCOPED_TRACE or logs behind a verbosity flag. If needed, use TLLM_LOG_* with a controllable level.
Apply this diff to remove the ad-hoc prints:
- std::cerr << "isContextInitState, itCount: " << itCount << std::endl; @@ - std::cerr << "isFirstContextChunk, itCount: " << itCount << std::endl; @@ - std::cerr << "addNewTokens, itCount: " << itCount << std::endl; @@ - std::cerr << "generation phase, itCount: " << itCount << std::endl; - std::cerr << "add token for kvCacheManager" << std::endl; @@ - std::cerr << "add token for crossKvCacheManager" << std::endl; @@ - std::cerr << "add token for llmReq" << std::endl; @@ - std::cerr << "GENERATION_COMPLETE" << std::endl;Also applies to: 407-427
462-466
: TODOs for sinkTokenLen > 0: track re-enabling scenariosAcknowledged the temporary restriction to sinkTokenLen=0. Please add an issue link or re-enable coverage once SWA + sink-bubble semantics are finalized.
I can help craft parameterized cases for sinkTokenLen > 0 once the allocator semantics stabilize.
Also applies to: 518-520, 563-565, 719-721, 845-852, 1188-1196, 1528-1536, 1580-1588
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(28 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(17 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(64 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(1 hunks)
👮 Files not reviewed due to content moderation or server errors (2)
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (1)
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🔇 Additional comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
548-554
: Constructor change: isSWA flag addition looks correctAdding the
isSWA
toggle intoWindowBlockManager
aligns with the SWA behavior changes. Ensure the.cpp
honors this to linearize writes and detach OOW blocks as described in the PR.cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
136-139
: Constructor update looks correctPassing
maxNumTokensPerSeq
into the KVCacheManager ctor matches the new signature. Ensure all other test helpers and sites were updated consistently.
061fd2c
to
9aa7fa0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
95-96
: LGTM! Clearer naming for maxTokensPerSeq field.The rename from
maxTokenNum
tomaxTokensPerSeq
improves clarity and consistency with the codebase naming conventions.
431-448
: Minor issue: Fix format specifier for requestId logging.
mRequestId
is of typeLlmRequest::RequestIdType
(likelyuint64_t
), notint
. Using%d
may cause undefined behavior or incorrect logging output.Apply this diff to fix the format specifier:
- TLLM_LOG_WARNING("RequestID %d: removeFrontBlock called but nothing to remove", mRequestId); + TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove", + static_cast<unsigned long>(mRequestId));
🧹 Nitpick comments (2)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
404-404
: Consider removing or converting debug output to proper test logging.The
std::cerr
statements appear to be debugging artifacts. Consider using proper test logging mechanisms or removing them if no longer needed.- std::cerr << "addNewTokens, itCount: " << itCount << std::endl; + // Token addition completed for context - std::cerr << "kvCacheManager" << std::endl; + // Initialize KV cache manager - std::cerr << "crossKvCacheManager" << std::endl; + // Initialize cross KV cache managerAlso applies to: 513-513, 516-516
453-454
: Document the reasoning for temporarily disabled sinkTokenLen > 0 tests.Multiple TODO comments indicate that
sinkTokenLen > 0
tests are disabled. Please add an explanation of why this configuration is not currently supported with SWA and when it might be addressed.Add a more detailed comment explaining the limitation:
- // TODO: Support and add coverage for sinkTokenLen > 0. (e.g. 4) + // TODO(TRTLLM-6341): Support and add coverage for sinkTokenLen > 0. + // SWA KV cache reuse currently doesn't support sink tokens. + // This will be addressed in a future iteration.Also applies to: 507-508, 554-555, 599-600, 710-711, 836-842, 1038-1039, 1092-1093, 1185-1186, 1371-1372, 1519-1520, 1571-1572, 1852-1853
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(29 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(15 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(64 hunks)cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/tests/unit_tests/batch_manager/llmRequestTest.cpp
👮 Files not reviewed due to content moderation or server errors (2)
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (7)
56-58
: LGTM! Clear naming for SWA extra block constant.The constant is well-documented, follows the naming convention (
kPREFIXED_UPPER_SNAKE
), and clearly indicates its purpose for sliding-window attention buffer management.
210-210
: LGTM! Appropriate invalid block ID constant.Using -2 as an invalid block ID sentinel value is correct, distinguishing it from -1 (
kCachedBlocksRootId
).
343-343
: LGTM! Proper state management for front block removal.The addition of
mNumFrontBlocksRemoved
with getter method and reset inclearCacheBlocks
ensures consistent tracking of detached front blocks for SWA.Also applies to: 386-389, 428-429, 481-482
550-550
: LGTM! SWA flag properly integrated into WindowBlockManager.The addition of the
mIsSWA
flag and its propagation through the constructor enables proper conditional behavior for sliding-window attention.Also applies to: 568-568, 827-827
585-585
: LGTM! Well-designed SWA-specific block management methods.The new methods provide clear semantics for SWA operations:
updateLastCacheBlockOffsets
: Updates offsets for the last blockdetachBlock
: Cleanly detaches blocks from sequencesaddBlockIfNeeded
: Conditionally adds blocks with detachment of out-of-window blocksstoreBlocks
now returns the count of actually stored blocks for better trackingAlso applies to: 591-592, 596-602, 614-614, 744-745
906-907
: LGTM! Clear API updates for SWA support.The addition of
isShareLastContextBlock
parameter and cache-level support ingetNumFreeBlocks
methods provides necessary control for SWA behavior.Also applies to: 949-950, 960-962
1136-1144
: LGTM! Well-designed block offset management methods for SWA.The new private helper methods in BlockManager provide clean separation of concerns for managing cache block offsets and block addition logic.
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
138-138
: Constructor signature properly updated for maxNumTokensPerSeq parameter.The test helper correctly passes the new
maxNumTokensPerSeq
parameter matching the updated KVCacheManager constructor.
9aa7fa0
to
3354203
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
291-299
: Fix: wrong container used when verifying scheduled disagg-gen-init requestsYou’re checking membership in
scheduledDisaggGenInitRequests
but callingscheduledRequests.find(...)
. This will produce false negatives and can hide scheduling bugs.Apply this diff:
- // Check that scheduleId is found in scheduled Requests - EXPECT_NE( - scheduledRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) + // Check that scheduleId is found in scheduled disaggregated generation-init Requests + EXPECT_NE( + scheduledDisaggGenInitRequests.find(scheduledReqState.mRequestId), scheduledDisaggGenInitRequests.end()) << "itCount: " << itCount << "mRequestId:" << scheduledReqState.mRequestId;cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
4179-4183
: Potential undefined type alias: use explicit CudaStream type for clarity
createKvCacheManager
takes a parameterStreamPtr stream
, but this alias isn’t defined in this TU (and may not exist). All other places usestd::shared_ptr<tr::CudaStream>
.Replace the parameter type with the explicit stream pointer type to avoid build breaks:
-std::shared_ptr<KVCacheManager> createKvCacheManager( - KvCacheManagerInstantiationParameters const& kvCacheInstantiationParameters, StreamPtr stream) +std::shared_ptr<KVCacheManager> createKvCacheManager( + KvCacheManagerInstantiationParameters const& kvCacheInstantiationParameters, + std::shared_ptr<tr::CudaStream> stream)
♻️ Duplicate comments (1)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
2863-2866
: Primary allocation check relies on subtracting secondary capacity (test-only workaround)You compute “allocated primary blocks” via:
blockManager.getNumAllocatedBlocks() - blocksInSecondaryPool
- and similarly for multi-window:
- blocksInSecondaryPoolPerWindow * numWindows
This couples tests to internal pool sizing semantics.
If feasible, expose an API to query allocated counts per cache level (symmetrical to
getNumFreeBlocks(level)
), e.g.,getNumAllocatedBlocks(level)
. If you prefer to keep current tests as-is (previously noted), no action needed now.Also applies to: 3030-3033
🧹 Nitpick comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2)
1191-1200
: Fix incorrect comment about block reuse versions.The comments state "There are two versions of BlockManager::addSequence function" but this is misleading. These are overloaded methods with different purposes, not just enabled/disabled versions of the same functionality.
Update the comments to be more accurate:
-// There are two versions of BlockManager::addSequence function. -// This is called when block reuse is enabled. +// Add a sequence with KV cache block reuse support. +// Attempts to reuse existing blocks from the cache tree when possible.-// There are two versions of BlockManager::addSequence function. -// This is called when block reuse is disabled. +// Add a sequence without block reuse. +// Allocates fresh blocks for the entire sequence.Also applies to: 1277-1300
2593-2594
: Add defensive check for potential underflow.When
blockCapacity / beamWidth <= kSWAExtraBlock
, the subtraction could underflow. While the current code path might prevent this scenario, adding a defensive check would make the code more robust.The issue was already addressed in a past review comment. Consider applying:
- return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock; + auto const blocksPerBeam = blockCapacity / beamWidth; + auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? blocksPerBeam - kSWAExtraBlock : 0; + return usableBlocksPerBeam * tokensPerBlock;Similarly for line 2601:
- return std::min( - outputLength + (leftoverBlockCapacity - kSWAExtraBlock) * tokensPerBlock, inputLength + outputLength); + auto const usableLeftoverCapacity = leftoverBlockCapacity > kSWAExtraBlock ? leftoverBlockCapacity - kSWAExtraBlock : 0; + return std::min( + outputLength + usableLeftoverCapacity * tokensPerBlock, inputLength + outputLength);Also applies to: 2600-2601
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
2811-2811
: Remove unused variable ‘temporaryAttentionWindow’
temporaryAttentionWindow
is defined but not used, which can trigger warnings.- auto constexpr temporaryAttentionWindow = 0; + // (unused) auto constexpr temporaryAttentionWindow = 0;Or remove the line entirely.
Also applies to: 3081-3081
4071-4071
: Typo in test name: ‘Caculate’ → ‘Calculate’Minor naming polish to avoid misspellings in test outputs.
-TEST_P(BlockRequirementsParamTest, TestCaculateMaxBlocksRequirement) +TEST_P(BlockRequirementsParamTest, TestCalculateMaxBlocksRequirement)
103-111
: Non-deterministic event polling can be flakyUsing a fixed
sleep_for(100ms)
to wait for event flushing can cause intermittent flakes on slow or loaded CI.Consider polling with a bounded timeout (e.g., loop on
getLatestEvents
until non-empty or a deadline is reached), or instrument the event manager to signal completion to avoid sleeping.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(29 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(14 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(61 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code Graph Analysis (3)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
KVCacheManager
(1687-1698)KVCacheManager
(1700-1714)KVCacheManager
(1716-1748)KVCacheManager
(1750-1763)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (27)
nodiscard
(1015-1022)nodiscard
(1522-1525)nodiscard
(2583-2602)KVCacheBlock
(261-274)addSequence
(1193-1197)addSequence
(1193-1194)addSequence
(1201-1236)addSequence
(1201-1202)addSequence
(1279-1283)addSequence
(1279-1280)addSequence
(1287-1300)addSequence
(1287-1288)addSequence
(2055-2113)addSequence
(2055-2056)updateLastCacheBlockOffsets
(1968-1984)updateLastCacheBlockOffsets
(1968-1968)detachBlock
(2011-2046)detachBlock
(2011-2011)addBlockIfNeeded
(1238-1244)addBlockIfNeeded
(1238-1238)addBlockIfNeeded
(1246-1275)addBlockIfNeeded
(1246-1246)getNumFreeBlocks
(1522-1522)updateSequenceCacheBlockOffsets
(1948-1966)updateSequenceCacheBlockOffsets
(1948-1948)updateCacheBlockOffsetsAtIdx
(1986-2001)updateCacheBlockOffsetsAtIdx
(1986-1986)cpp/tests/batch_manager/cacheTransceiverTest.cpp (8)
llmRequest
(421-450)llmRequest
(421-421)llmRequest
(872-903)llmRequest
(872-872)llmRequest
(905-912)llmRequest
(905-905)llmRequest
(914-933)llmRequest
(914-914)cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp (2)
getNumFreeBlocks
(179-182)getNumFreeBlocks
(179-179)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (8)
58-58
: LGTM! Well-structured SWA constant addition.The
kSWAExtraBlock
constant is appropriately defined and documented, providing a clear buffer for sliding window attention. The comment explains its purpose well.
95-108
: LGTM! Consistent naming improvement for window metadata.The rename from
maxTokenNum
tomaxTokensPerSeq
is more descriptive and aligns well withmaxBlocksPerSeq
. ThetoString()
method is correctly updated.
210-210
: Good addition of invalid block sentinel value.The
kInvalidBlockId = -2
constant provides a clear distinction fromkCachedBlocksRootId = -1
, making detached/removed blocks easily identifiable in SWA operations.
343-348
: LGTM! Well-designed front block removal for SWA.The implementation correctly:
- Resets the counter in
clearCacheBlocks
(addressing the past review comment)- Uses bounds checking to prevent out-of-bounds access
- Preserves block layout for reuse by marking with
kInvalidBlockId
instead of removing- Includes appropriate warning logging
Also applies to: 386-390, 428-429, 431-448, 481-483
550-551
: LGTM! Clean SWA flag integration.The
isSWA
parameter is properly propagated through the constructor chain and stored for controlling SWA-specific behavior. The boolean expressionwindowSize < maxSequenceLength
correctly identifies SWA scenarios.Also applies to: 565-566, 606-608, 619-620, 827-828
591-603
: LGTM! Well-structured SWA block management methods.The new methods provide clear separation of concerns:
updateLastCacheBlockOffsets
: Efficiently updates only the last blockdetachBlock
: Handles out-of-window block removaladdBlockIfNeeded
: Orchestrates allocation and detachmentThe documentation is clear and helpful.
745-746
: LGTM! Return value addition enables tracking of stored blocks.Changing
storeBlocks
to return the count of blocks stored is a good improvement for monitoring and debugging reuse effectiveness.
902-908
: LGTM! Clean API surface for SWA block management.The BlockManager methods properly delegate to WindowBlockManager instances and provide a clean abstraction for managing blocks across different window sizes.
Also applies to: 1137-1144
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (12)
159-191
: LGTM! Well-implemented sequence block traversal utility.The
getAllSequenceBlocks
function efficiently handles the traversal with proper early return for empty sequences and pre-allocation for performance. This addresses the past review comment about size_t underflow.
564-566
: LGTM! Proper SWA flag initialization.The SWA flag is correctly computed as
windowSize < maxSequenceLength.value_or(0)
, which properly identifies sliding window scenarios.
607-620
: LGTM! Clean WindowBlockManager constructor with SWA support.The constructor properly initializes the
mIsSWA
flag and maintains the correct parameter order.
1238-1276
: LGTM! Well-implemented SWA block management.The
addBlockIfNeeded
implementation correctly:
- Calculates when to detach blocks based on window boundaries
- Handles the detachment before allocation to maintain memory
- Updates offsets appropriately
The TODO comments provide valuable context about current limitations.
1360-1423
: LGTM! Proper return value for tracking stored blocks.The
storeBlocks
method now correctly returns the count of blocks stored for reuse, which is useful for monitoring and debugging.
1615-1663
: Excellent SWA-aware block release implementation!The
releaseBlocks
method properly handles SWA by:
- Using
getAllSequenceBlocks
to gather all blocks including detached ones- Checking for blocks that may not have ref counts (detached blocks)
- Logging the number of blocks stored for reuse
This is a robust implementation that handles the complexities of SWA.
1737-1738
: LGTM! Clear runtime guard against unsupported features.The check appropriately prevents streamLLM usage with SWA, which is documented as not yet supported.
1949-2001
: Clean refactoring of offset update methods.The separation of concerns is well done:
updateSequenceCacheBlockOffsets
: Full sequence updateupdateLastCacheBlockOffsets
: Efficient last block updateupdateCacheBlockOffsetsAtIdx
: Targeted block update
2003-2009
: LGTM! Clean public API for token addition.The new
addToken
method provides a clean interface that hides the complexity of block management.
2011-2046
: Well-structured block detachment for SWA.The
detachBlock
method correctly:
- Validates beam width constraint (documented limitation)
- Decrements ref counts appropriately
- Chooses between offload and release based on configuration
- Updates sequence state and allocated blocks
The TODO comment about future eviction policy integration is helpful.
2057-2102
: LGTM! Comprehensive sequence addition with SWA support.The method properly handles:
- Context length calculation
- Block allocation (with/without reuse)
- Sharing decisions for cross-attention
- Cache offset updates
2536-2547
: LGTM! Correct SWA block calculation with extra buffer.The block requirement calculations properly account for the
kSWAExtraBlock
buffer when sequences exceed the window size, ensuring sufficient blocks for smooth window sliding.Also applies to: 2549-2581
3354203
to
762d9d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
581-583
: Default fallback for maxSequenceLength may cause block allocation failure.When
maxSequenceLength
isstd::nullopt
, settingmaxTokensPerSeq = 0
results inmaxBlocksPerSeq = 0
, which prevents KV-cache allocation. This could break allocation flows when callers passstd::nullopt
.Apply this fix to derive a sensible default:
- auto const maxTokensPerSeq = maxSequenceLength.value_or(0); + // Derive a sensible default when maxSequenceLength is not provided + auto const maxTokensPerSeq = maxSequenceLength.has_value() + ? maxSequenceLength.value() + : allottedPrimaryBlocks * tokensPerBlock / beamWidth;
1287-1300
: Fix parameter name consistency between declaration and definition.The implementation uses
isShareLastContextBlock
while the header likely declaresisShareLastContextBlock
. Ensure parameter names match exactly between header and implementation.Verify and align parameter names:
void WindowBlockManager::addSequence( - GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock) + GenerationRequest& sequence, SizeType32 numContextBlocks, bool isShareLastContextBlock)
2592-2600
: Fix potential underflow in tight-capacity SWA calculation.The expression
((blockCapacity / beamWidth) - kSWAExtraBlock)
can underflow whenblockCapacity/beamWidth <= kSWAExtraBlock
, causing incorrect attention window calculation.Apply the clamping fix:
- return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock; + auto const blocksPerBeam = blockCapacity / beamWidth; + auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? blocksPerBeam - kSWAExtraBlock : 0; + return usableBlocksPerBeam * tokensPerBlock;
564-566
: Default fallback for maxSequenceLength may cause block allocation failure.When
maxSequenceLength
isstd::nullopt
, the fallback value of 0 results inmaxTokensPerSeq = 0
andmaxBlocksPerSeq = 0
(line 583), which prevents KV-cache allocation. This could break allocation flows when callers passstd::nullopt
.Apply this fix to derive a sensible default:
- /*isSWA=*/windowSize < maxSequenceLength.value_or(0), allottedPrimaryBlocks, + /*isSWA=*/maxSequenceLength.has_value() && windowSize < maxSequenceLength.value(), allottedPrimaryBlocks,cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
2976-2978
: Incorrect use of EXPECT_THAT with a raw value (won’t compile).EXPECT_THAT requires a matcher as the second argument. Use EXPECT_EQ or ::testing::Eq instead. This was flagged previously and still applies.
Apply this minimal fix:
- EXPECT_THAT(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows); + EXPECT_EQ(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows);Alternatively:
- EXPECT_THAT(allBlocksInPrimaryPools, blocksInPrimaryPoolPerWindow * numWindows); + EXPECT_THAT(allBlocksInPrimaryPools, ::testing::Eq(blocksInPrimaryPoolPerWindow * numWindows));
🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (2)
452-454
: Make sinkTokenLen coverage TODOs explicit to avoid accidental rot.Multiple TODOs defer sinkTokenLen > 0 coverage. Suggest one of:
- Mark corresponding parameterized cases as disabled (prefix test name with DISABLED_) until support lands.
- Keep the param values {0,4} but GTEST_SKIP() early when sinkTokenLen != 0, so the shape remains visible in CI without failing.
This helps track intent and prevents accidental deletion of these scenarios.
Example pattern within affected loops:
for (auto sinkTokenLen : sinkTokenLens) { + if (sinkTokenLen != 0) { + GTEST_SKIP() << "sinkTokenLen > 0 temporarily unsupported; tracked by TODO."; + } auto kvCacheManager = getKvCacheManager( maxNumRequests, kvCacheTokensPerBlock, kvCacheMaxNumTokens, kvCacheMaxNumTokensPerSeq, sinkTokenLen);Also applies to: 506-508, 551-553, 707-709, 833-840, 1090-1091, 1182-1184, 1368-1370, 1516-1518, 1568-1570, 1850-1851
1729-1745
: Dead branch comment about STATIC_BATCH in chunked test.This test declares only GUARANTEED_NO_EVICT and kMAX_UTILIZATION in capacitySchedulerPolicies, yet retains a STATIC_BATCH branch/comment. Either include STATIC_BATCH in the vector or remove the dead branch to avoid confusion.
If you want to exercise STATIC_BATCH here as well:
- auto capacitySchedulerPolicies = std::vector<CapacitySchedulerPolicy>{ - CapacitySchedulerPolicy::kGUARANTEED_NO_EVICT, CapacitySchedulerPolicy::kMAX_UTILIZATION}; + auto capacitySchedulerPolicies = std::vector<CapacitySchedulerPolicy>{ + CapacitySchedulerPolicy::kGUARANTEED_NO_EVICT, + CapacitySchedulerPolicy::kMAX_UTILIZATION, + CapacitySchedulerPolicy::kSTATIC_BATCH};cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
2810-2812
: Remove unused variable temporaryAttentionWindow.temporaryAttentionWindow is declared but never used in these tests. Clean it up to avoid warnings.
- auto constexpr temporaryAttentionWindow = 0;
Also applies to: 3081-3081
4706-4707
: Be explicit with allocatePools(false) for consistency.Most tests explicitly pass false to avoid UVM. For consistency and to prevent accidental defaults changing behavior, prefer allocatePools(false) here as well.
- kvCacheManager->allocatePools(); + kvCacheManager->allocatePools(false);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(29 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(14 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(61 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.938Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.938Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code Graph Analysis (2)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
KVCacheManager
(1686-1697)KVCacheManager
(1699-1713)KVCacheManager
(1715-1747)KVCacheManager
(1749-1762)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager
(141-931)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (26)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (23)
159-190
: LGTM! Well-structured utility for SWA block traversal.The function pre-allocates the vector and correctly handles the edge case of empty blocks. The early return guard protects against underflow issues when blockCount is zero.
607-619
: LGTM! Clear SWA integration in WindowBlockManager constructor.The addition of the
isSWA
parameter properly initializes SWA support based on the window size comparison.
1191-1244
: LGTM! Appropriate function signature documentation and implementation.The function correctly implements block allocation/detachment logic for SWA, with proper bounds checking and beam width validation for the detach mechanism.
1246-1275
: Well-documented SWA block management logic.The implementation correctly handles both block allocation and detachment for SWA with appropriate conditions. The comments clearly explain the current sub-optimal nature and areas for future improvement.
1360-1422
: LGTM! Enhanced storeBlocks with return value for SWA tracking.The function now returns the count of blocks stored for reuse, which is valuable for SWA logging and monitoring. The implementation correctly tracks and returns
numBlocksStoredForReuse
.
1522-1525
: LGTM! Improved function signature with explicit const qualifier.The addition of the
const noexcept
qualifiers correctly indicates that this function doesn't modify the object state and provides strong exception safety guarantees.
1615-1645
: Well-implemented SWA-aware block release with reuse support.The function correctly handles SWA scenarios by using
getAllSequenceBlocks()
to collect all sequence blocks when needed, and properly logs the number of blocks stored for reuse.
1650-1654
: LGTM! Proper handling of detached blocks without references.The conditional check correctly handles blocks that may not have reference counts due to SWA detachment, preventing unnecessary decrements on blocks that are already detached.
1736-1737
: Appropriate guard for unsupported streamLLM with SWA.The runtime check correctly prevents the use of streamLLM functionality when sink-related lengths are non-zero, providing clear error messaging for unsupported configurations.
1876-1879
: LGTM! Correct usage of metadata for token limit checking.The check properly uses
maxTokensPerSeq
from the window size metadata to prevent allocation beyond sequence limits.
1925-1944
: Well-implemented SWA extra block logic for generation planning.The calculation correctly accounts for sliding window scenarios where both block and window boundaries are crossed, requiring an extra block for smooth window transitions.
1947-1965
: LGTM! Comprehensive cache block offset updates.The function correctly updates all cache block offsets for the entire sequence, ensuring proper page table maintenance across all beams and blocks.
1967-1983
: LGTM! Efficient last block offset update.The function correctly updates only the last block's offsets, which is appropriate when only a new block has been appended to the sequence.
1985-2000
: LGTM! Targeted block offset update at specific index.The function correctly updates offsets for a specific block index across all beams, which is useful for SWA scenarios where specific blocks need offset recalculation.
2002-2008
: LGTM! Clean implementation of the new addToken API.The function correctly advances the sequence token count and triggers block allocation/detachment logic through the block manager. The TODO comment appropriately notes the need for future streamLLM support.
2010-2045
: Well-implemented SWA block detachment with proper validation.The function correctly implements out-of-window block detachment for SWA with appropriate beam width validation, reference count management, and offloading logic when reuse is enabled.
2079-2100
: LGTM! Clear routing logic for block reuse vs. non-reuse paths.The conditional logic correctly routes to different
addSequence
implementations based on block reuse settings, with appropriate warnings for retention configurations when reuse is disabled.
2120-2128
: LGTM! Proper sequence existence validation.The function correctly checks for sequence existence before attempting to store context blocks, with appropriate warning logging for missing sequences.
2138-2152
: LGTM! Clear validation and logging for storeNewBlock constraints.The function correctly validates beam width and block reuse requirements with informative warning messages when constraints are not met.
2444-2459
: LGTM! Comprehensive token removal with multi-window support.The function correctly handles token removal across all window sizes with proper block release logic when tokens align with block boundaries.
2540-2545
: LGTM! Proper SWA extra block allowance in capacity calculation.The addition of
kSWAExtraBlock
correctly accounts for the extra block needed during sliding window transitions when the sequence length exceeds the window size.
2574-2578
: LGTM! Consistent SWA extra block handling for beam search.The logic correctly adds
kSWAExtraBlock
when the sequence length exceeds the window size, maintaining consistency with single-beam calculations.
1848-1852
: Verify SWA extra block allowance in block calculation.The addition of
chunkSize
to the window size when calculatingpromptCacheLen
may not align with SWA semantics. Ensure this calculation properly accounts for the attention window boundaries.Review the calculation to verify the SWA window boundary logic:
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
137-139
: Constructor wiring LGTM: maxSequenceLength threaded correctly.The helper updates the KVCacheManager constructor to pass maxNumTokensPerSeq as both the window size entry and the explicit maxSequenceLength argument. This aligns with the new API and SWA semantics and keeps the blocksPerWindow configuration coherent.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
77-81
: Helper to count valid block IDs is clear and correct.getNumValidBlocks() correctly filters out KVCacheBlock::kInvalidBlockId and is used to assert allocation invariants under SWA. Good addition.
145-146
: addSequence(isShareLastContextBlock) coverage is solid.The test now exercises both non-sharing and sharing of the last context block and validates block allocation patterns and equality across beams. This meaningfully covers the new API.
Also applies to: 160-161, 173-189
762d9d4
to
169d317
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
52-59
: Fix use-before-declaration of SizeType32 for kPrimaryLevel/kSecondaryLevel (compile-time error).
SizeType32
is introduced at Line 65, but it's used earlier at Lines 52 and 54. This will not compile unless another alias is already in scope. Use a fully qualified type for these constants (or move the alias above).Apply this diff:
-static constexpr SizeType32 kPrimaryLevel = 0; - -static constexpr SizeType32 kSecondaryLevel = 1; +static constexpr tensorrt_llm::runtime::SizeType32 kPrimaryLevel = 0; + +static constexpr tensorrt_llm::runtime::SizeType32 kSecondaryLevel = 1;
♻️ Duplicate comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
431-446
: Use correct format specifier and cast for requestId in warning log.
mRequestId
isn't guaranteed to be anunsigned long
and using%lu
without an explicit cast is risky; previous reviews requested this fix.Apply this diff:
- TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove", mRequestId); + TLLM_LOG_WARNING("RequestID %lu: removeFrontBlock called but nothing to remove", + static_cast<unsigned long>(mRequestId));cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
2589-2601
: Clamp SWA extra-block arithmetic to avoid underflow in tight-capacity cases.When
blockCapacity/beamWidth <= kSWAExtraBlock
, subtractingkSWAExtraBlock
underflows (unsigned) and returns a huge window. Similarly forleftoverBlockCapacity - kSWAExtraBlock
.Apply this diff:
- if (outputBlockRequirements > blockCapacity) - { - return ((blockCapacity / beamWidth) - kSWAExtraBlock) * tokensPerBlock; - } + if (outputBlockRequirements > blockCapacity) + { + auto const blocksPerBeam = static_cast<SizeType32>(blockCapacity / beamWidth); + auto const usableBlocksPerBeam = blocksPerBeam > kSWAExtraBlock ? (blocksPerBeam - kSWAExtraBlock) : 0; + return usableBlocksPerBeam * tokensPerBlock; + } @@ - return std::min( - outputLength + (leftoverBlockCapacity - kSWAExtraBlock) * tokensPerBlock, inputLength + outputLength); + auto const usableLeftoverBlocks = + leftoverBlockCapacity > kSWAExtraBlock ? (leftoverBlockCapacity - kSWAExtraBlock) : 0; + return std::min(outputLength + usableLeftoverBlocks * tokensPerBlock, inputLength + outputLength);
1246-1275
: Re-cache KV offsets after detaching front block; prevent stale/invalid page-table entries.After
detachBlock
, indices shift but the page-table (offset tensor) isn’t recomputed. Copying offsets later will include stale entries or garbage for invalid slots.Apply this diff to recompute all offsets and mark invalid entries with a sentinel:
if (shouldDetachBlock && mIsSWA) { - detachBlock(sequence, isEnableBlockReuse); + detachBlock(sequence, isEnableBlockReuse); + + // Recompute offsets for the entire sequence after front-removal. + auto& cacheBlocksTensor = sequence.getCacheBlockIndices(mWindowSize); + auto* offsetsPtr = bufferCast<tk::KVCacheIndex>(cacheBlocksTensor); + auto const& offsetsShape = cacheBlocksTensor.getShape(); + auto const& cacheBlocks = sequence.getCacheBlockIds(mWindowSize); + auto const beamWidth = sequence.getBeamWidth(); + auto constexpr kIdx = 0; + auto constexpr vIdx = 1; + tk::KVCacheIndex const invalidIndex{ + std::numeric_limits<tk::KVCacheIndex::UnderlyingType>::max()}; + for (SizeType32 beamIdx = 0; beamIdx < beamWidth; ++beamIdx) + { + auto const& beamCacheBlock = cacheBlocks[beamIdx]; + for (SizeType32 blockIdx = 0; blockIdx < static_cast<SizeType32>(beamCacheBlock.size()); ++blockIdx) + { + auto const blockId = beamCacheBlock[blockIdx]; + if (blockId != KVCacheBlock::kInvalidBlockId) + { + setOffsets(offsetsPtr, offsetsShape, beamIdx, blockIdx, blockId); + } + else + { + // Fill sentinel for both K and V across all pools for invalid block slots. + for (SizeType32 poolIdx = 0; poolIdx < static_cast<SizeType32>(mPools.size()); ++poolIdx) + { + for (auto const xIdx : {kIdx, vIdx}) + { + auto const offsetIndex = + tensorrt_llm::common::flat_index(offsetsShape.d, poolIdx, beamIdx, xIdx, blockIdx); + offsetsPtr[offsetIndex] = invalidIndex; + } + } + } + } + } }
581-591
: Avoid metadata.maxBlocksPerSeq == 0 when maxSequenceLength is not provided.Defaulting
maxTokensPerSeq
to 0 yieldsmaxBlocksPerSeq=0
and zero-capacity offset tables, which later get written into. This is a correctness issue.Apply this diff:
- auto const maxTokensPerSeq = maxSequenceLength.value_or(0); + // If maxSequenceLength is not provided, fall back to a sane per-seq capacity: + // at least the window, plus one extra SWA buffer block. + auto const defaultBlocksPerSeq = + tensorrt_llm::common::ceilDiv(windowSize, tokensPerBlock) + kSWAExtraBlock; + auto const maxBlocksPerSeq = maxSequenceLength.has_value() + ? tensorrt_llm::common::ceilDiv(*maxSequenceLength, tokensPerBlock) + : defaultBlocksPerSeq; + auto const maxTokensPerSeq = maxBlocksPerSeq * tokensPerBlock; - auto const temporaryAttentionWindow = manager.calculateTemporaryAttentionWindow(tempAttentionWindowInputs); - auto const maxBlocksPerSeq = tc::ceilDiv(maxTokensPerSeq, tokensPerBlock); + auto const temporaryAttentionWindow = manager.calculateTemporaryAttentionWindow(tempAttentionWindowInputs); auto const [allottedPrimaryBlocks, allottedSecondaryBlocks] = blocksPerWindow.at(windowSize); mWindowSizeToMetadata[windowSize] - = WindowSizeMetadata{allottedPrimaryBlocks, allottedSecondaryBlocks, absolutePoolsOffset, numPools, - maxTokensPerSeq, maxBlocksPerSeq, manager.getMaxNumBlocks(), temporaryAttentionWindow}; + = WindowSizeMetadata{allottedPrimaryBlocks, allottedSecondaryBlocks, absolutePoolsOffset, numPools, + maxTokensPerSeq, maxBlocksPerSeq, manager.getMaxNumBlocks(), temporaryAttentionWindow}; TLLM_LOG_INFO( "Max KV cache blocks per sequence: %d [window size=%d], tokens per block=%d, primary blocks=%d, secondary " "blocks=%d", maxBlocksPerSeq, windowSize, tokensPerBlock, allottedPrimaryBlocks, allottedSecondaryBlocks);
🧹 Nitpick comments (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
548-554
: Constructor’s new isSWA parameter: good surface for SWA-specific flows.Adding
isSWA
toWindowBlockManager
enables localized SWA behavior. Ensure call sites computeisSWA
robustly (see cpp fix for nullopt maxSequenceLength).Would you like me to scan all call sites to confirm they pass a well-defined SWA signal across variable-window configs?
17-17
: Prefer include guards over #pragma once (guideline).The codebase guidelines specify include guards named TRTLLM__H. Consider adding them for consistency.
Example:
// #pragma once // optional to keep as well #ifndef TRTLLM_KVCACHEMANAGER_H #define TRTLLM_KVCACHEMANAGER_H ... #endif // TRTLLM_KVCACHEMANAGER_Hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
159-190
: Avoid potential size_t underflow pattern; simplify index logic.You added an early return for blockCount==0 (good), but the post-decrement pattern can still flag static-analysis warnings. Use pre-decrement with an initial
blockCount
value for clarity.Apply this diff:
- size_t currentIndex = blockCount - 1; - while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) - { - sequenceBlocks[currentIndex--] = currentBlock; + size_t currentIndex = blockCount; + while (currentBlock != nullptr && currentBlock->getBlockId() != KVCacheBlock::kCachedBlocksRootId) + { + sequenceBlocks[--currentIndex] = currentBlock; currentBlock = currentBlock->getPrevBlockInSeq(); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(29 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (4)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.853Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.853Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-14T21:04:50.208Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.208Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-15T06:46:53.780Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.780Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🔇 Additional comments (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
428-429
: LGTM: Resetting front-detach counter after clear.Resetting
mNumFrontBlocksRemoved
avoids indexing past the beginning on subsequent detach operations.
614-616
: Expose getNumFreeBlocks cacheLevel overload across Block/Window managers: solid addition.This aligns stats reporting with multi-level pools (primary/secondary).
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
1360-1423
: Returning number of stored blocks from storeBlocks is a good improvement.This allows upstream code to track reuse effectiveness per call. The bookkeeping added in release path leverages this count.
1615-1662
: Release path covers SWA-detached blocks correctly.Conditional
decRefCount()
avoids underflow for already-detached blocks; bulk clearing of per-seq storage is correct.
2444-2459
: Single-beam removeToken guardrails look good.Beam>1 guard and block-boundary logic for releasing last block are consistent with SWA restrictions in this PR.
…st for reuse/no reuse case Signed-off-by: eopXD <[email protected]>
…t the possible maximum for kv_cache_config.max_tokens This way, we are not constrained by the max_tokens limit. Signed-off-by: eopXD <[email protected]>
PR_Github #19441 [ run ] completed with state |
ae9a42c
to
3598e9f
Compare
/bot run --disable-fail-fast --post-merge |
PR_Github #19539 [ run ] triggered by Bot |
/bot run --disable-fail-fast --post-merge |
PR_Github #19602 [ run ] triggered by Bot |
PR_Github #19539 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #19653 [ run ] triggered by Bot |
PR_Github #19602 [ run ] completed with state |
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled. When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored. Signed-off-by: eopXD <[email protected]>
PR_Github #19653 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left another comment for a reminder, but it is ok since KV cache mgr doesn't affect the performance unless a user explicitly controls. I approved, please fix the calculation in a follow-up PR.
@jaedeok-nvidia Thank you for the reminder. I will queue this into my todo list. |
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled. When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored. Signed-off-by: eopXD <[email protected]>
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled. When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored. Signed-off-by: eopXD <[email protected]>
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled. When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored. Signed-off-by: eopXD <[email protected]>
(Landed) This MR is based upon #6767
Description
This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic matter. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design.
In this MR, we change such behavior to let the manager write blocks in a linear matter. With a linear block writing bahavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy.
KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request.
Writing the blocks linearly, the maximum number of blocks allocated for a sequence(
GenerationRequest
) is the "max sequence length" specified. TheGenerationRequest
that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks.Given the above, main changes are (more context/explanation in the commits):
KVCacheManager::addToken
. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this.KVCacheManager
/BlockManager
resource_manager.py
Test Coverage
Unit tests
Tests are added under
KVCacheManagerTest.cpp
. The main tests added to cover the SWA detach/reuse concep t areCTgjc8kegnVqvtVbGZfpP5RHLKnRNikArUYFpVHNebEN
andKVCacheManagerVariableWindowAttentionWithReuseTest
. AlthoughgetNeededBlocksOneStep
is not a used function right now, changes of the blocks SWA uses have been updated and tests have been added.Benchmark
Added test case on gemma3 with VSWA setting on the H100 platform.
Running with GPT-OSS 120B
Batch size throughput 482 using GPT-OSS 120B as model on GB200.
Summary by CodeRabbit
New Features
Documentation
Tests