-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach #7922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThe change set makes KV-cache management sequence-aware and SWA-aware across headers, implementation, and bindings. It replaces optional max-sequence-length with required parameters, adjusts constructor signatures and call sites, adds per-sequence bookkeeping and block reuse validity, removes cyclic KV-cache logic, updates tests, and refreshes related help text/docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor App as Caller
participant KM as KVCacheManager
participant BM as BlockManager
participant WBM as WindowBlockManager
participant Seq as GenerationRequest
rect rgba(230,240,255,0.5)
note over App,KM: Add sequence and allocate context
App->>KM: addSequence(Seq, ...)
KM->>BM: holdSequence(Seq.id)
KM->>BM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
end
rect rgba(240,255,240,0.5)
note over App,WBM: Onboard a new block (sequence-aware)
App->>KM: onboardBlock(Seq, windowSize, ...)
KM->>BM: onboardBlock(Seq, ...)
BM->>WBM: onboardBlock(Seq, ...)
WBM->>WBM: getFreeBlock(Seq, ...)
WBM-->>BM: BlockPtr
BM-->>KM: BlockPtr
KM-->>App: BlockPtr
end
alt SWA sequence
note right of WBM: SWA guards: skip store/detach on-the-fly
else Non-SWA
rect rgba(255,250,230,0.6)
note over WBM: Detach/adjust front blocks when needed
KM->>BM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
BM->>WBM: adjustBlocksIfNeeded(Seq, enableBlockReuse)
WBM->>WBM: detachFrontBlock(Seq, enableBlockReuse)
end
end
rect rgba(255,235,235,0.6)
note over App,WBM: Store or release blocks (sequence validity tracked)
App->>KM: releaseBlocks(Seq, ...)
KM->>BM: releaseBlocks(Seq, ...)
BM->>WBM: storeBlocks(Seq, ...) / release(...)
WBM->>BM: update block->sequence mappings
end
KM->>BM: releaseSequence(Seq.id)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
582-586
: Bug: cacheType shadowing prevents kSELFKONLY from being set.Inner ‘auto cacheType’ shadows the outer variable; SELF is always used even when kvFactor == 1.
Fix:
- CacheType cacheType = CacheType::kSELF; - if (kvFactor == 1) - { - auto cacheType = CacheType::kSELFKONLY; - } + CacheType cacheType = (kvFactor == 1) ? CacheType::kSELFKONLY : CacheType::kSELF;
🧹 Nitpick comments (12)
examples/utils.py (1)
356-362
: Help text tweak: clarify plural semantics for nargs="+"Since this arg accepts multiple integers, make the wording plural and add a brief example.
- 'The attention window size that controls the sliding window attention kv cache behavior' + 'Attention window size(s) controlling sliding-window attention KV-cache behavior. ' + 'Accepts one or more ints (e.g., --max_attention_window_size 4096 or 2048 4096).'examples/models/core/qwen2audio/utils.py (2)
41-41
: Polish help text for clarity and consistency (KV‑cache, hyphenation).Suggest concise copy and consistent hyphenation/capitalization.
- 'The attention window size that controls the sliding window attention kv cache behavior' + 'Controls the sliding-window attention KV-cache behavior.'
1-1
: Missing SPDX/NVIDIA Apache-2.0 header.Per guidelines, add the license header to Python sources.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 from argparse import BooleanOptionalActioncpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
59-63
: Constructor signature alignment LGTM; prefer explicit SizeType32 for kvMaxNumTokens.Avoid implicit int → SizeType32 conversions to keep types consistent across platforms.
Apply:
- auto kvMaxNumTokens = tokensPerBlock * maxBlocksPerSeq; + SizeType32 kvMaxNumTokens = static_cast<SizeType32>(tokensPerBlock) * maxBlocksPerSeq;cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
84-88
: Constructor update LGTM; minor type tightening suggested.Use SizeType32 for kvMaxNumTokens to match manager expectations.
- auto kvMaxNumTokens = tokensPerBlock * maxBlocksPerSeq; + SizeType32 kvMaxNumTokens = static_cast<SizeType32>(tokensPerBlock) * maxBlocksPerSeq;cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)
621-623
: Constructor update LGTM; consider explicit type for maxNumTokens.Use SizeType32 to avoid implicit narrowing and mirror other callsites.
- dataType, sinkTokenLength, stream, maxNumTokens, enableBlockReuse, onboardBlocks, CacheType::kSELF, + dataType, sinkTokenLength, stream, static_cast<SizeType32>(maxNumTokens), enableBlockReuse, onboardBlocks, CacheType::kSELF,
1316-1337
: Temporarily waiving isWindow tests reduces coverage.Add a TODO and plan to re‑enable once SWA reuse/OOW detach stabilizes; otherwise we may miss regressions.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
736-736
: Fix inconsistent parameter naming in onboardBlock methods.The
onboardBlock
methods use inconsistent parameter naming -copyOnPartialReuse
in some signatures andcopyOnpartialReuse
in others.Apply this diff to fix the inconsistent casing:
- bool copyOnpartialReuse = true, + bool copyOnPartialReuse = true,- bool copyOnpartialReuse = true, + bool copyOnPartialReuse = true,Also applies to: 828-829, 951-951, 988-989, 1053-1055, 1060-1060
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
324-327
: Remove or relocate debug logging from production code.The debug logging before the assertion seems misplaced. Consider removing it or wrapping it in a debug-only build flag.
- if (!hasRefs()) - { - TLLM_LOG_DEBUG("::decRefCount Going to hit assertion for block %d", getBlockId()); - } TLLM_CHECK_WITH_INFO(hasRefs(), "Can't remove link from block that is not allocated");
507-519
: Consider using the intended calculateWindowSizeToShare logic.The comment at line 519-520 indicates that the current uniform distribution (1.0f / windowSizeToLayers.size()) is temporary. The actual proportional calculation based on window sizes is already implemented but commented out. Consider enabling the intended logic if testing has been completed.
The function
calculateWindowSizeToShare
appears to have sophisticated logic for proportional distribution, but it's not being used. Consider tracking this as a TODO if the uniform distribution is indeed temporary:+ // TODO(TLLM-XXXX): Enable proportional block allocation based on window sizes + // once testing is complete. Currently using uniform distribution for identical performance. std::map<SizeType32, float> windowSizeToShare;
1091-1096
: Consider documenting the offloading behavior trade-offs.The comment correctly identifies that the current default behavior may lead to unnecessary traffic. Consider tracking this as a technical debt item.
Add a TODO to track the improvement opportunity:
+ // TODO(TLLM-XXXX): Delegate offloading decisions to the eviction policy + // to avoid unnecessary traffic from offloading blocks that won't be reused. // The current default behavior is to offload the out-of-window block
1485-1547
: Document the return value usage of storeBlocks.The method now returns the number of blocks stored, but this information appears to be used only for logging. Consider documenting this behavior.
Add a comment explaining the return value usage:
SizeType32 WindowBlockManager::storeBlocks( std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds) { + // Returns the number of blocks actually stored (excludes already-matched blocks) SizeType32 numBlocksStoredForReuse = 0;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(22 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(42 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(1 hunks)cpp/tests/unit_tests/executor/agentCommTest.cpp
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(3 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/utils.py
(1 hunks)examples/utils.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/functional.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
examples/models/core/qwen2audio/utils.py
examples/utils.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
examples/models/core/llama/summarize_long.py
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
examples/models/core/qwen2audio/utils.py
examples/utils.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
examples/models/core/llama/summarize_long.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
examples/models/core/qwen2audio/utils.py
examples/utils.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
examples/models/core/llama/summarize_long.py
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (9)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Applied to files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧬 Code graph analysis (7)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
stream
(4654-4660)stream
(4740-4746)stream
(5099-5105)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
storeBlocks
(996-1000)isSequenceValidForStoreForReuse
(1243-1248)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig
(972-1106)tensorrt_llm/llmapi/llm.py (1)
LLM
(1022-1038)tests/integration/defs/accuracy/accuracy_core.py (4)
GSM8K
(293-308)evaluate
(147-206)evaluate
(712-722)MMLU
(276-290)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (32)
nodiscard
(1118-1125)nodiscard
(1648-1651)nodiscard
(2735-2753)KVCacheBlock
(272-285)releaseBlocks
(1658-1673)releaseBlocks
(1658-1658)releaseBlocks
(1739-1796)releaseBlocks
(1739-1739)updateLastCacheBlockOffsets
(2103-2119)updateLastCacheBlockOffsets
(2103-2103)releaseLastBlock
(1624-1627)releaseLastBlock
(1624-1624)releaseLastBlock
(1629-1646)releaseLastBlock
(1629-1629)detachFrontBlock
(2146-2181)detachFrontBlock
(2146-2146)adjustBlocksIfNeeded
(1365-1371)adjustBlocksIfNeeded
(1365-1365)adjustBlocksIfNeeded
(1373-1391)adjustBlocksIfNeeded
(1373-1373)onboardBlock
(1054-1058)onboardBlock
(1054-1055)onboardBlock
(1060-1080)onboardBlock
(1060-1061)offloadBlock
(1082-1086)offloadBlock
(1082-1083)offloadBlock
(1088-1116)offloadBlock
(1088-1089)storeBlocks
(1485-1548)storeBlocks
(1485-1486)getFreeBlock
(951-1024)getFreeBlock
(951-953)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (40)
examples/models/core/llama/summarize_long.py (1)
44-49
: LGTM — help text aligned with SWA wordingWording consistent with removing "cyclic"; no behavior change. Sandbox rg search skipped files ("No files were searched"), so verify locally:
rg -n -C2 --type=py --hidden -uu --no-ignore -g '!**/site-packages/**' -e '\bmax_attention_window_size\b' -e '\bcyclic\b' || git ls-files | xargs -r grep -n -E -C2 '\b(max_attention_window_size|cyclic)\b'tensorrt_llm/functional.py (1)
5361-5361
: Doc terminology update looks fine; confirm expected tensor shape.The wording change is OK. Please verify whether
host_max_attention_window_sizes
is still expected to be shape [1] or per‑layer (vector). The docstring still says [1] while other parts of the PR imply per‑layer support.tests/integration/test_lists/test-db/l0_h100.yml (1)
39-42
: New VSWA test entries look good; verify they exist and are stable.Ensure these four tests are implemented and not skipped, and that they replace any older VSWA variants in other lists to avoid redundant runtime.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
335-337
: Unconditionally setting max_tokens may constrain VSWA unexpectedly.Previously VSWA left max_tokens unset; now it’s always derived from memory. Confirm C++ respects this as a global capacity (not per-window) and that VSWA capacity planning remains correct for multi-window configs.
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
136-139
: Updated KVCacheManager callsite matches new signature.No issues spotted.
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
218-222
: KVCacheManager signature change handled correctly here.No functional concerns with passing maxNumTokens.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
1060-1072
: LGTM! Test correctly configures KV cache without reuse.The test properly sets up the KV cache configuration with block reuse disabled and validates the SWA window configuration through GSM8K and MMLU benchmarks.
1074-1086
: LGTM! Test correctly enables KV cache block reuse.The test properly validates the new SWA behavior with block reuse enabled. This aligns with the PR objectives to enable OOW block detachment with reuse.
1087-1107
: LGTM! Chunked prefill test without reuse is well-structured.The test correctly combines chunked prefill configuration with disabled KV cache reuse, providing good coverage for this configuration combination.
1108-1127
: LGTM! Comprehensive test coverage for chunked prefill with reuse.The test provides good coverage for the combination of chunked prefill and block reuse, which is important for validating the SWA behavior under different operational modes.
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
486-500
: Constructor signature updated correctly for SWA support.The KVCacheManager constructor binding correctly updates the signature to:
- Replace optional maxSequenceLength with required SizeType32
- Add enable_partial_reuse and copy_on_partial_reuse parameters
This aligns with the C++ implementation changes for per-sequence SWA management.
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
689-696
: LGTM! KV cache manager creation updated for partial reuse.The createKvCacheManager call correctly passes:
- Fixed sequence length based on cache type (encoder vs decoder)
- New partial reuse configuration parameters
This enables the per-sequence block management features described in the PR.
2477-2478
: Confirm intent: isUseOneMoreBlock is effectively always falseisUseOneMoreBlock is declared at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h:110, initialized to false at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:361, and the only usage found is the kernel call at cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:2477–2478. No tests exercise this flag. Either restore dynamic assignment (and add unit tests for the "one more block" rewind path) or remove the parameter/callsite to avoid dead/unused behavior.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (11)
60-62
: LGTM! SWA-specific extra block constant is well-defined.The addition of
kSWAExtraBlock
constant for sliding window attention (SWA) buffer allocation is appropriate and follows the established naming conventions in the codebase.
100-101
: LGTM! Per-sequence window metadata additions are appropriate.The clarification of
maxTokenNum
as per-sequence and addition ofmaxBlocksPerSeq
field align well with the per-sequence tracking requirements for SWA support.
349-351
: Good transition from cyclic threshold to front-block accounting.Replacing
mCyclicThreshold
withmNumFrontBlocksRemoved
provides clearer semantics for tracking detached blocks in SWA scenarios. The field is well-initialized to 0.
394-397
: LGTM! GenerationRequest methods for front block management are well-designed.The addition of
getNumFrontBlocksRemoved()
,removeFrontBlock()
, and related state tracking provides a clean interface for managing out-of-window blocks in SWA.Also applies to: 436-437, 439-442, 485-488
554-554
: LGTM! WindowBlockManager SWA awareness is properly integrated.The addition of
isSWA
parameter throughout the constructor and theisSWA()
accessor method provide essential SWA awareness to the block management layer.Also applies to: 608-609, 661-661, 675-675, 786-789, 871-871
588-589
: Documentation improvement enhances clarity.The updated documentation clearly explains when blocks will be stored during release, which is crucial for understanding reuse behavior in SWA contexts.
594-596
: LGTM! New block management methods for SWA are well-structured.The addition of
updateLastCacheBlockOffsets
,detachFrontBlock
, andadjustBlocksIfNeeded
provides the necessary infrastructure for SWA block lifecycle management.Also applies to: 600-609
791-806
: Sequence storage validity tracking is well-implemented.The addition of methods to track sequence storage validity (
initializeSequenceStorageValidity
,releaseSequenceStorageValidity
,isSequenceValidForStoreForReuse
) along with the mapping structures provides necessary bookkeeping for SWA reuse scenarios.Also applies to: 913-919
1200-1206
: LGTM! Clear documentation for adjustBlocksIfNeeded in BlockManager.The comprehensive documentation explains the method's behavior for both block allocation and detachment in SWA scenarios.
1208-1236
: Excellent per-sequence lifecycle management implementation.The addition of sequence management methods (
isSequenceHeld
,holdSequence
,releaseSequence
,isSequenceValidForStoreForReuse
) and themManagedSequences
tracking set provides robust lifecycle management for sequences across all window sizes.Also applies to: 1238-1248, 1284-1285
931-931
: LGTM! Constructor signatures properly updated for maxSequenceLength.All KVCacheManager constructors now consistently use non-optional
SizeType32 maxSequenceLength
parameter, aligning with the design decision to enforce this as a required parameter.Also applies to: 1503-1504, 1514-1515, 1525-1526, 1536-1538, 1853-1854, 1887-1888
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (16)
160-191
: LGTM! Well-implemented getAllSequenceBlocks helper function.The function efficiently traverses the block chain backwards with pre-allocation for performance optimization. Good implementation with proper null checks and early return for empty sequences.
596-600
: Good defensive programming with window size validation.The warning for window sizes exceeding max sequence length helps catch configuration issues early.
625-637
: Clear explanation of SWA block allocation requirements.The detailed comment explaining the SWA linear allocation behavior and the temporary resolution for tempAttentionWindow calculation provides good context for future maintainers. The capping to maxSequenceLength is a reasonable temporary solution.
641-645
: LGTM! Enhanced logging for KV cache blocks per sequence.The improved logging now includes window size information, which is helpful for debugging variable window attention configurations.
793-798
: Appropriate SWA guard for context block storage.The check to skip storing context blocks for SWA sequences prevents potential issues with out-of-window blocks being incorrectly reused.
995-1022
: Excellent per-sequence ownership tracking in getFreeBlock.The implementation properly tracks block ownership by sequence and invalidates reuse when blocks are acquired by different sequences. The detailed debug logging helps with troubleshooting reuse issues.
1365-1391
: LGTM! Well-implemented adjustBlocksIfNeeded logic.The method correctly handles both block detachment for SWA and new block allocation at block boundaries. The while loop properly handles multiple OOW blocks in the first generation step.
1679-1684
: Good SWA handling in storeNewBlock.The check to skip storing new blocks for SWA sequences prevents issues with blocks that may go out-of-window.
1739-1796
: Comprehensive releaseBlocks implementation with SWA support.The method properly handles sequence validity checking for SWA and correctly manages the release of blocks while accounting for front blocks that have been removed.
2146-2181
: LGTM! Robust detachFrontBlock implementation.The method correctly handles front block detachment with proper ref counting and includes a check to ensure beam width is 1 (as SWA with beam search is not yet supported).
2213-2219
: Good sequence lifecycle management in addSequence.The method properly initializes sequence storage validity for new sequences, ensuring proper bookkeeping for all window sizes.
2267-2276
: Good defensive programming in storeContextBlocks.The method includes proper error handling with a warning when the sequence is not found, rather than crashing.
2311-2315
: LGTM! Proper sequence cleanup in removeSequence.The method correctly releases the sequence and verifies it's no longer held, with appropriate debug logging.
2693-2699
: LGTM! Proper SWA extra block accounting.The method correctly adds the extra block for SWA when the sequence length exceeds the window size.
2244-2244
: isShareLastContextBlock logic is correct — cross-KV should always share the last context block.WindowBlockManager already shares earlier context blocks; the final block is shared when either isCrossKv() (encoder/cross-attention outputs are the same across beams) or when the input ends exactly on a block boundary (no partial last block). Keep as‑is.
2066-2073
: Verify SWA extra-block calculation (numExtraBlocksPerBeam)Inspected KVCacheManager::getRemainingBlocksToCompletion (cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2066–2073). The code sets numExtraBlocksPerBeam = 1 only when:
- (req.mPromptLen + req.mMaxNewTokens) > windowSize, and
- ceilDiv(req.mPromptLen + req.mMaxNewTokens) > ceilDiv(req.getNumTokens(0)), and
- ceilDiv(req.mPromptLen + req.mMaxNewTokens) > numTotalBlocksPerBeam.
Confirm these are the intended semantics. Specifically verify:
- temporaryAttentionWindow / mSinkBubbleLength interaction with isSlidingWindow and numTotalBlocksPerBeam (should the sliding check consider windowSize + temporaryAttentionWindow?),
- whether willCrossBlockBoundary should compare against numAllocBlocksPerBeam (allocated blocks) instead of currentSeqlenInBlocks (req.getNumTokens(0)),
- consistency with kSWAExtraBlock usage in calculateMaxBlockRequirements*.
3da009f
to
45070b1
Compare
45070b1
to
feec60c
Compare
@coderabbitai generate docstrings |
✅ Actions performedDocstrings generation started. |
Caution An unexpected error occurred while opening a pull request: Reference update failed - https://docs.github.com/rest/git/refs#create-a-reference |
/bot run --disable-fail-fast |
PR_Github #19779 [ run ] triggered by Bot |
PR_Github #19779 [ run ] completed with state |
This MR is a continuation of NVIDIA#6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled. When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored. Signed-off-by: eopXD <[email protected]>
feec60c
to
9e2bb25
Compare
/bot run --disable-fail-fast |
PR_Github #19916 [ run ] triggered by Bot |
PR_Github #19916 [ run ] completed with state |
Summary by CodeRabbit
Description
NOTE: Please note that the depending MR is still on-going and not merged yet.
This MR is a continuation of #6768. In the previous merge request, OOW (out-of-window) blocks are only detached when reuse is not enabled. This MR enables KV cache manager to detach the block when reuse is enabled.
When reuse is enabled, OOW block gets detached while the manager bookkeep whether the block is overwritten by another sequence. If at the end of the sequence all blocks for it is still clean, then the sequence block will be stored.
Test Coverage
Regarding unit test,
KVCacheManagerVariableWindowAttentionWithReuseTest
,KVCacheManagerSWAInvalidateReuseTest
, andKVCacheManagerMaxAttentionWindowWithReuseTest
are added to verify SWA reuse and block invalidating mechanism.Regarding end-to-end, the existing test for vswa setting for gemma3 for the mmlu and gsm8k benchmark will cover the code paths added in this merge request.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.