-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager #7923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughIntroduces Sliding Window Attention (SWA) awareness across KV-cache management: new flags, bookkeeping, and block lifecycle methods; removes cyclic logic; makes maxSequenceLength non-optional across constructors; updates reuse/store/release flows; adjusts capacity calculations; aligns bindings, C++ call sites, tests, Python utilities, and example help texts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Req as GenerationRequest
participant BM as BlockManager
participant WBM as WindowBlockManager
participant Pool as Block Pools
Note over Req,BM: Token generation step (SWA-aware)
Req->>BM: requestNextStep()
BM->>WBM: adjustBlocksIfNeeded(reqState)
alt SWA and window exceeded
WBM->>WBM: detachFrontBlock()
WBM->>Pool: storeBlocks(...) -> numStored
Note right of WBM: Store count returned
else No detachment needed
WBM->>WBM: no-op
end
WBM->>Pool: acquireIfNeeded()
Pool-->>WBM: block(s) or wait/fail
WBM-->>BM: updated block layout
BM-->>Req: proceed with step
sequenceDiagram
autonumber
participant BM as BlockManager
participant WBM as WindowBlockManager
participant Pool as Block Pools
Note over BM,WBM: Release path (SWA-aware)
BM->>WBM: releaseSequence(seqId)
alt SWA sequence
WBM->>WBM: getAllSequenceBlocks()
WBM->>Pool: storeBlocks(...) -> numStored
else Non-SWA
WBM->>Pool: storeBlocks(...) -> numStored
end
Pool-->>BM: updated free counts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/models/core/qwen2audio/utils.py (1)
1-1
: Add SPDX + NVIDIA Apache-2.0 header.Per coding guidelines, prepend the NVIDIA header with current year to all source files.
Apply this at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + from argparse import BooleanOptionalActioncpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
36-40
: Add missing standard headers to cpp/tensorrt_llm/batch_manager/kvCacheManager.cppThe file uses std::stringstream (line ~2458), std::getenv (line ~2453) and std::accumulate (lines ~529/541/551) but does not include the corresponding headers. Add:
- #include
- #include
- #include
Insert them with the other standard includes (e.g., after ).
🧹 Nitpick comments (18)
tensorrt_llm/functional.py (1)
5361-5361
: Docstring wording consistency (SWA/KV cache).Prefer consistent capitalization and be more explicit about scope (per-layer max attention window sizes). Suggest:
- This controls the sliding-window-attention kv-cache features. + Controls Sliding Window Attention (SWA) KV cache behavior (per-layer max attention window sizes).tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1115-1122
: VSWA + chunked prefill (reuse) — OK; consider light param refactor.This mirrors the previous test with only reuse toggled. Consider parametrizing reuse to reduce duplication across the two VSWA chunked-prefill tests.
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
1316-1321
: Temporarily waived window tests — please track.Waiving is fine for now. Add a TODO with a tracking issue to restore SWA window variants.
-// (eop) Waive off isWindow test for now +// TODO(eop): Restore isWindow test variants after SWA rework is complete. Track: <issue-id>
1323-1330
: Avoid commented-out INSTANTIATE block in test sources.Prefer removing or gating by a macro/flag to keep test code clean.
-// INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest, -// testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1), -// testing::Values(1), -// testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4), testing::Values(8), -// testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2), -// testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true))); +#if 0 // TODO(eop): Re-enable after SWA window tests are stable. +INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest, + testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1), + testing::Values(1), testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4), + testing::Values(8), testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2), + testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true))); +#endif
1334-1337
: Nit: remove inline commented parameter hints in values lists.The trailing comments inside testing::Values hurt readability.
- testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false /*, true*/))); + testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false)));
583-587
: Shadowing bug: cacheType never updated for kvFactor == 1.'auto cacheType' in the if-block shadows the outer variable, so kSELFKONLY is never used.
- CacheType cacheType = CacheType::kSELF; - if (kvFactor == 1) - { - auto cacheType = CacheType::kSELFKONLY; - } + CacheType cacheType = CacheType::kSELF; + if (kvFactor == 1) + { + cacheType = CacheType::kSELFKONLY; + }cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
351-352
: Front-block removal bookkeeping — minor polish.API is fine. The windowSize param in removeFrontBlock is unused; mark to avoid warnings.
- void removeFrontBlock(SizeType32 windowSize) + void removeFrontBlock(SizeType32 /*windowSize*/) { ++mNumFrontBlocksRemoved; }Also applies to: 394-398, 436-442
1428-1468
: KVCacheManager constructors updated — minor naming nit.Parameter name copyOnpartialReuse is inconsistently cased vs. copyOnPartialReuse elsewhere. Consider normalizing for consistency.
- bool copyOnpartialReuse = true, + bool copyOnPartialReuse = true,Note: update definitions/uses accordingly across TU.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (4)
618-619
: Use 0 (not false) for sinkTokenLength; consider naming maxSequenceLength explicitly.Passing
false
forsinkTokenLength
is confusing; prefer0
. Also, heremaxAttentionWindow
is used asmaxSequenceLength
. For consistency with other tests, definemaxSequenceLength
and pass that.- beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4, - false, stream, maxAttentionWindow, true, onboardBlocks); + beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4, + 0, stream, maxAttentionWindow, true, onboardBlocks);
3059-3163
: New test for SWA window smaller than block size is solid; tighten types.Logic/expectations look right. Minor nit: prefer SizeType32 for
inputLength
to match APIs.- int inputLength = 2; + SizeType32 inputLength = 2;Please confirm this test is stable across devices; edge-case scheduling of detach/allocate can vary slightly if implementation changes heuristics.
3920-3933
: NeededBlocksOneStep: assertions refined; minor style nits.The step-wise assertions are precise. Minor: prefer
auto const
and fixed-width ints for counters to align with surrounding types.- auto numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks; + auto const numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks;Also applies to: 3950-4011
4161-4186
: Potential misuse of maxNumSequences in createKvCacheManager.
numBlocksInPrimaryPool
is passed asmaxNumSequences
. That parameter represents “max concurrent sequences”, not block count. It may work incidentally but is semantically off.- numBlocksInPrimaryPool, kvCacheInstantiationParameters.maxBeamWidth, + /* maxNumSequences = */ kvCacheInstantiationParameters.maxNumSequences, kvCacheInstantiationParameters.maxBeamWidth,Outside this hunk, add a field and populate it:
// In KvCacheManagerInstantiationParameters SizeType32 maxNumSequences = 8; // or a test-specific valueConfirm no tests rely on maxNumSequences being huge; if they do, set it explicitly in the parameter sets.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)
1325-1352
: SWA detach/allocate step logic is reasonable
- Detach only when past
window + block
and block‑reuse disabled.- Allocate on block boundaries.
Minor: the
manager
variable in the caller loop isn’t used.- for (auto& [windowSize, manager] : mWindowBlockManagers) + for (auto& [windowSize, _] : mWindowBlockManagers)
1698-1746
: Release flow for SWA: guard empty and clarify
- Replacing
allocatedBlocks
withgetAllSequenceBlocks
is fine but will crash if empty. Add a quick check beforeallocatedBlocks.back()
.- Loop correctly accounts for previously detached front blocks.
- if (mIsSWA) - { - // For SWA, get all blocks in the sequence. - allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back()); - } + if (mIsSWA) + { + if (!allocatedBlocks.empty()) + { + // For SWA, get all blocks in the sequence. + allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back()); + } + }Also, the in‑function comment about not releasing when reuse is enabled contradicts the call site (detach only when reuse is disabled). Consider updating the comment.
2096-2125
: detachFrontBlock: unused parameter; tighten comments
isEnableBlockReuse
isn’t used inside; either remove the param or use it to control release behavior. Current call sites only invoke this when reuse is disabled.-void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence, bool const isEnableBlockReuse) +void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence)Update the caller accordingly.
505-517
: Docstring for proportion function: minor nitsGood high‑level doc; consider referencing the new ENV override here to avoid stale docs.
1053-1059
: Offload policy should be centralizedComment acknowledges current behavior; long‑term, prefer delegating the decision fully to the eviction policy to avoid extra traffic.
2648-2660
: Rename kSWAExtraBlock to kSWA_EXTRA_BLOCK and apply consistentlyA static constexpr is already defined in cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:62 — change to:
static constexpr SizeType32 kSWA_EXTRA_BLOCK = 1;
and update all usages at:
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2657
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2691
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp:4089
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(17 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(30 hunks)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
(1 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(49 hunks)cpp/tests/unit_tests/executor/agentCommTest.cpp
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(3 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/utils.py
(1 hunks)examples/utils.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/functional.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
examples/models/core/llama/summarize_long.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
examples/utils.py
examples/models/core/qwen2audio/utils.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
examples/models/core/llama/summarize_long.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
examples/utils.py
examples/models/core/qwen2audio/utils.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
examples/models/core/llama/summarize_long.py
tensorrt_llm/functional.py
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
examples/utils.py
examples/models/core/qwen2audio/utils.py
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (10)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
cpp/tests/unit_tests/executor/agentCommTest.cpp
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.
Applied to files:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code graph analysis (8)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
stream
(4231-4237)stream
(4317-4323)stream
(4676-4682)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig
(972-1106)tensorrt_llm/llmapi/llm.py (1)
LLM
(1022-1038)tests/integration/defs/accuracy/accuracy_core.py (4)
GSM8K
(293-308)evaluate
(147-206)evaluate
(712-722)MMLU
(276-290)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (19)
releaseBlocks
(1619-1634)releaseBlocks
(1619-1619)releaseBlocks
(1698-1746)releaseBlocks
(1698-1698)updateLastCacheBlockOffsets
(2053-2069)updateLastCacheBlockOffsets
(2053-2053)releaseLastBlock
(1585-1588)releaseLastBlock
(1585-1585)releaseLastBlock
(1590-1607)releaseLastBlock
(1590-1590)detachFrontBlock
(2096-2125)detachFrontBlock
(2096-2096)adjustBlocksIfNeeded
(1325-1331)adjustBlocksIfNeeded
(1325-1325)adjustBlocksIfNeeded
(1333-1352)adjustBlocksIfNeeded
(1333-1333)storeBlocks
(1446-1509)storeBlocks
(1446-1447)KVCacheBlock
(272-285)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
storeBlocks
(969-973)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
KVCacheManager
(1770-1781)KVCacheManager
(1783-1797)KVCacheManager
(1799-1831)KVCacheManager
(1833-1846)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
KVCacheManager
(142-1011)cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
createKvCacheManager
(637-723)createKvCacheManager
(637-639)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (52)
examples/models/core/llama/summarize_long.py (1)
48-49
: Wording tweak LGTM.Help text reads clearly and matches current terminology.
examples/models/core/qwen2audio/utils.py (1)
41-42
: Wording tweak LGTM.Terminology is now consistent with SWA/KV cache usage elsewhere.
examples/utils.py (1)
361-362
: Wording tweak LGTM.Minor doc text change only; no behavioral impact.
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
86-87
: Constructor update looks correct.Passing kvMaxNumTokens as the new non-optional parameter after stream aligns with the updated KVCacheManager signature.
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
136-139
: Constructor update LGTM.Providing maxNumTokensPerSeq in the KVCacheManager ctor matches the non-optional max sequence length change.
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)
59-63
: KVCacheManager ctor arg shift LGTM.Supplying kvMaxNumTokens after stream matches the updated API; no other changes needed here.
tests/integration/test_lists/test-db/l0_h100.yml (1)
39-42
: Verified — VSWA tests exist and are discoverable.The four tests are defined in tests/integration/defs/accuracy/test_llm_api_pytorch.py (defs at ~1060, 1074, 1087, 1108) and are listed in tests/integration/test_lists/test-db/l0_h100.yml (lines 39–42).
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
361-361
: Rewind uses no extra block — please confirm boundary safety.Setting isUseOneMoreBlock to false changes rewind behavior at block boundaries. Please confirm no off-by-one issues when rewinding exactly at tokensPerBlock multiples and that invokeUpdateKVBlockArrayDraftTokenLocation handles this without needing a guard block.
688-697
: KVCacheManager ctor args and cross/self maxSequenceLength look correct.Passing streamPtr, dtype, sinkTokenLen, and selecting maxEncoderLen for CROSS vs getMaxSequenceLen for SELF aligns with the updated ctor and the non-optional maxSequenceLength requirement.
Please confirm:
- The nanobind bindings and all call sites use the same ctor signature/order.
- Cross-cache paths don’t (implicitly) rely on sinkTokenLen semantics intended for self-attention.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
1060-1066
: VSWA test (no reuse) looks good; window vector cycling is assumed.The per-layer max_attention_window vector size likely differs from numLayers and will be cycled. If Gemma3-1B layer count changes, this remains robust.
If flakiness appears, consider asserting the effective expanded window vector in the runtime to ensure expected cycling.
1074-1081
: VSWA with reuse enabled — ensure SWA+beam isn’t exercised.Reuse with SWA is fine; the known broken path is SWA+beam search. GSM8K/MMLU default to sampling, so this should be safe.
Please confirm no hidden beam-search params are applied via harness defaults.
1101-1106
: VSWA + chunked prefill (no reuse) — OK.Good coverage for chunked prefill + VSWA with reuse disabled.
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)
218-222
: Constructor arg update looks correct (tokens, not blocks).Passing maxNumTokens as max_sequence_length aligns with the non-optional constructor. Units are tokens; computed as tokensPerBlock * maxBlocksPerSeq. LGTM.
621-623
: Consistent KVCacheManager signature usage.Updated argument ordering (sink_token_length, stream, max_sequence_length, …) matches the new API. LGTM.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (6)
60-63
: Add kSWAExtraBlock constant — OK.Clear constant for SWA buffer accounting.
100-114
: WindowSizeMetadata additions — OK.maxTokenNum and maxBlocksPerSeq are useful; toString updated accordingly.
551-557
: SWA-aware lifecycle in WindowBlockManager — OK.Constructor now carries isSWA; added adjust/detach/update helpers and isSWA(). Interfaces look consistent with cpp implementation.
Please ensure all call sites construct WindowBlockManager with the correct isSWA value per window size.
Also applies to: 592-608, 783-788, 850-852
902-913
: BlockManager constructor: non-optional maxSequenceLength — OK.Signature aligns with retrieved learnings. Verify warnings are emitted when windowSize > maxSequenceLength (per policy).
1549-1555
: Offset table dims selection — OK.Using widest window’s maxBlocksPerSeq for a single offset table is reasonable. LGTM.
1173-1180
: Expose adjustBlocksIfNeeded on BlockManager — OK.Surface mirrors WindowBlockManager API; ensures per-window SWA handling is invoked.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (19)
708-711
: Helpful inline comments.The clarifications around block contents during reuse improve test readability.
1962-1981
: Constructor updates to include non-optional maxSequenceLength look correct.Defining
maxSequenceLength
and passing it through matches the production signature.
2123-2142
: Consistent use of maxSequenceLength in decode priority test.Good alignment with the new constructor signature.
2231-2249
: Timed eviction test updated for maxSequenceLength.Looks correct.
2379-2396
: Secondary block primary child test: signature alignment LGTM.
maxSequenceLength
defined and used correctly.
2455-2471
: Leaf block test: constructor call updated properly.No issues.
2650-2654
: Allocation test: introduces maxSequenceLength and forwards it.Consistent with the API change. Calculated values look sane.
Also applies to: 2670-2674
2712-2715
: KVCacheManagerTest: consistent maxSequenceLength and forwarding.Good cleanup and consistency.
Also applies to: 2732-2735
2828-2828
: Adjusted expectation accounts for two addToken calls.Matches the updated allocation behavior.
2861-2864
: Rewind tokens test updated for maxSequenceLength.Looks correct.
Also applies to: 2880-2883
2945-2952
: MaxAttentionWindow test: explicit maxSequenceLength and derived values.Consistent and clear.
Also applies to: 2966-2970
3185-3191
: Event stream test: uses maxSequenceLength consistently.No issues.
3342-3343
: Event stream overflow/blocking tests: constructor updates LGTM.Signature alignment is correct.
Also applies to: 3348-3349, 3490-3490
3401-3402
: Event stream priority test: constructor updates LGTM.All good.
Also applies to: 3407-3408
3484-3484
: Event stream blocking test: KVCacheManagerTest instance uses updated signature.Looks correct.
3530-3532
: Window size event stream test: forwards maxSequenceLength.No issues.
Also applies to: 3540-3541
3845-3863
: Commented sections (shared vs unshared blocks) improve understanding.Nice clarity without affecting logic.
4089-4090
: Confirm kSWAExtraBlock visibility.Ensure
kSWAExtraBlock
is included in this TU (header) to avoid ODR/visibility issues on some compilers/configs.
4659-4970
: Parametrized NeededBlocksOneStepTest adds strong coverage.Good matrix (context vs decode, twoStepsLookAhead, drafts). No issues.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (13)
160-191
: Helper to gather sequence blocks looks correctBackwards traversal with root guard and preallocation are sound; no off‑by‑one. LGTM.
592-597
: Correct: warn when windowSize > maxSequenceLengthMatches prior guidance; prevents silent misconfig.
603-607
: SWA flag propagation LGTM
isSWA = windowSize < maxSequenceLength
is the right criterion.
621-642
: Capacity calc and logging LGTM (with SWA cap note)Capping temp window to
maxSequenceLength
avoids over‑alloc now; follow‑up removal noted in comments. Logging is helpful.Please ensure any downstream logic that reads
maxBlocksPerSeq
expects “full‑attention” sized sequences for SWA windows.
655-673
: Constructor changes LGTMNew
isSWA
arg stored inmIsSWA
and other initializations are consistent.
1446-1509
: Return count from storeBlocks is useful; keep BlockManager wrapper consistentWindowBlockManager::storeBlocks now returns count. BlockManager::storeBlocks (header inline) ignores the return; that’s fine but surprising.
Consider updating the BlockManager wrapper to return the count (or comment that it’s intentionally ignored).
1934-1939
: One‑step needs calc: looks consistentCapping prompt cache length to
windowSize + chunkSize
is reasonable with draft tokens.
2053-2069
: Offset refresh only for last block: LGTMEfficiently updates just the newest block offsets per beam.
2089-2095
: Hooking adjustBlocksIfNeeded on addToken is correctEnsures timely detach/allocate on token additions.
2205-2213
: Graceful warning when sequence missing: OKWarning on missing sequence for storeContextBlocks is appropriate.
2216-2227
: Gate storeNewBlock correctlySkip when beamWidth > 1 or reuse disabled; otherwise delegate. LGTM.
589-597
: Window > maxSequenceLength warning LGTMMatches retrieved learnings; no action.
2011-2031
: SWA extra block: definition exists — rename to follow naming guidelineskSWAExtraBlock is defined at cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:62 (static constexpr SizeType32 kSWAExtraBlock = 1) and no duplicate definitions were found. Recommend renaming to kSWA_EXTRA_BLOCK and updating usages at:
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2657, 2691
- cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp:4089
Likely an incorrect or invalid review comment.
54a398d
to
6546070
Compare
6546070
to
9709231
Compare
/bot run |
PR_Github #19777 [ run ] triggered by Bot |
PR_Github #19777 [ run ] completed with state |
9709231
to
295dbf3
Compare
/bot run |
PR_Github #19915 [ run ] triggered by Bot |
PR_Github #19915 [ run ] completed with state |
…mory proportion shared Usage example: export TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 Signed-off-by: eopXD <[email protected]>
295dbf3
to
70feb7c
Compare
/bot run --disable-fail-fast |
PR_Github #19974 [ run ] triggered by Bot |
PR_Github #19974 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the work. Overall, LGTM, left several comments.
// Normalize shares to 1.0 | ||
for (auto& s : shares) | ||
{ | ||
s /= sumShares; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit) Might be good to log to let user knows that the shares are shrank down to these numbers.
For example, when they give [0.6, 1.4], print out that it went down from [0.6, 1.4] to [0.3, 0.7].
// then setting TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 will be allocating | ||
// 40% of the memory to window size 512 and 60% of the memory to window | ||
// size 32768. | ||
if (auto envStr = std::getenv("TRTLLM_WINDOW_SIZE_SHARES")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need to add these test under this cpp/tests/unit_tests/kvCacheManagerTest.cpp?
Summary by CodeRabbit
Description
Usage example:
export TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6
Test Coverage
No test coverage added.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.