[None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager #7923

eopXD · 2025-09-23T08:31:41Z

Summary by CodeRabbit

New Features
- New configuration knobs to enable customize pool allocation for block pool in kv cache manager.

Description

Usage example: export TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6

Test Coverage

No test coverage added.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-23T08:40:47Z

📝 Walkthrough

Walkthrough

Introduces Sliding Window Attention (SWA) awareness across KV-cache management: new flags, bookkeeping, and block lifecycle methods; removes cyclic logic; makes maxSequenceLength non-optional across constructors; updates reuse/store/release flows; adjusts capacity calculations; aligns bindings, C++ call sites, tests, Python utilities, and example help texts.

Changes

Cohort / File(s)	Summary
Core KV-cache management APIs `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`, `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Add SWA flag/paths, extra block constant, front-block bookkeeping, adjust/detach methods, storeBlocks returns count; remove cyclic logic; propagate non-optional `maxSequenceLength`; update block requirement calculations and env-driven window sharing.
Block/Window managers `.../kvCacheManager.h` (WindowBlockManager/BlockManager decls), `.../kvCacheManager.cpp` (impl)	Constructors gain `isSWA`; add `isSWA()` accessor; introduce `adjustBlocksIfNeeded`, `detachFrontBlock`; expose free-block queries; align reuse/release/store flows with SWA.
Inflight batching integration `cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`	KVCacheManager construction updated: pass non-optional `maxSequenceLength`, new `enablePartialReuse`/`copyOnPartialReuse`; simplify rewind inputs (fixed false); remove cyclic assertions.
Python bindings `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Nanobind constructor switches from `std::optional<SizeType32>` to `SizeType32` for `max_sequence_length`; updates arg list/names and defaults; adds binding args ordering changes.
Unit tests (KVCacheManager API calls) `cpp/tests/unit_tests/batch_manager/*`	Update all KVCacheManager invocations to pass concrete `maxSequenceLength` instead of `std::nullopt`; adapt helpers and expectations to new sizing semantics and SWA behavior.
Unit tests (multi-GPU) `cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`	Insert `maxNumTokens` param in constructor; waive/comment some windowed variants; adjust parameter orders.
Executor tests `cpp/tests/unit_tests/executor/agentCommTest.cpp`	Replace `std::nullopt` with explicit token limit in KVCacheManager constructor.
Python runtime utilities/docs `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/functional.py`	Unify KV capacity calculation (no VSWA branch); docstring wording updates for attention window feature.
Examples (CLI help text) `examples/models/core/llama/summarize_long.py`, `examples/models/core/qwen2audio/utils.py`, `examples/utils.py`	Update `--max_attention_window_size` descriptions to remove “cyclic” phrasing.
Integration tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/test-db/l0_h100.yml`	Replace one VSWA test with four variants (reuse/no-reuse and chunked prefill combinations); update test list entries accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Req as GenerationRequest
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant Pool as Block Pools

  Note over Req,BM: Token generation step (SWA-aware)
  Req->>BM: requestNextStep()
  BM->>WBM: adjustBlocksIfNeeded(reqState)
  alt SWA and window exceeded
    WBM->>WBM: detachFrontBlock()
    WBM->>Pool: storeBlocks(...) -> numStored
    Note right of WBM: Store count returned
  else No detachment needed
    WBM->>WBM: no-op
  end
  WBM->>Pool: acquireIfNeeded()
  Pool-->>WBM: block(s) or wait/fail
  WBM-->>BM: updated block layout
  BM-->>Req: proceed with step

sequenceDiagram
  autonumber
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant Pool as Block Pools

  Note over BM,WBM: Release path (SWA-aware)
  BM->>WBM: releaseSequence(seqId)
  alt SWA sequence
    WBM->>WBM: getAllSequenceBlocks()
    WBM->>Pool: storeBlocks(...) -> numStored
  else Non-SWA
    WBM->>Pool: storeBlocks(...) -> numStored
  end
  Pool-->>BM: updated free counts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.89% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The PR description largely retains the template scaffolding without providing the required summary, detailed explanation, or title, and the Description section only includes a usage example without summarizing the issue or the solution, while the Test Coverage section simply states “No test coverage added.” and does not list any relevant tests.	Please update the PR description by removing the template instructions, adding a clear title and concise summary of the changes and their rationale under Description, and listing the relevant tests under Test Coverage to ensure the template’s required sections are fully completed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly reflects the primary change (adding an environment variable to adjust the KV cache manager's block-pool allocation) and follows the repository's "[ticket][type] Summary" format, so it is relevant and specific to the changeset; it contains a minor typo ("ration" → "ratio") but remains understandable.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/models/core/qwen2audio/utils.py (1)
1-1: Add SPDX + NVIDIA Apache-2.0 header.

Per coding guidelines, prepend the NVIDIA header with current year to all source files.

Apply this at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from argparse import BooleanOptionalAction
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

36-40: Add missing standard headers to cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

The file uses std::stringstream (line ~2458), std::getenv (line ~2453) and std::accumulate (lines ~529/541/551) but does not include the corresponding headers. Add:

#include

#include

#include

Insert them with the other standard includes (e.g., after ).

🧹 Nitpick comments (18)

tensorrt_llm/functional.py (1)
5361-5361: Docstring wording consistency (SWA/KV cache).

Prefer consistent capitalization and be more explicit about scope (per-layer max attention window sizes). Suggest:
-            This controls the sliding-window-attention kv-cache features.
+            Controls Sliding Window Attention (SWA) KV cache behavior (per-layer max attention window sizes).
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1115-1122: VSWA + chunked prefill (reuse) — OK; consider light param refactor.

This mirrors the previous test with only reuse toggled. Consider parametrizing reuse to reduce duplication across the two VSWA chunked-prefill tests.
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)
1316-1321: Temporarily waived window tests — please track.

Waiving is fine for now. Add a TODO with a tracking issue to restore SWA window variants.
-// (eop) Waive off isWindow test for now
+// TODO(eop): Restore isWindow test variants after SWA rework is complete. Track: <issue-id>
1323-1330: Avoid commented-out INSTANTIATE block in test sources.

Prefer removing or gating by a macro/flag to keep test code clean.
-// INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest,
-//     testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1),
-//     testing::Values(1),
-//         testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4), testing::Values(8),
-//         testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2),
-//         testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true)));
+#if 0  // TODO(eop): Re-enable after SWA window tests are stable.
+INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest,
+    testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1),
+        testing::Values(1), testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4),
+        testing::Values(8), testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2),
+        testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true)));
+#endif
1334-1337: Nit: remove inline commented parameter hints in values lists.

The trailing comments inside testing::Values hurt readability.
-    testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false /*, true*/)));
+    testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false)));
583-587: Shadowing bug: cacheType never updated for kvFactor == 1.

'auto cacheType' in the if-block shadows the outer variable, so kSELFKONLY is never used.
-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = CacheType::kSELF;
+        if (kvFactor == 1)
+        {
+            cacheType = CacheType::kSELFKONLY;
+        }
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
351-352: Front-block removal bookkeeping — minor polish.

API is fine. The windowSize param in removeFrontBlock is unused; mark to avoid warnings.
-    void removeFrontBlock(SizeType32 windowSize)
+    void removeFrontBlock(SizeType32 /*windowSize*/)
     {
         ++mNumFrontBlocksRemoved;
     }
Also applies to: 394-398, 436-442

1428-1468: KVCacheManager constructors updated — minor naming nit.

Parameter name copyOnpartialReuse is inconsistently cased vs. copyOnPartialReuse elsewhere. Consider normalizing for consistency.
-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,
Note: update definitions/uses accordingly across TU.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (4)
618-619: Use 0 (not false) for sinkTokenLength; consider naming maxSequenceLength explicitly.

Passing false for sinkTokenLength is confusing; prefer 0. Also, here maxAttentionWindow is used as maxSequenceLength. For consistency with other tests, define maxSequenceLength and pass that.
-        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
-        false, stream, maxAttentionWindow, true, onboardBlocks);
+        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
+        0, stream, maxAttentionWindow, true, onboardBlocks);
3059-3163: New test for SWA window smaller than block size is solid; tighten types.

Logic/expectations look right. Minor nit: prefer SizeType32 for inputLength to match APIs.
-    int inputLength = 2;
+    SizeType32 inputLength = 2;
Please confirm this test is stable across devices; edge-case scheduling of detach/allocate can vary slightly if implementation changes heuristics.

3920-3933: NeededBlocksOneStep: assertions refined; minor style nits.

The step-wise assertions are precise. Minor: prefer auto const and fixed-width ints for counters to align with surrounding types.
-            auto numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks;
+            auto const numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks;
Also applies to: 3950-4011

4161-4186: Potential misuse of maxNumSequences in createKvCacheManager.

numBlocksInPrimaryPool is passed as maxNumSequences. That parameter represents “max concurrent sequences”, not block count. It may work incidentally but is semantically off.
-            numBlocksInPrimaryPool, kvCacheInstantiationParameters.maxBeamWidth,
+            /* maxNumSequences = */ kvCacheInstantiationParameters.maxNumSequences, kvCacheInstantiationParameters.maxBeamWidth,
Outside this hunk, add a field and populate it:
// In KvCacheManagerInstantiationParameters
SizeType32 maxNumSequences = 8; // or a test-specific value
Confirm no tests rely on maxNumSequences being huge; if they do, set it explicitly in the parameter sets.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)
1325-1352: SWA detach/allocate step logic is reasonable

Detach only when past window + block and block‑reuse disabled.

Allocate on block boundaries.

Minor: the manager variable in the caller loop isn’t used.
-    for (auto& [windowSize, manager] : mWindowBlockManagers)
+    for (auto& [windowSize, _] : mWindowBlockManagers)
1698-1746: Release flow for SWA: guard empty and clarify

Replacing allocatedBlocks with getAllSequenceBlocks is fine but will crash if empty. Add a quick check before allocatedBlocks.back().

Loop correctly accounts for previously detached front blocks.
-    if (mIsSWA)
-    {
-        // For SWA, get all blocks in the sequence.
-        allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
-    }
+    if (mIsSWA)
+    {
+        if (!allocatedBlocks.empty())
+        {
+            // For SWA, get all blocks in the sequence.
+            allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
+        }
+    }
Also, the in‑function comment about not releasing when reuse is enabled contradicts the call site (detach only when reuse is disabled). Consider updating the comment.

2096-2125: detachFrontBlock: unused parameter; tighten comments

isEnableBlockReuse isn’t used inside; either remove the param or use it to control release behavior. Current call sites only invoke this when reuse is disabled.
-void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence, bool const isEnableBlockReuse)
+void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence)
Update the caller accordingly.

505-517: Docstring for proportion function: minor nits

Good high‑level doc; consider referencing the new ENV override here to avoid stale docs.

1053-1059: Offload policy should be centralized

Comment acknowledges current behavior; long‑term, prefer delegating the decision fully to the eviction policy to avoid extra traffic.

2648-2660: Rename kSWAExtraBlock to kSWA_EXTRA_BLOCK and apply consistently

A static constexpr is already defined in cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:62 — change to:
static constexpr SizeType32 kSWA_EXTRA_BLOCK = 1;
and update all usages at:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2657

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2691

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp:4089

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb64e74 and 54a398d.

📒 Files selected for processing (16)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (30 hunks)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (49 hunks)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1 hunks)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3 hunks)
examples/models/core/llama/summarize_long.py (1 hunks)
examples/models/core/qwen2audio/utils.py (1 hunks)
examples/utils.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/functional.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}