Skip to content

Conversation

eopXD
Copy link
Collaborator

@eopXD eopXD commented Sep 23, 2025

Summary by CodeRabbit

  • New Features
    • New configuration knobs to enable customize pool allocation for block pool in kv cache manager.

Description

Usage example: export TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6

Test Coverage

No test coverage added.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

📝 Walkthrough

Walkthrough

Introduces Sliding Window Attention (SWA) awareness across KV-cache management: new flags, bookkeeping, and block lifecycle methods; removes cyclic logic; makes maxSequenceLength non-optional across constructors; updates reuse/store/release flows; adjusts capacity calculations; aligns bindings, C++ call sites, tests, Python utilities, and example help texts.

Changes

Cohort / File(s) Summary
Core KV-cache management APIs
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Add SWA flag/paths, extra block constant, front-block bookkeeping, adjust/detach methods, storeBlocks returns count; remove cyclic logic; propagate non-optional maxSequenceLength; update block requirement calculations and env-driven window sharing.
Block/Window managers
.../kvCacheManager.h (WindowBlockManager/BlockManager decls), .../kvCacheManager.cpp (impl)
Constructors gain isSWA; add isSWA() accessor; introduce adjustBlocksIfNeeded, detachFrontBlock; expose free-block queries; align reuse/release/store flows with SWA.
Inflight batching integration
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
KVCacheManager construction updated: pass non-optional maxSequenceLength, new enablePartialReuse/copyOnPartialReuse; simplify rewind inputs (fixed false); remove cyclic assertions.
Python bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Nanobind constructor switches from std::optional<SizeType32> to SizeType32 for max_sequence_length; updates arg list/names and defaults; adds binding args ordering changes.
Unit tests (KVCacheManager API calls)
cpp/tests/unit_tests/batch_manager/*
Update all KVCacheManager invocations to pass concrete maxSequenceLength instead of std::nullopt; adapt helpers and expectations to new sizing semantics and SWA behavior.
Unit tests (multi-GPU)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Insert maxNumTokens param in constructor; waive/comment some windowed variants; adjust parameter orders.
Executor tests
cpp/tests/unit_tests/executor/agentCommTest.cpp
Replace std::nullopt with explicit token limit in KVCacheManager constructor.
Python runtime utilities/docs
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/functional.py
Unify KV capacity calculation (no VSWA branch); docstring wording updates for attention window feature.
Examples (CLI help text)
examples/models/core/llama/summarize_long.py, examples/models/core/qwen2audio/utils.py, examples/utils.py
Update --max_attention_window_size descriptions to remove “cyclic” phrasing.
Integration tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/test-db/l0_h100.yml
Replace one VSWA test with four variants (reuse/no-reuse and chunked prefill combinations); update test list entries accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Req as GenerationRequest
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant Pool as Block Pools

  Note over Req,BM: Token generation step (SWA-aware)
  Req->>BM: requestNextStep()
  BM->>WBM: adjustBlocksIfNeeded(reqState)
  alt SWA and window exceeded
    WBM->>WBM: detachFrontBlock()
    WBM->>Pool: storeBlocks(...) -> numStored
    Note right of WBM: Store count returned
  else No detachment needed
    WBM->>WBM: no-op
  end
  WBM->>Pool: acquireIfNeeded()
  Pool-->>WBM: block(s) or wait/fail
  WBM-->>BM: updated block layout
  BM-->>Req: proceed with step
Loading
sequenceDiagram
  autonumber
  participant BM as BlockManager
  participant WBM as WindowBlockManager
  participant Pool as Block Pools

  Note over BM,WBM: Release path (SWA-aware)
  BM->>WBM: releaseSequence(seqId)
  alt SWA sequence
    WBM->>WBM: getAllSequenceBlocks()
    WBM->>Pool: storeBlocks(...) -> numStored
  else Non-SWA
    WBM->>Pool: storeBlocks(...) -> numStored
  end
  Pool-->>BM: updated free counts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.89% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description largely retains the template scaffolding without providing the required summary, detailed explanation, or title, and the Description section only includes a usage example without summarizing the issue or the solution, while the Test Coverage section simply states “No test coverage added.” and does not list any relevant tests. Please update the PR description by removing the template instructions, adding a clear title and concise summary of the changes and their rationale under Description, and listing the relevant tests under Test Coverage to ensure the template’s required sections are fully completed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly reflects the primary change (adding an environment variable to adjust the KV cache manager's block-pool allocation) and follows the repository's "[ticket][type] Summary" format, so it is relevant and specific to the changeset; it contains a minor typo ("ration" → "ratio") but remains understandable.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/models/core/qwen2audio/utils.py (1)

1-1: Add SPDX + NVIDIA Apache-2.0 header.

Per coding guidelines, prepend the NVIDIA header with current year to all source files.

Apply this at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from argparse import BooleanOptionalAction
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

36-40: Add missing standard headers to cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

The file uses std::stringstream (line ~2458), std::getenv (line ~2453) and std::accumulate (lines ~529/541/551) but does not include the corresponding headers. Add:

  • #include
  • #include
  • #include

Insert them with the other standard includes (e.g., after ).

🧹 Nitpick comments (18)
tensorrt_llm/functional.py (1)

5361-5361: Docstring wording consistency (SWA/KV cache).

Prefer consistent capitalization and be more explicit about scope (per-layer max attention window sizes). Suggest:

-            This controls the sliding-window-attention kv-cache features.
+            Controls Sliding Window Attention (SWA) KV cache behavior (per-layer max attention window sizes).
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1115-1122: VSWA + chunked prefill (reuse) — OK; consider light param refactor.

This mirrors the previous test with only reuse toggled. Consider parametrizing reuse to reduce duplication across the two VSWA chunked-prefill tests.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (4)

1316-1321: Temporarily waived window tests — please track.

Waiving is fine for now. Add a TODO with a tracking issue to restore SWA window variants.

-// (eop) Waive off isWindow test for now
+// TODO(eop): Restore isWindow test variants after SWA rework is complete. Track: <issue-id>

1323-1330: Avoid commented-out INSTANTIATE block in test sources.

Prefer removing or gating by a macro/flag to keep test code clean.

-// INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest,
-//     testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1),
-//     testing::Values(1),
-//         testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4), testing::Values(8),
-//         testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2),
-//         testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true)));
+#if 0  // TODO(eop): Re-enable after SWA window tests are stable.
+INSTANTIATE_TEST_CASE_P(AsymmetricCaseTestWithWindow, AsymmetricalCacheTest,
+    testing::Combine(testing::Values(1), testing::Values(1), testing::Values(1), testing::Values(1),
+        testing::Values(1), testing::Values(1), testing::Values(5), testing::Values(4), testing::Values(4),
+        testing::Values(8), testing::Values(nvinfer1::DataType::kFLOAT, nvinfer1::DataType::kINT8), testing::Values(2),
+        testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(true)));
+#endif

1334-1337: Nit: remove inline commented parameter hints in values lists.

The trailing comments inside testing::Values hurt readability.

-    testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false /*, true*/)));
+    testing::Values(false), testing::Values(false), testing::Values(false), testing::Values(false)));

583-587: Shadowing bug: cacheType never updated for kvFactor == 1.

'auto cacheType' in the if-block shadows the outer variable, so kSELFKONLY is never used.

-        CacheType cacheType = CacheType::kSELF;
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
+        CacheType cacheType = CacheType::kSELF;
+        if (kvFactor == 1)
+        {
+            cacheType = CacheType::kSELFKONLY;
+        }
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

351-352: Front-block removal bookkeeping — minor polish.

API is fine. The windowSize param in removeFrontBlock is unused; mark to avoid warnings.

-    void removeFrontBlock(SizeType32 windowSize)
+    void removeFrontBlock(SizeType32 /*windowSize*/)
     {
         ++mNumFrontBlocksRemoved;
     }

Also applies to: 394-398, 436-442


1428-1468: KVCacheManager constructors updated — minor naming nit.

Parameter name copyOnpartialReuse is inconsistently cased vs. copyOnPartialReuse elsewhere. Consider normalizing for consistency.

-        bool copyOnpartialReuse = true,
+        bool copyOnPartialReuse = true,

Note: update definitions/uses accordingly across TU.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (4)

618-619: Use 0 (not false) for sinkTokenLength; consider naming maxSequenceLength explicitly.

Passing false for sinkTokenLength is confusing; prefer 0. Also, here maxAttentionWindow is used as maxSequenceLength. For consistency with other tests, define maxSequenceLength and pass that.

-        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
-        false, stream, maxAttentionWindow, true, onboardBlocks);
+        beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kFP4,
+        0, stream, maxAttentionWindow, true, onboardBlocks);

3059-3163: New test for SWA window smaller than block size is solid; tighten types.

Logic/expectations look right. Minor nit: prefer SizeType32 for inputLength to match APIs.

-    int inputLength = 2;
+    SizeType32 inputLength = 2;

Please confirm this test is stable across devices; edge-case scheduling of detach/allocate can vary slightly if implementation changes heuristics.


3920-3933: NeededBlocksOneStep: assertions refined; minor style nits.

The step-wise assertions are precise. Minor: prefer auto const and fixed-width ints for counters to align with surrounding types.

-            auto numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks;
+            auto const numUsedBlocksThisStep = kvCacheManager.getNumAllocTotalBlocks() - currentNumAllocTotalBlocks;

Also applies to: 3950-4011


4161-4186: Potential misuse of maxNumSequences in createKvCacheManager.

numBlocksInPrimaryPool is passed as maxNumSequences. That parameter represents “max concurrent sequences”, not block count. It may work incidentally but is semantically off.

-            numBlocksInPrimaryPool, kvCacheInstantiationParameters.maxBeamWidth,
+            /* maxNumSequences = */ kvCacheInstantiationParameters.maxNumSequences, kvCacheInstantiationParameters.maxBeamWidth,

Outside this hunk, add a field and populate it:

// In KvCacheManagerInstantiationParameters
SizeType32 maxNumSequences = 8; // or a test-specific value

Confirm no tests rely on maxNumSequences being huge; if they do, set it explicitly in the parameter sets.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (6)

1325-1352: SWA detach/allocate step logic is reasonable

  • Detach only when past window + block and block‑reuse disabled.
  • Allocate on block boundaries.

Minor: the manager variable in the caller loop isn’t used.

-    for (auto& [windowSize, manager] : mWindowBlockManagers)
+    for (auto& [windowSize, _] : mWindowBlockManagers)

1698-1746: Release flow for SWA: guard empty and clarify

  • Replacing allocatedBlocks with getAllSequenceBlocks is fine but will crash if empty. Add a quick check before allocatedBlocks.back().
  • Loop correctly accounts for previously detached front blocks.
-    if (mIsSWA)
-    {
-        // For SWA, get all blocks in the sequence.
-        allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
-    }
+    if (mIsSWA)
+    {
+        if (!allocatedBlocks.empty())
+        {
+            // For SWA, get all blocks in the sequence.
+            allocatedBlocks = getAllSequenceBlocks(allocatedBlocks.back());
+        }
+    }

Also, the in‑function comment about not releasing when reuse is enabled contradicts the call site (detach only when reuse is disabled). Consider updating the comment.


2096-2125: detachFrontBlock: unused parameter; tighten comments

isEnableBlockReuse isn’t used inside; either remove the param or use it to control release behavior. Current call sites only invoke this when reuse is disabled.

-void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence, bool const isEnableBlockReuse)
+void WindowBlockManager::detachFrontBlock(GenerationRequest& sequence)

Update the caller accordingly.


505-517: Docstring for proportion function: minor nits

Good high‑level doc; consider referencing the new ENV override here to avoid stale docs.


1053-1059: Offload policy should be centralized

Comment acknowledges current behavior; long‑term, prefer delegating the decision fully to the eviction policy to avoid extra traffic.


2648-2660: Rename kSWAExtraBlock to kSWA_EXTRA_BLOCK and apply consistently

A static constexpr is already defined in cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:62 — change to:
static constexpr SizeType32 kSWA_EXTRA_BLOCK = 1;
and update all usages at:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2657
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2691
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp:4089
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb64e74 and 54a398d.

📒 Files selected for processing (16)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (17 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (30 hunks)
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (49 hunks)
  • cpp/tests/unit_tests/executor/agentCommTest.cpp (1 hunks)
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/utils.py (1 hunks)
  • examples/utils.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/functional.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • examples/models/core/llama/summarize_long.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • examples/utils.py
  • examples/models/core/qwen2audio/utils.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • examples/models/core/llama/summarize_long.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • examples/utils.py
  • examples/models/core/qwen2audio/utils.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • examples/models/core/llama/summarize_long.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • examples/utils.py
  • examples/models/core/qwen2audio/utils.py
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧠 Learnings (10)
📓 Common learnings
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp
  • cpp/tests/unit_tests/executor/agentCommTest.cpp
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/test_lists/test-db/l0_h100.yml
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: There is a planned refactoring to move cache block bookkeeping utilities from BlockManager/WindowBlockManager into the GenerationRequest class itself to improve code organization and make responsibilities clearer.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-13T16:20:37.987Z
Learnt from: dcampora
PR: NVIDIA/TensorRT-LLM#6867
File: tensorrt_llm/_torch/pyexecutor/sampler.py:67-72
Timestamp: 2025-08-13T16:20:37.987Z
Learning: In TensorRT-LLM sampler code, performance is prioritized over additional validation checks. The beam_width helper method intentionally returns the first request's beam_width without validating consistency across all requests to avoid performance overhead from iterating through the entire batch.

Applied to files:

  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
🧬 Code graph analysis (8)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
  • stream (4231-4237)
  • stream (4317-4323)
  • stream (4676-4682)
cpp/tests/unit_tests/executor/agentCommTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
tensorrt_llm/llmapi/llm_args.py (1)
  • KvCacheConfig (972-1106)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1022-1038)
tests/integration/defs/accuracy/accuracy_core.py (4)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (712-722)
  • MMLU (276-290)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (19)
  • releaseBlocks (1619-1634)
  • releaseBlocks (1619-1619)
  • releaseBlocks (1698-1746)
  • releaseBlocks (1698-1698)
  • updateLastCacheBlockOffsets (2053-2069)
  • updateLastCacheBlockOffsets (2053-2053)
  • releaseLastBlock (1585-1588)
  • releaseLastBlock (1585-1585)
  • releaseLastBlock (1590-1607)
  • releaseLastBlock (1590-1590)
  • detachFrontBlock (2096-2125)
  • detachFrontBlock (2096-2096)
  • adjustBlocksIfNeeded (1325-1331)
  • adjustBlocksIfNeeded (1325-1325)
  • adjustBlocksIfNeeded (1333-1352)
  • adjustBlocksIfNeeded (1333-1333)
  • storeBlocks (1446-1509)
  • storeBlocks (1446-1447)
  • KVCacheBlock (272-285)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
  • storeBlocks (969-973)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (3)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
  • KVCacheManager (1770-1781)
  • KVCacheManager (1783-1797)
  • KVCacheManager (1799-1831)
  • KVCacheManager (1833-1846)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • KVCacheManager (142-1011)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)
  • createKvCacheManager (637-723)
  • createKvCacheManager (637-639)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (52)
examples/models/core/llama/summarize_long.py (1)

48-49: Wording tweak LGTM.

Help text reads clearly and matches current terminology.

examples/models/core/qwen2audio/utils.py (1)

41-42: Wording tweak LGTM.

Terminology is now consistent with SWA/KV cache usage elsewhere.

examples/utils.py (1)

361-362: Wording tweak LGTM.

Minor doc text change only; no behavioral impact.

cpp/tests/unit_tests/executor/agentCommTest.cpp (1)

86-87: Constructor update looks correct.

Passing kvMaxNumTokens as the new non-optional parameter after stream aligns with the updated KVCacheManager signature.

cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp (1)

136-139: Constructor update LGTM.

Providing maxNumTokensPerSeq in the KVCacheManager ctor matches the non-optional max sequence length change.

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

59-63: KVCacheManager ctor arg shift LGTM.

Supplying kvMaxNumTokens after stream matches the updated API; no other changes needed here.

tests/integration/test_lists/test-db/l0_h100.yml (1)

39-42: Verified — VSWA tests exist and are discoverable.

The four tests are defined in tests/integration/defs/accuracy/test_llm_api_pytorch.py (defs at ~1060, 1074, 1087, 1108) and are listed in tests/integration/test_lists/test-db/l0_h100.yml (lines 39–42).

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (2)

361-361: Rewind uses no extra block — please confirm boundary safety.

Setting isUseOneMoreBlock to false changes rewind behavior at block boundaries. Please confirm no off-by-one issues when rewinding exactly at tokensPerBlock multiples and that invokeUpdateKVBlockArrayDraftTokenLocation handles this without needing a guard block.


688-697: KVCacheManager ctor args and cross/self maxSequenceLength look correct.

Passing streamPtr, dtype, sinkTokenLen, and selecting maxEncoderLen for CROSS vs getMaxSequenceLen for SELF aligns with the updated ctor and the non-optional maxSequenceLength requirement.

Please confirm:

  • The nanobind bindings and all call sites use the same ctor signature/order.
  • Cross-cache paths don’t (implicitly) rely on sinkTokenLen semantics intended for self-attention.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

1060-1066: VSWA test (no reuse) looks good; window vector cycling is assumed.

The per-layer max_attention_window vector size likely differs from numLayers and will be cycled. If Gemma3-1B layer count changes, this remains robust.

If flakiness appears, consider asserting the effective expanded window vector in the runtime to ensure expected cycling.


1074-1081: VSWA with reuse enabled — ensure SWA+beam isn’t exercised.

Reuse with SWA is fine; the known broken path is SWA+beam search. GSM8K/MMLU default to sampling, so this should be safe.

Please confirm no hidden beam-search params are applied via harness defaults.


1101-1106: VSWA + chunked prefill (no reuse) — OK.

Good coverage for chunked prefill + VSWA with reuse disabled.

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (2)

218-222: Constructor arg update looks correct (tokens, not blocks).

Passing maxNumTokens as max_sequence_length aligns with the non-optional constructor. Units are tokens; computed as tokensPerBlock * maxBlocksPerSeq. LGTM.


621-623: Consistent KVCacheManager signature usage.

Updated argument ordering (sink_token_length, stream, max_sequence_length, …) matches the new API. LGTM.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (6)

60-63: Add kSWAExtraBlock constant — OK.

Clear constant for SWA buffer accounting.


100-114: WindowSizeMetadata additions — OK.

maxTokenNum and maxBlocksPerSeq are useful; toString updated accordingly.


551-557: SWA-aware lifecycle in WindowBlockManager — OK.

Constructor now carries isSWA; added adjust/detach/update helpers and isSWA(). Interfaces look consistent with cpp implementation.

Please ensure all call sites construct WindowBlockManager with the correct isSWA value per window size.

Also applies to: 592-608, 783-788, 850-852


902-913: BlockManager constructor: non-optional maxSequenceLength — OK.

Signature aligns with retrieved learnings. Verify warnings are emitted when windowSize > maxSequenceLength (per policy).


1549-1555: Offset table dims selection — OK.

Using widest window’s maxBlocksPerSeq for a single offset table is reasonable. LGTM.


1173-1180: Expose adjustBlocksIfNeeded on BlockManager — OK.

Surface mirrors WindowBlockManager API; ensures per-window SWA handling is invoked.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (19)

708-711: Helpful inline comments.

The clarifications around block contents during reuse improve test readability.


1962-1981: Constructor updates to include non-optional maxSequenceLength look correct.

Defining maxSequenceLength and passing it through matches the production signature.


2123-2142: Consistent use of maxSequenceLength in decode priority test.

Good alignment with the new constructor signature.


2231-2249: Timed eviction test updated for maxSequenceLength.

Looks correct.


2379-2396: Secondary block primary child test: signature alignment LGTM.

maxSequenceLength defined and used correctly.


2455-2471: Leaf block test: constructor call updated properly.

No issues.


2650-2654: Allocation test: introduces maxSequenceLength and forwards it.

Consistent with the API change. Calculated values look sane.

Also applies to: 2670-2674


2712-2715: KVCacheManagerTest: consistent maxSequenceLength and forwarding.

Good cleanup and consistency.

Also applies to: 2732-2735


2828-2828: Adjusted expectation accounts for two addToken calls.

Matches the updated allocation behavior.


2861-2864: Rewind tokens test updated for maxSequenceLength.

Looks correct.

Also applies to: 2880-2883


2945-2952: MaxAttentionWindow test: explicit maxSequenceLength and derived values.

Consistent and clear.

Also applies to: 2966-2970


3185-3191: Event stream test: uses maxSequenceLength consistently.

No issues.


3342-3343: Event stream overflow/blocking tests: constructor updates LGTM.

Signature alignment is correct.

Also applies to: 3348-3349, 3490-3490


3401-3402: Event stream priority test: constructor updates LGTM.

All good.

Also applies to: 3407-3408


3484-3484: Event stream blocking test: KVCacheManagerTest instance uses updated signature.

Looks correct.


3530-3532: Window size event stream test: forwards maxSequenceLength.

No issues.

Also applies to: 3540-3541


3845-3863: Commented sections (shared vs unshared blocks) improve understanding.

Nice clarity without affecting logic.


4089-4090: Confirm kSWAExtraBlock visibility.

Ensure kSWAExtraBlock is included in this TU (header) to avoid ODR/visibility issues on some compilers/configs.


4659-4970: Parametrized NeededBlocksOneStepTest adds strong coverage.

Good matrix (context vs decode, twoStepsLookAhead, drafts). No issues.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (13)

160-191: Helper to gather sequence blocks looks correct

Backwards traversal with root guard and preallocation are sound; no off‑by‑one. LGTM.


592-597: Correct: warn when windowSize > maxSequenceLength

Matches prior guidance; prevents silent misconfig.


603-607: SWA flag propagation LGTM

isSWA = windowSize < maxSequenceLength is the right criterion.


621-642: Capacity calc and logging LGTM (with SWA cap note)

Capping temp window to maxSequenceLength avoids over‑alloc now; follow‑up removal noted in comments. Logging is helpful.

Please ensure any downstream logic that reads maxBlocksPerSeq expects “full‑attention” sized sequences for SWA windows.


655-673: Constructor changes LGTM

New isSWA arg stored in mIsSWA and other initializations are consistent.


1446-1509: Return count from storeBlocks is useful; keep BlockManager wrapper consistent

WindowBlockManager::storeBlocks now returns count. BlockManager::storeBlocks (header inline) ignores the return; that’s fine but surprising.

Consider updating the BlockManager wrapper to return the count (or comment that it’s intentionally ignored).


1934-1939: One‑step needs calc: looks consistent

Capping prompt cache length to windowSize + chunkSize is reasonable with draft tokens.


2053-2069: Offset refresh only for last block: LGTM

Efficiently updates just the newest block offsets per beam.


2089-2095: Hooking adjustBlocksIfNeeded on addToken is correct

Ensures timely detach/allocate on token additions.


2205-2213: Graceful warning when sequence missing: OK

Warning on missing sequence for storeContextBlocks is appropriate.


2216-2227: Gate storeNewBlock correctly

Skip when beamWidth > 1 or reuse disabled; otherwise delegate. LGTM.


589-597: Window > maxSequenceLength warning LGTM

Matches retrieved learnings; no action.


2011-2031: SWA extra block: definition exists — rename to follow naming guidelines

kSWAExtraBlock is defined at cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:62 (static constexpr SizeType32 kSWAExtraBlock = 1) and no duplicate definitions were found. Recommend renaming to kSWA_EXTRA_BLOCK and updating usages at:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2657, 2691
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp:4089

Likely an incorrect or invalid review comment.

@eopXD eopXD force-pushed the swa-block-ratio-knob branch from 54a398d to 6546070 Compare September 24, 2025 07:34
@eopXD eopXD changed the title [None][feature] Environment variable to adjust block pool allocation ration under kv cache manager [None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager Sep 24, 2025
@eopXD eopXD force-pushed the swa-block-ratio-knob branch from 6546070 to 9709231 Compare September 24, 2025 07:42
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 24, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19777 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19777 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14880 completed with status: 'FAILURE'

@eopXD eopXD force-pushed the swa-block-ratio-knob branch from 9709231 to 295dbf3 Compare September 25, 2025 07:19
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19915 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19915 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14991 completed with status: 'FAILURE'

…mory proportion shared

Usage example:

export TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6

Signed-off-by: eopXD <[email protected]>
@eopXD eopXD force-pushed the swa-block-ratio-knob branch from 295dbf3 to 70feb7c Compare September 25, 2025 15:21
@eopXD
Copy link
Collaborator Author

eopXD commented Sep 25, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19974 [ run ] triggered by Bot

@eopXD eopXD enabled auto-merge (squash) September 25, 2025 15:35
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19974 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15036 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eopXD eopXD merged commit 2db22fb into NVIDIA:main Sep 26, 2025
7 checks passed
Copy link
Collaborator

@yechank-nvidia yechank-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the work. Overall, LGTM, left several comments.

// Normalize shares to 1.0
for (auto& s : shares)
{
s /= sumShares;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit) Might be good to log to let user knows that the shares are shrank down to these numbers.

For example, when they give [0.6, 1.4], print out that it went down from [0.6, 1.4] to [0.3, 0.7].

// then setting TRTLLM_WINDOW_SIZE_SHARES=0.4,0.6 will be allocating
// 40% of the memory to window size 512 and 60% of the memory to window
// size 32768.
if (auto envStr = std::getenv("TRTLLM_WINDOW_SIZE_SHARES"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to add these test under this cpp/tests/unit_tests/kvCacheManagerTest.cpp?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants