-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix: Enhance ModelConfig for kv cache size calculations #5868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/bot run --disable-fail-fast |
PR_Github #11402 [ run ] triggered by Bot |
Thanks for fixing this missing part. This PR will resolve the memory underutilization issue of the VSWA model. There are still more things to do. Let's do it step by step. Thanks again for your help! |
PR_Github #11402 [ run ] completed with state |
Do colleagues working on kernels know about this? Should we keep the test around and waive it with a bug instead of removing? Chunked prefill is important and we'll eventually need it. |
I will add it back and waive it with related nvbug link, thanks for the suggestion! |
Just the one comment about keeping the test around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/bot run --disable-fail-fast |
PR_Github #11466 [ run ] triggered by Bot |
PR_Github #11466 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #11471 [ run ] triggered by Bot |
PR_Github #11471 [ run ] completed with state |
54ad131
to
7cee8a3
Compare
/bot run --disable-fail-fast |
PR_Github #11678 [ run ] triggered by Bot |
PR_Github #11678 [ run ] completed with state |
7cee8a3
to
c7c9eae
Compare
PR_Github #12021 [ run ] triggered by Bot |
PR_Github #12021 [ run ] completed with state |
- Added support for setting num_kv_heads in ModelConfig. - Implemented logic to determine head_size from head_size or head_dim in pretrained_config. - Added warnings for default values when configurations are not set. Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: qixiang-99 <[email protected]>
- expose `num_free_blocks_per_window_size` via kv_cache_stats. - with `num_free_blocks_per_window_size` , update `get_num_free_blocks()` and `get_num_available_tokens()` Signed-off-by: qixiang-99 <[email protected]>
WalkthroughThe changes expand kv cache management and reporting to include per-window-size free block statistics, update Python bindings and resource management logic to support Variable Sliding Window Attention (VSWA), and enhance model configuration handling for kv cache sizing. Integration tests are updated with new test cases and a bug-skipped scenario. Changes
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
/bot reuse-pipeline |
PR_Github #12113 [ reuse-pipeline ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
34-41
:<map>
header missing – compilation depends on transitive includes
std::map
is used in several declarations (e.g.BlocksPerWindow
, the newnumFreeBlocksPerWindowSize
) but<map>
is not explicitly included. Relying on a transitive include is brittle and may break with future standard-library or upstream header changes.+#include <map>
🧹 Nitpick comments (4)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
1462-1466
: Consider returning aconst&
to avoid an extra copy
getNumFreeBlocksPerWindowSize()
currently copies the map on every call.
If callers only need read-only access (which seems to be the case), returning aconst std::map<…>&
would avoid the extra allocation and copy.-[[nodiscard]] std::map<SizeType32, SizeType32> getNumFreeBlocksPerWindowSize() const +[[nodiscard]] std::map<SizeType32, SizeType32> const& getNumFreeBlocksPerWindowSize() constThat would require
BlockManager::getNumFreeBlocksPerWindowSize()
to return a reference as well.
Not blocking, but worth considering for hot paths.
1478-1482
: Move-construct the map to avoid one copyWhen populating the stats object we can transfer ownership of the temporary map produced by
getNumFreeBlocksPerWindowSize()
:- kvCacheStats.numFreeBlocksPerWindowSize = getNumFreeBlocksPerWindowSize(); + kvCacheStats.numFreeBlocksPerWindowSize = std::move(getNumFreeBlocksPerWindowSize());This yields a move instead of a deep copy (requires
<utility>
include). Minor, but free.tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
525-533
: Fix line length and consider logging optimization.The VSWA logic is correct, but there are a few improvements needed:
- Line 528 exceeds the 120-character limit (static analysis flag)
- Info-level logging might be too verbose for production use
- Consider caching the stats object to avoid repeated property access
Apply this diff to address these issues:
- if self.is_vswa: - logger.info( - f"For VSWA case, we return the minimum of the number of free blocks for each window size: {self.impl.get_kv_cache_stats().num_free_blocks_per_window_size}" - ) - return min(self.impl.get_kv_cache_stats(). - num_free_blocks_per_window_size.values()) - else: - return self.impl.get_kv_cache_stats().free_num_blocks + if self.is_vswa: + stats = self.impl.get_kv_cache_stats() + logger.debug( + f"VSWA: returning minimum free blocks across window sizes: " + f"{stats.num_free_blocks_per_window_size}" + ) + return min(stats.num_free_blocks_per_window_size.values()) + else: + return self.impl.get_kv_cache_stats().free_num_blocks
538-548
: Improve consistency and optimize repeated property access.The VSWA logic is correct, but there are consistency and performance improvements to consider:
- Use
self.is_vswa
consistently instead of checkinglen(self.max_attention_window_vec) > 1
- Cache the stats object to avoid repeated property access
- Simplify the conditional structure for better readability
Apply this diff to improve consistency and performance:
- if self.max_attention_window_vec and len( - self.max_attention_window_vec) > 1: - # VSWA case, the available tokens should the the minimum of the available tokens for each window size - min_free_blocks = min(self.impl.get_kv_cache_stats(). - num_free_blocks_per_window_size.values()) - res = min_free_blocks * self.tokens_per_block - self.num_extra_kv_tokens - max_num_draft_tokens - else: - res = (self.get_num_free_blocks() * self.tokens_per_block - - self.num_extra_kv_tokens - max_num_draft_tokens) - return res + if self.is_vswa: + # VSWA case: use minimum free blocks across all window sizes + stats = self.impl.get_kv_cache_stats() + min_free_blocks = min(stats.num_free_blocks_per_window_size.values()) + available_tokens = min_free_blocks * self.tokens_per_block + else: + available_tokens = self.get_num_free_blocks() * self.tokens_per_block + + return available_tokens - self.num_extra_kv_tokens - max_num_draft_tokens
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)tensorrt_llm/_torch/model_config.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
528-528: Line too long (171 > 120)
(E501)
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_h100.yml (1)
27-27
: LGTM: VSWA test addition aligns with PR objectives.The addition of
test_auto_dtype_vswa
test case is consistent with the PR's goal of enhancing VSWA support and kv cache management improvements.cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
301-302
: LGTM: Property binding follows established patterns.The addition of
num_free_blocks_per_window_size
property to theKvCacheStats
binding is correct and follows the same pattern as other properties in the class. This change enables Python access to per-window-size free block statistics, which supports the VSWA functionality as described in the PR objectives.tensorrt_llm/_torch/model_config.py (3)
309-316
: LGTM: Improved robustness with user feedback.The logic to handle
tokens_per_block
parameter is well-implemented:
- Provides clear warning when the parameter is not set
- Uses appropriate default fallback behavior
- Ensures the parameter is set when provided
This enhances the robustness of the kv cache size calculation configuration.
317-322
: LGTM: Correct kv heads calculation for parallelization.The calculation of
num_kv_heads
is correctly implemented:
- Properly retrieves
num_key_value_heads
from pretrained config with fallback tonum_heads
- Correctly accounts for tensor parallelization (tp_size) and context parallelization (cp_size)
- Uses appropriate getter method to retrieve the attribute
This supports the enhanced kv cache management for VSWA functionality.
344-355
: LGTM: Robust head size determination with fallback logic.The enhanced logic for determining
size_per_head
is well-designed:
- Iterates through multiple possible attribute names (
head_size
,head_dim
)- Uses the first available attribute found
- Provides clear warning when falling back to computed value
- Maintains backward compatibility with existing configurations
This improves the robustness of head size detection across different model configurations.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
539-557
: Well-implemented test for chunked prefill with VSWA configuration.The test method correctly implements chunked prefill testing with Variable Sliding Window Attention (VSWA) configuration. The skip decorator is appropriately applied with a clear reference to the kernel compatibility issue (nvbug 5338620), which aligns with the PR objectives of temporarily handling kernel compatibility issues.
The test configuration properly:
- Uses the class-level kv_cache_config with block reuse disabled
- Sets up VSWA with Gemma3 1B specific attention window pattern
- Enables chunked prefill with reasonable max_num_tokens setting
- Runs appropriate evaluation task
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
183-185
: Extension ofKvCacheStats
looks goodThe addition of
numFreeBlocksPerWindowSize
cleanly extends the stats struct without breaking existing initialisation semantics (default-constructedstd::map
is empty).tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
196-196
: Good refactoring to centralize VSWA condition.Converting the VSWA check to an instance variable improves code maintainability and avoids repeated calculations throughout the class methods.
PR_Github #12113 [ reuse-pipeline ] completed with state |
Signed-off-by: qixiang-99 <[email protected]>
[nvbug][fix] <Enhance ModelConfig for kv cache size calculations>
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Improvements
Tests