-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][fix] Fix free memory fraction calculation in resource manager #6807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughCalculate_max_num_blocks_from_cpp now budgets primary_pool_memory_bytes as a fraction of free GPU memory (kv_cache_config.free_gpu_memory_fraction, default 0.9) with an assertion free_mem_fraction < 1.0 and a debug log; remaining memory adjustment and block calculation logic is unchanged. Changes
Sequence Diagram(s)Silently omitted — changes are a localized calculation update and do not alter high-level control flow. Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
830-840
: Account for extra_cost_memory in VSWA window-size adjustmentThe Python-side window clamping currently uses the full
primary_pool_memory_bytes
, while the C++ call subtractsextra_cost_memory
. To keep both sides consistent, subtractextra_cost_memory
before callingadjust_window_sizes_for_vswa
.Please update in
tensorrt_llm/_torch/pyexecutor/resource_manager.py
around the call toadjust_window_sizes_for_vswa
:@@ def calculate_max_num_blocks_from_cpp(…): - # Adjust the window sizes to fit the memory if even a single sequence - # cannot fit in the memory. - window_size_to_layers = self.adjust_window_sizes_for_vswa( - window_size_to_layers=window_size_to_layers, - model_config=model_config, - kv_cache_config=kv_cache_config, - pool_memory_bytes=primary_pool_memory_bytes, - kv_factor=self.kv_factor, - dtype=self.dtype, - is_cross_attention=is_cross_attention, - ) + # Adjust memory for extra costs before clamping VSWA windows. + mem_for_adjustment = max(0, primary_pool_memory_bytes - extra_cost_memory) + window_size_to_layers = self.adjust_window_sizes_for_vswa( + window_size_to_layers=window_size_to_layers, + model_config=model_config, + kv_cache_config=kv_cache_config, + pool_memory_bytes=mem_for_adjustment, + kv_factor=self.kv_factor, + dtype=self.dtype, + is_cross_attention=is_cross_attention, + )This ensures the Python-side window resizing matches the final C++ memory budget.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
1-5
: Repository guideline: missing NVIDIA copyright headerPer the coding guidelines, all TensorRT-LLM OSS source files must include an NVIDIA copyright header with the current year. This file appears to be missing it.
Add at the very top:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +#If the repo uses a different canonical header text, please apply that instead for consistency.
782-856
: Add a focused unit/integration test for the VSWA budgeting pathGiven the prior bug and this critical fix, please add a test that verifies VSWA path respects free_gpu_memory_fraction:
- Mock torch.cuda.mem_get_info to return deterministic bytes.
- Create a minimal kv_cache_config with free_gpu_memory_fraction set (e.g., 0.5).
- Use a tiny model_config and window pattern such that adjustment is triggered.
- Assert that allotted_primary_mem_bytes passed to KVCacheManagerCpp.calculate_max_num_blocks scales with the fraction (by spying or, if spying is hard, validate adjusted window sizes change when the fraction changes).
I can scaffold a test harness that stubs KVCacheManagerCpp.calculate_max_num_blocks and asserts the computed memory budget and window adjustments. Want me to draft it?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
817-824
: Correct fix: now honoring free_gpu_memory_fraction for VSWA primary pool budgetGood catch. Computing primary_pool_memory_bytes as a fraction of free_mem aligns VSWA behavior with the Python path and prevents overallocation leading to OOM.
5aeb3b7
to
6e82ebe
Compare
Respect fraction specified, or else we will disregard the memory taken for storing the model and get out-of-memory (OOM) for allocating too much blocks. Signed-off-by: eopXD <[email protected]>
6e82ebe
to
78933da
Compare
/bot run --disable-fail-fast |
Dropping this. #5933 is covering this fix and more! |
Description
The calculation under
calculate_max_num_blocks_from_cpp
is incorrect. Looking closely it does not give regard of the memory fraction coefficient provided. This MR changes for it to respect fraction specified, or else we will disregard the memory taken for storing the model and get out-of-memory (OOM) for allocating too much blocks.Test Coverage
This bug is due to us having test coverage for the function only on the cpp side (kvCacheManagerTest.cpp). The function assumes to receive a free memory size and allocate blocks upon the given number.
The testing should be on python side, however I don't see any tests to the function under test_resource_manager.py.
Summary by CodeRabbit