You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add max_free_gpu_memory_size support for KV cache configuration
- Introduced max_free_gpu_memory_size to manage GPU memory allocation for KV cache.
- Updated KvCacheConfig and related methods to handle the new parameter.
- Modified estimation logic in KvCacheCreator to utilize max_free_gpu_memory_size for VSWA cases.
- Adjusted resource management to ensure compatibility with the new memory allocation strategy.
Signed-off-by: qixiang-99 <[email protected]>
# overwrite max_tokens in VSWA case, use max_free_gpu_memory_size instead
640
+
kv_cache_config.max_tokens=None
641
+
primary_pool_memory_bytes=min(
642
+
kv_cache_config.max_free_gpu_memory_size,
643
+
primary_pool_memory_bytes)
640
644
secondary_pool_memory_bytes=0
641
645
logger.debug(
642
646
f"primary_pool_memory_bytes is set to {primary_pool_memory_bytes/1024**3}GB, \nsecondary_pool_memory_bytes is set to {secondary_pool_memory_bytes/1024**3}GB"
Copy file name to clipboardExpand all lines: tensorrt_llm/llmapi/llm_args.py
+7-1Lines changed: 7 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -796,6 +796,11 @@ class KvCacheConfig(BaseModel, PybindMirror):
796
796
)
797
797
use_uvm: bool=Field(default=False,
798
798
description="Whether to use UVM for the KV cache.")
799
+
max_free_gpu_memory_size: Optional[int] =Field(
800
+
default=None,
801
+
description=
802
+
"The maximum size in bytes of GPU memory that can be allocated for the KV cache. This is only used for VSWA case for now as a alternative to `max_tokens`. If both `max_free_gpu_memory_size` and `free_gpu_memory_fraction` are specified, memory corresponding to the minimum will be allocated."
0 commit comments