-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
🚀 The feature, motivation and pitch
The feature request is to add support for a load/unload endpoint/API in vLLM to dynamically load and unload multiple LLMs within a single GPU instance. This feature aims to enhance resource utilization and scalability by allowing concurrent operation of multiple LLMs on the same GPU.
The load/unload endpoint in vLLM facilitates:
-
Increased Resource Utilization: Enables concurrent operation of multiple LLMs on a single GPU, optimizing computational resources and system efficiency.
-
Enhanced Scalability: Allows dynamic model loading and unloading based on demand, adapting to varying workloads and user requirements.
-
Improved Cost-effectiveness: Maximizes throughput and performance without additional hardware investments, ideal for organizations with budget constraints.
Alternatives
Alternatively, providing an API for manual model unloading offers finer control over resource management.
Additional context
- models here in my context are mainly small LLM (<= 10B).
- Several community members have raised issue to unload models or release GPU memory in vLLM. While workarounds exist, their efficacy is inconsistent. It is hoped that official support for these functions can be implemented.