Skip to content

[Feature]: load/unload API to run multiple LLMs in a single GPU instance #5491

@lizzzcai

Description

@lizzzcai

🚀 The feature, motivation and pitch

The feature request is to add support for a load/unload endpoint/API in vLLM to dynamically load and unload multiple LLMs within a single GPU instance. This feature aims to enhance resource utilization and scalability by allowing concurrent operation of multiple LLMs on the same GPU.

The load/unload endpoint in vLLM facilitates:

  • Increased Resource Utilization: Enables concurrent operation of multiple LLMs on a single GPU, optimizing computational resources and system efficiency.

  • Enhanced Scalability: Allows dynamic model loading and unloading based on demand, adapting to varying workloads and user requirements.

  • Improved Cost-effectiveness: Maximizes throughput and performance without additional hardware investments, ideal for organizations with budget constraints.

Alternatives

Alternatively, providing an API for manual model unloading offers finer control over resource management.

Additional context

  • models here in my context are mainly small LLM (<= 10B).
  • Several community members have raised issue to unload models or release GPU memory in vLLM. While workarounds exist, their efficacy is inconsistent. It is hoped that official support for these functions can be implemented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions