Skip to content

[RFC]: Add control panel support for vLLM #4873

@leiwen83

Description

@leiwen83

Motivation.

The Fastchat-vLLM operational model offers significant advantages in deploying large language models (LLMs) for product services. 1

The controller architecture in Fastchat is particularly beneficial for LLM deployment, owing to its loosely coupled design with the vLLM backend. This allows for:

  • Autoscaling: The vLLM backend can join and exit the cluster freely, enabling dynamic scaling capabilities.

  • Rolling Updates: The introduction of new models with distinct names allows the cluster to gradually update models, a process known as rolling updates.

  • Centralized Access: Users are relieved from the burden of tagging different URLs or IPs for various models; they simply send their requests to the controller, which then manages the rest, including dispatching requests to the appropriate backend based on the model name and ensuring effective load balancing.

However, the challenge for Fastchat lies in managing multiple backends, including vLLM. This complexity appears to hinder its ability to keep pace with the rapid evolution of vLLM. It is disheartening to observe that Fastchat currently does not support the latest vLLM features, such as multi-LoRA, fragmented chat stream support, and guidance decoding, among others.

Refence:
[1] https://blog.vllm.ai/2023/06/20/vllm.html

Proposed Change.

So just head it up, I port the key feature of controller from fastchat, and make it at minimal shape, which for interface like /v1/../completions, it simply extract model name, and forward anything towards the backend, so that all feature of vllm could be used.

Current implement: #4861

  • /v1/completions: same interface of vllm's
  • /v1/chat/completions: same interface of vllm's
  • /list_models: list models' name registered into controller
  • /health: check controller health status
  • /list_workers: list worker's detailed status, models provided by each worker, and its serving status
  • load balance with shortest queue algo
  • heart beat keep alive between controller and worker

Future directions:

  • maybe rust could be used for reimplement the controller, if we find the performance could be improved a lot
  • more algo for load balance
  • unified metrics exposed by controller, which collected from each worker
  • more interface support, like embeding

Feedback Period.

No response

CC List.

@simon-mo @robertgshaw2-neuralmagic

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCkeep-openPrevents stale label being applied

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions