[RFC]: Add control panel support for vLLM

### Motivation.

The Fastchat-vLLM operational model offers significant advantages in deploying large language models (LLMs) for product services. [1](https://blog.vllm.ai/2023/06/20/vllm.html)

The controller architecture in Fastchat is particularly beneficial for LLM deployment, owing to its loosely coupled design with the vLLM backend. This allows for:

* Autoscaling: The vLLM backend can join and exit the cluster freely, enabling dynamic scaling capabilities.

* Rolling Updates: The introduction of new models with distinct names allows the cluster to gradually update models, a process known as rolling updates.

* Centralized Access: Users are relieved from the burden of tagging different URLs or IPs for various models; they simply send their requests to the controller, which then manages the rest, including dispatching requests to the appropriate backend based on the model name and ensuring effective load balancing.

However, the challenge for Fastchat lies in managing multiple backends, including vLLM. This complexity appears to hinder its ability to keep pace with the rapid evolution of vLLM. It is disheartening to observe that Fastchat currently does not support the latest vLLM features, such as multi-LoRA, fragmented chat stream support, and guidance decoding, among others.

Refence:
[1] https://blog.vllm.ai/2023/06/20/vllm.html

### Proposed Change.


So just head it up, I port the key feature of controller from fastchat, and make it at minimal shape, which for interface like /v1/../completions, it simply extract model name, and forward anything towards the backend, so that all feature of vllm could be used.

Current implement: #4861 

- [x] /v1/completions:  same interface of vllm's
- [x] /v1/chat/completions:  same interface of vllm's
- [x] /list_models: list models' name registered into controller
- [x] /health: check controller health status
- [x] /list_workers: list worker's detailed status, models provided by each worker, and its serving status
- [x] load balance with shortest queue algo
- [x] heart beat keep alive between controller and worker

Future directions:
- [ ] maybe rust could be used for reimplement the controller, if we find the performance could be improved a lot
- [ ] more algo for load balance
- [ ] unified metrics exposed by controller, which collected from each worker
- [ ] more interface support, like embeding




### Feedback Period.

_No response_

### CC List.

@simon-mo @robertgshaw2-neuralmagic 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Add control panel support for vLLM #4873

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Add control panel support for vLLM #4873

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions