simplify readme (#110)

depenglee1707 · web-flow · commit 011328103c76 · 2024-04-18T21:07:14.000+08:00
diff --git a/README.md b/README.md
@@ -8,6 +8,8 @@ We gained a great deal of inspiration and motivation from [this open source proj
 
 <img src="./docs/llm-inference.png" alt="image" width=600 height="auto">
 
+### TL;DR
+
 Llm-inference is a platform for deploying and managing LLM (Lifelong Learning Machine) inference tasks with the following features:
 
 - Utilizes Ray technology to organize multiple nodes into a cluster, achieving centralized management of computational resources and distributing resources required for each inference task.
@@ -22,100 +24,45 @@ Llm-inference is a platform for deploying and managing LLM (Lifelong Learning Ma
 
 More features in [Roadmap](./Roadmap.md) are coming soon.
 
-## Getting started
 
-### Deploy locally
+## Deployment
 
-#### Install `LLM Inference` and dependencies
+### Install `LLM Inference` and dependencies
 
 You can start by cloning the repository and pip install `llm-serve`. It is recommended to deploy `llm-serve` with Python 3.10+.
 
 ```
 git clone https://github.com/OpenCSGs/llm-inference.git
 cd llm-inference
-pip install .
-```
-
-Option to use another pip source for faster transfer if needed.
-
-```
-pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple/
 ```
 
 Install specified dependencies by components:
 
 ```
 pip install '.[backend]'
-pip install '.[frontend]'
 ```
 
-**Note:** Install vllm dependency if runtime supports GPUs, run the following command:
+**Note:** `vllm` is optional, since it requires GPU:
 
 ```
 pip install '.[vllm]'
 ```
 
-Option to use other pip sources for faster transfers if needed.
-
+Install `llm-inference`:
 ```
-pip install '.[backend]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-pip install '.[frontend]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-pip install '.[vllm]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-```
-
-#### Install Ray and start a Ray Cluster locally
-
-Pip install Ray:
-
-```
-pip install -U "ray[serve-grpc]==2.9.3"
+pip install .
 ```
 
-Option to use another pip source for faster transfer if needed.
-
-```
-pip install -U "ray[serve-grpc]==2.9.3" -i https://pypi.tuna.tsinghua.edu.cn/simple/
-```
-
-> **Note:** ChatGLM2-6b requires transformers<=4.33.3, while the latest vllm requires transformers>=4.36.0.
-
-Start cluster then:
+### Start a Ray Cluster locally
 
 ```
 ray start --head --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265
 ```
 
-See reference [here](https://docs.ray.io/en/releases-2.9.3/ray-overview/installation.html).
-
-#### Quick start
-
-You can follow the [quick start](./docs/quick_start.md) to run an end-to-end case for model serving.
-
-#### Uninstall
-
-Uninstall `llm-serve` package:
-
-```
-pip uninstall llm-serve
-```
-
-Then shutdown the `Ray` cluster:
-
-```
-ray stop
-```
-
-### API server
-
-See the [guide](./docs/api_server.md) for API server and API documents.
-
-### Deploy on bare metal
-
-See the [guide](./docs/deploy_on_bare_metal.md) to deploy on bare metal.
+### Quick start
 
-### Deploy on kubernetes
+You can follow the [quick start](./docs/quick_start.md) to run an end-to-end case.
 
-See the [guide](./docs/deploy_on_kubernetes.md) to deploy on kubernetes.
 
 ## FAQ
 
diff --git a/README_cn.md b/README_cn.md
@@ -19,101 +19,54 @@ Llm-inference 是一个用于部署和管理LLM（Lifelong Learning Machine）
 
 更多[开发路线图](./Roadmap.md)中的功能正在开发中，欢迎您的贡献。
 
-## 快速入门
+## 本地部署
 
-### 本地部署
-
-#### 部署`LLM Inference`及其依赖
+### 部署`LLM Inference`及其依赖
 
 您可以下载此项目代码，然后使用pip install ' llm-serve '安装。建议使用Python 3.10+部署'llm-serve '。
 
 ```
 git clone https://github.com/OpenCSGs/llm-inference.git
 cd llm-inference
-pip install .
-```
-
-如果您受到网络传输速度的限制或影响，可以选择使用更快的传输速度的pip源。
-
-```
-pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple/
 ```
 
 按组件安装指定的依赖项:
 
 ```
 pip install '.[backend]'
-pip install '.[frontend]'
 ```
 
-**注意:** 如果运行时支持gpu，请执行以下命令安装vllm依赖:
+**注意:** vllm是可选的，因为它需要GPU, 根据你环境的情况安装:
 
 ```
 pip install '.[vllm]'
 ```
 
-如果您受到网络传输速度的限制或影响，可以选择使用更快的传输速度的pip源。
-
-```
-pip install '.[backend]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-pip install '.[frontend]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-pip install '.[vllm]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
-```
-
-#### 安装Ray并在本地启动Ray Cluster
-
-安装Ray:
-
+安装 `llm-inference`:
 ```
-pip install -U "ray[serve-grpc]==2.8.0"
+pip install .
 ```
 
 如果您受到网络传输速度的限制或影响，可以选择使用更快的传输速度的pip源。
 
 ```
-pip install -U "ray[serve-grpc]==2.8.0" -i https://pypi.tuna.tsinghua.edu.cn/simple/
+pip install '.[backend]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
+pip install '.[vllm]' -i https://pypi.tuna.tsinghua.edu.cn/simple/
+pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple/
 ```
 
-> **注意:** ChatGLM2-6b要求transformers<=4.33.3，最新的vllm要求transformers>=4.36.0。
+#### 安装Ray并在本地启动Ray Cluster
 
 启动Ray集群:
 
 ```
 ray start --head --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265
 ```
 
-请参阅[此文档](https://docs.ray.io/en/releases-2.8.0/ray-overview/installation.html)获取更多Ray的安装与启动的信息.
-
 #### 快速入门
 
 你可以按照[快速入门](./docs/quick_start.md)来运行一个端到端的模型服务案例。
 
-#### 卸载
-
-卸载 `llm-serve` :
-
-```
-pip uninstall llm-serve
-```
-
-停止`Ray` 集群:
-
-```
-ray stop
-```
-
-### API服务器
-
-关于详细的API Server和API的内容，请参见[此文档](./docs/api_server.md)。
-
-### 在裸机上部署
-
-请参见[此文档](./docs/deploy_on_bare_metal.md)中描述，查阅如何在裸机上部署。
-
-### 在Kubernetes中部署
-
-请参见[此文档](./docs/deploy_on_kubernetes.md)，查阅如何在Kubernetes中部署。
-
 ## 其他事项
 
 ### 使用模型从本地路径或git服务器或S3存储或OpenCSG Model Hub
diff --git a/docs/quick_start.md b/docs/quick_start.md
@@ -2,7 +2,7 @@
 
 ## Introduction to llm-serve
 
-`llmserve` comes with its own CLI, `llm-serve`, which allows you to interact directly with the backend without having to use the Gradio frontend.
+`llmserve` comes with its own CLI, `llm-serve`, which allows you to interact directly with the backend.
 
 Installing `llmserve` also installs the `llm-serve` CLI, and you can get a list of all available commands by running `llm-serve --help`.
 
@@ -25,57 +25,52 @@ Installing `llmserve` also installs the `llm-serve` CLI, and you can get a list
 
 ## Start a model serving
 
-You can deploy any model in the `models` directory of this repo, or define your own model YAML file and run that instead.  
+You can deploy any model in the [models](../models) directory of this repo, or define your own model YAML file and run that instead.  
 For example:
 
 ```
-llm-serve start serving-rest --model=models/text-generation--gpt2.yaml
-
-# You can start mutiple models serving at once.
-llm-serve start serving-rest --model=models/text-generation--facebook--opt-125m.yaml --model=models/text-generation--gpt2.yaml
+llm-serve start serving-rest --model models/text-generation--facebook--opt-125m.yaml
 ```
 
 ## Check model serving status and predict URL
 
 Check model serving status and predict URL by:
 
 ```SHELL
-# llm-serve list serving --name gpt2
+# llm-serve list serving --appname default
 {
-  "gpt2": {
+  "default": {
     "status": {
-      "gpt2": {
+      "default": {
         "application_status": "RUNNING",
         "deployments_status": {
-          "gpt2": "HEALTHY",
-          "RouterDeployment": "HEALTHY"
+          "facebook--opt-125m": "HEALTHY",
+          "facebook--opt-125m-router": "HEALTHY"
         }
       }
     },
     "url": {
-      "prodict_url": "http://0.0.0.0:8000/api/v1/default/gpt2/run/predict"
+      "facebook/opt-125m": "http://0.0.0.0:8000/api/v1/default/facebook--opt-125m/run/predict"
     }
   }
 }
 ```
 
-## Using the model serving
+## invoke model serving
 
 Invoke model with command `llm-serve predict`
 
 ```
-llm-serve predict --model gpt2 --prompt "I am going to do" --prompt "What do you like" 
+llm-serve predict --model facebook/opt-125m --prompt "I am going to do"
 ```
 
 If you start the model using `llm-serve start serving-rest`, you can also run the following command `curl` to call the model predict API shown above.
 
 ```
-curl -H "Content-Type: application/json" -X POST -d '{"prompt": "What can I do"}' "http://127.0.0.1:8000/api/v1/default/gpt2/run/predict"
-
-curl -H "Content-Type: application/json" -X POST -d '[{"prompt":"How can you"}, {"prompt": "What can I do"}]' "http://127.0.0.1:8000/api/v1/default/gpt2/run/predict"
+curl -H "Content-Type: application/json" -X POST -d '{"prompt": "What can I do"}' "http://127.0.0.1:8000/api/v1/default/facebook--opt-125m/run/predict"
 ```
 
-## Start your trial
+## Start a model serving with Gradio UI
 
 You can start a trial with the following command, which will start a serving and built-in UI for the model running on <http://127.0.0.1:8000/facebook--opt-125m>.
 
diff --git a/setup.py b/setup.py
@@ -53,6 +53,7 @@
             "torchmetrics==1.2.1",
             "llama_cpp_python==0.2.57",
             "transformers==4.39.1",
+            "ray[serve]==2.9.3",
         ],
         "vllm": [
             "vllm>=0.2.0,<0.2.6",